Uniquely identify URLs by hash of url instead of archive timestamp
See original GitHub issueMy Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).
The first time I run archive.py
, I end up with several archive directories named like 1317249309
, 1317249309.0
, 1317249309.1
, …. These directory names correspond properly with entries in index.json
as expected.
If I run archive.py
a second time with the same input, it appears to rewrite index.json
, assigning different numerical suffixes to the 1317249309
timestamp. The entries in index.json
no longer correspond with the contents of those archive directories on disk.
You can reproduce this with the following JSON file (pinboard.json
):
[{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"},
{"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"},
{"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"},
{"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"},
{"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"},
{"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}]
Run the following commands:
./archive.py ~/path/to/pinboard.json
# contents on disk match up with contents of index.json
./archive.py ~/path/to/pinboard.json
# timestamp suffices in index.json have been changed and no longer match content on disk
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
hash - Uniquely identifying URLs with one 64-bit number
This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first...
Read more >forums Subject: CDX digest not accurately capturing duplicates?
I'm having trouble using a CDX query to identify duplicate pages over ... http://web.archive.org/web/TIMESTAMP/ parts of all urls would do.
Read more >Artifacts for Detecting Timestamp Manipulation in NTFS on ...
In this paper, we present a new use of four existing windows artifacts – the $USNjrnl, link files, prefetch files, and Windows event...
Read more >RFC 3161 - Internet X.509 Public Key Infrastructure Time ...
1. Request Format A time-stamping request is as follows: TimeStampReq ::= SEQUENCE { version INTEGER { v1(1) }, messageImprint MessageImprint, --a hash ......
Read more >Guide for Caching and HTTP Cache Headers for Static Content
According to HTTP Archive, among the top 300,000 sites, ... are embedding current timestamps in URLs, then this changes the URL and response ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@aurelg fyi you might be interested in following this issue
Oh I’m already halfway through the migration process away from timestamps, I forgot to update this issue 😃
Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.
In v0.4.0 I’ve already added hashes, and in a subsequent version they will become the primary unique key.
The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:
I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.