question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Uniquely identify URLs by hash of url instead of archive timestamp

See original GitHub issue

My Pinboard export contains several bookmarks with identical timestamps (presumably from imports from Delicious years ago).

The first time I run archive.py, I end up with several archive directories named like 1317249309, 1317249309.0, 1317249309.1, …. These directory names correspond properly with entries in index.json as expected.

If I run archive.py a second time with the same input, it appears to rewrite index.json, assigning different numerical suffixes to the 1317249309 timestamp. The entries in index.json no longer correspond with the contents of those archive directories on disk.

You can reproduce this with the following JSON file (pinboard.json):

[{"href":"http:\/\/www.flickr.com\/groups\/photoshopsupport\/discuss\/72157600201629413\/","description":"Flickr: Discussing Index Of Topics: Compliments of LifeLive~ in Photoshop Support Group","extended":"","meta":"c9aa62c0eaa3c35a587903100870df43","hash":"8dd9951810c0eae6af67651341af5110","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography photoshop retouching"},
{"href":"http:\/\/allinthehead.com\/retro\/345\/whats-in-your-utility-belt","description":"What's In Your Utility Belt? \u2014 All in the head","extended":"","meta":"746e69822f36f2e78c16fc789a7545b5","hash":"ac4d0527bca6c7d6741fee117f45f631","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"php"},
{"href":"http:\/\/www.tyndellphotographic.com\/plasticwallet.html","description":"Plastic Wallet Boxes for Wallet sized photos","extended":"","meta":"c133eb53f29d97c35c3f31768ff7ce45","hash":"60bbf228c559518b818ed7d0ff997a69","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"photography supply"},
{"href":"http:\/\/www.arduino.cc\/","description":"Arduino - HomePage","extended":"","meta":"a80835b5f374965f5f8a5990da6cf2be","hash":"78532ff2155cd9feeac11aba18739bdc","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"arduino elecdiy"},
{"href":"http:\/\/mbed.org\/","description":"Rapid Prototyping for Microcontrollers | mbed","extended":"","meta":"644e8e0c9ae522eb1ca025c2af604f7d","hash":"fd2d014879e63a9aca6c18eb11e19b02","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy"},
{"href":"http:\/\/www.tasankokaiku.com\/jarse\/?p=268","description":"Jarse \u00bb Blog Archive \u00bb Kohtauskone","extended":"","meta":"8483f7b4d0423ddd0930142c55c909e3","hash":"e971d3670f0fe1b2638c343e458f88bd","time":"2011-09-28T18:35:09Z","shared":"yes","toread":"no","tags":"elecdiy arduino dmx512"}]

Run the following commands:

./archive.py ~/path/to/pinboard.json
# contents on disk match up with contents of index.json

./archive.py ~/path/to/pinboard.json
# timestamp suffices in index.json have been changed and no longer match content on disk

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
piratecommented, Aug 30, 2018

@aurelg fyi you might be interested in following this issue

1reaction
piratecommented, Apr 16, 2019

Oh I’m already halfway through the migration process away from timestamps, I forgot to update this issue 😃

Most of these problems go away as we start to use django more heavily, as the export folder structure can be changed dramatically now that we have a SQL database as the single-source-of-truth with safe migrations.

In v0.4.0 I’ve already added hashes, and in a subsequent version they will become the primary unique key.

The archive will be served by django, with static folder exports becoming optional-only. This allows us to provide both timestamp and hash-based URLs via django, and static export format can be selected by specifying a flag like:

archivebox export --folders=timestamp
# or
archivebox export --folders=hash

I might even add an options to do both with symlinks as discussed above, but for now I think letting the user decide is the simplest solution. Once we hear feedback from users on the new >v0.4.0 system we can decide how to proceed with export formatting.

Read more comments on GitHub >

github_iconTop Results From Across the Web

hash - Uniquely identifying URLs with one 64-bit number
This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first...
Read more >
forums Subject: CDX digest not accurately capturing duplicates?
I'm having trouble using a CDX query to identify duplicate pages over ... http://web.archive.org/web/TIMESTAMP/ parts of all urls would do.
Read more >
Artifacts for Detecting Timestamp Manipulation in NTFS on ...
In this paper, we present a new use of four existing windows artifacts – the $USNjrnl, link files, prefetch files, and Windows event...
Read more >
RFC 3161 - Internet X.509 Public Key Infrastructure Time ...
1. Request Format A time-stamping request is as follows: TimeStampReq ::= SEQUENCE { version INTEGER { v1(1) }, messageImprint MessageImprint, --a hash ......
Read more >
Guide for Caching and HTTP Cache Headers for Static Content
According to HTTP Archive, among the top 300,000 sites, ... are embedding current timestamps in URLs, then this changes the URL and response ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found