Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a way to delete an entry from the index and archive

See original GitHub issue

Occasionally I want to remove a URL from my archive. Currently this is a manual process of finding the entry in index.json, pulling out the timestamp, deleting the relevant lines, doing the same for index.html, and finally rm -r output/archive/$timestamp.

It would be nice if there was some slightly more automated way of doing this. Ideally I think this would be done with a final step after archiving, where the script would try to match each directory name in output/ with a timestamp in index.json. If a match isn’t found, the user is prompted with something like:

1536723384 not found in bookmark index. Delete output directory? (y/n)

This may be a behavior that is only enabled by an optional config option.

Issue Analytics

State:
Created 5 years ago
Reactions:6
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

f0086commented, Oct 18, 2018

I’ve recently imported my complete Pinboard archive, there where a lot of bookmarks with dead links in it:

[√] [2018-10-17 23:28:37] Update of 4249 links complete (133.12 min)
    - 15219 entries skipped
    - 714 entries updated
    - 1063 errors

(The script crashed a few times with “Too many open files” errors, so I had to rerun it a couple of times)

My idea is to run this script once a day with a fresh dump from my pinboard export (I’ve wrote a little go program which dumps the whole list from pinboard). But with that 1063 links with errors, it will take hours (even with small timeouts) and is totally useless to retry that links.

Because that 1063 dead links will always be in that exported list, the archiver will always retry to download it. It would be nice if there where a flag or environment variable to skip that links which where previously failed to download. A “cleanup” flag would be even better, but skipping that links would be sufficient for my usecase.

1reaction

piratecommented, Jul 24, 2020

The new django version has both the ability to remove snapshots from the archive, and a separate archivebox update command independent from archivebox add so that you can control when to retry previously failed links.

git checkout django
git pull
# or pip install -e . to run it without docker
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'
docker run -v $PWD/output:/data archivebox remove --help
docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com'
docker run -v $PWD/output:/data archivebox update

Adding a MAX_URL_ATTEMPTS option will be tracked in this separate issue: https://github.com/pirate/ArchiveBox/issues/109