Long URLs break when attempting to read/write them as filesystem paths
See original GitHub issueDescribe the bug
In multiple places in the docs (Quickstart for example) it mentions the use of ./archive some-file.txt
as a means of ingesting a file with a list of urls in it. There are also examples of using ./bin/archivebox-export-browser-history --firefox
which generates a JSON file that users should be able to feed into ./archive
as well.
With the latest release it seems that archivebox add
would have taken over this behavior but it either isn’t supposed to and this functionality was removed or there’s a bug somewhere. The existence of the parsers in the archivebox add
code path leads me to believe this is a bug and archivebox add
should handle these cases.
When running archivebox init
there is also a message at the end that states:
To add new links, you can run: archivebox add ~/some/path/or/url/to/list_of_links.txt
Steps to reproduce
- Create a file
test_urls.txt
which only has the contentshttps://example.org
- Run
archivebox add test_urls.txt
- Get back an error instead of archiving
https://example.org
The same issue happens when passing the output JSON file from running ./bin/export-browser-history.sh --firefox
or
Screenshots or log output
# archivebox add test_urls.txt
[i] [2020-08-09 19:50:30] ArchiveBox v0.4.11: archivebox add test_urls.txt
> /Users/mpeteuil/projects/ArchiveBox/data
[+] [2020-08-09 23:50:39] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1597017039-import.txt
0.0% (0/240sec)[X] Error while loading link! [1597017039.360603] test_urls.txt "None"
> Parsed 0 URLs from input (Failed to parse)
> Found 0 new URLs not already in index
[*] [2020-08-09 23:50:40] Writing 0 links to main index...
√ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3
√ /Users/mpeteuil/projects/ArchiveBox/data/index.json
√ /Users/mpeteuil/projects/ArchiveBox/data/index.html
It also happens when trying the input redirection route, except there is no [X] Error while loading link!
:
archivebox add < firefox_history_urls.txt
[i] [2020-08-09 23:00:29] ArchiveBox v0.4.11: archivebox add < /dev/stdin
> /Users/mpeteuil/projects/ArchiveBox/data
[+] [2020-08-10 03:00:30] Adding 43291 links to index (crawl depth=0)...
> Saved verbatim input to sources/1597028430-import.txt
> Parsed 0 URLs from input (Failed to parse)
> Found 0 new URLs not already in index
[*] [2020-08-10 03:00:34] Writing 0 links to main index...
√ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3
√ /Users/mpeteuil/projects/ArchiveBox/data/index.json
√ /Users/mpeteuil/projects/ArchiveBox/data/index.html
Software versions
- OS: macOS 10.15.6
- ArchiveBox version: 87ba82a
- Python version: 3.7.8
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
That background definitely helps me understand this better and helps me see that it’s not just this one isolated problem.
That sounds reasonable. One long URL spoiling the whole batch that’s being parsed is the main issue at hand here, so I think as long as that’s resolved then it’s case closed on this one.
Thanks for working through this with me, I appreciate all the help. It’s not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄
I would, but I’d like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it’s just really long. In this instance the URL is 1092 characters, which is causing
OSError: [Errno 63] File name too long
to be thrown when executingif Path(line).exists()
in the generic_txt parser. The error is eventually swallowed by_parse
’s exception handling , so it’s not seen elsewhere.The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.