Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long URLs break when attempting to read/write them as filesystem paths

See original GitHub issue

Describe the bug

In multiple places in the docs (Quickstart for example) it mentions the use of ./archive some-file.txt as a means of ingesting a file with a list of urls in it. There are also examples of using ./bin/archivebox-export-browser-history --firefox which generates a JSON file that users should be able to feed into ./archive as well.

With the latest release it seems that archivebox add would have taken over this behavior but it either isn’t supposed to and this functionality was removed or there’s a bug somewhere. The existence of the parsers in the archivebox add code path leads me to believe this is a bug and archivebox add should handle these cases.

When running archivebox init there is also a message at the end that states:

To add new links, you can run: archivebox add ~/some/path/or/url/to/list_of_links.txt

Steps to reproduce

Create a file test_urls.txt which only has the contents https://example.org
Run archivebox add test_urls.txt
Get back an error instead of archiving https://example.org

The same issue happens when passing the output JSON file from running ./bin/export-browser-history.sh --firefox or

Screenshots or log output

# archivebox add test_urls.txt 
[i] [2020-08-09 19:50:30] ArchiveBox v0.4.11: archivebox add test_urls.txt
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-09 23:50:39] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597017039-import.txt
                                                                           0.0% (0/240sec)[X] Error while loading link! [1597017039.360603] test_urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                    
    > Found 0 new URLs not already in index                                                                         

[*] [2020-08-09 23:50:40] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3                                                        
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json                                                           
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html

It also happens when trying the input redirection route, except there is no [X] Error while loading link!:

archivebox add < firefox_history_urls.txt 
[i] [2020-08-09 23:00:29] ArchiveBox v0.4.11: archivebox add < /dev/stdin
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-10 03:00:30] Adding 43291 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597028430-import.txt
    > Parsed 0 URLs from input (Failed to parse)
    > Found 0 new URLs not already in index

[*] [2020-08-10 03:00:34] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html

Software versions

OS: macOS 10.15.6
ArchiveBox version: 87ba82a
Python version: 3.7.8

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

mpeteuilcommented, Aug 11, 2020

That background definitely helps me understand this better and helps me see that it’s not just this one isolated problem.

What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.

That sounds reasonable. One long URL spoiling the whole batch that’s being parsed is the main issue at hand here, so I think as long as that’s resolved then it’s case closed on this one.

Thanks for working through this with me, I appreciate all the help. It’s not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄

1reaction

mpeteuilcommented, Aug 11, 2020

Can you post a snippet of the file you’re trying to import? (note the URLs must have a scheme, https://example.com √, example.com X)

I would, but I’d like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it’s just really long. In this instance the URL is 1092 characters, which is causing OSError: [Errno 63] File name too long to be thrown when executing if Path(line).exists() in the generic_txt parser. The error is eventually swallowed by _parse’s exception handling , so it’s not seen elsewhere.

The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.