question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long URLs break when attempting to read/write them as filesystem paths

See original GitHub issue

Describe the bug

In multiple places in the docs (Quickstart for example) it mentions the use of ./archive some-file.txt as a means of ingesting a file with a list of urls in it. There are also examples of using ./bin/archivebox-export-browser-history --firefox which generates a JSON file that users should be able to feed into ./archive as well.

With the latest release it seems that archivebox add would have taken over this behavior but it either isn’t supposed to and this functionality was removed or there’s a bug somewhere. The existence of the parsers in the archivebox add code path leads me to believe this is a bug and archivebox add should handle these cases.

When running archivebox init there is also a message at the end that states:

To add new links, you can run: archivebox add ~/some/path/or/url/to/list_of_links.txt

Steps to reproduce

  1. Create a file test_urls.txt which only has the contents https://example.org
  2. Run archivebox add test_urls.txt
  3. Get back an error instead of archiving https://example.org

The same issue happens when passing the output JSON file from running ./bin/export-browser-history.sh --firefox or

Screenshots or log output

# archivebox add test_urls.txt 
[i] [2020-08-09 19:50:30] ArchiveBox v0.4.11: archivebox add test_urls.txt
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-09 23:50:39] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597017039-import.txt
                                                                           0.0% (0/240sec)[X] Error while loading link! [1597017039.360603] test_urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                    
    > Found 0 new URLs not already in index                                                                         

[*] [2020-08-09 23:50:40] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3                                                        
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json                                                           
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html 

It also happens when trying the input redirection route, except there is no [X] Error while loading link!:

archivebox add < firefox_history_urls.txt 
[i] [2020-08-09 23:00:29] ArchiveBox v0.4.11: archivebox add < /dev/stdin
    > /Users/mpeteuil/projects/ArchiveBox/data

[+] [2020-08-10 03:00:30] Adding 43291 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597028430-import.txt
    > Parsed 0 URLs from input (Failed to parse)
    > Found 0 new URLs not already in index

[*] [2020-08-10 03:00:34] Writing 0 links to main index...
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.sqlite3
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.json
    √ /Users/mpeteuil/projects/ArchiveBox/data/index.html

Software versions

  • OS: macOS 10.15.6
  • ArchiveBox version: 87ba82a
  • Python version: 3.7.8

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mpeteuilcommented, Aug 11, 2020

That background definitely helps me understand this better and helps me see that it’s not just this one isolated problem.

What we can do for now is catch the exception and skip all attempts to read URL fragment paths, and save long URLs to the index normally up until some ridiculous limit like 65,000 characters. This will still result in broken wget clones, but all the other outputs and the index should work with the long URLs.

That sounds reasonable. One long URL spoiling the whole batch that’s being parsed is the main issue at hand here, so I think as long as that’s resolved then it’s case closed on this one.

Thanks for working through this with me, I appreciate all the help. It’s not easy maintaining OSS, but my interactions with this project have been nothing but pleasant 😄

1reaction
mpeteuilcommented, Aug 11, 2020

Can you post a snippet of the file you’re trying to import? (note the URLs must have a scheme, https://example.com √, example.com X)

I would, but I’d like to avoid doing so out of privacy concerns if possible. However, I was able to find the URL which is causing the issue. The problem seems to be that it’s just really long. In this instance the URL is 1092 characters, which is causing OSError: [Errno 63] File name too long to be thrown when executing if Path(line).exists() in the generic_txt parser. The error is eventually swallowed by _parse’s exception handling , so it’s not seen elsewhere.

The good news is that this is reproducible with any sufficiently long and valid URL in a txt file.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling Long Words and URLs (Forcing Breaks ... - CSS-Tricks
URL's don't typically have spaces in them, so they are often culprits. Here's a big snippet with all the CSS players involved: .dont-break-out...
Read more >
"The given path's format is not supported." - Stack Overflow
Can someone help me resolve the issue with this error message from line 2 of the code. The given path's format is not...
Read more >
The trouble with symbolic links - LWN.net
> I believe it's a bit of a stretch to say that “pathnames as a concept are now utterly broken in POSIX” just...
Read more >
Prevent inserted links with spaces from breaking
Prevent inserted links with spaces from breaking · In a new message, click Insert. · In the Links group, click Link or Hyperlink....
Read more >
File Manipulation - R
These functions provide a low-level interface to the computer's file system. ... ( file.exists silently reports false for paths that would be too...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found