question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Fails to parse list of URLs txt file

See original GitHub issue

Describe the bug

I can’t seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs. Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

  1. Create txt file with some URLs. Eg.
https://www.example.com/
https://example.com/
  1. Run archivebox add /tmp/urls.txt

Screenshots or log output

Here’s the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
  

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
piratecommented, Apr 21, 2022

Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new --depth=1 implementation!

I’ll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing --depth=1 /path/to/file.txt also works.

0reactions
rossvorcommented, Apr 20, 2022

I’ve also tried this using on a fresh docker image based installation and it fails similarly:

sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt
[i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt
    > /data

[+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650490323-import.txt
 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                               
    > Found 0 new URLs not already in index

[*] [2022-04-20 21:32:04] Writing 0 links to main index...
    √ ./index.sqlite3      

/tmp/ff/urls.txt being the same simple file:

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/
Read more comments on GitHub >

github_iconTop Results From Across the Web

Error When Opening URL's from a text file - Stack Overflow
The first step is loop through the list of entries picked up from the text file. You need to remove the newline character....
Read more >
Support for URLs in input requirements.txt files #18 - GitHub
Hi,. This is a feature request. ... It raises a RequirementParseError: Invalid requirement, parse error . ... By the way, URLs to tarball...
Read more >
Python Urllib Module - GeeksforGeeks
Urllib package is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function...
Read more >
Build and Submit a Sitemap | Google Search Central
Google supports several sitemap formats. Follow this guide to learn about formats, how to build a sitemap, and how to submit a sitemap...
Read more >
MissingSchema: Invalid URL ' ': No schema supplied
It is saying you have an invalid URL, and empty URL in this case. If the URLs you posted here came from the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found