Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug: Fails to parse list of URLs txt file

See original GitHub issue

Describe the bug

I can’t seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs. Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

Create txt file with some URLs. Eg.

https://www.example.com/
https://example.com/

Run archivebox add /tmp/urls.txt

Screenshots or log output

Here’s the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

piratecommented, Apr 21, 2022

Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new --depth=1 implementation!

I’ll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing --depth=1 /path/to/file.txt also works.

0reactions

rossvorcommented, Apr 20, 2022

I’ve also tried this using on a fresh docker image based installation and it fails similarly:

sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt
[i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt
    > /data

[+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650490323-import.txt
 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                               
    > Found 0 new URLs not already in index

[*] [2022-04-20 21:32:04] Writing 0 links to main index...
    √ ./index.sqlite3

/tmp/ff/urls.txt being the same simple file:

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/