Extend WARC file with all requests made via all archive methods
See original GitHub issueRight now the FETCH_WARC
option only creates a simple html file WARC with wget, it doesn’t save all the requests made dynamically after JS executes by chrome headless.
We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.
In the ideal scenario, the WARC should include:
- √ base html for the page
- √ all assets like images, styles, fonts, js
- all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
- any media files requested
I think we can record the wget
warc first, then use warcat
to merge it with a warcproxy-created warc containing all the chrome headless requests.
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Extend WARC file with all requests made via all archive methods
I've been investigating using pywb 's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore- ...
Read more >The stack: An introduction to the WARC file - Archive-It
The WARC file includes metadata about its creation and contents, records of server requests and responses, and each server response's full ...
Read more >The WARC File Format (Version 0.9)
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information.
Read more >The WARC Format - IIPC Community Resources
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple...
Read more >WARC, Web ARChive file format - Library of Congress
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There is a way to do this already right now:
archivebox config CHROME_ARGS
Correct me if I am wrong, but I don’t think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate https://github.com/ArchiveBox/ArchiveBox/blob/49faec8f6dfc15075203ad332abfea0940f4e7b7/archivebox/util.py#L219-L263