question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extend WARC file with all requests made via all archive methods

See original GitHub issue

Right now the FETCH_WARC option only creates a simple html file WARC with wget, it doesn’t save all the requests made dynamically after JS executes by chrome headless.

We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.

In the ideal scenario, the WARC should include:

  • √ base html for the page
  • √ all assets like images, styles, fonts, js
  • all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
  • any media files requested

I think we can record the wget warc first, then use warcat to merge it with a warcproxy-created warc containing all the chrome headless requests.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
piratecommented, May 18, 2022

There is a way to do this already right now:

  1. Uncomment the example pywb proxy server in the docker-compose file
  2. Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with archivebox config CHROME_ARGS
0reactions
goelayucommented, May 19, 2022

Correct me if I am wrong, but I don’t think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate https://github.com/ArchiveBox/ArchiveBox/blob/49faec8f6dfc15075203ad332abfea0940f4e7b7/archivebox/util.py#L219-L263

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extend WARC file with all requests made via all archive methods
I've been investigating using pywb 's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore- ...
Read more >
The stack: An introduction to the WARC file - Archive-It
The WARC file includes metadata about its creation and contents, records of server requests and responses, and each server response's full ...
Read more >
The WARC File Format (Version 0.9)
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information.
Read more >
The WARC Format - IIPC Community Resources
The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple...
Read more >
WARC, Web ARChive file format - Library of Congress
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found