Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!)
See original GitHub issueDescribe the bug
Hi there! There’s an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.
Steps to reproduce
- Save this page for example: [Twitter of @garethheyes] ](https://twitter.com/garethheyes/status/1126526480614416395)
- Open your index.html
- Get XSS’d by sir @garethheyes
Source code:
<a href="archive/1557816881/twitter.com/garethheyes/status/1126526480614416395.html" title="\u2028\u2029 op Twitter: "Another way to use throw without a semi-colon:
<script>{onerror=alert}throw 1</script>"">
Software versions
- OS: ArchLinux
- ArchiveBox version: 903.59da482-1
- Python version: python3.7
- Chrome version: Chromium 74.0.3729.131 Arch Linux
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Creating and sharing Lambda layers - AWS Documentation
Create a Lambda layer to share code in your organization or publicly. Layers can contain libraries, a custom runtime, or other dependencies.
Read more >How to Archive a Website: Our Mammoth Guide to Saving ...
Backups are important, but so is site archiving. This post will show you how to archive a website quickly and efficiently.
Read more >Jobs artifacts administration - GitLab Docs
An artifact is a list of files and directories attached to a job after it finishes. This feature is enabled by default in...
Read more >WKWebView | Apple Developer Documentation
An object that displays interactive web content, such as for an in-app browser. ... interface elements, such as contextual menus or panels, into...
Read more >Wayback Machine - Wikipedia
The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than
<iframe sandbox>
:Content-Security-Policy: sandbox
, which instructs the browser to treat the load as its own unique origin.This is much more robust and convenient than detecting iframe loads.
We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.
On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie
SameSite=Strict
.This should stop any cross-contamination between archived pages, but it won’t stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.
Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:
archive/<timestamp>/index.html
indexes, archived content with live JS, etc. that could be dangerousThese can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g.
HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.io
HTTP_ADMIN_LISTEN=https://demo.archivebox.io
This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks).
Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode
Extractor methods that replay JS:
Proposed behavior:
config option to enable bypassing sandboxing:
DANGER_ALLOW_BYPASSING_SANDOX=True/False