Update s3://commoncrawl/ access scheme
See original GitHub issueHi @fhamborg,
next week the scheme how data on s3://commoncrawl/ is accessed will change:
https://commoncrawl.s3.amazonaws.com/
should be replaced byhttps://data.commoncrawl.org/
- WARC file listings via
aws s3 ls
require authentication- for AWS users: remove
--no-sign-request
- otherwise WARC file listings for the news crawl are provided on
https://data.commoncrawl.org/crawl-data/CC-NEWS/year/month/warc.paths.gz
, eg. https://data.commoncrawl.org/crawl-data/CC-NEWS/2022/04/warc.paths.gz
- for AWS users: remove
For further information, please see:
- https://commoncrawl.org/2022/03/introducing-cloudfront-access-to-common-crawl-data/
- https://commoncrawl.org/access-the-data/
- https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html
If required I might be able to contribute a patch.
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:9 (6 by maintainers)
Top Results From Across the Web
we are introducing CloudFront as a new way to access ...
Common Crawl joined AWS's Open Data Sponsorships program, hosted on S3, with free access to everyone. Since then, the dataset has expanded (by ......
Read more >Access the Data - Common Crawl
Common Crawl data is free to access by anyone from anywhere. The data is hosted by Amazon Web Services' Open Data Sets Sponsorships...
Read more >Blog – Common Crawl
A: We will update our documentation and examples to read data either via HTTP(S) using the new CloudFront access, or via S3 and...
Read more >AWS – Common Crawl
A: We will update our documentation and examples to read data either via HTTP(S) using the new CloudFront access, or via S3 and...
Read more >Code – Common Crawl
We are pleased to announce a new index and query api system for Common Crawl. The raw index data is available, per crawl,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Is there is a typo in the access key - stored in ~/.aws/credentials?
The following shell script generates a list of all CC-NEWS WARC files without the need to own an AWS account:
But for news-please this should be rewritten using pure Python, eventually using a dual approach (boto3 for AWS users). I’ll try to provide a PR.
BTW the commoncrawl_crawler.py code is not updated, so the code doesn’t work as is yet. Lines 163 and 166 need to be edited to either wget with https, or removing no-sign-request and having authenticated AWS setup I think.