question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update s3://commoncrawl/ access scheme

See original GitHub issue

Hi @fhamborg,

next week the scheme how data on s3://commoncrawl/ is accessed will change:

  • https://commoncrawl.s3.amazonaws.com/ should be replaced by https://data.commoncrawl.org/
  • WARC file listings via aws s3 ls require authentication

For further information, please see:

If required I might be able to contribute a patch.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
sebastian-nagelcommented, Apr 15, 2022

Is there is a typo in the access key - stored in ~/.aws/credentials?

The following shell script generates a list of all CC-NEWS WARC files without the need to own an AWS account:

#!/bin/bash

for year in $(seq 2016 $(date +%Y)); do
    for month in $(seq 1 12); do
        yearmonth=$year/$(printf "%02d" $month)
        if [[ $yearmonth < 2016-08 ]]; then
            : # news crawl started August 2016
        elif [[ $yearmonth > $(date +%Y/%m) ]]; then
            : # after current month
        else
            curl https://data.commoncrawl.org/crawl-data/CC-NEWS/$yearmonth/warc.paths.gz \
                 | gzip -dc
        fi
    done
done

But for news-please this should be rewritten using pure Python, eventually using a dual approach (boto3 for AWS users). I’ll try to provide a PR.

1reaction
joemkwoncommented, Apr 11, 2022

BTW the commoncrawl_crawler.py code is not updated, so the code doesn’t work as is yet. Lines 163 and 166 need to be edited to either wget with https, or removing no-sign-request and having authenticated AWS setup I think.

Read more comments on GitHub >

github_iconTop Results From Across the Web

we are introducing CloudFront as a new way to access ...
Common Crawl joined AWS's Open Data Sponsorships program, hosted on S3, with free access to everyone. Since then, the dataset has expanded (by ......
Read more >
Access the Data - Common Crawl
Common Crawl data is free to access by anyone from anywhere. The data is hosted by Amazon Web Services' Open Data Sets Sponsorships...
Read more >
Blog – Common Crawl
A: We will update our documentation and examples to read data either via HTTP(S) using the new CloudFront access, or via S3 and...
Read more >
AWS – Common Crawl
A: We will update our documentation and examples to read data either via HTTP(S) using the new CloudFront access, or via S3 and...
Read more >
Code – Common Crawl
We are pleased to announce a new index and query api system for Common Crawl. The raw index data is available, per crawl,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found