Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update s3://commoncrawl/ access scheme

See original GitHub issue

Hi @fhamborg,

next week the scheme how data on s3://commoncrawl/ is accessed will change:

https://commoncrawl.s3.amazonaws.com/ should be replaced by https://data.commoncrawl.org/
WARC file listings via aws s3 ls require authentication
- for AWS users: remove --no-sign-request
- otherwise WARC file listings for the news crawl are provided on https://data.commoncrawl.org/crawl-data/CC-NEWS/year/month/warc.paths.gz, eg. https://data.commoncrawl.org/crawl-data/CC-NEWS/2022/04/warc.paths.gz

For further information, please see:

If required I might be able to contribute a patch.

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

sebastian-nagelcommented, Apr 15, 2022

Is there is a typo in the access key - stored in ~/.aws/credentials?

The following shell script generates a list of all CC-NEWS WARC files without the need to own an AWS account:

#!/bin/bash

for year in $(seq 2016 $(date +%Y)); do
    for month in $(seq 1 12); do
        yearmonth=$year/$(printf "%02d" $month)
        if [[ $yearmonth < 2016-08 ]]; then
            : # news crawl started August 2016
        elif [[ $yearmonth > $(date +%Y/%m) ]]; then
            : # after current month
        else
            curl https://data.commoncrawl.org/crawl-data/CC-NEWS/$yearmonth/warc.paths.gz \
                 | gzip -dc
        fi
    done
done

But for news-please this should be rewritten using pure Python, eventually using a dual approach (boto3 for AWS users). I’ll try to provide a PR.

1reaction

joemkwoncommented, Apr 11, 2022

BTW the commoncrawl_crawler.py code is not updated, so the code doesn’t work as is yet. Lines 163 and 166 need to be edited to either wget with https, or removing no-sign-request and having authenticated AWS setup I think.