question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset Information:

ClueWeb22 is the newest in the Lemur Project’s ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset is being developed by the Lemur Project with significant assistance and support from Microsoft Corporation.

The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.

  • It is much larger.
  • Documents are of higher quality.
  • Documents are provided in several formats (HTML, clean text, screen shots).
  • Document page analyses are provided that reveal where on a page text was displayed, and what was near it.
  • The dataset includes a large set of crowdsourced queries and shallow relevance assessments (a pseudo search log).

Authors: Arnold Overwijk, Chenyan Xiong (@xiongchenyan), Jamie Callan (@jamiecallan), Cameron VandenBerg, Xiao Lucy Liu

Links to Resources:

Dataset ID(s) & supported entities:

  • clueweb22/a: 200M docs, queries, qrels, scoreddocs?
  • clueweb22/b: 2B docs, queries?, qrels?, scoreddocs?
  • clueweb22/l: 10B docs, queries?, qrels?, scoreddocs?

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

  • Dataset definition (in ir_datasets/datasets/clueweb22.py)
  • Tests (in tests/integration/clueweb22.py)
  • Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
  • Documentation (in ir_datasets/etc/clueweb22.yaml)
  • Downloadable content (in ir_datasets/etc/downloads.json) Manual download requirded.
    • Download instructions added
    • Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
    • Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The dataset is planned to be used for shared tasks in the near future. I also personally think it is of very high value to have this in ir_datasets.

Open Questions

  • Where to get the topic tag mentioned in the paper?
  • Is VDOM-Paragraph the same as VDOM-Passage in the WARC headers?
  • What means the ? in the inlink format anchor type description?

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jamiecallancommented, Oct 11, 2022

Sean is correct.  Each warc.gz file is compressed by record and has a companion offset file.  To get the HTML of a specific document, open the appropriate .warc.offset file (can be determined from the ClueWeb docid), fseek to find the byte offsets of the start/end of the document (also determined from the ClueWeb docid), open the .warc.gz file, fseek to the start of the document, read the bytes, and uncompress them.

We can provide data samples if you need them.

We are trying to apply this or a similar architecture to all other types of data in the dataset, so that everything can be accessed quickly given a ClueWeb docid.

Best,

Jamie

On 10/5/2022 7:09 AM, Sean MacAvaney wrote:

Excellent, thanks @heinrichreimer https://github.com/heinrichreimer!

A while back I requested that they include offset files to facilitate random lookups, and it looks like it made it into the final spec! This will make adding the datasets much easier, since we won’t need to save zlib states and release our own checkpoint files.

— Reply to this email directly, view it on GitHub https://github.com/allenai/ir_datasets/issues/210#issuecomment-1268287629, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRMTZKVX77ZWOQ2LFPDT6TWBVOWHANCNFSM6AAAAAAQ5O3JSY. You are receiving this because you were mentioned.Message ID: @.***>

0reactions
heinrichreimercommented, Nov 29, 2022

As the categories are subsets of the larger ones, I’ve now also added “views” that can, for example, be used to just parse the plain text from the B category. The keys would be clueweb22/b/as-l, clueweb22/b/as-a, clueweb22/b/as-l/en, clueweb22/b/as-a/en and so on. To not clutter the list of dataset IDs too much, we could also just skip the language-specific versions for the “views”.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The ClueWeb22 Dataset - The Lemur Project
ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language ...
Read more >
ClueWeb22: 10 Billion Web Documents with Rich Information
ClueWeb22, the newest iteration of the ClueWeb line of datasets, is the result of more than a year of collaboration between industry and ......
Read more >
ClueWeb22: 10 Billion Web Documents with Rich Information
This talk shares the design and construction of ClueWeb22, and discusses its new features. We believe this newer, larger, and richer. ClueWeb ...
Read more >
ClueWeb22: 10 Billion Web Documents with Rich Information
Abstract: ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information.
Read more >
ClueWeb22: 10 Billion Web Documents with Visual ... - DeepAI
11/29/22 - ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found