Dataset Information:
The HC4 collection that is accepted at ECIR 2022
<brief description>Links to Resources:
https://github.com/hltcoe/HC4/tree/main/resouces/hc4
Dataset ID(s) & supported entities:
- Dataset ID: hc4/{language id: zh, fa, ru}/{train, dev, test}
- Will have {Chinese, Farsi, Russian} documents, English queries(title/description/narrative), an English report associated with each topic and qrels.
Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
- Dataset definition (in
ir_datasets/datasets/[topid].py
) - Tests (in
tests/integration/[topid].py
) - Metadata generated (using
ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
) - Documentation (in
ir_datasets/etc/[topid].yaml
)- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
- Downloadable content (in
ir_datasets/etc/downloads.json
)- Download verification action (in
.github/workflows/verify_downloads.yml
). Only one needed pertopid
. -
Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected indownloads.json
.
- Download verification action (in
Additional comments/concerns/ideas/etc.
The document id, qrels, and topics will be distributed through a public github repository. Users need to download the actual documents through Common Crawl. Script for downloading and validating will be provided along with the doc ids.
The structure will be very similar to the future NeuCLIR collection. Whether these two collections will be distributed through the same repository is TBD.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:11 (6 by maintainers)
Top Results From Across the Web
ODROID-HC4
ODROID-HC4 is new Home-Cloud platform based on the same ARM CPU as the ODROID-C4. We adopted a 12nm fabricated energy efficient 1.8Ghz Cortex-A55...
Read more >Health Career College Core Curriculum (HC4) Program
Health Careers College Core Curriculum (HC4) is a supported program for adult learners with little or no experience in higher education who are...
Read more >HC4 LED 4" Downlight Series - Cooper Lighting
The HC4 recessed 4" downlight is offered with narrow, medium, or wide beam reflectors or wall wash reflector. Installation options include new ...
Read more >HC4 Ligand Summary Page - RCSB PDB
HC4 ; Name, 4'-HYDROXYCINNAMIC ACID ; Synonyms, PARA-COUMARIC ACID ; Identifiers, (E)-3-(4-hydroxyphenyl)prop-2-enoic acid ; Formula, C9 H8 O ; Molecular Weight ...
Read more >HC-4 Health Care Coverage Questionnaire
TYPE 2 – A reimbursement type plan which requires the prepaid health care contractor, such as HMSA, to defray or reimburse the expenses...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, that’s why I was saying in this edge case, the user would have to take the intersection of qids in
hc4/fa/train
andhc4/zh/train
– leaving just Topic 3. They’d also have to filter the qrels, true.Thanks, Sean. This makes sense. The top-level should be just placeholder. At this point, I don’t think combining all three languages makes sense. Even though there are some topics that span across languages (i.e. same title and description), but the narratives are different. So they should be considered different queries.
Reports range from 1 to 5 paragraphs. Conceptually, they are written by the analysts prior to the search to reflect some background of the information need.