Cannot Download wmt21 en2zh test data
See original GitHub issuehere is my mtdata.recipes.wmt22-constrained.yaml config
- id: wmt22-zhen-t
langs: zho-eng
desc: WMT 22 General MT
url: https://www.statmt.org/wmt22/translation-task.html
dev:
test:
- Statmt-newstest_enzh-2021-eng-zho
train:
when download the test set using the following command,
mtdata get-recipe -ri wmt22-zhen-t -o .
it will raise error, and here is the error log.
2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?
it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.
the code cause this issue is at sgm.py line 79.
srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Issues · thammegowda/mtdata - GitHub
A tool that locates, downloads, and extracts machine translation corpora - Issues · thammegowda/mtdata. ... Cannot Download wmt21 en2zh test data.
Read more >emnlp 2021 sixth conference on machine translation (wmt21)
The data released for the WMT21 news translation task can be freely ... which will automatically download previous WMT test sets for you....
Read more >PROCEEDINGS - AMTA
However, it cannot be rejected either, and, from a practical standpoint, the methodology seems to have yielded the results we had hoped for....
Read more >Findings of the WMT 2021 Biomedical Translation Shared Task
Summaries of Animal Experiments as New Test Set ... researchers who cannot read those languages. ... 2 Training and test data.
Read more >Results of the WMT21 Metrics Shared Task - ACL Anthology
Additional domain Since we collected our own human ratings, we were also able to expand the domain of the test sets beyond news...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too
thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.