question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot Download wmt21 en2zh test data

See original GitHub issue

here is my mtdata.recipes.wmt22-constrained.yaml config


- id: wmt22-zhen-t
  langs: zho-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:
  test:
    - Statmt-newstest_enzh-2021-eng-zho
  train:

when download the test set using the following command,

mtdata get-recipe -ri wmt22-zhen-t -o .

it will raise error, and here is the error log.

2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?

it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

image

the code cause this issue is at sgm.py line 79.

srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
khayrallahcommented, Jun 27, 2022

Just wanted to make a note that this effects more than just enzh. German English is also affected when using the default scripts provided by wmt too

1reaction
khayrallahcommented, Jul 4, 2022

thanks for the update! It might be a good idea to make a note on the main WMT page, since it is linked as the way to download the WMT data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · thammegowda/mtdata - GitHub
A tool that locates, downloads, and extracts machine translation corpora - Issues · thammegowda/mtdata. ... Cannot Download wmt21 en2zh test data.
Read more >
emnlp 2021 sixth conference on machine translation (wmt21)
The data released for the WMT21 news translation task can be freely ... which will automatically download previous WMT test sets for you....
Read more >
PROCEEDINGS - AMTA
However, it cannot be rejected either, and, from a practical standpoint, the methodology seems to have yielded the results we had hoped for....
Read more >
Findings of the WMT 2021 Biomedical Translation Shared Task
Summaries of Animal Experiments as New Test Set ... researchers who cannot read those languages. ... 2 Training and test data.
Read more >
Results of the WMT21 Metrics Shared Task - ACL Anthology
Additional domain Since we collected our own human ratings, we were also able to expand the domain of the test sets beyond news...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found