here is my mtdata.recipes.wmt22-constrained.yaml config
- id: wmt22-zhen-t
langs: zho-eng
desc: WMT 22 General MT
url: https://www.statmt.org/wmt22/translation-task.html
dev:
test:
- Statmt-newstest_enzh-2021-eng-zho
train:
when download the test set using the following command,
mtdata get-recipe -ri wmt22-zhen-t -o .
it will raise error, and here is the error log.
2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?
it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

the code cause this issue is at sgm.py line 79.
srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'
here is my mtdata.recipes.wmt22-constrained.yaml config
when download the test set using the following command,
it will raise error, and here is the error log.
2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?
it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.
the code cause this issue is at sgm.py line 79.