Skip to content

Cannot Download wmt21 en2zh test data #116

@Pzzzzz5142

Description

@Pzzzzz5142

here is my mtdata.recipes.wmt22-constrained.yaml config

- id: wmt22-zhen-t
  langs: zho-eng
  desc: WMT 22 General MT
  url: https://www.statmt.org/wmt22/translation-task.html
  dev:
  test:
    - Statmt-newstest_enzh-2021-eng-zho
  train:

when download the test set using the following command,

mtdata get-recipe -ri wmt22-zhen-t -o .

it will raise error, and here is the error log.

2022-06-07 15:19:36 data.add_parts_sequential:329 ERROR:: Unable to add Statmt-newstest_enzh-2021-eng-zho: /Users/pzzzzz/.mtdata/data.statmt.org/1df0/c1646dcf67bf017db12b47b5c987/wmt21tests.tgz-extracted/test/newstest2021.en-zh.xml has unequal number of segs: 1845 == 2847?

it seems that for the 2021 en2zh test has multiple ref sentences for each src sentence, the assert statement will cause the error ahead.

image

the code cause this issue is at sgm.py line 79.

srcs = list(xpath_all(tree.getroot(), xpath=".//src//seg"))
tgts = list(xpath_all(tree.getroot(), xpath=".//ref//seg"))
assert len(srcs) == len(tgts), f'{data} has unequal number of segs: {len(srcs)} == {len(tgts)}?'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions