Breaking Changes:
- Drop Python 3.9 support; minimum is now Python 3.10
- Replace
enlightenprogress bars withrich; logging now usesRichHandler - Dependency versions updated:
rich~=14.0replacesenlighten,portalocker~=3.2,pybtex~=0.25,ruamel.yaml~=0.18 - Removed
setuptoolsworkaround for pybtex on Python 3.12+
New Features:
mtdata-mapCLI added for applying subprocess-based transformations to parallel data; registered as console script- Subprocess-based decompression for
.xz(viaxz -T0) and.bz2(viapbzip2/lbzip2/bzip2) for faster I/O - Generalized
SubprocessCompressorbase class inpigz.py;pigz,xz_subprocess, andbzip2_subprocessare now subclasses - HuggingFace index loader generalized to support arbitrary datasets (no longer hard-coded to
google/wmt24pponly) - Write buffering (1 MiB) for subprocess compressors
Data Updates:
- OPUS index updated: ~60k new entries (159k → 219k)
- Added WMT26 constrained recipes
- Added new datasets for WMT26
- Added English-Bhojpuri parallel and monolingual corpora (BHLTR); Fixes #174
Improvements:
- Progress bars rewritten with
rich: multi-task support, spinner, rate columns, coordinated logging - Singleton
_Sentinelpattern inSubprocMapperpreserves identity across pickling SubprocMapper: improved control message handling and queue draining assertionsmtdata-map: do not modify input lines (no longer replaces\tin data)- Improved log message readability across
mtdata-mapand data pipeline - Muted third-party loggers (
httpx,datasets,huggingface_hub,fsspec,urllib3) to WARNING level - CI: add Python 3.14, test on ubuntu-22.04
- Add preliminary support for huggingface datasets; currently wmt24++ is the only supported dataset
- Update setup.py -> pyproject.toml; hf datasets is optional dependency
- Add mtdata index subcommand. deprecate
mtdata --reindex <cmd> - Add a field named
metaof type dictionary to the Entry class; stores arbitrary key-vals which maybe useful for downloading and parsing datasets. - Support for document id , (currently, one among the many in meta fields)in
.meta.jsonl.gz - OPUS index updated
mtdata scoresub command added; support QE scoring via pymarian
- minor fixes
- Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cacheadded. Improves concurrency by supporting multiple recipes- Added WMT general test 2022 and 2023
- mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
- mtdata-bcp47 : --script {suppress-default,suppress-all,express}
- Uses
pigzto read and write gzip files by default when pigz is in PATH. exportUSE_PIGZ=0to disable
- Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
- Update ELRC datasets #138. Thanks @AlexUmnov
- Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
- Add Flores200 dev and devtests #145. Thanks @ZenBel
- Add support for
mtdata echo <ID> - dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
- Simplified index loading
- simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
- all resources are moved to
mtdata/resourcedir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )
New and exciting features:
- Support for adding new datasets at runtime (
mtdata*.pyfrom run dir). Note: you have to reindex by callingmtdata -ri list - Monolingual datasets support in progress (currently testing)
- Dataset IDs are now
Group-name-version-lang1-lang2for bitext andGroup-name-version-langfor monolingual mtdata listis updated.mtdata list -l eng-deufor bitext andmtdata list -l engfor monolingual- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...
- Dataset IDs are now
skipped 0.3.9 because the chages are significant
- CLI arg
--log-levelwith default set toWARNING - progressbar can be disabled from CLI
--no-pbar; default is enabled--pbar mtdata stats --quickdoes HTTP HEAD and shows content length; e.g.mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deupython -m mtdata.scripts.recipe_statsto read stats from output directory- Security fix with tar extract | Thanks @TrellixVulnTeam
- Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
- Opus and ELRC datasets update | Thanks @ZenBel
- Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)
- fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval
- Parallel download support
-j/--n-jobsargument (with default4) - Add histogram to web search interface (Thanks, @sgowdaks)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets are added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set
MTDATA_RECIPESdir (default is $PWD). All files matching the glob${MTDATA_RECIPES}/mtdata.recipes*.ymlare loaded - WMT22 recipes added
- JW300 is disabled #77
- Automatically create references.bib file based on datasets selected
- ELRC datasets updated
- Added docs, separate copy for each version (github pages)
- Dataset search via web interface. Support for regex match
- Added two new datasets Masakane fon-fra
- Improved TMX files BCP47 lang ID matching: compatibility instead of exact match
- bug fix: xml reading inside tar: Element tree's compain about TarPath
mtdata listhas-g/--groupsand-ng/--not-groupsas include exclude filters on group name (#91)mtdata listhas-id/--idflag to print only dataset IDs (#91)- add WMT21 tests (#90)
- add ccaligned datasets wmt21 (#89)
- add ParIce datasets (#88)
- add wmt21 en-ha (#87)
- add wmt21 wikititles v3 (#86)
- Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) (#84)
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
- Fix: buggy matching of languages
y1==y1 - Fix:
getcommand: ensure train/dev/test datasets are indeed compatible with languages specified in--langsargs
- Fix: recipes.yml is missing in the pip installed package
- Add Project Anuvaad: 196 datasets belonging to Indian languages
- add CLI
mtdata gethas--fail / --no-failarguments to tell whether to crash or no-crash upon errors
- Add support for recipes; list-recipe get-recipe subcommands added
- add support for viewing stats of dataset; words, chars, segs
- FIX url for UN dev and test sets (source was updated so we updated too)
- Multilingual experiment support; ISO 639-3 code
mulimplies multilingual; e.g. mul-eng or eng-mul --devaccepts multiple datasets, and merges it (useful for multilingual experiments)- tar files are extracted before read (performance improvements)
- setup.py: version and descriptions accessed via regex
Big Changes: BCP-47, data compression
-
BCP47: (Language, Script, Region)
- Our implementation is strictly not BCP-47. We differ on the following
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
en) and three letters for many. - We use
_(underscore) to join language, script, region whereas BCP-47 uses-(hyphen)
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
- Our implementation is strictly not BCP-47. We differ on the following
-
Dataset IDs (aka
didin short) are standardized<group>-<name>-<version>-<lang1>-<lang2><group>can have mixed case,<name>has to be lowercase
-
CLI interface now accept
dids. -
mtdata get --dev <did>now accepts a single dataset ID; createsdev.{xxx,yyy}links at the root of out dir -
mtdata get --test <did1> ... <did3>createstest{1..4}.{xxx,yyy}links at the root of out dir -
--compressoption to store compressed datasets under output dir -
zipandtarfiles are no longer extracted. we read directly from compressed files without extracting them -
._lockfiles are removed after download job is done -
Add JESC, jpn paracrawl, news commentary 15 and 16
-
Force unicode encoding; make it work on windows (Issue #71)
-
JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)
-
Add all Wikititle datasets from lingual tool (Issue #63)
-
progressbar :
englightenis used -
wgetis replaced withrequests. User-Agent header along with mtdata version is sent in HTTP request headers -
Paracrawl v9 added
- OPUS index updated (crawled on 20210522)
- new:
- CCAlignedV1
- EiTBParCC_v1
- EuroPat_v2
- MultiCCAligned_v1.1
- NewsCommentary_v14
- WikiMatrix_v1
- tico19_v20201028
- updates (replaces old with new):
- new:
- Multilingual TMX parsing, add ECDC and EAC -- #39 -- by @kpu
- Removed Global Voices -- now available via OPUS -- #41
- Move all BibTeX to a separate file -- #42
- Add ELRC-Share datasets #43 -- by @kpu
- Fix line count mismatch in some XML formats #45
- Parse BCP47 codes by removing everything after first hyphen #48 -- by @kpu
- Add Khresmoi datasets #53 -- by @kpu
- Optimize index loading by using cache;
- Added
-re | --reindexCLI flag to force update index cache #54 - Removed
--cacheCLI argument. Useexport MTDATA=/path/to/cache-dirinstead (which was already supported)
- Added
- Add :
DCEPcorpus, 253 language pairs #58 -- by @kpu - Add : WMT 21 dev sets: eng-hau eng-isl isl-eng hau-eng #36
- New datasets
- New features
- 'mtdata -b' for short outputs and crash on error input
- Fixes and improvements:
- Paracrawl v7 and v7.1 -- 29 new datasets
- Fix swapping issue with TMX format (TILDE corpus); add a testcase for TMX entry
- Add mtdata-iso shell command
- Add "mtdata report" sub command to summarize datasets by language and names
- Add OPUS 100 corpus
- Add all pairs of neulab_tedtalksv1 - train,test,dev -- 4,455 of them
- Add support for cleaning noise. Entry.is_noise(seg1, seg2)
- some basic noise is removed by default from training
- add
__slots__to Entry class (takes less memory and faster attrib lookup)
- Add all pairs of Wikimatrix -- 1,617 of them
- Add support for specifying
colsof.tsvfile - Add all Europarlv7 sets
- Remove hin-eng
dictfrom JoshuaIndianParallelCorpus - Remove Wikimatrix1 from statmt -- they are moved to separate file
- File locking using portalocker to deal with race conditions
when multiple
mtdata getare invoked in parallel - Remove language name from local file name -- as a same tar file can be used by many languages, and they dont need copy
- CLI to print version name
- Added KFTT Japanese-English set
- IITB hin-eng datasets
- Fix issue with dataset counting
- Pypi release bug fix: select all nested packages
- add UnitedNations test set
- Add JW300 Corpus
- All Languages are internally mapped to 3 letter codes of ISO codes
- 53,000 entries from OPUS are indexed