---
annotations_creators:
- no-annotation
language_creators:
- expert-generated
- machine-generated
languages:
- en
licenses:
- ms-pl
multilinguality:
- monolingual
- translation
size_categories:
- 1K<n<10K
source_datasets:
- extended|other-newstest2017
task_categories:
- conditional-text-generation
task_ids:
- machine-translation
paperswithcode_id: null
pretty_name: MsrZhenTranslationParity
---

# Dataset Card for msr_zhen_translation_parity

## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)

## Dataset Description

- **Homepage:**

[Translator Human Parity Data](https://msropendata.com/datasets/93f9aa87-9491-45ac-81c1-6498b6be0d0b)

- **Repository:**
- **Paper:**

[Achieving Human Parity on Automatic Chinese to English News Translation](https://www.microsoft.com/en-us/research/publication/achieving-human-parity-on-automatic-chinese-to-english-news-translation/)

- **Leaderboard:**
- **Point of Contact:**

### Dataset Summary

> Human evaluation results and translation output for the Translator Human Parity Data release,
> as described in https://blogs.microsoft.com/ai/machine-translation-news-test-set-human-parity/ 
 
> The Translator Human Parity Data release contains all human evaluation results and translations
> related to our paper "Achieving Human Parity on Automatic Chinese to English News Translation",
> published on March 14, 2018. We have released this data to 

> 1) allow external validation of our claim of having achieved human parity
> 2) to foster future research by releasing two additional human references 
>   for the Reference-WMT test set. 
>

The dataset includes:

1) two new references for Chinese-English language pair of WMT17, 
   one based on human translation from scratch (Reference-HT),
   the other based on human post-editing (Reference-PE); 

2) human parity translations generated by our research systems Combo-4, Combo-5, and Combo-6, 
   as well as translation output from online machine translation service Online-A-1710,
   collected on October 16, 2017;

The data package provided with the study also includes (but not parsed and provided as 
workable features of this dataset) all data points collected in human evaluation campaigns.

### Supported Tasks and Leaderboards

[More Information Needed]

### Languages

This dataset contains 6 extra English translations to Chinese-English language pair of WMT17.

## Dataset Structure

### Data Instances

[More Information Needed]

### Data Fields

As mentioned in the summary, this dataset provides 6 extra English translations of 
Chinese-English language pair of WMT17.

Data fields are named exactly like the associated paper for easier cross-referenceing.

- `Reference-HT`: human translation from scrach.
- `Reference-PE`: human post-editing.
- `Combo-4`, `Combo-5`, `Combo-6`: three translations by research systems.
- `Online-A-1710`: a translation from an anonymous online machine translation service.

All data fields of a record are translations for the same Chinese source sentence.

### Data Splits

[More Information Needed]

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

[More Information Needed]

### Citation Information

Citation information is available at this link [Achieving Human Parity on Automatic Chinese to English News Translation](https://www.microsoft.com/en-us/research/publication/achieving-human-parity-on-automatic-chinese-to-english-news-translation/)

### Contributions

Thanks to [@leoxzhao](https://github.com/leoxzhao) for adding this dataset.
