Metadata-Version: 2.1
Name: summ-eval
Version: 0.60
Summary: Toolkit for summarization evaluation
Home-page: https://github.com/Yale-LILY/SummEval
Author: Alex Fabbri, Wojciech Kryściński
Author-email: alexander.fabbri@yale.edu, wojciech.kryscinski@salesforce.com
License: MIT
Platform: UNKNOWN
Description-Content-Type: text/markdown

# Summarization Repository 
Authors: [Alex Fabbri*](http://alex-fabbri.github.io/), [Wojciech Kryściński*](https://twitter.com/iam_wkr), [Bryan McCann](https://bmccann.github.io/), [Caiming Xiong](http://cmxiong.com/), [Richard Socher](https://www.socher.org/), and [Dragomir Radev](http://www.cs.yale.edu/homes/radev/)<br/>

This project is a collaboration work between [Yale LILY Lab](https://yale-lily.github.io/) and [Salesforce Research](https://einstein.ai/). <br/><br/>

<p align="center">
<img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-lily.png" height="100" alt="LILY Logo" style="padding-right:160">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-salesforce.svg" height="100" alt="Salesforce Logo"> 
</p>

<sub><sup>\* - Equal contributions from authors</sup></sub>

## Table of Contents

1. [Updates](#updates)
2. [Data](#data)
3. [Evaluation Toolkit](#evaluation-toolkit)
4. [Citation](#citation)
5. [Get Involved](#get-involved)

## Updates
_04/19/2020_ - Updated the [human annotation file](https://drive.google.com/file/d/1d2Iaz3jNraURP1i7CfTqPIj8REZMJ3tS/view?usp=sharing) to include all models from paper and metric scores.<br/>
_04/19/2020_ - SummEval is now pip-installable! Check out the [pypi page](https://pypi.org/project/summ-eval/).<br/>
_04/09/2020_ - Please see [this comment](https://github.com/Yale-LILY/SummEval/issues/13#issuecomment-812918298) with code for computing system-level metric correlations!  <br/>
_11/12/2020_ - Added the reference-less BLANC and SUPERT metrics! <br/>
_7/16/2020_ - Initial commit! :) 

## Data
As part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset [here](#model-outputs).</br>
We also share human annotations, collected from both crowdsource workers and experts [here](#human-annotations).

Both datasets are shared WITHOUT the source articles that were used to generate the summaries. <br/>
To recreate the full dataset please follow the instructions listed [here](#data-preparation). 

### Model Outputs

|Model|Paper|Outputs|Type|
|-|-|-|-|
|M0|_Lead-3 Baseline_|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M0.tar.gz)|Extractive|
|M1|[Neural Document Summarization by Jointly Learning to Score and Select Sentences](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M1.tar.gz)|Extractive|
|M2|[BANDITSUM: Extractive Summarization as a Contextual Bandit](http://aclweb.org/anthology/P18-1061)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M2.tar.gz)|Extractive|
|M3|[Neural Latent Extractive Document Summarization](http://aclweb.org/anthology/D18-1088)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M3.tar.gz)|Extractive|
|M4|[Ranking Sentences for Extractive Summarization with Reinforcement Learning](https://www.aclweb.org/anthology/N18-1158/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M4.tar.gz)|Extractive|
|M5|[Learning to Extract Coherent Summary via Deep Reinforcement Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M5.tar.gz)|Extractive|
|M6|[Neural Extractive Text Summarization with Syntactic Compression](https://www.aclweb.org/anthology/D19-1324/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M6.tar.gz)|Extractive|
|M7|[STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings](https://www.aclweb.org/anthology/P19-2034/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M7.tar.gz)|Extractive|
|M8|[Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M8.tar.gz)|Abstractive|
|M9|[Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](https://www.aclweb.org/anthology/P18-1063)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M9.tar.gz)|Abstractive|
|M10|[Bottom-Up Abstractive Summarization](https://www.aclweb.org/anthology/D18-1443/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M10.tar.gz)|Abstractive|
|M11|[Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M11.tar.gz)|Abstractive|
|M12|[A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss](http://aclweb.org/anthology/P18-1013)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M12.tar.gz)|Abstractive|
|M13|[Multi-Reward Reinforced Summarization with Saliency and Entailment](http://aclweb.org/anthology/N18-2102)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M13.tar.gz)|Abstractive|
|M14|[Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M14.tar.gz)|Abstractive|
|M15|[Closed-Book Training to Improve Summarization Encoder Memory](http://aclweb.org/anthology/D18-1440)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M15.tar.gz)|Abstractive|
|M16|[An Entity-Driven Framework for Abstractive Summarization](https://www.aclweb.org/anthology/D19-1323/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M16.tar.gz)|Abstractive|
|M17|[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M17.tar.gz)|Abstractive|
|M18|[Better Rewards Yield Better Summaries: Learning to Summarise Without References](https://www.aclweb.org/anthology/D19-1307)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M18.tar.gz)|Abstractive|
|M19|[Text Summarization with Pretrained Encoders](https://www.aclweb.org/anthology/D19-1387)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M19.tar.gz)|Abstractive|
|M20|[Fine-Tuning GPT-2 from Human Preferences](https://openai.com/blog/fine-tuning-gpt-2/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M20.tar.gz)|Abstractive|
|M21|[Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/abs/1905.03197)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M21.tar.gz)|Abstractive|
|M22|[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://www.aclweb.org/anthology/2020.acl-main.703)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M22.tar.gz)|Abstractive|
|M23|[PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M23.tar.gz)|Abstractive|

**IMPORTANT:** 

All model outputs were obtained from the original authors of the models and shared with their consent.<br/>
When using any of the model outputs, please also _cite the original paper_.


### Human annotations

Human annotations of model generated summaries can be found [here](https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl).

The annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total). <br/>
Each of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total). <br/>
Summaries were evaluated across 4 dimensions: _coherence_, _consistency_, _fluency_, _relevance_. <br/>
Each source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.

### Data preparation

Both model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.

To recreate the datasets follow the instructions:
1. Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/
2. Create a `cnndm` directory and unpack downloaded files into the directory
3. Download and unpack model outputs or human annotations.
4. Run the `pair_data.py` script to pair the data with original articles

Example call for _model outputs_:

`python3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>`

Example call for _human annotations_:

`python3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>`


## Evaluation Toolkit

We provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.

### Metrics ###
Below are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:
|Metric|Paper|Code|
|-|-|-|
|ROUGE|[ROUGE: A Package for Automatic Evaluation of Summaries](https://www.aclweb.org/anthology/W04-1013.pdf)|[Link](https://github.com/bheinzerling/pyrouge/tree/master/pyrouge)|
|ROUGE-we|[Better Summarization Evaluation with Word Embeddings for ROUGE](https://www.aclweb.org/anthology/D15-1222.pdf)|[Link](https://github.com/UKPLab/emnlp-ws-2017-s3/blob/master/S3/ROUGE.py#L152)|
|MoverScore|[MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance](https://www.aclweb.org/anthology/D19-1053.pdf)|[Link](https://github.com/AIPHES/emnlp19-moverscore/)|
|BertScore|[BertScore: Evaluating Text Generation with BERT](https://arxiv.org/pdf/1904.09675.pdf)|[Link](https://github.com/Tiiiger/bert_score)|
|Sentence Mover's Similarity|[Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts](https://www.aclweb.org/anthology/P19-1264.pdf)|[Link](https://github.com/eaclark07/sms)|
|SummaQA|[Answers Unite! Unsupervised Metrics for Reinforced Summarization Models](https://www.aclweb.org/anthology/D19-1320.pdf)|[Link](https://github.com/recitalAI/summa-qa)|
|BLANC|[Fill in the BLANC: Human-free quality estimation of document summaries](https://arxiv.org/pdf/2002.09836.pdf)|[Link](https://github.com/PrimerAI/blanc)|
|SUPERT|[SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization](https://www.aclweb.org/anthology/2020.acl-main.124.pdf)|[Link](https://github.com/yg211/acl20-ref-free-eval)|
|METEOR|[METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments ](https://www.aclweb.org/anthology/W05-0909.pdf)|[Link](https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap/meteor)|
|S<sup>3</sup>|[Learning to Score System Summaries for Better Content Selection Evaluation](https://www.aclweb.org/anthology/W17-4510/)|[Link](https://github.com/UKPLab/emnlp-ws-2017-s3)|
|Misc. statistics<br/>(extractiveness, novel n-grams, repetition, length)|[Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies](https://www.aclweb.org/anthology/N18-1065/)| [Link](https://github.com/lil-lab/newsroom)|
|Syntactic Evaluation|[Automatic Analysis of Syntactic Complexity in Second Language writing](https://www.benjamins.com/catalog/ijcl.15.4.02lu)|[Link](http://www.personal.psu.edu/xxl13/downloads/L2SCA-2016-06-30.tgz)|
|CIDer|[CIDEr: Consensus-based Image Description Evaluation](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf)|[Link](https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap/cider)|
|CHRF|[CHRF++: words helping character n-grams](https://www.statmt.org/wmt17/pdf/WMT70.pdf)|[Link](https://github.com/m-popovic/chrF)|
|BLEU|[BLEU: a Method for Automatic Evaluation of Machine Translation](https://www.aclweb.org/anthology/P02-1040.pdf)|[Link](https://github.com/mjpost/sacreBLEU)|


#### SETUP ####

You can install summ_eval via pip:
```bash
pip install summ-eval
```

You can also install summ_eval from source:

```
git clone https://github.com/Yale-LILY/SummEval.git
cd evaluation
pip install -e .
```

You can test your installation (assuming you're in the `./summ_eval` folder) and get familiar with the library through `tests/`

```
python -m unittest discover
```


### Command-line interface
We provide a command-line interface `calc-scores` which makes use of [gin config](https://github.com/google/gin-config) files to set metric parameters. 

##### Examples
Run ROUGE on given source and target files and write to `rouge.jsonl`, analogous to [files2rouge](https://github.com/pltrdy/files2rouge). 
```
calc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True
```

**NOTE**: if you're seeing slow-ish startup time, try commenting out the metrics you're not using in the config; otherwise this will load all modules. 


Run ROUGE and BertScore on a jsonl file which contains "reference" and "summary" keys and write to `output.jsonl`. 
```
calc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl
```

For a full list of options, please run:
```
calc-scores --help
```


### For use in scripts
If you want to use the evaluation metrics as part of other scripts, we have you covered!

```
from summ_eval.rouge_metric import RougeMetric
rouge = RougeMetric()
```

#### Evaluate on a batch
```
summaries = ["This is one summary", "This is another summary"]
references = ["This is one reference", "This is another"]

rouge_dict = rouge.evaluate_batch(summaries, references)
```

#### Evaluate on a single example
```
rouge_dict = rouge.evaluate_example(summaries[0], references[0])
```


#### Evaluate with multiple references
Currently the command-line tool does not use multiple references for simplicity. Each metric has a `supports_multi_ref` property to tell you if it supports multiple references. 

```
print(rouge.supports_multi_ref) # True
multi_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]
rouge_dict = rouge.evaluate_batch(summaries, multi_references)
```





## Citation

```
@article{fabbri2020summeval,
  title={SummEval: Re-evaluating Summarization Evaluation},
  author={Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
  journal={arXiv preprint arXiv:2007.12626},
  year={2020}
}
```

### Get Involved

Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. 
We welcome PRs!


