Metadata-Version: 2.1
Name: summ_eval
Version: 0.80
Summary: Toolkit for summarization evaluation
Home-page: https://github.com/Yale-LILY/SummEval
Author: Alex Fabbri, Wojciech Kryściński
Author-email: alexander.fabbri@yale.edu, wojciech.kryscinski@salesforce.com
License: MIT
Platform: UNKNOWN
Description-Content-Type: text/markdown

b'# Summarization Repository \nAuthors: [Alex Fabbri*](http://alex-fabbri.github.io/), [Wojciech Kry\xc5\x9bci\xc5\x84ski*](https://twitter.com/iam_wkr), [Bryan McCann](https://bmccann.github.io/), [Caiming Xiong](http://cmxiong.com/), [Richard Socher](https://www.socher.org/), and [Dragomir Radev](http://www.cs.yale.edu/homes/radev/)<br/>\n\nThis project is a collaboration work between [Yale LILY Lab](https://yale-lily.github.io/) and [Salesforce Research](https://einstein.ai/). <br/><br/>\n\n<p align="center">\n<img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-lily.png" height="100" alt="LILY Logo" style="padding-right:160">\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n<img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-salesforce.svg" height="100" alt="Salesforce Logo"> \n</p>\n\n<sub><sup>\\* - Equal contributions from authors</sup></sub>\n\n## Table of Contents\n\n1. [Updates](#updates)\n2. [Data](#data)\n3. [Evaluation Toolkit](#evaluation-toolkit)\n4. [Citation](#citation)\n5. [Get Involved](#get-involved)\n\n## Updates\n_04/19/2020_ - Updated the [human annotation file](https://drive.google.com/file/d/1d2Iaz3jNraURP1i7CfTqPIj8REZMJ3tS/view?usp=sharing) to include all models from paper and metric scores.<br/>\n_04/19/2020_ - SummEval is now pip-installable! Check out the [pypi page](https://pypi.org/project/summ-eval/).<br/>\n_04/09/2020_ - Please see [this comment](https://github.com/Yale-LILY/SummEval/issues/13#issuecomment-812918298) with code for computing system-level metric correlations!  <br/>\n_11/12/2020_ - Added the reference-less BLANC and SUPERT metrics! <br/>\n_7/16/2020_ - Initial commit! :) \n\n## Data\nAs part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset [here](#model-outputs).</br>\nWe also share human annotations, collected from both crowdsource workers and experts [here](#human-annotations).\n\nBoth datasets are shared WITHOUT the source articles that were used to generate the summaries. <br/>\nTo recreate the full dataset please follow the instructions listed [here](#data-preparation). \n\n### Model Outputs\n\n|Model|Paper|Outputs|Type|\n|-|-|-|-|\n|M0|_Lead-3 Baseline_|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M0.tar.gz)|Extractive|\n|M1|[Neural Document Summarization by Jointly Learning to Score and Select Sentences](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M1.tar.gz)|Extractive|\n|M2|[BANDITSUM: Extractive Summarization as a Contextual Bandit](http://aclweb.org/anthology/P18-1061)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M2.tar.gz)|Extractive|\n|M3|[Neural Latent Extractive Document Summarization](http://aclweb.org/anthology/D18-1088)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M3.tar.gz)|Extractive|\n|M4|[Ranking Sentences for Extractive Summarization with Reinforcement Learning](https://www.aclweb.org/anthology/N18-1158/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M4.tar.gz)|Extractive|\n|M5|[Learning to Extract Coherent Summary via Deep Reinforcement Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16838/16118)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M5.tar.gz)|Extractive|\n|M6|[Neural Extractive Text Summarization with Syntactic Compression](https://www.aclweb.org/anthology/D19-1324/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M6.tar.gz)|Extractive|\n|M7|[STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings](https://www.aclweb.org/anthology/P19-2034/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M7.tar.gz)|Extractive|\n|M8|[Get To The Point: Summarization with Pointer-Generator Networks](http://aclweb.org/anthology/P17-1099)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M8.tar.gz)|Abstractive|\n|M9|[Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting](https://www.aclweb.org/anthology/P18-1063)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M9.tar.gz)|Abstractive|\n|M10|[Bottom-Up Abstractive Summarization](https://www.aclweb.org/anthology/D18-1443/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M10.tar.gz)|Abstractive|\n|M11|[Improving Abstraction in Text Summarization](http://aclweb.org/anthology/D18-1207)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M11.tar.gz)|Abstractive|\n|M12|[A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss](http://aclweb.org/anthology/P18-1013)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M12.tar.gz)|Abstractive|\n|M13|[Multi-Reward Reinforced Summarization with Saliency and Entailment](http://aclweb.org/anthology/N18-2102)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M13.tar.gz)|Abstractive|\n|M14|[Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation](http://aclweb.org/anthology/P18-1064)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M14.tar.gz)|Abstractive|\n|M15|[Closed-Book Training to Improve Summarization Encoder Memory](http://aclweb.org/anthology/D18-1440)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M15.tar.gz)|Abstractive|\n|M16|[An Entity-Driven Framework for Abstractive Summarization](https://www.aclweb.org/anthology/D19-1323/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M16.tar.gz)|Abstractive|\n|M17|[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M17.tar.gz)|Abstractive|\n|M18|[Better Rewards Yield Better Summaries: Learning to Summarise Without References](https://www.aclweb.org/anthology/D19-1307)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M18.tar.gz)|Abstractive|\n|M19|[Text Summarization with Pretrained Encoders](https://www.aclweb.org/anthology/D19-1387)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M19.tar.gz)|Abstractive|\n|M20|[Fine-Tuning GPT-2 from Human Preferences](https://openai.com/blog/fine-tuning-gpt-2/)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M20.tar.gz)|Abstractive|\n|M21|[Unified Language Model Pre-training for Natural Language Understanding and Generation](https://arxiv.org/abs/1905.03197)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M21.tar.gz)|Abstractive|\n|M22|[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://www.aclweb.org/anthology/2020.acl-main.703)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M22.tar.gz)|Abstractive|\n|M23|[PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/pdf/1912.08777.pdf)|[Link](https://storage.googleapis.com/sfr-summarization-repo-research/M23.tar.gz)|Abstractive|\n\n**IMPORTANT:** \n\nAll model outputs were obtained from the original authors of the models and shared with their consent.<br/>\nWhen using any of the model outputs, please also _cite the original paper_.\n\n\n### Human annotations\n\nHuman annotations of model generated summaries can be found [here](https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl).\n\nThe annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total). <br/>\nEach of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total). <br/>\nSummaries were evaluated across 4 dimensions: _coherence_, _consistency_, _fluency_, _relevance_. <br/>\nEach source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.\n\n### Data preparation\n\nBoth model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.\n\nTo recreate the datasets follow the instructions:\n1. Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/\n2. Create a `cnndm` directory and unpack downloaded files into the directory\n3. Download and unpack model outputs or human annotations.\n4. Run the `pair_data.py` script to pair the data with original articles\n\nExample call for _model outputs_:\n\n`python3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>`\n\nExample call for _human annotations_:\n\n`python3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>`\n\n\n## Evaluation Toolkit\n\nWe provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.\n\n### Metrics ###\nBelow are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:\n|Metric|Paper|Code|\n|-|-|-|\n|ROUGE|[ROUGE: A Package for Automatic Evaluation of Summaries](https://www.aclweb.org/anthology/W04-1013.pdf)|[Link](https://github.com/bheinzerling/pyrouge/tree/master/pyrouge)|\n|ROUGE-we|[Better Summarization Evaluation with Word Embeddings for ROUGE](https://www.aclweb.org/anthology/D15-1222.pdf)|[Link](https://github.com/UKPLab/emnlp-ws-2017-s3/blob/master/S3/ROUGE.py#L152)|\n|MoverScore|[MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance](https://www.aclweb.org/anthology/D19-1053.pdf)|[Link](https://github.com/AIPHES/emnlp19-moverscore/)|\n|BertScore|[BertScore: Evaluating Text Generation with BERT](https://arxiv.org/pdf/1904.09675.pdf)|[Link](https://github.com/Tiiiger/bert_score)|\n|Sentence Mover\'s Similarity|[Sentence Mover\xe2\x80\x99s Similarity: Automatic Evaluation for Multi-Sentence Texts](https://www.aclweb.org/anthology/P19-1264.pdf)|[Link](https://github.com/eaclark07/sms)|\n|SummaQA|[Answers Unite! Unsupervised Metrics for Reinforced Summarization Models](https://www.aclweb.org/anthology/D19-1320.pdf)|[Link](https://github.com/recitalAI/summa-qa)|\n|BLANC|[Fill in the BLANC: Human-free quality estimation of document summaries](https://arxiv.org/pdf/2002.09836.pdf)|[Link](https://github.com/PrimerAI/blanc)|\n|SUPERT|[SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization](https://www.aclweb.org/anthology/2020.acl-main.124.pdf)|[Link](https://github.com/yg211/acl20-ref-free-eval)|\n|METEOR|[METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments ](https://www.aclweb.org/anthology/W05-0909.pdf)|[Link](https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap/meteor)|\n|S<sup>3</sup>|[Learning to Score System Summaries for Better Content Selection Evaluation](https://www.aclweb.org/anthology/W17-4510/)|[Link](https://github.com/UKPLab/emnlp-ws-2017-s3)|\n|Misc. statistics<br/>(extractiveness, novel n-grams, repetition, length)|[Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies](https://www.aclweb.org/anthology/N18-1065/)| [Link](https://github.com/lil-lab/newsroom)|\n|Syntactic Evaluation|[Automatic Analysis of Syntactic Complexity in Second Language writing](https://www.benjamins.com/catalog/ijcl.15.4.02lu)|[Link](http://www.personal.psu.edu/xxl13/downloads/L2SCA-2016-06-30.tgz)|\n|CIDer|[CIDEr: Consensus-based Image Description Evaluation](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf)|[Link](https://github.com/Maluuba/nlg-eval/tree/master/nlgeval/pycocoevalcap/cider)|\n|CHRF|[CHRF++: words helping character n-grams](https://www.statmt.org/wmt17/pdf/WMT70.pdf)|[Link](https://github.com/m-popovic/chrF)|\n|BLEU|[BLEU: a Method for Automatic Evaluation of Machine Translation](https://www.aclweb.org/anthology/P02-1040.pdf)|[Link](https://github.com/mjpost/sacreBLEU)|\n\n\n#### SETUP ####\n\nYou can install summ_eval via pip:\n```bash\npip install summ-eval\n```\n\nYou can also install summ_eval from source:\n\n```\ngit clone https://github.com/Yale-LILY/SummEval.git\ncd evaluation\npip install -e .\n```\n\nYou can test your installation (assuming you\'re in the `./summ_eval` folder) and get familiar with the library through `tests/`\n\n```\npython -m unittest discover\n```\n\n\n### Command-line interface\nWe provide a command-line interface `calc-scores` which makes use of [gin config](https://github.com/google/gin-config) files to set metric parameters. \n\n##### Examples\nRun ROUGE on given source and target files and write to `rouge.jsonl`, analogous to [files2rouge](https://github.com/pltrdy/files2rouge). \n```\ncalc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True\n```\n\n**NOTE**: if you\'re seeing slow-ish startup time, try commenting out the metrics you\'re not using in the config; otherwise this will load all modules. \n\n\nRun ROUGE and BertScore on a `.jsonl` file which contains `reference` and `decoded` (i.e., system output) keys and write to `output.jsonl`.\n```\ncalc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl\n```\n\nFor a full list of options, please run:\n```\ncalc-scores --help\n```\n\n\n### For use in scripts\nIf you want to use the evaluation metrics as part of other scripts, we have you covered!\n\n```\nfrom summ_eval.rouge_metric import RougeMetric\nrouge = RougeMetric()\n```\n\n#### Evaluate on a batch\n```\nsummaries = ["This is one summary", "This is another summary"]\nreferences = ["This is one reference", "This is another"]\n\nrouge_dict = rouge.evaluate_batch(summaries, references)\n```\n\n#### Evaluate on a single example\n```\nrouge_dict = rouge.evaluate_example(summaries[0], references[0])\n```\n\n\n#### Evaluate with multiple references\nCurrently the command-line tool does not use multiple references for simplicity. Each metric has a `supports_multi_ref` property to tell you if it supports multiple references. \n\n```\nprint(rouge.supports_multi_ref) # True\nmulti_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]\nrouge_dict = rouge.evaluate_batch(summaries, multi_references)\n```\n\n\n\n\n\n## Citation\n\n```\n@article{fabbri2020summeval,\n  title={SummEval: Re-evaluating Summarization Evaluation},\n  author={Fabbri, Alexander R and Kry{\\\'s}ci{\\\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},\n  journal={arXiv preprint arXiv:2007.12626},\n  year={2020}\n}\n```\n\n### Get Involved\n\nPlease create a GitHub issue if you have any questions, suggestions, requests or bug-reports. \nWe welcome PRs!\n'

