Metadata-Version: 2.1
Name: skweak
Version: 0.2.8
Summary: Software toolkit for weak supervision in NLP
Home-page: https://github.com/NorskRegnesentral/skweak
Author: Pierre Lison
Author-email: plison@nr.no
License: LICENSE.txt
Description: # skweak: Weak supervision for NLP
        
        Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is an expensive and time-consuming process. 
        
        `skweak` (pronounced `/ski:k/`) is a Python-based software toolkit that provides a concrete solution to this problem using _weak supervision_. `skweak` is built around a very simple idea: Instead of annotating texts by hand, we define a set of _labelling functions_ to automatically label our documents, and then _aggregate_ their results to obtain a labelled version of your corpus. 
        
        The labelling functions may take a variety of forms, such as domain-specific heuristics (like pattern-matching rules), gazetteers (based on dictionaries of possible entries), machine learning models, or even annotations from crowd-workers. The results of those labelling functions are then aggregated using a statistical model that automatically estimates the relative accuracy (and confusion matrices) of each labelling function by comparing their predictions with one another.
        
        `skweak` can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. Give it a try!
        
        
        https://user-images.githubusercontent.com/11574012/114999146-e0995300-9ea1-11eb-8288-2bb54dc043e7.mp4
        
        For more details on `skweak`, see our paper (...). TODO
        
        **Documentation & API**: See the [Wiki](https://github.com/NorskRegnesentral/skweak/wiki) for details on how to use `skweak`. 
        
        ## Requirements
        
        The following Python packages must be installed:
        - `spacy` >= 3.0.0
        - `hmmlearn` >= 0.2.4
        - `pandas` >= 0.23
        - `numpy` >= 1.18
        
        `skweak` has been developed with Python 3.x (we haven't tested it for Python 2.x, but it might be possible to get it to work).
        
        ## Install
        
        TODO: make `skweak` into a full-blown package that can be installed through `pip install`. 
        
        ## Basic Overview
        
        <br>
        <p align="center">
           <img alt="Overview of skweak" src="https://raw.githubusercontent.com/NorskRegnesentral/skweak/main/data/skweak_procedure.png"/>
        </p><br>
        
        Weak supervision with `skweak` goes through the following steps:
        - **Start**: First, you need raw (unlabelled) data from your text domain. `skweak` is build on top of [SpaCy](http://www.spacy.io), and operates with Spacy `Doc` objects, so you first need to convert your documents to `Doc` objects with `spacy`.
        - **Step 1**: Then, we need to define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can comes from heuristics, gazetteers, machine learning models, or even noisy annotations from crowd-workers. See the ![documentation](https://github.com/NorskRegnesentral/skweak/wiki) for more details. 
        - **Step 2**: Once the labelling functions have been applied to your corpus, you need to _aggregate_ their results in order to obtain a single annotation layer (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in `skweak` using a generative model that automatically estimates the relative accuracy and possible confuctions of each labelling function. 
        - **Step 3**: Finally, based on those aggregated labels, we can train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You are free to use whichever model/framework you prefer. 
        
        ## Quickstart
        
        Here is a minimal example with three labelling functions (LFs) applied on a single document:
        
        ```python
        import spacy, re
        from skweak import heuristics, gazetteers, aggregation, utils
        
        # LF 1: heuristic to detect occurrences of MONEY entities
        def money_detector(doc):
           for tok in doc[1:]:
              if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
                  yield tok.i-1, tok.i+1, "MONEY"
        lf1 = heuristics.FunctionAnnotator("money", money_detector)
        
        # LF 2: detection of years with a regex
        lf2= heuristics.TokenConstraintAnnotator ("years", lambda tok: re.match("(19|20)\d{2}$", tok.text), "DATE")
        
        # LF 3: a gazetteer with a few names
        NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
        trie = gazetteers.Trie(NAMES)
        lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})
        
        # We create a corpus (here with a single text)
        nlp = spacy.load("en_core_web_sm")
        doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")
        
        # apply the labelling functions
        doc = lf3(lf2(lf1(doc)))
        
        # and aggregate them
        hmm = aggregation.HMM("hmm", ["PERSON", "DATE", "MONEY"])
        hmm.fit_and_aggregate([doc])
        
        # we can then visualise the final result (in Jupyter)
        utils.display_entities(doc, "hmm")
        ```
        
        Obviously, to get the most out of `skweak`, you will need more labelling functions, and a larger corpus including as many documents as possible from your domain. 
        
        ## Documentation
        
        See the [Wiki](https://github.com/NorskRegnesentral/skweak/wiki). 
        
        
        ## License
        
        `skweak` is released under an MIT License. 
        
        The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve
        the copyright and license notices (see file [License](https://github.com/NorskRegnesentral/skweak/blob/main/LICENSE.txt)). Licensed works, modifications, and larger works may be distributed under different terms and without source code.
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.6
Description-Content-Type: text/markdown
