Metadata-Version: 2.1
Name: prodmodel
Version: 0.1.5
Summary: Build data science pipelines and models
Home-page: https://github.com/prodmodel/prodmodel
Author: Gergely Svigruha
Author-email: gergely.svigruha@prodmodel.com
License: UNKNOWN
Description: # Prodmodel
        
        Prodmodel is a [build system](https://en.wikipedia.org/wiki/List_of_build_automation_software) for data science pipelines.
        Users, testers, contributors are welcome!
        
        <p align="center">
          <a href="https://pypi.org/project/prodmodel">
            <img src="https://img.shields.io/pypi/v/prodmodel.svg"></img></a>
          <a href="https://pypi.org/project/prodmodel" alt="Downloads">
            <img src="https://img.shields.io/pypi/dd/prodmodel.svg" /></a>
          <a href="https://github.com/prodmodel/prodmodel/graphs/contributors" alt="Contributors">
            <img src="https://img.shields.io/github/contributors/prodmodel/prodmodel.svg" /></a>
          <a href="https://github.com/prodmodel/prodmodel/pulse" alt="Activity">
            <img src="https://img.shields.io/github/commit-activity/m/prodmodel/prodmodel.svg" /></a>
          <a href="https://github.com/prodmodel/prodmodel/issues" alt="Issues">
            <img src="https://img.shields.io/github/issues/prodmodel/prodmodel.svg" /></a>
          <a href="https://github.com/prodmodel/prodmodel/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aclosed" alt="Closed issues">
            <img src="https://img.shields.io/github/issues-closed/prodmodel/prodmodel.svg" /></a>
          <a href="https://github.com/prodmodel/prodmodel/pulls" alt="Pulls">
            <img src="https://img.shields.io/github/issues-pr/prodmodel/prodmodel.svg" /></a>
        </p>
        
        <h3 align="center">
          <a href="#motivation">Motivation</a>
          <span> · </span>
          <a href="#concepts">Concepts</a>
          <span> · </span>
          <a href="#installation">Installation</a>
          <span> · </span>
          <a href="#usage">Usage</a>
          <span> · </span>
          <a href="#contributing">Contributing</a>
          <span> · </span>
          <a href="#contact">Contact</a>
          <span> · </span>
          <a href="#licence">Licence</a>
        </h3>
        
        ## Motivation
        
         * Performance. No need to rerun things, everything is cached, switching between multiple versions is super easy. Prodmodel can
           **figure out if a particular partial code path has already been executed using a particular piece of data** and just use the cached output.
         * Easy debugging. Every single dependency - code or data - is version controlled and tracked.
         * Deploy to production. Models are more than just a file. Prodmodel makes sure that the correct version of label encoders,
           feature transformation code and data and model files are all packaged together.
        
        ## Concepts
        
        A build system is a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of `rules` (transformations), `inputs` and `targets`.
        In Prodmodel `inputs` can be
         * data,
         * Python code,
         * and configuration.
        
        A `rule` is transforming any of the above to an output (which can in turn be depended on by other rules). Therefore rules need to be
        re-executed (and their outputs re-created) if any of their dependencies change. Prodmodel keeps track all of these dependencies.
        
        The outputs of the `rules` are `targets`. Every `target` corresponds to an output (e.g. a model or a dataset). These outputs
        are cached and version controlled.
        
        Prodmodel therefore ensures
         * correctness, by executing every code (e.g. feature transformation, model building, tests) which can potentially be affected by a change, and
         * performance, by executing only the necessary code, saving time compared to rerunning the whole pipeline.
        
        ### Rules
        
        Every rule is a statically typed function, where the inputs are targets, data, or configs. The execution of
        a rule outputs some data (e.g. a different feature set or a model), which can be used in other rules.
        
        In order to use Prodmodel your code has to be structured as functions which the rules can call into.
        
        ### Targets
        
        Targets are created by rule functions. Targets can be executed to generate output files. `IterableDataTarget` is a special target
        which can be used as an iterable of `dicts` to make iterating over datasets easier. Regular `DataTargets` can represent any
        Python object.
        
        ## Installation
        
        Prodmodel requires at least Python3.6. Use [pip](https://pip.pypa.io/en/stable/) to install prodmodel.
        
        ```bash
        pip install prodmodel --user
        ```
        
        ## Usage
        
        Create a `build.py` file in your data science folder. The build file contains references to your inputs and the build rules you can execute.
        
        ```python
        import rules
        
        csv_data = rules.data_source(file='data.csv', type='csv', dtypes={...})
        
        my_model = rules.transform(objects={'data': csv_data}, file='kmeans.py', fn='compute_kmeans')
        ```
        
        Now you can build your model by running `prodmodel my_model` from the directory of `build.py`,
        or `prodmodel <path_to_my_directory>:my_model` from any directory.
        
        Prodmodel creates a `.prodmodel` directory under the home directory of the user to store log and config files.
        
        ### Documentation
        
        Check out a complete [example project](https://github.com/prodmodel/prodmodel/tree/master/example) for more examples.
        
        The complete list of build rules can be found [here](https://github.com/prodmodel/prodmodel/blob/master/doc/api_doc.md).
        
        Prodmodel searches for a config file under `<user home dir>/.prodmodel/config`. The config file can be created manually
        based on this [template](https://github.com/prodmodel/prodmodel/blob/master/doc/config).
        
        ### Arguments
        
         * `--force_external`: Some data sources are remote (e.g. an SQL server), therefore tracking changes is not always feasible.
           This argument gives the user manual control over when to reload these data sources.
         * `--cache_data`: Cache local data files if changed. This can be useful for debugging / reproducibility by making sure every
           data source used for a specific build is saved.
        
        ## Contributing
        Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
        
        ## Contact
        Feel free to email me at gergely.svigruha@prodmodel.com if you have any question, need help or would like to contribute to the code.
        
        ## Licence
        [Apache 2.0](https://github.com/prodmodel/prodmodel/blob/master/LICENCE)
        
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3.6
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 3 - Alpha
Description-Content-Type: text/markdown
