Metadata-Version: 2.1
Name: pycspro
Version: 1.0.6
Summary: A Python library for parsing CSPro dictionaries and cases.
Home-page: https://github.com/amestsantim/pycspro
Author: Nahom Tamerat
Author-email: nahomt@amestsantim.com
License: UNKNOWN
Description: # pyCSPro
        Python library for parsing CSPro dictionaries and cases.
        
        ## Install
        pip:
        ```bash
        pip install pycspro
        ```
        
        git:
        ```bash
        git clone https://github.com/amestsantim/pycspro
        cd pycspro
        python setup.py install
        ```
        
        ## Usage
        The library is simple to use and has only two classes. DictionaryParser and CaseParser.
        There is also a medium article that explains, in considerable detail, how it works. You can find it here:
        
        [Working With CSPro Data Using Python (Pandas)](https://medium.com/@nahomt/working-with-cspro-data-using-python-pandas-9a6161b84ffa?sk=5e19e932f9090a21432c716aac0e7401)
        
        ### DictionaryParser
        This class receives a raw dictionary text and parses it into a Python dictionary which we can then manipulate to acomplish various tasks.
        
        #### parse()
        This method accepts the contents of a dictionary file (.dcf) and returns a dictionary object, which is basically a nested Python dictionary.
        ```python
        from pycspro import DictionaryParser
        
        raw_dictionary = open('CensusDictionary.dcf', 'r').read()
        dictionary_parser = DictionaryParser(raw_dictionary)
        parsed_dictionary = dictionary_parser.parse()
        print(json.dumps(parsed_dictionary, indent=4))
        ```
        
        ```json
        {
            "Dictionary": {
                "Name": "CEN2000",
                "Label": "Popstan Census",
                "Note": "",
                "Version": "CSPro 7.2",
                "RecordTypeStart": 1,
                "RecordTypeLen": 1,
                "Positions": "Relative",
                "ZeroFill": true,
                "DecimalChar": false,
                "Languages": [],
                "Relation": [],
                "Level": {
                    "Name": "QUEST",
                    "Label": "Questionnaire",
                    "Note": "",
                    "IdItems": [
                        {
                            "Name": "PROVINCE",
                            "Label": "Province",
                            "Note": "",
                            "Len": 2,
                            "ItemType": "Item",
                            "DataType": "Numeric",
                            "Occurrences": 1,
                            "Decimal": 0,
                            "DecimalChar": false,
                            "ZeroFill": true,
                            "OccurrenceLabel": [],
                            "Start": 2,
                            "ValueSets": [
                                {
                                    "Name": "PROV_VS1",
                                    "Label": "Province",
                                    "Note": "",
                                    "Value": [
                                        "1;Artesia",
                                        "2;Copal",
                                        "3;Dari",
                                        "4;Eris",
                                        "5;Girda",
                                        "6;Hali",
                                        "7;Kerac",
                                        "8;Lacuna",
                                        "9;Laya",
                                        "10;Lira",
                                        "11;Matanga",
                                        "12;Patan",
                                        "13;Rift",
                                        "14;Terra",
                                        "15;Tumar"
                                    ]
                                }
                            ]
                        },
                        ...
        ```
        
        #### get_column_labels()
        This method accepts a record name and returns a dictionary where keys are the item names and the values are the item labels for all the items within the given record.
        ```json
        {
            'H01_TYPE': 'Type of housing',
            'H02_WALL': 'Wall type',
            'H03_ROOF': 'Roof type',
            'H04_FLOOR': 'Floor type',
            'H05_ROOMS': 'Number of rooms',
            'H06_TENURE': 'Tenure'
        }
        ```
        This is useful for replacing the column labels in the Data Frame.
        ```python
        housing = dfs['HOUSING']
        housing.rename(columns = dictionary_parser.get_column_labels('HOUSING'))
        ```
        
        #### get_value_labels()
        This method accepts a record name and returns a dictionary where keys are the item names and the values are yet another dictionary. This dictionaries key-value paris are all the possible values and their respective labels.
        ```json
        {
            'P02_REL': {1: 'Head', 2: 'Spouse', 3: 'Child', 4: 'Parent', 5: 'Other', 6: 'Nonrelative', 9: 'Not Reported'},
            'P03_SEX': {1: 'Male', 2: 'Female'}
        }
        ```
        
        This can be used to replace values by their more meaningful labels in a Data Frame.
        ```python
        person = dfs['PERSON']
        person.replace(dictionary_parser.get_value_labels('PERSON'))
        ```
        
        ### CaseParser
        The CaseParser class is responsible for cutting up raw CSPro cases into tables by using a parsed dictionary (DictionaryParser). It produces a nested dictionary where each record is yet another dictionary. The resulting format is well suited to be converted into a Pandas Data Frame by using the from_dict method of the pandas DataFrame class.
        
        During instantiation, you can also pass in a cutting_mask to the CaseParser class to specify only the columns (items) you are interested in. This can be useful when there are a large number of items in a record.
        ```python
        cutting_mask = {
            'QUEST': ['PROVINCE', 'DISTRICT'],
            'PERSON': ['P03_SEX', 'P04_AGE', 'P11_LITERACY', 'P15_OCC'],
            'HOUSING': ['H01_TYPE', 'H05_ROOMS', 'H07_RENT', 'H08_TOILET', 'H13_PERSONS']
        }
        case_parser = CaseParser(parsed_dictionary, cutting_mask)
        ```
        
        #### parse()
        The parse method receives a list of cases and returns a nested dictionary of records.
        ```python
        import pandas as pd
        from pycspro import CaseParser
        
        case_parser = CSProCaseParser(parsed_dictionary)
        parsed_cases = case_parser.parse(cases) # where cases is a list of CSPro cases
        
        # parsed_cases will be Python dictionary where the keys are the record names
        # and values would be a dictionary with columns as keys and column values as a Python list
        for table_name, table in parsed_cases.items():
            pd.DataFrame.from_dict(table)
        ```
        
        ## Live Demo
        [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/amestsantim/pycspro-example/master?filepath=Using%20pyCSPro.ipynb)
        
        There is a Jupyter Notebook on Binder (great project!) that you can play with, in a live environment (in your browser) and see how easy it is to use this library. Please, take it for a spin!
        
        ## Syntax Checking
        This library uses a finite state machine to check the syntax of dictionaries. However, in the current version, some simplifying assumptions were made.
        These are:
        Dictionaries are assumed to have only a single Level
        SubItems are not considered and will cause an error if present
        
        ## Performance
        When reading and loading cases directly from CSWeb's MySQL database, you should be passing in to the CaseParser about 50,000 cases at a time and then converting the result into a Pandas DataFrame. In the next iteration, send in another 50,000 cases and on return, convert to a DataFrame and append to the previous DataFrame. That way, you can grow your DataFrame to a large size without consuming a lot of memory.
        
        ## To Dos
        Some edge cases such as SPECIAL values etc might not have been handled. If you run in to such edge cases, please submit an issue (or even better, a pull request) and hopefully, we will have them ironed out soon enough!
        
        ## License
        MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
