Metadata-Version: 2.1
Name: contextpro
Version: 2.0.0
Summary: Python library for concurrent text preprocessing
Home-page: https://gitlab.com/elzawie/contextpro
License: MIT
Keywords: concurrent-preprocessing,nlp
Author: Łukasz Zawieska
Author-email: zawieskal@yahoo.com
Requires-Python: >=3.6.1,<4.0.0
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: contractions (>=0.0.48,<0.0.49)
Requires-Dist: nltk (>=3.5,<4.0)
Requires-Dist: pandas (>=1.1.5,<2.0.0)
Requires-Dist: scipy (>=1.5.4,<2.0.0)
Requires-Dist: spacy (>=3.0.5,<4.0.0)
Requires-Dist: textblob (>=0.15.3,<0.16.0)
Requires-Dist: toml (>=0.10.2,<0.11.0)
Requires-Dist: wheel (>=0.36.2,<0.37.0)
Requires-Dist: word2number (>=1.1,<2.0)
Project-URL: Repository, https://gitlab.com/elzawie/contextpro
Description-Content-Type: text/markdown

# contextpro

[![pipeline status](https://gitlab.com/elzawie/contextpro/badges/master/pipeline.svg)](https://gitlab.com/elzawie/contextpro/-/commits/master)
[![coverage report](https://gitlab.com/elzawie/contextpro/badges/master/coverage.svg)](https://gitlab.com/elzawie/contextpro/-/commits/master)
[![License](https://img.shields.io/badge/license-MIT-blue)](https://gitlab.com/elzawie/contextpro/-/blob/master/LICENSE)


contextpro is a Python library for concurrent text preprocessing using functions from some well-known NLP packages including NLTK, spaCy and TextBlob.

- **Documentation:** https://contextpro.readthedocs.io/en/latest/
- **Source code:** https://gitlab.com/elzawie/contextpro

## Installation

 Windows / OS X / Linux:

-  Installation with pip

    ```
    pip install contextpro
    python -m spacy download en_core_web_sm
    ```

- Installation with poetry
    ```
    poetry add contextpro
    python -m spacy download en_core_web_sm
    ```

## Configuration

- Before using the package, execute the below commands in your virtual environment:

    ```python
    import nltk

    nltk.download("punkt")
    nltk.download("stopwords")
    nltk.download("wordnet")
    ```

## Usage examples

```python
from contextpro.normalization import batch_lowercase_text

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

result = batch_lowercase_text(
    corpus,
    num_workers=2
)

print(result)

[
    "my name is dr. jekyll.",
    "his name is mr. hyde",
    "this guy's name is edward scissorhands",
    "and this is tom parker"
]
```

```python
from contextpro.normalization import batch_remove_non_ascii_characters

corpus = [
    "https://sitebulb.com/Folder/øê.html?大学",
    "J\xf6reskog bi\xdfchen Z\xfcrcher"
    "This is a \xA9 but not a \xAE"
    "fractions \xBC, \xBD, \xBE"
]

result = batch_remove_non_ascii_characters(
        corpus,
        num_workers=2
)

print(result)

[
    "https://sitebulb.com/Folder/.html?",
    "Jreskog bichen Zrcher",
    "This is a  but not a ",
    "fractions , , "
]
```
```python
from contextpro.normalization import batch_replace_contractions

corpus = [
    "I don't want to be rude, but you shouldn't do this",
    "Do you think he'll pass his driving test?",
    "I'll see you next week",
    "I'm going for a walk"
]

result = batch_replace_contractions(
    corpus,
    num_workers=2
)

print(result)

[
    "I do not want to be rude, but you should not do this",
    "Do you think he will pass his driving test?",
    "I will see you next week",
    "I am going for a walk",
]
```
```python
from contextpro.normalization import batch_remove_stopwords

corpus = [
    ['My', 'name', 'is', 'Dr', 'Jekyll'],
    ['His', 'name', 'is', 'Mr', 'Hyde'],
    ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
    ['And', 'this', 'is', 'Tom', 'Parker']
]

result = batch_remove_stopwords(
    corpus,
    num_workers=2
)

print(result)

[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
```
```python
from contextpro.normalization import batch_lemmatize

corpus =  [
    ["I", "like", "driving", "a", "car"],
    ["I", "am", "going", "for", "a", "walk"],
    ["What", "are", "you", "doing"],
    ["Where", "are", "you", "coming", "from"]
]

result = batch_lemmatize(
    corpus,
    num_workers=2,
    pos="v"
)

print(result)

[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
```
```python
from contextpro.normalization import batch_convert_numerals_to_numbers

corpus = [
    "A bunch of five",
    "A picture is worth a thousand words",
    "A stitch in time saves nine",
    "Back to square one",
    "Behind the eight ball",
    "Between two stools",
]

result = batch_convert_numerals_to_numbers(
    corpus,
    num_workers=2
)

print(result)

[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
```
```python
from contextpro.feature_extraction import ConcurrentCountVectorizer

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

cvv = ConcurrentCountVectorizer(
    lowercase=True,
    remove_stopwords=True,
    ngram_range=(1, 1),
    num_workers=2
)

transformed = cvv.fit_transform(corpus)

print(cvv.get_feature_names())

[
    'dr', 'edward', 'guy', 'hyde', 'jekyll', 'mr',
    'name', 'parker', 'scissorhands', 'tom'
]

print(transformed.toarray())

[
    [1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0, 1, 1, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
]
```
```python
from contextpro.statistics import batch_calculate_corpus_statistics

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

statistics = batch_calculate_corpus_statistics(
    corpus,
    lowercase=False,
    remove_stopwords=False,
    num_workers=2,
)

print(statistics)

    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

        ascii_characters  sentiment_score  subjectivity_score
0                22              0.0                 0.0
1                20              0.0                 0.0
2                38              0.0                 0.0
3                22              0.0                 0.0
```

## Release History

* 0.1.0
    * First release

## Meta
Łukasz Zawieska – zawieskal@yahoo.com

<a href="https://gitlab.com/elzawie/">Gitlab account</a>

<a href="https://github.com/elzawie/">Github account</a>

Distributed under the MIT license. See <a href="https://gitlab.com/elzawie/contextpro/-/blob/master/LICENSE">LICENSE</a> for more information.

