Metadata-Version: 2.4
Name: vision-parse
Version: 0.1.8
Summary: Parse PDF documents into markdown formatted content using Vision LLMs
Project-URL: Homepage, https://github.com/iamarunbrahma/vision-parse
Project-URL: Repository, https://github.com/iamarunbrahma/vision-parse.git
Author-email: Arun Brahma <mithubrahma@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: llm,markdown,ocr,pdf,vision
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Requires-Dist: jinja2>=3.0.0
Requires-Dist: nest-asyncio>=1.6.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: ollama>=0.4.4
Requires-Dist: opencv-python>=4.10.0.84
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.22.0
Requires-Dist: tenacity>=9.0.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: google-generativeai==0.8.3; extra == 'all'
Requires-Dist: openai==1.58.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=24.4.1; extra == 'dev'
Requires-Dist: black[jupyter]>=24.8.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.5; extra == 'dev'
Requires-Dist: pytest>=8.3.4; extra == 'dev'
Requires-Dist: ruff>=0.8.3; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: google-generativeai==0.8.3; extra == 'gemini'
Provides-Extra: openai
Requires-Dist: openai==1.58.0; extra == 'openai'
Description-Content-Type: text/markdown

# Vision Parse

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Author: Arun Brahma](https://img.shields.io/badge/Author-Arun%20Brahma-purple)](https://github.com/iamarunbrahma)
[![PyPI version](https://img.shields.io/pypi/v/vision-parse.svg)](https://pypi.org/project/vision-parse/)

> 🚀 Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!

## 🎯 Introduction

Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:

- 📝 **Smart Content Extraction**: Intelligently identifies and extracts text and tables with high precision
- 🎨 **Content Formatting**: Preserves document hierarchy, styling, and indentation for markdown formatted content
- 🤖 **Multi-LLM Support**: Supports multiple Vision LLM providers i.e. OpenAI, LLama, Gemini etc. for accuracy and speed
- 🔄 **PDF Document Support**: Handle multi-page PDF documents effortlessly by converting each page into byte64 encoded images
- 📁 **Local Model Hosting**: Supports local model hosting using Ollama for secure document processing and for offline use


## 🚀 Getting Started

### Prerequisites

- 🐍 Python >= 3.9
- 🖥️ Ollama (if you want to use local models)
- 🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)

### Installation

**Install the core package using pip (Recommended):**

```bash
pip install vision-parse
```

**Install the additional dependencies for OpenAI or Gemini:**

```bash
# For OpenAI support
pip install 'vision-parse[openai]'
```

```bash
# For Gemini support
pip install 'vision-parse[gemini]'
```

```bash
# To install all the additional dependencies
pip install 'vision-parse[all]'
```

**Install the package from source:**

```bash
pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'
```

### Setting up Ollama (Optional)
See [examples/ollama_setup.md](examples/ollama_setup.md) on how to setup Ollama locally.

## ⌛️ Usage

### Basic Example Usage

```python
from vision_parse import VisionParser

# Initialize parser
parser = VisionParser(
    model_name="llama3.2-vision:11b", # For local models, you don't need to provide the api key
    temperature=0.4,
    top_p=0.5,
    image_mode="url", # Image mode can be "url", "base64" or None
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=False, # Set to True for parallel processing
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf" # local path to your pdf file
markdown_pages = parser.convert_pdf(pdf_path)

# Process results
for i, page_content in enumerate(markdown_pages):
    print(f"\n--- Page {i+1} ---\n{page_content}")
```

### Customize Ollama Configuration for parallel processing

```python
from vision_parse import VisionParser

# Initialize parser with Ollama configuration
parser = VisionParser(
    model_name="llama3.2-vision:11b",
    temperature=0.7,
    top_p=0.6,
    num_ctx=4096,
    image_mode="base64",
    detailed_extraction=True,
    ollama_config={
        "OLLAMA_NUM_PARALLEL": "4",
        "OLLAMA_REQUEST_TIMEOUT": "240.0",
    },
    enable_concurrency=True,
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)
```

### OpenAI or Gemini Model Usage

```python
from vision_parse import VisionParser

# Initialize parser with OpenAI model
parser = VisionParser(
    model_name="gpt-4o",
    api_key="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=True, # Set to True for more detailed extraction
    enable_concurrency=True,
)

# Initialize parser with Google Gemini model
parser = VisionParser(
    model_name="gemini-1.5-flash",
    api_key="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=True, # Set to True for more detailed extraction
    enable_concurrency=True,
)
```

## ✅ Supported Models

This package supports the following Vision LLM models:

- OpenAI: `gpt-4o`, `gpt-4o-mini`
- Google Gemini: `gemini-1.5-flash`, `gemini-2.0-flash-exp`, `gemini-1.5-pro`
- Meta Llama and LLava from Ollama: `llava:13b`, `llava:34b`, `llama3.2-vision:11b`, `llama3.2-vision:70b`

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
