Metadata-Version: 2.4
Name: codebase-indexer
Version: 1.0.0
Summary: A command-line tool for indexing and querying large codebases using AI
Home-page: https://github.com/RajwardhanShinde/Code-Indexer
Author: Rajwardhan Shinde
Author-email: rajshinde55553@example.com
Keywords: code,indexing,search,AI,embeddings,RAG,retrieval,Claude,OpenAI,Pinecone
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain
Requires-Dist: langchain-community
Requires-Dist: langchain-openai
Requires-Dist: langchain-anthropic
Requires-Dist: langchain-pinecone
Requires-Dist: openai
Requires-Dist: pinecone
Requires-Dist: anthropic
Requires-Dist: python-dotenv
Requires-Dist: tiktoken
Requires-Dist: tqdm
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Large Codebase Indexer

A command-line tool for indexing large codebases and enabling AI-powered queries.

## Overview

This tool allows you to index any codebase and query it using natural language. It leverages:

- OpenAI's embedding model for code semantics
- Pinecone vector database for efficient storage and retrieval
- Claude LLM for high-quality responses
- LangChain framework for integrating all components

## Installation

### Option 1: Install from PyPI (recommended)

```bash
# Install the package
pip install codebase-indexer

# Configure your API keys interactively
codebase-indexer configure
```

### Option 2: Install from source

1. Clone this repository:
   ```bash
   git clone https://github.com/yourusername/indexer.git
   cd indexer
   ```

2. Install the package in development mode:
   ```bash
   # Create a virtual environment (recommended)
   python3 -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   
   # Install in development mode
   pip install -e .
   ```

3. Create a `.env` file with your API keys:
   ```bash
   cp .env.example .env
   # Edit .env with your actual API keys
   ```

### Option 3: Quick setup (using the install script)

```bash
./install.sh
```

### API Keys

This tool requires API keys for:
- OpenAI (for embeddings)
- Anthropic (for Claude LLM)
- Pinecone (for vector storage)

You can get these keys by signing up at:
- [OpenAI](https://platform.openai.com/)
- [Anthropic](https://www.anthropic.com/)
- [Pinecone](https://www.pinecone.io/)

#### Setting up API keys

You can configure API keys using the interactive CLI:

```bash
# Interactive configuration
codebase-indexer configure

# Non-interactive configuration
codebase-indexer configure --openai=your-openai-key --anthropic=your-anthropic-key --pinecone=your-pinecone-key
```

The configuration command will:
1. Create a `.env` file if it doesn't exist
2. Prompt for missing API keys (or use the ones provided via command-line arguments)
3. Allow you to select the Claude model to use
4. Validate that all required keys are set

## Usage

### Indexing a Codebase

Index a codebase to create vector embeddings stored in Pinecone:

```bash
python src/main.py index --path /path/to/your/codebase --index-name your-index-name
```

Options:
- `--path`: Path to the codebase directory (required)
- `--index-name`: Name of the Pinecone index (default: "codebase-index")
- `--namespace`: Namespace within the index for this codebase (default: directory name)
- `--chunk-size`: Size of code chunks (default: 500)
- `--chunk-overlap`: Overlap between chunks (default: 50)
- `--extensions`: Comma-separated list of file extensions to index (e.g., py,js,java)
- `--batch-size`: Batch size for indexing (default: 100)

### Listing Files in a Codebase

List all files in a codebase or show file count by extension:

```bash
python src/main.py list --path /path/to/your/codebase --extensions py,js,java
python src/main.py list --path /path/to/your/codebase --count
```

### Analyzing a Codebase

Analyze a codebase to extract project metadata:

```bash
python src/main.py analyze --path /path/to/your/codebase
```

### Scanning a Codebase

Scan a codebase to get information about files and languages:

```bash
python src/main.py scan --path /path/to/your/codebase
```

### Managing Indexes

List all Pinecone indexes:

```bash
python src/main.py list-indexes
```

Get statistics about an index:

```bash
python src/main.py stats --index-name your-index-name
```

Delete an index or namespace:

```bash
python src/main.py delete --index-name your-index-name
python src/main.py delete --index-name your-index-name --namespace your-namespace
```

### Querying the Indexed Codebase

Query the indexed codebase using natural language:

```bash
python src/main.py query --query "What does the authenticate_user function do?" --index-name your-index-name --namespace your-namespace
```

Options:
- `--query`: The query string (required)
- `--index-name`: Name of the Pinecone index (default: "codebase-index")
- `--namespace`: Namespace to query in Pinecone
- `--limit`: Maximum number of results to return (default: 5)

### Chat with Conversation History

Have a conversation with the codebase, maintaining context between questions:

```bash
python src/main.py chat --query "How does the file loading work?" --index-name your-index-name
python src/main.py chat --query "What parameters does it accept?" --index-name your-index-name
```

### Find Related Code

Get code snippets related to a query without generating an answer:

```bash
python src/main.py related --query "error handling" --index-name your-index-name --limit 10
```

### Testing the CLI (without API keys)

You can use the test CLI script for commands that don't require API keys:

```bash
python src/test_cli.py list --path /path/to/your/codebase --count
python src/test_cli.py analyze --path /path/to/your/codebase
python src/test_cli.py file --path /path/to/your/codebase/some_file.py
```

## Command-Line Interface

The tool provides the following commands:

- `index`: Index a codebase
- `list`: List files in a codebase
- `analyze`: Analyze a codebase and extract project metadata
- `scan`: Scan a codebase and output file statistics
- `list-indexes`: List all Pinecone indexes
- `stats`: Show statistics about an index
- `delete`: Delete an index or namespace
- `query`: Query the indexed codebase
- `chat`: Chat with the codebase using conversation history
- `related`: Get code snippets related to a query

Use `--help` with any command to see available options:

```bash
python src/main.py --help
python src/main.py index --help
```

## Development Status

This project is being developed in phases:

1. ✅ Environment Setup
2. ✅ Command-Line Tool Framework 
3. ✅ Codebase Indexing
4. ✅ RAG System and Agent Development
5. 🔜 Testing, Refinement, and Deployment

## Project Structure

```
indexer/
├── docs/
│   └── ADR.md           # Architecture Decision Record
├── src/
│   ├── agents/          # RAG agent implementation
│   ├── indexers/        # Code indexing functionality
│   ├── models/          # OpenAI and Claude model wrappers
│   ├── utils/           # Utility functions
│   ├── main.py          # Main CLI entry point
│   └── test_cli.py      # Test CLI (no API keys required)
├── .env.example         # Example environment variables
├── README.md            # This file
├── requirements.txt     # Python dependencies
└── setup.py             # Package installation
```

## Current Features

### Milestone 1: Environment Setup
- ✅ Virtual environment and dependency management
- ✅ Configuration and API key handling
- ✅ Logging setup
- ✅ Basic project structure

### Milestone 2: Command-Line Tool Framework
- ✅ Argument parsing and command handling
- ✅ Directory traversal for any codebase path
- ✅ File filtering by extension
- ✅ Project metadata extraction
- ✅ Code analysis for supported languages
- ✅ Test CLI for verification without API keys

### Milestone 3: Codebase Indexing
- ✅ Loading and chunking code files
- ✅ Generating embeddings with OpenAI
- ✅ Storing embeddings in Pinecone DB
- ✅ Namespace support for multiple codebases
- ✅ Index management (create, delete, stats)
- ✅ Batch processing for large codebases

### Milestone 4: RAG System and Agent Development
- ✅ Semantic code retrieval via embeddings
- ✅ Natural language querying of code
- ✅ Conversational interface with memory
- ✅ Code-specific prompt engineering
- ✅ Finding related code snippets
- ✅ Integration with Claude LLM for high-quality responses

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Troubleshooting

### Common Issues

1. **API key errors**: Make sure you have properly set up your .env file with valid API keys.

   ```bash
   OPENAI_API_KEY=your-openai-api-key
   ANTHROPIC_API_KEY=your-anthropic-api-key
   PINECONE_API_KEY=your-pinecone-api-key
   ```

2. **Package not found errors**: If you encounter errors about packages not being found, try reinstalling the dependencies:

   ```bash
   pip install -r requirements.txt
   ```

3. **Model not found**: If you encounter errors about Claude models not being found, you can update the model name in `src/utils/config.py`:

   ```python
   # Try using a different model name if the current one isn't accessible
   LLM_MODEL = "claude-3-haiku-20240307"  # or another available model
   ```

4. **Rate limiting**: If you hit API rate limits, try reducing the batch size when indexing:

   ```bash
   codebase-indexer index --path /path/to/codebase --batch-size 50
   ```

For more help, please open an issue on GitHub.
