Skip to content

zhubert/inference-optimizer

Repository files navigation

Inference Optimizer - MCP Server

An MCP (Model Context Protocol) server that provides prompt compression tools for Claude Code and other MCP-compatible clients. Reduce LLM token usage by 50-80% using Microsoft's LLMlingua technology.

Python MCP License

⚠️ EXPERIMENTAL PROJECT

This is an experimental project created for educational and learning purposes. It is not intended for production use and should not be considered production-ready or professional-grade software. Use at your own risk for educational exploration and experimentation only.

Table of Contents

Features

  • MCP Server: Integrates directly with Claude Code and other MCP clients
  • Prompt Compression: Reduce token usage by 50-80% using LLMlingua-2
  • Two Compression Tools:
    • compress_text: Compress single prompts or documents
    • compress_multiple_texts: Compress multiple context pieces together
  • Flexible Configuration: Environment-based configuration
  • Device Support: CPU, CUDA (NVIDIA), and MPS (Apple Silicon)
  • Intelligent Compression: Preserve important context while removing redundancy

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd inference-optimizer

# Create virtual environment with uv
uv venv

# Activate the environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate  # Windows

# Install dependencies
uv pip install -e .

Test the Server

# Run the MCP server directly (stdio mode)
uv run python -m src.mcp_server

# The server will wait for MCP protocol messages on stdin
# Press Ctrl+C to exit

# Or test that it loads correctly
uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"

Claude Code Setup Guide

This section walks you through setting up the Inference Optimizer MCP server with Claude Code.

Prerequisites

  • Claude Code installed and running
  • Inference Optimizer cloned and dependencies installed (see Quick Start)
  • Python 3.10+ with uv package manager

Add the MCP Server

Navigate to the inference-optimizer directory and add the MCP server using the claude mcp add command:

For Apple Silicon (M1/M2/M3):

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=mps -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

For NVIDIA GPU:

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cuda -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

For CPU (works everywhere):

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cpu -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

Note: Make sure you run this command from inside the inference-optimizer directory, as the working directory will be set to wherever you run the command from.

Restart Claude Code

After adding the server, completely restart Claude Code:

  1. Quit Claude Code (don't just close the window)
  2. Start Claude Code again
  3. Wait for it to fully load (5-10 seconds for model loading)

Verify Installation

In a new Claude Code conversation, ask:

Can you show me what tools you have available?

Claude should list compress_text and compress_multiple_texts among the available tools.

Test Compression

Try compressing some text:

Please compress this text for me:

[paste a long paragraph or document here]

Claude should automatically call the compress_text tool and show you the results.

Configuration

Environment Variables

Create a .env file or set environment variables:

# Copy example configuration
cp .env.example .env

# Edit configuration
nano .env

Available Options

Variable Description Default
MODEL_NAME HuggingFace model name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
DEVICE_MAP Device to run on cpu
Options: cpu, cuda (NVIDIA GPU), mps (Apple Silicon)
USE_LLMLINGUA2 Use LLMlingua-2 (more accurate) true
LOG_LEVEL Logging level INFO
DEFAULT_COMPRESSION_RATE Default compression rate 0.5
DEFAULT_RANK_METHOD Default ranking algorithm longllmlingua

Device Configuration

For Apple Silicon (M1/M2/M3):

DEVICE_MAP=mps

For NVIDIA GPU:

DEVICE_MAP=cuda

For CPU (default, works everywhere):

DEVICE_MAP=cpu

Advanced Configuration

Custom Model

To use a different LLMlingua model in your Claude Code settings:

"env": {
  "MODEL_NAME": "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
  "DEVICE_MAP": "mps"
}

Adjust Default Compression

"env": {
  "DEFAULT_COMPRESSION_RATE": "0.7",
  "DEFAULT_RANK_METHOD": "llmlingua",
  "DEVICE_MAP": "mps"
}

Minimal Logging

For less verbose output:

"env": {
  "LOG_LEVEL": "WARNING",
  "DEVICE_MAP": "mps"
}

Debug Mode

For troubleshooting:

"env": {
  "LOG_LEVEL": "DEBUG",
  "DEVICE_MAP": "mps"
}

Tools Reference

compress_text

Compress a single text, prompt, or document.

Parameters:

  • text (required): The text to compress
  • rate (optional, default 0.5): Compression rate (0.1-0.9)
    • Lower values = more aggressive compression
    • 0.5 = 50% compression, 0.3 = 70% compression
  • instruction (optional): Instruction to preserve
  • question (optional): Question to guide what to preserve
  • rank_method (optional): longllmlingua or llmlingua

Returns:

  • compressed_text: The compressed result
  • original_tokens: Token count before compression
  • compressed_tokens: Token count after compression
  • compression_ratio: How much was compressed (e.g., "2.0x")
  • token_saving: Tokens saved (e.g., "500 tokens saved")

compress_multiple_texts

Compress multiple text segments together while preserving relationships.

Parameters:

  • texts (required): List of text strings
  • rate (optional, default 0.5): Compression rate
  • instruction (optional): Instruction to preserve
  • question (optional): Question to guide compression
  • rank_method (optional): Ranking algorithm

Returns: Same format as compress_text

Examples

Example 1: Compress a Long Document

You: Here's a 5000 word article about machine learning.
Can you compress it to focus on the key points about neural networks?

[paste article]

Claude: I'll compress this with focus on neural networks.
[calls compress_text with question="What are the key points about neural networks?"]

Result: Compressed from 5000 tokens to 1500 tokens (3.3x compression)

Example 2: Compress Multiple Files

You: I have 3 files related to authentication. Compress them together:

File 1: [auth.py code]
File 2: [token.py code]
File 3: [middleware.py code]

Claude: I'll compress these related files together.
[calls compress_multiple_texts with the three files]

Result: Combined 8000 tokens compressed to 2400 tokens

Example 3: Cost Savings

You: Before sending this to the API, compress it aggressively (70% compression)

Claude: [calls compress_text with rate=0.3]

Original: 2000 tokens → Compressed: 600 tokens
Savings: 1400 tokens saved
At $0.03/1k tokens: Saved $0.042 per request

Example 4: Compress with Specific Rate

You: Compress this with 30% compression (rate 0.7):

[paste text]

Claude: [calls compress_text with rate=0.7]

Example 5: Large Codebase Context

You: Before we work with this large codebase, compress these files first:

[paste multiple files]

Claude: [calls compress_multiple_texts]

Compression Parameters

Rate

The rate parameter controls compression aggressiveness:

  • 0.9: Light compression (10% reduction) - safest, minimal information loss
  • 0.7: Moderate compression (30% reduction) - balanced
  • 0.5: Default (50% reduction) - good balance
  • 0.3: Aggressive (70% reduction) - maximum savings, some information loss
  • 0.1: Extreme (90% reduction) - highest savings, significant information loss

Rank Method

  • longllmlingua: Better for long contexts (default)
  • llmlingua: Faster, good for shorter prompts

Instruction & Question

Use these to guide compression:

  • instruction: Preserved context (e.g., "You are a helpful assistant")
  • question: Guides what content to keep (e.g., "What are the main security concerns?")

Performance

Typical compression times (CPU):

Original Tokens Compressed Tokens Rate Time
2000 1000 0.5 ~2s
4000 1200 0.3 ~4s
8000 2400 0.3 ~8s

With GPU/MPS: 2-3x faster

Development

Project Structure

inference-optimizer/
├── src/
│   ├── __init__.py
│   ├── mcp_server.py           # MCP server with tool definitions
│   ├── compression_service.py  # Core compression logic
│   └── cli.py                  # Command-line interface (optional)
├── config.py                   # Configuration management
├── main.py                     # Entry point
├── pyproject.toml              # Project metadata and dependencies
├── .env.example               # Example configuration
├── CLAUDE.md                  # AI assistant context
└── README.md                  # This file

Running Tests

# Install dev dependencies
uv pip install -e ".[dev]"

# Format code
black src/

# Lint code
ruff check src/

Local Testing

Test the MCP server locally:

# Direct execution
uv run python -m src.mcp_server

# Or via main.py
uv run python main.py

For interactive testing, you can use an MCP client or integrate with Claude Code.

Troubleshooting

Tools Don't Appear in Claude Code

Problem: The compression tools don't show up in Claude's available tools

Solutions:

  1. Check that settings.json is valid JSON (no trailing commas, proper quotes)
  2. Verify the cwd path is absolute and correct
  3. Check Claude Code logs for error messages
  4. Ensure dependencies are installed: uv pip install -e .
  5. Restart Claude Code completely (quit and reopen)

Module Not Found Error

Problem: "Module not found" or "No module named 'src'"

Solutions:

  1. Make sure the cwd path in settings.json points to the project root
  2. Verify installation: uv pip install -e .
  3. Check that src/mcp_server.py exists in the project directory

Model Loading Issues

Problem: Model fails to download or load

Solutions:

  • Ensure internet connection (first download requires internet)
  • Check disk space (model is ~2GB)
  • Clear HuggingFace cache: rm -rf ~/.cache/huggingface/
  • Check logs for specific error messages

Device Issues

Problem: CUDA/MPS not working

Solutions:

  • Check device availability:
    import torch
    print(torch.cuda.is_available())  # CUDA
    print(torch.backends.mps.is_available())  # MPS
  • Fall back to CPU: "DEVICE_MAP": "cpu" in settings
  • Verify drivers are installed (CUDA requires NVIDIA drivers)

Server is Slow to Start

Problem: First compression takes a long time

Solutions:

  1. First startup takes 5-10 seconds to load the model
  2. Subsequent tool calls are much faster
  3. Consider using GPU/MPS for better performance
  4. This is expected behavior - the model needs to load into memory

MCP Server Not Showing in Claude Code

Problem: Server doesn't appear to be running

Solutions:

  • Verify settings.json syntax is correct (valid JSON)
  • Use absolute path in cwd field
  • Check logs: look for MCP server startup messages
  • Restart Claude Code completely
  • Ensure dependencies are installed: uv pip install -e .

Performance Issues

Problem: Compression is too slow

Solutions:

  • Enable GPU: DEVICE_MAP=cuda or DEVICE_MAP=mps
  • Reduce compression rate for faster processing
  • Use rank_method=llmlingua for shorter texts
  • Check system resources (RAM usage)

Compression Quality Issues

Problem: Important information is lost

Solutions:

  • Increase rate (e.g., 0.3 → 0.5 or 0.7)
  • Use question parameter to guide preservation
  • Use instruction to preserve system context
  • Try different rank_method

Getting Logs

To see detailed logs from the MCP server:

  1. Set "LOG_LEVEL": "DEBUG" in your Claude Code configuration
  2. Restart Claude Code
  3. Check Claude Code's console/logs for MCP server output
  4. Look for messages about model initialization and tool calls

Need More Help?

If you encounter issues not covered here:

  1. Check the CLAUDE.md file for technical details
  2. Open an issue on GitHub with:
    • Your settings.json configuration (remove sensitive data)
    • Error messages from Claude Code logs
    • Your operating system and Python version
    • Output from uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"

Use Cases

These are theoretical use cases for educational exploration:

  1. Cost Reduction: Experiment with reducing API costs by 50-80%
  2. Context Window Optimization: Fit more information in limited token windows
  3. RAG System Prototypes: Compress retrieved documents before LLM processing
  4. Conversation History: Compress multi-turn conversation history
  5. Batch Document Processing: Compress collections of related documents

Limitations

This is an experimental, educational project:

  • Not production-ready
  • No warranty or guarantees
  • Security not audited
  • Limited testing coverage
  • Compression is lossy - test with your use case
  • Cold start: initial model load takes 5-10 seconds
  • Memory intensive: ~2-4GB RAM required

Acknowledgments

Built with:

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! See issues for current tasks or submit pull requests.

For questions or issues, please use GitHub Issues.

About

toy token compression mcp

Topics

Resources

Stars

Watchers

Forks