Inference Optimizer - MCP Server

An MCP (Model Context Protocol) server that provides prompt compression tools for Claude Code and other MCP-compatible clients. Reduce LLM token usage by 50-80% using Microsoft's LLMlingua technology.

⚠️ EXPERIMENTAL PROJECT

This is an experimental project created for educational and learning purposes. It is not intended for production use and should not be considered production-ready or professional-grade software. Use at your own risk for educational exploration and experimentation only.

Features

MCP Server: Integrates directly with Claude Code and other MCP clients
Prompt Compression: Reduce token usage by 50-80% using LLMlingua-2
Two Compression Tools:
- compress_text: Compress single prompts or documents
- compress_multiple_texts: Compress multiple context pieces together
Flexible Configuration: Environment-based configuration
Device Support: CPU, CUDA (NVIDIA), and MPS (Apple Silicon)
Intelligent Compression: Preserve important context while removing redundancy

Quick Start

Installation

# Clone the repository
git clone <repository-url>
cd inference-optimizer

# Create virtual environment with uv
uv venv

# Activate the environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate  # Windows

# Install dependencies
uv pip install -e .

Test the Server

# Run the MCP server directly (stdio mode)
uv run python -m src.mcp_server

# The server will wait for MCP protocol messages on stdin
# Press Ctrl+C to exit

# Or test that it loads correctly
uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"

Claude Code Setup Guide

This section walks you through setting up the Inference Optimizer MCP server with Claude Code.

Prerequisites

Claude Code installed and running
Inference Optimizer cloned and dependencies installed (see Quick Start)
Python 3.10+ with uv package manager

Add the MCP Server

Navigate to the inference-optimizer directory and add the MCP server using the claude mcp add command:

For Apple Silicon (M1/M2/M3):

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=mps -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

For NVIDIA GPU:

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cuda -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

For CPU (works everywhere):

cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cpu -e LOG_LEVEL=INFO -- uv run python -m src.mcp_server

Note: Make sure you run this command from inside the inference-optimizer directory, as the working directory will be set to wherever you run the command from.

Restart Claude Code

After adding the server, completely restart Claude Code:

Quit Claude Code (don't just close the window)
Start Claude Code again
Wait for it to fully load (5-10 seconds for model loading)

Verify Installation

In a new Claude Code conversation, ask:

Can you show me what tools you have available?

Claude should list compress_text and compress_multiple_texts among the available tools.

Test Compression

Try compressing some text:

Please compress this text for me:

[paste a long paragraph or document here]

Claude should automatically call the compress_text tool and show you the results.

Configuration

Environment Variables

Create a .env file or set environment variables:

# Copy example configuration
cp .env.example .env

# Edit configuration
nano .env

Available Options

Variable	Description	Default
`MODEL_NAME`	HuggingFace model name	`microsoft/llmlingua-2-xlm-roberta-large-meetingbank`
`DEVICE_MAP`	Device to run on	`cpu`
	Options: `cpu`, `cuda` (NVIDIA GPU), `mps` (Apple Silicon)
`USE_LLMLINGUA2`	Use LLMlingua-2 (more accurate)	`true`
`LOG_LEVEL`	Logging level	`INFO`
`DEFAULT_COMPRESSION_RATE`	Default compression rate	`0.5`
`DEFAULT_RANK_METHOD`	Default ranking algorithm	`longllmlingua`

Device Configuration

For Apple Silicon (M1/M2/M3):

DEVICE_MAP=mps

For NVIDIA GPU:

DEVICE_MAP=cuda

For CPU (default, works everywhere):

DEVICE_MAP=cpu

Advanced Configuration

Custom Model

To use a different LLMlingua model in your Claude Code settings:

"env": {
  "MODEL_NAME": "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
  "DEVICE_MAP": "mps"
}

Adjust Default Compression

"env": {
  "DEFAULT_COMPRESSION_RATE": "0.7",
  "DEFAULT_RANK_METHOD": "llmlingua",
  "DEVICE_MAP": "mps"
}

Minimal Logging

For less verbose output:

"env": {
  "LOG_LEVEL": "WARNING",
  "DEVICE_MAP": "mps"
}

Debug Mode

For troubleshooting:

"env": {
  "LOG_LEVEL": "DEBUG",
  "DEVICE_MAP": "mps"
}

Tools Reference

compress_text

Compress a single text, prompt, or document.

Parameters:

text (required): The text to compress
rate (optional, default 0.5): Compression rate (0.1-0.9)
- Lower values = more aggressive compression
- 0.5 = 50% compression, 0.3 = 70% compression
instruction (optional): Instruction to preserve
question (optional): Question to guide what to preserve
rank_method (optional): longllmlingua or llmlingua

Returns:

compressed_text: The compressed result
original_tokens: Token count before compression
compressed_tokens: Token count after compression
compression_ratio: How much was compressed (e.g., "2.0x")
token_saving: Tokens saved (e.g., "500 tokens saved")

compress_multiple_texts

Compress multiple text segments together while preserving relationships.

Parameters:

texts (required): List of text strings
rate (optional, default 0.5): Compression rate
instruction (optional): Instruction to preserve
question (optional): Question to guide compression
rank_method (optional): Ranking algorithm

Returns: Same format as compress_text

Examples

Example 1: Compress a Long Document

You: Here's a 5000 word article about machine learning.
Can you compress it to focus on the key points about neural networks?

[paste article]

Claude: I'll compress this with focus on neural networks.
[calls compress_text with question="What are the key points about neural networks?"]

Result: Compressed from 5000 tokens to 1500 tokens (3.3x compression)

Example 2: Compress Multiple Files

You: I have 3 files related to authentication. Compress them together:

File 1: [auth.py code]
File 2: [token.py code]
File 3: [middleware.py code]

Claude: I'll compress these related files together.
[calls compress_multiple_texts with the three files]

Result: Combined 8000 tokens compressed to 2400 tokens

Example 3: Cost Savings

You: Before sending this to the API, compress it aggressively (70% compression)

Claude: [calls compress_text with rate=0.3]

Original: 2000 tokens → Compressed: 600 tokens
Savings: 1400 tokens saved
At $0.03/1k tokens: Saved $0.042 per request

Example 4: Compress with Specific Rate

You: Compress this with 30% compression (rate 0.7):

[paste text]

Claude: [calls compress_text with rate=0.7]

Example 5: Large Codebase Context

You: Before we work with this large codebase, compress these files first:

[paste multiple files]

Claude: [calls compress_multiple_texts]

Compression Parameters

Rate

The rate parameter controls compression aggressiveness:

0.9: Light compression (10% reduction) - safest, minimal information loss
0.7: Moderate compression (30% reduction) - balanced
0.5: Default (50% reduction) - good balance
0.3: Aggressive (70% reduction) - maximum savings, some information loss
0.1: Extreme (90% reduction) - highest savings, significant information loss

Rank Method

longllmlingua: Better for long contexts (default)
llmlingua: Faster, good for shorter prompts

Instruction & Question

Use these to guide compression:

instruction: Preserved context (e.g., "You are a helpful assistant")
question: Guides what content to keep (e.g., "What are the main security concerns?")

Performance

Typical compression times (CPU):

Original Tokens	Compressed Tokens	Rate	Time
2000	1000	0.5	~2s
4000	1200	0.3	~4s
8000	2400	0.3	~8s

With GPU/MPS: 2-3x faster

Development

Project Structure

inference-optimizer/
├── src/
│   ├── __init__.py
│   ├── mcp_server.py           # MCP server with tool definitions
│   ├── compression_service.py  # Core compression logic
│   └── cli.py                  # Command-line interface (optional)
├── config.py                   # Configuration management
├── main.py                     # Entry point
├── pyproject.toml              # Project metadata and dependencies
├── .env.example               # Example configuration
├── CLAUDE.md                  # AI assistant context
└── README.md                  # This file

Running Tests

# Install dev dependencies
uv pip install -e ".[dev]"

# Format code
black src/

# Lint code
ruff check src/

Local Testing

Test the MCP server locally:

# Direct execution
uv run python -m src.mcp_server

# Or via main.py
uv run python main.py

For interactive testing, you can use an MCP client or integrate with Claude Code.

Troubleshooting

Tools Don't Appear in Claude Code

Problem: The compression tools don't show up in Claude's available tools

Solutions:

Check that settings.json is valid JSON (no trailing commas, proper quotes)
Verify the cwd path is absolute and correct
Check Claude Code logs for error messages
Ensure dependencies are installed: uv pip install -e .
Restart Claude Code completely (quit and reopen)

Module Not Found Error

Problem: "Module not found" or "No module named 'src'"

Solutions:

Make sure the cwd path in settings.json points to the project root
Verify installation: uv pip install -e .
Check that src/mcp_server.py exists in the project directory

Model Loading Issues

Problem: Model fails to download or load

Solutions:

Ensure internet connection (first download requires internet)
Check disk space (model is ~2GB)
Clear HuggingFace cache: rm -rf ~/.cache/huggingface/
Check logs for specific error messages

Device Issues

Problem: CUDA/MPS not working

Solutions:

Check device availability:

import torch
print(torch.cuda.is_available())  # CUDA
print(torch.backends.mps.is_available())  # MPS

Fall back to CPU: "DEVICE_MAP": "cpu" in settings
Verify drivers are installed (CUDA requires NVIDIA drivers)

Server is Slow to Start

Problem: First compression takes a long time

Solutions:

First startup takes 5-10 seconds to load the model
Subsequent tool calls are much faster
Consider using GPU/MPS for better performance
This is expected behavior - the model needs to load into memory

MCP Server Not Showing in Claude Code

Problem: Server doesn't appear to be running

Solutions:

Verify settings.json syntax is correct (valid JSON)
Use absolute path in cwd field
Check logs: look for MCP server startup messages
Restart Claude Code completely
Ensure dependencies are installed: uv pip install -e .

Performance Issues

Problem: Compression is too slow

Solutions:

Enable GPU: DEVICE_MAP=cuda or DEVICE_MAP=mps
Reduce compression rate for faster processing
Use rank_method=llmlingua for shorter texts
Check system resources (RAM usage)

Compression Quality Issues

Problem: Important information is lost

Solutions:

Increase rate (e.g., 0.3 → 0.5 or 0.7)
Use question parameter to guide preservation
Use instruction to preserve system context
Try different rank_method

Getting Logs

To see detailed logs from the MCP server:

Set "LOG_LEVEL": "DEBUG" in your Claude Code configuration
Restart Claude Code
Check Claude Code's console/logs for MCP server output
Look for messages about model initialization and tool calls

Need More Help?

If you encounter issues not covered here:

Check the CLAUDE.md file for technical details
Open an issue on GitHub with:
- Your settings.json configuration (remove sensitive data)
- Error messages from Claude Code logs
- Your operating system and Python version
- Output from uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"

Use Cases

These are theoretical use cases for educational exploration:

Cost Reduction: Experiment with reducing API costs by 50-80%
Context Window Optimization: Fit more information in limited token windows
RAG System Prototypes: Compress retrieved documents before LLM processing
Conversation History: Compress multi-turn conversation history
Batch Document Processing: Compress collections of related documents

Limitations

This is an experimental, educational project:

Not production-ready
No warranty or guarantees
Security not audited
Limited testing coverage
Compression is lossy - test with your use case
Cold start: initial model load takes 5-10 seconds
Memory intensive: ~2-4GB RAM required

Acknowledgments

Built with:

LLMlingua by Microsoft Research
FastMCP for MCP server framework
Model Context Protocol by Anthropic

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! See issues for current tasks or submit pull requests.

For questions or issues, please use GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

zhubert/inference-optimizer

Folders and files

Latest commit

History

Repository files navigation

Inference Optimizer - MCP Server

Table of Contents

Features

Quick Start

Installation

Test the Server

Claude Code Setup Guide

Prerequisites

Add the MCP Server

Restart Claude Code

Verify Installation

Test Compression

Configuration

Environment Variables

Available Options

Device Configuration

Advanced Configuration

Custom Model

Adjust Default Compression

Minimal Logging

Debug Mode

Tools Reference

compress_text

compress_multiple_texts

Examples

Example 1: Compress a Long Document

Example 2: Compress Multiple Files

Example 3: Cost Savings

Example 4: Compress with Specific Rate

Example 5: Large Codebase Context

Compression Parameters

Rate

Rank Method

Instruction & Question

Performance

Development

Project Structure

Running Tests

Local Testing

Troubleshooting

Tools Don't Appear in Claude Code

Module Not Found Error

Model Loading Issues

Device Issues

Server is Slow to Start

MCP Server Not Showing in Claude Code

Performance Issues

Compression Quality Issues

Getting Logs

Need More Help?

Use Cases

Limitations

Acknowledgments

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages