An MCP (Model Context Protocol) server that provides prompt compression tools for Claude Code and other MCP-compatible clients. Reduce LLM token usage by 50-80% using Microsoft's LLMlingua technology.
⚠️ EXPERIMENTAL PROJECTThis is an experimental project created for educational and learning purposes. It is not intended for production use and should not be considered production-ready or professional-grade software. Use at your own risk for educational exploration and experimentation only.
- Features
- Quick Start
- Claude Code Setup Guide
- Configuration
- Tools Reference
- Examples
- Development
- Troubleshooting
- MCP Server: Integrates directly with Claude Code and other MCP clients
- Prompt Compression: Reduce token usage by 50-80% using LLMlingua-2
- Two Compression Tools:
- compress_text: Compress single prompts or documents
- compress_multiple_texts: Compress multiple context pieces together
 
- Flexible Configuration: Environment-based configuration
- Device Support: CPU, CUDA (NVIDIA), and MPS (Apple Silicon)
- Intelligent Compression: Preserve important context while removing redundancy
# Clone the repository
git clone <repository-url>
cd inference-optimizer
# Create virtual environment with uv
uv venv
# Activate the environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate  # Windows
# Install dependencies
uv pip install -e .# Run the MCP server directly (stdio mode)
uv run python -m src.mcp_server
# The server will wait for MCP protocol messages on stdin
# Press Ctrl+C to exit
# Or test that it loads correctly
uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"This section walks you through setting up the Inference Optimizer MCP server with Claude Code.
- Claude Code installed and running
- Inference Optimizer cloned and dependencies installed (see Quick Start)
- Python 3.10+ with uvpackage manager
Navigate to the inference-optimizer directory and add the MCP server using the claude mcp add command:
For Apple Silicon (M1/M2/M3):
cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=mps -e LOG_LEVEL=INFO -- uv run python -m src.mcp_serverFor NVIDIA GPU:
cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cuda -e LOG_LEVEL=INFO -- uv run python -m src.mcp_serverFor CPU (works everywhere):
cd /path/to/inference-optimizer
claude mcp add inference-optimizer -e DEVICE_MAP=cpu -e LOG_LEVEL=INFO -- uv run python -m src.mcp_serverNote: Make sure you run this command from inside the inference-optimizer directory, as the working directory will be set to wherever you run the command from.
After adding the server, completely restart Claude Code:
- Quit Claude Code (don't just close the window)
- Start Claude Code again
- Wait for it to fully load (5-10 seconds for model loading)
In a new Claude Code conversation, ask:
Can you show me what tools you have available?
Claude should list compress_text and compress_multiple_texts among the available tools.
Try compressing some text:
Please compress this text for me:
[paste a long paragraph or document here]
Claude should automatically call the compress_text tool and show you the results.
Create a .env file or set environment variables:
# Copy example configuration
cp .env.example .env
# Edit configuration
nano .env| Variable | Description | Default | 
|---|---|---|
| MODEL_NAME | HuggingFace model name | microsoft/llmlingua-2-xlm-roberta-large-meetingbank | 
| DEVICE_MAP | Device to run on | cpu | 
| Options: cpu,cuda(NVIDIA GPU),mps(Apple Silicon) | ||
| USE_LLMLINGUA2 | Use LLMlingua-2 (more accurate) | true | 
| LOG_LEVEL | Logging level | INFO | 
| DEFAULT_COMPRESSION_RATE | Default compression rate | 0.5 | 
| DEFAULT_RANK_METHOD | Default ranking algorithm | longllmlingua | 
For Apple Silicon (M1/M2/M3):
DEVICE_MAP=mpsFor NVIDIA GPU:
DEVICE_MAP=cudaFor CPU (default, works everywhere):
DEVICE_MAP=cpuTo use a different LLMlingua model in your Claude Code settings:
"env": {
  "MODEL_NAME": "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
  "DEVICE_MAP": "mps"
}"env": {
  "DEFAULT_COMPRESSION_RATE": "0.7",
  "DEFAULT_RANK_METHOD": "llmlingua",
  "DEVICE_MAP": "mps"
}For less verbose output:
"env": {
  "LOG_LEVEL": "WARNING",
  "DEVICE_MAP": "mps"
}For troubleshooting:
"env": {
  "LOG_LEVEL": "DEBUG",
  "DEVICE_MAP": "mps"
}Compress a single text, prompt, or document.
Parameters:
- text(required): The text to compress
- rate(optional, default 0.5): Compression rate (0.1-0.9)- Lower values = more aggressive compression
- 0.5 = 50% compression, 0.3 = 70% compression
 
- instruction(optional): Instruction to preserve
- question(optional): Question to guide what to preserve
- rank_method(optional):- longllmlinguaor- llmlingua
Returns:
- compressed_text: The compressed result
- original_tokens: Token count before compression
- compressed_tokens: Token count after compression
- compression_ratio: How much was compressed (e.g., "2.0x")
- token_saving: Tokens saved (e.g., "500 tokens saved")
Compress multiple text segments together while preserving relationships.
Parameters:
- texts(required): List of text strings
- rate(optional, default 0.5): Compression rate
- instruction(optional): Instruction to preserve
- question(optional): Question to guide compression
- rank_method(optional): Ranking algorithm
Returns: Same format as compress_text
You: Here's a 5000 word article about machine learning.
Can you compress it to focus on the key points about neural networks?
[paste article]
Claude: I'll compress this with focus on neural networks.
[calls compress_text with question="What are the key points about neural networks?"]
Result: Compressed from 5000 tokens to 1500 tokens (3.3x compression)
You: I have 3 files related to authentication. Compress them together:
File 1: [auth.py code]
File 2: [token.py code]
File 3: [middleware.py code]
Claude: I'll compress these related files together.
[calls compress_multiple_texts with the three files]
Result: Combined 8000 tokens compressed to 2400 tokens
You: Before sending this to the API, compress it aggressively (70% compression)
Claude: [calls compress_text with rate=0.3]
Original: 2000 tokens → Compressed: 600 tokens
Savings: 1400 tokens saved
At $0.03/1k tokens: Saved $0.042 per request
You: Compress this with 30% compression (rate 0.7):
[paste text]
Claude: [calls compress_text with rate=0.7]
You: Before we work with this large codebase, compress these files first:
[paste multiple files]
Claude: [calls compress_multiple_texts]
The rate parameter controls compression aggressiveness:
- 0.9: Light compression (10% reduction) - safest, minimal information loss
- 0.7: Moderate compression (30% reduction) - balanced
- 0.5: Default (50% reduction) - good balance
- 0.3: Aggressive (70% reduction) - maximum savings, some information loss
- 0.1: Extreme (90% reduction) - highest savings, significant information loss
- longllmlingua: Better for long contexts (default)
- llmlingua: Faster, good for shorter prompts
Use these to guide compression:
- instruction: Preserved context (e.g., "You are a helpful assistant")
- question: Guides what content to keep (e.g., "What are the main security concerns?")
Typical compression times (CPU):
| Original Tokens | Compressed Tokens | Rate | Time | 
|---|---|---|---|
| 2000 | 1000 | 0.5 | ~2s | 
| 4000 | 1200 | 0.3 | ~4s | 
| 8000 | 2400 | 0.3 | ~8s | 
With GPU/MPS: 2-3x faster
inference-optimizer/
├── src/
│   ├── __init__.py
│   ├── mcp_server.py           # MCP server with tool definitions
│   ├── compression_service.py  # Core compression logic
│   └── cli.py                  # Command-line interface (optional)
├── config.py                   # Configuration management
├── main.py                     # Entry point
├── pyproject.toml              # Project metadata and dependencies
├── .env.example               # Example configuration
├── CLAUDE.md                  # AI assistant context
└── README.md                  # This file
# Install dev dependencies
uv pip install -e ".[dev]"
# Format code
black src/
# Lint code
ruff check src/Test the MCP server locally:
# Direct execution
uv run python -m src.mcp_server
# Or via main.py
uv run python main.pyFor interactive testing, you can use an MCP client or integrate with Claude Code.
Problem: The compression tools don't show up in Claude's available tools
Solutions:
- Check that settings.jsonis valid JSON (no trailing commas, proper quotes)
- Verify the cwdpath is absolute and correct
- Check Claude Code logs for error messages
- Ensure dependencies are installed: uv pip install -e .
- Restart Claude Code completely (quit and reopen)
Problem: "Module not found" or "No module named 'src'"
Solutions:
- Make sure the cwdpath in settings.json points to the project root
- Verify installation: uv pip install -e .
- Check that src/mcp_server.pyexists in the project directory
Problem: Model fails to download or load
Solutions:
- Ensure internet connection (first download requires internet)
- Check disk space (model is ~2GB)
- Clear HuggingFace cache: rm -rf ~/.cache/huggingface/
- Check logs for specific error messages
Problem: CUDA/MPS not working
Solutions:
- Check device availability:
import torch print(torch.cuda.is_available()) # CUDA print(torch.backends.mps.is_available()) # MPS 
- Fall back to CPU: "DEVICE_MAP": "cpu"in settings
- Verify drivers are installed (CUDA requires NVIDIA drivers)
Problem: First compression takes a long time
Solutions:
- First startup takes 5-10 seconds to load the model
- Subsequent tool calls are much faster
- Consider using GPU/MPS for better performance
- This is expected behavior - the model needs to load into memory
Problem: Server doesn't appear to be running
Solutions:
- Verify settings.json syntax is correct (valid JSON)
- Use absolute path in cwdfield
- Check logs: look for MCP server startup messages
- Restart Claude Code completely
- Ensure dependencies are installed: uv pip install -e .
Problem: Compression is too slow
Solutions:
- Enable GPU: DEVICE_MAP=cudaorDEVICE_MAP=mps
- Reduce compression rate for faster processing
- Use rank_method=llmlinguafor shorter texts
- Check system resources (RAM usage)
Problem: Important information is lost
Solutions:
- Increase rate (e.g., 0.3 → 0.5 or 0.7)
- Use questionparameter to guide preservation
- Use instructionto preserve system context
- Try different rank_method
To see detailed logs from the MCP server:
- Set "LOG_LEVEL": "DEBUG"in your Claude Code configuration
- Restart Claude Code
- Check Claude Code's console/logs for MCP server output
- Look for messages about model initialization and tool calls
If you encounter issues not covered here:
- Check the CLAUDE.md file for technical details
- Open an issue on GitHub with:
- Your settings.json configuration (remove sensitive data)
- Error messages from Claude Code logs
- Your operating system and Python version
- Output from uv run python -c "from src.mcp_server import mcp; print('✓ Server ready')"
 
These are theoretical use cases for educational exploration:
- Cost Reduction: Experiment with reducing API costs by 50-80%
- Context Window Optimization: Fit more information in limited token windows
- RAG System Prototypes: Compress retrieved documents before LLM processing
- Conversation History: Compress multi-turn conversation history
- Batch Document Processing: Compress collections of related documents
This is an experimental, educational project:
- Not production-ready
- No warranty or guarantees
- Security not audited
- Limited testing coverage
- Compression is lossy - test with your use case
- Cold start: initial model load takes 5-10 seconds
- Memory intensive: ~2-4GB RAM required
Built with:
- LLMlingua by Microsoft Research
- FastMCP for MCP server framework
- Model Context Protocol by Anthropic
MIT License - See LICENSE file for details
Contributions welcome! See issues for current tasks or submit pull requests.
For questions or issues, please use GitHub Issues.