This API provides text generation and classification capabilities using Meta-Llama-3-8B-Instruct, a pre-trained large language model. The API supports text chunking for large inputs, streaming responses, and flexible configuration options.
- Text Generation: Generate text based on conversational messages using chat templates
- Text Classification: Classify text into predefined categories using prompt engineering
- Text Chunking: Process large texts by splitting them into manageable chunks with overlap
- Streaming Support: Real-time streaming responses for chunked processing
- Swagger Documentation: Interactive API documentation via Swagger UI
- Flexible Configuration: Customizable parameters for temperature, sampling, token limits, and chunking
- Python 3.x
- CUDA-capable GPU (recommended) or CPU
- Hugging Face account and token (for gated models like Meta-Llama-3)
- Install the required dependencies:
pip install -r requirements.txt- Set up environment variables (optional):
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct" # Default model
export HUGGINGFACE_HUB_TOKEN="your_token_here" # Required for gated models
export HOST="0.0.0.0" # Default host
export PORT="5050" # Default portTo run the API, execute the following command:
CUDA_VISIBLE_DEVICES=1 python llm-chat-llama3.pyThe API will start on the configured host and port (default: 0.0.0.0:5050).
Generate text based on input messages with optional chunking and streaming support.
Request Body:
{
"messages": [
{
"role": "system",
"content": "Your system prompt here"
},
{
"role": "user",
"content": "Your user message here"
}
],
"temperature": 0.7,
"top_p": 0.9,
"max_new_tokens": 256,
"max_seq_len": 1024,
"max_gen_len": 512,
"enable_chunking": false,
"chunk_size": 1000,
"chunk_overlap": 100,
"enable_streaming": false,
"aggregation_strategy": "concatenate"
}Parameters:
messages(required): Array of message objects withroleandcontentfieldstemperature(optional, default: 0.7): Sampling temperature for text generationtop_p(optional, default: 0.9): Nucleus sampling parametermax_new_tokens(optional, default: 256): Maximum number of new tokens to generatemax_seq_len(optional, default: 1024): Maximum sequence lengthmax_gen_len(optional, default: 512): Maximum generation lengthenable_chunking(optional, default: false): Enable text chunking for large inputschunk_size(optional, default: 1000): Size of each text chunk in characterschunk_overlap(optional, default: 100): Overlap between chunks in charactersenable_streaming(optional, default: false): Enable streaming responses (requires chunking)aggregation_strategy(optional, default: "concatenate"): Strategy for aggregating chunk results ("concatenate" or "summarize")
Response:
{
"generated_text": "Generated text response",
"chunk_count": 1,
"processing_time": 2.34
}Generate text with real-time streaming responses. This endpoint automatically enables chunking and streaming.
Request Body:
Same as /generate_response, but enable_chunking and enable_streaming are automatically set to true.
Response: Server-Sent Events (SSE) stream with the following event types:
start: Processing startedchunk: Individual chunk result with chunk numberfinal: Final aggregated resultend: Processing completed
The API includes intelligent text chunking that:
- Splits text at sentence boundaries when possible
- Falls back to word boundaries if sentence boundaries aren't found
- Maintains configurable overlap between chunks to preserve context
- Only activates when text length exceeds
chunk_sizeandenable_chunkingistrue
{
"messages": [
{
"role": "system",
"content": "Utilize prompt engineering to classify the given text accurately into one of the following predefined categories:\n Environment\n Soziales\n Governance\n Keine Armut\n Kein Hunger\n E-Umweltschutz\n E-Klimaschutz\n E-Erneuerbare Energie\n E-Emissionsreduktion\n E-Ressourceneffizienz\n S-Arbeitssicherheit\n S-Gesundheitsschutz\n S-Arbeitsbedingungen\nLimit your response to the identified class, nothing else. Optimize for increased accuracy."
},
{
"role": "user",
"content": "Am Wochenende fand ein tolles soziales Event statt..."
}
]
}{
"messages": [
{
"role": "system",
"content": "Summarize the following text:"
},
{
"role": "user",
"content": "Very long text here..."
}
],
"enable_chunking": true,
"chunk_size": 2000,
"chunk_overlap": 200,
"aggregation_strategy": "concatenate"
}The Swagger UI for this API is accessible at the following URL:
https://llm-chat.skynet.coypu.org/swagger
You can also access it locally at http://localhost:5050/swagger when running the API.
| Variable | Description | Default |
|---|---|---|
MODEL_NAME |
Hugging Face model identifier | meta-llama/Meta-Llama-3-8B-Instruct |
HUGGINGFACE_HUB_TOKEN or HF_TOKEN |
Hugging Face authentication token | None (required for gated models) |
HOST |
Server host address | 0.0.0.0 |
PORT |
Server port number | 5050 |
CUDA_VISIBLE_DEVICES |
GPU device selection | None (uses all available GPUs) |
The API logs all requests and responses to:
- Console output (stdout)
- Log file:
api.log
Log format: %(asctime)s - %(levelname)s - %(message)s
The API uses the following default model configuration:
- Model: Meta-Llama-3-8B-Instruct
- Precision: bfloat16
- Device: GPU (CUDA) if available, otherwise CPU
- Pipeline: Hugging Face Transformers text-generation pipeline
- The endpoint was renamed from
/generate_textto/generate_responsefor compatibility with BERTrend:weak_signal summarization - Chunking is only applied when text length exceeds the specified
chunk_size - Streaming requires chunking to be enabled
- The API uses Waitress as the production WSGI server
- Swagger JSON is automatically generated and saved to
static/swagger.json
