Text Generation and Classification API

This API provides text generation and classification capabilities using Meta-Llama-3-8B-Instruct, a pre-trained large language model. The API supports text chunking for large inputs, streaming responses, and flexible configuration options.

Features

Text Generation: Generate text based on conversational messages using chat templates
Text Classification: Classify text into predefined categories using prompt engineering
Text Chunking: Process large texts by splitting them into manageable chunks with overlap
Streaming Support: Real-time streaming responses for chunked processing
Swagger Documentation: Interactive API documentation via Swagger UI
Flexible Configuration: Customizable parameters for temperature, sampling, token limits, and chunking

Prerequisites

Python 3.x
CUDA-capable GPU (recommended) or CPU
Hugging Face account and token (for gated models like Meta-Llama-3)

Installation

Install the required dependencies:

pip install -r requirements.txt

Set up environment variables (optional):

export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"  # Default model
export HUGGINGFACE_HUB_TOKEN="your_token_here"  # Required for gated models
export HOST="0.0.0.0"  # Default host
export PORT="5050"  # Default port

Running the API

To run the API, execute the following command:

CUDA_VISIBLE_DEVICES=1 python llm-chat-llama3.py

The API will start on the configured host and port (default: 0.0.0.0:5050).

API Endpoints

1. `/generate_response` (POST)

Generate text based on input messages with optional chunking and streaming support.

Request Body:

{
  "messages": [
    {
      "role": "system",
      "content": "Your system prompt here"
    },
    {
      "role": "user",
      "content": "Your user message here"
    }
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_new_tokens": 256,
  "max_seq_len": 1024,
  "max_gen_len": 512,
  "enable_chunking": false,
  "chunk_size": 1000,
  "chunk_overlap": 100,
  "enable_streaming": false,
  "aggregation_strategy": "concatenate"
}

Parameters:

messages (required): Array of message objects with role and content fields
temperature (optional, default: 0.7): Sampling temperature for text generation
top_p (optional, default: 0.9): Nucleus sampling parameter
max_new_tokens (optional, default: 256): Maximum number of new tokens to generate
max_seq_len (optional, default: 1024): Maximum sequence length
max_gen_len (optional, default: 512): Maximum generation length
enable_chunking (optional, default: false): Enable text chunking for large inputs
chunk_size (optional, default: 1000): Size of each text chunk in characters
chunk_overlap (optional, default: 100): Overlap between chunks in characters
enable_streaming (optional, default: false): Enable streaming responses (requires chunking)
aggregation_strategy (optional, default: "concatenate"): Strategy for aggregating chunk results ("concatenate" or "summarize")

Response:

{
  "generated_text": "Generated text response",
  "chunk_count": 1,
  "processing_time": 2.34
}

2. `/generate_response_stream` (POST)

Generate text with real-time streaming responses. This endpoint automatically enables chunking and streaming.

Request Body: Same as /generate_response, but enable_chunking and enable_streaming are automatically set to true.

Response: Server-Sent Events (SSE) stream with the following event types:

start: Processing started
chunk: Individual chunk result with chunk number
final: Final aggregated result
end: Processing completed

Text Chunking

The API includes intelligent text chunking that:

Splits text at sentence boundaries when possible
Falls back to word boundaries if sentence boundaries aren't found
Maintains configurable overlap between chunks to preserve context
Only activates when text length exceeds chunk_size and enable_chunking is true

Example Use Cases

Text Classification

{
  "messages": [
    {
      "role": "system",
      "content": "Utilize prompt engineering to classify the given text accurately into one of the following predefined categories:\n    Environment\n    Soziales\n    Governance\n    Keine Armut\n    Kein Hunger\n    E-Umweltschutz\n    E-Klimaschutz\n    E-Erneuerbare Energie\n    E-Emissionsreduktion\n    E-Ressourceneffizienz\n    S-Arbeitssicherheit\n    S-Gesundheitsschutz\n    S-Arbeitsbedingungen\nLimit your response to the identified class, nothing else. Optimize for increased accuracy."
    },
    {
      "role": "user",
      "content": "Am Wochenende fand ein tolles soziales Event statt..."
    }
  ]
}

Large Text Processing with Chunking

{
  "messages": [
    {
      "role": "system",
      "content": "Summarize the following text:"
    },
    {
      "role": "user",
      "content": "Very long text here..."
    }
  ],
  "enable_chunking": true,
  "chunk_size": 2000,
  "chunk_overlap": 200,
  "aggregation_strategy": "concatenate"
}

Accessing the Swagger UI

The Swagger UI for this API is accessible at the following URL:

https://llm-chat.skynet.coypu.org/swagger

You can also access it locally at http://localhost:5050/swagger when running the API.

Environment Variables

Variable	Description	Default
`MODEL_NAME`	Hugging Face model identifier	`meta-llama/Meta-Llama-3-8B-Instruct`
`HUGGINGFACE_HUB_TOKEN` or `HF_TOKEN`	Hugging Face authentication token	None (required for gated models)
`HOST`	Server host address	`0.0.0.0`
`PORT`	Server port number	`5050`
`CUDA_VISIBLE_DEVICES`	GPU device selection	None (uses all available GPUs)

Logging

The API logs all requests and responses to:

Console output (stdout)
Log file: api.log

Log format: %(asctime)s - %(levelname)s - %(message)s

Model Configuration

The API uses the following default model configuration:

Model: Meta-Llama-3-8B-Instruct
Precision: bfloat16
Device: GPU (CUDA) if available, otherwise CPU
Pipeline: Hugging Face Transformers text-generation pipeline

Notes

The endpoint was renamed from /generate_text to /generate_response for compatibility with BERTrend:weak_signal summarization
Chunking is only applied when text length exceeds the specified chunk_size
Streaming requires chunking to be enabled
The API uses Waitress as the production WSGI server
Swagger JSON is automatically generated and saved to static/swagger.json

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
__pycache__		__pycache__
static		static
.gitignore		.gitignore
README.md		README.md
llm-chat-llama3.py		llm-chat-llama3.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Generation and Classification API

Features

Prerequisites

Installation

Running the API

API Endpoints

1. `/generate_response` (POST)

2. `/generate_response_stream` (POST)

Text Chunking

Example Use Cases

Text Classification

Large Text Processing with Chunking

Accessing the Swagger UI

Environment Variables

Logging

Model Configuration

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

semantic-systems/llm-api

Folders and files

Latest commit

History

Repository files navigation

Text Generation and Classification API

Features

Prerequisites

Installation

Running the API

API Endpoints

1. /generate_response (POST)

2. /generate_response_stream (POST)

Text Chunking

Example Use Cases

Text Classification

Large Text Processing with Chunking

Accessing the Swagger UI

Environment Variables

Logging

Model Configuration

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. `/generate_response` (POST)

2. `/generate_response_stream` (POST)

Packages