Skip to content

semantic-systems/llm-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Generation and Classification API

This API provides text generation and classification capabilities using Meta-Llama-3-8B-Instruct, a pre-trained large language model. The API supports text chunking for large inputs, streaming responses, and flexible configuration options.

Features

  • Text Generation: Generate text based on conversational messages using chat templates
  • Text Classification: Classify text into predefined categories using prompt engineering
  • Text Chunking: Process large texts by splitting them into manageable chunks with overlap
  • Streaming Support: Real-time streaming responses for chunked processing
  • Swagger Documentation: Interactive API documentation via Swagger UI
  • Flexible Configuration: Customizable parameters for temperature, sampling, token limits, and chunking

Prerequisites

  • Python 3.x
  • CUDA-capable GPU (recommended) or CPU
  • Hugging Face account and token (for gated models like Meta-Llama-3)

Installation

  1. Install the required dependencies:
pip install -r requirements.txt
  1. Set up environment variables (optional):
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"  # Default model
export HUGGINGFACE_HUB_TOKEN="your_token_here"  # Required for gated models
export HOST="0.0.0.0"  # Default host
export PORT="5050"  # Default port

Running the API

To run the API, execute the following command:

CUDA_VISIBLE_DEVICES=1 python llm-chat-llama3.py

The API will start on the configured host and port (default: 0.0.0.0:5050).

API Endpoints

1. /generate_response (POST)

Generate text based on input messages with optional chunking and streaming support.

Request Body:

{
  "messages": [
    {
      "role": "system",
      "content": "Your system prompt here"
    },
    {
      "role": "user",
      "content": "Your user message here"
    }
  ],
  "temperature": 0.7,
  "top_p": 0.9,
  "max_new_tokens": 256,
  "max_seq_len": 1024,
  "max_gen_len": 512,
  "enable_chunking": false,
  "chunk_size": 1000,
  "chunk_overlap": 100,
  "enable_streaming": false,
  "aggregation_strategy": "concatenate"
}

Parameters:

  • messages (required): Array of message objects with role and content fields
  • temperature (optional, default: 0.7): Sampling temperature for text generation
  • top_p (optional, default: 0.9): Nucleus sampling parameter
  • max_new_tokens (optional, default: 256): Maximum number of new tokens to generate
  • max_seq_len (optional, default: 1024): Maximum sequence length
  • max_gen_len (optional, default: 512): Maximum generation length
  • enable_chunking (optional, default: false): Enable text chunking for large inputs
  • chunk_size (optional, default: 1000): Size of each text chunk in characters
  • chunk_overlap (optional, default: 100): Overlap between chunks in characters
  • enable_streaming (optional, default: false): Enable streaming responses (requires chunking)
  • aggregation_strategy (optional, default: "concatenate"): Strategy for aggregating chunk results ("concatenate" or "summarize")

Response:

{
  "generated_text": "Generated text response",
  "chunk_count": 1,
  "processing_time": 2.34
}

2. /generate_response_stream (POST)

Generate text with real-time streaming responses. This endpoint automatically enables chunking and streaming.

Request Body: Same as /generate_response, but enable_chunking and enable_streaming are automatically set to true.

Response: Server-Sent Events (SSE) stream with the following event types:

  • start: Processing started
  • chunk: Individual chunk result with chunk number
  • final: Final aggregated result
  • end: Processing completed

Text Chunking

The API includes intelligent text chunking that:

  • Splits text at sentence boundaries when possible
  • Falls back to word boundaries if sentence boundaries aren't found
  • Maintains configurable overlap between chunks to preserve context
  • Only activates when text length exceeds chunk_size and enable_chunking is true

Example Use Cases

Text Classification

{
  "messages": [
    {
      "role": "system",
      "content": "Utilize prompt engineering to classify the given text accurately into one of the following predefined categories:\n    Environment\n    Soziales\n    Governance\n    Keine Armut\n    Kein Hunger\n    E-Umweltschutz\n    E-Klimaschutz\n    E-Erneuerbare Energie\n    E-Emissionsreduktion\n    E-Ressourceneffizienz\n    S-Arbeitssicherheit\n    S-Gesundheitsschutz\n    S-Arbeitsbedingungen\nLimit your response to the identified class, nothing else. Optimize for increased accuracy."
    },
    {
      "role": "user",
      "content": "Am Wochenende fand ein tolles soziales Event statt..."
    }
  ]
}

Large Text Processing with Chunking

{
  "messages": [
    {
      "role": "system",
      "content": "Summarize the following text:"
    },
    {
      "role": "user",
      "content": "Very long text here..."
    }
  ],
  "enable_chunking": true,
  "chunk_size": 2000,
  "chunk_overlap": 200,
  "aggregation_strategy": "concatenate"
}

Accessing the Swagger UI

The Swagger UI for this API is accessible at the following URL:

https://llm-chat.skynet.coypu.org/swagger

You can also access it locally at http://localhost:5050/swagger when running the API.

Swagger UI

Environment Variables

Variable Description Default
MODEL_NAME Hugging Face model identifier meta-llama/Meta-Llama-3-8B-Instruct
HUGGINGFACE_HUB_TOKEN or HF_TOKEN Hugging Face authentication token None (required for gated models)
HOST Server host address 0.0.0.0
PORT Server port number 5050
CUDA_VISIBLE_DEVICES GPU device selection None (uses all available GPUs)

Logging

The API logs all requests and responses to:

  • Console output (stdout)
  • Log file: api.log

Log format: %(asctime)s - %(levelname)s - %(message)s

Model Configuration

The API uses the following default model configuration:

  • Model: Meta-Llama-3-8B-Instruct
  • Precision: bfloat16
  • Device: GPU (CUDA) if available, otherwise CPU
  • Pipeline: Hugging Face Transformers text-generation pipeline

Notes

  • The endpoint was renamed from /generate_text to /generate_response for compatibility with BERTrend:weak_signal summarization
  • Chunking is only applied when text length exceeds the specified chunk_size
  • Streaming requires chunking to be enabled
  • The API uses Waitress as the production WSGI server
  • Swagger JSON is automatically generated and saved to static/swagger.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages