Skip to content

Commit 58df0f8

Browse files
authored
[ver]: v1.0.0 of text_auto_classification contains base HTTP service of multi-class classification task
1 parent 4e07279 commit 58df0f8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1412
-7
lines changed

.gitignore

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ venv/
127127
ENV/
128128
env.bak/
129129
venv.bak/
130+
local_env/
130131

131132
# Spyder project settings
132133
.spyderproject
@@ -152,9 +153,11 @@ dmypy.json
152153
# Cython debug symbols
153154
cython_debug/
154155

155-
# PyCharm
156-
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157-
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158-
# and can be added to the global gitignore or merged into this file. For a more nuclear
159-
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160-
#.idea/
156+
# VS Code IDE files
157+
.vscode/
158+
159+
# Others
160+
my_configs/
161+
temp_data/
162+
output_dir/
163+
test.py

Changelog

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
## [1.0.0] - 2024-19-04
2+
3+
_First release._
4+
5+
### Added
6+
7+
- **Breaking:** Base functionality for HTTP service
8+
- **Breaking:** Fine-tuning of multi-class classification task
9+
- **Breaking:** Action example for running pipeline in SuperAnnotate infrastructure
10+
11+
12+
## [0.0.1] - 2024-12-03
13+
14+
_Init._

Dockerfile

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
2+
3+
# Set utility env varibles
4+
ENV PATH=/text_auto_classification_private/miniconda/bin:$PATH
5+
6+
# Set paths as env variables
7+
ARG DEFAULT_SERVICE_CONFIG
8+
ARG DEFAULT_TRAINING_CONFIG
9+
10+
ENV DEFAULT_SERVICE_CONFIG=${DEFAULT_SERVICE_CONFIG}
11+
ENV DEFAULT_TRAINING_CONFIG=${DEFAULT_TRAINING_CONFIG}
12+
13+
# Install some basic utilities
14+
RUN apt-get update && apt-get install -y \
15+
curl \
16+
ca-certificates \
17+
sudo \
18+
git \
19+
bzip2 \
20+
build-essential \
21+
libgl1 \
22+
libglib2.0-0 \
23+
&& rm -rf /var/lib/apt/lists/*
24+
25+
# Set workdir
26+
WORKDIR /text_auto_classification_private
27+
28+
# Install Miniconda and Python
29+
RUN curl -sLo /text_auto_classification_private/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py311_24.1.2-0-Linux-x86_64.sh \
30+
&& chmod +x /text_auto_classification_private/miniconda.sh \
31+
&& /text_auto_classification_private/miniconda.sh -b -p /text_auto_classification_private/miniconda \
32+
&& rm /text_auto_classification_private/miniconda.sh \
33+
&& conda install -y python==3.11 \
34+
&& pip3 install nvitop
35+
36+
# Install python requirements
37+
COPY text_auto_classification/requirements.txt .
38+
RUN pip3 install -r requirements.txt --no-cache
39+
40+
# Copy code to container
41+
COPY text_auto_classification/ text_auto_classification/
42+
COPY etc/ etc/
43+
COPY version.txt .
44+
45+
EXPOSE 8080
46+
CMD uvicorn --host 0.0.0.0 --port 8080 text_auto_classification.fastapi_app:app

README.md

Lines changed: 150 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,150 @@
1-
# text_auto_classification
1+
# SuperAnnotate Text Auto Classification #
2+
3+
[![Version](https://img.shields.io/badge/version-1.0.0-green.svg)]() [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/) [![CUDA 12.2](https://img.shields.io/badge/CUDA-12.2-green.svg)](https://developer.nvidia.com/cuda-12-2-0-download-archive)
4+
5+
This repository contains an HTTP service designed for automatic text classification for pipeline integration with the SuperAnnotate platform.
6+
7+
To integrate this HTTP service into your pipeline on the SuperAnnotate platform, follow these steps:
8+
9+
- Create and set up a text project on the SuperAnnotate platform.
10+
- Deploy this HTTP service to a global accessible location.
11+
- Configure a pipeline on the SuperAnnotate platform to link this service to your project.
12+
13+
\
14+
<img src="pics/Main_readme_schemas.png" alt="Main schemas" width="500"/>
15+
16+
\
17+
The project facilitates the automatic training of a text classification and data tagging model on the SuperAnnotate platform. \
18+
Here's a high-level overview of the process:
19+
20+
1. **Annotate Data:** Annotate approximately 100 items per class using the SuperAnnotate platform.
21+
2. **Model Fine-Tuning:** Fine-tune the text classification model using the annotated data.
22+
3. **Prediction:** Use the fine-tuned model to predict labels for other items in your dataset.
23+
24+
## How it works ##
25+
26+
The project was created for the automatic training of a text classification and data tagging model on the SuperAnnotate platform. Everything happens in 3 main stages:
27+
28+
### 1. Loading and preparing data ###
29+
30+
- Annotations with file names are loaded from the specified project (and optional folders) from the platform.
31+
- Document texts are also loaded through the selected integration.
32+
- All this data is combined into a dataset and has standard processing, such as removing empty, duplicates, and extremely short/long (less than 10 and above 2000 words) texts. Texts are also preprocessed by converting them to lowercase and removing unnecessary spaces and line breaks. At any time you can change the text preprocessing function (`text_auto_classification/utils/data/data_processing.py`) to suit your needs.
33+
34+
### 2. Model training ###
35+
36+
- The training data is divided into training and validation data to evaluate the quality of the model and the learning process.
37+
- Next, the hyperparameters are initialized, which can be customized through the training config file.
38+
- The model's auto fine-tuning process, specified in the config, begins. All model, arguments and trainer are defined by standard HuggingFace abstractions.
39+
- The model output layer is gonna be based on the number of classes in the training data
40+
41+
### 3. Prediction ###
42+
43+
- All downloaded data from the platform that did not yet have labels is separated during the data preparation process into a separate set for future prediction.
44+
- At this stage, we run the texts of these elements through the model to obtain predictions.
45+
- These predictions are then uploaded to the platform.
46+
47+
To configure the Pipeline and service operation from the platform side, read this [**Tutorial**](tutorial.md)
48+
49+
## How to run service ##
50+
51+
### API Service Configuration ###
52+
53+
You can deploy the service wherever it is convenient; one of the basic options is on a created EC2 instance. Learn about instance creation and setup [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html).
54+
55+
***NOTES***:
56+
57+
- To verify that everything is functioning correctly, try calling the healthcheck endpoint.
58+
- Hardware requirements will depend largely on your arguments and the base model being used. However, it's recommended to utilize NVIDIA GPU architecture. For the basic configuration of the default service, it's advisable to use the following instance: [**g3s.xlarge**](https://instances.vantage.sh/aws/ec2/g3s.xlarge).
59+
- Also, ensure that the port on which your service is deployed (8080 by default) is open to the global network. Refer to this [**tutorial**](https://stackoverflow.com/questions/5004159/opening-port-80-ec2-amazon-web-services/10454688#10454688) for guidance on opening a port on an EC2 instance.
60+
61+
### Pre-requirements ###
62+
63+
To get started with the project, you should determine all the necessary configuration files. By default, they have located in the following path: `etc/configs`. Namely, there are 3 configs:
64+
65+
1. **SA_config.ini**:
66+
- This is a configuration file for connecting work with SDK SuperAnnotate, which contains your key to the platform and is needed for authorization in SAClient. You can read more [here](https://doc.superannotate.com/docs/python-sdk#with-arguments).
67+
68+
2. **service_config.json**:
69+
- This file contains a basic field for the working of the service in general. Contains the following fields:
70+
- `SA_CONFIG_PATH`: The path to the first config (SA_config.ini).
71+
- `SA_PROJECT_NAME`, `SA_FOLDERS`: The name and optionally the folders of the project on the platform with which to work.
72+
- `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`: AWS keys, more details [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html).
73+
- `AWS_URL_FOR_DATA_DOWNLOADS`, `AWS_URL_TO_MODEL_UPLOAD`: S3 URLs to the location of original documents and the place to save model checkpoints, respectively. More details [here](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-bucket-intro.html).
74+
75+
***NOTE***: Please ensure that the file structure on following path `AWS_URL_FOR_DATA_DOWNLOADS` matches with folders `SA_FOLDERS`
76+
77+
3. **train_config.json**:
78+
- This config contains the basic arguments necessary for training:
79+
- `pretrain_model`: The name of the pre-trained model with HuggingFace (it is recommended to use Bert-like model).
80+
- `validation_ratio`: A value from 0 to 1, representing the proportion of data that will be used to validate the model.
81+
- `max_length`: The maximum length of texts for the tokenizer, by default 512 is the limit for Bert-like models.
82+
- The remaining keys correspond to the arguments of the following `TrainingArguments` class. More details [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).
83+
84+
After all the configs are configured as described, you can start the service. There are 2 options:
85+
86+
### As Python file ###
87+
88+
- Install Python version 3.11. More details [here](https://www.python.org/downloads/)
89+
- Install Nvidia drivers and CUDA toolkit using, for example, this instructions: [**Nvidia drivers**](https://ubuntu.com/server/docs/nvidia-drivers-installation) and [**CUDA toolkit**](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local)
90+
- Install dependencies: `pip install -r ./text_auto_classification/requirements.txt`
91+
- Set the Python path variable: `export PYTHONPATH="."`
92+
- Run the API: `uvicorn --host 0.0.0.0 --port 8080 text_auto_classification.fastapi_app:app`
93+
94+
### As Docker container ###
95+
96+
- Initialize environment variables:
97+
- Path to the general configuration file `DEFAULT_SERVICE_CONFIG`: `export DEFAULT_SERVICE_CONFIG=etc/configs/service_config.json`
98+
- Path to the configuration file with parameters for training `DEFAULT_TRAINING_CONFIG`: `export DEFAULT_TRAINING_CONFIG=etc/configs/train_config.json`
99+
- Install Docker, Nvidia drivers, CUDA toolkit and NVIDIA Container Toolkit using, for example, this instructions: [**Docker**](https://docs.docker.com/engine/install/ubuntu/); [**Nvidia drivers**](https://ubuntu.com/server/docs/nvidia-drivers-installation); [**CUDA toolkit**](https://developer.nvidia.com/cuda-12-2-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local); [**NVIDIA Container Toolkit**](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
100+
- Build the docker image: `sudo docker build -t text_auto_classification --build-arg DEFAULT_SERVICE_CONFIG=$DEFAULT_SERVICE_CONFIG --build-arg DEFAULT_TRAINING_CONFIG=$DEFAULT_TRAINING_CONFIG .`
101+
- Run a container: `sudo docker run --gpus all -p 8080:8080 -d text_auto_classification`
102+
103+
## Endpoints ##
104+
105+
The following endpoints are available in the Text Auto Classification service:
106+
107+
- **GET /healthcheck**:
108+
- **Summary**: Ping
109+
- **Description**: Alive method
110+
- **Input Type**: None
111+
- **Output Type**: JSON
112+
- **Output Values**:
113+
- `{"healthy": True}`
114+
- **Status Codes**:
115+
- `200`: Successful Response
116+
117+
- **POST /train_predict**:
118+
- **Summary**: Train Predict
119+
- **Description**: Train model on annotated data from SA project and auto annotate other data
120+
- **Input Type**: None
121+
- **Output Type**: JSON
122+
- **Output Values**:
123+
- `{"status": "Pipeline successfully started"}`
124+
- `{"status": "Pipeline is already started"}`
125+
- **Status Codes**:
126+
- `200`: Pipeline successfully started
127+
- `429`: Pipeline is already started
128+
129+
- **GET /status**:
130+
- **Summary**: Status
131+
- **Description**: Method for status tracking
132+
- **Input Type**: None
133+
- **Output Type**: JSON
134+
- **Output Values**:
135+
- `{"status": "Not started"}`
136+
- `{"status": "Downloading data"}`
137+
- `{"status": "Model training"}`
138+
- `{"status": "Predicting other items"}`
139+
- `{"status": "Completed"}`
140+
- `{"status": "Failed"}`
141+
- **Status Codes**:
142+
- `200`: Successful Response
143+
144+
## Room for Improvements ##
145+
146+
There are several areas where the project can be further improved:
147+
148+
- **Implement support for multi-label classification**: Currently, the project focuses on single-label classification. Adding support for multi-label classification would enhance its versatility and applicability in various use cases.
149+
150+
- **Logic for working with long texts, add auto chunking**: Handling long texts efficiently is crucial for many natural language processing tasks. Implementing logic to handle long texts, such as automatic chunking, would improve the project's performance and scalability when dealing with lengthy documents.

etc/action_code.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
import json
2+
import os
3+
import urllib.parse
4+
from datetime import datetime
5+
from time import sleep, time
6+
7+
import requests
8+
from superannotate import SAClient
9+
10+
SA_TOKEN = os.environ["SA_TOKEN"]
11+
URL = os.environ["URL"]
12+
# Constant for limiting the amount of data for starting Auto-Classification
13+
# You can change it, but by default it's set up to 100, changing the limit to less may lead to unstable results
14+
COUNT_ITEMS_PER_CLASS = 100
15+
16+
sa = SAClient(token=SA_TOKEN)
17+
18+
19+
def read_status(resp):
20+
return json.loads(resp.content.decode()).get("status")
21+
22+
23+
def check_enough_data(project_name, threshold):
24+
project_metadata = sa.get_project_metadata(
25+
project = project_name,
26+
include_annotation_classes=True
27+
)
28+
29+
classes = [cl["name"] for cl in project_metadata["classes"] if cl["type"] == "tag"]
30+
31+
enough_data_flag = True
32+
for cl in classes:
33+
cl_items = sa.query(
34+
project = project_name,
35+
query = f"metadata(status =Completed) AND instance(className = {cl})"
36+
)
37+
38+
if len(cl_items) < threshold:
39+
print(f"Amount of completed items is too small for *{cl}*. {len(cl_items)}/{threshold}")
40+
enough_data_flag = False
41+
42+
return enough_data_flag
43+
44+
45+
def handler(event, context):
46+
# Get project name
47+
project_name = sa.get_project_by_id(context['after']['project_id'])['name']
48+
49+
# Can't run service if count completed items less than COUNT_ITEMS_PER_CLASS per class
50+
if not check_enough_data(project_name, COUNT_ITEMS_PER_CLASS):
51+
return False
52+
53+
# Call serice
54+
started = start_train_predict()
55+
if not started:
56+
return False
57+
58+
# Loop of monitoring the service and waiting for execution
59+
while True:
60+
resp = requests.get(urllib.parse.urljoin(URL, "text-auto-classification/status"))
61+
62+
print(f"Status code: {read_status(resp)}, waiting")
63+
# Create datetime object from current timestamp
64+
dt = datetime.fromtimestamp(int(time()))
65+
# Format datetime as "YYYY-MM-DD hh:mm:ss"
66+
formatted_datetime = dt.strftime("%Y-%m-%d %H:%M:%S")
67+
print(formatted_datetime)
68+
69+
if resp.status_code == 200 and read_status(resp) == "Completed":
70+
return True
71+
if (resp.status_code == 200 and read_status(resp) == "Failed") or resp.status_code != 200:
72+
print(resp.status_code)
73+
print(read_status(resp))
74+
return False
75+
76+
sleep(60)
77+
78+
79+
def start_train_predict():
80+
resp = requests.post(urllib.parse.urljoin(URL, "text-auto-classification/train_predict"))
81+
82+
if resp.status_code == 200:
83+
return True
84+
else:
85+
print(resp.status_code)
86+
print(read_status(resp))
87+
return False

etc/configs/SA_config.ini

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[DEFAULT]
2+
SA_TOKEN = <token>

etc/configs/service_config.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"SA_CONFIG_PATH": "etc/configs/SA_config.ini",
3+
"SA_PROJECT_NAME": "Project Name",
4+
"SA_FOLDERS": ["Folder1", "Folder2"],
5+
"AWS_ACCESS_KEY_ID": "AWS ACCESS KEY",
6+
"AWS_SECRET_ACCESS_KEY": "AWS SECRET ACCESS KEY",
7+
"AWS_URL_FOR_DATA_DOWNLOADS": "S3 URL",
8+
"AWS_URL_TO_MODEL_UPLOAD": "S3 URL"
9+
}

etc/configs/train_config.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"pretrain_model": "FacebookAI/xlm-roberta-base",
3+
"validation_ratio": 0.15,
4+
"max_length": 512,
5+
"optim": "adamw_torch",
6+
"learning_rate": 3e-5,
7+
"lr_scheduler_type": "cosine_with_restarts",
8+
"warmup_ratio": 0.2,
9+
"per_device_train_batch_size": 4,
10+
"per_device_eval_batch_size": 4,
11+
"gradient_accumulation_steps": 2,
12+
"num_train_epochs": 10,
13+
"weight_decay": 0.01
14+
}

pics/Main_readme_schemas.png

280 KB
Loading

pics/examples/example_1.png

Loading

0 commit comments

Comments
 (0)