This is an example of a barebones Embeddings API implementation. The API output mirrors that of the OpenAI Embeddings API. Built on FastAPI, Pydantic, and Sentence Transformers, this project is a learning exercise as much as a starting point for developing custom embedding API interfaces.
The API code is purposely contained in a single main.py file to keep it flexible, and I have included TODO comment tags in places which deserve further development in a production implementation.
Before running the API, it is recommended to download the model checkpoints which will be used to generate embeddings, so the pretrained weights are not re-downloaded after server restarts, reloads, etc.
This example API supports serving multiple model options, and these can be specified in the models.txt file (a barebones implementation of a model repo). This example comes with (2) specified, pretrained models from the Sentence-Transformers library, but more can be added by simply adding checkpoint names. More checkpoint options are available here.
Once the models.txt file is ready, run the download.py script to save the model weights and configurations in the artifacts directory.
The API is built with FastAPI, so start the server as follows:
$ uvicorn main:app --reloadTip
FastAPI provides Swagger documentation out of the box. These can be reached at localhost:port/docs#.
Note
Reload allows for editing the source file and updating the app. Not necessary otherwise
This endpoint provides information on the available models (see specified models above), as well as metadata for each model, such as dimension size.
import requests
from pprint import pprint
response = requests.get(
url="http://localhost/8000/models",
headers={"Content-Type": "application/json"}
)
pprint(response.json())$ curl -X 'GET' \
'http://localhost:8000/models' \
-H 'accept: application/json'{
"data": {
"models": {
"all-MiniLM-L12-v2": {"dim": 384},
"all-mpnet-base-v2": {"dim": 768}
}
},
"generated": "2023-12-02 @ 20:39:22",
"id": "76662a90-55da-47b9-8072-310ed4d090b8"
}The core feature of the API is to generate and return embedding representations of input text sequences. The example below is for a single text sequence, but the API can handle an array of text sequences as input. The input requires the user to select an available model (see above section).
The API provides (4) fields of data in the response:
model: name of the selected model in the requestgenerated: generic date/time stampid: generic uuid for record inferencedata: contains the returned embedding contents
For each embedding returned, there are (3) fields of data:
embedding: vector representation of input text sequenceindex: index number for embedding; relevant if multiple text sequences provided as inputnum_tokens: number of tokens derived from the input text sequence.
import requests
checkpoint = "all-mpnet-base-v2"
text = "Can we mimic the Embeddings API output format from OpenAI? I dunno, but we can try."
response = requests.post(
url="http://localhost:8000/embeddings",
json={"model": checkpoint, "input": text},
headers=headers
)
pprint(response)$ curl -X 'POST' \
'http://localhost:8000/embeddings' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "all-mpnet-base-v2",
"input": "Can we mimic the Embeddings API output format from OpenAI? I dunno, but we can try."
}'{
"data": [
{
"index": 0,
"embedding": [
-0.03878581151366234,
0.03181726858019829,
-0.020964443683624268,
...
],
"num_tokens": 27
}
],
"generated": "2023-12-02 @ 20:50:04",
"id": "83e70cf8-e8d6-4783-a928-b45d9507635e",
"model": "all-mpnet-base-v2"The API can accept a list of input text sequences, and the response output will not be changed. The embeddings are returned as a list either way.
import requests
checkpoint = "all-mpnet-base-v2"
text = [
"Can we mimic the Embeddings API output format from OpenAI?",
"I dunno, but we can try."
]
response = requests.post(
url="http://localhost:8000/embeddings",
json={"model": checkpoint, "input": text},
headers=headers
)
if response.status_code == 200:
content = response.json()
print(f"** returned {len(content['data'])} embeddings...")
>>> ** returned 2 embeddings...In the current implementation, the model returns an error if an invalid (i.e., unspecified, unavailable) model is selected in the request.
import json
import requests
from pprint import pprint
model = "i-ll-take-the-finest-model-you-ve-got!"
text = "..."
response = requests.post(
url=embeddings_url,
json={"model": model, "input": text},
headers=headers
)
error_msg = json.loads(response.content)
pprint(error_msg){
"detail": [
{
"ctx": {"error": {}},
"input": "the-finest-model-youve-got!",
"loc": ["body", "model"],
"msg": "Assertion failed, Model `i-ll-take-the-finest-model-you-ve-got!` has not been specified and is not available.",
"type": "assertion_error",
"url": "https://errors.pydantic.dev/2.5/v/assertion_error"
}
]
}Be sure to check out the many TODO comments in the main.py file to see how this example could be further expanded. Also, submit a pull request to add more TODO's.