Minimum code to run an LLM model from HuggingFace with OpenVINO

Programs / Files

#	file name	description
1	download_model.py	Download a LLM model, and convert it into OpenVINO IR model
2	inference.py	Run an LLM model with OpenVINO. One of the most simple LLM inferencing code with OpenVINO and the `optimum-intel` library.
3	inference-stream.py	Run an LLM model with OpenVINO and `optimum-intel`. Display the answer in streaming mode (word by word).
4	inference-stream-openvino-only.py	Run an LLM model with only OpenVINO. This program doesn't require any DL frameworks such as TF or PyTorch. Also, this program doesn't even use the '`optimum-intel`' library or HuggingFace tokenizers to run. This program uses a simple and dumb tokenizer (that I wrote) instead of HF tokenizers. Try swapping the tokenizer to HF tokenizer in case you see only garbage text from the program (uncomment `AutoTokenizer` and comment out `SimpleTokenizer`)
5	inference-stream-openvino-only-greedy.py	Same as program #4 but uses 'greedy decoding' instead of sampling. This program generates fixed output text because it always picks the most probability token ID from the predictions (=greedy decoding).
6	inference-stream-openvino-only-stateless.py	Same as program #4 but supports STATELESS models (which does not use the internal state variables to keep KV-cache values inside of the model) instead of stateful models.

How to run

Preparation

Note: Converting LLM model requires a large amount of memory (>=32GB).

python -m venv venv
venv\Scripts\activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt

Download an LLM model and generate OpenVINO IR models

python download_model.py

Hint: You can use optimum-cli tool to download the models from Huggingface hub, too. You need to install optimum-intel Python package to export the model for OpenVINO.
Hint: You can generate a stateless model by adding --disable-stateful option.

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4_asym_g64 TinyLlama-1.1B-Chat-v1.0/INT4
optimum-cli export openvino -m intel/neural-chat-7b-v3 --weight-format int4_asym_g64 neural-chat-7b-v3/INT4

Run inference

python inference.py
# or
python inference-stream.py

Official '`optimum-intel`' documents

Following web sites are also infomative and helpful for optimum-intel users.

Test environment

Windows 11
OpenVINO 2023.3.0 LTS

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
resources		resources
README.md		README.md
download_model.py		download_model.py
inference-stream-openvino-only-greedy.py		inference-stream-openvino-only-greedy.py
inference-stream-openvino-only-stateless.py		inference-stream-openvino-only-stateless.py
inference-stream-openvino-only.py		inference-stream-openvino-only.py
inference-stream.py		inference-stream.py
inference.py		inference.py
misc.py		misc.py
requirements.txt		requirements.txt
setupvars.bat		setupvars.bat
simple_tokenizer.py		simple_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minimum code to run an LLM model from HuggingFace with OpenVINO

Programs / Files

How to run

Official '`optimum-intel`' documents

Test environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

yas-sim/openvino-llm-minimal-code

Folders and files

Latest commit

History

Repository files navigation

Minimum code to run an LLM model from HuggingFace with OpenVINO

Programs / Files

How to run

Official 'optimum-intel' documents

Test environment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Official '`optimum-intel`' documents

Packages