| # | file name | description |
|---|---|---|
| 1 | download_model.py | Download a LLM model, and convert it into OpenVINO IR model |
| 2 | inference.py | Run an LLM model with OpenVINO. One of the most simple LLM inferencing code with OpenVINO and the optimum-intel library. |
| 3 | inference-stream.py | Run an LLM model with OpenVINO and optimum-intel.Display the answer in streaming mode (word by word). |
| 4 | inference-stream-openvino-only.py | Run an LLM model with only OpenVINO. This program doesn't require any DL frameworks such as TF or PyTorch. Also, this program doesn't even use the ' optimum-intel' library or HuggingFace tokenizers to run. This program uses a simple and dumb tokenizer (that I wrote) instead of HF tokenizers.Try swapping the tokenizer to HF tokenizer in case you see only garbage text from the program (uncomment AutoTokenizer and comment out SimpleTokenizer) |
| 5 | inference-stream-openvino-only-greedy.py | Same as program #4 but uses 'greedy decoding' instead of sampling. This program generates fixed output text because it always picks the most probability token ID from the predictions (=greedy decoding). |
| 6 | inference-stream-openvino-only-stateless.py | Same as program #4 but supports STATELESS models (which does not use the internal state variables to keep KV-cache values inside of the model) instead of stateful models. |
- Preparation
Note: Converting LLM model requires a large amount of memory (>=32GB).
python -m venv venv
venv\Scripts\activate
python -m pip install -U pip
pip install -U setuptools wheel
pip install -r requirements.txt- Download an LLM model and generate OpenVINO IR models
python download_model.pyHint: You can use optimum-cli tool to download the models from Huggingface hub, too. You need to install optimum-intel Python package to export the model for OpenVINO.
Hint: You can generate a stateless model by adding --disable-stateful option.
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4_asym_g64 TinyLlama-1.1B-Chat-v1.0/INT4
optimum-cli export openvino -m intel/neural-chat-7b-v3 --weight-format int4_asym_g64 neural-chat-7b-v3/INT4- Run inference
python inference.py
# or
python inference-stream.pyFollowing web sites are also infomative and helpful for optimum-intel users.
- Windows 11
- OpenVINO 2023.3.0 LTS
