You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- By default, NNPA is enabled when available. To disable it (not recommended):
45
+
- By default, NNPA is disabled by default. To enable it:
46
46
47
47
```bash
48
48
cmake -S . -B build \
49
49
-DCMAKE_BUILD_TYPE=Release \
50
50
-DGGML_BLAS=ON \
51
51
-DGGML_BLAS_VENDOR=OpenBLAS \
52
-
-DGGML_NNPA=OFF
52
+
-DGGML_NNPA=ON
53
53
54
54
cmake --build build --config Release -j $(nproc)
55
55
```
@@ -84,16 +84,24 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
84
84
85
85

86
86
87
-
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
87
+
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
88
88
89
-
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
89
+
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
90
90
91
91
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
92
92
93
93

94
94
95
95
The model you are trying to convert must be in`safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
96
96
97
+
Ensure that you have installed the required packages in advance
98
+
99
+
```bash
100
+
pip3 install -r requirements.txt
101
+
```
102
+
103
+
Convert the `safetensors` model to `GGUF`
104
+
97
105
```bash
98
106
python3 convert_hf_to_gguf.py \
99
107
--outfile model-name-be.f16.gguf \
@@ -116,7 +124,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
116
124
117
125

118
126
119
-
The model you are trying to convert must be in`gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
127
+
The model you are trying to convert must be in`gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
120
128
121
129
```bash
122
130
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
@@ -141,15 +149,15 @@ Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by
141
149
142
150
### 2. NNPA Vector Intrinsics Acceleration
143
151
144
-
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
152
+
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
145
153
146
154
### 3. zDNN Accelerator
147
155
148
-
_Only available in IBM z16 or later system. No direction at the moment._
156
+
_Only available in IBM z16 / LinuxONE 4 or later system. No support currently available._
149
157
150
158
### 4. Spyre Accelerator
151
159
152
-
_No direction at the moment._
160
+
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
153
161
154
162
## Performance Tuning
155
163
@@ -189,6 +197,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
189
197
190
198
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
191
199
200
+
4. Failing to install the `sentencepiece` package using GCC 15+
201
+
202
+
Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
203
+
204
+
As a temporary workaround, please run the installation command with the following environment variables.
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
219
+
192
220
## Getting Help on IBM Z & LinuxONE
193
221
194
222
1. **Bugs, Feature Requests**
@@ -244,3 +272,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
244
272
- ✅ - acceleration available
245
273
- 🚫 - acceleration unavailable, will still run using scalar implementation
246
274
- ❓ - acceleration unknown, please contribute if you can test it yourself
275
+
276
+
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 25, 2025.
0 commit comments