Skip to content

Commit c7f3169

Browse files
authored
ggml-cpu : disable GGML_NNPA by default due to instability (#14880)
* docs: update s390x document for sentencepiece Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit e086c5e) * docs: update huggingface links + reword Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8410b08) * ggml-cpu: disable ggml-nnpa compile flag by default fixes #14877 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 412f4c7) * docs: update s390x build docs to reflect nnpa disable Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit c1eeae1) --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
1 parent 793c0d7 commit c7f3169

File tree

3 files changed

+40
-9
lines changed

3 files changed

+40
-9
lines changed

docs/build-s390x.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,14 +42,14 @@ cmake --build build --config Release -j $(nproc)
4242
cmake --build build --config Release -j $(nproc)
4343
```
4444

45-
- By default, NNPA is enabled when available. To disable it (not recommended):
45+
- By default, NNPA is disabled by default. To enable it:
4646

4747
```bash
4848
cmake -S . -B build \
4949
-DCMAKE_BUILD_TYPE=Release \
5050
-DGGML_BLAS=ON \
5151
-DGGML_BLAS_VENDOR=OpenBLAS \
52-
-DGGML_NNPA=OFF
52+
-DGGML_NNPA=ON
5353
5454
cmake --build build --config Release -j $(nproc)
5555
```
@@ -84,16 +84,24 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
8484

8585
![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
8686

87-
You can find popular models pre-converted and verified at [s390x Ready Models](https://huggingface.co/collections/taronaeo/s390x-ready-models-672765393af438d0ccb72a08).
87+
You can find popular models pre-converted and verified at [s390x Verified Models](https://huggingface.co/collections/taronaeo/s390x-verified-models-672765393af438d0ccb72a08) or [s390x Runnable Models](https://huggingface.co/collections/taronaeo/s390x-runnable-models-686e951824198df12416017e).
8888

89-
These models have already been converted from `safetensors` to `GGUF Big-Endian` and their respective tokenizers verified to run correctly on IBM z15 and later system.
89+
These models have already been converted from `safetensors` to `GGUF` Big-Endian and their respective tokenizers verified to run correctly on IBM z15 and later system.
9090

9191
2. **Convert safetensors model to GGUF Big-Endian directly (recommended)**
9292

9393
![File Type - safetensors](https://img.shields.io/badge/File_Type-safetensors-da1e28)
9494

9595
The model you are trying to convert must be in `safetensors` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct)). Make sure you have downloaded the model repository for this case.
9696

97+
Ensure that you have installed the required packages in advance
98+
99+
```bash
100+
pip3 install -r requirements.txt
101+
```
102+
103+
Convert the `safetensors` model to `GGUF`
104+
97105
```bash
98106
python3 convert_hf_to_gguf.py \
99107
--outfile model-name-be.f16.gguf \
@@ -116,7 +124,7 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
116124

117125
![File Type - gguf](https://img.shields.io/badge/File_Type-gguf-fff)
118126

119-
The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
127+
The model you are trying to convert must be in `gguf` file format (for example [IBM Granite 3.3 2B GGUF](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct-GGUF)). Make sure you have downloaded the model file for this case.
120128

121129
```bash
122130
python3 gguf-py/gguf/scripts/gguf_convert_endian.py model-name.f16.gguf BIG
@@ -141,15 +149,15 @@ Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by
141149

142150
### 2. NNPA Vector Intrinsics Acceleration
143151

144-
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned on when available) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
152+
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
145153

146154
### 3. zDNN Accelerator
147155

148-
_Only available in IBM z16 or later system. No direction at the moment._
156+
_Only available in IBM z16 / LinuxONE 4 or later system. No support currently available._
149157

150158
### 4. Spyre Accelerator
151159

152-
_No direction at the moment._
160+
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
153161

154162
## Performance Tuning
155163

@@ -189,6 +197,26 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
189197

190198
Answer: Please ensure that your GCC compiler is of minimum GCC 15.1.0 version, and have `binutils` updated to the latest version. If this does not fix the problem, kindly open an issue.
191199

200+
4. Failing to install the `sentencepiece` package using GCC 15+
201+
202+
Answer: The `sentencepiece` team are aware of this as seen in [this issue](https://github.com/google/sentencepiece/issues/1108).
203+
204+
As a temporary workaround, please run the installation command with the following environment variables.
205+
206+
```bash
207+
export CXXFLAGS="-include cstdint"
208+
```
209+
210+
For example,
211+
212+
```bash
213+
CXXFLAGS="-include cstdint" pip3 install -r requirements.txt
214+
```
215+
216+
5. `-DGGML_NNPA=ON` generates gibberish output
217+
218+
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
219+
192220
## Getting Help on IBM Z & LinuxONE
193221

194222
1. **Bugs, Feature Requests**
@@ -244,3 +272,5 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
244272
- ✅ - acceleration available
245273
- 🚫 - acceleration unavailable, will still run using scalar implementation
246274
- ❓ - acceleration unknown, please contribute if you can test it yourself
275+
276+
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 25, 2025.

ggml/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
131131
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
132132
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
133133
option(GGML_VXE "ggml: enable vxe" ON)
134-
option(GGML_NNPA "ggml: enable nnpa" ON)
134+
option(GGML_NNPA "ggml: enable nnpa" OFF) # temp disabled by default, see: https://github.com/ggml-org/llama.cpp/issues/14877
135135

136136
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
137137
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")

ggml/src/ggml-cpu/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
458458
list(APPEND ARCH_FLAGS -march=z16)
459459
elseif (${S390X_M} MATCHES "9175|9176")
460460
# NOTE: Only available from GCC 15.1.0 onwards. Any z17 machine with compile issues must first verify their GCC version.
461+
# binutils must also be updated to the latest for the -march=z17 flag to work. Otherwise, use -march=arch15.
461462
message(STATUS "z17 target")
462463
list(APPEND ARCH_FLAGS -march=z17)
463464
else()

0 commit comments

Comments
 (0)