ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

compilade · 2025-08-17T19:52:52Z

This is kind of a follow up to #11773, with the fix from #11773 (comment) (which should fix #11764 and #12439).

make_qp_quants doesn't check for divisions by zero with its suml2 when calculating the resulting scale, and so I've added a check to avoid producing NAN scales that way. That function is used for Q2_K, Q4_K, Q5_K, and IQ2_XXS.

In the future, I'm hoping to replace make_qp_quants with a saner quantization function from #12557. But until that's ready, might as well prevent some known existing errors.

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

llama.cpp/ggml/src/ggml-quants.c

Lines 4269 to 4287 in 21c17b5

    
           float best_score = -FLT_MIN, scale = max; 
        
           int besti1 = -1, besti2 = -1, best_shift = 0; 
        
           for (int i1 = 0; i1 <= block_size; ++i1) { 
        
               for (int i2 = i1; i2 <= block_size; ++i2) { 
        
                   float sumqx = (sumx[i1] - sumx[0])*x_p[0] + (sumx[i2] - sumx[i1])*x_p[1] + (sumx[block_size] - sumx[i2])*x_p[2]; 
        
                   float sumq2 = (sumw[i1] - sumw[0])*x_p[0]*x_p[0] + (sumw[i2] - sumw[i1])*x_p[1]*x_p[1] + (sumw[block_size] - sumw[i2])*x_p[2]*x_p[2]; 
        
                   if (sumq2 > 0 && sumqx*sumqx > best_score*sumq2) { 
        
                       scale = sumqx/sumq2; best_score = scale*sumqx; 
        
                       besti1 = i1; besti2 = i2; best_shift = 1; 
        
                   } 
        
                   sumqx = (sumx[i1] - sumx[0])*x_m[0] + (sumx[i2] - sumx[i1])*x_m[1] + (sumx[block_size] - sumx[i2])*x_m[2]; 
        
                   sumq2 = (sumw[i1] - sumw[0])*x_m[0]*x_m[0] + (sumw[i2] - sumw[i1])*x_m[1]*x_m[1] + (sumw[block_size] - sumw[i2])*x_m[2]*x_m[2]; 
        
                   if (sumq2 > 0 && sumqx*sumqx > best_score*sumq2) { 
        
                       scale = sumqx/sumq2; best_score = scale*sumqx; 
        
                       besti1 = i1; besti2 = i2; best_shift = -1; 
        
                   } 
        
               } 
        
           } 
        
           GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0);

-FLT_MIN is -1.1754944e-38, (which is the first normal non-zero value below 0).

The assertion can fail in the situation where sumq2 is always smaller or equal to 5.9604645e-08, and sumqx * sumqx is zero (e.g. if sumqx is on the order of 1e-24) and then -FLT_MIN * 5.9604645e-08 is zero, so the branch setting the new best_score is never taken in that case.

I'm pretty sure the intention was for the initial value of best_scores to be -FLT_MAX (aka -3.4028235e+38), which doesn't have the problem of rounding to zero (it can overflow to -INFINITY, but that's also a reasonable starting value, since it would also be smaller than any sumqx*sumqx).

This should hopefully fix variants of #14788, and #6067.

cc @nicoboss, @bartowski1182, @schmorp since you've previously reported issues linked to this.

Make sure to read the contributing guidelines before submitting a PR

CISC · 2025-08-17T22:01:16Z

make_qp_quants doesn't check for divisions by zero with its suml2 when calculating the resulting scale, and so I've added a check to avoid producing NAN scales that way. That function is used for Q2_K, Q4_K, Q5_K, and IQ2_XXS.

See #7825 (which also proposes the same fix for make_q3_quants). :)

CISC · 2025-08-17T22:20:29Z

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

This change was made in #7955

compilade · 2025-08-17T22:20:50Z

See #7825 (which also proposes the same fix for make_q3_quants). :)

@CISC
Should that PR be merged as well, or should I integrate the one-line change in make_q3_quants here?

Ah, so the -FLT_MIN problem with IQ1_S and IQ1_M comes from #7955, which attempted to fix that problem, but didn't pick the right constant. (EDIT: I see you've commented this reference at the same time as I typed this :)

CISC · 2025-08-17T22:22:33Z

See #7825 (which also proposes the same fix for make_q3_quants). :)

@CISC Should that PR be merged as well, or should I integrate the one-line change in make_q3_quants here?

Might as well integrate it here, and I'll close that one.

CISC · 2025-08-17T22:36:35Z

Also, should maybe suml2 be allowed to be negative, like here?

llama.cpp/ggml/src/ggml-quants.c

Line 493 in fb573f4

float scale = suml2 ? sumlx/suml2 : 0.0f;

compilade · 2025-08-17T22:49:04Z

Also, should maybe suml2 be allowed to be negative, like here?

I don't think so. suml2 can only be negative when the imatrix weights are negative, which should not happen because it breaks other assumptions.

It's a sum of squares, which should always be positive.

CISC · 2025-08-17T23:36:41Z

Also, should maybe suml2 be allowed to be negative, like here?

I don't think so. suml2 can only be negative when the imatrix weights are negative, which should not happen because it breaks other assumptions.

It's a sum of squares, which should always be positive.

Well, that's what I meant, do we assert this somewhere? If so, why check if it's larger than 0.

I suppose this should also check for signbit:

llama.cpp/src/llama-quant.cpp

Lines 619 to 626 in 50aa938

    
           // check imatrix for nans or infs 
        
           for (const auto & kv : *imatrix_data) { 
        
               for (float f : kv.second) { 
        
                   if (!std::isfinite(f)) { 
        
                       throw std::runtime_error(format("imatrix contains non-finite value %f\n", f)); 
        
                   } 
        
               } 
        
           }

compilade · 2025-08-18T04:07:23Z

[imatrix weights should never be negative] because it breaks other assumptions

Well, that's what I meant, do we assert this somewhere? If so, why check if it's larger than 0.

Personally, I think using suml2 > 0.0f makes the intention of the constraints in the code clearer, even if in practice it's the same as suml2 != 0.0f.

I suppose this should also check for signbit:

llama.cpp/src/llama-quant.cpp

Lines 619 to 626 in 50aa938

// check imatrix for nans or infs

for (const auto & kv : *imatrix_data) {

for (float f : kv.second) {

if (!std::isfinite(f)) {

throw std::runtime_error(format("imatrix contains non-finite value %f\n", f));

}

}

}

Yes, this is a good idea. But I consider this out of scope for this PR, which is about handling problematic edge cases in quantization functions (and which have happened in the wild), and not necessarily imatrix validation, even if both are related. Still, it's a good idea, and I don't think any valid imatrix should/could have negative values.

Arguably, even 0 (or near 0) imatrix values probably should also be avoided, because they will produce garbage and/or numerical instabilities (because of rounding towards 0 when squared).
That's kind of the goal of #15060 (at least with the initial implementation of adding a weighted 1 to the mean), except it doesn't handle all imatrix formats (it could, to some extent).

ikawrakow · 2025-08-18T14:08:58Z

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

I'm pretty sure the intention was for the initial value of best_scores to be -FLT_MAX (aka -3.4028235e+38), which doesn't have the problem of rounding to zero (it can overflow to -INFINITY, but that's also a reasonable starting value, since it would also be smaller than any sumqx*sumqx).

Well, you are wrong. The correct initial value for best_score is zero, as initially written, and then broken with the "fix" in #7955 (and I know wall too well such "fixes").

Here some math, so hopefully this will help to stop "fixing" it.

We want to minimize $F$

$$F = \sum w_i (x_i - d q_i)^2\quad\quad\quad(1)$$

where $w_i$ are the importances, $x_i$ the original model weights, $q_i$ the quantized weights, and $d$ is the block scale. Given a set of $q_i$, the scale that minimizes $F$ is obtained by solving $\partial F/\partial d = 0$, which gives

$$d = \frac{\sum w_i x_i q_i}{\sum w_i q_i^2}$$

(assuming the denominator is not zero; if it is, there is no solution as any value of $d$ is as good as any other).

With that, we can now expand (1), plugin the optimum value of $d$, and we get

$$F = \sum w_i x_i^2 - \frac{\left(\sum w_i x_i q_i\right)^2}{\sum w_i q_i^2}$$

The first term on the right-hand side is a constant. Hence, to minimize $F$, we need to pick the $q_i$ in such a way as to maximize the second term, which we call the "score" of the solution. Obviously any score that is less than zero (how would that be even possible?) will make $F$ increase compared to the situation where we do not quantize at all and simply set $d \equiv 0, q_i \equiv 0$. Hence, the best initial score we want is 0, and not -FLT_MIN or -FLT_MAX, and we go from there. The edge cases we need to pay attention to are

$\sum w_i |x_i| \equiv 0$ (or, if you prefer, $\sum w_i |x_i| < \varepsilon$ with some suitable $\varepsilon$)
$\sum w_i q_i^2 \equiv 0$ (or, if you prefer, $\sum w_i q_i^2 < \varepsilon$ with some suitable $\varepsilon$)

1 can happen if all $x_i$ are zero (or very small), all $w_i$ are zero (or very small), or we got unlucky and it happens that $w_i \neq 0$ for $x_i = 0$, and vice versa.

2 can happen if all $w_i$ are zero (or very small), all $q_i$ are zero, or we got unlucky and $w_i \neq 0$ for $q_i = 0$, and vice versa.

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

184cdc6

compilade added bugfix fixes an issue or bug Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels Aug 17, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 17, 2025

nicoboss mentioned this pull request Aug 17, 2025

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors nicoboss/llama.cpp#8

Merged

ggml-quants : avoid division by zero in make_q3_quants

fb573f4

CISC approved these changes Aug 18, 2025

View reviewed changes

CISC requested a review from ggerganov August 18, 2025 06:43

ggerganov approved these changes Aug 18, 2025

View reviewed changes

CISC merged commit f44f793 into master Aug 18, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

Uh oh!

compilade commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 17, 2025 •

edited

Loading

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

ikawrakow commented Aug 18, 2025

Uh oh!

Uh oh!

	float best_score = -FLT_MIN, scale = max;
	int besti1 = -1, besti2 = -1, best_shift = 0;
	for (int i1 = 0; i1 <= block_size; ++i1) {
	for (int i2 = i1; i2 <= block_size; ++i2) {
	float sumqx = (sumx[i1] - sumx[0])x_p[0] + (sumx[i2] - sumx[i1])x_p[1] + (sumx[block_size] - sumx[i2])*x_p[2];
	float sumq2 = (sumw[i1] - sumw[0])x_p[0]x_p[0] + (sumw[i2] - sumw[i1])x_p[1]x_p[1] + (sumw[block_size] - sumw[i2])x_p[2]x_p[2];
	if (sumq2 > 0 && sumqxsumqx > best_scoresumq2) {
	scale = sumqx/sumq2; best_score = scale*sumqx;
	besti1 = i1; besti2 = i2; best_shift = 1;
	}
	sumqx = (sumx[i1] - sumx[0])x_m[0] + (sumx[i2] - sumx[i1])x_m[1] + (sumx[block_size] - sumx[i2])*x_m[2];
	sumq2 = (sumw[i1] - sumw[0])x_m[0]x_m[0] + (sumw[i2] - sumw[i1])x_m[1]x_m[1] + (sumw[block_size] - sumw[i2])x_m[2]x_m[2];
	if (sumq2 > 0 && sumqxsumqx > best_scoresumq2) {
	scale = sumqx/sumq2; best_score = scale*sumqx;
	besti1 = i1; besti2 = i2; best_shift = -1;
	}
	}
	}
	GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0);

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

Uh oh!

Conversation

compilade commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 17, 2025

Uh oh!

CISC commented Aug 17, 2025

Uh oh!

compilade commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikawrakow commented Aug 18, 2025

Uh oh!

Uh oh!

compilade commented Aug 17, 2025 •

edited

Loading

compilade commented Aug 18, 2025 •

edited

Loading