Skip to content

ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors #15379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 18, 2025

Conversation

compilade
Copy link
Collaborator

This is kind of a follow up to #11773, with the fix from #11773 (comment) (which should fix #11764 and #12439).

make_qp_quants doesn't check for divisions by zero with its suml2 when calculating the resulting scale, and so I've added a check to avoid producing NAN scales that way. That function is used for Q2_K, Q4_K, Q5_K, and IQ2_XXS.

In the future, I'm hoping to replace make_qp_quants with a saner quantization function from #12557. But until that's ready, might as well prevent some known existing errors.

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

float best_score = -FLT_MIN, scale = max;
int besti1 = -1, besti2 = -1, best_shift = 0;
for (int i1 = 0; i1 <= block_size; ++i1) {
for (int i2 = i1; i2 <= block_size; ++i2) {
float sumqx = (sumx[i1] - sumx[0])*x_p[0] + (sumx[i2] - sumx[i1])*x_p[1] + (sumx[block_size] - sumx[i2])*x_p[2];
float sumq2 = (sumw[i1] - sumw[0])*x_p[0]*x_p[0] + (sumw[i2] - sumw[i1])*x_p[1]*x_p[1] + (sumw[block_size] - sumw[i2])*x_p[2]*x_p[2];
if (sumq2 > 0 && sumqx*sumqx > best_score*sumq2) {
scale = sumqx/sumq2; best_score = scale*sumqx;
besti1 = i1; besti2 = i2; best_shift = 1;
}
sumqx = (sumx[i1] - sumx[0])*x_m[0] + (sumx[i2] - sumx[i1])*x_m[1] + (sumx[block_size] - sumx[i2])*x_m[2];
sumq2 = (sumw[i1] - sumw[0])*x_m[0]*x_m[0] + (sumw[i2] - sumw[i1])*x_m[1]*x_m[1] + (sumw[block_size] - sumw[i2])*x_m[2]*x_m[2];
if (sumq2 > 0 && sumqx*sumqx > best_score*sumq2) {
scale = sumqx/sumq2; best_score = scale*sumqx;
besti1 = i1; besti2 = i2; best_shift = -1;
}
}
}
GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0);

-FLT_MIN is -1.1754944e-38, (which is the first normal non-zero value below 0).

The assertion can fail in the situation where sumq2 is always smaller or equal to 5.9604645e-08, and sumqx * sumqx is zero (e.g. if sumqx is on the order of 1e-24) and then -FLT_MIN * 5.9604645e-08 is zero, so the branch setting the new best_score is never taken in that case.

I'm pretty sure the intention was for the initial value of best_scores to be -FLT_MAX (aka -3.4028235e+38), which doesn't have the problem of rounding to zero (it can overflow to -INFINITY, but that's also a reasonable starting value, since it would also be smaller than any sumqx*sumqx).

This should hopefully fix variants of #14788, and #6067.

cc @nicoboss, @bartowski1182, @schmorp since you've previously reported issues linked to this.


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added bugfix fixes an issue or bug Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels Aug 17, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 17, 2025
@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

make_qp_quants doesn't check for divisions by zero with its suml2 when calculating the resulting scale, and so I've added a check to avoid producing NAN scales that way. That function is used for Q2_K, Q4_K, Q5_K, and IQ2_XXS.

See #7825 (which also proposes the same fix for make_q3_quants). :)

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

This change was made in #7955

@compilade
Copy link
Collaborator Author

compilade commented Aug 17, 2025

See #7825 (which also proposes the same fix for make_q3_quants). :)

@CISC
Should that PR be merged as well, or should I integrate the one-line change in make_q3_quants here?

Ah, so the -FLT_MIN problem with IQ1_S and IQ1_M comes from #7955, which attempted to fix that problem, but didn't pick the right constant. (EDIT: I see you've commented this reference at the same time as I typed this :)

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

See #7825 (which also proposes the same fix for make_q3_quants). :)

@CISC Should that PR be merged as well, or should I integrate the one-line change in make_q3_quants here?

Might as well integrate it here, and I'll close that one.

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

Also, should maybe suml2 be allowed to be negative, like here?

float scale = suml2 ? sumlx/suml2 : 0.0f;

@compilade
Copy link
Collaborator Author

Also, should maybe suml2 be allowed to be negative, like here?

I don't think so. suml2 can only be negative when the imatrix weights are negative, which should not happen because it breaks other assumptions.

It's a sum of squares, which should always be positive.

@CISC
Copy link
Collaborator

CISC commented Aug 17, 2025

Also, should maybe suml2 be allowed to be negative, like here?

I don't think so. suml2 can only be negative when the imatrix weights are negative, which should not happen because it breaks other assumptions.

It's a sum of squares, which should always be positive.

Well, that's what I meant, do we assert this somewhere? If so, why check if it's larger than 0.

I suppose this should also check for signbit:

// check imatrix for nans or infs
for (const auto & kv : *imatrix_data) {
for (float f : kv.second) {
if (!std::isfinite(f)) {
throw std::runtime_error(format("imatrix contains non-finite value %f\n", f));
}
}
}

@compilade
Copy link
Collaborator Author

compilade commented Aug 18, 2025

[imatrix weights should never be negative] because it breaks other assumptions

Well, that's what I meant, do we assert this somewhere? If so, why check if it's larger than 0.

Personally, I think using suml2 > 0.0f makes the intention of the constraints in the code clearer, even if in practice it's the same as suml2 != 0.0f.

I suppose this should also check for signbit:

// check imatrix for nans or infs
for (const auto & kv : *imatrix_data) {
for (float f : kv.second) {
if (!std::isfinite(f)) {
throw std::runtime_error(format("imatrix contains non-finite value %f\n", f));
}
}
}

Yes, this is a good idea. But I consider this out of scope for this PR, which is about handling problematic edge cases in quantization functions (and which have happened in the wild), and not necessarily imatrix validation, even if both are related. Still, it's a good idea, and I don't think any valid imatrix should/could have negative values.

Arguably, even 0 (or near 0) imatrix values probably should also be avoided, because they will produce garbage and/or numerical instabilities (because of rounding towards 0 when squared).
That's kind of the goal of #15060 (at least with the initial implementation of adding a weighted 1 to the mean), except it doesn't handle all imatrix formats (it could, to some extent).

@CISC CISC requested a review from ggerganov August 18, 2025 06:43
@CISC CISC merged commit f44f793 into master Aug 18, 2025
47 checks passed
@ikawrakow
Copy link
Contributor

Somewhat relatedly, I've been studying the i-quants implementation, and in IQ1_S and IQ1_M, -FLT_MIN being used as the initial best_score seems wrong.

I'm pretty sure the intention was for the initial value of best_scores to be -FLT_MAX (aka -3.4028235e+38), which doesn't have the problem of rounding to zero (it can overflow to -INFINITY, but that's also a reasonable starting value, since it would also be smaller than any sumqx*sumqx).

Well, you are wrong. The correct initial value for best_score is zero, as initially written, and then broken with the "fix" in #7955 (and I know wall too well such "fixes").

Here some math, so hopefully this will help to stop "fixing" it.

We want to minimize $F$

$$F = \sum w_i (x_i - d q_i)^2\quad\quad\quad(1)$$

where $w_i$ are the importances, $x_i$ the original model weights, $q_i$ the quantized weights, and $d$ is the block scale. Given a set of $q_i$, the scale that minimizes $F$ is obtained by solving $\partial F/\partial d = 0$, which gives

$$d = \frac{\sum w_i x_i q_i}{\sum w_i q_i^2}$$

(assuming the denominator is not zero; if it is, there is no solution as any value of $d$ is as good as any other).

With that, we can now expand (1), plugin the optimum value of $d$, and we get

$$F = \sum w_i x_i^2 - \frac{\left(\sum w_i x_i q_i\right)^2}{\sum w_i q_i^2}$$

The first term on the right-hand side is a constant. Hence, to minimize $F$, we need to pick the $q_i$ in such a way as to maximize the second term, which we call the "score" of the solution. Obviously any score that is less than zero (how would that be even possible?) will make $F$ increase compared to the situation where we do not quantize at all and simply set $d \equiv 0, q_i \equiv 0$. Hence, the best initial score we want is 0, and not -FLT_MIN or -FLT_MAX, and we go from there. The edge cases we need to pay attention to are

  1. $\sum w_i |x_i| \equiv 0$ (or, if you prefer, $\sum w_i |x_i| &lt; \varepsilon$ with some suitable $\varepsilon$)
  2. $\sum w_i q_i^2 \equiv 0$ (or, if you prefer, $\sum w_i q_i^2 &lt; \varepsilon$ with some suitable $\varepsilon$)

1 can happen if all $x_i$ are zero (or very small), all $w_i$ are zero (or very small), or we got unlucky and it happens that $w_i \neq 0$ for $x_i = 0$, and vice versa.

2 can happen if all $w_i$ are zero (or very small), all $q_i$ are zero, or we got unlucky and $w_i \neq 0$ for $q_i = 0$, and vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug ggml changes relating to the ggml tensor library for machine learning Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: Quantizing Olmo models with imatrix failing on some sizes
4 participants