Skip to content

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Jan 28, 2024

secant2 calls update, and update might switch to bisection in the U3 step (as named in the initial HZ publication). If we did bisect, the line-search "are we making enough progress?" check (step L2) is nonsensical (we might have shrunk multiple iterations of bisection, but that is not an indication that the secant model was "working").

Consequently, this reports back about whether bisection was engaged in update, and if so skip any kind of convergence assessment and do another iteration.

Fixes #173. Sadly there isn't a very portable test.

secant2 calls update, and update might switch to bisection in the U3 step. If we did bisect, the line-search progress
step L2 testing whether the interval is shrinking fast
enough is nonsensical (we might have shrunk multiple
iterations of bisection, but that is not an indication that the
secant model was "working").

Consequently, this reports back about whether bisection
was engaged in `update`, and if so skip any kind of
convergence assessment and do another iteration.

Fixes JuliaNLSolvers#173.
@pkofod
Copy link
Member

pkofod commented Jan 29, 2024

Thanks @timholy really appreciate it. I'll try to find time to review it later today .

@mateuszbaran
Copy link
Contributor

Thank you! This really turned out to be more involved than I initially expected.

@mateuszbaran
Copy link
Contributor

Bump?

@pkofod
Copy link
Member

pkofod commented Feb 12, 2024

Okay, this looks good to me. @mateuszbaran did you somehow check if this solved your original problem?

@mateuszbaran
Copy link
Contributor

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

@mateuszbaran
Copy link
Contributor

This script can easily generate cases where such choice is made:

using Revise
using Manopt
using Manifolds
using LineSearches
using LinearAlgebra

norm_inf(M::AbstractManifold, p, X) = norm(X, Inf)

function f_rosenbrock(x)
    result = 0.0
    for i in 1:2:length(x)
        result += (1.0 - x[i])^2 + 100.0 * (x[i + 1] - x[i]^2)^2
    end
    return result
end
function f_rosenbrock(::AbstractManifold, x)
    return f_rosenbrock(x)
end

function g_rosenbrock!(storage, x)
    for i in 1:2:length(x)
        storage[i] = -2.0 * (1.0 - x[i]) - 400.0 * (x[i + 1] - x[i]^2) * x[i]
        storage[i + 1] = 200.0 * (x[i + 1] - x[i]^2)
    end
    return storage
end

function g_rosenbrock!(M::AbstractManifold, storage, x)
    g_rosenbrock!(storage, x)
    riemannian_gradient!(M, storage, x, storage)
    return storage
end

function test_case_manopt()
    for i in 2:2:1000
        N = 5000+i
        mem_len = rand(4:10)
        println("i = $i, mem_len=$mem_len")
        M = Manifolds.Sphere(N - 1; parameter=:field)
        ls_hz = LineSearches.HagerZhang()

        x0 = zeros(N)
        x0[1] = 1
        manopt_sc = StopWhenGradientNormLess(1e-6; norm=norm_inf) | StopAfterIteration(1000)

        quasi_Newton(
            M,
            f_rosenbrock,
            g_rosenbrock!,
            x0;
            stepsize=Manopt.LineSearchesStepsize(ls_hz),
            #stepsize=HagerZhangLinesearch(M),
            evaluation=InplaceEvaluation(),
            vector_transport_method=ParallelTransport(),
            return_state=true,
            memory_size=mem_len,
            stopping_criterion=manopt_sc,
        )
    end
end
test_case_manopt()

You can then add if values[ib] < values[iA] || values[iB] < values[iA] check in HZ to see if the choice is indeed suboptimal.

@pkofod
Copy link
Member

pkofod commented Feb 12, 2024

I've just checked. We've solved the original problem by making L-BFGS in Manopt more robust to line searches returning stepsize 0, so it's harder to reproduce the true original problem, but I can still generate problems for which the sub-optimal alpha is selected in the "It's so flat, secant didn't do anything useful, time to quit" branch.

have you decided to reset the approximation in that case?

@mateuszbaran
Copy link
Contributor

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

@pkofod
Copy link
Member

pkofod commented Feb 12, 2024

Currently it just goes in the direction of negative gradient but the memory isn't purged. Maybe it would work better with purging memory, I just didn't check how to do that safely in Manopt and it works OK without it.

Okay. I had thought to change Optim to reset in this case which is why I asked :)

@mateuszbaran
Copy link
Contributor

I had a chat about it with Roland Herzog some time ago and he suggested using truncated CG is such case but restarting is also a valid option.

@pkofod
Copy link
Member

pkofod commented Feb 14, 2024

@timholy do you have any thoughts on the above comments by @mateuszbaran ?

@mateuszbaran
Copy link
Contributor

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

@mateuszbaran
Copy link
Contributor

Bump?

@pkofod
Copy link
Member

pkofod commented Mar 23, 2024

I've checked the paper and it doesn't have the termination condition corresponding to "It's so flat, secant didn't do anything useful, time to quit" present in LineSearches.jl implementation. I don't see what the justification for it is. Maybe it should be modified?

I think Tim may have added that. I suppose you're saying that maybe it's failing here partially because the branch doesn't belong?

@mateuszbaran
Copy link
Contributor

There is clearly a difference between how it's commented and what it actually does. It's possible that termination condition actually catches some corner and in some cases is a good idea but currently it also catches things it shouldn't catch (including in this PR). I don't know that corner case it was supposed to handle so I don't know how to properly fix it.

Anyway, there is always that simple change I've done that works well enough for me and shouldn't really harm any use case.

There is already at least one person (other than me and Ronny) who had issues because my fixed HZ linesearch is what works best for his problem but he would like to use it in Pluto which doesn't like non-registered packages. So it would be nice to have a good HZ linesearch in a registered package.

@timholy
Copy link
Contributor Author

timholy commented Mar 25, 2024

Sorry I haven't had time for "general"/"charitable" development lately. My busy schedule is likely to persist another couple of weeks at least. If anyone wants to take this effort over, by all means go for it.

A couple of high-level thoughts:

  • the Julia implementaton of HZ linesearch should be in this package, and it should be good. It's important to fix this and release a new version.
  • this episode is a reminder that in some ways, the test suite is more valuable than the implementation. My extensions to the published algorithm were almost certainly motivated by real-world cases, but those are lost to the mists of time. My inability to reconstruct those motivations is at risk of paralyzing any decision here. If push comes to shove, it might be better to revert to the published algorithm and force me to deal with the consequences.
  • Likewise, the issues uncovered by @mateuszbaran depend on a complicated software stack that may be more than one wants to depend on for a test suite: updates to a large ecosystem can have a way of making such things fragile. So we're at risk for losing yet more important test cases if we don't do something.
  • To me, it seems that the most important step is to develop a testing layer where we capture the outcome of the one-dimensional problem, as that removes dependence from the "real" problem and allows us to create small tests without a large dependency stack. The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.) Then we could have a test suite that consists of vectors of alpha, phi, and dphi values for all the "bad" cases anyone reports.
  • Failing to do this now, while we have real-world failing test cases, is just a way of perpetuating the inability to keep this package at the forefront of Julia implementations of line search methods. Which would be a giant waste.

Thus if someone wants to help move this forward, perhaps the best thing to do is focus on developing that test harness.

@mateuszbaran
Copy link
Contributor

Thank you, your observations are very on-point. To push this forward I can prepare a simplified test case, either this week or the next one. Anyway, I'm not a maintainer of this package so I can't decide how to fix this issue.

@mateuszbaran
Copy link
Contributor

mateuszbaran commented Apr 8, 2024

I have made a self-contained example based on cubic interpolation. It somehow doesn't reproduce the original problem -- I will look for the right test values but that's all I could do in the time I've found so far.

struct LineSearchTestCase
    alphas::Vector{Float64}
    values::Vector{Float64}
    slopes::Vector{Float64}
end


function prepare_test_case(; alphas, values, slopes)
    perm = sortperm(alphas)
    alphas = alphas[perm]
    push!(alphas, alphas[end]+1)
    values = values[perm]
    push!(values, values[end])
    slopes = slopes[perm]
    push!(slopes, 0.0)
    return LineSearchTestCase(alphas, values, slopes)
end

tc1 = prepare_test_case(;
    alphas = [0.0, 1.0, 5.0, 3.541670844449739],
    values = [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876],
    slopes = [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057]
)

function tc_to_f(tc)
    function f(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        return val
    end
end

function tc_to_fdf(tc)
    function fdf(x)
        i = findfirst(u -> u > x, tc.alphas) - 1
        xk = tc.alphas[i]
        xkp1 = tc.alphas[i + 1]
        dx = xkp1 - xk
        t = (x - xk) / dx
        h00t = 2t^3 - 3t^2 + 1
        h10t = t * (1 - t)^2
        h01t = t^2 * (3 - 2t)
        h11t = t^2 * (t - 1)
        val =
            h00t * tc.values[i] +
            h10t * dx * tc.slopes[i] +
            h01t * tc.values[i + 1] +
            h11t * dx * tc.slopes[i + 1]

        h00tp = 6t^2 - 6t
        h10tp = 3t^2 - 4t + 1
        h01tp = -6t^2 + 6 * t
        h11tp = 3t^2 - 2t
        slope =
            (
                h00tp * tc.values[i] +
                h10tp * dx * tc.slopes[i] +
                h01tp * tc.values[i + 1] +
                h11tp * dx * tc.slopes[i + 1]
            ) / dx
        println(x, " ", val, " ", slope)
        return val, slope
    end
end

using LineSearches

function test_tc(tc)
    hz = HagerZhang()
    f = tc_to_f(tc)
    fdf = tc_to_fdf(tc)
    hz(f, fdf, 1.0, fdf(0.0)...)
end

test_tc(tc1)

EDIT: new testing code, where the best searched value is 2891.4462095232184 but HZ returns the point at which the objective is equal to 3000.9760725116876

@mateuszbaran
Copy link
Contributor

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

@pkofod
Copy link
Member

pkofod commented Apr 12, 2024

Thank you so much Mateusz! I hope to have a little more time next week than in the past months :)

@pkofod
Copy link
Member

pkofod commented Apr 18, 2024

Not forgotten! I will use your test-case to look at this next week.

@pkofod
Copy link
Member

pkofod commented Apr 30, 2024

I've updated my example, now there is a clear, fully self-contained case of sub-optimal returned value.

You have been patient :) I had some deadlines that were quite important. I fixed some Optim.jl stuff yesterday evening, and I will try to look at this when I'm off work.

@kbarros
Copy link

kbarros commented May 14, 2024

As per #175, I encountered another issue where the flatness check causes early termination at a point that is not close to stationary, which does not seem to be resolved by this PR.

The problematic sequence of ϕ and data are these:

cs = [0, 0.2, 0.1, 0.055223623837016025]
ϕs = [3.042968312396456, 3.117411287254072, -3.503584823341612, 0.5244246783084083]
dϕs = [-832.4270136930788, -505.33622492291033, 674.947830358913, 738.3388472434362]

This bug took a while for me to diagnose (typical users will not be suspecting a problem in LineSearches). Maybe as a band-aid we can add a check inside that flatness check -- if the norm of the gradient does not seem small, then print a warning for the user that there is a known bug with this part of LineSearches?

My original function is oscillatory, but like @timholy says above, one can just as well create a polynomial giving the same data:

The idea would be to grab the record of the phi and dphi values at the alpha values chosen by the algorithm, and use interpolation to match both the values and slopes at those alpha values. (IIRC Lagrange polynomial interpolation can be readily generalized to match both values and derivatives.)

I created the package HermiteInterpolation for exactly this purpose (submitted for registration), and can be used like this:

using HermiteInterpolation: fit
f = fit(cs, ϕs, dϕs)
@assert f.(cs)  ϕs

using DynamicPolynomials
@polyvar c

println(f(c))
# 3.042968312396456 - 832.4270136930788*c - 132807.15591801773*c^2 + 7.915421661743959e6*c^3 - 1.570284840040962e8*c^4 + 1.4221708747294645e9*c^5 - 5.970065591205604e9*c^6 + 9.405512899903618e9*c^7

println(differentiate(f(c), c))
# -832.4270136930788 - 265614.31183603546*c + 2.3746264985231876e7*c^2 - 6.281139360163848e8*c^3 + 7.110854373647323e9*c^4 - 3.582039354723362e10*c^5 + 6.5838590299325325e10*c^6

Incidentally, as a temporary workaround for my optimization problem, I came up with this sequence:

    method = Optim.LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2))
    res0 = Optim.optimize(f, g!, collect(reinterpret(Float64, params)), method, options)
    res = Optim.optimize(f, g!, Optim.minimizer(res0), Optim.ConjugateGradient(), options)

LBFGS with BackTracking search seems to converge robustly, and once it's near the local minimum, then ConjugateGradient can effectively fine tune minimizer.

@kbarros
Copy link

kbarros commented May 18, 2024

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

# The minimizer is x0=[0, 2πn/100], with f(x0) = 1. Any integer n is fine.
function f(x)
    return (x[1]^2 + 1) * (2 - cos(100*x[2]))
end

using Optim

function test_converges(method)
    for i in 1:100
        r = randn(2)
        res = optimize(f, r, method)
        if Optim.converged(res) && minimum(res) > f([0,0]) + 1e-8
            println("""
                Incorrectly reported convergence after $(res.iterations) iterations
                Reached x = $(Optim.minimizer(res)) with f(x) = $(minimum(res))
                """)
        end
    end
end

# Works successfully, no printed output
test_converges(LBFGS(; linesearch=Optim.LineSearches.BackTracking(order=2)))

# Prints ~10 failures to converge (in 100 tries). Frequently fails after the
# first line search.
test_converges(ConjugateGradient())

@kbarros
Copy link

kbarros commented Aug 20, 2024

Gentle ping -- any hope for tightening up the HZ "flatness check"? The randomized optimization problem above should work pretty robustly as a unit test.

@timholy
Copy link
Contributor Author

timholy commented Aug 29, 2024

I'll get to this in another week or two.

timholy added a commit that referenced this pull request Oct 22, 2024
This may cause problems of its own, but for the time being it's better
than the status quo.

Fixes #173
Closes #174
Fixes #175
timholy added a commit that referenced this pull request Oct 22, 2024
This may cause problems of its own, but for the time being it's better
than the status quo.

Fixes #173
Closes #174
Fixes #175
@pkofod
Copy link
Member

pkofod commented Mar 26, 2025

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

I think this may highlight the same issue as in #174 (comment)

Here is an optimization problem that highlights a fundamental problem with the current flatness check:

With my own simple branch made for checking this I am able to run

julia> test_converges(ConjugateGradient(; linesearch=Optim.LineSearches.HagerZhang(check_flatness=false)))

julia> test_converges(ConjugateGradient(; linesearch=Optim.LineSearches.HagerZhang(check_flatness=true)))
Incorrectly reported convergence after 1 iterations
Reached x = [0.16013274827601717, 0.35097409719940265] with f(x) = 2.9310451880340986

Incorrectly reported convergence after 1 iterations
Reached x = [0.0507085888328239, 0.1084483256815114] with f(x) = 2.1557003490556643

Incorrectly reported convergence after 2 iterations
Reached x = [0.7211228462970422, -22.3471576683635] with f(x) = 3.8050518641095663

Incorrectly reported convergence after 1 iterations
Reached x = [-1.2589697390692955, -1.0403076610453794] with f(x) = 7.5909348717323635

Incorrectly reported convergence after 1 iterations
Reached x = [0.08409600277401064, 0.10725179410260108] with f(x) = 2.2831453152517023

Incorrectly reported convergence after 2 iterations
Reached x = [0.9010705241444945, 0.7330566636590027] with f(x) = 4.526934892232607

Incorrectly reported convergence after 2 iterations
Reached x = [-0.27240989049521974, 11.156588470976192] with f(x) = 3.1405412374583666

Incorrectly reported convergence after 2 iterations
Reached x = [0.4717492439353791, -15.742921266415511] with f(x) = 3.5917484672484252

Incorrectly reported convergence after 1 iterations
Reached x = [-1.4143709712675014, 0.3441070259740454] with f(x) = 8.969056311635578

Incorrectly reported convergence after 1 iterations
Reached x = [0.29458492668827274, 1.1477725568576247] with f(x) = 2.291918955272849

Incorrectly reported convergence after 1 iterations
Reached x = [-0.19214192623076573, -0.14209344321079873] with f(x) = 2.1486141359671405

Incorrectly reported convergence after 4 iterations
Reached x = [-0.21665091042117182, 43.27319699842477] with f(x) = 2.3266172361393616

Incorrectly reported convergence after 1 iterations
Reached x = [-0.6789087745560297, 0.5133014956972761] with f(x) = 2.213588359372836

and running @mateuszbaran problem here #174 (comment) I get

julia> res, res_cache = test_tc(tc1, true)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
((3.541670844449739, 3000.9760725116876), LineSearchCache{Float64}([0.0, 1.0, 5.0, 3.541670844449739], [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876], [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057]))

julia> @test_broken minimum(res_cache.values) == res[2]

Test Broken
  Expression: minimum(res_cache.values) == res[2]

julia> res2, res_cache2 = test_tc(tc1, false)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
4.49745720537289 -2137.605778336924 7059.04612357025
((4.49745720537289, -2137.605778336924), LineSearchCache{Float64}([0.0, 1.0, 5.0, 3.541670844449739, 4.49745720537289], [3003.592409634743, 2962.0378569864743, 2891.4462095232184, 3000.9760725116876, -2137.605778336924], [-22332.321416890798, -20423.214551925797, 11718.185026267562, -22286.821227217057, 7059.04612357025]))

julia> @test minimum(res_cache2.values) == res2[2]
Test Passed

with this plot of the problematic case

image

so based on the plot and the order of alphas above we see that we first check 1 that has downwards slope then 5 that has upwards. We additionally seem to come up with 3.54... that has downwards slop and then we get a bracket [3.54..., 5.0] that must contain a minimum. But nothing here suggests that the function should be flat between 3.54... and 5.0. The flatness check seems to check an unfortunate condition that

nextfloat(3000.9760725116876) >=  2891.4462095232184

in other words A is 3.54... and B = 5.0 so B > A as suggested in the assertion but why would we know that

nextfloat(values[ia]) >= values[ib] && nextfloat(values[iA]) >= values[iB]

is then an appropriate test for flatness if nextfloat(values[iA]) >= values[iB]? It seems to expect that values[iA] == values[iB] or values[iA] < values[iB] or am I misunderstanding something in this flatness check?

@timholy
Copy link
Contributor Author

timholy commented Apr 9, 2025

Yes, I think it's making some assumptions about what those indices mean. It's basically expecting that it narrowed the interval and is still within epsilon of flat. We could probably add some additional sanity checks and maybe it would happen less? But your red dots, if they correspond to ia, ib, iA, and iB, indicate that it's incorrect about the meaning of those indices.

@pkofod
Copy link
Member

pkofod commented Apr 9, 2025

I think (ia, ib, iA, iB) = (0, 1, 3.54..., 5.0) above but I have to go back and print :)

@pkofod
Copy link
Member

pkofod commented Apr 10, 2025

I tried modifying the "flatness test" to check closeness rather than inequality and I got

julia> test_tc(tc1)
0.0 3003.592409634743 -22332.321416890798
1.0 2962.0378569864743 -20423.214551925797
(ia, ib) = (1, 2)
5.0 2891.4462095232184 11718.185026267562
3.541670844449739 3000.9760725116876 -22286.821227217057
4.49745720537289 -2137.605778336924 7059.04612357025
(4.49745720537289, -2137.605778336924)

which seems to be a much better point for @mateuszbaran problem!

@kbarros
Copy link

kbarros commented May 22, 2025

The experiment in #182 suggests that simply disabling the flatness check would be an improvement in behavior. Is there a blocker for merging? Perhaps some tests can be included from #180.

@pkofod
Copy link
Member

pkofod commented Jun 8, 2025

The experiment in #182 suggests that simply disabling the flatness check would be an improvement in behavior. Is there a blocker for merging? Perhaps some tests can be included from #180.

I was a bit busy for the past month. I will apply the change with an escape hatch to get the old behavior and we can improve from there.

@kbarros
Copy link

kbarros commented Jun 8, 2025

Hi @pkofod, great news that you're pushing this forward! This change may break certain tests in Sunny.jl. If you're not doing a major version bump of this package and Optim.jl, would you mind coordinating here on the package releases so that I can ensure a smooth transition? Another possibility is to make the escape hatch go the other direction -- have the changes be "opt-in" for some transition period?

@pkofod
Copy link
Member

pkofod commented Jun 8, 2025

Hi @pkofod, great news that you're pushing this forward! This change may break certain tests in Sunny.jl. If you're not doing a major version bump of this package and Optim.jl, would you mind coordinating here on the package releases so that I can ensure a smooth transition? Another possibility is to make the escape hatch go the other direction -- have the changes be "opt-in" for some transition period?

We could introduce it as a "non-change" but the change is made to fix a "bug" or "buggy" behavior so technically I think it should be fine. The current behavior seems to have caused more bad than good so I am comfortable doing it in a feature release as a fix + feature to return to the old behavior. Semver tells you that you need to tag a breaking change if the user needs to change their code in the new version but not if it's because a bug was fixed.

I can tag it first with the feature but without changing the default and then later change it, but it shouldn't really matter I think. You should be able to have bounds in your Project to avoid getting the new version.

Can you open an issue on your repo to describe what the issue is exactly? I will be happy to wait with the tag until we can resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal choice of alpha in Hager-Zhang
4 participants