Skip to content

503 from (probably) failed allocation after upgrade from 6.x to 7.7 #4408

@XANi

Description

@XANi

Since updating to 7.7.0(debian testing) from 6.x we are getting a spew of 503's that happens days after the cache is filled up

default.vcl.txt

/usr/sbin/varnishd -j unix,user=vcache -F -a 0.0.0.0:80 -T localhost:6082 -f /etc/varnish/default.vcl -l 81920k -p thread_pool_min=80 -p thread_pool_max=4000 -p thread_pools=2 -p default_ttl=1209600 -p nuke_limit=1000 -p max_restarts=4 -S /etc/varnish/secret -s default=malloc,8569M -s mm=malloc,17139M

logs look like that

*   << Request  >> 583226464 
-   Begin          req 583226327 rxreq
-   Timestamp      Start: 1761649198.288374 0.000000 0.000000
-   Timestamp      Req: 1761649198.288374 0.000000 0.000000
-   VCL_use        boot
-   ReqStart       10.0.0.1 64710 a0
-   ReqMethod      GET
-   ReqURL         /video/abcd.mp4
-   ReqProtocol    HTTP/1.1
-   ReqHeader      host: example.com
-   ReqHeader      sec-ch-ua-platform: ...
-   ReqHeader      accept-encoding: ...
-   ReqHeader      user-agent: ...
-   ReqHeader      accept: */*
-   ReqHeader      range: bytes=44325-15139796
-   ReqHeader      if-range: "063fccff745ae4b743f8053ba65c0f7d-2"
-   ReqHeader      priority: i
-   ReqHeader      x-forwarded-for: 1.2.3.4
-   ReqUnset       x-forwarded-for: 1.2.3.4
-   ReqHeader      X-Forwarded-For: 1.2.3.4, 10.0.1.1
-   ReqHeader      Via: 1.1 cache3 (Varnish/7.7)
-   VCL_call       RECV
-   ReqUnset       accept-encoding: identity;q=1, *;q=0
-   ReqURL         /video/abcd.mp4
-   VCL_return     hash
-   VCL_call       HASH
-   VCL_return     lookup
-   VCL_call       MISS
-   VCL_return     fetch
-   Link           bereq 583226465 fetch
-   Timestamp      Fetch: 1761649198.310064 0.021689 0.021689
-   RespProtocol   HTTP/1.1
-   RespStatus     503
-   RespReason     Service Unavailable
-   RespHeader     Date: Tue, 28 Oct 2025 10:59:58 GMT
-   RespHeader     Server: Varnish
-   RespHeader     X-Varnish: 583226464
-   VCL_call       SYNTH
-   RespHeader     Content-Type: text/html; charset=utf-8
-   RespHeader     Retry-After: 5
-   VCL_return     deliver
-   Timestamp      Process: 1761649198.310122 0.021748 0.000058
-   RespHeader     Content-Length: 283
-   Storage        malloc Transient
-   Filters        
-   RespHeader     Connection: keep-alive
-   Timestamp      Resp: 1761649198.310186 0.021811 0.000063
-   ReqAcct        711 0 711 213 283 496
-   End       

and when requests started failing I noticed both rise of the allocation fails in the mm pool (which handled big objects), and drop in lru nukes

Image

that (so far, haven't run it enough to happen on other pool) only happens for big objects and only after few days of running after the pool have been full

Steps to Reproduce (for bugs)

No idea, we run it as front of S3 cache for files from small css to video

Varnish Cache version

7.7.0 (Debian testing)

Operating system

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions