-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Description
Summary
Clang keeps a large fixed-size local array in the stack frame of a recursive function alive across recursive calls, even though the array is only used after both recursive calls return and its address does not escape into them.
This leads to stack usage of roughly depth * sizeof(array) instead of something close to sizeof(array) + O(depth) for a classic mergesort implementation, and seems to be a missed optimization in Clang/LLVM's lifetime analysis / stack allocation.
Environment
- Clang version: 21.1.0
- Platform: x86_64 (as used by Compiler Explorer / godbolt.org)
- Command line used on Compiler Explorer:
clang++ -std=c++23 -O3 -S mgsort.cpp -o mgsort.s(On Compiler Explorer this corresponds to choosing Clang and passing -std=c++23 -O3.)
- Reproducer link (Compiler Explorer): https://godbolt.org/z/9bG6jGqjj
Test case
#include <algorithm>
#include <cstring>
constexpr int N = 100000 + 1;
int a[N];
void mgsort(int l, int r) {
if (l == r) {
return ;
}
int mid = (l + r) >> 1;
mgsort(l, mid);
mgsort(mid + 1, r);
int b[N];
std::merge(a + l, a + mid + 1, a + mid + 1, a + r + 1, b + l);
std::memcpy(a + l, b + l, sizeof(int) * (r - l + 1));
}This is intended to be standard C++ (no VLA).
Actual behavior
For non‑leaf calls, Clang allocates the full int b[N] array in the prologue of mgsort before making any recursive calls and keeps this stack space reserved until the function returns.
On x86_64 with -O3, the generated assembly looks like (simplified):
mgsort(int, int):
mov eax, esi
sub eax, edi
je .LBB0_14 # base case: l == r
push rbp
push r15
push r14
push r13
push r12
push rbx
sub rsp, 400024 # reserve ~400k of stack for b[N]
... compute mid, save l / r ...
call mgsort # recursive call on left half
...
call mgsort # recursive call on right half
... later, compute the address of b + l on the stack ...
lea r14, [rsp + 4*rax]
add r14, 16 # r14 = b + l (b is based at rsp+16)
... merge + memcpy ...
add rsp, 400024
pop rbx
pop r12
pop r13
pop r14
pop r15
pop rbp
.LBB0_14:
retSo every recursive frame reserves a full b[N] array on its stack frame, and this space is live across the recursive calls. With N = 100000 + 1, sizeof(b) is about 400 kB; the recursion depth is O(log N) (around 17 here), so the total stack consumed by this single variable is roughly 17 * 400kB ≈ 6–7 MB.
However, b is only accessed after both recursive calls have returned, and there is no way for a recursive call to access its caller's b.
Expected behavior
Since the first use of b is strictly after both recursive calls, and no pointer to b is passed into those calls (nor stored anywhere that
they could access), it should be legal for the compiler to shrink the allocation lifetime of b so that its storage does not overlap the recursive calls.
Conceptually, the compiler could:
- move the actual stack allocation for
bfrom the function prologue to just before thestd::merge/memcpycode, and - free it immediately after
memcpyreturns.
With such an optimization, at run time there would only be a single active instance of b on the stack at any time, reducing stack usage from approximately depth * sizeof(b) to about sizeof(b) + O(depth).
This appears to be a missed optimization in Clang/LLVM's handling of large fixed-size local arrays in recursive functions.
Additional notes
-
The issue reproduces for me on Compiler Explorer (see the link above) with
-std=c++23 -O3. It also appears with other high optimization levels (e.g.-O3vs-Ofast) where stack allocation is similar. -
A related GCC bug report with the same test case is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123082
In that report, GCC developers classify this as a middle-end missed-optimization and suggest a workaround using a GNU C++ VLA extension. The intention here is similar: to see whether LLVM/Clang could perform a comparable lifetime-shrinking optimization for the standard C++ fixed-array case, without relying on extensions.