Skip to content
Discussion options

You must be logged in to vote

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2448.00 MiB on device 0: cudaMalloc failed: out of memory

You ran out of memory on the GPU. If you want it to spillover into system memory, you need to provide the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, f.ex.,

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-server -m /home/user/Documents/ik_llama.cpp/models/GLM-4.6-smol-IQ2_KS-00001-of-00003.gguf  --alias GLM-4.6-IQ2_KS  --ctx-size 32768  --n-gpu-layers 99  -ot exps=CPU  -fa 1 -ub 4096 -b 4096  --threads 8 --host 127.0.0.1  --port 8080 -cram -1

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ZuppaTuscana
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants