Is it possible to mix GPU & CPU RPC #17442

d-shehu · 2025-11-23T06:20:47Z

d-shehu
Nov 23, 2025

I had an earlier bug report I closed on this. My testing suggests it's impossible to mix CPU & GPU RPC backends. I have 2 machines each with a GPU.

I've tried many combinations of parameters in an effort to run llama3.1 405b and qwen3-coder 480b. I have enough memory to load both if using CPU + GPU but it crashes immediately.

If someone can clear this up I'd appreciate it. I've seen reddit posts showing it's possible. Thanks.

rgerganov · 2025-11-24T14:04:29Z

rgerganov
Nov 24, 2025
Collaborator

I am not able to reproduce this. For example I can run gpt-oss-20b with one RPC server using CUDA backend and another one using CPU backend.

I have enough memory to load both if using CPU + GPU but it crashes immediately.

What is crashing -- llama-cli or rpc-server? Please provide logs and more details about your environment.

0 replies

d-shehu · 2025-11-24T20:02:56Z

d-shehu
Nov 24, 2025
Author

CLI / server - both seg fault. I've tried different arguments per this guide. With llama-cli, compiled with CPU or CUDA.

Smaller & larger models: gemma3 27b
Changing --n-gpu-layers between 40-63 to balance memory
Tried offloading MOE layers. For qwen3 only this works:
```
-ot ".ffn_(up)_exps.=CPU" 
```
Same results with/without --mlock and --no-mmap
Tried -fa auto and also explicitly on/off.

If I mix backends CPU/GPU it always segment faults.

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:823: Remote RPC server crashed or returned malformed response
/llama.cpp/build/cuda/bin/libggml-base.so.0(+0x1840b)[0x740e1a94640b]
/llama.cpp/build/cuda/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x740e1a94686f]
/llama.cpp/build/cuda/bin/libggml-base.so.0(ggml_abort+0x152)[0x740e1a946a42]
/llama.cpp/build/cuda/bin/libggml-rpc.so.0(+0xfad6)[0x740e1a856ad6]
/llama.cpp/build/cuda/bin/libggml-base.so.0(ggml_backend_sched_graph_compute_async+0x81f)[0x740e1a962e5f]
/llama.cpp/build/cuda/bin/libllama.so.0(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x740e1a6a3581]
/llama.cpp/build/cuda/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x11d)[0x740e1a6a392d]
/llama.cpp/build/cuda/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x2ef)[0x740e1a6aa2af]
/llama.cpp/build/cuda/bin/libllama.so.0(llama_decode+0x10)[0x740e1a6ab150]
./llama-cli(+0x47ecd)[0x5a5661914ecd]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x740e1a0edd90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x740e1a0ede40]
./llama-cli(+0x4c2b5)[0x5a56619192b5]

If I use iGPU I can get it to use most of my system memory. It responds but halfway through locks up the machine.

With CPU only, it runs but it's glacial slow. If I stick to GPU rpc backends, I can run all smaller models fine include gpt-oss 120.

Server

./llama-cli --model <path>/models--unsloth--Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ4_XS/Qwen3-Coder-480B-A35B-Instruct-IQ4_XS-00001-of-00006.gguf \
        --ctx-size 4096 
        -t 16 
        --n-gpu-layers 55
        --rpc <addr>:<port>,<addr>:<port>
        --no-mmap 
        -fa auto 
        --no-warmup

RPC Nodes

Container 1

LLAMA_CACHE="$CACHE_DIR" ./llama.cpp/build/cpu/bin/rpc-server -c -t 12 -H 0.0.0.0 -p 26000

Container 2

LLAMA_CACHE="$CACHE_DIR" ./llama.cpp/build/cuda/bin/rpc-server -c -t 12 -H 0.0.0.0 -p 26001

0 replies

d-shehu · 2025-11-25T17:33:32Z

d-shehu
Nov 25, 2025
Author

If anyone is curious, here's how I got QWEN 3 480 4 bit to run at a usable speed. Obviously, it runs on CPU. One machine hosts llama-cli using nvidia GPU. The 2nd machine runs 2 rpc-nodes, one for the Radeon and the other for the iGPU to expose the 2nd machine's CPU and memory.

CPU utilization maxes on both machines while evaluating the prompt. Interestingly, with tensors cached, network speed as measured by bmon is at ~37-40 Mib / sec while evaluating the prompt. I'd expected it to be saturated as the network is otherwise idle. It's much less during loading at 1-10 Mib / sec.

llama_perf_sampler_print:    sampling time =     240.96 ms /  2446 runs   (    0.10 ms per token, 10150.98 tokens per second)
llama_perf_context_print:        load time =  392300.98 ms
llama_perf_context_print: prompt eval time =    3174.50 ms /    18 tokens (  176.36 ms per token,     5.67 tokens per second)
llama_perf_context_print:        eval time = 1119711.32 ms /  2427 runs   (  461.36 ms per token,     2.17 tokens per second)
llama_perf_context_print:       total time = 1551032.82 ms /  2445 tokens

Machines

- 7950x (16 core), 128GB RAM, Nvidia 3090 RTX
  llama-cli: CPU + Cuda
- 9900x, 128GB RAM, Radeon R9700 Pro
  rpc-server 1- RDNA iGPU
  rpc-server 2 - Radeon

Network: 2.5Gb

Info

Llama.cpp: revision 7054 (becc481)
Model: unsloth--Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ4_XS
Ubuntu 24.04 LTS
Drivers:
-- 570.195.03 CUDA Version: 12.8
-- amdgpu v6.16.6 (Vulkan RADV)
Test prompt: "Write me a C++ program to build an AVL tree"
iGPU limits overriden: amdgpu.gttsize=110592 amdttm.page_pool_size=27648000

llama-cli

./llama-cli \
                    --model <path> \
                    --ctx-size 4096 \
                    -t $numProcs \
                    --n-gpu-layers 52 \
                    --rpc <server2>:<port1>,<server2>:<port2> \
                    -dev CUDA0,RPC0,RPC1 \
                    --tensor-split 20,28,105 \
                    -ot ".ffn_(up)_exps.=CPU" \
                    --no-mmap \
                    -fa auto \
                    --no-warmup

rpc-server

LLAMA_CACHE="$CACHE_DIR" ./llama.cpp/build/vulkan/bin/rpc-server -c -t 12 -H 0.0.0.0 -p 26000

Memory Allocation

load_tensors: loading model tensors, this can take a while... (mmap = false)                                                                                                                                                                                                   
load_tensors: offloading 52 repeating layers to GPU                                                                                                                                                                                                                            
load_tensors: offloaded 52/63 layers to GPU                                                                                                                                                                                                                                    
load_tensors:          CPU model buffer size =   500.77 MiB                                                                                                                                                                                                                    
load_tensors:        CUDA0 model buffer size = 18988.27 MiB                                                                                                                                                                                                                    
load_tensors:    CUDA_Host model buffer size = 106906.41 MiB                                                                                                                                                                                                                   
load_tensors: RPC0[192.168.2.209:26000:26000] model buffer size = 94941.36 MiB                                                                                                                                                                                                 
load_tensors: RPC0[192.168.2.209:26002:26002] model buffer size = 27126.10 MiB

0 replies

d-shehu · 2025-11-25T23:08:16Z

d-shehu
Nov 25, 2025
Author

This release works again b7157.

However, when I try to list RPC nodes from my Mac (Metal) it doesn't work. Ubuntu to Ubuntu works fine.

/llama-server --list-devices --rpc <address>:<port>

Output: segmentation faul

I hope the other metal backend fix is merged soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to mix GPU & CPU RPC #17442

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Is it possible to mix GPU & CPU RPC #17442

Uh oh!

Uh oh!

d-shehu Nov 23, 2025

Replies: 4 comments

Uh oh!

rgerganov Nov 24, 2025 Collaborator

Uh oh!

Uh oh!

d-shehu Nov 24, 2025 Author

Uh oh!

Uh oh!

d-shehu Nov 25, 2025 Author

Uh oh!

Uh oh!

d-shehu Nov 25, 2025 Author

d-shehu
Nov 23, 2025

rgerganov
Nov 24, 2025
Collaborator

d-shehu
Nov 24, 2025
Author

d-shehu
Nov 25, 2025
Author

d-shehu
Nov 25, 2025
Author