Replies: 4 comments
-
|
I am not able to reproduce this. For example I can run gpt-oss-20b with one RPC server using CUDA backend and another one using CPU backend.
What is crashing -- |
Beta Was this translation helpful? Give feedback.
-
|
CLI / server - both seg fault. I've tried different arguments per this guide. With llama-cli, compiled with CPU or CUDA.
If I mix backends CPU/GPU it always segment faults. If I use iGPU I can get it to use most of my system memory. It responds but halfway through locks up the machine. With CPU only, it runs but it's glacial slow. If I stick to GPU rpc backends, I can run all smaller models fine include gpt-oss 120. Server RPC Nodes Container 1 Container 2 |
Beta Was this translation helpful? Give feedback.
-
|
If anyone is curious, here's how I got QWEN 3 480 4 bit to run at a usable speed. Obviously, it runs on CPU. One machine hosts llama-cli using nvidia GPU. The 2nd machine runs 2 rpc-nodes, one for the Radeon and the other for the iGPU to expose the 2nd machine's CPU and memory. CPU utilization maxes on both machines while evaluating the prompt. Interestingly, with tensors cached, network speed as measured by bmon is at ~37-40 Mib / sec while evaluating the prompt. I'd expected it to be saturated as the network is otherwise idle. It's much less during loading at 1-10 Mib / sec. Machines
Network: 2.5Gb Info
llama-cli rpc-server Memory Allocation |
Beta Was this translation helpful? Give feedback.
-
|
This release works again b7157. However, when I try to list RPC nodes from my Mac (Metal) it doesn't work. Ubuntu to Ubuntu works fine.
Output: segmentation faul I hope the other metal backend fix is merged soon. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I had an earlier bug report I closed on this. My testing suggests it's impossible to mix CPU & GPU RPC backends. I have 2 machines each with a GPU.
I've tried many combinations of parameters in an effort to run llama3.1 405b and qwen3-coder 480b. I have enough memory to load both if using CPU + GPU but it crashes immediately.
If someone can clear this up I'd appreciate it. I've seen reddit posts showing it's possible. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions