-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS, it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with following errors
i even tried with 64k context window, it went into cuda out of memeory issues
2025-03-13T08:36:37.262517Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.85.0
Commit sha: 411a282
Docker label: sha-411a282
nvidia-smi:
Thu Mar 13 08:36:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:45:00.0 Off | 0 |
| N/A 29C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:4E:00.0 Off | 0 |
| N/A 29C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000001:1B:00.0 Off | 0 |
| N/A 31C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000001:24:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
hpu-smi:
N/A
2025-03-13T08:36:37.262563Z INFO text_generation_launcher: Args {
model_id: "google/gemma-3-27b-it",
revision: None,
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
4,
),
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
32000,
),
max_input_length: None,
max_total_tokens: Some(
64000,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
32000,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "gemma-3-27b-it-5d7964566c-xnkck",
port: 8000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/huggingface/hub",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: true,
max_client_batch_size: 1,
lora_adapters: None,
usage_stats: Off,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-13T08:36:40.043396Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching False
2025-03-13T08:36:40.043429Z INFO text_generation_launcher: Sharding model on 4 processes
2025-03-13T08:36:40.043433Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-13T08:36:40.043785Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it
2025-03-13T08:36:43.498233Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-13T08:36:44.060714Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it
2025-03-13T08:36:44.061471Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-13T08:36:44.590395Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2025-03-13T08:36:45.196166Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2025-03-13T08:36:45.867258Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2025-03-13T08:36:47.973482Z INFO text_generation_launcher: Using prefix caching = False
2025-03-13T08:36:47.973534Z INFO text_generation_launcher: Using Attention = flashinfer
2025-03-13T08:36:54.083888Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:36:54.609747Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:36:55.216572Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:36:55.888966Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:04.091352Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:04.617169Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:05.224253Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:05.896938Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:14.098533Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:14.624769Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:15.231953Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:15.904796Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:24.105963Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:24.632677Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:25.239656Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:25.912803Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:34.113333Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:34.641461Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:35.247092Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:35.920604Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:44.120842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:44.649364Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:45.254347Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:45.928487Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:54.128489Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:54.657147Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:55.261709Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:55.936555Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:04.135901Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:04.664958Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:05.269205Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:05.944561Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:14.143354Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:14.672706Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:15.276730Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:15.952321Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:18.500055Z INFO text_generation_launcher: Using prefill chunking = False
2025-03-13T08:38:19.085091Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2025-03-13T08:38:19.176301Z INFO shard-manager: text_generation_launcher: Shard ready in 94.574638951s rank=1
2025-03-13T08:38:21.300395Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2025-03-13T08:38:21.301426Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-13T08:38:21.301937Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2025-03-13T08:38:21.348798Z INFO shard-manager: text_generation_launcher: Shard ready in 97.272539231s rank=0
2025-03-13T08:38:21.356498Z INFO shard-manager: text_generation_launcher: Shard ready in 95.475191243s rank=3
2025-03-13T08:38:21.385097Z INFO shard-manager: text_generation_launcher: Shard ready in 96.176034962s rank=2
2025-03-13T08:38:22.958763Z INFO text_generation_launcher: Starting Webserver
2025-03-13T08:38:23.126019Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-13T08:38:23.330948Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-13T08:38:25.345859Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 3 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342359 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.349736Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 1 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342101 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.350178Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 2 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342216 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.350698Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 0 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342032 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.358791Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.370414Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.381723Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
2025-03-13T08:38:25.392642Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens
Error: Backend(Warmup(Generation("Not enough memory to handle 32000 prefill tokens. You need to decrease--max-batch-prefill-tokens")))
2025-03-13T08:38:25.403245Z ERROR text_generation_launcher: Webserver Crashed
2025-03-13T08:38:25.403260Z INFO text_generation_launcher: Shutting down shards
2025-03-13T08:38:25.452182Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-03-13T08:38:25.452239Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-03-13T08:38:25.459966Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2025-03-13T08:38:25.462190Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2025-03-13T08:38:25.481703Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2025-03-13T08:38:25.481742Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2025-03-13T08:38:25.488581Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2025-03-13T08:38:25.488620Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2025-03-13T08:38:25.862773Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2025-03-13T08:38:27.053688Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
2025-03-13T08:38:27.290200Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
2025-03-13T08:38:27.583555Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed