Skip to content

google/gemma-3-27b-it context lenght issue #3105

@nskpro-cmd

Description

@nskpro-cmd

i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS, it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with following errors

i even tried with 64k context window, it went into cuda out of memeory issues

2025-03-13T08:36:37.262517Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.85.0
Commit sha: 411a282
Docker label: sha-411a282
nvidia-smi:
Thu Mar 13 08:36:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:45:00.0 Off | 0 |
| N/A 29C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:4E:00.0 Off | 0 |
| N/A 29C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000001:1B:00.0 Off | 0 |
| N/A 31C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000001:24:00.0 Off | 0 |
| N/A 28C P0 73W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
hpu-smi:
N/A

2025-03-13T08:36:37.262563Z INFO text_generation_launcher: Args {
model_id: "google/gemma-3-27b-it",
revision: None,
validation_workers: 2,
sharded: Some(
true,
),
num_shard: Some(
4,
),
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: Some(
32000,
),
max_input_length: None,
max_total_tokens: Some(
64000,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
32000,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "gemma-3-27b-it-5d7964566c-xnkck",
port: 8000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/huggingface/hub",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: true,
max_client_batch_size: 1,
lora_adapters: None,
usage_stats: Off,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-13T08:36:40.043396Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching False
2025-03-13T08:36:40.043429Z INFO text_generation_launcher: Sharding model on 4 processes
2025-03-13T08:36:40.043433Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-13T08:36:40.043785Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it
2025-03-13T08:36:43.498233Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-13T08:36:44.060714Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it
2025-03-13T08:36:44.061471Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-13T08:36:44.590395Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2025-03-13T08:36:45.196166Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2025-03-13T08:36:45.867258Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2025-03-13T08:36:47.973482Z INFO text_generation_launcher: Using prefix caching = False
2025-03-13T08:36:47.973534Z INFO text_generation_launcher: Using Attention = flashinfer
2025-03-13T08:36:54.083888Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:36:54.609747Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:36:55.216572Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:36:55.888966Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:04.091352Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:04.617169Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:05.224253Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:05.896938Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:14.098533Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:14.624769Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:15.231953Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:15.904796Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:24.105963Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:24.632677Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:25.239656Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:25.912803Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:34.113333Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:34.641461Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:35.247092Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:35.920604Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:44.120842Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:44.649364Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:45.254347Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:45.928487Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:54.128489Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:54.657147Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:55.261709Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:55.936555Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:04.135901Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:04.664958Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:05.269205Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:05.944561Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:14.143354Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:14.672706Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:15.276730Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:15.952321Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:18.500055Z INFO text_generation_launcher: Using prefill chunking = False
2025-03-13T08:38:19.085091Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2025-03-13T08:38:19.176301Z INFO shard-manager: text_generation_launcher: Shard ready in 94.574638951s rank=1
2025-03-13T08:38:21.300395Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2025-03-13T08:38:21.301426Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-13T08:38:21.301937Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2025-03-13T08:38:21.348798Z INFO shard-manager: text_generation_launcher: Shard ready in 97.272539231s rank=0
2025-03-13T08:38:21.356498Z INFO shard-manager: text_generation_launcher: Shard ready in 95.475191243s rank=3
2025-03-13T08:38:21.385097Z INFO shard-manager: text_generation_launcher: Shard ready in 96.176034962s rank=2
2025-03-13T08:38:22.958763Z INFO text_generation_launcher: Starting Webserver
2025-03-13T08:38:23.126019Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-13T08:38:23.330948Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-13T08:38:25.345859Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 3 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342359 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.349736Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 1 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342101 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.350178Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 2 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342216 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.350698Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
_, _batch, _ = self.generate_token(batch)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
logits, speculative_logits = self.model.forward(
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
hidden_states = self.text_model.model(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
hidden_states, residual = layer(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
attn_output = self.self_attn(
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 0 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342032 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
sys.exit(app())
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call
return get_command(self)(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(

File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
self.model.warmup(batch, max_input_tokens, max_total_tokens)
File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.358791Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.370414Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.381723Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
2025-03-13T08:38:25.392642Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Backend(Warmup(Generation("Not enough memory to handle 32000 prefill tokens. You need to decrease --max-batch-prefill-tokens")))
2025-03-13T08:38:25.403245Z ERROR text_generation_launcher: Webserver Crashed
2025-03-13T08:38:25.403260Z INFO text_generation_launcher: Shutting down shards
2025-03-13T08:38:25.452182Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-03-13T08:38:25.452239Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-03-13T08:38:25.459966Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2025-03-13T08:38:25.462190Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2025-03-13T08:38:25.481703Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2025-03-13T08:38:25.481742Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2025-03-13T08:38:25.488581Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2025-03-13T08:38:25.488620Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2025-03-13T08:38:25.862773Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2025-03-13T08:38:27.053688Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
2025-03-13T08:38:27.290200Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
2025-03-13T08:38:27.583555Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions