google/gemma-3-27b-it context lenght issue

i have deployed the google/gemma-3-27b-it model on 4 H100 GPUS,  it only supports 23k context length, when i increased to support 128k context window as it supports, i endup with following errors 

i even tried with 64k context window,  it went into cuda out of memeory issues


2025-03-13T08:36:37.262517Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.85.0
Commit sha: 411a28288de9218e2684dccbace481a1abdb0cef
Docker label: sha-411a282
nvidia-smi:
Thu Mar 13 08:36:36 2025
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  |   00000000:45:00.0 Off |                    0 |
   | N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                    0 |
   | N/A   29C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   2  NVIDIA H100 80GB HBM3          On  |   00000001:1B:00.0 Off |                    0 |
   | N/A   31C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+
   |   3  NVIDIA H100 80GB HBM3          On  |   00000001:24:00.0 Off |                    0 |
   | N/A   28C    P0             73W /  700W |       1MiB /  81559MiB |      0%      Default |
   |                                         |                        |             Disabled |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A
hpu-smi:
N/A

2025-03-13T08:36:37.262563Z  INFO text_generation_launcher: Args {
    model_id: "google/gemma-3-27b-it",
    revision: None,
    validation_workers: 2,
    sharded: Some(
        true,
    ),
    num_shard: Some(
        4,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(
        32000,
    ),
    max_input_length: None,
    max_total_tokens: Some(
        64000,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        32000,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "gemma-3-27b-it-5d7964566c-xnkck",
    port: 8000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/huggingface/hub",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 1,
    lora_adapters: None,
    usage_stats: Off,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
}
2025-03-13T08:36:40.043396Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching False
2025-03-13T08:36:40.043429Z  INFO text_generation_launcher: Sharding model on 4 processes
2025-03-13T08:36:40.043433Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-13T08:36:40.043785Z  INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it
2025-03-13T08:36:43.498233Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-13T08:36:44.060714Z  INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it
2025-03-13T08:36:44.061471Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-13T08:36:44.590395Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2025-03-13T08:36:45.196166Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2025-03-13T08:36:45.867258Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2025-03-13T08:36:47.973482Z  INFO text_generation_launcher: Using prefix caching = False
2025-03-13T08:36:47.973534Z  INFO text_generation_launcher: Using Attention = flashinfer
2025-03-13T08:36:54.083888Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:36:54.609747Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:36:55.216572Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:36:55.888966Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:04.091352Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:04.617169Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:05.224253Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:05.896938Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:14.098533Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:14.624769Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:15.231953Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:15.904796Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:24.105963Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:24.632677Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:25.239656Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:25.912803Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:34.113333Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:34.641461Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:35.247092Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:35.920604Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:44.120842Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:44.649364Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:45.254347Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:45.928487Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:37:54.128489Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:37:54.657147Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:37:55.261709Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:37:55.936555Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:04.135901Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:04.664958Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:05.269205Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:05.944561Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:14.143354Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-03-13T08:38:14.672706Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2025-03-13T08:38:15.276730Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2025-03-13T08:38:15.952321Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2025-03-13T08:38:18.500055Z  INFO text_generation_launcher: Using prefill chunking = False
2025-03-13T08:38:19.085091Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2025-03-13T08:38:19.176301Z  INFO shard-manager: text_generation_launcher: Shard ready in 94.574638951s rank=1
2025-03-13T08:38:21.300395Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2025-03-13T08:38:21.301426Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-03-13T08:38:21.301937Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2025-03-13T08:38:21.348798Z  INFO shard-manager: text_generation_launcher: Shard ready in 97.272539231s rank=0
2025-03-13T08:38:21.356498Z  INFO shard-manager: text_generation_launcher: Shard ready in 95.475191243s rank=3
2025-03-13T08:38:21.385097Z  INFO shard-manager: text_generation_launcher: Shard ready in 96.176034962s rank=2
2025-03-13T08:38:22.958763Z  INFO text_generation_launcher: Starting Webserver
2025-03-13T08:38:23.126019Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-03-13T08:38:23.330948Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-03-13T08:38:25.345859Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
    _, _batch, _ = self.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
    attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 3 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342359 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/.venv/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.349736Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
    _, _batch, _ = self.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
    attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 1 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342101 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/.venv/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.350178Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
    _, _batch, _ = self.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
    attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 2 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342216 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/.venv/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.350698Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1585, in warmup
    _, _batch, _ = self.generate_token(batch)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1971, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 482, in forward
    logits, speculative_logits = self.model.forward(
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 888, in forward
    hidden_states = self.text_model.model(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 547, in forward
    hidden_states, residual = layer(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 449, in forward
    attn_output = self.self_attn(
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/src/server/text_generation_server/models/custom_modeling/flash_gemma3_modeling.py", line 296, in forward
    attn_output = F.scaled_dot_product_attention(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.52 GiB. GPU 0 has a total capacity of 79.10 GiB of which 14.37 GiB is free. Process 3342032 has 64.72 GiB memory in use. 79.10 GiB allowed; Of the allocated memory 62.19 GiB is allocated by PyTorch, and 1.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/.venv/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/src/.venv/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/usr/src/server/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/usr/src/.venv/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/usr/src/server/text_generation_server/server.py", line 144, in Warmup
    self.model.warmup(batch, max_input_tokens, max_total_tokens)
  File "/usr/src/server/text_generation_server/models/flash_causal_lm.py", line 1587, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.358791Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.370414Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.381723Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
2025-03-13T08:38:25.392642Z ERROR warmup{max_input_length=Some(32000) max_prefill_tokens=32000 max_total_tokens=Some(64000) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Backend(Warmup(Generation("Not enough memory to handle 32000 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")))
2025-03-13T08:38:25.403245Z ERROR text_generation_launcher: Webserver Crashed
2025-03-13T08:38:25.403260Z  INFO text_generation_launcher: Shutting down shards
2025-03-13T08:38:25.452182Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2025-03-13T08:38:25.452239Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2025-03-13T08:38:25.459966Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2025-03-13T08:38:25.462190Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2025-03-13T08:38:25.481703Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2025-03-13T08:38:25.481742Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2025-03-13T08:38:25.488581Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2025-03-13T08:38:25.488620Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2025-03-13T08:38:25.862773Z  INFO shard-manager: text_generation_launcher: shard terminated rank=3
2025-03-13T08:38:27.053688Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
2025-03-13T08:38:27.290200Z  INFO shard-manager: text_generation_launcher: shard terminated rank=2
2025-03-13T08:38:27.583555Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

google/gemma-3-27b-it context lenght issue #3105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

google/gemma-3-27b-it context lenght issue #3105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions