Skip to content

Implicit caching produces inconsistent cache hits when system prompt changes #1880

@conradlee

Description

@conradlee

When using implicit caching with Gemini 2.5 Flash, requests whose system prompts share a long common prefix (>9k tokens) but have different suffixes produce inconsistent cache hits. The cache hit rate appears random (~40-60%) rather than consistently hitting after the first call establishes the cache.

According to the context caching documentation, implicit caching should cache common prefixes and provide cost savings when subsequent requests share that prefix.

Environment

  • Package: google-genai 1.51.0
  • API: Vertex AI mode
  • Model: gemini-2.5-flash
  • Region: europe-west8
  • Python: 3.12

Reproduction

import time
from google import genai
from google.genai.types import GenerateContentConfig

PROJECT_ID = "your-project-id"
LOCATION = "europe-west8"
MODEL_ID = "gemini-2.5-flash"

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# Create a long shared prefix (~10k tokens)
shared_prefix = "You are an expert assistant. " + (
    "Knowledge base entry: software architecture patterns, "
    "design principles, and best practices for building scalable systems. "
) * 500

run_id = int(time.time())

for i in range(12):
    # Each call has the same long prefix but a unique suffix
    unique_suffix = f"Session identifier: {run_id}-{i:03d}"
    system_instruction = shared_prefix + unique_suffix

    response = client.models.generate_content(
        model=MODEL_ID,
        contents="Reply with OK.",
        config=GenerateContentConfig(system_instruction=system_instruction),
    )

    input_tokens = response.usage_metadata.prompt_token_count
    cached_tokens = response.usage_metadata.cached_content_token_count or 0
    cache_pct = (cached_tokens / input_tokens * 100) if cached_tokens else 0

    print(f"Call {i + 1:2d}: {cached_tokens:>5}/{input_tokens} cached ({cache_pct:>5.1f}%)")

Actual Output

Call  1:     0/9528 tokens cached (  0.0%)
Call  2:     0/9528 tokens cached (  0.0%)
Call  3:     0/9528 tokens cached (  0.0%)
Call  4:  9188/9528 tokens cached ( 96.4%)
Call  5:  9188/9528 tokens cached ( 96.4%)
Call  6:  9188/9528 tokens cached ( 96.4%)
Call  7:     0/9528 tokens cached (  0.0%)
Call  8:     0/9528 tokens cached (  0.0%)
Call  9:  9188/9528 tokens cached ( 96.4%)
Call 10:  9188/9528 tokens cached ( 96.4%)
Call 11:     0/9528 tokens cached (  0.0%)
Call 12:     0/9528 tokens cached (  0.0%)

Cache hit rate: 5/12 (42%)

Expected Behavior

After the first call establishes the cache, subsequent calls sharing the same ~9k token prefix should consistently hit the cache (calls 2-12 should all show ~96% cached tokens).

Observations

  1. Inconsistent hits: Cache hits appear random, not sequential
  2. Clustering: Hits sometimes cluster (calls 4-6, 9-10) then disappear
  3. First call variance: Sometimes 2-3 calls miss before any cache hit
  4. No timing correlation: Adding delays between calls (0.5s, 1s) did not improve reliability

Questions

  1. Is this expected behavior for implicit caching with varying suffixes?
  2. Is there a recommended pattern for achieving consistent cache hits with shared prefixes?
  3. Should we use explicit caching instead for this use case?

Additional Context

  • Exact-match requests (identical system instruction + content) cache reliably on the second call
  • The issue only manifests when the suffix differs between calls
  • Token count is well above the 2048 minimum for implicit caching

Metadata

Metadata

Assignees

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.status:awaiting user responsestatus:staletype: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions