Implicit caching produces inconsistent cache hits when system prompt changes

When using implicit caching with Gemini 2.5 Flash, requests whose system prompts share a long common prefix (>9k tokens) but have different suffixes produce inconsistent cache hits. The cache hit rate appears random (~40-60%) rather than consistently hitting after the first call establishes the cache.

According to the [context caching documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview), implicit caching should cache common prefixes and provide cost savings when subsequent requests share that prefix.

## Environment
- **Package**: google-genai 1.51.0
- **API**: Vertex AI mode
- **Model**: gemini-2.5-flash
- **Region**: europe-west8
- **Python**: 3.12
## Reproduction

```python
import time
from google import genai
from google.genai.types import GenerateContentConfig

PROJECT_ID = "your-project-id"
LOCATION = "europe-west8"
MODEL_ID = "gemini-2.5-flash"

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# Create a long shared prefix (~10k tokens)
shared_prefix = "You are an expert assistant. " + (
    "Knowledge base entry: software architecture patterns, "
    "design principles, and best practices for building scalable systems. "
) * 500

run_id = int(time.time())

for i in range(12):
    # Each call has the same long prefix but a unique suffix
    unique_suffix = f"Session identifier: {run_id}-{i:03d}"
    system_instruction = shared_prefix + unique_suffix

    response = client.models.generate_content(
        model=MODEL_ID,
        contents="Reply with OK.",
        config=GenerateContentConfig(system_instruction=system_instruction),
    )

    input_tokens = response.usage_metadata.prompt_token_count
    cached_tokens = response.usage_metadata.cached_content_token_count or 0
    cache_pct = (cached_tokens / input_tokens * 100) if cached_tokens else 0

    print(f"Call {i + 1:2d}: {cached_tokens:>5}/{input_tokens} cached ({cache_pct:>5.1f}%)")
```

## Actual Output

```
Call  1:     0/9528 tokens cached (  0.0%)
Call  2:     0/9528 tokens cached (  0.0%)
Call  3:     0/9528 tokens cached (  0.0%)
Call  4:  9188/9528 tokens cached ( 96.4%)
Call  5:  9188/9528 tokens cached ( 96.4%)
Call  6:  9188/9528 tokens cached ( 96.4%)
Call  7:     0/9528 tokens cached (  0.0%)
Call  8:     0/9528 tokens cached (  0.0%)
Call  9:  9188/9528 tokens cached ( 96.4%)
Call 10:  9188/9528 tokens cached ( 96.4%)
Call 11:     0/9528 tokens cached (  0.0%)
Call 12:     0/9528 tokens cached (  0.0%)

Cache hit rate: 5/12 (42%)
```

## Expected Behavior

After the first call establishes the cache, subsequent calls sharing the same ~9k token prefix should consistently hit the cache (calls 2-12 should all show ~96% cached tokens).

## Observations

1. **Inconsistent hits**: Cache hits appear random, not sequential
2. **Clustering**: Hits sometimes cluster (calls 4-6, 9-10) then disappear
3. **First call variance**: Sometimes 2-3 calls miss before any cache hit
4. **No timing correlation**: Adding delays between calls (0.5s, 1s) did not improve reliability

## Questions

1. Is this expected behavior for implicit caching with varying suffixes?
2. Is there a recommended pattern for achieving consistent cache hits with shared prefixes?
3. Should we use explicit caching instead for this use case?

## Additional Context

- Exact-match requests (identical system instruction + content) cache reliably on the second call
- The issue only manifests when the suffix differs between calls
- Token count is well above the 2048 minimum for implicit caching


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implicit caching produces inconsistent cache hits when system prompt changes #1880

Environment

Reproduction

Actual Output

Expected Behavior

Observations

Questions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implicit caching produces inconsistent cache hits when system prompt changes #1880

Description

Environment

Reproduction

Actual Output

Expected Behavior

Observations

Questions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions