pydantic · Wh1isper · Nov 19, 2025 · Nov 19, 2025 · Nov 19, 2025 · Nov 20, 2025
diff --git a/docs/models/anthropic.md b/docs/models/anthropic.md
@@ -80,25 +80,45 @@ agent = Agent(model)
 
 ## Prompt Caching
 
-Anthropic supports [prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) to reduce costs by caching parts of your prompts. Pydantic AI provides three ways to use prompt caching:
+Anthropic supports [prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) to reduce costs by caching parts of your prompts. Pydantic AI provides four ways to use prompt caching:
 
 1. **Cache User Messages with [`CachePoint`][pydantic_ai.messages.CachePoint]**: Insert a `CachePoint` marker in your user messages to cache everything before it
 2. **Cache System Instructions**: Set [`AnthropicModelSettings.anthropic_cache_instructions`][pydantic_ai.models.anthropic.AnthropicModelSettings.anthropic_cache_instructions] to `True` (uses 5m TTL by default) or specify `'5m'` / `'1h'` directly
 3. **Cache Tool Definitions**: Set [`AnthropicModelSettings.anthropic_cache_tool_definitions`][pydantic_ai.models.anthropic.AnthropicModelSettings.anthropic_cache_tool_definitions] to `True` (uses 5m TTL by default) or specify `'5m'` / `'1h'` directly
+4. **Cache Last Message (Convenience)**: Set [`AnthropicModelSettings.anthropic_cache_messages`][pydantic_ai.models.anthropic.AnthropicModelSettings.anthropic_cache_messages] to `True` to automatically cache the last user message
 
-You can combine all three strategies for maximum savings:
+You can combine multiple strategies for maximum savings:
 
 ```python {test="skip"}
 from pydantic_ai import Agent, CachePoint, RunContext
 from pydantic_ai.models.anthropic import AnthropicModelSettings
 
+# Example 1: Use anthropic_cache_messages for automatic last message caching
+agent = Agent(
+    'anthropic:claude-sonnet-4-5',
+    system_prompt='You are a helpful assistant.',
+    model_settings=AnthropicModelSettings(
+        anthropic_cache_messages=True,  # Automatically caches the last message
+    ),
+)
+
+async def main():
+    # The last message is automatically cached - no need for manual CachePoint
+    result1 = await agent.run('What is the capital of France?')
+
+    # Subsequent calls with similar conversation benefit from cache
+    result2 = await agent.run('What is the capital of Germany?')
+    print(f'Cache write: {result1.usage().cache_write_tokens}')
+    print(f'Cache read: {result2.usage().cache_read_tokens}')
+
+# Example 2: Combine with other cache settings for comprehensive caching
 agent = Agent(
     'anthropic:claude-sonnet-4-5',
     system_prompt='Detailed instructions...',
     model_settings=AnthropicModelSettings(
-        # Use True for default 5m TTL, or specify '5m' / '1h' directly
-        anthropic_cache_instructions=True,
-        anthropic_cache_tool_definitions='1h',  # Longer cache for tool definitions
+        anthropic_cache_instructions=True,      # Cache system instructions
+        anthropic_cache_tool_definitions='1h',  # Cache tool definitions with 1h TTL
+        anthropic_cache_messages=True,          # Also cache the last message
     ),
 )
 
@@ -107,22 +127,25 @@ def search_docs(ctx: RunContext, query: str) -> str:
     """Search documentation."""
     return f'Results for {query}'
 
-async def main():
-    # First call - writes to cache
-    result1 = await agent.run([
+async def main():  # noqa: F811
+    # All three cache points are used: instructions, tools, and last message
+    result = await agent.run('Search for Python best practices')
+    print(result.output)
+
+# Example 3: Fine-grained control with manual CachePoint markers
+agent = Agent(
+    'anthropic:claude-sonnet-4-5',
+    system_prompt='Instructions...',
+)
+
+async def main():  # noqa: F811
+    # Manually control cache points for specific content blocks
+    result = await agent.run([
         'Long context from documentation...',
-        CachePoint(),
+        CachePoint(),  # Cache everything up to this point
         'First question'
     ])
-
-    # Subsequent calls - read from cache (90% cost reduction)
-    result2 = await agent.run([
-        'Long context from documentation...',  # Same content
-        CachePoint(),
-        'Second question'
-    ])
-    print(f'First: {result1.output}')
-    print(f'Second: {result2.output}')
+    print(result.output)
 ```
 
 Access cache usage statistics via `result.usage()`:
@@ -139,9 +162,98 @@ agent = Agent(
     ),
 )
 
-async def main():
+async def main():  # noqa: F811
     result = await agent.run('Your question')
     usage = result.usage()
     print(f'Cache write tokens: {usage.cache_write_tokens}')
     print(f'Cache read tokens: {usage.cache_read_tokens}')
 ```
+
+### Cache Point Limits
+
+Anthropic enforces a maximum of 4 cache points per request. Pydantic AI automatically manages this limit to ensure your requests always comply without errors.
+
+#### How Cache Points Are Allocated
+
+Cache points can be placed in three locations:
+
+1. **System Prompt**: Via `anthropic_cache_instructions` setting (adds cache point to last system prompt block)
+2. **Tool Definitions**: Via `anthropic_cache_tool_definitions` setting (adds cache point to last tool definition)
+3. **Messages**: Via `CachePoint` markers or `anthropic_cache_messages` setting (adds cache points to message content)
+
+Each setting uses **at most 1 cache point**, but you can combine them:
+
+```python {test="skip"}
+from pydantic_ai import Agent, CachePoint
+from pydantic_ai.models.anthropic import AnthropicModelSettings
+
+# Example: Using all 3 cache point sources
+agent = Agent(
+    'anthropic:claude-sonnet-4-5',
+    system_prompt='Detailed instructions...',
+    model_settings=AnthropicModelSettings(
+        anthropic_cache_instructions=True,      # 1 cache point
+        anthropic_cache_tool_definitions=True,  # 1 cache point
+        anthropic_cache_messages=True,          # 1 cache point
+    ),
+)
+
+@agent.tool_plain
+def my_tool() -> str:
+    return 'result'
+
+async def main():  # noqa: F811
+    # This uses 3 cache points (instructions + tools + last message)
+    # You can add 1 more CachePoint marker before hitting the limit
+    result = await agent.run([
+        'Context', CachePoint(),  # 4th cache point - OK
+        'Question'
+    ])
+    print(result.output)
+    usage = result.usage()
+    print(f'Cache write tokens: {usage.cache_write_tokens}')
+    print(f'Cache read tokens: {usage.cache_read_tokens}')
+```
+
+#### Automatic Cache Point Limiting
+
+When cache points from all sources (settings + `CachePoint` markers) exceed 4, Pydantic AI automatically removes excess cache points from **older message content** (keeping the most recent ones):
+
+```python {test="skip"}
+from pydantic_ai import Agent, CachePoint
+from pydantic_ai.models.anthropic import AnthropicModelSettings
+
+agent = Agent(
+    'anthropic:claude-sonnet-4-5',
+    system_prompt='Instructions...',
+    model_settings=AnthropicModelSettings(
+        anthropic_cache_instructions=True,      # 1 cache point
+        anthropic_cache_tool_definitions=True,  # 1 cache point
+    ),
+)
+
+@agent.tool_plain
+def search() -> str:
+    return 'data'
+
+async def main():  # noqa: F811
+    # Already using 2 cache points (instructions + tools)
+    # Can add 2 more CachePoint markers (4 total limit)
+    result = await agent.run([
+        'Context 1', CachePoint(),  # Oldest - will be removed
+        'Context 2', CachePoint(),  # Will be kept (3rd point)
+        'Context 3', CachePoint(),  # Will be kept (4th point)
+        'Question'
+    ])
+    # Final cache points: instructions + tools + Context 2 + Context 3 = 4
+    print(result.output)
+    usage = result.usage()
+    print(f'Cache write tokens: {usage.cache_write_tokens}')
+    print(f'Cache read tokens: {usage.cache_read_tokens}')
+```
+
+**Key Points**:
+- System and tool cache points are **always preserved**
+- The cache point created by `anthropic_cache_messages` is **always preserved** (as it's the newest message cache point)
+- Additional `CachePoint` markers in messages are removed from oldest to newest when the limit is exceeded
+- This ensures critical caching (instructions/tools) is maintained while still benefiting from message-level caching
diff --git a/pydantic_ai_slim/pydantic_ai/models/anthropic.py b/pydantic_ai_slim/pydantic_ai/models/anthropic.py
@@ -169,6 +169,19 @@ class AnthropicModelSettings(ModelSettings, total=False):
     See https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching for more information.
     """
 
+    anthropic_cache_messages: bool | Literal['5m', '1h']
+    """Convenience setting to enable caching for the last user message.
+
+    When enabled, this automatically adds a cache point to the last content block
+    in the final user message, which is useful for caching conversation history
+    or context in multi-turn conversations.
+    If `True`, uses TTL='5m'. You can also specify '5m' or '1h' directly.
+
+    Note: Uses 1 of Anthropic's 4 available cache points per request. Any additional CachePoint
+    markers in messages will be automatically limited to respect the 4-cache-point maximum.
+    See https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching for more information.
+    """
+
 
 @dataclass(init=False)
 class AnthropicModel(Model):
@@ -333,7 +346,7 @@ async def _messages_create(
         tool_choice = self._infer_tool_choice(tools, model_settings, model_request_parameters)
 
         system_prompt, anthropic_messages = await self._map_message(messages, model_request_parameters, model_settings)
-
+        self._limit_cache_points(system_prompt, anthropic_messages, tools)
         try:
             extra_headers = self._map_extra_headers(beta_features, model_settings)
 
@@ -376,7 +389,7 @@ async def _messages_count_tokens(
         tool_choice = self._infer_tool_choice(tools, model_settings, model_request_parameters)
 
         system_prompt, anthropic_messages = await self._map_message(messages, model_request_parameters, model_settings)
-
+        self._limit_cache_points(system_prompt, anthropic_messages, tools)
         try:
             extra_headers = self._map_extra_headers(beta_features, model_settings)
 
@@ -747,6 +760,25 @@ async def _map_message(  # noqa: C901
             system_prompt_parts.insert(0, instructions)
         system_prompt = '\n\n'.join(system_prompt_parts)
 
+        # Add cache_control to the last message content if anthropic_cache_messages is enabled
+        if anthropic_messages and (cache_messages := model_settings.get('anthropic_cache_messages')):
+            ttl: Literal['5m', '1h'] = '5m' if cache_messages is True else cache_messages
+            m = anthropic_messages[-1]
+            content = m['content']
+            if isinstance(content, str):
+                # Convert string content to list format with cache_control
+                m['content'] = [  # pragma: no cover
+                    BetaTextBlockParam(
+                        text=content,
+                        type='text',
+                        cache_control=BetaCacheControlEphemeralParam(type='ephemeral', ttl=ttl),
+                    )
+                ]
+            else:
+                # Add cache_control to the last content block
+                content = cast(list[BetaContentBlockParam], content)
+                self._add_cache_control_to_last_param(content, ttl)
+
         # If anthropic_cache_instructions is enabled, return system prompt as a list with cache_control
         if system_prompt and (cache_instructions := model_settings.get('anthropic_cache_instructions')):
             # If True, use '5m'; otherwise use the specified ttl value
@@ -762,6 +794,75 @@ async def _map_message(  # noqa: C901
 
         return system_prompt, anthropic_messages
 
+    @staticmethod
+    def _limit_cache_points(
+        system_prompt: str | list[BetaTextBlockParam],
+        anthropic_messages: list[BetaMessageParam],
+        tools: list[BetaToolUnionParam],
+    ) -> None:
+        """Limit the number of cache points in the request to Anthropic's maximum.
+
+        Anthropic enforces a maximum of 4 cache points per request. This method ensures
+        compliance by counting existing cache points and removing excess ones from messages.
+
+        Strategy:
+        1. Count cache points in system_prompt (can be multiple if list of blocks)
+        2. Count cache points in tools (can be in any position, not just last)
+        3. Raise UserError if system + tools already exceed MAX_CACHE_POINTS
+        4. Calculate remaining budget for message cache points
+        5. Traverse messages from newest to oldest, keeping the most recent cache points
+           within the remaining budget
+        6. Remove excess cache points from older messages to stay within limit
+
+        Cache point priority (always preserved):
+        - System prompt cache points
+        - Tool definition cache points
+        - Message cache points (newest first, oldest removed if needed)
+
+        Raises:
+            UserError: If system_prompt and tools combined already exceed MAX_CACHE_POINTS (4).
+                      This indicates a configuration error that cannot be auto-fixed.
+        """
+        MAX_CACHE_POINTS = 4
+
+        # Count existing cache points in system prompt
+        used_cache_points = (
+            sum(1 for block in system_prompt if 'cache_control' in cast(dict[str, Any], block))
+            if isinstance(system_prompt, list)
+            else 0
+        )
+
+        # Count existing cache points in tools (any tool may have cache_control)
+        # Note: cache_control can be in the middle of tools list if builtin tools are added after
+        for tool in tools:
+            if 'cache_control' in tool:
+                used_cache_points += 1
+
+        # Calculate remaining cache points budget for messages
+        remaining_budget = MAX_CACHE_POINTS - used_cache_points
+        if remaining_budget < 0:  # pragma: no cover
+            raise UserError(
+                f'Too many cache points for Anthropic request. '
+                f'System prompt and tool definitions already use {used_cache_points} cache points, '
+                f'which exceeds the maximum of {MAX_CACHE_POINTS}.'
+            )
+        # Remove excess cache points from messages (newest to oldest)
+        for message in reversed(anthropic_messages):
+            content = message['content']
+            if isinstance(content, str):  # pragma: no cover
+                continue
+
+            # Process content blocks in reverse order (newest first)
+            for block in reversed(cast(list[BetaContentBlockParam], content)):
+                block_dict = cast(dict[str, Any], block)
+
+                if 'cache_control' in block_dict:
+                    if remaining_budget > 0:
+                        remaining_budget -= 1
+                    else:
+                        # Exceeded limit, remove this cache point
+                        del block_dict['cache_control']
+
     @staticmethod
     def _add_cache_control_to_last_param(params: list[BetaContentBlockParam], ttl: Literal['5m', '1h'] = '5m') -> None:
         """Add cache control to the last content block param.