Skip to content

Conversation

@bcdurak
Copy link
Contributor

@bcdurak bcdurak commented Oct 30, 2025

This is a large PR, so I would advise you to read the whole description before starting the review.

What do we capture now?

Previously, ZenML used to add a new handler to the root logger. This root logger was used to capture all the logs that goes through the root logger and store them in the artifact store. Additionally, we used to wrap the built-in print function in a way, that we stored the printed messages as well. However, in this case, we missed on a couple of sources such as messages from loggers that do not propagate to the root logger, anything on the stdout/stderr aside from log messages and print statements.

Now, we do the following:

  • stdout and stderr are now wrapped. we keep the original stdout and stderr.
  • everything that goes through the new wrapped stdout/stderr still go through the original stdout/stderr.
  • however, additionally, every message is also directed a classmethod called LoggingContext.emit(...). (will be explained in the following section).
  • we still add a handler to the root logger. With this the root logger ends up having two handler: the console_handler and the zenml_handler.
  • While the console_handler formats and writes things to the console, the zenml_handler is responsible for routing all the incoming log messages to the LoggingContext.emit(...) as well.
Screenshot 2025-12-04 at 10 25 04

The new LoggingContext class

We have a new LoggingContext class now that replaces the old PipelineLogsContext. It's still a context manager, but operates a bit differently.

When you __init__ this class, it stores the reference to the log store within your active stack. Every the __enter__ method gets called, it checks a context variable called active_logging_context, if there is one, it stores it and replaces the context variable with itself. Similarly, when __exit__ gets called, it removes itself from the context variable and puts the old value back.

One of the most critical parts is the fact that you require a LogsResponse to initiate a LoggingContext now. So, when we ultimately call the emit(...) classmethod, it passes the message and active logging context (along with the correspondingLogsResponse) to the emit(...) method of the log store.

The new LogStore component

We have a new type of stack component called a LogStore. It handles log collection and retrieval . Different implementations can plug into this interface to provide different storage backends without changing how logs are captured or accessed.

This PR also introduces three layers of implementation:

1. Layer: BaseLogStore

This layer introduces the main abstraction for the new stack component. Main abstract methods include:

  • emit(...): receives log records and sends them to a specific backend
  • fetch(...): retrieves stored logs for the dashboard and API based on time filters and limits
  • finalize(...): finalizes the stream of logs associated with a specific log response

2. Layer: OtelLogStore

This is yet another abstraction built on the base log store that implements the core OTEL infrastructure:

  • emit(...): Activates the log store if not yet activated and translates log recored objects into OTEL format with zenml-specific attributes (e.g., zenml.log_id, zenml.log_uri, zenml.log_store_id) and emits them through the OTEL logger
  • activate(...): Sets up the OpenTelemetry pipeline including the LoggerProvider, BatchLogRecordProcessor, and LoggingHandler
  • deactivate(...): Flushes pending logs and shuts down the processor and its background thread

Moreover, It introduces configuration options for the OTEL-standardized logs including: service_name, service_version, max_queue_size, schedule_delay_millis, max_export_batch_size

The following abstract methods are exposed and must be implemented by subclasses:

  • get_exporter(...): Returns the specific LogExporter instance for the backend
  • fetch(...): Backend-specific log retrieval (since each backend has different query mechanisms)

3. Layer: Concrete Implementations

ArtifactLogStore

The artifact log store writes logs directly to the artifact store, providing a zero-configuration logging solution that works out of the box:

  • Uses a custom ArtifactLogExporter that writes LogEntry objects to the artifact store (compatible with our previous approach)
  • Handles both mutable filesystems (single file with append) and immutable filesystems like GCS (directory with timestamped files that get merged on finalization)
  • Automatically chunks large messages (>5KB) to prevent storage issues with UTF-8 boundary handling
  • Implements fetch(...) by streaming log files line-by-line from the artifact store
  • Can be created automatically from an existing artifact store via from_artifact_store(...) class method
  • Supports log finalization via an END_OF_STREAM_MESSAGE marker that triggers file merging on immutable filesystems and version removal on others

DatadogLogStore

  • The Datadog log store exports logs to Datadog's HTTP intake API using OTLP:
  • Uses an OTLPLogExporter configured with Datadog's OTLP endpoint
  • Requires an api_key for log ingestion and an application_key for log retrieval
  • Implements fetch(...) using Datadog's Logs Search API (/api/v2/logs/events/search) with:
  • Query filtering by service and zenml.log_id

Interaction of the stack with log stores

Similar to the image builders, if you don't have a log store within your active stack, an ArtifactLogStore flavor will be used instead. Since our default approach requires the opentelemetry-sdk, it is added to the pyproject.toml as an additional dependency to the base package.

Various other changes

  • There was a context variable called redirected which was defaulted to False and never used afterwards. That is now removed.
  • We changed the way prepare the logs uri for the artifact log store.

Notes

  • Recheck the flush/deactivate solution.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@bcdurak bcdurak changed the base branch from main to develop October 30, 2025 10:03
@github-actions github-actions bot added internal To filter out internal PRs and issues enhancement New feature or request labels Oct 30, 2025
@bcdurak bcdurak added the release-notes Release notes will be attached and used publicly for this PR. label Dec 3, 2025
@bcdurak bcdurak linked an issue Dec 4, 2025 that may be closed by this pull request
1 task
@bcdurak bcdurak merged commit 3989d92 into develop Dec 6, 2025
109 of 119 checks passed
@bcdurak bcdurak deleted the feature/log-store branch December 6, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal To filter out internal PRs and issues release-notes Release notes will be attached and used publicly for this PR. run-slow-ci Tag that is used to trigger the slow-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve the logging experience

3 participants