clean up

Liudmila Molkova · Liudmila Molkova · commit 7515be0c866a · 2024-12-27T09:42:11.000-08:00
diff --git a/oteps/4333-recording-exceptions-on-logs.md b/oteps/4333-recording-exceptions-on-logs.md
@@ -12,11 +12,11 @@ Exceptions recorded on logs have the following advantages over span events:
 - they can have different severity levels to reflect how critical the exception is
 - they are already reported natively by many frameworks and libraries
 
-Recording exception on logs is essential for troubleshooting. But regardless of how they are recorded, they could be noisy:
+Recording exceptions is essential for troubleshooting. Regardless of how exceptions are recorded, they could be noisy:
 - distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading -
   individual occurrence of transient errors are not necessarily indicative of a problem.
 - exception stack traces can be huge. Corresponding attribute value can frequently reach several KBs resulting in high costs
-  associated with ingesting and storing such logs. It's also common to log exceptions multiple times while they bubble up
+  associated with ingesting and storing them. It's also common to log exceptions multiple times while they bubble up
   leading to duplication and aggravating the verbosity problem.
 
 In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and
@@ -29,37 +29,29 @@ starting point, but they are encouraged to adjust it to their needs.
 
 This guidance boils down to the following:
 
-- we should record full exception details including stack traces only for unhandled exceptions (by default).
-- we should log error details and context when the error happens. These records should not include
-  exception stack traces unless this exception is unhandled.
-- we should avoid logging the same error multiple times as it propagates up through the stack.
-- we should log errors with appropriate severity ranging from `Trace` to `Fatal`.
+Instrumentations should record exception information (along with other context) on the log record and
+use appropriate severity - only unhandled exceptions should be recorded as `Error` or higher. They
+should strive to report each exception once.
 
-> [!NOTE]
->
-> Based on this guidance non-native instrumentations should record exceptions in top-level instrumentations only (#2 in [Details](#details))
-
-> [!Important]
->
-> OTel should provide APIs like `setException` when creating log record that will record only necessary information depending
-> on the configuration and log severity. See [API changes](#api-changes) for the details.
+Instrumentation should provide the whole exception instance to the OTel (instead of individual attributes)
+and the OTel SDK should, based on user configuration, decide which information to record. As a default,
+this OTEP proposes to record exception stack traces on log with `Error` or higher severity.
 
 ### Details
 
-1. Exceptions should be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
+1. Exceptions should be recorded on [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
    or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
 
 2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record logs
-   for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/attributes-registry/exception.md) flag set to `true`.
+   for unhandled exceptions with `Error` severity.
 
-    <!-- TODO: do we need an `exception.unhandled` attribute instead of `exception.escaped`? -->
    Some runtimes and frameworks provide global exception handler that can be used to record exception logs. Priority should be given to the instrumentation point where the operation context is available.
 
-3. It's recommended to record exception stack traces only for unhandled exceptions in cases outlined in #2 above.
+3. Native instrumentations should record log describing an error and the context it happened in
+   when this error is detected (or where the most context is available).
 
-4. Native instrumentations should record log describing an error and the context it happened in
-   when this error is detected. Corresponding log record should not contain exception stack
-   traces (if an exception was thrown/caught) unless such exceptions usually remain unhandled.
+4. It's not recommended to record the same error as it propagates through the stack trace or
+   attach the same instance of exception to multiple log records.
 
 5. An error should be logged with appropriate severity depending on the available context.
 
@@ -68,15 +60,25 @@ This guidance boils down to the following:
    - Unhandled exceptions that don't result in application shutdown should be recorded with severity `Error`
    - Errors that result in application shutdown should be recorded with severity `Fatal`
 
-6. Instrumentations should not log errors or exceptions that are handled or
-   are propagated as is, except ones handled in global exception handlers (see #2 below)
+6. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
+   to describe the context that the exception was thrown in.
+   They are also encouraged to define their own error events and enrich them with exception details.
 
-   If a new exception is created based on the original one or a new details about the error become available,
-   instrumentation may record another error (without stack trace)
+7. OTel SDK should record stack traces on exceptions with severity `Error` or higher and should allow users to
+   change the threshold.
 
-7. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
-   to describe the context that the exception was thrown in.
-   They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
+   See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that
+   records stack trace conditionally.
+
+
+> [!NOTE]
+>
+> Based on this guidance non-native instrumentations should record exceptions in top-level instrumentations only (#2 in [Details](#details))
+
+> [!Important]
+>
+> OTel should provide API like `setException` when creating log record that will record only necessary information depending
+> on the configuration and log severity. See [API changes](#api-changes) for the details.
 
 ## API changes
 
@@ -85,8 +87,8 @@ Library may write logs providing exception instance through a log bridge and not
 
 It also maybe desirable by some vendors/apps to record all the exception details.
 
-OTel Logs API should provide additional methods that enrich log record with exception details such as
-`setException(exception)` (`setUnhandledException`, etc), similar to [RecordException](../specification/trace/api.md?plain=1#L682)
+OTel Logs API should provide methods that enrich log record with exception details such as
+`setException(exception)` similar to [RecordException](../specification/trace/api.md?plain=1#L682)
 method on span.
 
 OTel SDK should implement such methods and set exception attributes based on configuration
@@ -108,27 +110,22 @@ try {
     // we don't record exception here, but may record a log record without exception info
     logger.logRecordBuilder()
         .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
-        .severityNumber(Severity.INFO)
         // let's assume it's expected that some content can disappear
-        .addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
-        .addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
-         // ideally we should provide the following method for convenience, optimization,
-         // and to support different behavior behind config options
-         //.addException(ex)
+        .severityNumber(Severity.INFO)
+        // by default SDK will only populate `exception.type` and `exception.message`
+        // since severity is `INFO`, but it should not be instrumentation library
+        // concern
+        .setException(ex)
         .emit();
 
     return response(HttpStatus.NOT_FOUND);
-} catch (NotAuthorizedException ex) {
-    // let's assume it's really unexpected - service lost access to the underlying storage
-    // since we're returning error response without an exception, we
+} catch (ForbiddenException ex) {
     logger.logRecordBuilder()
+        // let's assume it's really unexpected for this application - service does not have access to the underlying storage.
         .severityNumber(Severity.ERROR)
         .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
-         // ideally we should provide the following method for convenience, optimization,
-         // and to support different behavior behind config options
-         //.addException(ex)
-        .addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
-        .addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
+        // by default SDK will record stack trace for this exception since the severity is ERROR
+        .setException(ex)
         .emit();
 
     return response(HttpStatus.INTERNAL_SERVER_ERROR);
@@ -152,16 +149,13 @@ public class StorageClient {
          }
 
          logger.logRecordBuilder()
-            // we may set different levels depending on the status code, but in general
-            // we expect caller to handle the error, so this is at most warning
+            // In general we don't know if it's certainly an error - we expect caller
+            // to handle the exception and decide. So this is warning (at most).
+            // If it remains unhandled, it'd be logged by the global handler.
             .setSeverity(Severity.WARN)
             .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
             .addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode())
-            // ideally we should provide the following method for convenience, optimization,
-            // and to support different behavior behind config options
-            //.addException(ex)
-            .addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
-            .addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
+            .addException(ex)
             .emit();
 
         if (response.statusCode() == 404) {
@@ -183,13 +177,12 @@ public class Connection {
     public long send(ByteBuffer content) {
         try {
             return socketChannel.write(content);
-        } catch (Throwable ex) {
+        } catch (SocketException ex) {
             logger.logRecordBuilder()
-              // we'll retry it, so it's Info or lower
+              // we retry it, so it's Info or lower
               .setSeverity(Severity.INFO)
               .addAttribute("connection.id", this.getId())
-              .addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
-              .addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
+              .addException(ex)
               .emit();
 
             throw ex;
@@ -198,9 +191,38 @@ public class Connection {
 }
 ```
 
-#### HTTP server instrumentation/global exception handler
+#### Messaging processor instrumentation
+
+In this example, application code provides and callback to the messaging processor to
+execute for each message.
+
+```java
+MessagingProcessorClient processorClient = new MessagingClientBuilder()
+  .endpoint(endpoint)
+  .queueName(queueName)
+  .processor()
+  .processMessage(messageContext -> processMessage(messageContext))
+  .buildProcessorClient();
+
+processorClient.start();
+```
+
+The `MessagingProcessorClient` implementation should catch exceptions thrown by the  `processMessage` callback and log them similarly to
 
-TODO
+```java
+MessageContext context = retrieveNext();
+try {
+  processMessage.accept(context)
+} catch (Throwable t) {
+  // this native instrumentation may use OTel log API or another logging library
+  // such as SLF4J
+  logger.atError()
+    .addKeyValuePair("messaging.message.id", context.getMessageId())
+    ...
+    .setException(t)
+    .log()
+}
+```
 
 ## Trade-offs and mitigations
 
@@ -210,9 +232,9 @@ TODO
    - OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
      but that's not enough - instrumentations will have to change their behavior to report exception logs
      with appropriate severity (or stop reporting them).
-   - TODO: document opt-in mechanism similar to `OTEL_SEMCONV_STABILITY_OPT_IN`
+   - We should provide opt-in mechanism for existing instrumentations to switch to logs.
 
-1. Recording exceptions as log-based events would result in UX degradation for users
+2. Recording exceptions as log-based events would result in UX degradation for users
    leveraging trace-only backends such as Jaeger.
 
    **Mitigation:**
@@ -223,15 +245,13 @@ TODO
 
 Alternatives:
 
-1. Deduplicate exception info on logs. We can mark exception instances as logged
-   (augment exception instance or keep a small cache of recently logged exceptions).
+1. Deduplicate exception info by marking exception instances as logged.
    This can potentially mitigate the problem for existing application when it logs exceptions extensively.
    We should still provide optimal guidance for the greenfield applications and libraries.
 
-2. Log full exception info only when exception is thrown for the first time
-   (including new exceptions wrapping original ones). This results in at-most-once
-   logging, but even this is known to be problematic since absolute majority of exceptions
-   are handled.
+2. Log full exception info only when exception is thrown for the first time.
+   This results in at-most-once logging, but even this is known to be problematic since absolute
+   majority of exceptions are handled.
    It also relies on the assumption that most libraries will follow this guidance.
 
 ## Open questions
@@ -240,16 +260,6 @@ TBD
 
 ## Future possibilities
 
-1. OpenTelemetry should provide configuration options and APIs allowing (but not limited) to:
-
-   - Record unhandled exceptions only (the default documented in this guidance)
-   - Record exception info based on the log severity
-   - Record exception logs, but omit the stack trace based on (at least) the log level.
-     See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
-   - Record all available exceptions with all the details
-
-   It should be possible to optimize instrumentation and avoid collecting exception information
-   (such as stack trace) when the corresponding exception log is not going to be recorded.
-
-2. Exception stack traces can be recorded in structured form instead of their
-   string representation. It may be easier to process and consume them in this form.
+Exception stack traces can be recorded in structured form instead of their
+string representation. It may be easier to process and consume them in this form.
+This is out of scope of this OTEP.