Skip to content

Commit 7515be0

Browse files
author
Liudmila Molkova
committed
clean up
1 parent 308a3bc commit 7515be0

File tree

1 file changed

+89
-79
lines changed

1 file changed

+89
-79
lines changed

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 89 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ Exceptions recorded on logs have the following advantages over span events:
1212
- they can have different severity levels to reflect how critical the exception is
1313
- they are already reported natively by many frameworks and libraries
1414

15-
Recording exception on logs is essential for troubleshooting. But regardless of how they are recorded, they could be noisy:
15+
Recording exceptions is essential for troubleshooting. Regardless of how exceptions are recorded, they could be noisy:
1616
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading -
1717
individual occurrence of transient errors are not necessarily indicative of a problem.
1818
- exception stack traces can be huge. Corresponding attribute value can frequently reach several KBs resulting in high costs
19-
associated with ingesting and storing such logs. It's also common to log exceptions multiple times while they bubble up
19+
associated with ingesting and storing them. It's also common to log exceptions multiple times while they bubble up
2020
leading to duplication and aggravating the verbosity problem.
2121

2222
In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and
@@ -29,37 +29,29 @@ starting point, but they are encouraged to adjust it to their needs.
2929

3030
This guidance boils down to the following:
3131

32-
- we should record full exception details including stack traces only for unhandled exceptions (by default).
33-
- we should log error details and context when the error happens. These records should not include
34-
exception stack traces unless this exception is unhandled.
35-
- we should avoid logging the same error multiple times as it propagates up through the stack.
36-
- we should log errors with appropriate severity ranging from `Trace` to `Fatal`.
32+
Instrumentations should record exception information (along with other context) on the log record and
33+
use appropriate severity - only unhandled exceptions should be recorded as `Error` or higher. They
34+
should strive to report each exception once.
3735

38-
> [!NOTE]
39-
>
40-
> Based on this guidance non-native instrumentations should record exceptions in top-level instrumentations only (#2 in [Details](#details))
41-
42-
> [!Important]
43-
>
44-
> OTel should provide APIs like `setException` when creating log record that will record only necessary information depending
45-
> on the configuration and log severity. See [API changes](#api-changes) for the details.
36+
Instrumentation should provide the whole exception instance to the OTel (instead of individual attributes)
37+
and the OTel SDK should, based on user configuration, decide which information to record. As a default,
38+
this OTEP proposes to record exception stack traces on log with `Error` or higher severity.
4639

4740
### Details
4841

49-
1. Exceptions should be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
42+
1. Exceptions should be recorded on [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
5043
or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
5144

5245
2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record logs
53-
for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/attributes-registry/exception.md) flag set to `true`.
46+
for unhandled exceptions with `Error` severity.
5447

55-
<!-- TODO: do we need an `exception.unhandled` attribute instead of `exception.escaped`? -->
5648
Some runtimes and frameworks provide global exception handler that can be used to record exception logs. Priority should be given to the instrumentation point where the operation context is available.
5749

58-
3. It's recommended to record exception stack traces only for unhandled exceptions in cases outlined in #2 above.
50+
3. Native instrumentations should record log describing an error and the context it happened in
51+
when this error is detected (or where the most context is available).
5952

60-
4. Native instrumentations should record log describing an error and the context it happened in
61-
when this error is detected. Corresponding log record should not contain exception stack
62-
traces (if an exception was thrown/caught) unless such exceptions usually remain unhandled.
53+
4. It's not recommended to record the same error as it propagates through the stack trace or
54+
attach the same instance of exception to multiple log records.
6355

6456
5. An error should be logged with appropriate severity depending on the available context.
6557

@@ -68,15 +60,25 @@ This guidance boils down to the following:
6860
- Unhandled exceptions that don't result in application shutdown should be recorded with severity `Error`
6961
- Errors that result in application shutdown should be recorded with severity `Fatal`
7062

71-
6. Instrumentations should not log errors or exceptions that are handled or
72-
are propagated as is, except ones handled in global exception handlers (see #2 below)
63+
6. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
64+
to describe the context that the exception was thrown in.
65+
They are also encouraged to define their own error events and enrich them with exception details.
7366

74-
If a new exception is created based on the original one or a new details about the error become available,
75-
instrumentation may record another error (without stack trace)
67+
7. OTel SDK should record stack traces on exceptions with severity `Error` or higher and should allow users to
68+
change the threshold.
7669

77-
7. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
78-
to describe the context that the exception was thrown in.
79-
They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
70+
See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that
71+
records stack trace conditionally.
72+
73+
74+
> [!NOTE]
75+
>
76+
> Based on this guidance non-native instrumentations should record exceptions in top-level instrumentations only (#2 in [Details](#details))
77+
78+
> [!Important]
79+
>
80+
> OTel should provide API like `setException` when creating log record that will record only necessary information depending
81+
> on the configuration and log severity. See [API changes](#api-changes) for the details.
8082
8183
## API changes
8284

@@ -85,8 +87,8 @@ Library may write logs providing exception instance through a log bridge and not
8587

8688
It also maybe desirable by some vendors/apps to record all the exception details.
8789

88-
OTel Logs API should provide additional methods that enrich log record with exception details such as
89-
`setException(exception)` (`setUnhandledException`, etc), similar to [RecordException](../specification/trace/api.md?plain=1#L682)
90+
OTel Logs API should provide methods that enrich log record with exception details such as
91+
`setException(exception)` similar to [RecordException](../specification/trace/api.md?plain=1#L682)
9092
method on span.
9193

9294
OTel SDK should implement such methods and set exception attributes based on configuration
@@ -108,27 +110,22 @@ try {
108110
// we don't record exception here, but may record a log record without exception info
109111
logger.logRecordBuilder()
110112
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
111-
.severityNumber(Severity.INFO)
112113
// let's assume it's expected that some content can disappear
113-
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
114-
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
115-
// ideally we should provide the following method for convenience, optimization,
116-
// and to support different behavior behind config options
117-
//.addException(ex)
114+
.severityNumber(Severity.INFO)
115+
// by default SDK will only populate `exception.type` and `exception.message`
116+
// since severity is `INFO`, but it should not be instrumentation library
117+
// concern
118+
.setException(ex)
118119
.emit();
119120

120121
return response(HttpStatus.NOT_FOUND);
121-
} catch (NotAuthorizedException ex) {
122-
// let's assume it's really unexpected - service lost access to the underlying storage
123-
// since we're returning error response without an exception, we
122+
} catch (ForbiddenException ex) {
124123
logger.logRecordBuilder()
124+
// let's assume it's really unexpected for this application - service does not have access to the underlying storage.
125125
.severityNumber(Severity.ERROR)
126126
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
127-
// ideally we should provide the following method for convenience, optimization,
128-
// and to support different behavior behind config options
129-
//.addException(ex)
130-
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
131-
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
127+
// by default SDK will record stack trace for this exception since the severity is ERROR
128+
.setException(ex)
132129
.emit();
133130

134131
return response(HttpStatus.INTERNAL_SERVER_ERROR);
@@ -152,16 +149,13 @@ public class StorageClient {
152149
}
153150

154151
logger.logRecordBuilder()
155-
// we may set different levels depending on the status code, but in general
156-
// we expect caller to handle the error, so this is at most warning
152+
// In general we don't know if it's certainly an error - we expect caller
153+
// to handle the exception and decide. So this is warning (at most).
154+
// If it remains unhandled, it'd be logged by the global handler.
157155
.setSeverity(Severity.WARN)
158156
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
159157
.addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode())
160-
// ideally we should provide the following method for convenience, optimization,
161-
// and to support different behavior behind config options
162-
//.addException(ex)
163-
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
164-
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
158+
.addException(ex)
165159
.emit();
166160

167161
if (response.statusCode() == 404) {
@@ -183,13 +177,12 @@ public class Connection {
183177
public long send(ByteBuffer content) {
184178
try {
185179
return socketChannel.write(content);
186-
} catch (Throwable ex) {
180+
} catch (SocketException ex) {
187181
logger.logRecordBuilder()
188-
// we'll retry it, so it's Info or lower
182+
// we retry it, so it's Info or lower
189183
.setSeverity(Severity.INFO)
190184
.addAttribute("connection.id", this.getId())
191-
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
192-
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
185+
.addException(ex)
193186
.emit();
194187

195188
throw ex;
@@ -198,9 +191,38 @@ public class Connection {
198191
}
199192
```
200193

201-
#### HTTP server instrumentation/global exception handler
194+
#### Messaging processor instrumentation
195+
196+
In this example, application code provides and callback to the messaging processor to
197+
execute for each message.
198+
199+
```java
200+
MessagingProcessorClient processorClient = new MessagingClientBuilder()
201+
.endpoint(endpoint)
202+
.queueName(queueName)
203+
.processor()
204+
.processMessage(messageContext -> processMessage(messageContext))
205+
.buildProcessorClient();
206+
207+
processorClient.start();
208+
```
209+
210+
The `MessagingProcessorClient` implementation should catch exceptions thrown by the `processMessage` callback and log them similarly to
202211

203-
TODO
212+
```java
213+
MessageContext context = retrieveNext();
214+
try {
215+
processMessage.accept(context)
216+
} catch (Throwable t) {
217+
// this native instrumentation may use OTel log API or another logging library
218+
// such as SLF4J
219+
logger.atError()
220+
.addKeyValuePair("messaging.message.id", context.getMessageId())
221+
...
222+
.setException(t)
223+
.log()
224+
}
225+
```
204226

205227
## Trade-offs and mitigations
206228

@@ -210,9 +232,9 @@ TODO
210232
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
211233
but that's not enough - instrumentations will have to change their behavior to report exception logs
212234
with appropriate severity (or stop reporting them).
213-
- TODO: document opt-in mechanism similar to `OTEL_SEMCONV_STABILITY_OPT_IN`
235+
- We should provide opt-in mechanism for existing instrumentations to switch to logs.
214236

215-
1. Recording exceptions as log-based events would result in UX degradation for users
237+
2. Recording exceptions as log-based events would result in UX degradation for users
216238
leveraging trace-only backends such as Jaeger.
217239

218240
**Mitigation:**
@@ -223,15 +245,13 @@ TODO
223245

224246
Alternatives:
225247

226-
1. Deduplicate exception info on logs. We can mark exception instances as logged
227-
(augment exception instance or keep a small cache of recently logged exceptions).
248+
1. Deduplicate exception info by marking exception instances as logged.
228249
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
229250
We should still provide optimal guidance for the greenfield applications and libraries.
230251

231-
2. Log full exception info only when exception is thrown for the first time
232-
(including new exceptions wrapping original ones). This results in at-most-once
233-
logging, but even this is known to be problematic since absolute majority of exceptions
234-
are handled.
252+
2. Log full exception info only when exception is thrown for the first time.
253+
This results in at-most-once logging, but even this is known to be problematic since absolute
254+
majority of exceptions are handled.
235255
It also relies on the assumption that most libraries will follow this guidance.
236256

237257
## Open questions
@@ -240,16 +260,6 @@ TBD
240260

241261
## Future possibilities
242262

243-
1. OpenTelemetry should provide configuration options and APIs allowing (but not limited) to:
244-
245-
- Record unhandled exceptions only (the default documented in this guidance)
246-
- Record exception info based on the log severity
247-
- Record exception logs, but omit the stack trace based on (at least) the log level.
248-
See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
249-
- Record all available exceptions with all the details
250-
251-
It should be possible to optimize instrumentation and avoid collecting exception information
252-
(such as stack trace) when the corresponding exception log is not going to be recorded.
253-
254-
2. Exception stack traces can be recorded in structured form instead of their
255-
string representation. It may be easier to process and consume them in this form.
263+
Exception stack traces can be recorded in structured form instead of their
264+
string representation. It may be easier to process and consume them in this form.
265+
This is out of scope of this OTEP.

0 commit comments

Comments
 (0)