From 17ae31a19ed9784a5cfd14a4f69936f4a83fdac2 Mon Sep 17 00:00:00 2001 From: Sreekanth Vadigi Date: Mon, 27 Oct 2025 12:56:51 +0530 Subject: [PATCH 1/6] telemetry lld Signed-off-by: Sreekanth Vadigi --- .../Drivers/Databricks/Telemetry/prompts.txt | 11 + .../telemetry-integration-lld-design.md | 3565 +++++++++++++++++ .../Telemetry/telemetry-lld-summary.md | 280 ++ 3 files changed, 3856 insertions(+) create mode 100644 csharp/src/Drivers/Databricks/Telemetry/prompts.txt create mode 100644 csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md create mode 100644 csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md diff --git a/csharp/src/Drivers/Databricks/Telemetry/prompts.txt b/csharp/src/Drivers/Databricks/Telemetry/prompts.txt new file mode 100644 index 0000000000..db1e2fcd5d --- /dev/null +++ b/csharp/src/Drivers/Databricks/Telemetry/prompts.txt @@ -0,0 +1,11 @@ +1. "can you understand the content present in this google doc: {telemetry-design-doc-url}" + +2. "can you use google mcp" + +4. "can you check the databricks jdbc repo and understand how it is currently implemented" + +5. "now, lets go through the arrow adbc driver for databricks present at {project-location}/arrow-adbc/csharp/src/Drivers/Databricks, and understand its flow" + +6. "i want to create a low level design doc for adding telemetry to the databricks adbc driver. based on the context you have can you create one for me. make it a detailed one. example design doc: {github-url}/statement-execution-api-design.md ultrathink" + +7. "does all of the changes in the lld document come under the folder {project-location}/arrow-adbc/csharp/src/Drivers/Databricks or outside as well ? ultrathink" \ No newline at end of file diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md new file mode 100644 index 0000000000..48be782635 --- /dev/null +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md @@ -0,0 +1,3565 @@ +# Databricks ADBC Driver: Client Telemetry Integration + +## Executive Summary + +This document outlines the design for integrating client-side telemetry into the Databricks ADBC driver for C#. The telemetry system will collect operational metrics, performance data, and error information from the driver to enable proactive monitoring, usage analytics, and faster issue resolution. + +**Key Objectives:** +- Enable comprehensive observability of driver operations without impacting performance +- Collect usage insights (CloudFetch vs inline, driver configurations, error patterns) +- Track adoption of new features and configurations +- Provide proactive error monitoring to identify issues before customer reports +- Maintain compatibility with existing OpenTelemetry/Activity-based tracing + +**Design Principles:** +- **Non-blocking**: Telemetry operations must never block driver functionality +- **Privacy-first**: No PII or query data collected; schema curated for data residency compliance +- **Opt-out capable**: Users can disable telemetry via configuration +- **Server-controlled**: Feature flag support for server-side enable/disable +- **Backward compatible**: No breaking changes to existing driver API +- **OpenTelemetry aligned**: Leverage existing Activity infrastructure where possible + +--- + +## Table of Contents + +1. [Background & Motivation](#1-background--motivation) +2. [Requirements](#2-requirements) +3. [Architecture Overview](#3-architecture-overview) +4. [Telemetry Components](#4-telemetry-components) +5. [Data Schema](#5-data-schema) +6. [Collection Points](#6-collection-points) +7. [Export Mechanism](#7-export-mechanism) +8. [Configuration](#8-configuration) +9. [Privacy & Data Residency](#9-privacy--data-residency) +10. [Error Handling](#10-error-handling) +11. [Testing Strategy](#11-testing-strategy) +12. [Migration & Rollout](#12-migration--rollout) +13. [Alternatives Considered](#13-alternatives-considered) +14. [Open Questions](#14-open-questions) +15. [References](#15-references) + +--- + +## 1. Background & Motivation + +### 1.1 Current State + +The Databricks ADBC driver currently implements: +- **Activity-based tracing** via `ActivityTrace` and `ActivitySource` +- **W3C Trace Context propagation** for distributed tracing +- **Local file exporter** for debugging traces + +However, this approach has limitations: +- **No centralized aggregation**: Traces are local-only unless connected to external APM +- **Limited usage insights**: No visibility into driver configuration patterns +- **Reactive debugging**: Relies on customer-reported issues with trace files +- **No feature adoption metrics**: Cannot track usage of CloudFetch, Direct Results, etc. + +### 1.2 JDBC Driver Precedent + +The Databricks JDBC driver successfully implemented client telemetry with: +- **Comprehensive metrics**: Operation latency, chunk downloads, error rates +- **Configuration tracking**: Driver settings, auth types, proxy usage +- **Server-side control**: Feature flag to enable/disable telemetry +- **Centralized storage**: Data flows to `main.eng_lumberjack.prod_frontend_log_sql_driver_log` +- **Privacy compliance**: No PII, curated schema, Lumberjack data residency + +### 1.3 Key Gaps to Address + +1. **Proactive Monitoring**: Identify errors before customer escalation +2. **Usage Analytics**: Understand driver configuration patterns across customer base +3. **Feature Adoption**: Track uptake of CloudFetch, Direct Results, OAuth flows +4. **Performance Insights**: Client-side latency vs server-side metrics +5. **Error Patterns**: Common configuration mistakes, auth failures, network issues + +--- + +## 2. Requirements + +### 2.1 Functional Requirements + +| ID | Requirement | Priority | +|:---|:---|:---:| +| FR-1 | Collect driver configuration metadata (auth type, CloudFetch settings, etc.) | P0 | +| FR-2 | Track operation latency (connection open, statement execution, result fetching) | P0 | +| FR-3 | Record error events with error codes and context | P0 | +| FR-4 | Capture CloudFetch metrics (chunk downloads, retries, compression status) | P0 | +| FR-5 | Track result format usage (inline vs CloudFetch) | P1 | +| FR-6 | Support server-side feature flag to enable/disable telemetry | P0 | +| FR-7 | Provide client-side opt-out mechanism | P1 | +| FR-8 | Batch telemetry events to reduce network overhead | P0 | +| FR-9 | Export telemetry to Databricks telemetry service | P0 | +| FR-10 | Support both authenticated and unauthenticated telemetry endpoints | P0 | + +### 2.2 Non-Functional Requirements + +| ID | Requirement | Target | Priority | +|:---|:---|:---:|:---:| +| NFR-1 | Telemetry overhead < 1% of operation latency | < 1% | P0 | +| NFR-2 | Memory overhead < 10MB per connection | < 10MB | P0 | +| NFR-3 | Zero impact on driver operation if telemetry fails | 0 failures | P0 | +| NFR-4 | Telemetry export success rate | > 95% | P1 | +| NFR-5 | Batch flush latency | < 5s | P1 | +| NFR-6 | Support workspace-level disable | 100% | P0 | +| NFR-7 | No PII or query data collected | 0 PII | P0 | +| NFR-8 | Compatible with existing Activity tracing | 100% | P0 | + +### 2.3 Out of Scope + +- Distributed tracing (already covered by Activity/OpenTelemetry) +- Query result data collection +- Real-time alerting (server-side responsibility) +- Custom telemetry endpoints (only Databricks service) + +--- + +## 3. Architecture Overview + +### 3.1 High-Level Design + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ADBC Driver Operations │ +│ (Connection, Statement Execution, Result Fetching) │ +└─────────────────────────────────────────────────────────────────┘ + │ + │ Emit Events + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ TelemetryCollector │ +│ - Per-connection singleton │ +│ - Aggregates events by statement ID │ +│ - Non-blocking event ingestion │ +└─────────────────────────────────────────────────────────────────┘ + │ + │ Batch Events + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ TelemetryExporter │ +│ - Background export worker │ +│ - Periodic flush (configurable interval) │ +│ - Size-based flush (batch threshold) │ +│ - Connection close flush │ +└─────────────────────────────────────────────────────────────────┘ + │ + │ HTTP POST + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Databricks Telemetry Service │ +│ Endpoints: │ +│ - /telemetry-ext (authenticated) │ +│ - /telemetry-unauth (unauthenticated - connection errors) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Lumberjack Pipeline │ +│ Table: main.eng_lumberjack.prod_frontend_log_sql_driver_log │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 3.2 Component Interaction Flow + +```mermaid +sequenceDiagram + participant App as Application + participant Conn as DatabricksConnection + participant Stmt as DatabricksStatement + participant TC as TelemetryCollector + participant TE as TelemetryExporter + participant TS as Telemetry Service + + App->>Conn: OpenAsync() + Conn->>TC: Initialize(config) + TC->>TE: Start background worker + Conn->>TC: RecordConnectionOpen(latency, config) + + App->>Stmt: ExecuteQueryAsync() + Stmt->>TC: RecordStatementExecution(statementId, latency) + + loop CloudFetch Downloads + Stmt->>TC: RecordChunkDownload(chunkIndex, latency, size) + end + + Stmt->>TC: RecordStatementComplete(statementId) + + alt Batch size reached + TC->>TE: Flush batch + TE->>TS: POST /telemetry-ext + end + + App->>Conn: CloseAsync() + Conn->>TC: Flush all pending + TC->>TE: Force flush + TE->>TS: POST /telemetry-ext + TE->>TE: Stop worker +``` + +### 3.3 Integration with Existing Components + +The telemetry system will integrate with existing driver components: + +1. **DatabricksConnection**: + - Initialize telemetry collector on open + - Record connection configuration + - Flush telemetry on close + - Handle feature flag from server + +2. **DatabricksStatement**: + - Record statement execution metrics + - Track result format (inline vs CloudFetch) + - Capture operation latency + +3. **CloudFetchDownloader**: + - Record chunk download latency + - Track retry attempts + - Report compression status + +4. **Activity Infrastructure**: + - Leverage existing Activity context for correlation + - Add telemetry as Activity events for unified observability + - Maintain W3C trace context propagation + +--- + +## 4. Telemetry Components + +### 4.1 TelemetryCollector + +**Purpose**: Aggregate and buffer telemetry events per connection. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryCollector` + +**Responsibilities**: +- Accept telemetry events from driver operations +- Aggregate events by statement ID +- Buffer events for batching +- Provide non-blocking event ingestion +- Trigger flush on batch size or time threshold + +**Interface**: +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Collects and aggregates telemetry events for a connection. + /// Thread-safe and non-blocking. + /// + internal sealed class TelemetryCollector : IDisposable + { + // Constructor + public TelemetryCollector( + DatabricksConnection connection, + ITelemetryExporter exporter, + TelemetryConfiguration config); + + // Event recording methods + public void RecordConnectionOpen( + TimeSpan latency, + DriverConfiguration driverConfig); + + public void RecordStatementExecute( + string statementId, + TimeSpan latency, + ExecutionResultFormat resultFormat); + + public void RecordChunkDownload( + string statementId, + int chunkIndex, + TimeSpan latency, + long bytesDownloaded, + bool compressed); + + public void RecordOperationStatus( + string statementId, + int pollCount, + TimeSpan totalLatency); + + public void RecordStatementComplete(string statementId); + + public void RecordError( + string errorCode, + string errorMessage, + string? statementId = null, + int? chunkIndex = null); + + // Flush methods + public Task FlushAsync(CancellationToken cancellationToken = default); + + public Task FlushAllPendingAsync(); + + // IDisposable + public void Dispose(); + } +} +``` + +**Implementation Details**: + +```csharp +internal sealed class TelemetryCollector : IDisposable +{ + private readonly DatabricksConnection _connection; + private readonly ITelemetryExporter _exporter; + private readonly TelemetryConfiguration _config; + private readonly ConcurrentDictionary _statementData; + private readonly ConcurrentQueue _eventQueue; + private readonly Timer _flushTimer; + private readonly SemaphoreSlim _flushLock; + private long _lastFlushTime; + private int _eventCount; + private bool _disposed; + + public TelemetryCollector( + DatabricksConnection connection, + ITelemetryExporter exporter, + TelemetryConfiguration config) + { + _connection = connection ?? throw new ArgumentNullException(nameof(connection)); + _exporter = exporter ?? throw new ArgumentNullException(nameof(exporter)); + _config = config ?? throw new ArgumentNullException(nameof(config)); + + _statementData = new ConcurrentDictionary(); + _eventQueue = new ConcurrentQueue(); + _flushLock = new SemaphoreSlim(1, 1); + _lastFlushTime = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + + // Start periodic flush timer + if (_config.FlushIntervalMilliseconds > 0) + { + _flushTimer = new Timer( + OnTimerFlush, + null, + _config.FlushIntervalMilliseconds, + _config.FlushIntervalMilliseconds); + } + } + + public void RecordConnectionOpen(TimeSpan latency, DriverConfiguration driverConfig) + { + if (!_config.Enabled) return; + + var telemetryEvent = new TelemetryEvent + { + EventType = TelemetryEventType.ConnectionOpen, + Timestamp = DateTimeOffset.UtcNow, + OperationLatencyMs = (long)latency.TotalMilliseconds, + DriverConfig = driverConfig, + SessionId = _connection.SessionId, + WorkspaceId = _connection.WorkspaceId + }; + + EnqueueEvent(telemetryEvent); + } + + public void RecordStatementExecute( + string statementId, + TimeSpan latency, + ExecutionResultFormat resultFormat) + { + if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; + + var stmtData = _statementData.GetOrAdd( + statementId, + _ => new StatementTelemetryData { StatementId = statementId }); + + stmtData.ExecutionLatencyMs = (long)latency.TotalMilliseconds; + stmtData.ResultFormat = resultFormat; + stmtData.Timestamp = DateTimeOffset.UtcNow; + } + + public void RecordChunkDownload( + string statementId, + int chunkIndex, + TimeSpan latency, + long bytesDownloaded, + bool compressed) + { + if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; + + var stmtData = _statementData.GetOrAdd( + statementId, + _ => new StatementTelemetryData { StatementId = statementId }); + + stmtData.ChunkDownloads.Add(new ChunkDownloadData + { + ChunkIndex = chunkIndex, + LatencyMs = (long)latency.TotalMilliseconds, + BytesDownloaded = bytesDownloaded, + Compressed = compressed + }); + + stmtData.TotalChunks = Math.Max(stmtData.TotalChunks, chunkIndex + 1); + } + + public void RecordStatementComplete(string statementId) + { + if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; + + if (_statementData.TryRemove(statementId, out var stmtData)) + { + // Convert statement data to telemetry event + var telemetryEvent = CreateStatementEvent(stmtData); + EnqueueEvent(telemetryEvent); + } + } + + public void RecordError( + string errorCode, + string errorMessage, + string? statementId = null, + int? chunkIndex = null) + { + if (!_config.Enabled) return; + + var telemetryEvent = new TelemetryEvent + { + EventType = TelemetryEventType.Error, + Timestamp = DateTimeOffset.UtcNow, + ErrorCode = errorCode, + ErrorMessage = errorMessage, + StatementId = statementId, + ChunkIndex = chunkIndex, + SessionId = _connection.SessionId, + WorkspaceId = _connection.WorkspaceId + }; + + EnqueueEvent(telemetryEvent); + } + + private void EnqueueEvent(TelemetryEvent telemetryEvent) + { + _eventQueue.Enqueue(telemetryEvent); + var count = Interlocked.Increment(ref _eventCount); + + // Trigger flush if batch size reached + if (count >= _config.BatchSize) + { + _ = Task.Run(() => FlushAsync(CancellationToken.None)); + } + } + + public async Task FlushAsync(CancellationToken cancellationToken = default) + { + if (_eventCount == 0) return; + + await _flushLock.WaitAsync(cancellationToken); + try + { + var events = DequeueEvents(); + if (events.Count > 0) + { + await _exporter.ExportAsync(events, cancellationToken); + _lastFlushTime = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + } + } + catch (Exception ex) + { + // Log but don't throw - telemetry must not break driver + Debug.WriteLine($"Telemetry flush failed: {ex.Message}"); + } + finally + { + _flushLock.Release(); + } + } + + public async Task FlushAllPendingAsync() + { + // Export all pending statement data + foreach (var kvp in _statementData) + { + if (_statementData.TryRemove(kvp.Key, out var stmtData)) + { + var telemetryEvent = CreateStatementEvent(stmtData); + EnqueueEvent(telemetryEvent); + } + } + + // Flush event queue + await FlushAsync(CancellationToken.None); + } + + private List DequeueEvents() + { + var events = new List(_config.BatchSize); + while (_eventQueue.TryDequeue(out var telemetryEvent) && events.Count < _config.BatchSize) + { + events.Add(telemetryEvent); + Interlocked.Decrement(ref _eventCount); + } + return events; + } + + private void OnTimerFlush(object? state) + { + var now = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + if (now - _lastFlushTime >= _config.FlushIntervalMilliseconds && _eventCount > 0) + { + _ = Task.Run(() => FlushAsync(CancellationToken.None)); + } + } + + public void Dispose() + { + if (_disposed) return; + _disposed = true; + + _flushTimer?.Dispose(); + + // Flush all pending data synchronously on dispose + FlushAllPendingAsync().GetAwaiter().GetResult(); + + _flushLock?.Dispose(); + } +} +``` + +### 4.2 TelemetryExporter + +**Purpose**: Export telemetry events to Databricks telemetry service. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryExporter` + +**Responsibilities**: +- Serialize telemetry events to JSON +- Send HTTP POST requests to telemetry endpoints +- Handle authentication (OAuth tokens) +- Implement retry logic for transient failures +- Support circuit breaker pattern + +**Interface**: +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Exports telemetry events to Databricks telemetry service. + /// + internal interface ITelemetryExporter + { + Task ExportAsync( + IReadOnlyList events, + CancellationToken cancellationToken = default); + } + + internal sealed class TelemetryExporter : ITelemetryExporter + { + public TelemetryExporter( + HttpClient httpClient, + DatabricksConnection connection, + TelemetryConfiguration config); + + public Task ExportAsync( + IReadOnlyList events, + CancellationToken cancellationToken = default); + } +} +``` + +**Implementation Details**: + +```csharp +internal sealed class TelemetryExporter : ITelemetryExporter +{ + private readonly HttpClient _httpClient; + private readonly DatabricksConnection _connection; + private readonly TelemetryConfiguration _config; + private readonly JsonSerializerOptions _jsonOptions; + private readonly CircuitBreaker? _circuitBreaker; + + private const string AuthenticatedPath = "/telemetry-ext"; + private const string UnauthenticatedPath = "/telemetry-unauth"; + + public TelemetryExporter( + HttpClient httpClient, + DatabricksConnection connection, + TelemetryConfiguration config) + { + _httpClient = httpClient ?? throw new ArgumentNullException(nameof(httpClient)); + _connection = connection ?? throw new ArgumentNullException(nameof(connection)); + _config = config ?? throw new ArgumentNullException(nameof(config)); + + _jsonOptions = new JsonSerializerOptions + { + PropertyNamingPolicy = JsonNamingPolicy.SnakeCaseLower, + DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull, + WriteIndented = false + }; + + if (_config.CircuitBreakerEnabled) + { + _circuitBreaker = new CircuitBreaker( + _config.CircuitBreakerThreshold, + _config.CircuitBreakerTimeout); + } + } + + public async Task ExportAsync( + IReadOnlyList events, + CancellationToken cancellationToken = default) + { + if (events == null || events.Count == 0) return; + + try + { + // Check circuit breaker + if (_circuitBreaker != null && _circuitBreaker.IsOpen) + { + Debug.WriteLine("Telemetry circuit breaker is open, dropping events"); + return; + } + + // Determine endpoint based on authentication status + var isAuthenticated = _connection.IsAuthenticated; + var path = isAuthenticated ? AuthenticatedPath : UnauthenticatedPath; + var uri = new Uri(_connection.Host, path); + + // Create request payload + var request = CreateTelemetryRequest(events); + var json = JsonSerializer.Serialize(request, _jsonOptions); + var content = new StringContent(json, Encoding.UTF8, "application/json"); + + // Create HTTP request + using var httpRequest = new HttpRequestMessage(HttpMethod.Post, uri) + { + Content = content + }; + + // Add authentication headers if authenticated + if (isAuthenticated) + { + await AddAuthenticationHeadersAsync(httpRequest, cancellationToken); + } + + // Send request with retry + var response = await SendWithRetryAsync(httpRequest, cancellationToken); + + // Handle response + if (response.IsSuccessStatusCode) + { + _circuitBreaker?.RecordSuccess(); + + // Parse response for partial failures + var responseContent = await response.Content.ReadAsStringAsync(cancellationToken); + var telemetryResponse = JsonSerializer.Deserialize( + responseContent, + _jsonOptions); + + if (telemetryResponse?.Errors?.Count > 0) + { + Debug.WriteLine( + $"Telemetry partial failure: {telemetryResponse.Errors.Count} errors"); + } + } + else + { + _circuitBreaker?.RecordFailure(); + Debug.WriteLine( + $"Telemetry export failed: {response.StatusCode} - {response.ReasonPhrase}"); + } + } + catch (Exception ex) + { + _circuitBreaker?.RecordFailure(); + Debug.WriteLine($"Telemetry export exception: {ex.Message}"); + // Don't rethrow - telemetry must not break driver operations + } + } + + private TelemetryRequest CreateTelemetryRequest(IReadOnlyList events) + { + var protoLogs = events.Select(e => new TelemetryFrontendLog + { + WorkspaceId = e.WorkspaceId, + FrontendLogEventId = Guid.NewGuid().ToString(), + Context = new FrontendLogContext + { + ClientContext = new TelemetryClientContext + { + TimestampMillis = e.Timestamp.ToUnixTimeMilliseconds(), + UserAgent = _connection.UserAgent + } + }, + Entry = new FrontendLogEntry + { + SqlDriverLog = CreateSqlDriverLog(e) + } + }).ToList(); + + return new TelemetryRequest + { + ProtoLogs = protoLogs + }; + } + + private SqlDriverLog CreateSqlDriverLog(TelemetryEvent e) + { + var log = new SqlDriverLog + { + SessionId = e.SessionId, + SqlStatementId = e.StatementId, + OperationLatencyMs = e.OperationLatencyMs, + SystemConfiguration = e.DriverConfig != null + ? CreateSystemConfiguration(e.DriverConfig) + : null, + DriverConnectionParams = e.DriverConfig != null + ? CreateConnectionParameters(e.DriverConfig) + : null + }; + + // Add SQL operation data if present + if (e.SqlOperationData != null) + { + log.SqlOperation = new SqlExecutionEvent + { + ExecutionResult = e.SqlOperationData.ResultFormat.ToString(), + RetryCount = e.SqlOperationData.RetryCount, + ChunkDetails = e.SqlOperationData.ChunkDownloads?.Count > 0 + ? CreateChunkDetails(e.SqlOperationData.ChunkDownloads) + : null + }; + } + + // Add error info if present + if (!string.IsNullOrEmpty(e.ErrorCode)) + { + log.ErrorInfo = new DriverErrorInfo + { + ErrorName = e.ErrorCode, + StackTrace = e.ErrorMessage + }; + } + + return log; + } + + private async Task SendWithRetryAsync( + HttpRequestMessage request, + CancellationToken cancellationToken) + { + var retryCount = 0; + var maxRetries = _config.MaxRetries; + + while (true) + { + try + { + var response = await _httpClient.SendAsync( + request, + HttpCompletionOption.ResponseHeadersRead, + cancellationToken); + + // Don't retry on client errors (4xx) + if ((int)response.StatusCode < 500) + { + return response; + } + + // Retry on server errors (5xx) if retries remaining + if (retryCount >= maxRetries) + { + return response; + } + } + catch (HttpRequestException) when (retryCount < maxRetries) + { + // Retry on network errors + } + catch (TaskCanceledException) when (!cancellationToken.IsCancellationRequested && retryCount < maxRetries) + { + // Retry on timeout (not user cancellation) + } + + retryCount++; + var delay = TimeSpan.FromMilliseconds(_config.RetryDelayMs * Math.Pow(2, retryCount - 1)); + await Task.Delay(delay, cancellationToken); + } + } + + private async Task AddAuthenticationHeadersAsync( + HttpRequestMessage request, + CancellationToken cancellationToken) + { + // Use connection's authentication mechanism + var authHeaders = await _connection.GetAuthenticationHeadersAsync(cancellationToken); + foreach (var header in authHeaders) + { + request.Headers.TryAddWithoutValidation(header.Key, header.Value); + } + } +} +``` + +### 4.3 CircuitBreaker + +**Purpose**: Prevent telemetry storms when service is unavailable. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.CircuitBreaker` + +**Implementation**: +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Circuit breaker to prevent telemetry storms. + /// + internal sealed class CircuitBreaker + { + private readonly int _failureThreshold; + private readonly TimeSpan _timeout; + private int _failureCount; + private DateTime _lastFailureTime; + private CircuitState _state; + private readonly object _lock = new object(); + + private enum CircuitState + { + Closed, // Normal operation + Open, // Blocking requests + HalfOpen // Testing if service recovered + } + + public CircuitBreaker(int failureThreshold, TimeSpan timeout) + { + _failureThreshold = failureThreshold; + _timeout = timeout; + _state = CircuitState.Closed; + } + + public bool IsOpen + { + get + { + lock (_lock) + { + // Auto-transition from Open to HalfOpen after timeout + if (_state == CircuitState.Open) + { + if (DateTime.UtcNow - _lastFailureTime > _timeout) + { + _state = CircuitState.HalfOpen; + return false; + } + return true; + } + return false; + } + } + } + + public void RecordSuccess() + { + lock (_lock) + { + _failureCount = 0; + _state = CircuitState.Closed; + } + } + + public void RecordFailure() + { + lock (_lock) + { + _failureCount++; + _lastFailureTime = DateTime.UtcNow; + + if (_failureCount >= _failureThreshold) + { + _state = CircuitState.Open; + } + } + } + } +} +``` + +### 4.4 TelemetryConfiguration + +**Purpose**: Centralize all telemetry configuration. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryConfiguration` + +**Implementation**: +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Configuration for telemetry collection and export. + /// + public sealed class TelemetryConfiguration + { + // Enable/disable flags + public bool Enabled { get; set; } = true; + public bool ForceEnable { get; set; } = false; // Bypass feature flag + + // Batch configuration + public int BatchSize { get; set; } = 50; + public int FlushIntervalMilliseconds { get; set; } = 30000; // 30 seconds + + // Retry configuration + public int MaxRetries { get; set; } = 3; + public int RetryDelayMs { get; set; } = 500; + + // Circuit breaker configuration + public bool CircuitBreakerEnabled { get; set; } = true; + public int CircuitBreakerThreshold { get; set; } = 5; + public TimeSpan CircuitBreakerTimeout { get; set; } = TimeSpan.FromMinutes(1); + + // Log level filtering + public TelemetryLogLevel LogLevel { get; set; } = TelemetryLogLevel.Info; + + // Feature flag name + public const string FeatureFlagName = + "databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc"; + + // Create from connection properties + public static TelemetryConfiguration FromProperties( + IReadOnlyDictionary properties) + { + var config = new TelemetryConfiguration(); + + if (properties.TryGetValue(DatabricksParameters.TelemetryEnabled, out var enabled)) + { + config.Enabled = bool.Parse(enabled); + } + + if (properties.TryGetValue(DatabricksParameters.TelemetryBatchSize, out var batchSize)) + { + config.BatchSize = int.Parse(batchSize); + } + + if (properties.TryGetValue(DatabricksParameters.TelemetryFlushIntervalMs, out var flushInterval)) + { + config.FlushIntervalMilliseconds = int.Parse(flushInterval); + } + + return config; + } + } + + public enum TelemetryLogLevel + { + Off = 0, + Error = 1, + Warn = 2, + Info = 3, + Debug = 4, + Trace = 5 + } +} +``` + +--- + +## 5. Data Schema + +### 5.1 Telemetry Event Model + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models +{ + /// + /// Base telemetry event. + /// + internal sealed class TelemetryEvent + { + public TelemetryEventType EventType { get; set; } + public DateTimeOffset Timestamp { get; set; } + public long? WorkspaceId { get; set; } + public string? SessionId { get; set; } + public string? StatementId { get; set; } + public long? OperationLatencyMs { get; set; } + + // Driver configuration (connection events only) + public DriverConfiguration? DriverConfig { get; set; } + + // SQL operation data (statement events only) + public SqlOperationData? SqlOperationData { get; set; } + + // Error information (error events only) + public string? ErrorCode { get; set; } + public string? ErrorMessage { get; set; } + public int? ChunkIndex { get; set; } + } + + public enum TelemetryEventType + { + ConnectionOpen, + StatementExecution, + Error + } +} +``` + +### 5.2 Driver Configuration Model + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models +{ + /// + /// Driver configuration snapshot (collected once per connection). + /// + internal sealed class DriverConfiguration + { + // System information + public string? DriverName { get; set; } = "Databricks.ADBC.CSharp"; + public string? DriverVersion { get; set; } + public string? OsName { get; set; } + public string? OsVersion { get; set; } + public string? RuntimeVersion { get; set; } + public string? ProcessName { get; set; } + + // Connection configuration + public string? AuthType { get; set; } + public string? HostUrl { get; set; } + public string? HttpPath { get; set; } + + // Feature flags + public bool CloudFetchEnabled { get; set; } + public bool Lz4DecompressionEnabled { get; set; } + public bool DirectResultsEnabled { get; set; } + public bool TracePropagationEnabled { get; set; } + public bool MultipleCatalogSupport { get; set; } + public bool PrimaryKeyForeignKeyEnabled { get; set; } + + // CloudFetch configuration + public long MaxBytesPerFile { get; set; } + public long MaxBytesPerFetchRequest { get; set; } + public int MaxParallelDownloads { get; set; } + public int PrefetchCount { get; set; } + public int MemoryBufferSizeMb { get; set; } + + // Proxy configuration + public bool UseProxy { get; set; } + public string? ProxyHost { get; set; } + public int? ProxyPort { get; set; } + + // Statement configuration + public long BatchSize { get; set; } + public int PollTimeMs { get; set; } + + // Direct results limits + public long DirectResultMaxBytes { get; set; } + public long DirectResultMaxRows { get; set; } + } +} +``` + +### 5.3 SQL Operation Data Model + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models +{ + /// + /// SQL operation metrics. + /// + internal sealed class SqlOperationData + { + public string? StatementId { get; set; } + public ExecutionResultFormat ResultFormat { get; set; } + public long ExecutionLatencyMs { get; set; } + public int RetryCount { get; set; } + public int PollCount { get; set; } + public long TotalPollLatencyMs { get; set; } + + // CloudFetch specific + public List? ChunkDownloads { get; set; } + public int TotalChunks { get; set; } + } + + public enum ExecutionResultFormat + { + Unknown = 0, + InlineArrow = 1, + InlineJson = 2, + ExternalLinks = 3 // CloudFetch + } + + internal sealed class ChunkDownloadData + { + public int ChunkIndex { get; set; } + public long LatencyMs { get; set; } + public long BytesDownloaded { get; set; } + public bool Compressed { get; set; } + } +} +``` + +### 5.4 Server Payload Schema + +The exported JSON payload matches JDBC format for consistency: + +```json +{ + "proto_logs": [ + { + "workspace_id": 1234567890, + "frontend_log_event_id": "550e8400-e29b-41d4-a716-446655440000", + "context": { + "client_context": { + "timestamp_millis": 1698765432000, + "user_agent": "Databricks-ADBC-CSharp/1.0.0" + } + }, + "entry": { + "sql_driver_log": { + "session_id": "01234567-89ab-cdef-0123-456789abcdef", + "sql_statement_id": "01234567-89ab-cdef-0123-456789abcdef", + "operation_latency_ms": 1234, + "system_configuration": { + "driver_name": "Databricks.ADBC.CSharp", + "driver_version": "1.0.0", + "os_name": "Windows", + "os_version": "10.0.19042", + "runtime_version": ".NET 8.0.0", + "process_name": "PowerBI.Desktop" + }, + "driver_connection_params": { + "auth_type": "oauth_client_credentials", + "cloudfetch_enabled": true, + "lz4_decompression_enabled": true, + "direct_results_enabled": true, + "max_bytes_per_file": 20971520, + "max_parallel_downloads": 3, + "batch_size": 2000000 + }, + "sql_operation": { + "execution_result": "EXTERNAL_LINKS", + "retry_count": 0, + "chunk_details": { + "total_chunks": 10, + "chunks_downloaded": 10, + "total_download_latency_ms": 5432, + "avg_chunk_size_bytes": 15728640, + "compressed": true + } + }, + "error_info": null + } + } + } + ] +} +``` + +--- + +## 6. Collection Points + +### 6.1 Connection Lifecycle Events + +#### 6.1.1 Connection Open + +**Location**: `DatabricksConnection.OpenAsync()` + +**What to Collect**: +- Connection open latency +- Driver configuration snapshot +- Session ID +- Workspace ID + +**Implementation**: +```csharp +public override async Task OpenAsync(CancellationToken cancellationToken = default) +{ + var sw = Stopwatch.StartNew(); + + try + { + await base.OpenAsync(cancellationToken); + + // Initialize telemetry after successful connection + InitializeTelemetry(); + + sw.Stop(); + + // Record connection open event + _telemetryCollector?.RecordConnectionOpen( + sw.Elapsed, + CreateDriverConfiguration()); + } + catch (Exception) + { + sw.Stop(); + // Error will be recorded by exception handler + throw; + } +} + +private DriverConfiguration CreateDriverConfiguration() +{ + return new DriverConfiguration + { + DriverName = "Databricks.ADBC.CSharp", + DriverVersion = GetType().Assembly.GetName().Version?.ToString(), + OsName = Environment.OSVersion.Platform.ToString(), + OsVersion = Environment.OSVersion.Version.ToString(), + RuntimeVersion = Environment.Version.ToString(), + ProcessName = Process.GetCurrentProcess().ProcessName, + + AuthType = DetermineAuthType(), + HostUrl = Host?.Host, + HttpPath = HttpPath, + + CloudFetchEnabled = UseCloudFetch, + Lz4DecompressionEnabled = CanDecompressLz4, + DirectResultsEnabled = _enableDirectResults, + TracePropagationEnabled = _tracePropagationEnabled, + MultipleCatalogSupport = _enableMultipleCatalogSupport, + PrimaryKeyForeignKeyEnabled = _enablePKFK, + + MaxBytesPerFile = _maxBytesPerFile, + MaxBytesPerFetchRequest = _maxBytesPerFetchRequest, + MaxParallelDownloads = GetIntProperty( + DatabricksParameters.CloudFetchParallelDownloads, + 3), + PrefetchCount = GetIntProperty( + DatabricksParameters.CloudFetchPrefetchCount, + 2), + MemoryBufferSizeMb = GetIntProperty( + DatabricksParameters.CloudFetchMemoryBufferSizeMb, + 200), + + UseProxy = Properties.ContainsKey(ApacheParameters.ProxyHost), + ProxyHost = Properties.TryGetValue(ApacheParameters.ProxyHost, out var host) + ? host + : null, + ProxyPort = Properties.TryGetValue(ApacheParameters.ProxyPort, out var port) + ? int.Parse(port) + : (int?)null, + + BatchSize = DatabricksStatement.DatabricksBatchSizeDefault, + PollTimeMs = GetIntProperty( + ApacheParameters.PollTimeMilliseconds, + DatabricksConstants.DefaultAsyncExecPollIntervalMs), + + DirectResultMaxBytes = _directResultMaxBytes, + DirectResultMaxRows = _directResultMaxRows + }; +} +``` + +#### 6.1.2 Connection Close + +**Location**: `DatabricksConnection.Dispose()` + +**What to Do**: +- Flush all pending telemetry +- Dispose telemetry collector + +**Implementation**: +```csharp +public override void Dispose() +{ + try + { + // Flush telemetry before closing connection + _telemetryCollector?.FlushAllPendingAsync().GetAwaiter().GetResult(); + } + catch (Exception ex) + { + Debug.WriteLine($"Error flushing telemetry on connection close: {ex.Message}"); + } + finally + { + _telemetryCollector?.Dispose(); + _telemetryCollector = null; + + base.Dispose(); + } +} +``` + +### 6.2 Statement Execution Events + +#### 6.2.1 Statement Execute + +**Location**: `DatabricksStatement.ExecuteQueryAsync()` + +**What to Collect**: +- Statement execution latency +- Result format (inline vs CloudFetch) +- Statement ID + +**Implementation**: +```csharp +protected override async Task ExecuteQueryAsync( + string? sqlQuery, + CancellationToken cancellationToken = default) +{ + var sw = Stopwatch.StartNew(); + string? statementId = null; + + try + { + var result = await base.ExecuteQueryAsync(sqlQuery, cancellationToken); + + sw.Stop(); + statementId = result.StatementHandle?.ToSQLExecStatementId(); + + // Determine result format + var resultFormat = DetermineResultFormat(result); + + // Record statement execution + Connection.TelemetryCollector?.RecordStatementExecute( + statementId ?? Guid.NewGuid().ToString(), + sw.Elapsed, + resultFormat); + + return result; + } + catch (Exception ex) + { + sw.Stop(); + + // Record error + Connection.TelemetryCollector?.RecordError( + DetermineErrorCode(ex), + ex.Message, + statementId); + + throw; + } +} + +private ExecutionResultFormat DetermineResultFormat(QueryResult result) +{ + if (result.DirectResult != null) + { + return ExecutionResultFormat.InlineArrow; + } + else if (result.ResultLinks != null && result.ResultLinks.Count > 0) + { + return ExecutionResultFormat.ExternalLinks; + } + else + { + return ExecutionResultFormat.Unknown; + } +} +``` + +#### 6.2.2 Statement Close + +**Location**: `DatabricksStatement.Dispose()` + +**What to Do**: +- Mark statement as complete in telemetry + +**Implementation**: +```csharp +public override void Dispose() +{ + try + { + // Mark statement complete (triggers export of aggregated metrics) + if (!string.IsNullOrEmpty(_statementId)) + { + Connection.TelemetryCollector?.RecordStatementComplete(_statementId); + } + } + finally + { + base.Dispose(); + } +} +``` + +### 6.3 CloudFetch Events + +#### 6.3.1 Chunk Download + +**Location**: `CloudFetchDownloader.DownloadFileAsync()` + +**What to Collect**: +- Download latency per chunk +- Bytes downloaded +- Compression status +- Retry attempts + +**Implementation**: +```csharp +private async Task DownloadFileAsync( + IDownloadResult downloadResult, + CancellationToken cancellationToken) +{ + var sw = Stopwatch.StartNew(); + var retryCount = 0; + + while (retryCount <= _maxRetries) + { + try + { + using var response = await _httpClient.GetAsync( + downloadResult.Url, + HttpCompletionOption.ResponseHeadersRead, + cancellationToken); + + response.EnsureSuccessStatusCode(); + + var contentLength = response.Content.Headers.ContentLength ?? 0; + var stream = await response.Content.ReadAsStreamAsync(cancellationToken); + + // Decompress if needed + if (_isLz4Compressed) + { + stream = LZ4Stream.Decode(stream); + } + + // Copy to memory buffer + await _memoryManager.ReserveAsync(contentLength, cancellationToken); + var memoryStream = new MemoryStream(); + await stream.CopyToAsync(memoryStream, cancellationToken); + + sw.Stop(); + + // Record successful download + _statement.Connection.TelemetryCollector?.RecordChunkDownload( + _statement.StatementId, + downloadResult.ChunkIndex, + sw.Elapsed, + contentLength, + _isLz4Compressed); + + downloadResult.SetData(memoryStream); + return; + } + catch (Exception ex) + { + retryCount++; + + if (retryCount > _maxRetries) + { + sw.Stop(); + + // Record download error + _statement.Connection.TelemetryCollector?.RecordError( + "CHUNK_DOWNLOAD_ERROR", + ex.Message, + _statement.StatementId, + downloadResult.ChunkIndex); + + downloadResult.SetError(ex); + throw; + } + + await Task.Delay(_retryDelayMs * retryCount, cancellationToken); + } + } +} +``` + +#### 6.3.2 Operation Status Polling + +**Location**: `DatabricksOperationStatusPoller.PollForCompletionAsync()` + +**What to Collect**: +- Number of polls +- Total polling latency + +**Implementation**: +```csharp +public async Task PollForCompletionAsync( + TOperationHandle operationHandle, + CancellationToken cancellationToken = default) +{ + var sw = Stopwatch.StartNew(); + var pollCount = 0; + + try + { + TGetOperationStatusResp? statusResp = null; + + while (!cancellationToken.IsCancellationRequested) + { + statusResp = await GetOperationStatusAsync(operationHandle, cancellationToken); + pollCount++; + + if (IsComplete(statusResp.OperationState)) + { + break; + } + + await Task.Delay(_pollIntervalMs, cancellationToken); + } + + sw.Stop(); + + // Record polling metrics + _connection.TelemetryCollector?.RecordOperationStatus( + operationHandle.OperationId?.Guid.ToString() ?? string.Empty, + pollCount, + sw.Elapsed); + + return statusResp!; + } + catch (Exception) + { + sw.Stop(); + throw; + } +} +``` + +### 6.4 Error Events + +#### 6.4.1 Exception Handler Integration + +**Location**: Throughout driver code + +**What to Collect**: +- Error code/type +- Error message (sanitized) +- Statement ID (if available) +- Chunk index (for download errors) + +**Implementation Pattern**: +```csharp +try +{ + // Driver operation +} +catch (DatabricksException ex) +{ + Connection.TelemetryCollector?.RecordError( + ex.ErrorCode, + SanitizeErrorMessage(ex.Message), + statementId, + chunkIndex); + + throw; +} +catch (AdbcException ex) +{ + Connection.TelemetryCollector?.RecordError( + ex.Status.ToString(), + SanitizeErrorMessage(ex.Message), + statementId); + + throw; +} +catch (Exception ex) +{ + Connection.TelemetryCollector?.RecordError( + "UNKNOWN_ERROR", + SanitizeErrorMessage(ex.Message), + statementId); + + throw; +} + +private static string SanitizeErrorMessage(string message) +{ + // Remove potential PII from error messages + // - Remove connection strings + // - Remove auth tokens + // - Remove file paths containing usernames + // - Keep only first 500 characters + + var sanitized = message; + + // Remove anything that looks like a connection string + sanitized = Regex.Replace( + sanitized, + @"token=[^;]+", + "token=***", + RegexOptions.IgnoreCase); + + // Remove Bearer tokens + sanitized = Regex.Replace( + sanitized, + @"Bearer\s+[A-Za-z0-9\-._~+/]+=*", + "Bearer ***", + RegexOptions.IgnoreCase); + + // Truncate to 500 characters + if (sanitized.Length > 500) + { + sanitized = sanitized.Substring(0, 500) + "..."; + } + + return sanitized; +} +``` + +--- + +## 7. Export Mechanism + +### 7.1 Export Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Driver Operations │ +│ (Emit events to TelemetryCollector) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ TelemetryCollector │ +│ - Buffer events in ConcurrentQueue │ +│ - Aggregate statement metrics in ConcurrentDictionary │ +│ - Track batch size and time since last flush │ +└─────────────────────────────────────────────────────────────────┘ + │ + ┌─────────────┼─────────────┐ + │ │ │ + ▼ ▼ ▼ + Batch Size Time Based Connection Close + Threshold Periodic Flush + Reached Flush + │ │ │ + └─────────────┼─────────────┘ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ TelemetryExporter │ +│ 1. Check circuit breaker state │ +│ 2. Serialize events to JSON │ +│ 3. Create HTTP POST request │ +│ 4. Add authentication headers (if authenticated) │ +│ 5. Send with retry logic │ +│ 6. Update circuit breaker on success/failure │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ HTTP POST +┌─────────────────────────────────────────────────────────────────┐ +│ Databricks Telemetry Service │ +│ Endpoints: │ +│ - POST /telemetry-ext (authenticated) │ +│ Auth: OAuth token from connection │ +│ - POST /telemetry-unauth (unauthenticated) │ +│ For pre-authentication errors only │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Lumberjack Pipeline │ +│ - Regional Logfood │ +│ - Central Logfood │ +│ - Table: main.eng_lumberjack.prod_frontend_log_sql_driver_log │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### 7.2 Export Triggers + +#### 7.2.1 Batch Size Threshold + +```csharp +private void EnqueueEvent(TelemetryEvent telemetryEvent) +{ + _eventQueue.Enqueue(telemetryEvent); + var count = Interlocked.Increment(ref _eventCount); + + // Trigger flush if batch size reached + if (count >= _config.BatchSize) + { + _ = Task.Run(() => FlushAsync(CancellationToken.None)); + } +} +``` + +**Default**: 50 events per batch +**Rationale**: Balance between export frequency and network overhead + +#### 7.2.2 Time-Based Periodic Flush + +```csharp +private void OnTimerFlush(object? state) +{ + var now = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); + if (now - _lastFlushTime >= _config.FlushIntervalMilliseconds && _eventCount > 0) + { + _ = Task.Run(() => FlushAsync(CancellationToken.None)); + } +} +``` + +**Default**: 30 seconds +**Rationale**: Ensure events are exported even with low event rate + +#### 7.2.3 Connection Close Flush + +```csharp +public void Dispose() +{ + if (_disposed) return; + _disposed = true; + + _flushTimer?.Dispose(); + + // Flush all pending data synchronously on dispose + FlushAllPendingAsync().GetAwaiter().GetResult(); + + _flushLock?.Dispose(); +} +``` + +**Behavior**: Synchronous flush to ensure no data loss on connection close + +### 7.3 Retry Strategy + +**Exponential Backoff with Jitter**: + +```csharp +private async Task SendWithRetryAsync( + HttpRequestMessage request, + CancellationToken cancellationToken) +{ + var retryCount = 0; + var maxRetries = _config.MaxRetries; + var random = new Random(); + + while (true) + { + try + { + var response = await _httpClient.SendAsync( + request, + HttpCompletionOption.ResponseHeadersRead, + cancellationToken); + + // Don't retry on client errors (4xx) + if ((int)response.StatusCode < 500) + { + return response; + } + + // Retry on server errors (5xx) if retries remaining + if (retryCount >= maxRetries) + { + return response; + } + } + catch (HttpRequestException) when (retryCount < maxRetries) + { + // Retry on network errors + } + catch (TaskCanceledException) when (!cancellationToken.IsCancellationRequested && retryCount < maxRetries) + { + // Retry on timeout (not user cancellation) + } + + retryCount++; + + // Exponential backoff with jitter + var baseDelay = _config.RetryDelayMs * Math.Pow(2, retryCount - 1); + var jitter = random.Next(0, (int)(baseDelay * 0.1)); // 10% jitter + var delay = TimeSpan.FromMilliseconds(baseDelay + jitter); + + await Task.Delay(delay, cancellationToken); + } +} +``` + +**Parameters**: +- Base delay: 500ms +- Max retries: 3 +- Exponential multiplier: 2 +- Jitter: 10% of base delay + +**Retry Conditions**: +- ✅ 5xx server errors +- ✅ Network errors (HttpRequestException) +- ✅ Timeouts (TaskCanceledException, not user cancellation) +- ❌ 4xx client errors (don't retry) +- ❌ User cancellation + +### 7.4 Circuit Breaker + +**Purpose**: Prevent telemetry storms when service is degraded + +**State Transitions**: + +``` + Closed ──────────────────┐ + │ │ + │ Failure threshold │ Success + │ reached │ + ▼ │ + Open ◄────┐ │ + │ │ │ + │ │ Failure │ + │ │ during │ + │ │ half-open │ + │ │ │ + │ Timeout │ + │ expired │ + ▼ │ │ + HalfOpen ──┴──────────────┘ +``` + +**Configuration**: +- Failure threshold: 5 consecutive failures +- Timeout: 60 seconds +- State check: On every export attempt + +**Behavior**: +- **Closed**: Normal operation, all exports attempted +- **Open**: Drop all events, no export attempts +- **HalfOpen**: Allow one export to test if service recovered + +--- + +## 8. Configuration + +### 8.1 Connection Parameters + +Add new ADBC connection parameters in `DatabricksParameters.cs`: + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks +{ + public static partial class DatabricksParameters + { + // Telemetry enable/disable + public const string TelemetryEnabled = "adbc.databricks.telemetry.enabled"; + + // Force enable (bypass feature flag) + public const string TelemetryForceEnable = "adbc.databricks.telemetry.force_enable"; + + // Batch configuration + public const string TelemetryBatchSize = "adbc.databricks.telemetry.batch_size"; + public const string TelemetryFlushIntervalMs = "adbc.databricks.telemetry.flush_interval_ms"; + + // Retry configuration + public const string TelemetryMaxRetries = "adbc.databricks.telemetry.max_retries"; + public const string TelemetryRetryDelayMs = "adbc.databricks.telemetry.retry_delay_ms"; + + // Circuit breaker configuration + public const string TelemetryCircuitBreakerEnabled = "adbc.databricks.telemetry.circuit_breaker.enabled"; + public const string TelemetryCircuitBreakerThreshold = "adbc.databricks.telemetry.circuit_breaker.threshold"; + public const string TelemetryCircuitBreakerTimeoutSec = "adbc.databricks.telemetry.circuit_breaker.timeout_sec"; + + // Log level filtering + public const string TelemetryLogLevel = "adbc.databricks.telemetry.log_level"; + } +} +``` + +### 8.2 Default Values + +| Parameter | Default | Description | +|:---|:---|:---| +| `adbc.databricks.telemetry.enabled` | `true` | Enable/disable telemetry collection | +| `adbc.databricks.telemetry.force_enable` | `false` | Bypass server-side feature flag | +| `adbc.databricks.telemetry.batch_size` | `50` | Number of events per batch | +| `adbc.databricks.telemetry.flush_interval_ms` | `30000` | Flush interval in milliseconds | +| `adbc.databricks.telemetry.max_retries` | `3` | Maximum retry attempts | +| `adbc.databricks.telemetry.retry_delay_ms` | `500` | Base retry delay in milliseconds | +| `adbc.databricks.telemetry.circuit_breaker.enabled` | `true` | Enable circuit breaker | +| `adbc.databricks.telemetry.circuit_breaker.threshold` | `5` | Failure threshold | +| `adbc.databricks.telemetry.circuit_breaker.timeout_sec` | `60` | Open state timeout in seconds | +| `adbc.databricks.telemetry.log_level` | `Info` | Minimum log level (Off/Error/Warn/Info/Debug/Trace) | + +### 8.3 Example Configuration + +#### JSON Configuration File + +```json +{ + "adbc.connection.host": "https://my-workspace.databricks.com", + "adbc.connection.auth_type": "oauth", + "adbc.databricks.oauth.client_id": "my-client-id", + "adbc.databricks.oauth.client_secret": "my-secret", + + "adbc.databricks.telemetry.enabled": "true", + "adbc.databricks.telemetry.batch_size": "100", + "adbc.databricks.telemetry.flush_interval_ms": "60000", + "adbc.databricks.telemetry.log_level": "Info" +} +``` + +#### Programmatic Configuration + +```csharp +var properties = new Dictionary +{ + [DatabricksParameters.HostName] = "https://my-workspace.databricks.com", + [DatabricksParameters.AuthType] = "oauth", + [DatabricksParameters.OAuthClientId] = "my-client-id", + [DatabricksParameters.OAuthClientSecret] = "my-secret", + + [DatabricksParameters.TelemetryEnabled] = "true", + [DatabricksParameters.TelemetryBatchSize] = "100", + [DatabricksParameters.TelemetryFlushIntervalMs] = "60000", + [DatabricksParameters.TelemetryLogLevel] = "Info" +}; + +using var driver = new DatabricksDriver(); +using var database = driver.Open(properties); +using var connection = database.Connect(); +``` + +#### Disable Telemetry + +```csharp +var properties = new Dictionary +{ + // ... other properties ... + [DatabricksParameters.TelemetryEnabled] = "false" +}; +``` + +### 8.4 Server-Side Feature Flag + +**Feature Flag Name**: `databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc` + +**Checking Logic**: + +```csharp +private async Task IsTelemetryEnabledByServerAsync(CancellationToken cancellationToken) +{ + // Check client-side force enable first + if (_config.ForceEnable) + { + return true; + } + + try + { + // Query server for feature flag + // This happens during ApplyServerSidePropertiesAsync() + var query = $"SELECT * FROM databricks_client_config WHERE key = '{TelemetryConfiguration.FeatureFlagName}'"; + + using var statement = Connection.CreateStatement(); + using var reader = await statement.ExecuteQueryAsync(query, cancellationToken); + + if (await reader.ReadAsync(cancellationToken)) + { + var value = reader.GetString(1); // value column + return bool.TryParse(value, out var enabled) && enabled; + } + } + catch (Exception ex) + { + Debug.WriteLine($"Failed to check telemetry feature flag: {ex.Message}"); + // Default to enabled if check fails + return true; + } + + // Default to enabled + return true; +} +``` + +**Integration in Connection**: + +```csharp +internal async Task ApplyServerSidePropertiesAsync(CancellationToken cancellationToken = default) +{ + await base.ApplyServerSidePropertiesAsync(cancellationToken); + + // Check telemetry feature flag + if (_telemetryConfig != null && _telemetryConfig.Enabled) + { + var serverEnabled = await IsTelemetryEnabledByServerAsync(cancellationToken); + if (!serverEnabled) + { + _telemetryConfig.Enabled = false; + _telemetryCollector?.Dispose(); + _telemetryCollector = null; + } + } +} +``` + +--- + +## 9. Privacy & Data Residency + +### 9.1 Privacy Principles + +**No PII Collection**: +- ❌ Query text +- ❌ Query results +- ❌ Table names +- ❌ Column names +- ❌ User identifiers (beyond workspace/session IDs) +- ❌ IP addresses +- ❌ File paths with usernames +- ❌ Authentication credentials + +**What We Collect**: +- ✅ Operation latency metrics +- ✅ Driver configuration settings +- ✅ Error codes and sanitized messages +- ✅ Result format (inline vs CloudFetch) +- ✅ System information (OS, runtime version) +- ✅ Session and statement IDs (UUIDs) + +### 9.2 Data Sanitization + +**Error Message Sanitization**: + +```csharp +private static string SanitizeErrorMessage(string message) +{ + // Remove connection strings + message = Regex.Replace( + message, + @"token=[^;]+", + "token=***", + RegexOptions.IgnoreCase); + + // Remove Bearer tokens + message = Regex.Replace( + message, + @"Bearer\s+[A-Za-z0-9\-._~+/]+=*", + "Bearer ***", + RegexOptions.IgnoreCase); + + // Remove client secrets + message = Regex.Replace( + message, + @"client_secret=[^&\s]+", + "client_secret=***", + RegexOptions.IgnoreCase); + + // Remove basic auth + message = Regex.Replace( + message, + @"Basic\s+[A-Za-z0-9+/]+=*", + "Basic ***", + RegexOptions.IgnoreCase); + + // Remove file paths with usernames (Windows/Unix) + message = Regex.Replace( + message, + @"C:\\Users\\[^\\]+", + "C:\\Users\\***", + RegexOptions.IgnoreCase); + + message = Regex.Replace( + message, + @"/home/[^/]+", + "/home/***"); + + message = Regex.Replace( + message, + @"/Users/[^/]+", + "/Users/***"); + + // Truncate to 500 characters + if (message.Length > 500) + { + message = message.Substring(0, 500) + "..."; + } + + return message; +} +``` + +**Configuration Sanitization**: + +```csharp +private DriverConfiguration CreateDriverConfiguration() +{ + var config = new DriverConfiguration + { + // ... populate config ... + + // Sanitize sensitive fields + HostUrl = SanitizeUrl(_connection.Host?.Host), + ProxyHost = SanitizeUrl(_connection.ProxyHost) + }; + + return config; +} + +private static string? SanitizeUrl(string? url) +{ + if (string.IsNullOrEmpty(url)) return url; + + try + { + var uri = new Uri(url); + // Return only host and scheme, no credentials or query params + return $"{uri.Scheme}://{uri.Host}"; + } + catch + { + return "***"; + } +} +``` + +### 9.3 Data Residency Compliance + +**Lumberjack Integration**: + +The Databricks telemetry service integrates with Lumberjack, which handles: +- **Data residency**: Logs stored in region-appropriate storage +- **Encryption**: At-rest and in-transit encryption +- **Retention**: Automated retention policies +- **Compliance**: GDPR, CCPA, HIPAA compliance + +**Regional Processing**: + +``` +┌────────────────────────────────────────────────────────────┐ +│ US-based Client │ +└────────────────────────────────────────────────────────────┘ + │ + ▼ POST /telemetry-ext +┌────────────────────────────────────────────────────────────┐ +│ US Control Plane │ +│ - Telemetry Service │ +└────────────────────────────────────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────┐ +│ US Regional Logfood │ +│ (US-based storage) │ +└────────────────────────────────────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────┐ +│ Central Logfood │ +│ (Global aggregation) │ +└────────────────────────────────────────────────────────────┘ +``` + +**No Cross-Region Data Transfer**: +- Telemetry sent to workspace's control plane region +- Processed and stored within that region +- Central aggregation respects data residency rules + +### 9.4 Opt-Out Mechanisms + +**Client-Side Opt-Out**: + +```csharp +// Disable via connection properties +properties[DatabricksParameters.TelemetryEnabled] = "false"; + +// Or via JSON config +{ + "adbc.databricks.telemetry.enabled": "false" +} +``` + +**Server-Side Opt-Out**: + +```sql +-- Workspace administrator can disable +SET databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc = false; +``` + +**Environment Variable Opt-Out**: + +```bash +# Set environment variable to disable globally +export DATABRICKS_TELEMETRY_ENABLED=false +``` + +**Implementation**: + +```csharp +private static bool IsTelemetryEnabled(IReadOnlyDictionary properties) +{ + // Check environment variable first + var envVar = Environment.GetEnvironmentVariable("DATABRICKS_TELEMETRY_ENABLED"); + if (!string.IsNullOrEmpty(envVar) && bool.TryParse(envVar, out var envEnabled)) + { + return envEnabled; + } + + // Check connection properties + if (properties.TryGetValue(DatabricksParameters.TelemetryEnabled, out var propValue)) + { + return bool.TryParse(propValue, out var propEnabled) && propEnabled; + } + + // Default to enabled + return true; +} +``` + +--- + +## 10. Error Handling + +### 10.1 Principles + +1. **Never Block Driver Operations**: Telemetry failures must not impact driver functionality +2. **Fail Silently**: Log errors but don't throw exceptions +3. **Degrade Gracefully**: Circuit breaker prevents cascading failures +4. **No Retry Storms**: Exponential backoff with circuit breaker + +### 10.2 Error Scenarios + +#### 10.2.1 Telemetry Service Unavailable + +**Scenario**: Telemetry endpoint returns 503 Service Unavailable + +**Handling**: +```csharp +try +{ + var response = await _httpClient.SendAsync(request, cancellationToken); + + if (response.StatusCode == HttpStatusCode.ServiceUnavailable) + { + _circuitBreaker?.RecordFailure(); + Debug.WriteLine("Telemetry service unavailable, will retry"); + return; + } +} +catch (HttpRequestException ex) +{ + _circuitBreaker?.RecordFailure(); + Debug.WriteLine($"Telemetry HTTP error: {ex.Message}"); + // Don't throw - fail silently +} +``` + +**Result**: Circuit breaker opens after threshold, drops subsequent events until service recovers + +#### 10.2.2 Network Timeout + +**Scenario**: HTTP request times out + +**Handling**: +```csharp +try +{ + using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); + cts.CancelAfter(TimeSpan.FromSeconds(10)); // 10 second timeout + + var response = await _httpClient.SendAsync(request, cts.Token); +} +catch (TaskCanceledException ex) when (!cancellationToken.IsCancellationRequested) +{ + // Timeout, not user cancellation + Debug.WriteLine("Telemetry request timeout, will retry"); + // Retry logic handles this +} +``` + +**Result**: Retry with exponential backoff, eventually give up if persistent + +#### 10.2.3 Serialization Error + +**Scenario**: JSON serialization fails for telemetry event + +**Handling**: +```csharp +try +{ + var json = JsonSerializer.Serialize(request, _jsonOptions); +} +catch (JsonException ex) +{ + Debug.WriteLine($"Telemetry serialization error: {ex.Message}"); + // Skip this batch, don't crash + return; +} +``` + +**Result**: Drop problematic events, continue with next batch + +#### 10.2.4 Out of Memory + +**Scenario**: Too many telemetry events buffered in memory + +**Handling**: +```csharp +private void EnqueueEvent(TelemetryEvent telemetryEvent) +{ + // Check queue size limit + if (_eventQueue.Count >= _config.MaxQueueSize) + { + Debug.WriteLine("Telemetry queue full, dropping oldest event"); + _eventQueue.TryDequeue(out _); // Drop oldest + } + + _eventQueue.Enqueue(telemetryEvent); +} +``` + +**Configuration**: `MaxQueueSize = 1000` (default) + +**Result**: FIFO queue with bounded size, drops oldest events when full + +#### 10.2.5 Partial Failure Response + +**Scenario**: Server accepts some events but rejects others + +**Handling**: +```csharp +var telemetryResponse = JsonSerializer.Deserialize( + responseContent, + _jsonOptions); + +if (telemetryResponse?.Errors?.Count > 0) +{ + Debug.WriteLine( + $"Telemetry partial failure: {telemetryResponse.NumProtoSuccess} succeeded, " + + $"{telemetryResponse.Errors.Count} failed"); + + // Log details about failures + foreach (var error in telemetryResponse.Errors) + { + Debug.WriteLine($" - Event {error.Index}: {error.Message}"); + } + + // Don't retry individual events - too complex + // Accept partial success +} +``` + +**Result**: Accept partial success, log details for debugging + +### 10.3 Error Logging + +**Debug Output**: +```csharp +// Use Debug.WriteLine for telemetry errors (not visible in production) +Debug.WriteLine($"Telemetry error: {ex.Message}"); +``` + +**Activity Integration**: +```csharp +try +{ + await ExportAsync(events, cancellationToken); +} +catch (Exception ex) +{ + // Add telemetry error as Activity event (if tracing enabled) + Activity.Current?.AddEvent(new ActivityEvent( + "telemetry.export.failed", + tags: new ActivityTagsCollection + { + { "error.type", ex.GetType().Name }, + { "error.message", ex.Message }, + { "event.count", events.Count } + })); + + Debug.WriteLine($"Telemetry export failed: {ex.Message}"); +} +``` + +**Result**: Telemetry errors captured in traces (if enabled) but don't affect driver + +--- + +## 11. Testing Strategy + +### 11.1 Unit Tests + +#### 11.1.1 TelemetryCollector Tests + +**File**: `TelemetryCollectorTests.cs` + +```csharp +[TestClass] +public class TelemetryCollectorTests +{ + private Mock _mockExporter; + private Mock _mockConnection; + private TelemetryConfiguration _config; + private TelemetryCollector _collector; + + [TestInitialize] + public void Setup() + { + _mockExporter = new Mock(); + _mockConnection = new Mock(); + _config = new TelemetryConfiguration + { + Enabled = true, + BatchSize = 10, + FlushIntervalMilliseconds = 1000 + }; + + _collector = new TelemetryCollector( + _mockConnection.Object, + _mockExporter.Object, + _config); + } + + [TestMethod] + public void RecordConnectionOpen_AddsEventToQueue() + { + // Arrange + var latency = TimeSpan.FromMilliseconds(100); + var driverConfig = new DriverConfiguration(); + + // Act + _collector.RecordConnectionOpen(latency, driverConfig); + + // Assert + // Verify event was queued (internal queue is private, so check via flush) + _collector.FlushAsync().Wait(); + _mockExporter.Verify( + e => e.ExportAsync( + It.Is>(list => list.Count == 1), + It.IsAny()), + Times.Once); + } + + [TestMethod] + public void RecordStatementExecute_AggregatesMetrics() + { + // Arrange + var statementId = Guid.NewGuid().ToString(); + var latency = TimeSpan.FromMilliseconds(200); + var resultFormat = ExecutionResultFormat.ExternalLinks; + + // Act + _collector.RecordStatementExecute(statementId, latency, resultFormat); + _collector.RecordStatementComplete(statementId); + + // Assert + _collector.FlushAsync().Wait(); + _mockExporter.Verify( + e => e.ExportAsync( + It.Is>(list => + list.Count == 1 && + list[0].SqlOperationData.ExecutionLatencyMs == 200), + It.IsAny()), + Times.Once); + } + + [TestMethod] + public async Task FlushAsync_TriggeredOnBatchSizeThreshold() + { + // Arrange - BatchSize is 10 + var driverConfig = new DriverConfiguration(); + + // Act - Add 10 events + for (int i = 0; i < 10; i++) + { + _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(i), driverConfig); + } + + // Wait for async flush to complete + await Task.Delay(100); + + // Assert + _mockExporter.Verify( + e => e.ExportAsync( + It.Is>(list => list.Count == 10), + It.IsAny()), + Times.Once); + } + + [TestMethod] + public async Task FlushAsync_TriggeredOnTimeInterval() + { + // Arrange - FlushIntervalMilliseconds is 1000 + var driverConfig = new DriverConfiguration(); + _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(100), driverConfig); + + // Act - Wait for timer to trigger flush + await Task.Delay(1500); + + // Assert + _mockExporter.Verify( + e => e.ExportAsync( + It.IsAny>(), + It.IsAny()), + Times.AtLeastOnce); + } + + [TestMethod] + public void Dispose_FlushesAllPendingEvents() + { + // Arrange + var driverConfig = new DriverConfiguration(); + _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(100), driverConfig); + + // Act + _collector.Dispose(); + + // Assert + _mockExporter.Verify( + e => e.ExportAsync( + It.Is>(list => list.Count > 0), + It.IsAny()), + Times.Once); + } + + [TestMethod] + public void RecordError_CreatesErrorEvent() + { + // Arrange + var errorCode = "CONNECTION_ERROR"; + var errorMessage = "Failed to connect"; + var statementId = Guid.NewGuid().ToString(); + + // Act + _collector.RecordError(errorCode, errorMessage, statementId); + + // Assert + _collector.FlushAsync().Wait(); + _mockExporter.Verify( + e => e.ExportAsync( + It.Is>(list => + list.Count == 1 && + list[0].ErrorCode == errorCode), + It.IsAny()), + Times.Once); + } +} +``` + +#### 11.1.2 TelemetryExporter Tests + +**File**: `TelemetryExporterTests.cs` + +```csharp +[TestClass] +public class TelemetryExporterTests +{ + private Mock _mockHttpHandler; + private HttpClient _httpClient; + private Mock _mockConnection; + private TelemetryConfiguration _config; + private TelemetryExporter _exporter; + + [TestInitialize] + public void Setup() + { + _mockHttpHandler = new Mock(); + _httpClient = new HttpClient(_mockHttpHandler.Object); + _mockConnection = new Mock(); + _mockConnection.Setup(c => c.Host).Returns(new Uri("https://test.databricks.com")); + _mockConnection.Setup(c => c.IsAuthenticated).Returns(true); + + _config = new TelemetryConfiguration + { + Enabled = true, + MaxRetries = 3, + RetryDelayMs = 100, + CircuitBreakerEnabled = true + }; + + _exporter = new TelemetryExporter(_httpClient, _mockConnection.Object, _config); + } + + [TestMethod] + public async Task ExportAsync_SendsEventsToCorrectEndpoint() + { + // Arrange + var events = new List + { + new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } + }; + + _mockHttpHandler + .Protected() + .Setup>( + "SendAsync", + ItExpr.IsAny(), + ItExpr.IsAny()) + .ReturnsAsync(new HttpResponseMessage + { + StatusCode = HttpStatusCode.OK, + Content = new StringContent("{\"num_proto_success\": 1, \"errors\": []}") + }); + + // Act + await _exporter.ExportAsync(events); + + // Assert + _mockHttpHandler.Protected().Verify( + "SendAsync", + Times.Once(), + ItExpr.Is(req => + req.Method == HttpMethod.Post && + req.RequestUri.AbsolutePath == "/telemetry-ext"), + ItExpr.IsAny()); + } + + [TestMethod] + public async Task ExportAsync_RetriesOnServerError() + { + // Arrange + var events = new List + { + new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } + }; + + var callCount = 0; + _mockHttpHandler + .Protected() + .Setup>( + "SendAsync", + ItExpr.IsAny(), + ItExpr.IsAny()) + .ReturnsAsync(() => + { + callCount++; + if (callCount < 3) + { + return new HttpResponseMessage(HttpStatusCode.ServiceUnavailable); + } + return new HttpResponseMessage + { + StatusCode = HttpStatusCode.OK, + Content = new StringContent("{\"num_proto_success\": 1, \"errors\": []}") + }; + }); + + // Act + await _exporter.ExportAsync(events); + + // Assert + Assert.AreEqual(3, callCount, "Should retry twice before succeeding"); + } + + [TestMethod] + public async Task ExportAsync_DoesNotRetryOnClientError() + { + // Arrange + var events = new List + { + new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } + }; + + _mockHttpHandler + .Protected() + .Setup>( + "SendAsync", + ItExpr.IsAny(), + ItExpr.IsAny()) + .ReturnsAsync(new HttpResponseMessage(HttpStatusCode.BadRequest)); + + // Act + await _exporter.ExportAsync(events); + + // Assert - Should only try once + _mockHttpHandler.Protected().Verify( + "SendAsync", + Times.Once(), + ItExpr.IsAny(), + ItExpr.IsAny()); + } + + [TestMethod] + public async Task ExportAsync_DoesNotThrowOnFailure() + { + // Arrange + var events = new List + { + new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } + }; + + _mockHttpHandler + .Protected() + .Setup>( + "SendAsync", + ItExpr.IsAny(), + ItExpr.IsAny()) + .ThrowsAsync(new HttpRequestException("Network error")); + + // Act & Assert - Should not throw + await _exporter.ExportAsync(events); + } +} +``` + +#### 11.1.3 CircuitBreaker Tests + +**File**: `CircuitBreakerTests.cs` + +```csharp +[TestClass] +public class CircuitBreakerTests +{ + [TestMethod] + public void IsOpen_ReturnsFalseInitially() + { + // Arrange + var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); + + // Assert + Assert.IsFalse(cb.IsOpen); + } + + [TestMethod] + public void IsOpen_ReturnsTrueAfterThresholdFailures() + { + // Arrange + var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); + + // Act + cb.RecordFailure(); + cb.RecordFailure(); + cb.RecordFailure(); + + // Assert + Assert.IsTrue(cb.IsOpen); + } + + [TestMethod] + public void IsOpen_TransitionsToHalfOpenAfterTimeout() + { + // Arrange + var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromMilliseconds(100)); + + // Act + cb.RecordFailure(); + cb.RecordFailure(); + cb.RecordFailure(); + Assert.IsTrue(cb.IsOpen); + + // Wait for timeout + Thread.Sleep(150); + + // Assert + Assert.IsFalse(cb.IsOpen); // Transitions to HalfOpen, returns false + } + + [TestMethod] + public void RecordSuccess_ResetsCircuitBreaker() + { + // Arrange + var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); + + // Act + cb.RecordFailure(); + cb.RecordFailure(); + cb.RecordSuccess(); // Reset + cb.RecordFailure(); + + // Assert - Should still be closed (only 1 failure after reset) + Assert.IsFalse(cb.IsOpen); + } +} +``` + +### 11.2 Integration Tests + +#### 11.2.1 End-to-End Telemetry Flow + +**File**: `TelemetryIntegrationTests.cs` + +```csharp +[TestClass] +public class TelemetryIntegrationTests +{ + private const string TestConnectionString = "..."; // Real Databricks workspace + + [TestMethod] + [TestCategory("Integration")] + public async Task ConnectionOpen_SendsTelemetry() + { + // Arrange + var properties = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "true", + [DatabricksParameters.TelemetryBatchSize] = "1", // Immediate flush + }; + + // Act + using var driver = new DatabricksDriver(); + using var database = driver.Open(properties); + using var connection = (DatabricksConnection)database.Connect(); + + // Give telemetry time to export + await Task.Delay(1000); + + // Assert - Check that telemetry was sent (via logs or server-side query) + // This requires access to telemetry table or mock endpoint + } + + [TestMethod] + [TestCategory("Integration")] + public async Task StatementExecution_SendsTelemetry() + { + // Arrange + var properties = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "true", + }; + + // Act + using var driver = new DatabricksDriver(); + using var database = driver.Open(properties); + using var connection = database.Connect(); + using var statement = connection.CreateStatement(); + + var reader = await statement.ExecuteQueryAsync("SELECT 1 AS test"); + await reader.ReadAsync(); + + // Close to flush telemetry + connection.Dispose(); + + // Assert - Verify telemetry sent + } + + [TestMethod] + [TestCategory("Integration")] + public async Task CloudFetchDownload_SendsTelemetry() + { + // Arrange - Query that returns CloudFetch results + var properties = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "true", + [DatabricksParameters.CloudFetchEnabled] = "true", + }; + + // Act + using var driver = new DatabricksDriver(); + using var database = driver.Open(properties); + using var connection = database.Connect(); + using var statement = connection.CreateStatement(); + + // Query large result set to trigger CloudFetch + var reader = await statement.ExecuteQueryAsync("SELECT * FROM large_table LIMIT 1000000"); + + while (await reader.ReadAsync()) + { + // Consume results + } + + connection.Dispose(); + + // Assert - Verify chunk download telemetry sent + } +} +``` + +### 11.3 Performance Tests + +#### 11.3.1 Telemetry Overhead + +**File**: `TelemetryPerformanceTests.cs` + +```csharp +[TestClass] +public class TelemetryPerformanceTests +{ + [TestMethod] + [TestCategory("Performance")] + public async Task TelemetryOverhead_LessThan1Percent() + { + // Arrange + var propertiesWithTelemetry = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "true", + }; + + var propertiesWithoutTelemetry = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "false", + }; + + const int iterations = 100; + + // Act - Measure with telemetry + var swWithTelemetry = Stopwatch.StartNew(); + for (int i = 0; i < iterations; i++) + { + using var driver = new DatabricksDriver(); + using var database = driver.Open(propertiesWithTelemetry); + using var connection = database.Connect(); + using var statement = connection.CreateStatement(); + var reader = await statement.ExecuteQueryAsync("SELECT 1"); + await reader.ReadAsync(); + } + swWithTelemetry.Stop(); + + // Act - Measure without telemetry + var swWithoutTelemetry = Stopwatch.StartNew(); + for (int i = 0; i < iterations; i++) + { + using var driver = new DatabricksDriver(); + using var database = driver.Open(propertiesWithoutTelemetry); + using var connection = database.Connect(); + using var statement = connection.CreateStatement(); + var reader = await statement.ExecuteQueryAsync("SELECT 1"); + await reader.ReadAsync(); + } + swWithoutTelemetry.Stop(); + + // Assert - Overhead should be < 1% + var overhead = (double)(swWithTelemetry.ElapsedMilliseconds - swWithoutTelemetry.ElapsedMilliseconds) + / swWithoutTelemetry.ElapsedMilliseconds; + + Console.WriteLine($"Telemetry overhead: {overhead:P2}"); + Assert.IsTrue(overhead < 0.01, $"Overhead {overhead:P2} exceeds 1% threshold"); + } + + [TestMethod] + [TestCategory("Performance")] + public void MemoryUsage_LessThan10MB() + { + // Arrange + var properties = new Dictionary + { + // ... connection properties ... + [DatabricksParameters.TelemetryEnabled] = "true", + [DatabricksParameters.TelemetryBatchSize] = "10000", // Large batch to accumulate memory + }; + + var initialMemory = GC.GetTotalMemory(forceFullCollection: true); + + // Act - Generate lots of telemetry events + using var driver = new DatabricksDriver(); + using var database = driver.Open(properties); + using var connection = (DatabricksConnection)database.Connect(); + + for (int i = 0; i < 1000; i++) + { + connection.TelemetryCollector?.RecordError( + "TEST_ERROR", + "Test error message", + Guid.NewGuid().ToString()); + } + + var finalMemory = GC.GetTotalMemory(forceFullCollection: false); + var memoryUsed = (finalMemory - initialMemory) / (1024 * 1024); // MB + + // Assert + Console.WriteLine($"Memory used: {memoryUsed}MB"); + Assert.IsTrue(memoryUsed < 10, $"Memory usage {memoryUsed}MB exceeds 10MB threshold"); + } +} +``` + +### 11.4 Mock Endpoint Testing + +**File**: `MockTelemetryEndpointTests.cs` + +```csharp +[TestClass] +public class MockTelemetryEndpointTests +{ + private TestServer _testServer; + private HttpClient _httpClient; + + [TestInitialize] + public void Setup() + { + // Create in-memory test server + _testServer = new TestServer(new WebHostBuilder() + .ConfigureServices(services => { }) + .Configure(app => + { + app.Run(async context => + { + if (context.Request.Path == "/telemetry-ext") + { + // Mock telemetry endpoint + var body = await new StreamReader(context.Request.Body).ReadToEndAsync(); + + // Validate request structure + var request = JsonSerializer.Deserialize(body); + + // Return success response + context.Response.StatusCode = 200; + await context.Response.WriteAsJsonAsync(new TelemetryResponse + { + NumProtoSuccess = request?.ProtoLogs?.Count ?? 0, + Errors = new List() + }); + } + }); + })); + + _httpClient = _testServer.CreateClient(); + } + + [TestCleanup] + public void Cleanup() + { + _httpClient?.Dispose(); + _testServer?.Dispose(); + } + + [TestMethod] + public async Task ExportAsync_SendsCorrectPayload() + { + // Arrange + var mockConnection = new Mock(); + mockConnection.Setup(c => c.Host).Returns(new Uri(_testServer.BaseAddress, "/")); + mockConnection.Setup(c => c.IsAuthenticated).Returns(true); + + var config = new TelemetryConfiguration { Enabled = true }; + var exporter = new TelemetryExporter(_httpClient, mockConnection.Object, config); + + var events = new List + { + new TelemetryEvent + { + EventType = TelemetryEventType.ConnectionOpen, + WorkspaceId = 123456, + SessionId = Guid.NewGuid().ToString(), + OperationLatencyMs = 100, + DriverConfig = new DriverConfiguration + { + DriverName = "Test", + DriverVersion = "1.0.0" + } + } + }; + + // Act + await exporter.ExportAsync(events); + + // Assert - Test server validated request structure + } +} +``` + +--- + +## 12. Migration & Rollout + +### 12.1 Rollout Phases + +#### Phase 1: Development & Testing (Weeks 1-3) + +**Goals**: +- Implement core telemetry components +- Add unit tests (80% coverage target) +- Test with mock endpoints + +**Deliverables**: +- `TelemetryCollector` implementation +- `TelemetryExporter` implementation +- `CircuitBreaker` implementation +- Unit test suite +- Mock endpoint tests + +**Success Criteria**: +- All unit tests pass +- Code coverage > 80% +- Performance overhead < 1% +- Memory usage < 10MB + +#### Phase 2: Internal Dogfooding (Weeks 4-5) + +**Goals**: +- Deploy to internal staging environment +- Test with real Databricks workspaces +- Validate telemetry data in Lumberjack + +**Configuration**: +```json +{ + "adbc.databricks.telemetry.enabled": "true", + "adbc.databricks.telemetry.force_enable": "true" +} +``` + +**Monitoring**: +- Query Lumberjack table for telemetry data +- Validate schema correctness +- Check for any data quality issues + +**Success Criteria**: +- Telemetry data visible in Lumberjack +- No driver functionality issues +- No performance regressions + +#### Phase 3: Opt-In Beta (Weeks 6-8) + +**Goals**: +- Release to select beta customers +- Gather feedback on telemetry value +- Monitor telemetry service load + +**Configuration**: +- Default: `telemetry.enabled = false` +- Beta customers opt-in via config + +**Monitoring**: +- Track opt-in rate +- Monitor telemetry service QPS +- Watch for any issues + +**Success Criteria**: +- 10+ beta customers opted in +- No critical issues reported +- Positive feedback on value + +#### Phase 4: Default On with Feature Flag (Weeks 9-12) + +**Goals**: +- Enable telemetry by default for new connections +- Gradual rollout via server-side feature flag + +**Configuration**: +- Client-side: `telemetry.enabled = true` (default) +- Server-side: Feature flag controls actual enablement + +**Rollout Schedule**: +- Week 9: 10% of workspaces +- Week 10: 25% of workspaces +- Week 11: 50% of workspaces +- Week 12: 100% of workspaces + +**Monitoring**: +- Track telemetry service QPS growth +- Monitor circuit breaker activation rate +- Watch for any performance impact + +**Success Criteria**: +- Telemetry service handles load +- < 0.1% customer issues +- Valuable insights being derived + +#### Phase 5: General Availability (Week 13+) + +**Goals**: +- Telemetry enabled for all workspaces +- Documentation published +- Monitoring dashboards created + +**Configuration**: +- Default: Enabled +- Opt-out available via config + +**Success Criteria**: +- 100% rollout complete +- Usage analytics dashboard live +- Error monitoring alerts configured + +### 12.2 Rollback Plan + +#### Trigger Conditions + +**Immediate Rollback**: +- Telemetry causing driver crashes +- Performance degradation > 5% +- Data privacy violation detected + +**Gradual Rollback**: +- Telemetry service overloaded (> 1000 QPS sustained) +- Circuit breaker open rate > 10% +- Customer complaints > 5/week + +#### Rollback Procedures + +**Server-Side Rollback** (Preferred): +```sql +-- Disable via feature flag (affects all clients immediately) +UPDATE databricks_client_config +SET value = 'false' +WHERE key = 'databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc'; +``` + +**Client-Side Rollback**: +```json +{ + "adbc.databricks.telemetry.enabled": "false" +} +``` + +**Code Rollback**: +- Revert telemetry changes via Git +- Deploy previous driver version +- Communicate to customers + +### 12.3 Compatibility Matrix + +| Driver Version | Telemetry Support | Server Feature Flag Required | +|:---|:---:|:---:| +| < 1.0.0 | ❌ No | N/A | +| 1.0.0 - 1.0.5 | ⚠️ Beta (opt-in) | No | +| 1.1.0+ | ✅ GA (default on) | Yes | + +**Backward Compatibility**: +- Older driver versions continue to work (no telemetry) +- Newer driver versions work with older servers (feature flag defaults to enabled) +- No breaking changes to ADBC API + +### 12.4 Documentation Plan + +#### User-Facing Documentation + +**Location**: `csharp/src/Drivers/Databricks/readme.md` + +**Sections to Add**: +1. Telemetry Overview +2. Configuration Options +3. Privacy & Data Collection +4. Opt-Out Instructions + +**Example**: +```markdown +## Telemetry + +The Databricks ADBC driver collects anonymous usage telemetry to help improve +the driver. Telemetry is enabled by default but can be disabled. + +### What Data is Collected + +- Driver configuration (CloudFetch settings, batch size, etc.) +- Operation latency metrics +- Error codes and sanitized error messages +- Result format usage (inline vs CloudFetch) + +### What Data is NOT Collected + +- Query text or results +- Table or column names +- Authentication credentials +- Personal identifiable information + +### Disabling Telemetry + +To disable telemetry, set the following connection property: + +```json +{ + "adbc.databricks.telemetry.enabled": "false" +} +``` + +Or via environment variable: + +```bash +export DATABRICKS_TELEMETRY_ENABLED=false +``` +``` + +#### Internal Documentation + +**Location**: Internal Confluence/Wiki + +**Sections**: +1. Architecture Overview +2. Data Schema +3. Lumberjack Table Access +4. Dashboard Links +5. Troubleshooting Guide + +--- + +## 13. Alternatives Considered + +### 13.1 OpenTelemetry Metrics Export + +**Approach**: Use OpenTelemetry metrics SDK instead of custom telemetry client. + +**Pros**: +- Industry standard +- Rich ecosystem (Prometheus, Grafana, etc.) +- Automatic instrumentation +- Built-in retry and batching + +**Cons**: +- ❌ Requires external OTLP endpoint (not Databricks service) +- ❌ More complex configuration for users +- ❌ Harder to enforce server-side control (feature flags) +- ❌ Different schema from JDBC driver +- ❌ Additional dependency (OpenTelemetry SDK) + +**Decision**: Not chosen. Custom approach gives better control and consistency with JDBC. + +### 13.2 Activity Events Only + +**Approach**: Extend existing Activity/trace infrastructure to include telemetry events. + +**Pros**: +- Reuses existing infrastructure +- No new components needed +- Unified observability model + +**Cons**: +- ❌ Activities are trace-focused, not metrics-focused +- ❌ No built-in aggregation (traces are per-operation) +- ❌ Requires external trace backend (Jaeger, Zipkin, etc.) +- ❌ Not centralized in Databricks +- ❌ No server-side control + +**Decision**: Not chosen. Activities complement telemetry but don't replace it. + +### 13.3 Log-Based Telemetry + +**Approach**: Emit structured logs that get aggregated by log shipper. + +**Pros**: +- Simple implementation +- Leverages existing logging infrastructure +- Easy to debug locally + +**Cons**: +- ❌ Relies on customer log infrastructure +- ❌ No guarantee logs reach Databricks +- ❌ Hard to enforce server-side control +- ❌ Inconsistent across deployments +- ❌ Performance overhead of log serialization + +**Decision**: Not chosen. Not reliable enough for production telemetry. + +### 13.4 Synchronous Telemetry Export + +**Approach**: Export telemetry synchronously with each operation. + +**Pros**: +- Simpler implementation (no batching) +- Guaranteed delivery (or failure) +- No background threads + +**Cons**: +- ❌ **Blocking**: Would impact driver operation latency +- ❌ High network overhead (one request per event) +- ❌ Poor performance +- ❌ Violates non-blocking requirement + +**Decision**: Not chosen. Must be asynchronous and batched. + +### 13.5 No Telemetry + +**Approach**: Don't implement telemetry, rely on customer-reported issues. + +**Pros**: +- No implementation effort +- No privacy concerns +- Simpler driver code + +**Cons**: +- ❌ **Reactive debugging only**: Wait for customer reports +- ❌ No usage insights +- ❌ Can't track feature adoption +- ❌ Harder to identify systemic issues +- ❌ Slower issue resolution + +**Decision**: Not chosen. Telemetry provides too much value. + +--- + +## 14. Open Questions + +### 14.1 Schema Evolution + +**Question**: How do we handle schema changes over time? + +**Options**: +1. **Versioned schema**: Add `schema_version` field to payload +2. **Backward compatible additions**: Only add optional fields +3. **Server-side schema validation**: Reject unknown fields + +**Recommendation**: Option 2 (backward compatible additions) + versioning + +**Action**: Define schema versioning strategy before GA + +### 14.2 Sampling + +**Question**: Should we implement sampling for high-volume workspaces? + +**Context**: Some workspaces may execute thousands of queries per second + +**Options**: +1. **No sampling**: Collect all events +2. **Client-side sampling**: Sample events before export +3. **Server-side sampling**: Server accepts all, samples during processing + +**Recommendation**: Start with no sampling, add client-side sampling if needed + +**Action**: Monitor telemetry service QPS during rollout, add sampling if > 1000 QPS sustained + +### 14.3 Custom Metrics + +**Question**: Should we allow users to add custom telemetry fields? + +**Use Case**: Enterprise customers may want to tag telemetry with internal identifiers + +**Options**: +1. **No custom fields**: Fixed schema only +2. **Tagged fields**: Allow key-value tags +3. **Extensible schema**: Allow arbitrary JSON in `metadata` field + +**Recommendation**: Option 1 for MVP, revisit based on feedback + +**Action**: Gather feedback during beta phase + +### 14.4 Real-Time Alerting + +**Question**: Should telemetry trigger real-time alerts? + +**Use Case**: Alert on-call when error rate spikes + +**Options**: +1. **No real-time alerting**: Batch processing only +2. **Server-side alerting**: Telemetry service triggers alerts +3. **Client-side alerting**: Driver triggers alerts (not recommended) + +**Recommendation**: Option 2 (server-side alerting) as future enhancement + +**Action**: Design alerting as follow-up project + +### 14.5 PII Detection + +**Question**: How do we detect and prevent PII in telemetry? + +**Current Approach**: Manual sanitization in code + +**Options**: +1. **Manual sanitization**: Regex-based (current) +2. **Automated PII detection**: ML-based PII scanner +3. **Server-side PII scrubbing**: Lumberjack scrubs PII + +**Recommendation**: Option 1 for MVP, Option 3 as enhancement + +**Action**: Audit sanitization logic, add comprehensive tests + +--- + +## 15. References + +### 15.1 Internal Documents + +- [JDBC Telemetry Design Doc](https://docs.google.com/document/d/1Ww9sWPqt-ZpGDgtRPqnIhTVyGaeFp-3wPa-xElYfnbw/edit) +- [Lumberjack Data Residency](https://databricks.atlassian.net/wiki/spaces/ENG/pages/...) +- [Telemetry Service API](https://github.com/databricks/universe/tree/master/...) + +### 15.2 External Standards + +- [OpenTelemetry Specification](https://opentelemetry.io/docs/specs/otel/) +- [W3C Trace Context](https://www.w3.org/TR/trace-context/) +- [GDPR Compliance](https://gdpr.eu/) +- [CCPA Compliance](https://oag.ca.gov/privacy/ccpa) + +### 15.3 Code References + +- JDBC Telemetry Implementation: `databricks-jdbc/src/main/java/com/databricks/jdbc/telemetry/` +- ADBC Activity Infrastructure: `arrow-adbc/csharp/src/Apache.Arrow.Adbc/Tracing/` +- Databricks ADBC Driver: `arrow-adbc/csharp/src/Drivers/Databricks/` + +### 15.4 Related Projects + +- [OpenTelemetry .NET SDK](https://github.com/open-telemetry/opentelemetry-dotnet) +- [Polly (Resilience Library)](https://github.com/App-vNext/Polly) +- [Apache Arrow ADBC](https://arrow.apache.org/adbc/) + +--- + +## Appendix A: Example Code + +### A.1 Full Integration Example + +```csharp +using Apache.Arrow.Adbc.Drivers.Databricks; + +// Configure telemetry +var properties = new Dictionary +{ + [DatabricksParameters.HostName] = "https://my-workspace.databricks.com", + [DatabricksParameters.AuthType] = "oauth", + [DatabricksParameters.OAuthClientId] = "my-client-id", + [DatabricksParameters.OAuthClientSecret] = "my-secret", + + // Telemetry configuration + [DatabricksParameters.TelemetryEnabled] = "true", + [DatabricksParameters.TelemetryBatchSize] = "50", + [DatabricksParameters.TelemetryFlushIntervalMs] = "30000", + [DatabricksParameters.TelemetryLogLevel] = "Info" +}; + +// Create connection +using var driver = new DatabricksDriver(); +using var database = driver.Open(properties); +using var connection = database.Connect(); + +// Telemetry automatically collects: +// - Connection open latency +// - Driver configuration + +// Execute query +using var statement = connection.CreateStatement(); +var reader = await statement.ExecuteQueryAsync("SELECT * FROM my_table LIMIT 1000000"); + +// Telemetry automatically collects: +// - Statement execution latency +// - Result format (inline vs CloudFetch) +// - Chunk download metrics (if CloudFetch) + +while (await reader.ReadAsync()) +{ + // Process results +} + +// Close connection +connection.Dispose(); + +// Telemetry automatically: +// - Flushes all pending events +// - Exports to Databricks service +``` + +### A.2 Error Handling Example + +```csharp +try +{ + using var connection = database.Connect(); + using var statement = connection.CreateStatement(); + var reader = await statement.ExecuteQueryAsync("INVALID SQL"); +} +catch (AdbcException ex) +{ + // Telemetry automatically records error: + // - Error code: ex.Status + // - Error message: sanitized version + // - Statement ID (if available) + + Console.WriteLine($"Query failed: {ex.Message}"); +} +``` + +--- + +## Appendix B: Configuration Reference + +### B.1 All Telemetry Parameters + +| Parameter | Type | Default | Description | +|:---|:---:|:---:|:---| +| `adbc.databricks.telemetry.enabled` | bool | `true` | Enable/disable telemetry | +| `adbc.databricks.telemetry.force_enable` | bool | `false` | Bypass feature flag | +| `adbc.databricks.telemetry.batch_size` | int | `50` | Events per batch | +| `adbc.databricks.telemetry.flush_interval_ms` | int | `30000` | Flush interval (ms) | +| `adbc.databricks.telemetry.max_retries` | int | `3` | Max retry attempts | +| `adbc.databricks.telemetry.retry_delay_ms` | int | `500` | Base retry delay (ms) | +| `adbc.databricks.telemetry.circuit_breaker.enabled` | bool | `true` | Enable circuit breaker | +| `adbc.databricks.telemetry.circuit_breaker.threshold` | int | `5` | Failure threshold | +| `adbc.databricks.telemetry.circuit_breaker.timeout_sec` | int | `60` | Open timeout (sec) | +| `adbc.databricks.telemetry.log_level` | enum | `Info` | Log level filter | + +--- + +## Appendix C: Telemetry Events Catalog + +### C.1 Connection Events + +| Event | Fields | When Emitted | +|:---|:---|:---| +| `ConnectionOpen` | latency, driver_config | Connection opened successfully | +| `ConnectionError` | error_code, error_message | Connection failed to open | + +### C.2 Statement Events + +| Event | Fields | When Emitted | +|:---|:---|:---| +| `StatementExecute` | latency, result_format | Statement executed successfully | +| `StatementComplete` | aggregated_metrics | Statement closed | +| `StatementError` | error_code, error_message, statement_id | Statement execution failed | + +### C.3 CloudFetch Events + +| Event | Fields | When Emitted | +|:---|:---|:---| +| `ChunkDownload` | chunk_index, latency, bytes, compressed | Chunk downloaded successfully | +| `ChunkDownloadError` | chunk_index, error_code, error_message | Chunk download failed | +| `OperationStatus` | poll_count, total_latency | Polling completed | + +--- + +**Document Version**: 1.0 +**Last Updated**: 2025-10-26 +**Authors**: Design Team +**Reviewers**: Engineering, Product, Security, Privacy diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md new file mode 100644 index 0000000000..1db34a1e23 --- /dev/null +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md @@ -0,0 +1,280 @@ +****# Analysis: File Locations in Telemetry LLD + +Based on my analysis of the design document, **ALL changes are contained within the Databricks driver folder** (`/Users/sreekanth.vadigi/Desktop/projects/arrow-adbc/csharp/src/Drivers/Databricks`). Here's the complete breakdown: + +## ✅ New Files to Create (All in Databricks folder) + +### 1. Telemetry Core Components + +``` +/Databricks/Telemetry/ +├── TelemetryCollector.cs (New - event aggregation) +├── TelemetryExporter.cs (New - HTTP export) +├── ITelemetryExporter.cs (New - interface) +├── CircuitBreaker.cs (New - resilience) +├── TelemetryConfiguration.cs (New - config) +└── Models/ + ├── TelemetryEvent.cs (New - event model) + ├── TelemetryRequest.cs (New - request payload) + ├── TelemetryResponse.cs (New - response payload) + ├── TelemetryFrontendLog.cs (New - log wrapper) + ├── FrontendLogContext.cs (New - context) + ├── FrontendLogEntry.cs (New - entry) + ├── SqlDriverLog.cs (New - driver log) + ├── DriverConfiguration.cs (New - config snapshot) + ├── SqlOperationData.cs (New - SQL metrics) + ├── ChunkDownloadData.cs (New - chunk metrics) + ├── DriverErrorInfo.cs (New - error info) + ├── TelemetryClientContext.cs (New - client context) + └── StatementTelemetryData.cs (New - aggregated data) +``` + +### 2. Test Files + +``` +/Databricks.Tests/Telemetry/ +├── TelemetryCollectorTests.cs (New - unit tests) +├── TelemetryExporterTests.cs (New - unit tests) +├── CircuitBreakerTests.cs (New - unit tests) +├── TelemetryIntegrationTests.cs (New - integration tests) +├── TelemetryPerformanceTests.cs (New - perf tests) +└── MockTelemetryEndpointTests.cs (New - mock tests) +``` + +## ✅ Existing Files to Modify (All in Databricks folder) + +### 1. DatabricksParameters.cs + +**Location:** `/Databricks/DatabricksParameters.cs` +**Changes:** Add telemetry configuration constants + +```csharp +public const string TelemetryEnabled = "adbc.databricks.telemetry.enabled"; +public const string TelemetryBatchSize = "adbc.databricks.telemetry.batch_size"; +public const string TelemetryFlushIntervalMs = "adbc.databricks.telemetry.flush_interval_ms"; +// ... 7 more parameters +``` + +### 2. DatabricksConnection.cs + +**Location:** `/Databricks/DatabricksConnection.cs` +**Changes:** +- Add TelemetryCollector field +- Initialize telemetry in `OpenAsync()` +- Record connection configuration +- Flush telemetry in `Dispose()` +- Check server-side feature flag in `ApplyServerSidePropertiesAsync()` + +```csharp +private TelemetryCollector? _telemetryCollector; +private TelemetryConfiguration? _telemetryConfig; + +public override async Task OpenAsync(CancellationToken cancellationToken = default) +{ + // ... existing code ... + InitializeTelemetry(); + _telemetryCollector?.RecordConnectionOpen(latency, driverConfig); +} + +public override void Dispose() +{ + _telemetryCollector?.FlushAllPendingAsync().Wait(); + _telemetryCollector?.Dispose(); + base.Dispose(); +} +``` + +### 3. DatabricksStatement.cs + +**Location:** `/Databricks/DatabricksStatement.cs` +**Changes:** +- Record statement execution metrics +- Track result format +- Mark statement complete on dispose + +```csharp +protected override async Task ExecuteQueryAsync(...) +{ + var sw = Stopwatch.StartNew(); + // ... execute ... + Connection.TelemetryCollector?.RecordStatementExecute( + statementId, sw.Elapsed, resultFormat); +} + +public override void Dispose() +{ + Connection.TelemetryCollector?.RecordStatementComplete(_statementId); + base.Dispose(); +} +``` + +### 4. CloudFetchDownloader.cs + +**Location:** `/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs` +**Changes:** +- Record chunk download latency +- Track retry attempts +- Report download errors + +```csharp +private async Task DownloadFileAsync(IDownloadResult downloadResult, ...) +{ + var sw = Stopwatch.StartNew(); + // ... download ... + _statement.Connection.TelemetryCollector?.RecordChunkDownload( + statementId, chunkIndex, sw.Elapsed, bytesDownloaded, compressed); +} +``` + +### 5. DatabricksOperationStatusPoller.cs + +**Location:** `/Databricks/Reader/DatabricksOperationStatusPoller.cs` +**Changes:** +- Record polling metrics + +```csharp +public async Task PollForCompletionAsync(...) +{ + var pollCount = 0; + var sw = Stopwatch.StartNew(); + // ... polling loop ... + _connection.TelemetryCollector?.RecordOperationStatus( + operationId, pollCount, sw.Elapsed); +} +``` + +### 6. Exception Handlers (Multiple Files) + +**Locations:** Throughout `/Databricks/` (wherever exceptions are caught) +**Changes:** Add telemetry error recording + +```csharp +catch (Exception ex) +{ + Connection.TelemetryCollector?.RecordError( + errorCode, SanitizeErrorMessage(ex.Message), statementId); + throw; +} +``` + +### 7. readme.md + +**Location:** `/Databricks/readme.md` +**Changes:** Add telemetry documentation section + +```markdown +## Telemetry + +The Databricks ADBC driver collects anonymous usage telemetry... + +### What Data is Collected +### What Data is NOT Collected +### Disabling Telemetry +``` + +## ❌ NO Changes Outside Databricks Folder + +The design does **NOT** require any changes to: + +- ❌ Base ADBC library (`Apache.Arrow.Adbc/`) +- ❌ Apache Spark/Hive2 drivers (`Drivers/Apache/`) +- ❌ ADBC interfaces (`AdbcConnection`, `AdbcStatement`, etc.) +- ❌ Activity/Tracing infrastructure (already exists, just reuse) +- ❌ Other ADBC drivers (BigQuery, Snowflake, etc.) + +## 📦 External Dependencies + +The design **reuses existing infrastructure**: + +### Already Available (No Changes Needed): + +**Activity/Tracing** (`Apache.Arrow.Adbc.Tracing/`) +- `ActivityTrace` - Already exists +- `IActivityTracer` - Already exists +- Used for correlation, not modified + +**HTTP Client** +- `HttpClient` - .NET standard library +- Already used by driver + +**JSON Serialization** +- `System.Text.Json` - .NET standard library +- Already used by driver + +**Testing Infrastructure** +- MSTest/xUnit - Standard testing frameworks +- Already used by driver tests + +## 📁 Complete File Tree + +``` +arrow-adbc/csharp/src/Drivers/Databricks/ +│ +├── Telemetry/ ← NEW FOLDER +│ ├── TelemetryCollector.cs ← NEW +│ ├── TelemetryExporter.cs ← NEW +│ ├── ITelemetryExporter.cs ← NEW +│ ├── CircuitBreaker.cs ← NEW +│ ├── TelemetryConfiguration.cs ← NEW +│ └── Models/ ← NEW FOLDER +│ ├── TelemetryEvent.cs ← NEW +│ ├── TelemetryRequest.cs ← NEW +│ ├── TelemetryResponse.cs ← NEW +│ ├── TelemetryFrontendLog.cs ← NEW +│ ├── FrontendLogContext.cs ← NEW +│ ├── FrontendLogEntry.cs ← NEW +│ ├── SqlDriverLog.cs ← NEW +│ ├── DriverConfiguration.cs ← NEW +│ ├── SqlOperationData.cs ← NEW +│ ├── ChunkDownloadData.cs ← NEW +│ ├── DriverErrorInfo.cs ← NEW +│ ├── TelemetryClientContext.cs ← NEW +│ └── StatementTelemetryData.cs ← NEW +│ +├── DatabricksParameters.cs ← MODIFY (add constants) +├── DatabricksConnection.cs ← MODIFY (add telemetry) +├── DatabricksStatement.cs ← MODIFY (add telemetry) +├── readme.md ← MODIFY (add docs) +│ +├── Reader/ +│ ├── DatabricksOperationStatusPoller.cs ← MODIFY (add telemetry) +│ └── CloudFetch/ +│ └── CloudFetchDownloader.cs ← MODIFY (add telemetry) +│ +└── [Other existing files remain unchanged] + +arrow-adbc/csharp/test/Drivers/Databricks.Tests/ +│ +└── Telemetry/ ← NEW FOLDER + ├── TelemetryCollectorTests.cs ← NEW + ├── TelemetryExporterTests.cs ← NEW + ├── CircuitBreakerTests.cs ← NEW + ├── TelemetryIntegrationTests.cs ← NEW + ├── TelemetryPerformanceTests.cs ← NEW + └── MockTelemetryEndpointTests.cs ← NEW +``` + +## Summary + +✅ **All changes are self-contained within the Databricks driver folder** + +**New Files:** ~27 new files (all under `/Databricks/`) +- 14 core implementation files +- 6 test files +- 7+ model classes + +**Modified Files:** ~6-8 existing files (all under `/Databricks/`) +- DatabricksParameters.cs +- DatabricksConnection.cs +- DatabricksStatement.cs +- CloudFetchDownloader.cs +- DatabricksOperationStatusPoller.cs +- readme.md +- Exception handlers (scattered) + +**External Dependencies:** Zero new dependencies outside the folder +- Reuses existing Activity/Tracing infrastructure +- Uses standard .NET libraries (HttpClient, System.Text.Json) +- No changes to base ADBC library + +**This is a clean, modular implementation that doesn't require any changes to the ADBC standard or other drivers!** 🎯 From 26895e58d32643af6e1f4de095f350c2c06496d5 Mon Sep 17 00:00:00 2001 From: Jade Wang Date: Mon, 27 Oct 2025 14:31:40 -0700 Subject: [PATCH 2/6] Add activity based design --- .../telemetry-activity-based-design.md | 917 ++++++++++++++++++ 1 file changed, 917 insertions(+) create mode 100644 csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md new file mode 100644 index 0000000000..2afa7630f6 --- /dev/null +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md @@ -0,0 +1,917 @@ +# Databricks ADBC Driver: Activity-Based Telemetry Design + +## Executive Summary + +This document outlines an **Activity-based telemetry design** that leverages the existing Activity/ActivitySource infrastructure in the Databricks ADBC driver. Instead of creating a parallel telemetry system, we extend the current tracing infrastructure to collect metrics and export them to Databricks telemetry service. + +**Key Objectives:** +- Reuse existing Activity instrumentation points +- Add metrics collection without duplicating code +- Export aggregated metrics to Databricks service +- Maintain server-side feature flag control +- Preserve backward compatibility with OpenTelemetry + +**Design Principles:** +- **Build on existing infrastructure**: Leverage ActivitySource/ActivityListener +- **Single instrumentation point**: Don't duplicate tracing and metrics +- **Non-blocking**: All operations async and non-blocking +- **Privacy-first**: No PII or query data collected +- **Server-controlled**: Feature flag support for enable/disable + +**Key Difference from Original Design:** +- ❌ **OLD**: Separate TelemetryCollector + TelemetryExporter alongside Activity +- ✅ **NEW**: Activity-based with custom ActivityListener + aggregation + +--- + +## Table of Contents + +1. [Background & Motivation](#1-background--motivation) +2. [Architecture Overview](#2-architecture-overview) +3. [Core Components](#3-core-components) +4. [Data Collection](#4-data-collection) +5. [Export Mechanism](#5-export-mechanism) +6. [Configuration](#6-configuration) +7. [Privacy & Compliance](#7-privacy--compliance) +8. [Error Handling](#8-error-handling) +9. [Testing Strategy](#9-testing-strategy) +10. [Migration & Rollout](#10-migration--rollout) +11. [Comparison with Separate Telemetry System](#11-comparison-with-separate-telemetry-system) + +--- + +## 1. Background & Motivation + +### 1.1 Current State + +The Databricks ADBC driver already has: +- ✅ **ActivitySource**: `DatabricksAdbcActivitySource` +- ✅ **Activity instrumentation**: Connection, statement execution, result fetching +- ✅ **W3C Trace Context**: Distributed tracing support +- ✅ **ActivityTrace utility**: Helper for creating activities + +### 1.2 The Problem + +The original design proposed creating a separate telemetry system alongside Activity infrastructure: +- ❌ Duplicate instrumentation in driver code +- ❌ Two data models (Activity vs TelemetryEvent) +- ❌ Two export mechanisms +- ❌ Maintenance burden + +### 1.3 The Solution + +**Extend Activity infrastructure** instead of creating parallel system: +- ✅ Single instrumentation point (Activity) +- ✅ Custom ActivityListener for metrics aggregation +- ✅ Export aggregated data to Databricks service +- ✅ Reuse Activity context, correlation, and timing + +--- + +## 2. Architecture Overview + +### 2.1 High-Level Architecture + +```mermaid +graph TB + A[Driver Operations] -->|Activity.Start/Stop| B[ActivitySource] + B -->|Activity Events| C[DatabricksActivityListener] + C -->|Aggregate Metrics| D[MetricsAggregator] + D -->|Batch & Buffer| E[DatabricksTelemetryExporter] + E -->|HTTP POST| F[Databricks Service] + F --> G[Lumberjack] + + H[Feature Flag Service] -.->|Enable/Disable| C + + style C fill:#e1f5fe + style D fill:#e1f5fe + style E fill:#e1f5fe +``` + +**Key Components:** +1. **ActivitySource** (existing): Emits activities for all operations +2. **DatabricksActivityListener** (new): Listens to activities, extracts metrics +3. **MetricsAggregator** (new): Aggregates by statement, batches events +4. **DatabricksTelemetryExporter** (new): Exports to Databricks service + +### 2.2 Activity Flow + +```mermaid +sequenceDiagram + participant App as Application + participant Driver as DatabricksConnection + participant AS as ActivitySource + participant AL as ActivityListener + participant MA as MetricsAggregator + participant Ex as TelemetryExporter + participant Service as Databricks Service + + App->>Driver: ExecuteQueryAsync() + Driver->>AS: StartActivity("ExecuteQuery") + AS->>AL: ActivityStarted(activity) + + Driver->>Driver: Execute operation + Driver->>AS: activity.SetTag("result_format", "CloudFetch") + Driver->>AS: activity.AddEvent("ChunkDownload", tags) + + AS->>AL: ActivityStopped(activity) + AL->>MA: ProcessActivity(activity) + MA->>MA: Aggregate by statement_id + + alt Batch threshold reached + MA->>Ex: Flush(batch) + Ex->>Service: POST /telemetry-ext + end +``` + +### 2.3 Comparison with Existing Activity Usage + +**Before (Tracing Only)**: +```csharp +using var activity = ActivityTrace.Start("ExecuteQuery"); +try { + // operation + activity?.SetTag("success", true); +} catch { + activity?.SetTag("error", true); +} +``` + +**After (Tracing + Metrics)**: +```csharp +using var activity = ActivityTrace.Start("ExecuteQuery"); +try { + // operation + activity?.SetTag("result_format", resultFormat); // ← Picked up by listener + activity?.SetTag("chunk_count", chunkCount); // ← Picked up by listener + activity?.SetTag("success", true); +} catch { + activity?.SetTag("error", errorCode); +} +// Listener automatically aggregates metrics from activity +``` + +**No duplicate instrumentation - same code path!** + +--- + +## 3. Core Components + +### 3.1 DatabricksActivityListener + +**Purpose**: Listen to Activity events and extract metrics for Databricks telemetry. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.DatabricksActivityListener` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Custom ActivityListener that aggregates metrics from Activity events + /// and exports them to Databricks telemetry service. + /// + public sealed class DatabricksActivityListener : IDisposable + { + public DatabricksActivityListener( + DatabricksConnection connection, + ITelemetryExporter exporter, + TelemetryConfiguration config); + + // Start listening to activities + public void Start(); + + // Stop listening and flush pending metrics + public Task StopAsync(); + + public void Dispose(); + } +} +``` + +#### Activity Listener Configuration + +```csharp +// Internal setup +private ActivityListener CreateListener() +{ + return new ActivityListener + { + ShouldListenTo = source => + source.Name == "Databricks.Adbc.Driver", + + ActivityStarted = OnActivityStarted, + ActivityStopped = OnActivityStopped, + + Sample = (ref ActivityCreationOptions options) => + _config.Enabled ? ActivitySamplingResult.AllDataAndRecorded + : ActivitySamplingResult.None + }; +} +``` + +#### Contracts + +**Activity Filtering**: +- Only listen to `"Databricks.Adbc.Driver"` ActivitySource +- Respects feature flag via `Sample` callback + +**Metric Extraction**: +- Extract metrics from Activity tags +- Aggregate by `statement_id` tag +- Aggregate by `session_id` tag + +**Non-Blocking**: +- All processing async +- Never blocks Activity completion +- Failures logged but don't propagate + +--- + +### 3.2 MetricsAggregator + +**Purpose**: Aggregate Activity data into metrics suitable for Databricks telemetry. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.MetricsAggregator` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Aggregates metrics from activities by statement and session. + /// + internal sealed class MetricsAggregator : IDisposable + { + public MetricsAggregator( + ITelemetryExporter exporter, + TelemetryConfiguration config); + + // Process completed activity + public void ProcessActivity(Activity activity); + + // Mark statement complete and emit aggregated metrics + public void CompleteStatement(string statementId); + + // Flush all pending metrics + public Task FlushAsync(CancellationToken ct = default); + + public void Dispose(); + } +} +``` + +#### Aggregation Logic + +```mermaid +flowchart TD + A[Activity Stopped] --> B{Activity.OperationName} + B -->|Connection.Open| C[Emit Connection Event] + B -->|Statement.Execute| D[Aggregate by statement_id] + B -->|CloudFetch.Download| D + B -->|Statement.Complete| E[Emit Statement Event] + + D --> F{Batch Size Reached?} + F -->|Yes| G[Flush to Exporter] + F -->|No| H[Continue Buffering] +``` + +#### Contracts + +**Statement Aggregation**: +- Activities with same `statement_id` tag aggregated together +- Aggregation includes: execution latency, chunk downloads, poll count +- Emitted when statement marked complete + +**Connection-Level Events**: +- Connection.Open emitted immediately +- Driver configuration collected once per connection + +**Error Handling**: +- Activity errors (tags with `error.type`) captured +- Never throws exceptions + +--- + +### 3.3 DatabricksTelemetryExporter + +**Purpose**: Export aggregated metrics to Databricks telemetry service. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.DatabricksTelemetryExporter` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + public interface ITelemetryExporter + { + /// + /// Export metrics to Databricks service. Never throws. + /// + Task ExportAsync( + IReadOnlyList metrics, + CancellationToken ct = default); + } + + internal sealed class DatabricksTelemetryExporter : ITelemetryExporter + { + public DatabricksTelemetryExporter( + HttpClient httpClient, + DatabricksConnection connection, + TelemetryConfiguration config); + + public Task ExportAsync( + IReadOnlyList metrics, + CancellationToken ct = default); + } +} +``` + +**Same implementation as original design**: Circuit breaker, retry logic, endpoints. + +--- + +## 4. Data Collection + +### 4.1 Activity Tags for Metrics + +**Standard Activity Tags** (already exist): +- `operation.name`: e.g., "Connection.Open", "Statement.Execute" +- `db.operation`: SQL operation type +- `db.statement`: Statement ID (not query text) +- `server.address`: Databricks workspace host + +**New Tags for Metrics** (add to existing activities): +- `result.format`: "inline" | "cloudfetch" +- `result.chunk_count`: Number of CloudFetch chunks +- `result.bytes_downloaded`: Total bytes downloaded +- `result.compression_enabled`: true/false +- `poll.count`: Number of status poll requests +- `poll.latency_ms`: Total polling latency + +**Driver Configuration Tags** (Connection.Open activity): +- `driver.version`: Driver version string +- `driver.os`: Operating system +- `driver.runtime`: .NET runtime version +- `feature.cloudfetch`: CloudFetch enabled? +- `feature.lz4`: LZ4 decompression enabled? +- `feature.direct_results`: Direct results enabled? + +### 4.2 Activity Events for Fine-Grained Data + +Use `Activity.AddEvent()` for per-chunk metrics: + +```csharp +activity?.AddEvent(new ActivityEvent("CloudFetch.ChunkDownloaded", + tags: new ActivityTagsCollection + { + { "chunk.index", chunkIndex }, + { "chunk.latency_ms", latency.TotalMilliseconds }, + { "chunk.bytes", bytesDownloaded }, + { "chunk.compressed", compressed } + })); +``` + +### 4.3 Collection Points + +```mermaid +graph LR + A[Connection.OpenAsync] -->|Activity + Tags| B[Listener] + C[Statement.ExecuteAsync] -->|Activity + Tags| B + D[CloudFetch.Download] -->|Activity.AddEvent| B + E[Statement.GetResults] -->|Activity + Tags| B + + B --> F[MetricsAggregator] +``` + +**Key Point**: No new instrumentation code! Just add tags to existing activities. + +--- + +## 5. Export Mechanism + +### 5.1 Export Flow + +```mermaid +flowchart TD + A[Activity Stopped] --> B[ActivityListener] + B --> C[MetricsAggregator] + C -->|Buffer & Aggregate| D{Flush Trigger?} + + D -->|Batch Size| E[Create TelemetryMetric] + D -->|Time Interval| E + D -->|Connection Close| E + + E --> F[TelemetryExporter] + F -->|Check Circuit Breaker| G{Circuit Open?} + G -->|Yes| H[Drop Events] + G -->|No| I[Serialize to JSON] + + I --> J{Authenticated?} + J -->|Yes| K[POST /telemetry-ext] + J -->|No| L[POST /telemetry-unauth] + + K --> M[Databricks Service] + L --> M + M --> N[Lumberjack] +``` + +### 5.2 Data Model + +**TelemetryMetric** (aggregated from multiple activities): + +```csharp +public sealed class TelemetryMetric +{ + // Common fields + public string MetricType { get; set; } // "connection", "statement", "error" + public DateTimeOffset Timestamp { get; set; } + public long WorkspaceId { get; set; } + public string SessionId { get; set; } + + // Statement metrics (aggregated from activities) + public string StatementId { get; set; } + public long ExecutionLatencyMs { get; set; } + public string ResultFormat { get; set; } + public int ChunkCount { get; set; } + public long TotalBytesDownloaded { get; set; } + public int PollCount { get; set; } + + // Driver config (from connection activity) + public DriverConfiguration DriverConfig { get; set; } +} +``` + +**Derived from Activity**: +- `Timestamp`: `activity.StartTimeUtc` +- `ExecutionLatencyMs`: `activity.Duration.TotalMilliseconds` +- `StatementId`: `activity.GetTagItem("db.statement")` +- `ResultFormat`: `activity.GetTagItem("result.format")` + +### 5.3 Batching Strategy + +Same as original design: +- **Batch size**: Default 100 metrics +- **Flush interval**: Default 5 seconds +- **Force flush**: On connection close + +--- + +## 6. Configuration + +### 6.1 Configuration Model + +```csharp +public sealed class TelemetryConfiguration +{ + // Enable/disable + public bool Enabled { get; set; } = true; + + // Batching + public int BatchSize { get; set; } = 100; + public int FlushIntervalMs { get; set; } = 5000; + + // Export + public int MaxRetries { get; set; } = 3; + public int RetryDelayMs { get; set; } = 100; + + // Circuit breaker + public bool CircuitBreakerEnabled { get; set; } = true; + public int CircuitBreakerThreshold { get; set; } = 5; + public TimeSpan CircuitBreakerTimeout { get; set; } = TimeSpan.FromMinutes(1); + + // Feature flag + public const string FeatureFlagName = + "databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc"; +} +``` + +### 6.2 Initialization + +```csharp +// In DatabricksConnection.OpenAsync() +if (_telemetryConfig.Enabled && serverFeatureFlag.Enabled) +{ + _activityListener = new DatabricksActivityListener( + connection: this, + exporter: new DatabricksTelemetryExporter(_httpClient, this, _telemetryConfig), + config: _telemetryConfig); + + _activityListener.Start(); +} +``` + +### 6.3 Feature Flag Integration + +```mermaid +flowchart TD + A[Connection Opens] --> B{Client Config Enabled?} + B -->|No| C[Telemetry Disabled] + B -->|Yes| D{Server Feature Flag?} + D -->|No| C + D -->|Yes| E[Start ActivityListener] + E --> F[Collect & Export Metrics] +``` + +**Priority Order**: +1. Server feature flag (highest) +2. Client connection string +3. Environment variable +4. Default value + +--- + +## 7. Privacy & Compliance + +### 7.1 Data Privacy + +**Never Collected from Activities**: +- ❌ SQL query text (only statement ID) +- ❌ Query results or data values +- ❌ Table/column names from queries +- ❌ User identities (only workspace ID) + +**Always Collected**: +- ✅ Operation latency (from `Activity.Duration`) +- ✅ Error codes (from `activity.GetTagItem("error.type")`) +- ✅ Feature flags (boolean settings) +- ✅ Statement IDs (UUIDs) + +### 7.2 Activity Tag Filtering + +The listener filters which tags to export: + +```csharp +private static readonly HashSet AllowedTags = new() +{ + "result.format", + "result.chunk_count", + "result.bytes_downloaded", + "poll.count", + "error.type", + "feature.cloudfetch", + // ... safe tags only +}; + +private void ProcessActivity(Activity activity) +{ + var metrics = new TelemetryMetric(); + + foreach (var tag in activity.Tags) + { + if (AllowedTags.Contains(tag.Key)) + { + // Export this tag + metrics.AddTag(tag.Key, tag.Value); + } + // Sensitive tags silently dropped + } +} +``` + +### 7.3 Compliance + +Same as original design: +- **GDPR**: No personal data +- **CCPA**: No personal information +- **SOC 2**: Encrypted in transit +- **Data Residency**: Regional control plane + +--- + +## 8. Error Handling + +### 8.1 Error Handling Principles + +Same as original design: +1. Never block driver operations +2. Fail silently (log only) +3. Circuit breaker for service failures +4. No retry storms + +### 8.2 Activity Listener Error Handling + +```csharp +private void OnActivityStopped(Activity activity) +{ + try + { + _aggregator.ProcessActivity(activity); + } + catch (Exception ex) + { + // Log but never throw - must not impact driver + Debug.WriteLine($"Telemetry processing error: {ex.Message}"); + } +} +``` + +### 8.3 Failure Modes + +| Failure | Behavior | +|---------|----------| +| Listener throws | Caught, logged, activity continues | +| Aggregator throws | Caught, logged, skip this activity | +| Exporter fails | Circuit breaker, retry with backoff | +| Circuit breaker open | Drop metrics immediately | +| Out of memory | Disable listener, stop collecting | + +--- + +## 9. Testing Strategy + +### 9.1 Unit Tests + +**DatabricksActivityListener Tests**: +- `Listener_FiltersCorrectActivitySource` +- `Listener_ExtractsTagsFromActivity` +- `Listener_HandlesActivityWithoutTags` +- `Listener_DoesNotThrowOnError` +- `Listener_RespectsFeatureFlag` + +**MetricsAggregator Tests**: +- `Aggregator_CombinesActivitiesByStatementId` +- `Aggregator_EmitsOnStatementComplete` +- `Aggregator_HandlesConnectionActivity` +- `Aggregator_FlushesOnBatchSize` +- `Aggregator_FlushesOnTimeInterval` + +**TelemetryExporter Tests**: +- Same as original design (endpoints, retry, circuit breaker) + +### 9.2 Integration Tests + +**End-to-End with Activity**: +- `ActivityBased_ConnectionOpen_ExportedSuccessfully` +- `ActivityBased_StatementWithChunks_AggregatedCorrectly` +- `ActivityBased_ErrorActivity_CapturedInMetrics` +- `ActivityBased_FeatureFlagDisabled_NoExport` + +**Compatibility Tests**: +- `ActivityBased_CoexistsWithOpenTelemetry` +- `ActivityBased_CorrelationIdPreserved` +- `ActivityBased_ParentChildSpansWork` + +### 9.3 Performance Tests + +**Overhead Measurement**: +- `ActivityListener_Overhead_LessThan1Percent` +- `MetricExtraction_Completes_UnderOneMicrosecond` + +Compare: +- Baseline: Activity with no listener +- With listener but disabled: Should be ~0% overhead +- With listener enabled: Should be < 1% overhead + +### 9.4 Test Coverage Goals + +| Component | Unit Test Coverage | Integration Test Coverage | +|-----------|-------------------|---------------------------| +| DatabricksActivityListener | > 90% | > 80% | +| MetricsAggregator | > 90% | > 80% | +| TelemetryExporter | > 90% | > 80% | +| Activity Tag Filtering | 100% | N/A | + +--- + +## 10. Migration & Rollout + +### 10.1 Rollout Phases + +#### Phase 1: Implementation (Weeks 1-2) + +**Goals**: +- Implement ActivityListener, MetricsAggregator, Exporter +- Add necessary tags to existing activities +- Unit tests with 90%+ coverage + +**Key Activities**: +- Identify which activities need additional tags +- Implement listener with filtering logic +- Implement aggregator with batching +- Add feature flag integration + +#### Phase 2: Internal Testing (Week 3) + +**Goals**: +- Deploy to internal Databricks environments +- Validate metrics in Lumberjack +- Performance testing + +**Success Criteria**: +- < 1% performance overhead +- Metrics appear in Lumberjack table +- No driver failures due to telemetry + +#### Phase 3: Beta Rollout (Weeks 4-5) + +**Goals**: +- Enable for 10% of workspaces via feature flag +- Monitor error rates and performance +- Collect feedback + +#### Phase 4: Full Rollout (Week 6) + +**Goals**: +- Enable for 100% of workspaces +- Monitor at scale + +### 10.2 Backward Compatibility + +**Guarantees**: +- ✅ Existing Activity-based tracing continues to work +- ✅ OpenTelemetry exporters still receive activities +- ✅ W3C Trace Context propagation unchanged +- ✅ No breaking API changes + +**Migration Path**: +- No migration needed for applications +- Listener is transparent to existing code +- Only adds tags to existing activities + +--- + +## 11. Comparison with Separate Telemetry System + +### 11.1 Side-by-Side Comparison + +| Aspect | **Separate Telemetry** (Original) | **Activity-Based** (This Design) | +|--------|----------------------------------|----------------------------------| +| **Instrumentation** | Duplicate: Activity + TelemetryCollector.Record*() | Single: Activity only | +| **Data Model** | Two: Activity + TelemetryEvent | One: Activity tags | +| **Correlation** | Manual correlation between systems | Built-in via Activity context | +| **Code Changes** | New instrumentation points | Add tags to existing activities | +| **Maintenance** | Two systems to maintain | One system | +| **Complexity** | Higher | Lower | +| **Performance Overhead** | Activity + Telemetry overhead | Activity + Listener overhead | +| **OpenTelemetry Compat** | Parallel systems | Seamless integration | + +### 11.2 Code Comparison + +**Separate Telemetry Approach**: +```csharp +// Instrumentation point +using var activity = ActivityTrace.Start("ExecuteQuery"); // For tracing +var sw = Stopwatch.StartNew(); // For telemetry + +try +{ + var result = await ExecuteAsync(); + + activity?.SetTag("success", true); // For tracing + _telemetryCollector?.RecordStatementExecute( // For telemetry + statementId, sw.Elapsed, resultFormat); +} +catch (Exception ex) +{ + activity?.SetTag("error", true); // For tracing + _telemetryCollector?.RecordError( // For telemetry + ex.GetType().Name, ex.Message, statementId); +} +``` + +**Activity-Based Approach**: +```csharp +// Single instrumentation point +using var activity = ActivityTrace.Start("ExecuteQuery"); + +try +{ + var result = await ExecuteAsync(); + + // Tags automatically picked up by listener for metrics + activity?.SetTag("result.format", resultFormat); + activity?.SetTag("statement.id", statementId); + activity?.SetTag("success", true); +} +catch (Exception ex) +{ + activity?.SetTag("error.type", ex.GetType().Name); +} +// Listener automatically extracts metrics from activity +``` + +### 11.3 Pros and Cons + +**Activity-Based Approach Pros**: +- ✅ **Less Code**: No duplicate instrumentation +- ✅ **Single Source of Truth**: Activity is the only data model +- ✅ **Better Correlation**: Activity context automatically propagates +- ✅ **Standards-Based**: Activity is the .NET standard for instrumentation +- ✅ **Easier Maintenance**: One system instead of two +- ✅ **OpenTelemetry Ready**: Works with any OTEL exporter + +**Activity-Based Approach Cons**: +- ⚠️ **Activity Dependency**: Coupled to Activity API (but it's standard .NET) +- ⚠️ **Tag Limits**: Activities have tag size limits (but adequate for metrics) +- ⚠️ **Learning Curve**: Team needs to understand Activity API (but simpler than two systems) + +**Separate Telemetry Approach Pros**: +- ✅ **Independent**: Not coupled to Activity +- ✅ **JDBC Parity**: Matches JDBC driver design + +**Separate Telemetry Approach Cons**: +- ❌ **Duplicate Code**: Two instrumentation points +- ❌ **Two Data Models**: Activity + TelemetryEvent +- ❌ **Harder to Correlate**: Manual correlation needed +- ❌ **More Maintenance**: Two systems to maintain +- ❌ **More Complexity**: Understanding both systems + +--- + +## 12. Implementation Checklist + +### Phase 1: Core Implementation +- [ ] Create `DatabricksActivityListener` class +- [ ] Create `MetricsAggregator` class +- [ ] Create `DatabricksTelemetryExporter` class (reuse from original design) +- [ ] Add necessary tags to existing activities +- [ ] Implement activity tag filtering (allowlist) +- [ ] Add feature flag integration + +### Phase 2: Integration +- [ ] Initialize listener in `DatabricksConnection.OpenAsync()` +- [ ] Stop listener in `DatabricksConnection.CloseAsync()` +- [ ] Add configuration parsing from connection string +- [ ] Add server feature flag check + +### Phase 3: Testing +- [ ] Unit tests for ActivityListener +- [ ] Unit tests for MetricsAggregator +- [ ] Integration tests with real activities +- [ ] Performance tests (overhead measurement) +- [ ] Compatibility tests with OpenTelemetry + +### Phase 4: Documentation +- [ ] Update Activity instrumentation docs +- [ ] Document new activity tags +- [ ] Update configuration guide +- [ ] Add troubleshooting guide + +--- + +## 13. Open Questions + +### 13.1 Activity Tag Naming Conventions + +**Question**: Should we use OpenTelemetry semantic conventions for tag names? + +**Recommendation**: Yes, use OTEL conventions where applicable: +- `db.statement.id` instead of `statement.id` +- `http.response.body.size` instead of `bytes_downloaded` +- `error.type` instead of `error_code` + +This ensures compatibility with OTEL ecosystem. + +### 13.2 Statement Completion Detection + +**Question**: How do we know when a statement is complete for aggregation? + +**Options**: +1. **Activity completion**: When statement activity stops (recommended) +2. **Explicit marker**: Call `CompleteStatement(id)` explicitly +3. **Timeout-based**: Emit after N seconds of inactivity + +**Recommendation**: Use activity completion - cleaner and automatic. + +### 13.3 Performance Impact on Existing Activity Users + +**Question**: Will adding tags impact applications that already use Activity for tracing? + +**Answer**: Minimal impact: +- Tags are cheap (< 1μs to set) +- Listener is optional (only activated when telemetry enabled) +- Activity overhead already exists + +--- + +## 14. References + +### 14.1 Related Documentation + +- [.NET Activity API](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing) +- [OpenTelemetry .NET](https://opentelemetry.io/docs/languages/net/) +- [ActivityListener Documentation](https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.activitylistener) + +### 14.2 Existing Code References + +- `ActivityTrace.cs`: Existing Activity helper +- `DatabricksAdbcActivitySource`: Existing ActivitySource +- Connection/Statement activities: Already instrumented + +--- + +## Summary + +This **Activity-based design** provides a cleaner, simpler approach to telemetry by: + +1. **Leveraging existing infrastructure** instead of building parallel systems +2. **Single instrumentation point** via Activity +3. **Standard .NET patterns** (Activity/ActivityListener) +4. **Less code to maintain** (no duplicate instrumentation) +5. **Better compatibility** with OpenTelemetry and APM tools + +**Recommendation**: Use this Activity-based approach unless there's a compelling reason to maintain separate systems. From 3f6599f89da70895f59fe6a578205b6c6be3b9c8 Mon Sep 17 00:00:00 2001 From: Jade Wang Date: Mon, 27 Oct 2025 14:59:32 -0700 Subject: [PATCH 3/6] Update telemetry-activity-based-design.md --- .../telemetry-activity-based-design.md | 241 +++++++++--------- 1 file changed, 121 insertions(+), 120 deletions(-) diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md index 2afa7630f6..28cf957ec7 100644 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md @@ -18,10 +18,6 @@ This document outlines an **Activity-based telemetry design** that leverages the - **Privacy-first**: No PII or query data collected - **Server-controlled**: Feature flag support for enable/disable -**Key Difference from Original Design:** -- ❌ **OLD**: Separate TelemetryCollector + TelemetryExporter alongside Activity -- ✅ **NEW**: Activity-based with custom ActivityListener + aggregation - --- ## Table of Contents @@ -36,7 +32,7 @@ This document outlines an **Activity-based telemetry design** that leverages the 8. [Error Handling](#8-error-handling) 9. [Testing Strategy](#9-testing-strategy) 10. [Migration & Rollout](#10-migration--rollout) -11. [Comparison with Separate Telemetry System](#11-comparison-with-separate-telemetry-system) +11. [Alternatives Considered](#11-alternatives-considered) --- @@ -50,21 +46,23 @@ The Databricks ADBC driver already has: - ✅ **W3C Trace Context**: Distributed tracing support - ✅ **ActivityTrace utility**: Helper for creating activities -### 1.2 The Problem +### 1.2 Design Opportunity -The original design proposed creating a separate telemetry system alongside Activity infrastructure: -- ❌ Duplicate instrumentation in driver code -- ❌ Two data models (Activity vs TelemetryEvent) -- ❌ Two export mechanisms -- ❌ Maintenance burden +The driver already has comprehensive Activity instrumentation for distributed tracing. This presents an opportunity to: +- Leverage existing Activity infrastructure for both tracing and metrics +- Avoid duplicate instrumentation points in the driver code +- Use a single data model (Activity) for both observability concerns +- Maintain automatic correlation between traces and metrics +- Reduce overall system complexity and maintenance burden -### 1.3 The Solution +### 1.3 The Approach -**Extend Activity infrastructure** instead of creating parallel system: +**Extend Activity infrastructure** with metrics collection: - ✅ Single instrumentation point (Activity) - ✅ Custom ActivityListener for metrics aggregation - ✅ Export aggregated data to Databricks service - ✅ Reuse Activity context, correlation, and timing +- ✅ Seamless integration with OpenTelemetry ecosystem --- @@ -124,35 +122,6 @@ sequenceDiagram end ``` -### 2.3 Comparison with Existing Activity Usage - -**Before (Tracing Only)**: -```csharp -using var activity = ActivityTrace.Start("ExecuteQuery"); -try { - // operation - activity?.SetTag("success", true); -} catch { - activity?.SetTag("error", true); -} -``` - -**After (Tracing + Metrics)**: -```csharp -using var activity = ActivityTrace.Start("ExecuteQuery"); -try { - // operation - activity?.SetTag("result_format", resultFormat); // ← Picked up by listener - activity?.SetTag("chunk_count", chunkCount); // ← Picked up by listener - activity?.SetTag("success", true); -} catch { - activity?.SetTag("error", errorCode); -} -// Listener automatically aggregates metrics from activity -``` - -**No duplicate instrumentation - same code path!** - --- ## 3. Core Components @@ -734,91 +703,123 @@ Compare: --- -## 11. Comparison with Separate Telemetry System +## 11. Alternatives Considered -### 11.1 Side-by-Side Comparison +### 11.1 Alternative 1: Separate Telemetry System -| Aspect | **Separate Telemetry** (Original) | **Activity-Based** (This Design) | -|--------|----------------------------------|----------------------------------| -| **Instrumentation** | Duplicate: Activity + TelemetryCollector.Record*() | Single: Activity only | -| **Data Model** | Two: Activity + TelemetryEvent | One: Activity tags | -| **Correlation** | Manual correlation between systems | Built-in via Activity context | -| **Code Changes** | New instrumentation points | Add tags to existing activities | -| **Maintenance** | Two systems to maintain | One system | -| **Complexity** | Higher | Lower | -| **Performance Overhead** | Activity + Telemetry overhead | Activity + Listener overhead | -| **OpenTelemetry Compat** | Parallel systems | Seamless integration | +**Description**: Create a dedicated telemetry collection system parallel to Activity infrastructure, with explicit TelemetryCollector and TelemetryExporter classes. -### 11.2 Code Comparison +**Approach**: +- Add `TelemetryCollector.RecordXXX()` calls at each driver operation +- Maintain separate `TelemetryEvent` data model +- Export via dedicated `TelemetryExporter` +- Manual correlation with distributed traces -**Separate Telemetry Approach**: -```csharp -// Instrumentation point -using var activity = ActivityTrace.Start("ExecuteQuery"); // For tracing -var sw = Stopwatch.StartNew(); // For telemetry +**Pros**: +- Independent from Activity API +- Direct control over data collection +- Matches JDBC driver design pattern -try -{ - var result = await ExecuteAsync(); +**Cons**: +- Duplicate instrumentation at every operation point +- Two parallel data models (Activity + TelemetryEvent) +- Manual correlation between traces and metrics required +- Higher maintenance burden (two systems) +- Increased code complexity - activity?.SetTag("success", true); // For tracing - _telemetryCollector?.RecordStatementExecute( // For telemetry - statementId, sw.Elapsed, resultFormat); -} -catch (Exception ex) -{ - activity?.SetTag("error", true); // For tracing - _telemetryCollector?.RecordError( // For telemetry - ex.GetType().Name, ex.Message, statementId); -} -``` +**Why Not Chosen**: The driver already has comprehensive Activity instrumentation. Creating a parallel system would duplicate this effort and increase maintenance complexity without providing significant benefits. -**Activity-Based Approach**: -```csharp -// Single instrumentation point -using var activity = ActivityTrace.Start("ExecuteQuery"); +--- -try -{ - var result = await ExecuteAsync(); +### 11.2 Alternative 2: OpenTelemetry Metrics API Directly - // Tags automatically picked up by listener for metrics - activity?.SetTag("result.format", resultFormat); - activity?.SetTag("statement.id", statementId); - activity?.SetTag("success", true); -} -catch (Exception ex) -{ - activity?.SetTag("error.type", ex.GetType().Name); -} -// Listener automatically extracts metrics from activity -``` +**Description**: Use OpenTelemetry's Metrics API (`Meter` and `Counter`/`Histogram`) directly in driver code. + +**Approach**: +- Create `Meter` instance for the driver +- Add `Counter.Add()` and `Histogram.Record()` calls at each operation +- Export via OpenTelemetry SDK to Databricks backend + +**Pros**: +- Industry standard metrics API +- Built-in aggregation and export +- Native OTEL ecosystem support + +**Cons**: +- Still requires separate instrumentation alongside Activity +- Introduces new dependency (OpenTelemetry.Api.Metrics) +- Metrics and traces remain separate systems +- Manual correlation still needed +- Databricks export requires custom OTLP exporter + +**Why Not Chosen**: This still creates duplicate instrumentation points. The Activity-based approach allows us to derive metrics from existing Activity data, avoiding code duplication. + +--- + +### 11.3 Alternative 3: Log-Based Metrics + +**Description**: Write structured logs at key operations and extract metrics from logs. + +**Approach**: +- Use `ILogger` to log structured events +- Include metric-relevant fields (latency, result format, etc.) +- Backend log processor extracts metrics from log entries + +**Pros**: +- Simple implementation (just logging) +- No new infrastructure needed +- Flexible data collection + +**Cons**: +- High log volume (every operation logged) +- Backend processing complexity +- Delayed metrics (log ingestion lag) +- No built-in aggregation +- Difficult to correlate with distributed traces +- Privacy concerns (logs may contain sensitive data) + +**Why Not Chosen**: Log-based metrics are inefficient and lack the structure needed for real-time aggregation. They also complicate privacy compliance. + +--- + +### 11.4 Why Activity-Based Approach Was Chosen + +The Activity-based design was selected because it: + +**1. Leverages Existing Infrastructure** +- Driver already has comprehensive Activity instrumentation +- No new instrumentation points needed +- Reuses Activity's built-in timing and correlation + +**2. Single Source of Truth** +- Activity serves as the data model for both traces and metrics +- Automatic correlation between distributed traces and telemetry metrics +- Consistent data across all observability signals -### 11.3 Pros and Cons +**3. Minimal Code Changes** +- Only requires adding tags to existing activities +- No duplicate instrumentation code +- Lower maintenance burden -**Activity-Based Approach Pros**: -- ✅ **Less Code**: No duplicate instrumentation -- ✅ **Single Source of Truth**: Activity is the only data model -- ✅ **Better Correlation**: Activity context automatically propagates -- ✅ **Standards-Based**: Activity is the .NET standard for instrumentation -- ✅ **Easier Maintenance**: One system instead of two -- ✅ **OpenTelemetry Ready**: Works with any OTEL exporter +**4. Standards-Based** +- Activity is .NET's standard distributed tracing API +- Works seamlessly with OpenTelemetry ecosystem +- Compatible with existing APM tools -**Activity-Based Approach Cons**: -- ⚠️ **Activity Dependency**: Coupled to Activity API (but it's standard .NET) -- ⚠️ **Tag Limits**: Activities have tag size limits (but adequate for metrics) -- ⚠️ **Learning Curve**: Team needs to understand Activity API (but simpler than two systems) +**5. Performance Efficient** +- ActivityListener has minimal overhead +- No duplicate timing or data collection +- Non-blocking by design -**Separate Telemetry Approach Pros**: -- ✅ **Independent**: Not coupled to Activity -- ✅ **JDBC Parity**: Matches JDBC driver design +**6. Simplicity** +- Easier to understand (one system vs two) +- Easier to test (single instrumentation path) +- Easier to maintain (single codebase) -**Separate Telemetry Approach Cons**: -- ❌ **Duplicate Code**: Two instrumentation points -- ❌ **Two Data Models**: Activity + TelemetryEvent -- ❌ **Harder to Correlate**: Manual correlation needed -- ❌ **More Maintenance**: Two systems to maintain -- ❌ **More Complexity**: Understanding both systems +**Trade-offs Accepted**: +- Coupling to Activity API (acceptable - it's .NET standard) +- Activity tag size limits (adequate for our metrics needs) +- Requires understanding Activity API (but provides better developer experience overall) --- @@ -906,12 +907,12 @@ This ensures compatibility with OTEL ecosystem. ## Summary -This **Activity-based design** provides a cleaner, simpler approach to telemetry by: +This **Activity-based telemetry design** provides an efficient approach to collecting driver metrics by: -1. **Leveraging existing infrastructure** instead of building parallel systems -2. **Single instrumentation point** via Activity -3. **Standard .NET patterns** (Activity/ActivityListener) -4. **Less code to maintain** (no duplicate instrumentation) -5. **Better compatibility** with OpenTelemetry and APM tools +1. **Leveraging existing infrastructure**: Extends the driver's comprehensive Activity instrumentation +2. **Single instrumentation point**: Uses Activity as the unified data model for both tracing and metrics +3. **Standard .NET patterns**: Built on Activity/ActivityListener APIs that are platform standards +4. **Minimal code changes**: Only requires adding tags to existing activities +5. **Seamless integration**: Works natively with OpenTelemetry and APM tools -**Recommendation**: Use this Activity-based approach unless there's a compelling reason to maintain separate systems. +This design enables the Databricks ADBC driver to collect valuable usage metrics while maintaining code simplicity, high performance, and full compatibility with the .NET observability ecosystem. From 9d53b5e7420a5fbb1891b460fea21fed7b175246 Mon Sep 17 00:00:00 2001 From: Jade Wang Date: Tue, 28 Oct 2025 14:52:48 -0700 Subject: [PATCH 4/6] address comments --- .../Drivers/Databricks/Telemetry/prompts.txt | 11 - .../telemetry-activity-based-design.md | 527 ++- .../telemetry-integration-lld-design.md | 3565 ----------------- .../Telemetry/telemetry-lld-summary.md | 280 -- 4 files changed, 416 insertions(+), 3967 deletions(-) delete mode 100644 csharp/src/Drivers/Databricks/Telemetry/prompts.txt delete mode 100644 csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md delete mode 100644 csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md diff --git a/csharp/src/Drivers/Databricks/Telemetry/prompts.txt b/csharp/src/Drivers/Databricks/Telemetry/prompts.txt deleted file mode 100644 index db1e2fcd5d..0000000000 --- a/csharp/src/Drivers/Databricks/Telemetry/prompts.txt +++ /dev/null @@ -1,11 +0,0 @@ -1. "can you understand the content present in this google doc: {telemetry-design-doc-url}" - -2. "can you use google mcp" - -4. "can you check the databricks jdbc repo and understand how it is currently implemented" - -5. "now, lets go through the arrow adbc driver for databricks present at {project-location}/arrow-adbc/csharp/src/Drivers/Databricks, and understand its flow" - -6. "i want to create a low level design doc for adding telemetry to the databricks adbc driver. based on the context you have can you create one for me. make it a detailed one. example design doc: {github-url}/statement-execution-api-design.md ultrathink" - -7. "does all of the changes in the lld document come under the folder {project-location}/arrow-adbc/csharp/src/Drivers/Databricks or outside as well ? ultrathink" \ No newline at end of file diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md index 28cf957ec7..4d2647b9a7 100644 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md @@ -31,8 +31,10 @@ This document outlines an **Activity-based telemetry design** that leverages the 7. [Privacy & Compliance](#7-privacy--compliance) 8. [Error Handling](#8-error-handling) 9. [Testing Strategy](#9-testing-strategy) -10. [Migration & Rollout](#10-migration--rollout) -11. [Alternatives Considered](#11-alternatives-considered) +10. [Alternatives Considered](#10-alternatives-considered) +11. [Implementation Checklist](#11-implementation-checklist) +12. [Open Questions](#12-open-questions) +13. [References](#13-references) --- @@ -236,17 +238,30 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry ```mermaid flowchart TD - A[Activity Stopped] --> B{Activity.OperationName} - B -->|Connection.Open| C[Emit Connection Event] - B -->|Statement.Execute| D[Aggregate by statement_id] - B -->|CloudFetch.Download| D - B -->|Statement.Complete| E[Emit Statement Event] - - D --> F{Batch Size Reached?} - F -->|Yes| G[Flush to Exporter] - F -->|No| H[Continue Buffering] + A[Activity Stopped] --> B{Determine EventType} + B -->|Connection.Open*| C[Map to ConnectionOpen] + B -->|Statement.*| D[Map to StatementExecution] + B -->|error.type tag present| E[Map to Error] + + C --> F[Emit Connection Event Immediately] + D --> G[Aggregate by statement_id] + E --> H[Emit Error Event Immediately] + + G --> I{Statement Complete?} + I -->|Yes| J[Emit Aggregated Statement Event] + I -->|No| K[Continue Buffering] + + J --> L{Batch Size Reached?} + L -->|Yes| M[Flush Batch to Exporter] + L -->|No| K ``` +**Key Behaviors:** +- **Connection events**: Emitted immediately (no aggregation needed) +- **Statement events**: Aggregated by `statement_id` until statement completes +- **Error events**: Emitted immediately +- **Child activities** (CloudFetch.Download, etc.): Metrics rolled up to parent statement activity + #### Contracts **Statement Aggregation**: @@ -305,13 +320,341 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry ## 4. Data Collection -### 4.1 Activity Tags for Metrics +### 4.1 Tag Definition System + +To ensure maintainability and explicit control over what data is collected and exported, all Activity tags are defined in a centralized tag definition system. + +#### Tag Definition Structure + +**Location**: `Telemetry/TagDefinitions/` + +``` +Telemetry/ +└── TagDefinitions/ + ├── TelemetryTag.cs # Tag metadata and annotations + ├── TelemetryEvent.cs # Event definitions with associated tags + ├── ConnectionOpenEvent.cs # Connection event tag definitions + ├── StatementExecutionEvent.cs # Statement event tag definitions + └── ErrorEvent.cs # Error event tag definitions +``` + +#### TelemetryTag Annotation + +**File**: `TagDefinitions/TelemetryTag.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// + /// Defines export scope for telemetry tags. + /// + [Flags] + internal enum TagExportScope + { + None = 0, + ExportLocal = 1, // Export to local diagnostics (file listener, etc.) + ExportDatabricks = 2, // Export to Databricks telemetry service + ExportAll = ExportLocal | ExportDatabricks + } + + /// + /// Attribute to annotate Activity tag definitions. + /// + [AttributeUsage(AttributeTargets.Field, AllowMultiple = false)] + internal sealed class TelemetryTagAttribute : Attribute + { + public string TagName { get; } + public TagExportScope ExportScope { get; set; } + public string? Description { get; set; } + public bool Required { get; set; } + + public TelemetryTagAttribute(string tagName) + { + TagName = tagName; + ExportScope = TagExportScope.ExportAll; + } + } +} +``` + +#### Event Tag Definitions + +**File**: `TagDefinitions/ConnectionOpenEvent.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// + /// Tag definitions for Connection.Open events. + /// + internal static class ConnectionOpenEvent + { + public const string EventName = "Connection.Open"; + + // Standard tags + [TelemetryTag("workspace.id", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Databricks workspace ID", + Required = true)] + public const string WorkspaceId = "workspace.id"; + + [TelemetryTag("session.id", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Connection session ID", + Required = true)] + public const string SessionId = "session.id"; + + // Driver configuration tags + [TelemetryTag("driver.version", + ExportScope = TagExportScope.ExportAll, + Description = "ADBC driver version")] + public const string DriverVersion = "driver.version"; + + [TelemetryTag("driver.os", + ExportScope = TagExportScope.ExportAll, + Description = "Operating system")] + public const string DriverOS = "driver.os"; + + [TelemetryTag("driver.runtime", + ExportScope = TagExportScope.ExportAll, + Description = ".NET runtime version")] + public const string DriverRuntime = "driver.runtime"; + + // Feature flags + [TelemetryTag("feature.cloudfetch", + ExportScope = TagExportScope.ExportDatabricks, + Description = "CloudFetch enabled")] + public const string FeatureCloudFetch = "feature.cloudfetch"; + + [TelemetryTag("feature.lz4", + ExportScope = TagExportScope.ExportDatabricks, + Description = "LZ4 compression enabled")] + public const string FeatureLz4 = "feature.lz4"; + + // Sensitive tags - NOT exported to Databricks + [TelemetryTag("server.address", + ExportScope = TagExportScope.ExportLocal, + Description = "Workspace host (local diagnostics only)")] + public const string ServerAddress = "server.address"; + + /// + /// Get all tags that should be exported to Databricks. + /// + public static IReadOnlySet GetDatabricksExportTags() + { + return new HashSet + { + WorkspaceId, + SessionId, + DriverVersion, + DriverOS, + DriverRuntime, + FeatureCloudFetch, + FeatureLz4 + }; + } + } +} +``` + +**File**: `TagDefinitions/StatementExecutionEvent.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// + /// Tag definitions for Statement execution events. + /// + internal static class StatementExecutionEvent + { + public const string EventName = "Statement.Execute"; + + // Statement identification + [TelemetryTag("statement.id", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Statement execution ID", + Required = true)] + public const string StatementId = "statement.id"; + + [TelemetryTag("session.id", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Connection session ID", + Required = true)] + public const string SessionId = "session.id"; + + // Result format tags + [TelemetryTag("result.format", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Result format: inline, cloudfetch")] + public const string ResultFormat = "result.format"; + + [TelemetryTag("result.chunk_count", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Number of CloudFetch chunks")] + public const string ResultChunkCount = "result.chunk_count"; + + [TelemetryTag("result.bytes_downloaded", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Total bytes downloaded")] + public const string ResultBytesDownloaded = "result.bytes_downloaded"; + + [TelemetryTag("result.compression_enabled", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Compression enabled for results")] + public const string ResultCompressionEnabled = "result.compression_enabled"; + + // Polling metrics + [TelemetryTag("poll.count", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Number of status poll requests")] + public const string PollCount = "poll.count"; + + [TelemetryTag("poll.latency_ms", + ExportScope = TagExportScope.ExportDatabricks, + Description = "Total polling latency")] + public const string PollLatencyMs = "poll.latency_ms"; + + // Sensitive tags - NOT exported to Databricks + [TelemetryTag("db.statement", + ExportScope = TagExportScope.ExportLocal, + Description = "SQL query text (local diagnostics only)")] + public const string DbStatement = "db.statement"; + + /// + /// Get all tags that should be exported to Databricks. + /// + public static IReadOnlySet GetDatabricksExportTags() + { + return new HashSet + { + StatementId, + SessionId, + ResultFormat, + ResultChunkCount, + ResultBytesDownloaded, + ResultCompressionEnabled, + PollCount, + PollLatencyMs + }; + } + } +} +``` + +#### Tag Registry + +**File**: `TagDefinitions/TelemetryTagRegistry.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// + /// Central registry for all telemetry tags and events. + /// + internal static class TelemetryTagRegistry + { + /// + /// Get all tags allowed for Databricks export by event type. + /// + public static IReadOnlySet GetDatabricksExportTags(TelemetryEventType eventType) + { + return eventType switch + { + TelemetryEventType.ConnectionOpen => ConnectionOpenEvent.GetDatabricksExportTags(), + TelemetryEventType.StatementExecution => StatementExecutionEvent.GetDatabricksExportTags(), + TelemetryEventType.Error => ErrorEvent.GetDatabricksExportTags(), + _ => new HashSet() + }; + } + + /// + /// Check if a tag should be exported to Databricks for a given event type. + /// + public static bool ShouldExportToDatabricks(TelemetryEventType eventType, string tagName) + { + var allowedTags = GetDatabricksExportTags(eventType); + return allowedTags.Contains(tagName); + } + } +} +``` + +#### Usage in Activity Tag Filtering + +The `MetricsAggregator` uses the tag registry for filtering: + +```csharp +private TelemetryMetric ProcessActivity(Activity activity) +{ + var eventType = DetermineEventType(activity); + var metric = new TelemetryMetric + { + EventType = eventType, + Timestamp = activity.StartTimeUtc + }; + + // Filter tags using the registry + foreach (var tag in activity.Tags) + { + if (TelemetryTagRegistry.ShouldExportToDatabricks(eventType, tag.Key)) + { + // Export this tag + SetMetricProperty(metric, tag.Key, tag.Value); + } + // Tags not in registry are silently dropped + } + + return metric; +} +``` + +#### Benefits + +1. **Centralized Control**: All tags defined in one place +2. **Explicit Export Scope**: Clear annotation of what goes where +3. **Type Safety**: Constants prevent typos +4. **Self-Documenting**: Descriptions embedded in code +5. **Easy Auditing**: Simple to review what data is exported +6. **Future-Proof**: New tags just require adding to definition files + +### 4.2 Activity Tags by Event Type + +#### Activity Operation Name to MetricType Mapping -**Standard Activity Tags** (already exist): -- `operation.name`: e.g., "Connection.Open", "Statement.Execute" -- `db.operation`: SQL operation type -- `db.statement`: Statement ID (not query text) -- `server.address`: Databricks workspace host +The `ActivityListener` maps Activity operation names to Databricks `TelemetryEventType` enum: + +| Activity Operation Name | TelemetryEventType | Notes | +|------------------------|-------------------|-------| +| `Connection.Open` | `ConnectionOpen` | Emitted immediately when connection opens | +| `Connection.OpenAsync` | `ConnectionOpen` | Same as above | +| `Statement.Execute` | `StatementExecution` | Main statement execution activity | +| `Statement.ExecuteQuery` | `StatementExecution` | Query execution variant | +| `Statement.ExecuteUpdate` | `StatementExecution` | Update execution variant | +| `CloudFetch.Download` | _(aggregated into parent)_ | Child activity, metrics rolled up to statement | +| `CloudFetch.ChunkDownload` | _(aggregated into parent)_ | Child activity, metrics rolled up to statement | +| `Results.Fetch` | _(aggregated into parent)_ | Child activity, metrics rolled up to statement | +| _(any activity with `error.type` tag)_ | `Error` | Error events based on tag presence | + +**Mapping Logic** (in `MetricsAggregator`): +```csharp +private TelemetryEventType DetermineEventType(Activity activity) +{ + // Check for errors first + if (activity.GetTagItem("error.type") != null) + return TelemetryEventType.Error; + + // Map based on operation name + var operationName = activity.OperationName; + if (operationName.StartsWith("Connection.")) + return TelemetryEventType.ConnectionOpen; + + if (operationName.StartsWith("Statement.")) + return TelemetryEventType.StatementExecution; + + // Default for unknown operations + return TelemetryEventType.StatementExecution; +} +``` **New Tags for Metrics** (add to existing activities): - `result.format`: "inline" | "cloudfetch" @@ -511,36 +854,48 @@ flowchart TD ### 7.2 Activity Tag Filtering -The listener filters which tags to export: +The listener filters tags using the centralized tag definition system: ```csharp -private static readonly HashSet AllowedTags = new() -{ - "result.format", - "result.chunk_count", - "result.bytes_downloaded", - "poll.count", - "error.type", - "feature.cloudfetch", - // ... safe tags only -}; - -private void ProcessActivity(Activity activity) +private TelemetryMetric ProcessActivity(Activity activity) { - var metrics = new TelemetryMetric(); + var eventType = DetermineEventType(activity); + var metric = new TelemetryMetric { EventType = eventType }; foreach (var tag in activity.Tags) { - if (AllowedTags.Contains(tag.Key)) + // Use tag registry to determine if tag should be exported + if (TelemetryTagRegistry.ShouldExportToDatabricks(eventType, tag.Key)) { // Export this tag - metrics.AddTag(tag.Key, tag.Value); + SetMetricProperty(metric, tag.Key, tag.Value); } - // Sensitive tags silently dropped + // Tags not in registry are silently dropped for Databricks export + // But may still be exported to local diagnostics if marked ExportLocal } + + return metric; } ``` +**Tag Export Examples:** + +| Tag Name | ExportLocal | ExportDatabricks | Reason | +|----------|-------------|------------------|--------| +| `statement.id` | ✅ | ✅ | Safe UUID, needed for correlation | +| `result.format` | ✅ | ✅ | Safe enum value | +| `result.chunk_count` | ✅ | ✅ | Numeric metric | +| `driver.version` | ✅ | ✅ | Safe version string | +| `server.address` | ✅ | ❌ | May contain PII (workspace host) | +| `db.statement` | ✅ | ❌ | SQL query text (sensitive) | +| `user.name` | ❌ | ❌ | Personal information | + +This approach ensures: +- **Compile-time safety**: Tag names are constants +- **Explicit control**: Each tag's export scope is clearly defined +- **Easy auditing**: Single file to review for compliance +- **Future-proof**: New tags must be added to definitions (prevents accidental leaks) + ### 7.3 Compliance Same as original design: @@ -646,66 +1001,9 @@ Compare: --- -## 10. Migration & Rollout - -### 10.1 Rollout Phases - -#### Phase 1: Implementation (Weeks 1-2) - -**Goals**: -- Implement ActivityListener, MetricsAggregator, Exporter -- Add necessary tags to existing activities -- Unit tests with 90%+ coverage - -**Key Activities**: -- Identify which activities need additional tags -- Implement listener with filtering logic -- Implement aggregator with batching -- Add feature flag integration - -#### Phase 2: Internal Testing (Week 3) - -**Goals**: -- Deploy to internal Databricks environments -- Validate metrics in Lumberjack -- Performance testing - -**Success Criteria**: -- < 1% performance overhead -- Metrics appear in Lumberjack table -- No driver failures due to telemetry - -#### Phase 3: Beta Rollout (Weeks 4-5) - -**Goals**: -- Enable for 10% of workspaces via feature flag -- Monitor error rates and performance -- Collect feedback - -#### Phase 4: Full Rollout (Week 6) - -**Goals**: -- Enable for 100% of workspaces -- Monitor at scale - -### 10.2 Backward Compatibility - -**Guarantees**: -- ✅ Existing Activity-based tracing continues to work -- ✅ OpenTelemetry exporters still receive activities -- ✅ W3C Trace Context propagation unchanged -- ✅ No breaking API changes - -**Migration Path**: -- No migration needed for applications -- Listener is transparent to existing code -- Only adds tags to existing activities - ---- - -## 11. Alternatives Considered +## 10. Alternatives Considered -### 11.1 Alternative 1: Separate Telemetry System +### 10.1 Alternative 1: Separate Telemetry System **Description**: Create a dedicated telemetry collection system parallel to Activity infrastructure, with explicit TelemetryCollector and TelemetryExporter classes. @@ -731,7 +1029,7 @@ Compare: --- -### 11.2 Alternative 2: OpenTelemetry Metrics API Directly +### 10.2 Alternative 2: OpenTelemetry Metrics API Directly **Description**: Use OpenTelemetry's Metrics API (`Meter` and `Counter`/`Histogram`) directly in driver code. @@ -756,7 +1054,7 @@ Compare: --- -### 11.3 Alternative 3: Log-Based Metrics +### 10.3 Alternative 3: Log-Based Metrics **Description**: Write structured logs at key operations and extract metrics from logs. @@ -782,7 +1080,7 @@ Compare: --- -### 11.4 Why Activity-Based Approach Was Chosen +### 10.4 Why Activity-Based Approach Was Chosen The Activity-based design was selected because it: @@ -823,30 +1121,37 @@ The Activity-based design was selected because it: --- -## 12. Implementation Checklist +## 11. Implementation Checklist + +### Phase 1: Tag Definition System +- [ ] Create `TagDefinitions/TelemetryTag.cs` (attribute and enums) +- [ ] Create `TagDefinitions/ConnectionOpenEvent.cs` (connection tag definitions) +- [ ] Create `TagDefinitions/StatementExecutionEvent.cs` (statement tag definitions) +- [ ] Create `TagDefinitions/ErrorEvent.cs` (error tag definitions) +- [ ] Create `TagDefinitions/TelemetryTagRegistry.cs` (central registry) +- [ ] Add unit tests for tag registry -### Phase 1: Core Implementation +### Phase 2: Core Implementation - [ ] Create `DatabricksActivityListener` class -- [ ] Create `MetricsAggregator` class +- [ ] Create `MetricsAggregator` class (using tag registry for filtering) - [ ] Create `DatabricksTelemetryExporter` class (reuse from original design) -- [ ] Add necessary tags to existing activities -- [ ] Implement activity tag filtering (allowlist) +- [ ] Add necessary tags to existing activities (using defined constants) - [ ] Add feature flag integration -### Phase 2: Integration +### Phase 3: Integration - [ ] Initialize listener in `DatabricksConnection.OpenAsync()` - [ ] Stop listener in `DatabricksConnection.CloseAsync()` - [ ] Add configuration parsing from connection string - [ ] Add server feature flag check -### Phase 3: Testing +### Phase 4: Testing - [ ] Unit tests for ActivityListener - [ ] Unit tests for MetricsAggregator - [ ] Integration tests with real activities - [ ] Performance tests (overhead measurement) - [ ] Compatibility tests with OpenTelemetry -### Phase 4: Documentation +### Phase 5: Documentation - [ ] Update Activity instrumentation docs - [ ] Document new activity tags - [ ] Update configuration guide @@ -854,9 +1159,9 @@ The Activity-based design was selected because it: --- -## 13. Open Questions +## 12. Open Questions -### 13.1 Activity Tag Naming Conventions +### 12.1 Activity Tag Naming Conventions **Question**: Should we use OpenTelemetry semantic conventions for tag names? @@ -867,7 +1172,7 @@ The Activity-based design was selected because it: This ensures compatibility with OTEL ecosystem. -### 13.2 Statement Completion Detection +### 12.2 Statement Completion Detection **Question**: How do we know when a statement is complete for aggregation? @@ -878,7 +1183,7 @@ This ensures compatibility with OTEL ecosystem. **Recommendation**: Use activity completion - cleaner and automatic. -### 13.3 Performance Impact on Existing Activity Users +### 12.3 Performance Impact on Existing Activity Users **Question**: Will adding tags impact applications that already use Activity for tracing? @@ -889,15 +1194,15 @@ This ensures compatibility with OTEL ecosystem. --- -## 14. References +## 13. References -### 14.1 Related Documentation +### 13.1 Related Documentation - [.NET Activity API](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing) - [OpenTelemetry .NET](https://opentelemetry.io/docs/languages/net/) - [ActivityListener Documentation](https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.activitylistener) -### 14.2 Existing Code References +### 13.2 Existing Code References - `ActivityTrace.cs`: Existing Activity helper - `DatabricksAdbcActivitySource`: Existing ActivitySource diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md deleted file mode 100644 index 48be782635..0000000000 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-integration-lld-design.md +++ /dev/null @@ -1,3565 +0,0 @@ -# Databricks ADBC Driver: Client Telemetry Integration - -## Executive Summary - -This document outlines the design for integrating client-side telemetry into the Databricks ADBC driver for C#. The telemetry system will collect operational metrics, performance data, and error information from the driver to enable proactive monitoring, usage analytics, and faster issue resolution. - -**Key Objectives:** -- Enable comprehensive observability of driver operations without impacting performance -- Collect usage insights (CloudFetch vs inline, driver configurations, error patterns) -- Track adoption of new features and configurations -- Provide proactive error monitoring to identify issues before customer reports -- Maintain compatibility with existing OpenTelemetry/Activity-based tracing - -**Design Principles:** -- **Non-blocking**: Telemetry operations must never block driver functionality -- **Privacy-first**: No PII or query data collected; schema curated for data residency compliance -- **Opt-out capable**: Users can disable telemetry via configuration -- **Server-controlled**: Feature flag support for server-side enable/disable -- **Backward compatible**: No breaking changes to existing driver API -- **OpenTelemetry aligned**: Leverage existing Activity infrastructure where possible - ---- - -## Table of Contents - -1. [Background & Motivation](#1-background--motivation) -2. [Requirements](#2-requirements) -3. [Architecture Overview](#3-architecture-overview) -4. [Telemetry Components](#4-telemetry-components) -5. [Data Schema](#5-data-schema) -6. [Collection Points](#6-collection-points) -7. [Export Mechanism](#7-export-mechanism) -8. [Configuration](#8-configuration) -9. [Privacy & Data Residency](#9-privacy--data-residency) -10. [Error Handling](#10-error-handling) -11. [Testing Strategy](#11-testing-strategy) -12. [Migration & Rollout](#12-migration--rollout) -13. [Alternatives Considered](#13-alternatives-considered) -14. [Open Questions](#14-open-questions) -15. [References](#15-references) - ---- - -## 1. Background & Motivation - -### 1.1 Current State - -The Databricks ADBC driver currently implements: -- **Activity-based tracing** via `ActivityTrace` and `ActivitySource` -- **W3C Trace Context propagation** for distributed tracing -- **Local file exporter** for debugging traces - -However, this approach has limitations: -- **No centralized aggregation**: Traces are local-only unless connected to external APM -- **Limited usage insights**: No visibility into driver configuration patterns -- **Reactive debugging**: Relies on customer-reported issues with trace files -- **No feature adoption metrics**: Cannot track usage of CloudFetch, Direct Results, etc. - -### 1.2 JDBC Driver Precedent - -The Databricks JDBC driver successfully implemented client telemetry with: -- **Comprehensive metrics**: Operation latency, chunk downloads, error rates -- **Configuration tracking**: Driver settings, auth types, proxy usage -- **Server-side control**: Feature flag to enable/disable telemetry -- **Centralized storage**: Data flows to `main.eng_lumberjack.prod_frontend_log_sql_driver_log` -- **Privacy compliance**: No PII, curated schema, Lumberjack data residency - -### 1.3 Key Gaps to Address - -1. **Proactive Monitoring**: Identify errors before customer escalation -2. **Usage Analytics**: Understand driver configuration patterns across customer base -3. **Feature Adoption**: Track uptake of CloudFetch, Direct Results, OAuth flows -4. **Performance Insights**: Client-side latency vs server-side metrics -5. **Error Patterns**: Common configuration mistakes, auth failures, network issues - ---- - -## 2. Requirements - -### 2.1 Functional Requirements - -| ID | Requirement | Priority | -|:---|:---|:---:| -| FR-1 | Collect driver configuration metadata (auth type, CloudFetch settings, etc.) | P0 | -| FR-2 | Track operation latency (connection open, statement execution, result fetching) | P0 | -| FR-3 | Record error events with error codes and context | P0 | -| FR-4 | Capture CloudFetch metrics (chunk downloads, retries, compression status) | P0 | -| FR-5 | Track result format usage (inline vs CloudFetch) | P1 | -| FR-6 | Support server-side feature flag to enable/disable telemetry | P0 | -| FR-7 | Provide client-side opt-out mechanism | P1 | -| FR-8 | Batch telemetry events to reduce network overhead | P0 | -| FR-9 | Export telemetry to Databricks telemetry service | P0 | -| FR-10 | Support both authenticated and unauthenticated telemetry endpoints | P0 | - -### 2.2 Non-Functional Requirements - -| ID | Requirement | Target | Priority | -|:---|:---|:---:|:---:| -| NFR-1 | Telemetry overhead < 1% of operation latency | < 1% | P0 | -| NFR-2 | Memory overhead < 10MB per connection | < 10MB | P0 | -| NFR-3 | Zero impact on driver operation if telemetry fails | 0 failures | P0 | -| NFR-4 | Telemetry export success rate | > 95% | P1 | -| NFR-5 | Batch flush latency | < 5s | P1 | -| NFR-6 | Support workspace-level disable | 100% | P0 | -| NFR-7 | No PII or query data collected | 0 PII | P0 | -| NFR-8 | Compatible with existing Activity tracing | 100% | P0 | - -### 2.3 Out of Scope - -- Distributed tracing (already covered by Activity/OpenTelemetry) -- Query result data collection -- Real-time alerting (server-side responsibility) -- Custom telemetry endpoints (only Databricks service) - ---- - -## 3. Architecture Overview - -### 3.1 High-Level Design - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ ADBC Driver Operations │ -│ (Connection, Statement Execution, Result Fetching) │ -└─────────────────────────────────────────────────────────────────┘ - │ - │ Emit Events - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ TelemetryCollector │ -│ - Per-connection singleton │ -│ - Aggregates events by statement ID │ -│ - Non-blocking event ingestion │ -└─────────────────────────────────────────────────────────────────┘ - │ - │ Batch Events - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ TelemetryExporter │ -│ - Background export worker │ -│ - Periodic flush (configurable interval) │ -│ - Size-based flush (batch threshold) │ -│ - Connection close flush │ -└─────────────────────────────────────────────────────────────────┘ - │ - │ HTTP POST - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ Databricks Telemetry Service │ -│ Endpoints: │ -│ - /telemetry-ext (authenticated) │ -│ - /telemetry-unauth (unauthenticated - connection errors) │ -└─────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ Lumberjack Pipeline │ -│ Table: main.eng_lumberjack.prod_frontend_log_sql_driver_log │ -└─────────────────────────────────────────────────────────────────┘ -``` - -### 3.2 Component Interaction Flow - -```mermaid -sequenceDiagram - participant App as Application - participant Conn as DatabricksConnection - participant Stmt as DatabricksStatement - participant TC as TelemetryCollector - participant TE as TelemetryExporter - participant TS as Telemetry Service - - App->>Conn: OpenAsync() - Conn->>TC: Initialize(config) - TC->>TE: Start background worker - Conn->>TC: RecordConnectionOpen(latency, config) - - App->>Stmt: ExecuteQueryAsync() - Stmt->>TC: RecordStatementExecution(statementId, latency) - - loop CloudFetch Downloads - Stmt->>TC: RecordChunkDownload(chunkIndex, latency, size) - end - - Stmt->>TC: RecordStatementComplete(statementId) - - alt Batch size reached - TC->>TE: Flush batch - TE->>TS: POST /telemetry-ext - end - - App->>Conn: CloseAsync() - Conn->>TC: Flush all pending - TC->>TE: Force flush - TE->>TS: POST /telemetry-ext - TE->>TE: Stop worker -``` - -### 3.3 Integration with Existing Components - -The telemetry system will integrate with existing driver components: - -1. **DatabricksConnection**: - - Initialize telemetry collector on open - - Record connection configuration - - Flush telemetry on close - - Handle feature flag from server - -2. **DatabricksStatement**: - - Record statement execution metrics - - Track result format (inline vs CloudFetch) - - Capture operation latency - -3. **CloudFetchDownloader**: - - Record chunk download latency - - Track retry attempts - - Report compression status - -4. **Activity Infrastructure**: - - Leverage existing Activity context for correlation - - Add telemetry as Activity events for unified observability - - Maintain W3C trace context propagation - ---- - -## 4. Telemetry Components - -### 4.1 TelemetryCollector - -**Purpose**: Aggregate and buffer telemetry events per connection. - -**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryCollector` - -**Responsibilities**: -- Accept telemetry events from driver operations -- Aggregate events by statement ID -- Buffer events for batching -- Provide non-blocking event ingestion -- Trigger flush on batch size or time threshold - -**Interface**: -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry -{ - /// - /// Collects and aggregates telemetry events for a connection. - /// Thread-safe and non-blocking. - /// - internal sealed class TelemetryCollector : IDisposable - { - // Constructor - public TelemetryCollector( - DatabricksConnection connection, - ITelemetryExporter exporter, - TelemetryConfiguration config); - - // Event recording methods - public void RecordConnectionOpen( - TimeSpan latency, - DriverConfiguration driverConfig); - - public void RecordStatementExecute( - string statementId, - TimeSpan latency, - ExecutionResultFormat resultFormat); - - public void RecordChunkDownload( - string statementId, - int chunkIndex, - TimeSpan latency, - long bytesDownloaded, - bool compressed); - - public void RecordOperationStatus( - string statementId, - int pollCount, - TimeSpan totalLatency); - - public void RecordStatementComplete(string statementId); - - public void RecordError( - string errorCode, - string errorMessage, - string? statementId = null, - int? chunkIndex = null); - - // Flush methods - public Task FlushAsync(CancellationToken cancellationToken = default); - - public Task FlushAllPendingAsync(); - - // IDisposable - public void Dispose(); - } -} -``` - -**Implementation Details**: - -```csharp -internal sealed class TelemetryCollector : IDisposable -{ - private readonly DatabricksConnection _connection; - private readonly ITelemetryExporter _exporter; - private readonly TelemetryConfiguration _config; - private readonly ConcurrentDictionary _statementData; - private readonly ConcurrentQueue _eventQueue; - private readonly Timer _flushTimer; - private readonly SemaphoreSlim _flushLock; - private long _lastFlushTime; - private int _eventCount; - private bool _disposed; - - public TelemetryCollector( - DatabricksConnection connection, - ITelemetryExporter exporter, - TelemetryConfiguration config) - { - _connection = connection ?? throw new ArgumentNullException(nameof(connection)); - _exporter = exporter ?? throw new ArgumentNullException(nameof(exporter)); - _config = config ?? throw new ArgumentNullException(nameof(config)); - - _statementData = new ConcurrentDictionary(); - _eventQueue = new ConcurrentQueue(); - _flushLock = new SemaphoreSlim(1, 1); - _lastFlushTime = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); - - // Start periodic flush timer - if (_config.FlushIntervalMilliseconds > 0) - { - _flushTimer = new Timer( - OnTimerFlush, - null, - _config.FlushIntervalMilliseconds, - _config.FlushIntervalMilliseconds); - } - } - - public void RecordConnectionOpen(TimeSpan latency, DriverConfiguration driverConfig) - { - if (!_config.Enabled) return; - - var telemetryEvent = new TelemetryEvent - { - EventType = TelemetryEventType.ConnectionOpen, - Timestamp = DateTimeOffset.UtcNow, - OperationLatencyMs = (long)latency.TotalMilliseconds, - DriverConfig = driverConfig, - SessionId = _connection.SessionId, - WorkspaceId = _connection.WorkspaceId - }; - - EnqueueEvent(telemetryEvent); - } - - public void RecordStatementExecute( - string statementId, - TimeSpan latency, - ExecutionResultFormat resultFormat) - { - if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; - - var stmtData = _statementData.GetOrAdd( - statementId, - _ => new StatementTelemetryData { StatementId = statementId }); - - stmtData.ExecutionLatencyMs = (long)latency.TotalMilliseconds; - stmtData.ResultFormat = resultFormat; - stmtData.Timestamp = DateTimeOffset.UtcNow; - } - - public void RecordChunkDownload( - string statementId, - int chunkIndex, - TimeSpan latency, - long bytesDownloaded, - bool compressed) - { - if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; - - var stmtData = _statementData.GetOrAdd( - statementId, - _ => new StatementTelemetryData { StatementId = statementId }); - - stmtData.ChunkDownloads.Add(new ChunkDownloadData - { - ChunkIndex = chunkIndex, - LatencyMs = (long)latency.TotalMilliseconds, - BytesDownloaded = bytesDownloaded, - Compressed = compressed - }); - - stmtData.TotalChunks = Math.Max(stmtData.TotalChunks, chunkIndex + 1); - } - - public void RecordStatementComplete(string statementId) - { - if (!_config.Enabled || string.IsNullOrEmpty(statementId)) return; - - if (_statementData.TryRemove(statementId, out var stmtData)) - { - // Convert statement data to telemetry event - var telemetryEvent = CreateStatementEvent(stmtData); - EnqueueEvent(telemetryEvent); - } - } - - public void RecordError( - string errorCode, - string errorMessage, - string? statementId = null, - int? chunkIndex = null) - { - if (!_config.Enabled) return; - - var telemetryEvent = new TelemetryEvent - { - EventType = TelemetryEventType.Error, - Timestamp = DateTimeOffset.UtcNow, - ErrorCode = errorCode, - ErrorMessage = errorMessage, - StatementId = statementId, - ChunkIndex = chunkIndex, - SessionId = _connection.SessionId, - WorkspaceId = _connection.WorkspaceId - }; - - EnqueueEvent(telemetryEvent); - } - - private void EnqueueEvent(TelemetryEvent telemetryEvent) - { - _eventQueue.Enqueue(telemetryEvent); - var count = Interlocked.Increment(ref _eventCount); - - // Trigger flush if batch size reached - if (count >= _config.BatchSize) - { - _ = Task.Run(() => FlushAsync(CancellationToken.None)); - } - } - - public async Task FlushAsync(CancellationToken cancellationToken = default) - { - if (_eventCount == 0) return; - - await _flushLock.WaitAsync(cancellationToken); - try - { - var events = DequeueEvents(); - if (events.Count > 0) - { - await _exporter.ExportAsync(events, cancellationToken); - _lastFlushTime = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); - } - } - catch (Exception ex) - { - // Log but don't throw - telemetry must not break driver - Debug.WriteLine($"Telemetry flush failed: {ex.Message}"); - } - finally - { - _flushLock.Release(); - } - } - - public async Task FlushAllPendingAsync() - { - // Export all pending statement data - foreach (var kvp in _statementData) - { - if (_statementData.TryRemove(kvp.Key, out var stmtData)) - { - var telemetryEvent = CreateStatementEvent(stmtData); - EnqueueEvent(telemetryEvent); - } - } - - // Flush event queue - await FlushAsync(CancellationToken.None); - } - - private List DequeueEvents() - { - var events = new List(_config.BatchSize); - while (_eventQueue.TryDequeue(out var telemetryEvent) && events.Count < _config.BatchSize) - { - events.Add(telemetryEvent); - Interlocked.Decrement(ref _eventCount); - } - return events; - } - - private void OnTimerFlush(object? state) - { - var now = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); - if (now - _lastFlushTime >= _config.FlushIntervalMilliseconds && _eventCount > 0) - { - _ = Task.Run(() => FlushAsync(CancellationToken.None)); - } - } - - public void Dispose() - { - if (_disposed) return; - _disposed = true; - - _flushTimer?.Dispose(); - - // Flush all pending data synchronously on dispose - FlushAllPendingAsync().GetAwaiter().GetResult(); - - _flushLock?.Dispose(); - } -} -``` - -### 4.2 TelemetryExporter - -**Purpose**: Export telemetry events to Databricks telemetry service. - -**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryExporter` - -**Responsibilities**: -- Serialize telemetry events to JSON -- Send HTTP POST requests to telemetry endpoints -- Handle authentication (OAuth tokens) -- Implement retry logic for transient failures -- Support circuit breaker pattern - -**Interface**: -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry -{ - /// - /// Exports telemetry events to Databricks telemetry service. - /// - internal interface ITelemetryExporter - { - Task ExportAsync( - IReadOnlyList events, - CancellationToken cancellationToken = default); - } - - internal sealed class TelemetryExporter : ITelemetryExporter - { - public TelemetryExporter( - HttpClient httpClient, - DatabricksConnection connection, - TelemetryConfiguration config); - - public Task ExportAsync( - IReadOnlyList events, - CancellationToken cancellationToken = default); - } -} -``` - -**Implementation Details**: - -```csharp -internal sealed class TelemetryExporter : ITelemetryExporter -{ - private readonly HttpClient _httpClient; - private readonly DatabricksConnection _connection; - private readonly TelemetryConfiguration _config; - private readonly JsonSerializerOptions _jsonOptions; - private readonly CircuitBreaker? _circuitBreaker; - - private const string AuthenticatedPath = "/telemetry-ext"; - private const string UnauthenticatedPath = "/telemetry-unauth"; - - public TelemetryExporter( - HttpClient httpClient, - DatabricksConnection connection, - TelemetryConfiguration config) - { - _httpClient = httpClient ?? throw new ArgumentNullException(nameof(httpClient)); - _connection = connection ?? throw new ArgumentNullException(nameof(connection)); - _config = config ?? throw new ArgumentNullException(nameof(config)); - - _jsonOptions = new JsonSerializerOptions - { - PropertyNamingPolicy = JsonNamingPolicy.SnakeCaseLower, - DefaultIgnoreCondition = JsonIgnoreCondition.WhenWritingNull, - WriteIndented = false - }; - - if (_config.CircuitBreakerEnabled) - { - _circuitBreaker = new CircuitBreaker( - _config.CircuitBreakerThreshold, - _config.CircuitBreakerTimeout); - } - } - - public async Task ExportAsync( - IReadOnlyList events, - CancellationToken cancellationToken = default) - { - if (events == null || events.Count == 0) return; - - try - { - // Check circuit breaker - if (_circuitBreaker != null && _circuitBreaker.IsOpen) - { - Debug.WriteLine("Telemetry circuit breaker is open, dropping events"); - return; - } - - // Determine endpoint based on authentication status - var isAuthenticated = _connection.IsAuthenticated; - var path = isAuthenticated ? AuthenticatedPath : UnauthenticatedPath; - var uri = new Uri(_connection.Host, path); - - // Create request payload - var request = CreateTelemetryRequest(events); - var json = JsonSerializer.Serialize(request, _jsonOptions); - var content = new StringContent(json, Encoding.UTF8, "application/json"); - - // Create HTTP request - using var httpRequest = new HttpRequestMessage(HttpMethod.Post, uri) - { - Content = content - }; - - // Add authentication headers if authenticated - if (isAuthenticated) - { - await AddAuthenticationHeadersAsync(httpRequest, cancellationToken); - } - - // Send request with retry - var response = await SendWithRetryAsync(httpRequest, cancellationToken); - - // Handle response - if (response.IsSuccessStatusCode) - { - _circuitBreaker?.RecordSuccess(); - - // Parse response for partial failures - var responseContent = await response.Content.ReadAsStringAsync(cancellationToken); - var telemetryResponse = JsonSerializer.Deserialize( - responseContent, - _jsonOptions); - - if (telemetryResponse?.Errors?.Count > 0) - { - Debug.WriteLine( - $"Telemetry partial failure: {telemetryResponse.Errors.Count} errors"); - } - } - else - { - _circuitBreaker?.RecordFailure(); - Debug.WriteLine( - $"Telemetry export failed: {response.StatusCode} - {response.ReasonPhrase}"); - } - } - catch (Exception ex) - { - _circuitBreaker?.RecordFailure(); - Debug.WriteLine($"Telemetry export exception: {ex.Message}"); - // Don't rethrow - telemetry must not break driver operations - } - } - - private TelemetryRequest CreateTelemetryRequest(IReadOnlyList events) - { - var protoLogs = events.Select(e => new TelemetryFrontendLog - { - WorkspaceId = e.WorkspaceId, - FrontendLogEventId = Guid.NewGuid().ToString(), - Context = new FrontendLogContext - { - ClientContext = new TelemetryClientContext - { - TimestampMillis = e.Timestamp.ToUnixTimeMilliseconds(), - UserAgent = _connection.UserAgent - } - }, - Entry = new FrontendLogEntry - { - SqlDriverLog = CreateSqlDriverLog(e) - } - }).ToList(); - - return new TelemetryRequest - { - ProtoLogs = protoLogs - }; - } - - private SqlDriverLog CreateSqlDriverLog(TelemetryEvent e) - { - var log = new SqlDriverLog - { - SessionId = e.SessionId, - SqlStatementId = e.StatementId, - OperationLatencyMs = e.OperationLatencyMs, - SystemConfiguration = e.DriverConfig != null - ? CreateSystemConfiguration(e.DriverConfig) - : null, - DriverConnectionParams = e.DriverConfig != null - ? CreateConnectionParameters(e.DriverConfig) - : null - }; - - // Add SQL operation data if present - if (e.SqlOperationData != null) - { - log.SqlOperation = new SqlExecutionEvent - { - ExecutionResult = e.SqlOperationData.ResultFormat.ToString(), - RetryCount = e.SqlOperationData.RetryCount, - ChunkDetails = e.SqlOperationData.ChunkDownloads?.Count > 0 - ? CreateChunkDetails(e.SqlOperationData.ChunkDownloads) - : null - }; - } - - // Add error info if present - if (!string.IsNullOrEmpty(e.ErrorCode)) - { - log.ErrorInfo = new DriverErrorInfo - { - ErrorName = e.ErrorCode, - StackTrace = e.ErrorMessage - }; - } - - return log; - } - - private async Task SendWithRetryAsync( - HttpRequestMessage request, - CancellationToken cancellationToken) - { - var retryCount = 0; - var maxRetries = _config.MaxRetries; - - while (true) - { - try - { - var response = await _httpClient.SendAsync( - request, - HttpCompletionOption.ResponseHeadersRead, - cancellationToken); - - // Don't retry on client errors (4xx) - if ((int)response.StatusCode < 500) - { - return response; - } - - // Retry on server errors (5xx) if retries remaining - if (retryCount >= maxRetries) - { - return response; - } - } - catch (HttpRequestException) when (retryCount < maxRetries) - { - // Retry on network errors - } - catch (TaskCanceledException) when (!cancellationToken.IsCancellationRequested && retryCount < maxRetries) - { - // Retry on timeout (not user cancellation) - } - - retryCount++; - var delay = TimeSpan.FromMilliseconds(_config.RetryDelayMs * Math.Pow(2, retryCount - 1)); - await Task.Delay(delay, cancellationToken); - } - } - - private async Task AddAuthenticationHeadersAsync( - HttpRequestMessage request, - CancellationToken cancellationToken) - { - // Use connection's authentication mechanism - var authHeaders = await _connection.GetAuthenticationHeadersAsync(cancellationToken); - foreach (var header in authHeaders) - { - request.Headers.TryAddWithoutValidation(header.Key, header.Value); - } - } -} -``` - -### 4.3 CircuitBreaker - -**Purpose**: Prevent telemetry storms when service is unavailable. - -**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.CircuitBreaker` - -**Implementation**: -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry -{ - /// - /// Circuit breaker to prevent telemetry storms. - /// - internal sealed class CircuitBreaker - { - private readonly int _failureThreshold; - private readonly TimeSpan _timeout; - private int _failureCount; - private DateTime _lastFailureTime; - private CircuitState _state; - private readonly object _lock = new object(); - - private enum CircuitState - { - Closed, // Normal operation - Open, // Blocking requests - HalfOpen // Testing if service recovered - } - - public CircuitBreaker(int failureThreshold, TimeSpan timeout) - { - _failureThreshold = failureThreshold; - _timeout = timeout; - _state = CircuitState.Closed; - } - - public bool IsOpen - { - get - { - lock (_lock) - { - // Auto-transition from Open to HalfOpen after timeout - if (_state == CircuitState.Open) - { - if (DateTime.UtcNow - _lastFailureTime > _timeout) - { - _state = CircuitState.HalfOpen; - return false; - } - return true; - } - return false; - } - } - } - - public void RecordSuccess() - { - lock (_lock) - { - _failureCount = 0; - _state = CircuitState.Closed; - } - } - - public void RecordFailure() - { - lock (_lock) - { - _failureCount++; - _lastFailureTime = DateTime.UtcNow; - - if (_failureCount >= _failureThreshold) - { - _state = CircuitState.Open; - } - } - } - } -} -``` - -### 4.4 TelemetryConfiguration - -**Purpose**: Centralize all telemetry configuration. - -**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryConfiguration` - -**Implementation**: -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry -{ - /// - /// Configuration for telemetry collection and export. - /// - public sealed class TelemetryConfiguration - { - // Enable/disable flags - public bool Enabled { get; set; } = true; - public bool ForceEnable { get; set; } = false; // Bypass feature flag - - // Batch configuration - public int BatchSize { get; set; } = 50; - public int FlushIntervalMilliseconds { get; set; } = 30000; // 30 seconds - - // Retry configuration - public int MaxRetries { get; set; } = 3; - public int RetryDelayMs { get; set; } = 500; - - // Circuit breaker configuration - public bool CircuitBreakerEnabled { get; set; } = true; - public int CircuitBreakerThreshold { get; set; } = 5; - public TimeSpan CircuitBreakerTimeout { get; set; } = TimeSpan.FromMinutes(1); - - // Log level filtering - public TelemetryLogLevel LogLevel { get; set; } = TelemetryLogLevel.Info; - - // Feature flag name - public const string FeatureFlagName = - "databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc"; - - // Create from connection properties - public static TelemetryConfiguration FromProperties( - IReadOnlyDictionary properties) - { - var config = new TelemetryConfiguration(); - - if (properties.TryGetValue(DatabricksParameters.TelemetryEnabled, out var enabled)) - { - config.Enabled = bool.Parse(enabled); - } - - if (properties.TryGetValue(DatabricksParameters.TelemetryBatchSize, out var batchSize)) - { - config.BatchSize = int.Parse(batchSize); - } - - if (properties.TryGetValue(DatabricksParameters.TelemetryFlushIntervalMs, out var flushInterval)) - { - config.FlushIntervalMilliseconds = int.Parse(flushInterval); - } - - return config; - } - } - - public enum TelemetryLogLevel - { - Off = 0, - Error = 1, - Warn = 2, - Info = 3, - Debug = 4, - Trace = 5 - } -} -``` - ---- - -## 5. Data Schema - -### 5.1 Telemetry Event Model - -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models -{ - /// - /// Base telemetry event. - /// - internal sealed class TelemetryEvent - { - public TelemetryEventType EventType { get; set; } - public DateTimeOffset Timestamp { get; set; } - public long? WorkspaceId { get; set; } - public string? SessionId { get; set; } - public string? StatementId { get; set; } - public long? OperationLatencyMs { get; set; } - - // Driver configuration (connection events only) - public DriverConfiguration? DriverConfig { get; set; } - - // SQL operation data (statement events only) - public SqlOperationData? SqlOperationData { get; set; } - - // Error information (error events only) - public string? ErrorCode { get; set; } - public string? ErrorMessage { get; set; } - public int? ChunkIndex { get; set; } - } - - public enum TelemetryEventType - { - ConnectionOpen, - StatementExecution, - Error - } -} -``` - -### 5.2 Driver Configuration Model - -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models -{ - /// - /// Driver configuration snapshot (collected once per connection). - /// - internal sealed class DriverConfiguration - { - // System information - public string? DriverName { get; set; } = "Databricks.ADBC.CSharp"; - public string? DriverVersion { get; set; } - public string? OsName { get; set; } - public string? OsVersion { get; set; } - public string? RuntimeVersion { get; set; } - public string? ProcessName { get; set; } - - // Connection configuration - public string? AuthType { get; set; } - public string? HostUrl { get; set; } - public string? HttpPath { get; set; } - - // Feature flags - public bool CloudFetchEnabled { get; set; } - public bool Lz4DecompressionEnabled { get; set; } - public bool DirectResultsEnabled { get; set; } - public bool TracePropagationEnabled { get; set; } - public bool MultipleCatalogSupport { get; set; } - public bool PrimaryKeyForeignKeyEnabled { get; set; } - - // CloudFetch configuration - public long MaxBytesPerFile { get; set; } - public long MaxBytesPerFetchRequest { get; set; } - public int MaxParallelDownloads { get; set; } - public int PrefetchCount { get; set; } - public int MemoryBufferSizeMb { get; set; } - - // Proxy configuration - public bool UseProxy { get; set; } - public string? ProxyHost { get; set; } - public int? ProxyPort { get; set; } - - // Statement configuration - public long BatchSize { get; set; } - public int PollTimeMs { get; set; } - - // Direct results limits - public long DirectResultMaxBytes { get; set; } - public long DirectResultMaxRows { get; set; } - } -} -``` - -### 5.3 SQL Operation Data Model - -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.Models -{ - /// - /// SQL operation metrics. - /// - internal sealed class SqlOperationData - { - public string? StatementId { get; set; } - public ExecutionResultFormat ResultFormat { get; set; } - public long ExecutionLatencyMs { get; set; } - public int RetryCount { get; set; } - public int PollCount { get; set; } - public long TotalPollLatencyMs { get; set; } - - // CloudFetch specific - public List? ChunkDownloads { get; set; } - public int TotalChunks { get; set; } - } - - public enum ExecutionResultFormat - { - Unknown = 0, - InlineArrow = 1, - InlineJson = 2, - ExternalLinks = 3 // CloudFetch - } - - internal sealed class ChunkDownloadData - { - public int ChunkIndex { get; set; } - public long LatencyMs { get; set; } - public long BytesDownloaded { get; set; } - public bool Compressed { get; set; } - } -} -``` - -### 5.4 Server Payload Schema - -The exported JSON payload matches JDBC format for consistency: - -```json -{ - "proto_logs": [ - { - "workspace_id": 1234567890, - "frontend_log_event_id": "550e8400-e29b-41d4-a716-446655440000", - "context": { - "client_context": { - "timestamp_millis": 1698765432000, - "user_agent": "Databricks-ADBC-CSharp/1.0.0" - } - }, - "entry": { - "sql_driver_log": { - "session_id": "01234567-89ab-cdef-0123-456789abcdef", - "sql_statement_id": "01234567-89ab-cdef-0123-456789abcdef", - "operation_latency_ms": 1234, - "system_configuration": { - "driver_name": "Databricks.ADBC.CSharp", - "driver_version": "1.0.0", - "os_name": "Windows", - "os_version": "10.0.19042", - "runtime_version": ".NET 8.0.0", - "process_name": "PowerBI.Desktop" - }, - "driver_connection_params": { - "auth_type": "oauth_client_credentials", - "cloudfetch_enabled": true, - "lz4_decompression_enabled": true, - "direct_results_enabled": true, - "max_bytes_per_file": 20971520, - "max_parallel_downloads": 3, - "batch_size": 2000000 - }, - "sql_operation": { - "execution_result": "EXTERNAL_LINKS", - "retry_count": 0, - "chunk_details": { - "total_chunks": 10, - "chunks_downloaded": 10, - "total_download_latency_ms": 5432, - "avg_chunk_size_bytes": 15728640, - "compressed": true - } - }, - "error_info": null - } - } - } - ] -} -``` - ---- - -## 6. Collection Points - -### 6.1 Connection Lifecycle Events - -#### 6.1.1 Connection Open - -**Location**: `DatabricksConnection.OpenAsync()` - -**What to Collect**: -- Connection open latency -- Driver configuration snapshot -- Session ID -- Workspace ID - -**Implementation**: -```csharp -public override async Task OpenAsync(CancellationToken cancellationToken = default) -{ - var sw = Stopwatch.StartNew(); - - try - { - await base.OpenAsync(cancellationToken); - - // Initialize telemetry after successful connection - InitializeTelemetry(); - - sw.Stop(); - - // Record connection open event - _telemetryCollector?.RecordConnectionOpen( - sw.Elapsed, - CreateDriverConfiguration()); - } - catch (Exception) - { - sw.Stop(); - // Error will be recorded by exception handler - throw; - } -} - -private DriverConfiguration CreateDriverConfiguration() -{ - return new DriverConfiguration - { - DriverName = "Databricks.ADBC.CSharp", - DriverVersion = GetType().Assembly.GetName().Version?.ToString(), - OsName = Environment.OSVersion.Platform.ToString(), - OsVersion = Environment.OSVersion.Version.ToString(), - RuntimeVersion = Environment.Version.ToString(), - ProcessName = Process.GetCurrentProcess().ProcessName, - - AuthType = DetermineAuthType(), - HostUrl = Host?.Host, - HttpPath = HttpPath, - - CloudFetchEnabled = UseCloudFetch, - Lz4DecompressionEnabled = CanDecompressLz4, - DirectResultsEnabled = _enableDirectResults, - TracePropagationEnabled = _tracePropagationEnabled, - MultipleCatalogSupport = _enableMultipleCatalogSupport, - PrimaryKeyForeignKeyEnabled = _enablePKFK, - - MaxBytesPerFile = _maxBytesPerFile, - MaxBytesPerFetchRequest = _maxBytesPerFetchRequest, - MaxParallelDownloads = GetIntProperty( - DatabricksParameters.CloudFetchParallelDownloads, - 3), - PrefetchCount = GetIntProperty( - DatabricksParameters.CloudFetchPrefetchCount, - 2), - MemoryBufferSizeMb = GetIntProperty( - DatabricksParameters.CloudFetchMemoryBufferSizeMb, - 200), - - UseProxy = Properties.ContainsKey(ApacheParameters.ProxyHost), - ProxyHost = Properties.TryGetValue(ApacheParameters.ProxyHost, out var host) - ? host - : null, - ProxyPort = Properties.TryGetValue(ApacheParameters.ProxyPort, out var port) - ? int.Parse(port) - : (int?)null, - - BatchSize = DatabricksStatement.DatabricksBatchSizeDefault, - PollTimeMs = GetIntProperty( - ApacheParameters.PollTimeMilliseconds, - DatabricksConstants.DefaultAsyncExecPollIntervalMs), - - DirectResultMaxBytes = _directResultMaxBytes, - DirectResultMaxRows = _directResultMaxRows - }; -} -``` - -#### 6.1.2 Connection Close - -**Location**: `DatabricksConnection.Dispose()` - -**What to Do**: -- Flush all pending telemetry -- Dispose telemetry collector - -**Implementation**: -```csharp -public override void Dispose() -{ - try - { - // Flush telemetry before closing connection - _telemetryCollector?.FlushAllPendingAsync().GetAwaiter().GetResult(); - } - catch (Exception ex) - { - Debug.WriteLine($"Error flushing telemetry on connection close: {ex.Message}"); - } - finally - { - _telemetryCollector?.Dispose(); - _telemetryCollector = null; - - base.Dispose(); - } -} -``` - -### 6.2 Statement Execution Events - -#### 6.2.1 Statement Execute - -**Location**: `DatabricksStatement.ExecuteQueryAsync()` - -**What to Collect**: -- Statement execution latency -- Result format (inline vs CloudFetch) -- Statement ID - -**Implementation**: -```csharp -protected override async Task ExecuteQueryAsync( - string? sqlQuery, - CancellationToken cancellationToken = default) -{ - var sw = Stopwatch.StartNew(); - string? statementId = null; - - try - { - var result = await base.ExecuteQueryAsync(sqlQuery, cancellationToken); - - sw.Stop(); - statementId = result.StatementHandle?.ToSQLExecStatementId(); - - // Determine result format - var resultFormat = DetermineResultFormat(result); - - // Record statement execution - Connection.TelemetryCollector?.RecordStatementExecute( - statementId ?? Guid.NewGuid().ToString(), - sw.Elapsed, - resultFormat); - - return result; - } - catch (Exception ex) - { - sw.Stop(); - - // Record error - Connection.TelemetryCollector?.RecordError( - DetermineErrorCode(ex), - ex.Message, - statementId); - - throw; - } -} - -private ExecutionResultFormat DetermineResultFormat(QueryResult result) -{ - if (result.DirectResult != null) - { - return ExecutionResultFormat.InlineArrow; - } - else if (result.ResultLinks != null && result.ResultLinks.Count > 0) - { - return ExecutionResultFormat.ExternalLinks; - } - else - { - return ExecutionResultFormat.Unknown; - } -} -``` - -#### 6.2.2 Statement Close - -**Location**: `DatabricksStatement.Dispose()` - -**What to Do**: -- Mark statement as complete in telemetry - -**Implementation**: -```csharp -public override void Dispose() -{ - try - { - // Mark statement complete (triggers export of aggregated metrics) - if (!string.IsNullOrEmpty(_statementId)) - { - Connection.TelemetryCollector?.RecordStatementComplete(_statementId); - } - } - finally - { - base.Dispose(); - } -} -``` - -### 6.3 CloudFetch Events - -#### 6.3.1 Chunk Download - -**Location**: `CloudFetchDownloader.DownloadFileAsync()` - -**What to Collect**: -- Download latency per chunk -- Bytes downloaded -- Compression status -- Retry attempts - -**Implementation**: -```csharp -private async Task DownloadFileAsync( - IDownloadResult downloadResult, - CancellationToken cancellationToken) -{ - var sw = Stopwatch.StartNew(); - var retryCount = 0; - - while (retryCount <= _maxRetries) - { - try - { - using var response = await _httpClient.GetAsync( - downloadResult.Url, - HttpCompletionOption.ResponseHeadersRead, - cancellationToken); - - response.EnsureSuccessStatusCode(); - - var contentLength = response.Content.Headers.ContentLength ?? 0; - var stream = await response.Content.ReadAsStreamAsync(cancellationToken); - - // Decompress if needed - if (_isLz4Compressed) - { - stream = LZ4Stream.Decode(stream); - } - - // Copy to memory buffer - await _memoryManager.ReserveAsync(contentLength, cancellationToken); - var memoryStream = new MemoryStream(); - await stream.CopyToAsync(memoryStream, cancellationToken); - - sw.Stop(); - - // Record successful download - _statement.Connection.TelemetryCollector?.RecordChunkDownload( - _statement.StatementId, - downloadResult.ChunkIndex, - sw.Elapsed, - contentLength, - _isLz4Compressed); - - downloadResult.SetData(memoryStream); - return; - } - catch (Exception ex) - { - retryCount++; - - if (retryCount > _maxRetries) - { - sw.Stop(); - - // Record download error - _statement.Connection.TelemetryCollector?.RecordError( - "CHUNK_DOWNLOAD_ERROR", - ex.Message, - _statement.StatementId, - downloadResult.ChunkIndex); - - downloadResult.SetError(ex); - throw; - } - - await Task.Delay(_retryDelayMs * retryCount, cancellationToken); - } - } -} -``` - -#### 6.3.2 Operation Status Polling - -**Location**: `DatabricksOperationStatusPoller.PollForCompletionAsync()` - -**What to Collect**: -- Number of polls -- Total polling latency - -**Implementation**: -```csharp -public async Task PollForCompletionAsync( - TOperationHandle operationHandle, - CancellationToken cancellationToken = default) -{ - var sw = Stopwatch.StartNew(); - var pollCount = 0; - - try - { - TGetOperationStatusResp? statusResp = null; - - while (!cancellationToken.IsCancellationRequested) - { - statusResp = await GetOperationStatusAsync(operationHandle, cancellationToken); - pollCount++; - - if (IsComplete(statusResp.OperationState)) - { - break; - } - - await Task.Delay(_pollIntervalMs, cancellationToken); - } - - sw.Stop(); - - // Record polling metrics - _connection.TelemetryCollector?.RecordOperationStatus( - operationHandle.OperationId?.Guid.ToString() ?? string.Empty, - pollCount, - sw.Elapsed); - - return statusResp!; - } - catch (Exception) - { - sw.Stop(); - throw; - } -} -``` - -### 6.4 Error Events - -#### 6.4.1 Exception Handler Integration - -**Location**: Throughout driver code - -**What to Collect**: -- Error code/type -- Error message (sanitized) -- Statement ID (if available) -- Chunk index (for download errors) - -**Implementation Pattern**: -```csharp -try -{ - // Driver operation -} -catch (DatabricksException ex) -{ - Connection.TelemetryCollector?.RecordError( - ex.ErrorCode, - SanitizeErrorMessage(ex.Message), - statementId, - chunkIndex); - - throw; -} -catch (AdbcException ex) -{ - Connection.TelemetryCollector?.RecordError( - ex.Status.ToString(), - SanitizeErrorMessage(ex.Message), - statementId); - - throw; -} -catch (Exception ex) -{ - Connection.TelemetryCollector?.RecordError( - "UNKNOWN_ERROR", - SanitizeErrorMessage(ex.Message), - statementId); - - throw; -} - -private static string SanitizeErrorMessage(string message) -{ - // Remove potential PII from error messages - // - Remove connection strings - // - Remove auth tokens - // - Remove file paths containing usernames - // - Keep only first 500 characters - - var sanitized = message; - - // Remove anything that looks like a connection string - sanitized = Regex.Replace( - sanitized, - @"token=[^;]+", - "token=***", - RegexOptions.IgnoreCase); - - // Remove Bearer tokens - sanitized = Regex.Replace( - sanitized, - @"Bearer\s+[A-Za-z0-9\-._~+/]+=*", - "Bearer ***", - RegexOptions.IgnoreCase); - - // Truncate to 500 characters - if (sanitized.Length > 500) - { - sanitized = sanitized.Substring(0, 500) + "..."; - } - - return sanitized; -} -``` - ---- - -## 7. Export Mechanism - -### 7.1 Export Flow - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ Driver Operations │ -│ (Emit events to TelemetryCollector) │ -└─────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ TelemetryCollector │ -│ - Buffer events in ConcurrentQueue │ -│ - Aggregate statement metrics in ConcurrentDictionary │ -│ - Track batch size and time since last flush │ -└─────────────────────────────────────────────────────────────────┘ - │ - ┌─────────────┼─────────────┐ - │ │ │ - ▼ ▼ ▼ - Batch Size Time Based Connection Close - Threshold Periodic Flush - Reached Flush - │ │ │ - └─────────────┼─────────────┘ - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ TelemetryExporter │ -│ 1. Check circuit breaker state │ -│ 2. Serialize events to JSON │ -│ 3. Create HTTP POST request │ -│ 4. Add authentication headers (if authenticated) │ -│ 5. Send with retry logic │ -│ 6. Update circuit breaker on success/failure │ -└─────────────────────────────────────────────────────────────────┘ - │ - ▼ HTTP POST -┌─────────────────────────────────────────────────────────────────┐ -│ Databricks Telemetry Service │ -│ Endpoints: │ -│ - POST /telemetry-ext (authenticated) │ -│ Auth: OAuth token from connection │ -│ - POST /telemetry-unauth (unauthenticated) │ -│ For pre-authentication errors only │ -└─────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ Lumberjack Pipeline │ -│ - Regional Logfood │ -│ - Central Logfood │ -│ - Table: main.eng_lumberjack.prod_frontend_log_sql_driver_log │ -└─────────────────────────────────────────────────────────────────┘ -``` - -### 7.2 Export Triggers - -#### 7.2.1 Batch Size Threshold - -```csharp -private void EnqueueEvent(TelemetryEvent telemetryEvent) -{ - _eventQueue.Enqueue(telemetryEvent); - var count = Interlocked.Increment(ref _eventCount); - - // Trigger flush if batch size reached - if (count >= _config.BatchSize) - { - _ = Task.Run(() => FlushAsync(CancellationToken.None)); - } -} -``` - -**Default**: 50 events per batch -**Rationale**: Balance between export frequency and network overhead - -#### 7.2.2 Time-Based Periodic Flush - -```csharp -private void OnTimerFlush(object? state) -{ - var now = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds(); - if (now - _lastFlushTime >= _config.FlushIntervalMilliseconds && _eventCount > 0) - { - _ = Task.Run(() => FlushAsync(CancellationToken.None)); - } -} -``` - -**Default**: 30 seconds -**Rationale**: Ensure events are exported even with low event rate - -#### 7.2.3 Connection Close Flush - -```csharp -public void Dispose() -{ - if (_disposed) return; - _disposed = true; - - _flushTimer?.Dispose(); - - // Flush all pending data synchronously on dispose - FlushAllPendingAsync().GetAwaiter().GetResult(); - - _flushLock?.Dispose(); -} -``` - -**Behavior**: Synchronous flush to ensure no data loss on connection close - -### 7.3 Retry Strategy - -**Exponential Backoff with Jitter**: - -```csharp -private async Task SendWithRetryAsync( - HttpRequestMessage request, - CancellationToken cancellationToken) -{ - var retryCount = 0; - var maxRetries = _config.MaxRetries; - var random = new Random(); - - while (true) - { - try - { - var response = await _httpClient.SendAsync( - request, - HttpCompletionOption.ResponseHeadersRead, - cancellationToken); - - // Don't retry on client errors (4xx) - if ((int)response.StatusCode < 500) - { - return response; - } - - // Retry on server errors (5xx) if retries remaining - if (retryCount >= maxRetries) - { - return response; - } - } - catch (HttpRequestException) when (retryCount < maxRetries) - { - // Retry on network errors - } - catch (TaskCanceledException) when (!cancellationToken.IsCancellationRequested && retryCount < maxRetries) - { - // Retry on timeout (not user cancellation) - } - - retryCount++; - - // Exponential backoff with jitter - var baseDelay = _config.RetryDelayMs * Math.Pow(2, retryCount - 1); - var jitter = random.Next(0, (int)(baseDelay * 0.1)); // 10% jitter - var delay = TimeSpan.FromMilliseconds(baseDelay + jitter); - - await Task.Delay(delay, cancellationToken); - } -} -``` - -**Parameters**: -- Base delay: 500ms -- Max retries: 3 -- Exponential multiplier: 2 -- Jitter: 10% of base delay - -**Retry Conditions**: -- ✅ 5xx server errors -- ✅ Network errors (HttpRequestException) -- ✅ Timeouts (TaskCanceledException, not user cancellation) -- ❌ 4xx client errors (don't retry) -- ❌ User cancellation - -### 7.4 Circuit Breaker - -**Purpose**: Prevent telemetry storms when service is degraded - -**State Transitions**: - -``` - Closed ──────────────────┐ - │ │ - │ Failure threshold │ Success - │ reached │ - ▼ │ - Open ◄────┐ │ - │ │ │ - │ │ Failure │ - │ │ during │ - │ │ half-open │ - │ │ │ - │ Timeout │ - │ expired │ - ▼ │ │ - HalfOpen ──┴──────────────┘ -``` - -**Configuration**: -- Failure threshold: 5 consecutive failures -- Timeout: 60 seconds -- State check: On every export attempt - -**Behavior**: -- **Closed**: Normal operation, all exports attempted -- **Open**: Drop all events, no export attempts -- **HalfOpen**: Allow one export to test if service recovered - ---- - -## 8. Configuration - -### 8.1 Connection Parameters - -Add new ADBC connection parameters in `DatabricksParameters.cs`: - -```csharp -namespace Apache.Arrow.Adbc.Drivers.Databricks -{ - public static partial class DatabricksParameters - { - // Telemetry enable/disable - public const string TelemetryEnabled = "adbc.databricks.telemetry.enabled"; - - // Force enable (bypass feature flag) - public const string TelemetryForceEnable = "adbc.databricks.telemetry.force_enable"; - - // Batch configuration - public const string TelemetryBatchSize = "adbc.databricks.telemetry.batch_size"; - public const string TelemetryFlushIntervalMs = "adbc.databricks.telemetry.flush_interval_ms"; - - // Retry configuration - public const string TelemetryMaxRetries = "adbc.databricks.telemetry.max_retries"; - public const string TelemetryRetryDelayMs = "adbc.databricks.telemetry.retry_delay_ms"; - - // Circuit breaker configuration - public const string TelemetryCircuitBreakerEnabled = "adbc.databricks.telemetry.circuit_breaker.enabled"; - public const string TelemetryCircuitBreakerThreshold = "adbc.databricks.telemetry.circuit_breaker.threshold"; - public const string TelemetryCircuitBreakerTimeoutSec = "adbc.databricks.telemetry.circuit_breaker.timeout_sec"; - - // Log level filtering - public const string TelemetryLogLevel = "adbc.databricks.telemetry.log_level"; - } -} -``` - -### 8.2 Default Values - -| Parameter | Default | Description | -|:---|:---|:---| -| `adbc.databricks.telemetry.enabled` | `true` | Enable/disable telemetry collection | -| `adbc.databricks.telemetry.force_enable` | `false` | Bypass server-side feature flag | -| `adbc.databricks.telemetry.batch_size` | `50` | Number of events per batch | -| `adbc.databricks.telemetry.flush_interval_ms` | `30000` | Flush interval in milliseconds | -| `adbc.databricks.telemetry.max_retries` | `3` | Maximum retry attempts | -| `adbc.databricks.telemetry.retry_delay_ms` | `500` | Base retry delay in milliseconds | -| `adbc.databricks.telemetry.circuit_breaker.enabled` | `true` | Enable circuit breaker | -| `adbc.databricks.telemetry.circuit_breaker.threshold` | `5` | Failure threshold | -| `adbc.databricks.telemetry.circuit_breaker.timeout_sec` | `60` | Open state timeout in seconds | -| `adbc.databricks.telemetry.log_level` | `Info` | Minimum log level (Off/Error/Warn/Info/Debug/Trace) | - -### 8.3 Example Configuration - -#### JSON Configuration File - -```json -{ - "adbc.connection.host": "https://my-workspace.databricks.com", - "adbc.connection.auth_type": "oauth", - "adbc.databricks.oauth.client_id": "my-client-id", - "adbc.databricks.oauth.client_secret": "my-secret", - - "adbc.databricks.telemetry.enabled": "true", - "adbc.databricks.telemetry.batch_size": "100", - "adbc.databricks.telemetry.flush_interval_ms": "60000", - "adbc.databricks.telemetry.log_level": "Info" -} -``` - -#### Programmatic Configuration - -```csharp -var properties = new Dictionary -{ - [DatabricksParameters.HostName] = "https://my-workspace.databricks.com", - [DatabricksParameters.AuthType] = "oauth", - [DatabricksParameters.OAuthClientId] = "my-client-id", - [DatabricksParameters.OAuthClientSecret] = "my-secret", - - [DatabricksParameters.TelemetryEnabled] = "true", - [DatabricksParameters.TelemetryBatchSize] = "100", - [DatabricksParameters.TelemetryFlushIntervalMs] = "60000", - [DatabricksParameters.TelemetryLogLevel] = "Info" -}; - -using var driver = new DatabricksDriver(); -using var database = driver.Open(properties); -using var connection = database.Connect(); -``` - -#### Disable Telemetry - -```csharp -var properties = new Dictionary -{ - // ... other properties ... - [DatabricksParameters.TelemetryEnabled] = "false" -}; -``` - -### 8.4 Server-Side Feature Flag - -**Feature Flag Name**: `databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc` - -**Checking Logic**: - -```csharp -private async Task IsTelemetryEnabledByServerAsync(CancellationToken cancellationToken) -{ - // Check client-side force enable first - if (_config.ForceEnable) - { - return true; - } - - try - { - // Query server for feature flag - // This happens during ApplyServerSidePropertiesAsync() - var query = $"SELECT * FROM databricks_client_config WHERE key = '{TelemetryConfiguration.FeatureFlagName}'"; - - using var statement = Connection.CreateStatement(); - using var reader = await statement.ExecuteQueryAsync(query, cancellationToken); - - if (await reader.ReadAsync(cancellationToken)) - { - var value = reader.GetString(1); // value column - return bool.TryParse(value, out var enabled) && enabled; - } - } - catch (Exception ex) - { - Debug.WriteLine($"Failed to check telemetry feature flag: {ex.Message}"); - // Default to enabled if check fails - return true; - } - - // Default to enabled - return true; -} -``` - -**Integration in Connection**: - -```csharp -internal async Task ApplyServerSidePropertiesAsync(CancellationToken cancellationToken = default) -{ - await base.ApplyServerSidePropertiesAsync(cancellationToken); - - // Check telemetry feature flag - if (_telemetryConfig != null && _telemetryConfig.Enabled) - { - var serverEnabled = await IsTelemetryEnabledByServerAsync(cancellationToken); - if (!serverEnabled) - { - _telemetryConfig.Enabled = false; - _telemetryCollector?.Dispose(); - _telemetryCollector = null; - } - } -} -``` - ---- - -## 9. Privacy & Data Residency - -### 9.1 Privacy Principles - -**No PII Collection**: -- ❌ Query text -- ❌ Query results -- ❌ Table names -- ❌ Column names -- ❌ User identifiers (beyond workspace/session IDs) -- ❌ IP addresses -- ❌ File paths with usernames -- ❌ Authentication credentials - -**What We Collect**: -- ✅ Operation latency metrics -- ✅ Driver configuration settings -- ✅ Error codes and sanitized messages -- ✅ Result format (inline vs CloudFetch) -- ✅ System information (OS, runtime version) -- ✅ Session and statement IDs (UUIDs) - -### 9.2 Data Sanitization - -**Error Message Sanitization**: - -```csharp -private static string SanitizeErrorMessage(string message) -{ - // Remove connection strings - message = Regex.Replace( - message, - @"token=[^;]+", - "token=***", - RegexOptions.IgnoreCase); - - // Remove Bearer tokens - message = Regex.Replace( - message, - @"Bearer\s+[A-Za-z0-9\-._~+/]+=*", - "Bearer ***", - RegexOptions.IgnoreCase); - - // Remove client secrets - message = Regex.Replace( - message, - @"client_secret=[^&\s]+", - "client_secret=***", - RegexOptions.IgnoreCase); - - // Remove basic auth - message = Regex.Replace( - message, - @"Basic\s+[A-Za-z0-9+/]+=*", - "Basic ***", - RegexOptions.IgnoreCase); - - // Remove file paths with usernames (Windows/Unix) - message = Regex.Replace( - message, - @"C:\\Users\\[^\\]+", - "C:\\Users\\***", - RegexOptions.IgnoreCase); - - message = Regex.Replace( - message, - @"/home/[^/]+", - "/home/***"); - - message = Regex.Replace( - message, - @"/Users/[^/]+", - "/Users/***"); - - // Truncate to 500 characters - if (message.Length > 500) - { - message = message.Substring(0, 500) + "..."; - } - - return message; -} -``` - -**Configuration Sanitization**: - -```csharp -private DriverConfiguration CreateDriverConfiguration() -{ - var config = new DriverConfiguration - { - // ... populate config ... - - // Sanitize sensitive fields - HostUrl = SanitizeUrl(_connection.Host?.Host), - ProxyHost = SanitizeUrl(_connection.ProxyHost) - }; - - return config; -} - -private static string? SanitizeUrl(string? url) -{ - if (string.IsNullOrEmpty(url)) return url; - - try - { - var uri = new Uri(url); - // Return only host and scheme, no credentials or query params - return $"{uri.Scheme}://{uri.Host}"; - } - catch - { - return "***"; - } -} -``` - -### 9.3 Data Residency Compliance - -**Lumberjack Integration**: - -The Databricks telemetry service integrates with Lumberjack, which handles: -- **Data residency**: Logs stored in region-appropriate storage -- **Encryption**: At-rest and in-transit encryption -- **Retention**: Automated retention policies -- **Compliance**: GDPR, CCPA, HIPAA compliance - -**Regional Processing**: - -``` -┌────────────────────────────────────────────────────────────┐ -│ US-based Client │ -└────────────────────────────────────────────────────────────┘ - │ - ▼ POST /telemetry-ext -┌────────────────────────────────────────────────────────────┐ -│ US Control Plane │ -│ - Telemetry Service │ -└────────────────────────────────────────────────────────────┘ - │ - ▼ -┌────────────────────────────────────────────────────────────┐ -│ US Regional Logfood │ -│ (US-based storage) │ -└────────────────────────────────────────────────────────────┘ - │ - ▼ -┌────────────────────────────────────────────────────────────┐ -│ Central Logfood │ -│ (Global aggregation) │ -└────────────────────────────────────────────────────────────┘ -``` - -**No Cross-Region Data Transfer**: -- Telemetry sent to workspace's control plane region -- Processed and stored within that region -- Central aggregation respects data residency rules - -### 9.4 Opt-Out Mechanisms - -**Client-Side Opt-Out**: - -```csharp -// Disable via connection properties -properties[DatabricksParameters.TelemetryEnabled] = "false"; - -// Or via JSON config -{ - "adbc.databricks.telemetry.enabled": "false" -} -``` - -**Server-Side Opt-Out**: - -```sql --- Workspace administrator can disable -SET databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc = false; -``` - -**Environment Variable Opt-Out**: - -```bash -# Set environment variable to disable globally -export DATABRICKS_TELEMETRY_ENABLED=false -``` - -**Implementation**: - -```csharp -private static bool IsTelemetryEnabled(IReadOnlyDictionary properties) -{ - // Check environment variable first - var envVar = Environment.GetEnvironmentVariable("DATABRICKS_TELEMETRY_ENABLED"); - if (!string.IsNullOrEmpty(envVar) && bool.TryParse(envVar, out var envEnabled)) - { - return envEnabled; - } - - // Check connection properties - if (properties.TryGetValue(DatabricksParameters.TelemetryEnabled, out var propValue)) - { - return bool.TryParse(propValue, out var propEnabled) && propEnabled; - } - - // Default to enabled - return true; -} -``` - ---- - -## 10. Error Handling - -### 10.1 Principles - -1. **Never Block Driver Operations**: Telemetry failures must not impact driver functionality -2. **Fail Silently**: Log errors but don't throw exceptions -3. **Degrade Gracefully**: Circuit breaker prevents cascading failures -4. **No Retry Storms**: Exponential backoff with circuit breaker - -### 10.2 Error Scenarios - -#### 10.2.1 Telemetry Service Unavailable - -**Scenario**: Telemetry endpoint returns 503 Service Unavailable - -**Handling**: -```csharp -try -{ - var response = await _httpClient.SendAsync(request, cancellationToken); - - if (response.StatusCode == HttpStatusCode.ServiceUnavailable) - { - _circuitBreaker?.RecordFailure(); - Debug.WriteLine("Telemetry service unavailable, will retry"); - return; - } -} -catch (HttpRequestException ex) -{ - _circuitBreaker?.RecordFailure(); - Debug.WriteLine($"Telemetry HTTP error: {ex.Message}"); - // Don't throw - fail silently -} -``` - -**Result**: Circuit breaker opens after threshold, drops subsequent events until service recovers - -#### 10.2.2 Network Timeout - -**Scenario**: HTTP request times out - -**Handling**: -```csharp -try -{ - using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken); - cts.CancelAfter(TimeSpan.FromSeconds(10)); // 10 second timeout - - var response = await _httpClient.SendAsync(request, cts.Token); -} -catch (TaskCanceledException ex) when (!cancellationToken.IsCancellationRequested) -{ - // Timeout, not user cancellation - Debug.WriteLine("Telemetry request timeout, will retry"); - // Retry logic handles this -} -``` - -**Result**: Retry with exponential backoff, eventually give up if persistent - -#### 10.2.3 Serialization Error - -**Scenario**: JSON serialization fails for telemetry event - -**Handling**: -```csharp -try -{ - var json = JsonSerializer.Serialize(request, _jsonOptions); -} -catch (JsonException ex) -{ - Debug.WriteLine($"Telemetry serialization error: {ex.Message}"); - // Skip this batch, don't crash - return; -} -``` - -**Result**: Drop problematic events, continue with next batch - -#### 10.2.4 Out of Memory - -**Scenario**: Too many telemetry events buffered in memory - -**Handling**: -```csharp -private void EnqueueEvent(TelemetryEvent telemetryEvent) -{ - // Check queue size limit - if (_eventQueue.Count >= _config.MaxQueueSize) - { - Debug.WriteLine("Telemetry queue full, dropping oldest event"); - _eventQueue.TryDequeue(out _); // Drop oldest - } - - _eventQueue.Enqueue(telemetryEvent); -} -``` - -**Configuration**: `MaxQueueSize = 1000` (default) - -**Result**: FIFO queue with bounded size, drops oldest events when full - -#### 10.2.5 Partial Failure Response - -**Scenario**: Server accepts some events but rejects others - -**Handling**: -```csharp -var telemetryResponse = JsonSerializer.Deserialize( - responseContent, - _jsonOptions); - -if (telemetryResponse?.Errors?.Count > 0) -{ - Debug.WriteLine( - $"Telemetry partial failure: {telemetryResponse.NumProtoSuccess} succeeded, " + - $"{telemetryResponse.Errors.Count} failed"); - - // Log details about failures - foreach (var error in telemetryResponse.Errors) - { - Debug.WriteLine($" - Event {error.Index}: {error.Message}"); - } - - // Don't retry individual events - too complex - // Accept partial success -} -``` - -**Result**: Accept partial success, log details for debugging - -### 10.3 Error Logging - -**Debug Output**: -```csharp -// Use Debug.WriteLine for telemetry errors (not visible in production) -Debug.WriteLine($"Telemetry error: {ex.Message}"); -``` - -**Activity Integration**: -```csharp -try -{ - await ExportAsync(events, cancellationToken); -} -catch (Exception ex) -{ - // Add telemetry error as Activity event (if tracing enabled) - Activity.Current?.AddEvent(new ActivityEvent( - "telemetry.export.failed", - tags: new ActivityTagsCollection - { - { "error.type", ex.GetType().Name }, - { "error.message", ex.Message }, - { "event.count", events.Count } - })); - - Debug.WriteLine($"Telemetry export failed: {ex.Message}"); -} -``` - -**Result**: Telemetry errors captured in traces (if enabled) but don't affect driver - ---- - -## 11. Testing Strategy - -### 11.1 Unit Tests - -#### 11.1.1 TelemetryCollector Tests - -**File**: `TelemetryCollectorTests.cs` - -```csharp -[TestClass] -public class TelemetryCollectorTests -{ - private Mock _mockExporter; - private Mock _mockConnection; - private TelemetryConfiguration _config; - private TelemetryCollector _collector; - - [TestInitialize] - public void Setup() - { - _mockExporter = new Mock(); - _mockConnection = new Mock(); - _config = new TelemetryConfiguration - { - Enabled = true, - BatchSize = 10, - FlushIntervalMilliseconds = 1000 - }; - - _collector = new TelemetryCollector( - _mockConnection.Object, - _mockExporter.Object, - _config); - } - - [TestMethod] - public void RecordConnectionOpen_AddsEventToQueue() - { - // Arrange - var latency = TimeSpan.FromMilliseconds(100); - var driverConfig = new DriverConfiguration(); - - // Act - _collector.RecordConnectionOpen(latency, driverConfig); - - // Assert - // Verify event was queued (internal queue is private, so check via flush) - _collector.FlushAsync().Wait(); - _mockExporter.Verify( - e => e.ExportAsync( - It.Is>(list => list.Count == 1), - It.IsAny()), - Times.Once); - } - - [TestMethod] - public void RecordStatementExecute_AggregatesMetrics() - { - // Arrange - var statementId = Guid.NewGuid().ToString(); - var latency = TimeSpan.FromMilliseconds(200); - var resultFormat = ExecutionResultFormat.ExternalLinks; - - // Act - _collector.RecordStatementExecute(statementId, latency, resultFormat); - _collector.RecordStatementComplete(statementId); - - // Assert - _collector.FlushAsync().Wait(); - _mockExporter.Verify( - e => e.ExportAsync( - It.Is>(list => - list.Count == 1 && - list[0].SqlOperationData.ExecutionLatencyMs == 200), - It.IsAny()), - Times.Once); - } - - [TestMethod] - public async Task FlushAsync_TriggeredOnBatchSizeThreshold() - { - // Arrange - BatchSize is 10 - var driverConfig = new DriverConfiguration(); - - // Act - Add 10 events - for (int i = 0; i < 10; i++) - { - _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(i), driverConfig); - } - - // Wait for async flush to complete - await Task.Delay(100); - - // Assert - _mockExporter.Verify( - e => e.ExportAsync( - It.Is>(list => list.Count == 10), - It.IsAny()), - Times.Once); - } - - [TestMethod] - public async Task FlushAsync_TriggeredOnTimeInterval() - { - // Arrange - FlushIntervalMilliseconds is 1000 - var driverConfig = new DriverConfiguration(); - _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(100), driverConfig); - - // Act - Wait for timer to trigger flush - await Task.Delay(1500); - - // Assert - _mockExporter.Verify( - e => e.ExportAsync( - It.IsAny>(), - It.IsAny()), - Times.AtLeastOnce); - } - - [TestMethod] - public void Dispose_FlushesAllPendingEvents() - { - // Arrange - var driverConfig = new DriverConfiguration(); - _collector.RecordConnectionOpen(TimeSpan.FromMilliseconds(100), driverConfig); - - // Act - _collector.Dispose(); - - // Assert - _mockExporter.Verify( - e => e.ExportAsync( - It.Is>(list => list.Count > 0), - It.IsAny()), - Times.Once); - } - - [TestMethod] - public void RecordError_CreatesErrorEvent() - { - // Arrange - var errorCode = "CONNECTION_ERROR"; - var errorMessage = "Failed to connect"; - var statementId = Guid.NewGuid().ToString(); - - // Act - _collector.RecordError(errorCode, errorMessage, statementId); - - // Assert - _collector.FlushAsync().Wait(); - _mockExporter.Verify( - e => e.ExportAsync( - It.Is>(list => - list.Count == 1 && - list[0].ErrorCode == errorCode), - It.IsAny()), - Times.Once); - } -} -``` - -#### 11.1.2 TelemetryExporter Tests - -**File**: `TelemetryExporterTests.cs` - -```csharp -[TestClass] -public class TelemetryExporterTests -{ - private Mock _mockHttpHandler; - private HttpClient _httpClient; - private Mock _mockConnection; - private TelemetryConfiguration _config; - private TelemetryExporter _exporter; - - [TestInitialize] - public void Setup() - { - _mockHttpHandler = new Mock(); - _httpClient = new HttpClient(_mockHttpHandler.Object); - _mockConnection = new Mock(); - _mockConnection.Setup(c => c.Host).Returns(new Uri("https://test.databricks.com")); - _mockConnection.Setup(c => c.IsAuthenticated).Returns(true); - - _config = new TelemetryConfiguration - { - Enabled = true, - MaxRetries = 3, - RetryDelayMs = 100, - CircuitBreakerEnabled = true - }; - - _exporter = new TelemetryExporter(_httpClient, _mockConnection.Object, _config); - } - - [TestMethod] - public async Task ExportAsync_SendsEventsToCorrectEndpoint() - { - // Arrange - var events = new List - { - new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } - }; - - _mockHttpHandler - .Protected() - .Setup>( - "SendAsync", - ItExpr.IsAny(), - ItExpr.IsAny()) - .ReturnsAsync(new HttpResponseMessage - { - StatusCode = HttpStatusCode.OK, - Content = new StringContent("{\"num_proto_success\": 1, \"errors\": []}") - }); - - // Act - await _exporter.ExportAsync(events); - - // Assert - _mockHttpHandler.Protected().Verify( - "SendAsync", - Times.Once(), - ItExpr.Is(req => - req.Method == HttpMethod.Post && - req.RequestUri.AbsolutePath == "/telemetry-ext"), - ItExpr.IsAny()); - } - - [TestMethod] - public async Task ExportAsync_RetriesOnServerError() - { - // Arrange - var events = new List - { - new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } - }; - - var callCount = 0; - _mockHttpHandler - .Protected() - .Setup>( - "SendAsync", - ItExpr.IsAny(), - ItExpr.IsAny()) - .ReturnsAsync(() => - { - callCount++; - if (callCount < 3) - { - return new HttpResponseMessage(HttpStatusCode.ServiceUnavailable); - } - return new HttpResponseMessage - { - StatusCode = HttpStatusCode.OK, - Content = new StringContent("{\"num_proto_success\": 1, \"errors\": []}") - }; - }); - - // Act - await _exporter.ExportAsync(events); - - // Assert - Assert.AreEqual(3, callCount, "Should retry twice before succeeding"); - } - - [TestMethod] - public async Task ExportAsync_DoesNotRetryOnClientError() - { - // Arrange - var events = new List - { - new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } - }; - - _mockHttpHandler - .Protected() - .Setup>( - "SendAsync", - ItExpr.IsAny(), - ItExpr.IsAny()) - .ReturnsAsync(new HttpResponseMessage(HttpStatusCode.BadRequest)); - - // Act - await _exporter.ExportAsync(events); - - // Assert - Should only try once - _mockHttpHandler.Protected().Verify( - "SendAsync", - Times.Once(), - ItExpr.IsAny(), - ItExpr.IsAny()); - } - - [TestMethod] - public async Task ExportAsync_DoesNotThrowOnFailure() - { - // Arrange - var events = new List - { - new TelemetryEvent { EventType = TelemetryEventType.ConnectionOpen } - }; - - _mockHttpHandler - .Protected() - .Setup>( - "SendAsync", - ItExpr.IsAny(), - ItExpr.IsAny()) - .ThrowsAsync(new HttpRequestException("Network error")); - - // Act & Assert - Should not throw - await _exporter.ExportAsync(events); - } -} -``` - -#### 11.1.3 CircuitBreaker Tests - -**File**: `CircuitBreakerTests.cs` - -```csharp -[TestClass] -public class CircuitBreakerTests -{ - [TestMethod] - public void IsOpen_ReturnsFalseInitially() - { - // Arrange - var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); - - // Assert - Assert.IsFalse(cb.IsOpen); - } - - [TestMethod] - public void IsOpen_ReturnsTrueAfterThresholdFailures() - { - // Arrange - var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); - - // Act - cb.RecordFailure(); - cb.RecordFailure(); - cb.RecordFailure(); - - // Assert - Assert.IsTrue(cb.IsOpen); - } - - [TestMethod] - public void IsOpen_TransitionsToHalfOpenAfterTimeout() - { - // Arrange - var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromMilliseconds(100)); - - // Act - cb.RecordFailure(); - cb.RecordFailure(); - cb.RecordFailure(); - Assert.IsTrue(cb.IsOpen); - - // Wait for timeout - Thread.Sleep(150); - - // Assert - Assert.IsFalse(cb.IsOpen); // Transitions to HalfOpen, returns false - } - - [TestMethod] - public void RecordSuccess_ResetsCircuitBreaker() - { - // Arrange - var cb = new CircuitBreaker(failureThreshold: 3, timeout: TimeSpan.FromSeconds(60)); - - // Act - cb.RecordFailure(); - cb.RecordFailure(); - cb.RecordSuccess(); // Reset - cb.RecordFailure(); - - // Assert - Should still be closed (only 1 failure after reset) - Assert.IsFalse(cb.IsOpen); - } -} -``` - -### 11.2 Integration Tests - -#### 11.2.1 End-to-End Telemetry Flow - -**File**: `TelemetryIntegrationTests.cs` - -```csharp -[TestClass] -public class TelemetryIntegrationTests -{ - private const string TestConnectionString = "..."; // Real Databricks workspace - - [TestMethod] - [TestCategory("Integration")] - public async Task ConnectionOpen_SendsTelemetry() - { - // Arrange - var properties = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "true", - [DatabricksParameters.TelemetryBatchSize] = "1", // Immediate flush - }; - - // Act - using var driver = new DatabricksDriver(); - using var database = driver.Open(properties); - using var connection = (DatabricksConnection)database.Connect(); - - // Give telemetry time to export - await Task.Delay(1000); - - // Assert - Check that telemetry was sent (via logs or server-side query) - // This requires access to telemetry table or mock endpoint - } - - [TestMethod] - [TestCategory("Integration")] - public async Task StatementExecution_SendsTelemetry() - { - // Arrange - var properties = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "true", - }; - - // Act - using var driver = new DatabricksDriver(); - using var database = driver.Open(properties); - using var connection = database.Connect(); - using var statement = connection.CreateStatement(); - - var reader = await statement.ExecuteQueryAsync("SELECT 1 AS test"); - await reader.ReadAsync(); - - // Close to flush telemetry - connection.Dispose(); - - // Assert - Verify telemetry sent - } - - [TestMethod] - [TestCategory("Integration")] - public async Task CloudFetchDownload_SendsTelemetry() - { - // Arrange - Query that returns CloudFetch results - var properties = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "true", - [DatabricksParameters.CloudFetchEnabled] = "true", - }; - - // Act - using var driver = new DatabricksDriver(); - using var database = driver.Open(properties); - using var connection = database.Connect(); - using var statement = connection.CreateStatement(); - - // Query large result set to trigger CloudFetch - var reader = await statement.ExecuteQueryAsync("SELECT * FROM large_table LIMIT 1000000"); - - while (await reader.ReadAsync()) - { - // Consume results - } - - connection.Dispose(); - - // Assert - Verify chunk download telemetry sent - } -} -``` - -### 11.3 Performance Tests - -#### 11.3.1 Telemetry Overhead - -**File**: `TelemetryPerformanceTests.cs` - -```csharp -[TestClass] -public class TelemetryPerformanceTests -{ - [TestMethod] - [TestCategory("Performance")] - public async Task TelemetryOverhead_LessThan1Percent() - { - // Arrange - var propertiesWithTelemetry = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "true", - }; - - var propertiesWithoutTelemetry = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "false", - }; - - const int iterations = 100; - - // Act - Measure with telemetry - var swWithTelemetry = Stopwatch.StartNew(); - for (int i = 0; i < iterations; i++) - { - using var driver = new DatabricksDriver(); - using var database = driver.Open(propertiesWithTelemetry); - using var connection = database.Connect(); - using var statement = connection.CreateStatement(); - var reader = await statement.ExecuteQueryAsync("SELECT 1"); - await reader.ReadAsync(); - } - swWithTelemetry.Stop(); - - // Act - Measure without telemetry - var swWithoutTelemetry = Stopwatch.StartNew(); - for (int i = 0; i < iterations; i++) - { - using var driver = new DatabricksDriver(); - using var database = driver.Open(propertiesWithoutTelemetry); - using var connection = database.Connect(); - using var statement = connection.CreateStatement(); - var reader = await statement.ExecuteQueryAsync("SELECT 1"); - await reader.ReadAsync(); - } - swWithoutTelemetry.Stop(); - - // Assert - Overhead should be < 1% - var overhead = (double)(swWithTelemetry.ElapsedMilliseconds - swWithoutTelemetry.ElapsedMilliseconds) - / swWithoutTelemetry.ElapsedMilliseconds; - - Console.WriteLine($"Telemetry overhead: {overhead:P2}"); - Assert.IsTrue(overhead < 0.01, $"Overhead {overhead:P2} exceeds 1% threshold"); - } - - [TestMethod] - [TestCategory("Performance")] - public void MemoryUsage_LessThan10MB() - { - // Arrange - var properties = new Dictionary - { - // ... connection properties ... - [DatabricksParameters.TelemetryEnabled] = "true", - [DatabricksParameters.TelemetryBatchSize] = "10000", // Large batch to accumulate memory - }; - - var initialMemory = GC.GetTotalMemory(forceFullCollection: true); - - // Act - Generate lots of telemetry events - using var driver = new DatabricksDriver(); - using var database = driver.Open(properties); - using var connection = (DatabricksConnection)database.Connect(); - - for (int i = 0; i < 1000; i++) - { - connection.TelemetryCollector?.RecordError( - "TEST_ERROR", - "Test error message", - Guid.NewGuid().ToString()); - } - - var finalMemory = GC.GetTotalMemory(forceFullCollection: false); - var memoryUsed = (finalMemory - initialMemory) / (1024 * 1024); // MB - - // Assert - Console.WriteLine($"Memory used: {memoryUsed}MB"); - Assert.IsTrue(memoryUsed < 10, $"Memory usage {memoryUsed}MB exceeds 10MB threshold"); - } -} -``` - -### 11.4 Mock Endpoint Testing - -**File**: `MockTelemetryEndpointTests.cs` - -```csharp -[TestClass] -public class MockTelemetryEndpointTests -{ - private TestServer _testServer; - private HttpClient _httpClient; - - [TestInitialize] - public void Setup() - { - // Create in-memory test server - _testServer = new TestServer(new WebHostBuilder() - .ConfigureServices(services => { }) - .Configure(app => - { - app.Run(async context => - { - if (context.Request.Path == "/telemetry-ext") - { - // Mock telemetry endpoint - var body = await new StreamReader(context.Request.Body).ReadToEndAsync(); - - // Validate request structure - var request = JsonSerializer.Deserialize(body); - - // Return success response - context.Response.StatusCode = 200; - await context.Response.WriteAsJsonAsync(new TelemetryResponse - { - NumProtoSuccess = request?.ProtoLogs?.Count ?? 0, - Errors = new List() - }); - } - }); - })); - - _httpClient = _testServer.CreateClient(); - } - - [TestCleanup] - public void Cleanup() - { - _httpClient?.Dispose(); - _testServer?.Dispose(); - } - - [TestMethod] - public async Task ExportAsync_SendsCorrectPayload() - { - // Arrange - var mockConnection = new Mock(); - mockConnection.Setup(c => c.Host).Returns(new Uri(_testServer.BaseAddress, "/")); - mockConnection.Setup(c => c.IsAuthenticated).Returns(true); - - var config = new TelemetryConfiguration { Enabled = true }; - var exporter = new TelemetryExporter(_httpClient, mockConnection.Object, config); - - var events = new List - { - new TelemetryEvent - { - EventType = TelemetryEventType.ConnectionOpen, - WorkspaceId = 123456, - SessionId = Guid.NewGuid().ToString(), - OperationLatencyMs = 100, - DriverConfig = new DriverConfiguration - { - DriverName = "Test", - DriverVersion = "1.0.0" - } - } - }; - - // Act - await exporter.ExportAsync(events); - - // Assert - Test server validated request structure - } -} -``` - ---- - -## 12. Migration & Rollout - -### 12.1 Rollout Phases - -#### Phase 1: Development & Testing (Weeks 1-3) - -**Goals**: -- Implement core telemetry components -- Add unit tests (80% coverage target) -- Test with mock endpoints - -**Deliverables**: -- `TelemetryCollector` implementation -- `TelemetryExporter` implementation -- `CircuitBreaker` implementation -- Unit test suite -- Mock endpoint tests - -**Success Criteria**: -- All unit tests pass -- Code coverage > 80% -- Performance overhead < 1% -- Memory usage < 10MB - -#### Phase 2: Internal Dogfooding (Weeks 4-5) - -**Goals**: -- Deploy to internal staging environment -- Test with real Databricks workspaces -- Validate telemetry data in Lumberjack - -**Configuration**: -```json -{ - "adbc.databricks.telemetry.enabled": "true", - "adbc.databricks.telemetry.force_enable": "true" -} -``` - -**Monitoring**: -- Query Lumberjack table for telemetry data -- Validate schema correctness -- Check for any data quality issues - -**Success Criteria**: -- Telemetry data visible in Lumberjack -- No driver functionality issues -- No performance regressions - -#### Phase 3: Opt-In Beta (Weeks 6-8) - -**Goals**: -- Release to select beta customers -- Gather feedback on telemetry value -- Monitor telemetry service load - -**Configuration**: -- Default: `telemetry.enabled = false` -- Beta customers opt-in via config - -**Monitoring**: -- Track opt-in rate -- Monitor telemetry service QPS -- Watch for any issues - -**Success Criteria**: -- 10+ beta customers opted in -- No critical issues reported -- Positive feedback on value - -#### Phase 4: Default On with Feature Flag (Weeks 9-12) - -**Goals**: -- Enable telemetry by default for new connections -- Gradual rollout via server-side feature flag - -**Configuration**: -- Client-side: `telemetry.enabled = true` (default) -- Server-side: Feature flag controls actual enablement - -**Rollout Schedule**: -- Week 9: 10% of workspaces -- Week 10: 25% of workspaces -- Week 11: 50% of workspaces -- Week 12: 100% of workspaces - -**Monitoring**: -- Track telemetry service QPS growth -- Monitor circuit breaker activation rate -- Watch for any performance impact - -**Success Criteria**: -- Telemetry service handles load -- < 0.1% customer issues -- Valuable insights being derived - -#### Phase 5: General Availability (Week 13+) - -**Goals**: -- Telemetry enabled for all workspaces -- Documentation published -- Monitoring dashboards created - -**Configuration**: -- Default: Enabled -- Opt-out available via config - -**Success Criteria**: -- 100% rollout complete -- Usage analytics dashboard live -- Error monitoring alerts configured - -### 12.2 Rollback Plan - -#### Trigger Conditions - -**Immediate Rollback**: -- Telemetry causing driver crashes -- Performance degradation > 5% -- Data privacy violation detected - -**Gradual Rollback**: -- Telemetry service overloaded (> 1000 QPS sustained) -- Circuit breaker open rate > 10% -- Customer complaints > 5/week - -#### Rollback Procedures - -**Server-Side Rollback** (Preferred): -```sql --- Disable via feature flag (affects all clients immediately) -UPDATE databricks_client_config -SET value = 'false' -WHERE key = 'databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForAdbc'; -``` - -**Client-Side Rollback**: -```json -{ - "adbc.databricks.telemetry.enabled": "false" -} -``` - -**Code Rollback**: -- Revert telemetry changes via Git -- Deploy previous driver version -- Communicate to customers - -### 12.3 Compatibility Matrix - -| Driver Version | Telemetry Support | Server Feature Flag Required | -|:---|:---:|:---:| -| < 1.0.0 | ❌ No | N/A | -| 1.0.0 - 1.0.5 | ⚠️ Beta (opt-in) | No | -| 1.1.0+ | ✅ GA (default on) | Yes | - -**Backward Compatibility**: -- Older driver versions continue to work (no telemetry) -- Newer driver versions work with older servers (feature flag defaults to enabled) -- No breaking changes to ADBC API - -### 12.4 Documentation Plan - -#### User-Facing Documentation - -**Location**: `csharp/src/Drivers/Databricks/readme.md` - -**Sections to Add**: -1. Telemetry Overview -2. Configuration Options -3. Privacy & Data Collection -4. Opt-Out Instructions - -**Example**: -```markdown -## Telemetry - -The Databricks ADBC driver collects anonymous usage telemetry to help improve -the driver. Telemetry is enabled by default but can be disabled. - -### What Data is Collected - -- Driver configuration (CloudFetch settings, batch size, etc.) -- Operation latency metrics -- Error codes and sanitized error messages -- Result format usage (inline vs CloudFetch) - -### What Data is NOT Collected - -- Query text or results -- Table or column names -- Authentication credentials -- Personal identifiable information - -### Disabling Telemetry - -To disable telemetry, set the following connection property: - -```json -{ - "adbc.databricks.telemetry.enabled": "false" -} -``` - -Or via environment variable: - -```bash -export DATABRICKS_TELEMETRY_ENABLED=false -``` -``` - -#### Internal Documentation - -**Location**: Internal Confluence/Wiki - -**Sections**: -1. Architecture Overview -2. Data Schema -3. Lumberjack Table Access -4. Dashboard Links -5. Troubleshooting Guide - ---- - -## 13. Alternatives Considered - -### 13.1 OpenTelemetry Metrics Export - -**Approach**: Use OpenTelemetry metrics SDK instead of custom telemetry client. - -**Pros**: -- Industry standard -- Rich ecosystem (Prometheus, Grafana, etc.) -- Automatic instrumentation -- Built-in retry and batching - -**Cons**: -- ❌ Requires external OTLP endpoint (not Databricks service) -- ❌ More complex configuration for users -- ❌ Harder to enforce server-side control (feature flags) -- ❌ Different schema from JDBC driver -- ❌ Additional dependency (OpenTelemetry SDK) - -**Decision**: Not chosen. Custom approach gives better control and consistency with JDBC. - -### 13.2 Activity Events Only - -**Approach**: Extend existing Activity/trace infrastructure to include telemetry events. - -**Pros**: -- Reuses existing infrastructure -- No new components needed -- Unified observability model - -**Cons**: -- ❌ Activities are trace-focused, not metrics-focused -- ❌ No built-in aggregation (traces are per-operation) -- ❌ Requires external trace backend (Jaeger, Zipkin, etc.) -- ❌ Not centralized in Databricks -- ❌ No server-side control - -**Decision**: Not chosen. Activities complement telemetry but don't replace it. - -### 13.3 Log-Based Telemetry - -**Approach**: Emit structured logs that get aggregated by log shipper. - -**Pros**: -- Simple implementation -- Leverages existing logging infrastructure -- Easy to debug locally - -**Cons**: -- ❌ Relies on customer log infrastructure -- ❌ No guarantee logs reach Databricks -- ❌ Hard to enforce server-side control -- ❌ Inconsistent across deployments -- ❌ Performance overhead of log serialization - -**Decision**: Not chosen. Not reliable enough for production telemetry. - -### 13.4 Synchronous Telemetry Export - -**Approach**: Export telemetry synchronously with each operation. - -**Pros**: -- Simpler implementation (no batching) -- Guaranteed delivery (or failure) -- No background threads - -**Cons**: -- ❌ **Blocking**: Would impact driver operation latency -- ❌ High network overhead (one request per event) -- ❌ Poor performance -- ❌ Violates non-blocking requirement - -**Decision**: Not chosen. Must be asynchronous and batched. - -### 13.5 No Telemetry - -**Approach**: Don't implement telemetry, rely on customer-reported issues. - -**Pros**: -- No implementation effort -- No privacy concerns -- Simpler driver code - -**Cons**: -- ❌ **Reactive debugging only**: Wait for customer reports -- ❌ No usage insights -- ❌ Can't track feature adoption -- ❌ Harder to identify systemic issues -- ❌ Slower issue resolution - -**Decision**: Not chosen. Telemetry provides too much value. - ---- - -## 14. Open Questions - -### 14.1 Schema Evolution - -**Question**: How do we handle schema changes over time? - -**Options**: -1. **Versioned schema**: Add `schema_version` field to payload -2. **Backward compatible additions**: Only add optional fields -3. **Server-side schema validation**: Reject unknown fields - -**Recommendation**: Option 2 (backward compatible additions) + versioning - -**Action**: Define schema versioning strategy before GA - -### 14.2 Sampling - -**Question**: Should we implement sampling for high-volume workspaces? - -**Context**: Some workspaces may execute thousands of queries per second - -**Options**: -1. **No sampling**: Collect all events -2. **Client-side sampling**: Sample events before export -3. **Server-side sampling**: Server accepts all, samples during processing - -**Recommendation**: Start with no sampling, add client-side sampling if needed - -**Action**: Monitor telemetry service QPS during rollout, add sampling if > 1000 QPS sustained - -### 14.3 Custom Metrics - -**Question**: Should we allow users to add custom telemetry fields? - -**Use Case**: Enterprise customers may want to tag telemetry with internal identifiers - -**Options**: -1. **No custom fields**: Fixed schema only -2. **Tagged fields**: Allow key-value tags -3. **Extensible schema**: Allow arbitrary JSON in `metadata` field - -**Recommendation**: Option 1 for MVP, revisit based on feedback - -**Action**: Gather feedback during beta phase - -### 14.4 Real-Time Alerting - -**Question**: Should telemetry trigger real-time alerts? - -**Use Case**: Alert on-call when error rate spikes - -**Options**: -1. **No real-time alerting**: Batch processing only -2. **Server-side alerting**: Telemetry service triggers alerts -3. **Client-side alerting**: Driver triggers alerts (not recommended) - -**Recommendation**: Option 2 (server-side alerting) as future enhancement - -**Action**: Design alerting as follow-up project - -### 14.5 PII Detection - -**Question**: How do we detect and prevent PII in telemetry? - -**Current Approach**: Manual sanitization in code - -**Options**: -1. **Manual sanitization**: Regex-based (current) -2. **Automated PII detection**: ML-based PII scanner -3. **Server-side PII scrubbing**: Lumberjack scrubs PII - -**Recommendation**: Option 1 for MVP, Option 3 as enhancement - -**Action**: Audit sanitization logic, add comprehensive tests - ---- - -## 15. References - -### 15.1 Internal Documents - -- [JDBC Telemetry Design Doc](https://docs.google.com/document/d/1Ww9sWPqt-ZpGDgtRPqnIhTVyGaeFp-3wPa-xElYfnbw/edit) -- [Lumberjack Data Residency](https://databricks.atlassian.net/wiki/spaces/ENG/pages/...) -- [Telemetry Service API](https://github.com/databricks/universe/tree/master/...) - -### 15.2 External Standards - -- [OpenTelemetry Specification](https://opentelemetry.io/docs/specs/otel/) -- [W3C Trace Context](https://www.w3.org/TR/trace-context/) -- [GDPR Compliance](https://gdpr.eu/) -- [CCPA Compliance](https://oag.ca.gov/privacy/ccpa) - -### 15.3 Code References - -- JDBC Telemetry Implementation: `databricks-jdbc/src/main/java/com/databricks/jdbc/telemetry/` -- ADBC Activity Infrastructure: `arrow-adbc/csharp/src/Apache.Arrow.Adbc/Tracing/` -- Databricks ADBC Driver: `arrow-adbc/csharp/src/Drivers/Databricks/` - -### 15.4 Related Projects - -- [OpenTelemetry .NET SDK](https://github.com/open-telemetry/opentelemetry-dotnet) -- [Polly (Resilience Library)](https://github.com/App-vNext/Polly) -- [Apache Arrow ADBC](https://arrow.apache.org/adbc/) - ---- - -## Appendix A: Example Code - -### A.1 Full Integration Example - -```csharp -using Apache.Arrow.Adbc.Drivers.Databricks; - -// Configure telemetry -var properties = new Dictionary -{ - [DatabricksParameters.HostName] = "https://my-workspace.databricks.com", - [DatabricksParameters.AuthType] = "oauth", - [DatabricksParameters.OAuthClientId] = "my-client-id", - [DatabricksParameters.OAuthClientSecret] = "my-secret", - - // Telemetry configuration - [DatabricksParameters.TelemetryEnabled] = "true", - [DatabricksParameters.TelemetryBatchSize] = "50", - [DatabricksParameters.TelemetryFlushIntervalMs] = "30000", - [DatabricksParameters.TelemetryLogLevel] = "Info" -}; - -// Create connection -using var driver = new DatabricksDriver(); -using var database = driver.Open(properties); -using var connection = database.Connect(); - -// Telemetry automatically collects: -// - Connection open latency -// - Driver configuration - -// Execute query -using var statement = connection.CreateStatement(); -var reader = await statement.ExecuteQueryAsync("SELECT * FROM my_table LIMIT 1000000"); - -// Telemetry automatically collects: -// - Statement execution latency -// - Result format (inline vs CloudFetch) -// - Chunk download metrics (if CloudFetch) - -while (await reader.ReadAsync()) -{ - // Process results -} - -// Close connection -connection.Dispose(); - -// Telemetry automatically: -// - Flushes all pending events -// - Exports to Databricks service -``` - -### A.2 Error Handling Example - -```csharp -try -{ - using var connection = database.Connect(); - using var statement = connection.CreateStatement(); - var reader = await statement.ExecuteQueryAsync("INVALID SQL"); -} -catch (AdbcException ex) -{ - // Telemetry automatically records error: - // - Error code: ex.Status - // - Error message: sanitized version - // - Statement ID (if available) - - Console.WriteLine($"Query failed: {ex.Message}"); -} -``` - ---- - -## Appendix B: Configuration Reference - -### B.1 All Telemetry Parameters - -| Parameter | Type | Default | Description | -|:---|:---:|:---:|:---| -| `adbc.databricks.telemetry.enabled` | bool | `true` | Enable/disable telemetry | -| `adbc.databricks.telemetry.force_enable` | bool | `false` | Bypass feature flag | -| `adbc.databricks.telemetry.batch_size` | int | `50` | Events per batch | -| `adbc.databricks.telemetry.flush_interval_ms` | int | `30000` | Flush interval (ms) | -| `adbc.databricks.telemetry.max_retries` | int | `3` | Max retry attempts | -| `adbc.databricks.telemetry.retry_delay_ms` | int | `500` | Base retry delay (ms) | -| `adbc.databricks.telemetry.circuit_breaker.enabled` | bool | `true` | Enable circuit breaker | -| `adbc.databricks.telemetry.circuit_breaker.threshold` | int | `5` | Failure threshold | -| `adbc.databricks.telemetry.circuit_breaker.timeout_sec` | int | `60` | Open timeout (sec) | -| `adbc.databricks.telemetry.log_level` | enum | `Info` | Log level filter | - ---- - -## Appendix C: Telemetry Events Catalog - -### C.1 Connection Events - -| Event | Fields | When Emitted | -|:---|:---|:---| -| `ConnectionOpen` | latency, driver_config | Connection opened successfully | -| `ConnectionError` | error_code, error_message | Connection failed to open | - -### C.2 Statement Events - -| Event | Fields | When Emitted | -|:---|:---|:---| -| `StatementExecute` | latency, result_format | Statement executed successfully | -| `StatementComplete` | aggregated_metrics | Statement closed | -| `StatementError` | error_code, error_message, statement_id | Statement execution failed | - -### C.3 CloudFetch Events - -| Event | Fields | When Emitted | -|:---|:---|:---| -| `ChunkDownload` | chunk_index, latency, bytes, compressed | Chunk downloaded successfully | -| `ChunkDownloadError` | chunk_index, error_code, error_message | Chunk download failed | -| `OperationStatus` | poll_count, total_latency | Polling completed | - ---- - -**Document Version**: 1.0 -**Last Updated**: 2025-10-26 -**Authors**: Design Team -**Reviewers**: Engineering, Product, Security, Privacy diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md deleted file mode 100644 index 1db34a1e23..0000000000 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-lld-summary.md +++ /dev/null @@ -1,280 +0,0 @@ -****# Analysis: File Locations in Telemetry LLD - -Based on my analysis of the design document, **ALL changes are contained within the Databricks driver folder** (`/Users/sreekanth.vadigi/Desktop/projects/arrow-adbc/csharp/src/Drivers/Databricks`). Here's the complete breakdown: - -## ✅ New Files to Create (All in Databricks folder) - -### 1. Telemetry Core Components - -``` -/Databricks/Telemetry/ -├── TelemetryCollector.cs (New - event aggregation) -├── TelemetryExporter.cs (New - HTTP export) -├── ITelemetryExporter.cs (New - interface) -├── CircuitBreaker.cs (New - resilience) -├── TelemetryConfiguration.cs (New - config) -└── Models/ - ├── TelemetryEvent.cs (New - event model) - ├── TelemetryRequest.cs (New - request payload) - ├── TelemetryResponse.cs (New - response payload) - ├── TelemetryFrontendLog.cs (New - log wrapper) - ├── FrontendLogContext.cs (New - context) - ├── FrontendLogEntry.cs (New - entry) - ├── SqlDriverLog.cs (New - driver log) - ├── DriverConfiguration.cs (New - config snapshot) - ├── SqlOperationData.cs (New - SQL metrics) - ├── ChunkDownloadData.cs (New - chunk metrics) - ├── DriverErrorInfo.cs (New - error info) - ├── TelemetryClientContext.cs (New - client context) - └── StatementTelemetryData.cs (New - aggregated data) -``` - -### 2. Test Files - -``` -/Databricks.Tests/Telemetry/ -├── TelemetryCollectorTests.cs (New - unit tests) -├── TelemetryExporterTests.cs (New - unit tests) -├── CircuitBreakerTests.cs (New - unit tests) -├── TelemetryIntegrationTests.cs (New - integration tests) -├── TelemetryPerformanceTests.cs (New - perf tests) -└── MockTelemetryEndpointTests.cs (New - mock tests) -``` - -## ✅ Existing Files to Modify (All in Databricks folder) - -### 1. DatabricksParameters.cs - -**Location:** `/Databricks/DatabricksParameters.cs` -**Changes:** Add telemetry configuration constants - -```csharp -public const string TelemetryEnabled = "adbc.databricks.telemetry.enabled"; -public const string TelemetryBatchSize = "adbc.databricks.telemetry.batch_size"; -public const string TelemetryFlushIntervalMs = "adbc.databricks.telemetry.flush_interval_ms"; -// ... 7 more parameters -``` - -### 2. DatabricksConnection.cs - -**Location:** `/Databricks/DatabricksConnection.cs` -**Changes:** -- Add TelemetryCollector field -- Initialize telemetry in `OpenAsync()` -- Record connection configuration -- Flush telemetry in `Dispose()` -- Check server-side feature flag in `ApplyServerSidePropertiesAsync()` - -```csharp -private TelemetryCollector? _telemetryCollector; -private TelemetryConfiguration? _telemetryConfig; - -public override async Task OpenAsync(CancellationToken cancellationToken = default) -{ - // ... existing code ... - InitializeTelemetry(); - _telemetryCollector?.RecordConnectionOpen(latency, driverConfig); -} - -public override void Dispose() -{ - _telemetryCollector?.FlushAllPendingAsync().Wait(); - _telemetryCollector?.Dispose(); - base.Dispose(); -} -``` - -### 3. DatabricksStatement.cs - -**Location:** `/Databricks/DatabricksStatement.cs` -**Changes:** -- Record statement execution metrics -- Track result format -- Mark statement complete on dispose - -```csharp -protected override async Task ExecuteQueryAsync(...) -{ - var sw = Stopwatch.StartNew(); - // ... execute ... - Connection.TelemetryCollector?.RecordStatementExecute( - statementId, sw.Elapsed, resultFormat); -} - -public override void Dispose() -{ - Connection.TelemetryCollector?.RecordStatementComplete(_statementId); - base.Dispose(); -} -``` - -### 4. CloudFetchDownloader.cs - -**Location:** `/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs` -**Changes:** -- Record chunk download latency -- Track retry attempts -- Report download errors - -```csharp -private async Task DownloadFileAsync(IDownloadResult downloadResult, ...) -{ - var sw = Stopwatch.StartNew(); - // ... download ... - _statement.Connection.TelemetryCollector?.RecordChunkDownload( - statementId, chunkIndex, sw.Elapsed, bytesDownloaded, compressed); -} -``` - -### 5. DatabricksOperationStatusPoller.cs - -**Location:** `/Databricks/Reader/DatabricksOperationStatusPoller.cs` -**Changes:** -- Record polling metrics - -```csharp -public async Task PollForCompletionAsync(...) -{ - var pollCount = 0; - var sw = Stopwatch.StartNew(); - // ... polling loop ... - _connection.TelemetryCollector?.RecordOperationStatus( - operationId, pollCount, sw.Elapsed); -} -``` - -### 6. Exception Handlers (Multiple Files) - -**Locations:** Throughout `/Databricks/` (wherever exceptions are caught) -**Changes:** Add telemetry error recording - -```csharp -catch (Exception ex) -{ - Connection.TelemetryCollector?.RecordError( - errorCode, SanitizeErrorMessage(ex.Message), statementId); - throw; -} -``` - -### 7. readme.md - -**Location:** `/Databricks/readme.md` -**Changes:** Add telemetry documentation section - -```markdown -## Telemetry - -The Databricks ADBC driver collects anonymous usage telemetry... - -### What Data is Collected -### What Data is NOT Collected -### Disabling Telemetry -``` - -## ❌ NO Changes Outside Databricks Folder - -The design does **NOT** require any changes to: - -- ❌ Base ADBC library (`Apache.Arrow.Adbc/`) -- ❌ Apache Spark/Hive2 drivers (`Drivers/Apache/`) -- ❌ ADBC interfaces (`AdbcConnection`, `AdbcStatement`, etc.) -- ❌ Activity/Tracing infrastructure (already exists, just reuse) -- ❌ Other ADBC drivers (BigQuery, Snowflake, etc.) - -## 📦 External Dependencies - -The design **reuses existing infrastructure**: - -### Already Available (No Changes Needed): - -**Activity/Tracing** (`Apache.Arrow.Adbc.Tracing/`) -- `ActivityTrace` - Already exists -- `IActivityTracer` - Already exists -- Used for correlation, not modified - -**HTTP Client** -- `HttpClient` - .NET standard library -- Already used by driver - -**JSON Serialization** -- `System.Text.Json` - .NET standard library -- Already used by driver - -**Testing Infrastructure** -- MSTest/xUnit - Standard testing frameworks -- Already used by driver tests - -## 📁 Complete File Tree - -``` -arrow-adbc/csharp/src/Drivers/Databricks/ -│ -├── Telemetry/ ← NEW FOLDER -│ ├── TelemetryCollector.cs ← NEW -│ ├── TelemetryExporter.cs ← NEW -│ ├── ITelemetryExporter.cs ← NEW -│ ├── CircuitBreaker.cs ← NEW -│ ├── TelemetryConfiguration.cs ← NEW -│ └── Models/ ← NEW FOLDER -│ ├── TelemetryEvent.cs ← NEW -│ ├── TelemetryRequest.cs ← NEW -│ ├── TelemetryResponse.cs ← NEW -│ ├── TelemetryFrontendLog.cs ← NEW -│ ├── FrontendLogContext.cs ← NEW -│ ├── FrontendLogEntry.cs ← NEW -│ ├── SqlDriverLog.cs ← NEW -│ ├── DriverConfiguration.cs ← NEW -│ ├── SqlOperationData.cs ← NEW -│ ├── ChunkDownloadData.cs ← NEW -│ ├── DriverErrorInfo.cs ← NEW -│ ├── TelemetryClientContext.cs ← NEW -│ └── StatementTelemetryData.cs ← NEW -│ -├── DatabricksParameters.cs ← MODIFY (add constants) -├── DatabricksConnection.cs ← MODIFY (add telemetry) -├── DatabricksStatement.cs ← MODIFY (add telemetry) -├── readme.md ← MODIFY (add docs) -│ -├── Reader/ -│ ├── DatabricksOperationStatusPoller.cs ← MODIFY (add telemetry) -│ └── CloudFetch/ -│ └── CloudFetchDownloader.cs ← MODIFY (add telemetry) -│ -└── [Other existing files remain unchanged] - -arrow-adbc/csharp/test/Drivers/Databricks.Tests/ -│ -└── Telemetry/ ← NEW FOLDER - ├── TelemetryCollectorTests.cs ← NEW - ├── TelemetryExporterTests.cs ← NEW - ├── CircuitBreakerTests.cs ← NEW - ├── TelemetryIntegrationTests.cs ← NEW - ├── TelemetryPerformanceTests.cs ← NEW - └── MockTelemetryEndpointTests.cs ← NEW -``` - -## Summary - -✅ **All changes are self-contained within the Databricks driver folder** - -**New Files:** ~27 new files (all under `/Databricks/`) -- 14 core implementation files -- 6 test files -- 7+ model classes - -**Modified Files:** ~6-8 existing files (all under `/Databricks/`) -- DatabricksParameters.cs -- DatabricksConnection.cs -- DatabricksStatement.cs -- CloudFetchDownloader.cs -- DatabricksOperationStatusPoller.cs -- readme.md -- Exception handlers (scattered) - -**External Dependencies:** Zero new dependencies outside the folder -- Reuses existing Activity/Tracing infrastructure -- Uses standard .NET libraries (HttpClient, System.Text.Json) -- No changes to base ADBC library - -**This is a clean, modular implementation that doesn't require any changes to the ADBC standard or other drivers!** 🎯 From f46d741ad5c4ccadf1fb1614347a978905832c59 Mon Sep 17 00:00:00 2001 From: Jade Wang Date: Tue, 28 Oct 2025 14:55:24 -0700 Subject: [PATCH 5/6] Update telemetry-activity-based-design.md --- .../telemetry-activity-based-design.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md index 4d2647b9a7..5561896de3 100644 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md @@ -1,3 +1,22 @@ + + # Databricks ADBC Driver: Activity-Based Telemetry Design ## Executive Summary From 1387cccc7cd7f0ce14bb8ba2eed4fd524a594c6f Mon Sep 17 00:00:00 2001 From: samikshya-chand_data Date: Tue, 4 Nov 2025 15:16:05 -0800 Subject: [PATCH 6/6] Update design based on JDBC learnings on telemetry --- .../telemetry-activity-based-design.md | 808 ++++++++++++++++-- 1 file changed, 741 insertions(+), 67 deletions(-) diff --git a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md index 5561896de3..1deac2fbba 100644 --- a/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md +++ b/csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md @@ -37,6 +37,14 @@ This document outlines an **Activity-based telemetry design** that leverages the - **Privacy-first**: No PII or query data collected - **Server-controlled**: Feature flag support for enable/disable +**Enhanced Production Requirements** (from JDBC driver experience): +- **Feature flag caching**: Per-host caching to avoid rate limiting +- **Circuit breaker**: Protect against telemetry endpoint failures +- **Exception swallowing**: All telemetry exceptions caught with minimal logging +- **Per-host telemetry client**: One client per host to prevent rate limiting +- **Graceful shutdown**: Proper cleanup with reference counting +- **Smart exception flushing**: Only flush terminal exceptions immediately + --- ## Table of Contents @@ -44,16 +52,25 @@ This document outlines an **Activity-based telemetry design** that leverages the 1. [Background & Motivation](#1-background--motivation) 2. [Architecture Overview](#2-architecture-overview) 3. [Core Components](#3-core-components) + - 3.1 [FeatureFlagCache (Per-Host)](#31-featureflagcache-per-host) + - 3.2 [TelemetryClientManager (Per-Host)](#32-telemetryclientmanager-per-host) + - 3.3 [Circuit Breaker](#33-circuit-breaker) + - 3.4 [DatabricksActivityListener](#34-databricksactivitylistener) + - 3.5 [MetricsAggregator](#35-metricsaggregator) + - 3.6 [DatabricksTelemetryExporter](#36-databrickstelemetryexporter) 4. [Data Collection](#4-data-collection) 5. [Export Mechanism](#5-export-mechanism) 6. [Configuration](#6-configuration) 7. [Privacy & Compliance](#7-privacy--compliance) 8. [Error Handling](#8-error-handling) -9. [Testing Strategy](#9-testing-strategy) -10. [Alternatives Considered](#10-alternatives-considered) -11. [Implementation Checklist](#11-implementation-checklist) -12. [Open Questions](#12-open-questions) -13. [References](#13-references) + - 8.1 [Exception Swallowing Strategy](#81-exception-swallowing-strategy) + - 8.2 [Terminal vs Retryable Exceptions](#82-terminal-vs-retryable-exceptions) +9. [Graceful Shutdown](#9-graceful-shutdown) +10. [Testing Strategy](#10-testing-strategy) +11. [Alternatives Considered](#11-alternatives-considered) +12. [Implementation Checklist](#12-implementation-checklist) +13. [Open Questions](#13-open-questions) +14. [References](#14-references) --- @@ -96,22 +113,35 @@ graph TB A[Driver Operations] -->|Activity.Start/Stop| B[ActivitySource] B -->|Activity Events| C[DatabricksActivityListener] C -->|Aggregate Metrics| D[MetricsAggregator] - D -->|Batch & Buffer| E[DatabricksTelemetryExporter] - E -->|HTTP POST| F[Databricks Service] - F --> G[Lumberjack] - - H[Feature Flag Service] -.->|Enable/Disable| C + D -->|Batch & Buffer| E[TelemetryClientManager] + E -->|Get Per-Host Client| F[TelemetryClient per Host] + F -->|Check Circuit Breaker| G[CircuitBreakerWrapper] + G -->|HTTP POST| H[DatabricksTelemetryExporter] + H --> I[Databricks Service] + I --> J[Lumberjack] + + K[FeatureFlagCache per Host] -.->|Enable/Disable| C + L[Connection Open] -->|Increment RefCount| E + L -->|Increment RefCount| K + M[Connection Close] -->|Decrement RefCount| E + M -->|Decrement RefCount| K style C fill:#e1f5fe style D fill:#e1f5fe - style E fill:#e1f5fe + style E fill:#ffe0b2 + style F fill:#ffe0b2 + style G fill:#ffccbc + style K fill:#c8e6c9 ``` **Key Components:** 1. **ActivitySource** (existing): Emits activities for all operations -2. **DatabricksActivityListener** (new): Listens to activities, extracts metrics -3. **MetricsAggregator** (new): Aggregates by statement, batches events -4. **DatabricksTelemetryExporter** (new): Exports to Databricks service +2. **FeatureFlagCache** (new): Per-host caching of feature flags with reference counting +3. **TelemetryClientManager** (new): Manages one telemetry client per host with reference counting +4. **CircuitBreakerWrapper** (new): Protects against failing telemetry endpoint +5. **DatabricksActivityListener** (new): Listens to activities, extracts metrics +6. **MetricsAggregator** (new): Aggregates by statement, batches events +7. **DatabricksTelemetryExporter** (new): Exports to Databricks service ### 2.2 Activity Flow @@ -147,7 +177,196 @@ sequenceDiagram ## 3. Core Components -### 3.1 DatabricksActivityListener +### 3.1 FeatureFlagCache (Per-Host) + +**Purpose**: Cache feature flag values at the host level to avoid repeated API calls and rate limiting. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.FeatureFlagCache` + +#### Rationale +- **Per-host caching**: Feature flags cached by host (not per connection) to prevent rate limiting +- **Reference counting**: Tracks number of connections per host for proper cleanup +- **Automatic expiration**: Refreshes cached flags after TTL expires (15 minutes) +- **Thread-safe**: Uses ConcurrentDictionary for concurrent access from multiple connections + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Singleton that manages feature flag cache per host. + /// Prevents rate limiting by caching feature flag responses. + /// + internal sealed class FeatureFlagCache + { + private static readonly FeatureFlagCache Instance = new(); + public static FeatureFlagCache GetInstance() => Instance; + + /// + /// Gets or creates a feature flag context for the host. + /// Increments reference count. + /// + public FeatureFlagContext GetOrCreateContext(string host); + + /// + /// Decrements reference count for the host. + /// Removes context when ref count reaches zero. + /// + public void ReleaseContext(string host); + + /// + /// Checks if telemetry is enabled for the host. + /// Uses cached value if available and not expired. + /// + public Task IsTelemetryEnabledAsync( + string host, + HttpClient httpClient, + CancellationToken ct = default); + } + + /// + /// Holds feature flag state and reference count for a host. + /// + internal sealed class FeatureFlagContext + { + public bool? TelemetryEnabled { get; set; } + public DateTime? LastFetched { get; set; } + public int RefCount { get; set; } + public TimeSpan CacheDuration { get; } = TimeSpan.FromMinutes(15); + + public bool IsExpired => LastFetched == null || + DateTime.UtcNow - LastFetched.Value > CacheDuration; + } +} +``` + +**JDBC Reference**: `DatabricksDriverFeatureFlagsContextFactory.java:27` maintains per-compute (host) feature flag contexts with reference counting. + +--- + +### 3.2 TelemetryClientManager (Per-Host) + +**Purpose**: Manage one telemetry client per host to prevent rate limiting from concurrent connections. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TelemetryClientManager` + +#### Rationale +- **One client per host**: Large customers (e.g., Celonis) open many parallel connections to the same host +- **Prevents rate limiting**: Shared client batches events from all connections, avoiding multiple concurrent flushes +- **Reference counting**: Tracks active connections, only closes client when last connection closes +- **Thread-safe**: Safe for concurrent access from multiple connections + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Singleton factory that manages one telemetry client per host. + /// Prevents rate limiting by sharing clients across connections. + /// + internal sealed class TelemetryClientManager + { + private static readonly TelemetryClientManager Instance = new(); + public static TelemetryClientManager GetInstance() => Instance; + + /// + /// Gets or creates a telemetry client for the host. + /// Increments reference count. + /// + public ITelemetryClient GetOrCreateClient( + string host, + HttpClient httpClient, + TelemetryConfiguration config); + + /// + /// Decrements reference count for the host. + /// Closes and removes client when ref count reaches zero. + /// + public Task ReleaseClientAsync(string host); + } + + /// + /// Holds a telemetry client and its reference count. + /// + internal sealed class TelemetryClientHolder + { + public ITelemetryClient Client { get; } + public int RefCount { get; set; } + } +} +``` + +**JDBC Reference**: `TelemetryClientFactory.java:27` maintains `ConcurrentHashMap` with per-host clients and reference counting. + +--- + +### 3.3 Circuit Breaker + +**Purpose**: Implement circuit breaker pattern to protect against failing telemetry endpoint. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.CircuitBreaker` + +#### Rationale +- **Endpoint protection**: The telemetry endpoint itself may fail or become unavailable +- **Not just rate limiting**: Protects against 5xx errors, timeouts, network failures +- **Resource efficiency**: Prevents wasting resources on a failing endpoint +- **Auto-recovery**: Automatically detects when endpoint becomes healthy again + +#### States +1. **Closed**: Normal operation, requests pass through +2. **Open**: After threshold failures, all requests rejected immediately (drop events) +3. **Half-Open**: After timeout, allows test requests to check if endpoint recovered + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// + /// Wraps telemetry exporter with circuit breaker pattern. + /// + internal sealed class CircuitBreakerTelemetryExporter : ITelemetryExporter + { + public CircuitBreakerTelemetryExporter(string host, ITelemetryExporter innerExporter); + + public Task ExportAsync( + IReadOnlyList metrics, + CancellationToken ct = default); + } + + /// + /// Singleton that manages circuit breakers per host. + /// + internal sealed class CircuitBreakerManager + { + private static readonly CircuitBreakerManager Instance = new(); + public static CircuitBreakerManager GetInstance() => Instance; + + public CircuitBreaker GetCircuitBreaker(string host); + } + + internal sealed class CircuitBreaker + { + public CircuitBreakerConfig Config { get; } + public Task ExecuteAsync(Func action); + } + + internal class CircuitBreakerConfig + { + public int FailureThreshold { get; set; } = 5; // Open after 5 failures + public TimeSpan Timeout { get; set; } = TimeSpan.FromMinutes(1); // Try again after 1 min + public int SuccessThreshold { get; set; } = 2; // Close after 2 successes + } +} +``` + +**JDBC Reference**: `CircuitBreakerTelemetryPushClient.java:15` and `CircuitBreakerManager.java:25` + +--- + +### 3.4 DatabricksActivityListener **Purpose**: Listen to Activity events and extract metrics for Databricks telemetry. @@ -161,12 +380,13 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry /// /// Custom ActivityListener that aggregates metrics from Activity events /// and exports them to Databricks telemetry service. + /// All exceptions are swallowed to prevent impacting driver operations. /// public sealed class DatabricksActivityListener : IDisposable { public DatabricksActivityListener( - DatabricksConnection connection, - ITelemetryExporter exporter, + string host, + ITelemetryClient telemetryClient, TelemetryConfiguration config); // Start listening to activities @@ -180,6 +400,8 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry } ``` +**Constructor Change**: Takes `string host` and shared `ITelemetryClient` instead of `DatabricksConnection`. + #### Activity Listener Configuration ```csharp @@ -215,11 +437,16 @@ private ActivityListener CreateListener() **Non-Blocking**: - All processing async - Never blocks Activity completion -- Failures logged but don't propagate +- All exceptions swallowed (logged at TRACE level only) + +**Exception Handling**: +- Wraps all callbacks in try-catch +- Never throws exceptions to Activity infrastructure +- Logs at TRACE level only to avoid customer anxiety --- -### 3.2 MetricsAggregator +### 3.5 MetricsAggregator **Purpose**: Aggregate Activity data into metrics suitable for Databricks telemetry. @@ -295,10 +522,16 @@ flowchart TD **Error Handling**: - Activity errors (tags with `error.type`) captured - Never throws exceptions +- All exceptions swallowed (logged at TRACE level only) + +**Terminal vs Retryable Exceptions**: +- **Terminal exceptions**: Flush immediately (auth failures, syntax errors, etc.) +- **Retryable exceptions**: Buffer until statement completes (network errors, 429, 503, etc.) +- Only flush retryable exceptions if statement ultimately fails --- -### 3.3 DatabricksTelemetryExporter +### 3.6 DatabricksTelemetryExporter **Purpose**: Export aggregated metrics to Databricks telemetry service. @@ -927,15 +1160,47 @@ Same as original design: ## 8. Error Handling -### 8.1 Error Handling Principles +### 8.1 Exception Swallowing Strategy -Same as original design: -1. Never block driver operations -2. Fail silently (log only) -3. Circuit breaker for service failures -4. No retry storms +**Core Principle**: Every telemetry exception must be swallowed with minimal logging to avoid customer anxiety. + +**Rationale** (from JDBC experience): +- Customers become anxious when they see error logs, even if telemetry is non-blocking +- Telemetry failures should never impact the driver's core functionality +- **Critical**: Circuit breaker must catch errors **before** swallowing, otherwise it won't work + +#### Logging Levels +- **TRACE**: Use for most telemetry errors (default) +- **DEBUG**: Use only for circuit breaker state changes +- **WARN/ERROR**: Never use for telemetry errors + +#### Exception Handling Layers -### 8.2 Activity Listener Error Handling +```mermaid +graph TD + A[Driver Operation] --> B[Activity Created] + B --> C[ActivityListener Callback] + C -->|Try-Catch TRACE| D[MetricsAggregator] + D -->|Try-Catch TRACE| E[TelemetryClient] + E --> F[Circuit Breaker] + F -->|Sees Exception| G{Track Failure} + G -->|After Tracking| H[Exporter] + H -->|Try-Catch TRACE| I[HTTP Call] + + C -.->|Exception Swallowed| J[Log at TRACE] + D -.->|Exception Swallowed| J + E -.->|Exception Swallowed| J + F -.->|Circuit Opens| K[Log at DEBUG] + H -.->|Exception Swallowed| J + + style C fill:#ffccbc + style D fill:#ffccbc + style E fill:#ffccbc + style F fill:#ffccbc + style H fill:#ffccbc +``` + +#### Activity Listener Error Handling ```csharp private void OnActivityStopped(Activity activity) @@ -946,27 +1211,371 @@ private void OnActivityStopped(Activity activity) } catch (Exception ex) { - // Log but never throw - must not impact driver - Debug.WriteLine($"Telemetry processing error: {ex.Message}"); + // Swallow ALL exceptions per requirement + // Use TRACE level to avoid customer anxiety + Debug.WriteLine($"[TRACE] Telemetry listener error: {ex.Message}"); + } +} +``` + +#### MetricsAggregator Error Handling + +```csharp +public void ProcessActivity(Activity activity) +{ + try + { + // Extract metrics, buffer, flush if needed + } + catch (Exception ex) + { + Debug.WriteLine($"[TRACE] Telemetry aggregator error: {ex.Message}"); } } ``` +#### Circuit Breaker Error Handling + +**Important**: Circuit breaker MUST see exceptions before they are swallowed! + +```csharp +public async Task ExportAsync(IReadOnlyList metrics) +{ + try + { + // Circuit breaker tracks failures BEFORE swallowing + await _circuitBreaker.ExecuteAsync(async () => + { + await _innerExporter.ExportAsync(metrics); + }); + } + catch (CircuitBreakerOpenException) + { + // Circuit is open, drop events silently + Debug.WriteLine($"[DEBUG] Circuit breaker OPEN - dropping telemetry"); + } + catch (Exception ex) + { + // All other exceptions swallowed AFTER circuit breaker saw them + Debug.WriteLine($"[TRACE] Telemetry export error: {ex.Message}"); + } +} +``` + +**JDBC Reference**: `TelemetryPushClient.java:86-94` - Re-throws exception if circuit breaker enabled, allowing it to track failures before swallowing. + +--- + +### 8.2 Terminal vs Retryable Exceptions + +**Requirement**: Do not flush exceptions immediately when they occur. Flush immediately only for **terminal exceptions**. + +#### Exception Classification + +**Terminal Exceptions** (flush immediately): +- Authentication failures (401, 403) +- Invalid SQL syntax errors +- Permission denied errors +- Resource not found errors (404) +- Invalid request format errors (400) + +**Retryable Exceptions** (buffer until statement completes): +- Network timeouts +- Connection errors +- Rate limiting (429) +- Service unavailable (503) +- Internal server errors (500, 502, 504) + +#### Rationale +- Some exceptions are retryable and may succeed on retry +- If a retryable exception is thrown twice but succeeds the third time, we'd flush twice unnecessarily +- Only terminal (non-retryable) exceptions should trigger immediate flush +- Statement completion should trigger flush for accumulated exceptions + +#### Exception Classifier + +```csharp +internal static class ExceptionClassifier +{ + public static bool IsTerminalException(Exception ex) + { + return ex switch + { + HttpRequestException httpEx when IsTerminalHttpStatus(httpEx) => true, + AuthenticationException => true, + UnauthorizedAccessException => true, + SqlException sqlEx when IsSyntaxError(sqlEx) => true, + _ => false + }; + } + + private static bool IsTerminalHttpStatus(HttpRequestException ex) + { + if (ex.StatusCode.HasValue) + { + var statusCode = (int)ex.StatusCode.Value; + return statusCode is 400 or 401 or 403 or 404; + } + return false; + } +} +``` + +#### Exception Buffering in MetricsAggregator + +```csharp +public void RecordException(string statementId, Exception ex) +{ + try + { + if (ExceptionClassifier.IsTerminalException(ex)) + { + // Terminal exception: flush immediately + var errorMetric = CreateErrorMetric(statementId, ex); + _ = _telemetryClient.ExportAsync(new[] { errorMetric }); + } + else + { + // Retryable exception: buffer until statement completes + _statementContexts[statementId].Exceptions.Add(ex); + } + } + catch (Exception aggregatorEx) + { + Debug.WriteLine($"[TRACE] Error recording exception: {aggregatorEx.Message}"); + } +} + +public void CompleteStatement(string statementId, bool failed) +{ + try + { + if (_statementContexts.TryRemove(statementId, out var context)) + { + // Only flush exceptions if statement ultimately failed + if (failed && context.Exceptions.Any()) + { + var errorMetrics = context.Exceptions + .Select(ex => CreateErrorMetric(statementId, ex)) + .ToList(); + _ = _telemetryClient.ExportAsync(errorMetrics); + } + } + } + catch (Exception ex) + { + Debug.WriteLine($"[TRACE] Error completing statement: {ex.Message}"); + } +} +``` + +#### Usage Example + +```csharp +string statementId = GetStatementId(); + +try +{ + var result = await ExecuteStatementAsync(statementId); + _aggregator.CompleteStatement(statementId, failed: false); +} +catch (Exception ex) +{ + // Record exception (classified as terminal or retryable) + _aggregator.RecordException(statementId, ex); + _aggregator.CompleteStatement(statementId, failed: true); + throw; // Re-throw for application handling +} +``` + +--- + ### 8.3 Failure Modes | Failure | Behavior | |---------|----------| -| Listener throws | Caught, logged, activity continues | -| Aggregator throws | Caught, logged, skip this activity | -| Exporter fails | Circuit breaker, retry with backoff | -| Circuit breaker open | Drop metrics immediately | +| Listener throws | Caught, logged at TRACE, activity continues | +| Aggregator throws | Caught, logged at TRACE, skip this activity | +| Exporter fails | Circuit breaker tracks failure, then caught and logged at TRACE | +| Circuit breaker open | Drop metrics immediately, log at DEBUG | | Out of memory | Disable listener, stop collecting | +| Terminal exception | Flush immediately, log at TRACE | +| Retryable exception | Buffer until statement completes | --- -## 9. Testing Strategy +## 9. Graceful Shutdown + +**Requirement**: Every telemetry client and HTTP client must be closed gracefully. Maintain reference counting properly to determine when to close shared resources. + +### 9.1 Shutdown Sequence + +```mermaid +sequenceDiagram + participant App as Application + participant Conn as DatabricksConnection + participant Listener as ActivityListener + participant Manager as TelemetryClientManager + participant Client as TelemetryClient (shared) + participant FFCache as FeatureFlagCache + + App->>Conn: CloseAsync() + + Conn->>Listener: StopAsync() + Listener->>Listener: Flush pending metrics + Listener->>Listener: Dispose + + Conn->>Manager: ReleaseClientAsync(host) + Manager->>Manager: Decrement RefCount + + alt RefCount == 0 (Last Connection) + Manager->>Client: CloseAsync() + Client->>Client: Flush pending events + Client->>Client: Shutdown executor + Client->>Client: Close HTTP client + else RefCount > 0 (Other Connections Exist) + Manager->>Manager: Keep client alive + end + + Conn->>FFCache: ReleaseContext(host) + FFCache->>FFCache: Decrement RefCount + + alt RefCount == 0 + FFCache->>FFCache: Remove context + else RefCount > 0 + FFCache->>FFCache: Keep context + end +``` + +### 9.2 Connection Close Implementation + +```csharp +public sealed class DatabricksConnection : AdbcConnection +{ + private string? _host; + private DatabricksActivityListener? _activityListener; + + protected override async ValueTask DisposeAsyncCore() + { + if (_host == null) return; -### 9.1 Unit Tests + try + { + // Step 1: Stop activity listener and flush pending metrics + if (_activityListener != null) + { + await _activityListener.StopAsync(); + _activityListener.Dispose(); + _activityListener = null; + } + + // Step 2: Release telemetry client (decrements ref count, closes if last) + await TelemetryClientManager.GetInstance().ReleaseClientAsync(_host); + + // Step 3: Release feature flag context (decrements ref count) + FeatureFlagCache.GetInstance().ReleaseContext(_host); + } + catch (Exception ex) + { + // Swallow all exceptions per requirement + Debug.WriteLine($"[TRACE] Error during telemetry cleanup: {ex.Message}"); + } + + // Continue with normal connection cleanup + await base.DisposeAsyncCore(); + } +} +``` + +### 9.3 TelemetryClient Close Implementation + +```csharp +public sealed class TelemetryClient : ITelemetryClient +{ + private readonly ITelemetryExporter _exporter; + private readonly CancellationTokenSource _cts = new(); + private readonly Task _backgroundFlushTask; + + public async Task CloseAsync() + { + try + { + // Step 1: Cancel background flush task + _cts.Cancel(); + + // Step 2: Flush all pending metrics synchronously + await FlushAsync(force: true); + + // Step 3: Wait for background task to complete (with timeout) + await _backgroundFlushTask.WaitAsync(TimeSpan.FromSeconds(5)); + } + catch (Exception ex) + { + // Swallow per requirement + Debug.WriteLine($"[TRACE] Error closing telemetry client: {ex.Message}"); + } + finally + { + _cts.Dispose(); + } + } +} +``` + +### 9.4 Reference Counting Example + +**TelemetryClientHolder with Reference Counting**: + +```csharp +// Connection 1 opens +var client1 = TelemetryClientManager.GetInstance() + .GetOrCreateClient("host1", httpClient, config); +// RefCount for "host1" = 1 + +// Connection 2 opens (same host) +var client2 = TelemetryClientManager.GetInstance() + .GetOrCreateClient("host1", httpClient, config); +// RefCount for "host1" = 2 +// client1 == client2 (same instance) + +// Connection 1 closes +await TelemetryClientManager.GetInstance().ReleaseClientAsync("host1"); +// RefCount for "host1" = 1 +// Client NOT closed (other connection still using it) + +// Connection 2 closes +await TelemetryClientManager.GetInstance().ReleaseClientAsync("host1"); +// RefCount for "host1" = 0 +// Client IS closed and removed from cache +``` + +**Same logic applies to FeatureFlagCache**. + +### 9.5 Shutdown Contracts + +**TelemetryClientManager**: +- `GetOrCreateClient()`: Atomically increments ref count +- `ReleaseClientAsync()`: Atomically decrements ref count, closes client if zero +- Thread-safe for concurrent access + +**FeatureFlagCache**: +- `GetOrCreateContext()`: Atomically increments ref count +- `ReleaseContext()`: Atomically decrements ref count, removes context if zero +- Thread-safe for concurrent access + +**TelemetryClient.CloseAsync()**: +- Synchronously flushes all pending metrics (blocks until complete) +- Cancels background flush task +- Disposes resources (HTTP client, executors, etc.) +- Never throws exceptions + +**JDBC Reference**: `TelemetryClient.java:105-139` - Synchronous close with flush and executor shutdown. + +--- + +## 10. Testing Strategy + +### 10.1 Unit Tests **DatabricksActivityListener Tests**: - `Listener_FiltersCorrectActivitySource` @@ -985,7 +1594,22 @@ private void OnActivityStopped(Activity activity) **TelemetryExporter Tests**: - Same as original design (endpoints, retry, circuit breaker) -### 9.2 Integration Tests +**New Component Tests** (per-host management): +- `FeatureFlagCache_CachesPerHost` +- `FeatureFlagCache_ExpiresAfter15Minutes` +- `FeatureFlagCache_RefCountingWorks` +- `TelemetryClientManager_OneClientPerHost` +- `TelemetryClientManager_RefCountingWorks` +- `TelemetryClientManager_ClosesOnLastRelease` +- `CircuitBreaker_OpensAfterFailures` +- `CircuitBreaker_ClosesAfterSuccesses` +- `CircuitBreaker_PerHostIsolation` +- `ExceptionClassifier_IdentifiesTerminal` +- `ExceptionClassifier_IdentifiesRetryable` +- `MetricsAggregator_BuffersRetryableExceptions` +- `MetricsAggregator_FlushesTerminalImmediately` + +### 10.2 Integration Tests **End-to-End with Activity**: - `ActivityBased_ConnectionOpen_ExportedSuccessfully` @@ -998,7 +1622,15 @@ private void OnActivityStopped(Activity activity) - `ActivityBased_CorrelationIdPreserved` - `ActivityBased_ParentChildSpansWork` -### 9.3 Performance Tests +**New Integration Tests** (production requirements): +- `MultipleConnections_SameHost_SharesClient` +- `FeatureFlagCache_SharedAcrossConnections` +- `CircuitBreaker_StopsFlushingWhenOpen` +- `GracefulShutdown_LastConnection_ClosesClient` +- `TerminalException_FlushedImmediately` +- `RetryableException_BufferedUntilComplete` + +### 10.3 Performance Tests **Overhead Measurement**: - `ActivityListener_Overhead_LessThan1Percent` @@ -1009,7 +1641,7 @@ Compare: - With listener but disabled: Should be ~0% overhead - With listener enabled: Should be < 1% overhead -### 9.4 Test Coverage Goals +### 10.4 Test Coverage Goals | Component | Unit Test Coverage | Integration Test Coverage | |-----------|-------------------|---------------------------| @@ -1017,12 +1649,16 @@ Compare: | MetricsAggregator | > 90% | > 80% | | TelemetryExporter | > 90% | > 80% | | Activity Tag Filtering | 100% | N/A | +| FeatureFlagCache | > 90% | > 80% | +| TelemetryClientManager | > 90% | > 80% | +| CircuitBreaker | > 90% | > 80% | +| ExceptionClassifier | 100% | N/A | --- -## 10. Alternatives Considered +## 11. Alternatives Considered -### 10.1 Alternative 1: Separate Telemetry System +### 11.1 Alternative 1: Separate Telemetry System **Description**: Create a dedicated telemetry collection system parallel to Activity infrastructure, with explicit TelemetryCollector and TelemetryExporter classes. @@ -1048,7 +1684,7 @@ Compare: --- -### 10.2 Alternative 2: OpenTelemetry Metrics API Directly +### 11.2 Alternative 2: OpenTelemetry Metrics API Directly **Description**: Use OpenTelemetry's Metrics API (`Meter` and `Counter`/`Histogram`) directly in driver code. @@ -1073,7 +1709,7 @@ Compare: --- -### 10.3 Alternative 3: Log-Based Metrics +### 11.3 Alternative 3: Log-Based Metrics **Description**: Write structured logs at key operations and extract metrics from logs. @@ -1099,7 +1735,7 @@ Compare: --- -### 10.4 Why Activity-Based Approach Was Chosen +### 11.4 Why Activity-Based Approach Was Chosen The Activity-based design was selected because it: @@ -1140,9 +1776,35 @@ The Activity-based design was selected because it: --- -## 11. Implementation Checklist - -### Phase 1: Tag Definition System +## 12. Implementation Checklist + +### Phase 1: Feature Flag Cache & Per-Host Management +- [ ] Create `FeatureFlagCache` singleton with per-host contexts +- [ ] Implement `FeatureFlagContext` with reference counting +- [ ] Add cache expiration logic (15 minute TTL) +- [ ] Implement `FetchFeatureFlagAsync` to call feature endpoint +- [ ] Create `TelemetryClientManager` singleton +- [ ] Implement `TelemetryClientHolder` with reference counting +- [ ] Add unit tests for cache behavior and reference counting + +### Phase 2: Circuit Breaker +- [ ] Create `CircuitBreaker` class with state machine +- [ ] Create `CircuitBreakerManager` singleton (per-host breakers) +- [ ] Create `CircuitBreakerTelemetryExporter` wrapper +- [ ] Configure failure thresholds and timeouts +- [ ] Add DEBUG logging for state transitions +- [ ] Add unit tests for circuit breaker logic + +### Phase 3: Exception Handling +- [ ] Create `ExceptionClassifier` for terminal vs retryable +- [ ] Update `MetricsAggregator` to buffer retryable exceptions +- [ ] Implement immediate flush for terminal exceptions +- [ ] Wrap all telemetry code in try-catch blocks +- [ ] Replace all logging with TRACE/DEBUG levels only +- [ ] Ensure circuit breaker sees exceptions before swallowing +- [ ] Add unit tests for exception classification + +### Phase 4: Tag Definition System - [ ] Create `TagDefinitions/TelemetryTag.cs` (attribute and enums) - [ ] Create `TagDefinitions/ConnectionOpenEvent.cs` (connection tag definitions) - [ ] Create `TagDefinitions/StatementExecutionEvent.cs` (statement tag definitions) @@ -1150,27 +1812,28 @@ The Activity-based design was selected because it: - [ ] Create `TagDefinitions/TelemetryTagRegistry.cs` (central registry) - [ ] Add unit tests for tag registry -### Phase 2: Core Implementation +### Phase 5: Core Implementation - [ ] Create `DatabricksActivityListener` class -- [ ] Create `MetricsAggregator` class (using tag registry for filtering) -- [ ] Create `DatabricksTelemetryExporter` class (reuse from original design) +- [ ] Create `MetricsAggregator` class (with exception buffering) +- [ ] Create `DatabricksTelemetryExporter` class - [ ] Add necessary tags to existing activities (using defined constants) -- [ ] Add feature flag integration +- [ ] Update connection to use per-host management -### Phase 3: Integration -- [ ] Initialize listener in `DatabricksConnection.OpenAsync()` -- [ ] Stop listener in `DatabricksConnection.CloseAsync()` +### Phase 6: Integration +- [ ] Update `DatabricksConnection.OpenAsync()` to use managers +- [ ] Implement graceful shutdown in `DatabricksConnection.CloseAsync()` - [ ] Add configuration parsing from connection string -- [ ] Add server feature flag check +- [ ] Wire up feature flag cache -### Phase 4: Testing -- [ ] Unit tests for ActivityListener -- [ ] Unit tests for MetricsAggregator -- [ ] Integration tests with real activities +### Phase 7: Testing +- [ ] Unit tests for all new components +- [ ] Integration tests for per-host management +- [ ] Integration tests for circuit breaker +- [ ] Integration tests for graceful shutdown - [ ] Performance tests (overhead measurement) -- [ ] Compatibility tests with OpenTelemetry +- [ ] Load tests with many concurrent connections -### Phase 5: Documentation +### Phase 8: Documentation - [ ] Update Activity instrumentation docs - [ ] Document new activity tags - [ ] Update configuration guide @@ -1178,9 +1841,9 @@ The Activity-based design was selected because it: --- -## 12. Open Questions +## 13. Open Questions -### 12.1 Activity Tag Naming Conventions +### 13.1 Activity Tag Naming Conventions **Question**: Should we use OpenTelemetry semantic conventions for tag names? @@ -1191,7 +1854,7 @@ The Activity-based design was selected because it: This ensures compatibility with OTEL ecosystem. -### 12.2 Statement Completion Detection +### 13.2 Statement Completion Detection **Question**: How do we know when a statement is complete for aggregation? @@ -1202,7 +1865,7 @@ This ensures compatibility with OTEL ecosystem. **Recommendation**: Use activity completion - cleaner and automatic. -### 12.3 Performance Impact on Existing Activity Users +### 13.3 Performance Impact on Existing Activity Users **Question**: Will adding tags impact applications that already use Activity for tracing? @@ -1213,20 +1876,31 @@ This ensures compatibility with OTEL ecosystem. --- -## 13. References +## 14. References -### 13.1 Related Documentation +### 14.1 Related Documentation - [.NET Activity API](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing) - [OpenTelemetry .NET](https://opentelemetry.io/docs/languages/net/) - [ActivityListener Documentation](https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.activitylistener) -### 13.2 Existing Code References +### 14.2 Existing Code References +**ADBC Driver**: - `ActivityTrace.cs`: Existing Activity helper - `DatabricksAdbcActivitySource`: Existing ActivitySource - Connection/Statement activities: Already instrumented +**JDBC Driver** (reference implementation): +- `TelemetryClient.java:15`: Main telemetry client with batching and flush +- `TelemetryClientFactory.java:27`: Per-host client management with reference counting +- `TelemetryClientHolder.java:5`: Reference counting holder +- `CircuitBreakerTelemetryPushClient.java:15`: Circuit breaker wrapper +- `CircuitBreakerManager.java:25`: Per-host circuit breaker management +- `TelemetryPushClient.java:86-94`: Exception re-throwing for circuit breaker +- `TelemetryHelper.java:60-71`: Feature flag checking +- `DatabricksDriverFeatureFlagsContextFactory.java:27`: Per-host feature flag cache + --- ## Summary