-
Notifications
You must be signed in to change notification settings - Fork 24
[LEADS-141] Add Latency Metrics to Evaluation Reports #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second), a PerformanceTracker and StreamingContext in the streaming parser, propagates metrics through constants, models, pipeline, configuration, and tests to capture, serialize, and assert streaming timing and throughput. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Parser as parse_streaming_response
participant Context as StreamingContext
participant Perf as _PerformanceTracker
participant Proc as _process_event
participant Validator as _validate_response
Client->>Parser: send SSE stream
Parser->>Context: initialize (includes Perf)
Context->>Perf: record start timestamp
loop per SSE line
Parser->>Parser: _parse_sse_line -> event
Parser->>Proc: dispatch event with Context
alt content token or tool_call
Proc->>Perf: record TTFT if first content
Proc->>Context: append token/tool_call, increment counters
else turn_complete
Proc->>Context: update input/output token counts
else end
Proc->>Perf: record end timestamp
Perf->>Context: compute streaming_duration & tokens_per_second
Proc->>Context: set final_response
else error
Proc->>Context: record error
end
end
Parser->>Validator: validate final_response & conversation_id
Validator-->>Parser: validated
Parser->>Client: return serialized response (includes streaming metrics)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (11)
🚧 Files skipped from review as they are similar to previous changes (3)
🧰 Additional context used📓 Path-based instructions (3)src/lightspeed_evaluation/**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/test_*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
tests/**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (10)📚 Learning: 2025-09-09T14:58:10.630ZApplied to files:
📚 Learning: 2025-09-18T23:59:37.026ZApplied to files:
📚 Learning: 2025-09-19T00:37:23.798ZApplied to files:
📚 Learning: 2025-09-19T12:32:06.403ZApplied to files:
📚 Learning: 2025-07-16T13:21:53.225ZApplied to files:
📚 Learning: 2025-07-16T13:20:40.632ZApplied to files:
📚 Learning: 2025-09-08T11:23:50.164ZApplied to files:
📚 Learning: 2025-12-11T10:05:06.422ZApplied to files:
📚 Learning: 2025-09-11T12:47:06.747ZApplied to files:
📚 Learning: 2025-09-10T15:48:14.671ZApplied to files:
🧬 Code graph analysis (3)src/lightspeed_evaluation/core/models/api.py (1)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (22)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
41-66: Consider makingStreamingContextprivate for consistency.The
_PerformanceTrackerclass uses an underscore prefix to indicate internal use. SinceStreamingContextis also only used internally, consider renaming it to_StreamingContextfor consistency.
120-139: Add type hint forevent_dataparameter.The
event_dataparameter is missing a full type hint. Per coding guidelines, type hints should be provided for all functions.🔎 Proposed fix
-def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: +def _process_event(ctx: StreamingContext, event: str, event_data: dict[str, Any]) -> None:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.py
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🧠 Learnings (4)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (2)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(88-117)
tests/unit/core/models/test_api_additional.py (1)
src/lightspeed_evaluation/core/models/api.py (2)
APIResponse(80-149)from_raw_response(114-149)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.13)
🔇 Additional comments (13)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
72-91: LGTM! Clean implementation of streaming metrics extraction.The streaming performance metrics extraction is well-implemented:
- Appropriate use of
getattrwithNonedefaults for optional fields- Conditional debug logging only when metrics are present
- Consistent with the existing token usage pattern
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics properly propagated to evaluation results.The streaming performance metrics are correctly extracted from
turn_dataand populated intoEvaluationResult, following the same conditional pattern as other turn-level fields.
218-236: LGTM! Error handling improved with consistent streaming metrics.Good refactoring to use a local
turn_datavariable for clarity, and the error result now includes streaming metrics consistently with the success path.tests/unit/core/api/test_streaming_parser.py (1)
260-377: LGTM! Comprehensive test coverage for streaming performance metrics.The new test class provides excellent coverage:
- Validates TTFT capture on different event types (token, tool_call, turn_complete)
- Checks metric relationships (streaming_duration ≥ time_to_first_token)
- Tests tokens_per_second calculation with and without token counts
- Verifies complete flow with all metrics
src/lightspeed_evaluation/core/models/api.py (2)
96-111: LGTM! Streaming metrics properly defined with validation.The streaming performance metric fields are well-defined:
- Appropriate
Optional[float]type for streaming-only metrics- Non-negative validation with
ge=0- Clear descriptions indicating streaming-only applicability
134-148: LGTM! Streaming metrics correctly extracted and propagated.The extraction logic correctly uses
.get()to retrieve optional streaming metrics from raw data and passes them to the APIResponse constructor.tests/unit/core/models/test_api_additional.py (1)
168-228: LGTM! Thorough test coverage for APIResponse streaming metrics.The tests comprehensively cover:
- Direct instantiation with streaming metrics
- Default None values when metrics are absent
- Parsing from raw response with and without streaming metrics
- Token count propagation alongside streaming metrics
src/lightspeed_evaluation/core/models/data.py (1)
81-96: LGTM! Streaming metrics properly defined in TurnData with validation.The streaming performance metrics are correctly defined with:
- Appropriate
Optional[float]type- Non-negative validation (
ge=0)- Clear descriptions
src/lightspeed_evaluation/core/api/streaming_parser.py (5)
1-14: LGTM!The imports are appropriate for the new performance tracking functionality, and
CONTENT_EVENTSas a tuple provides an immutable constant for event classification.
17-38: LGTM!Good use of
time.perf_counter()for high-resolution timing. The TTFT capture logic correctly ensures single capture via the_first_content_receivedflag.
69-85: LGTM!The throughput calculation correctly excludes TTFT from the generation time, and the function properly handles edge cases (no tokens, missing TTFT, zero/negative generation time).
142-147: LGTM!The validation logic correctly ensures required fields are present before returning the response. The use of
ValueErrorwas noted in the earlier comment about custom exceptions.
150-196: LGTM!The existing helper functions (
_parse_sse_line,_parse_tool_call,_format_tool_sequences) integrate correctly with the new context-based streaming flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/api/streaming_parser.py (2)
120-140: Use custom exceptions per coding guidelines.Line 125 raises
ValueErrorfor streaming API errors. Per the coding guidelines and a previous review comment, this should useAPIErrorfromcore.system.exceptionsinstead.🔎 Suggested fix
Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import APIErrorThen update the error handling:
def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: """Process a single streaming event and update context.""" if event == "error" and "token" in event_data: error_message = event_data["token"] logger.error("Received error event from streaming API: %s", error_message) - raise ValueError(f"Streaming API error: {error_message}") + raise APIError(f"Streaming API error: {error_message}")Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".
142-147: Use custom exceptions for validation errors.Lines 145 and 147 raise
ValueErrorfor validation failures. Per the coding guidelines and a previous review comment, these should useDataValidationErrorfromcore.system.exceptionsinstead.🔎 Suggested fix
Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import DataValidationErrorThen update the validation:
def _validate_response(ctx: StreamingContext) -> None: """Validate that required response fields are present.""" if not ctx.final_response: - raise ValueError("No final response found in streaming output") + raise DataValidationError("No final response found in streaming output") if not ctx.conversation_id: - raise ValueError("No Conversation ID found") + raise DataValidationError("No Conversation ID found")Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
17-38: Add type hints and docstrings for_PerformanceTracker.Per coding guidelines, all functions and methods should have type hints, and public APIs should have Google-style docstrings. While this is a private class, adding docstrings improves maintainability.
🔎 Suggested improvements
@dataclass class _PerformanceTracker: - """Tracks streaming performance metrics (TTFT, throughput).""" + """Tracks streaming performance metrics. + + Captures time to first token (TTFT) and calculates throughput metrics + for streaming responses. + + Attributes: + stream_start_time: Monotonic timestamp when streaming started + time_to_first_token: Time in seconds to receive first content token + _first_content_received: Internal flag tracking if first content seen + """ stream_start_time: float = field(default_factory=time.perf_counter) time_to_first_token: Optional[float] = None _first_content_received: bool = False - def capture_ttft(self) -> None: - """Capture time to first token if not already captured.""" + def capture_ttft(self) -> None: + """Capture time to first token if not already captured. + + Records the time elapsed from stream start to first content token. + Subsequent calls are ignored to preserve the first measurement. + """ if not self._first_content_received: self.time_to_first_token = time.perf_counter() - self.stream_start_time self._first_content_received = True logger.debug("Time to first token: %.3f seconds", self.time_to_first_token) - def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]: - """Calculate streaming duration and tokens per second.""" + def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]: + """Calculate streaming duration and tokens per second. + + Args: + output_tokens: Number of output tokens generated + + Returns: + Tuple of (streaming_duration, tokens_per_second) + """ streaming_duration = time.perf_counter() - self.stream_start_time tokens_per_second = _calculate_tokens_per_second( output_tokens, self.time_to_first_token, streaming_duration ) return streaming_duration, tokens_per_secondBased on coding guidelines: "Provide type hints for all public functions and methods" and "Use Google-style docstrings for all public APIs".
41-66: Enhance docstrings forStreamingContext.The class and method need more detailed Google-style docstrings per coding guidelines. The existing docstrings are minimal.
🔎 Suggested improvements
@dataclass class StreamingContext: # pylint: disable=too-many-instance-attributes - """Context for streaming response parsing.""" + """Context for streaming response parsing. + + Maintains state during SSE stream parsing, including response content, + conversation tracking, token counts, and performance measurements. + + Attributes: + conversation_id: Unique conversation identifier from API + final_response: Complete response text after streaming + tool_calls: List of tool invocations from the stream + input_tokens: Number of input tokens used + output_tokens: Number of output tokens generated + perf: Performance tracker for TTFT and throughput metrics + """ conversation_id: str = "" final_response: str = "" tool_calls: list[dict[str, Any]] = field(default_factory=list) input_tokens: int = 0 output_tokens: int = 0 perf: _PerformanceTracker = field(default_factory=_PerformanceTracker) def to_response_dict(self) -> dict[str, Any]: - """Convert context to response dictionary.""" + """Convert context to response dictionary. + + Returns: + Dictionary containing response text, tool calls, conversation ID, + token counts, and streaming performance metrics (TTFT, duration, + tokens per second). + """ streaming_duration, tokens_per_second = self.perf.get_metrics( self.output_tokens ) return { "response": self.final_response, "tool_calls": _format_tool_sequences(self.tool_calls), "conversation_id": self.conversation_id, "input_tokens": self.input_tokens, "output_tokens": self.output_tokens, "time_to_first_token": self.perf.time_to_first_token, "streaming_duration": streaming_duration, "tokens_per_second": tokens_per_second, }Based on coding guidelines: "Use Google-style docstrings for all public APIs".
69-85: Consider enhancing docstring for clarity.The function has a good implementation with proper edge case handling. The docstring could be expanded to document parameters and return value in Google-style format.
🔎 Suggested improvement
def _calculate_tokens_per_second( output_tokens: int, ttft: Optional[float], total_duration: float ) -> Optional[float]: - """Calculate tokens per second, excluding TTFT from throughput calculation.""" + """Calculate tokens per second, excluding TTFT from throughput calculation. + + Args: + output_tokens: Number of output tokens generated + ttft: Time to first token in seconds, or None if not captured + total_duration: Total streaming duration in seconds + + Returns: + Tokens per second throughput (output_tokens / generation_time), + or None if calculation is not possible (no tokens, no TTFT, or + negative generation time). + """ if output_tokens <= 0 or ttft is None: return None generation_time = total_duration - ttft if generation_time <= 0: return None tokens_per_second = output_tokens / generation_time logger.debug( "Streaming performance: %.3f tokens/sec (%d tokens in %.3f sec)", tokens_per_second, output_tokens, generation_time, ) return tokens_per_second
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/core/models/data.py
- config/system.yaml
- src/lightspeed_evaluation/core/constants.py
- tests/unit/core/models/test_api_additional.py
🧰 Additional context used
📓 Path-based instructions (3)
**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use pytest mocking (mocker fixture) instead of unittest.mock
Files:
tests/unit/core/api/test_streaming_parser.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*
Files:
tests/unit/core/api/test_streaming_parser.py
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/api/streaming_parser.py
🧠 Learnings (6)
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
🧬 Code graph analysis (1)
tests/unit/core/api/test_streaming_parser.py (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
parse_streaming_response(88-117)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.13)
🔇 Additional comments (8)
tests/unit/core/api/test_streaming_parser.py (2)
36-39: LGTM! Existing test appropriately updated.The test now verifies that performance metrics are present in the response. This ensures backward compatibility while validating the new metrics are populated.
260-377: Excellent test coverage for streaming performance metrics.The new test suite comprehensively validates:
- Time to First Token (TTFT) capture on different event types
- Streaming duration calculation and its relationship to TTFT
- Tokens per second calculation with and without token counts
- Complete end-to-end flow with all metrics
The tests follow pytest conventions and coding guidelines.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
185-192: LGTM! Streaming metrics properly integrated.The streaming performance metrics are correctly extracted from turn_data with appropriate None fallbacks, consistent with the existing pattern for token usage metrics.
214-236: LGTM! Error result consistently includes streaming metrics.The refactoring to use a local
turn_datavariable improves readability and avoids repeated attribute access. Including streaming metrics in error results is appropriate—they provide valuable performance insights even when evaluations fail.src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
72-90: LGTM! Streaming metrics properly extracted and logged.The implementation correctly:
- Uses
getattrwithNonedefaults for optional streaming fields- Logs metrics only when available (conditional on TTFT presence)
- Employs defensive null handling in logging to prevent crashes
- Provides clear, formatted output with appropriate precision
The code follows project guidelines and integrates cleanly with existing patterns.
src/lightspeed_evaluation/core/models/api.py (2)
96-111: LGTM! Streaming metric fields properly defined.The new fields are well-structured with:
- Appropriate types (
Optional[float])- Non-negative validation (
ge=0)- Clear documentation indicating streaming-only availability
- Consistent naming conventions
134-138: LGTM! Metric extraction and population implemented correctly.The extraction uses
raw_data.get()without defaults, appropriately returningNonefor missing keys. This pattern is consistent with the optional nature of streaming metrics and aligns with existing token extraction logic.Also applies to: 146-148
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
88-117: LGTM! Well-structured refactoring with clear separation of concerns.The refactored implementation:
- Uses context-based state management for cleaner code
- Separates parsing, event processing, and validation into distinct phases
- Captures TTFT at the correct point (first content event)
- Includes comprehensive documentation of performance metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
120-147: Use custom exceptions fromcore.system.exceptionsfor error handling.Per the coding guidelines and previous review feedback, replace
ValueErrorwith custom exceptions:
- Line 125: Use
APIErrorfor streaming API errors- Lines 145, 147: Use
DataValidationErrorfor validation failures🔎 Proposed fix
+from lightspeed_evaluation.core.system.exceptions import APIError, DataValidationError + def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None: """Process a single streaming event and update context.""" if event == "error" and "token" in event_data: error_message = event_data["token"] logger.error("Received error event from streaming API: %s", error_message) - raise ValueError(f"Streaming API error: {error_message}") + raise APIError(f"Streaming API error: {error_message}") # ... rest unchanged ... def _validate_response(ctx: StreamingContext) -> None: """Validate that required response fields are present.""" if not ctx.final_response: - raise ValueError("No final response found in streaming output") + raise DataValidationError("No final response found in streaming output") if not ctx.conversation_id: - raise ValueError("No Conversation ID found") + raise DataValidationError("No Conversation ID found")
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/api/streaming_parser.py (1)
41-66:StreamingContextis a public class—add Google-style docstring for theto_response_dictmethod.Per coding guidelines, public APIs should use Google-style docstrings. The
to_response_dictmethod lacks a return type description.🔎 Suggested docstring enhancement
def to_response_dict(self) -> dict[str, Any]: - """Convert context to response dictionary.""" + """Convert context to response dictionary. + + Returns: + dict[str, Any]: Dictionary containing response data and streaming + performance metrics (time_to_first_token, streaming_duration, + tokens_per_second). + """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
config/system.yamlsrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/api/test_streaming_parser.pytests/unit/core/models/test_api_additional.py
🚧 Files skipped from review as they are similar to previous changes (6)
- src/lightspeed_evaluation/core/constants.py
- config/system.yaml
- src/lightspeed_evaluation/pipeline/evaluation/amender.py
- tests/unit/core/api/test_streaming_parser.py
- tests/unit/core/models/test_api_additional.py
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
📓 Path-based instructions (1)
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules
Files:
src/lightspeed_evaluation/core/models/mixins.pysrc/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/api.pysrc/lightspeed_evaluation/core/models/data.py
🧠 Learnings (6)
📚 Learning: 2025-09-08T11:23:50.164Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.pysrc/lightspeed_evaluation/core/models/api.py
📚 Learning: 2025-12-11T10:05:06.422Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling
Applied to files:
src/lightspeed_evaluation/core/api/streaming_parser.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/models/api.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
src/lightspeed_evaluation/core/models/data.py (1)
src/lightspeed_evaluation/core/models/mixins.py (1)
StreamingMetricsMixin(8-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.13)
🔇 Additional comments (9)
src/lightspeed_evaluation/core/api/streaming_parser.py (3)
17-38: Well-designed performance tracker with proper encapsulation.The
_PerformanceTrackerdataclass cleanly encapsulates TTFT capture logic with idempotentcapture_ttft()and correctly usestime.perf_counter()for high-resolution timing. The separation of concerns between timing capture and metrics calculation is good.
69-85: Tokens-per-second calculation correctly excludes TTFT from throughput.The logic properly calculates generation throughput by subtracting TTFT from total duration, and handles edge cases (zero/negative tokens, missing TTFT, zero/negative generation time) by returning
None.
88-117: Refactored parsing flow is cleaner and more maintainable.The event-driven approach with
StreamingContext,_process_event, and_validate_responseimproves readability and testability. The TTFT capture onCONTENT_EVENTSis correctly placed.src/lightspeed_evaluation/core/models/__init__.py (1)
15-15: LGTM!The
StreamingMetricsMixinimport and export follow the established patterns. The categorization under# Mixinscomment is appropriate.Also applies to: 47-48
src/lightspeed_evaluation/core/models/mixins.py (1)
1-33: Clean mixin design with proper validation constraints.The
StreamingMetricsMixinfollows Pydantic best practices:
- Uses
Optional[float]withdefault=Nonefor streaming-only metrics- Applies
ge=0validation to prevent negative values- Includes descriptive field documentation
This provides a reusable component for
APIResponse,TurnData, andEvaluationResult.src/lightspeed_evaluation/core/models/api.py (2)
7-8: Inheritance fromStreamingMetricsMixinis appropriate.Since
StreamingMetricsMixinextendsBaseModel,APIResponsecorrectly inherits Pydantic functionality along with the streaming metric fields.Also applies to: 82-82
119-134: Streaming metrics extraction and propagation looks correct.The
from_raw_responsemethod properly extracts the three streaming metrics using.get()(defaulting toNone), which aligns with the mixin's field defaults. The metrics are passed to the constructor where Pydantic'sge=0validation will apply.src/lightspeed_evaluation/core/models/data.py (2)
10-10:TurnDatacorrectly inherits streaming metrics via mixin.The inheritance change from
BaseModeltoStreamingMetricsMixinis appropriate. Since the mixin extendsBaseModel, all existing Pydantic functionality is preserved while adding the three streaming metric fields with proper validation.Also applies to: 36-36
384-384:EvaluationResultcorrectly inherits streaming metrics via mixin.The mixin inheritance provides
time_to_first_token,streaming_duration, andtokens_per_secondfields withge=0validation, addressing the previous review feedback about validation constraints.
Description
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Data / Schema
Tests
✏️ Tip: You can customize this high-level summary in your review settings.