[LEADS-141] Add Latency Metrics to Evaluation Reports #127

bsatapat-jpg · 2025-12-26T09:49:53Z

Description

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude-4.5-opus-high
Generated by: Cursor

Related Tickets & Documents

Related Issue # https://issues.redhat.com/browse/LEADS-141
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Added three streaming performance metrics (time to first token, streaming duration, tokens per second); captured for streaming responses and included in outputs.
Data / Schema
- Models, evaluation results, and CSV exports now surface the new streaming metrics when available.
Tests
- Added comprehensive tests validating metric capture, propagation, CSV inclusion, and correct handling when metrics are absent.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-26T09:50:02Z

Walkthrough

Adds streaming performance metrics (time_to_first_token, streaming_duration, tokens_per_second), a PerformanceTracker and StreamingContext in the streaming parser, propagates metrics through constants, models, pipeline, configuration, and tests to capture, serialize, and assert streaming timing and throughput.

Changes

Cohort / File(s)	Summary
Configuration `config/system.yaml`	Added `time_to_first_token`, `streaming_duration`, `tokens_per_second` to top-level `outputs` and `csv_columns`.
Streaming parser `src/lightspeed_evaluation/core/api/streaming_parser.py`	Added `_PerformanceTracker` and public `StreamingContext`; refactored parsing to event-driven flow with `_parse_sse_line`, `_process_event`, and `_validate_response`; capture TTFT, total duration, and tokens-per-second; expanded logging and error handling.
Constants `src/lightspeed_evaluation/core/constants.py`	Added `"time_to_first_token"`, `"streaming_duration"`, and `"tokens_per_second"` to `SUPPORTED_CSV_COLUMNS`.
Models (mixins, API & data) `src/lightspeed_evaluation/core/models/mixins.py`, `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/core/models/api.py`, `src/lightspeed_evaluation/core/models/data.py`	New `StreamingMetricsMixin` (three optional non-negative float fields) exported and applied as base for `APIResponse`, `TurnData`, and `EvaluationResult`; `APIResponse.from_raw_response` now extracts metrics; additional optional fields added to `EvaluationResult`.
Pipeline integration `src/lightspeed_evaluation/pipeline/evaluation/amender.py`, `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Amender populates streaming metrics into `TurnData`; evaluator includes metrics in success and error `EvaluationResult` payloads; conditional debug logging when TTFT present.
Tests `tests/unit/core/api/test_streaming_parser.py`, `tests/unit/core/models/test_api_additional.py`	Added/updated tests verifying time_to_first_token, streaming_duration, tokens_per_second across token/tool_call/end flows and APIResponse parsing, including absence cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Parser as parse_streaming_response
    participant Context as StreamingContext
    participant Perf as _PerformanceTracker
    participant Proc as _process_event
    participant Validator as _validate_response

    Client->>Parser: send SSE stream
    Parser->>Context: initialize (includes Perf)
    Context->>Perf: record start timestamp

    loop per SSE line
        Parser->>Parser: _parse_sse_line -> event
        Parser->>Proc: dispatch event with Context
        alt content token or tool_call
            Proc->>Perf: record TTFT if first content
            Proc->>Context: append token/tool_call, increment counters
        else turn_complete
            Proc->>Context: update input/output token counts
        else end
            Proc->>Perf: record end timestamp
            Perf->>Context: compute streaming_duration & tokens_per_second
            Proc->>Context: set final_response
        else error
            Proc->>Context: record error
        end
    end

    Parser->>Validator: validate final_response & conversation_id
    Validator-->>Parser: validated
    Parser->>Client: return serialized response (includes streaming metrics)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[LEADS-40]: Get statistics about the token usage for lightspeed-evaluation #112: Overlaps streaming parser and metrics tracking.
LEADS-26: Increased Unit test cases coverage #109: Related test expansions and streaming parser behavior.
Turn metric override #55: Similar model field and inheritance changes.

Suggested reviewers

VladimirKadlec
asamal4
tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[LEADS-141] Add Latency Metrics to Evaluation Reports' directly and accurately describes the main change—adding streaming performance metrics (time-to-first-token, streaming_duration, tokens_per_second) across configuration, models, and evaluation pipeline.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff4133d and a9864e3.

📒 Files selected for processing (11)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/unit/core/models/test_api_additional.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/models/init.py

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/core/models/data.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py

🧠 Learnings (10)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (3)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (88-117)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.13)

🔇 Additional comments (22)

src/lightspeed_evaluation/core/constants.py (1)

42-45: LGTM!

The streaming performance metric constants are properly added and documented. The naming is consistent with existing conventions.

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM!

The assertions properly verify that streaming performance metrics are present in the parsed response. This ensures backward compatibility while adding the new metrics.

260-377: LGTM!

Comprehensive test coverage for streaming performance metrics:

TTFT capture on different event types (token, tool_call, turn_complete)

Streaming duration and tokens per second calculations

Edge cases (no token counts scenario)

Complete end-to-end flow

The tests follow pytest conventions and use the mocker fixture correctly per coding guidelines.

config/system.yaml (1)

181-184: LGTM!

The streaming metrics configuration is well-documented with clear comments indicating they are only populated when using the streaming endpoint. This helps users understand when these fields will have values.

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-90: LGTM!

The streaming metrics are properly extracted from the API response using safe getattr patterns consistent with the existing token usage code. The conditional debug logging provides useful observability without cluttering logs when streaming metrics aren't available.

src/lightspeed_evaluation/core/models/api.py (3)

7-7: LGTM!

The import of StreamingMetricsMixin is properly added to support the new base class.

82-82: LGTM!

The base class change to StreamingMetricsMixin is backward compatible since the mixin extends BaseModel and adds only optional fields with None defaults. This cleanly adds streaming metrics support to the API response model.

119-133: LGTM!

The streaming metrics are properly extracted from raw response data using safe .get() calls and passed to the constructor. The pattern is consistent with existing token count extraction, and None defaults align with the optional nature of streaming metrics.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

185-192: LGTM!

The streaming performance metrics are properly extracted from turn_data and included in the evaluation result. The conditional pattern is consistent with other turn-level fields in the same constructor call.

218-218: LGTM!

The local turn_data variable improves readability and consistency with the success path in evaluate_metric.

227-235: LGTM!

The error result now includes streaming metrics, ensuring consistency between success and error evaluation paths. The refactoring to use the local turn_data variable improves code clarity.
src/lightspeed_evaluation/core/api/streaming_parser.py (7)
5-6: LGTM!

The new imports (time, dataclass, field) are necessary for the performance tracking functionality.

14-14: LGTM!

The CONTENT_EVENTS constant clearly defines which events should trigger TTFT capture. This improves code clarity and maintainability.

17-38: LGTM!

The _PerformanceTracker class provides clean encapsulation of streaming performance metrics:

Properly guards TTFT to capture only the first content event

Uses time.perf_counter() for high-resolution timing

Delegates tokens/sec calculation to the helper function

As a private class, the absence of explicit return type hints is acceptable.

41-66: LGTM!

The StreamingContext dataclass properly encapsulates streaming state and provides clean serialization via to_response_dict(). All public methods have appropriate type hints per coding guidelines.

69-85: LGTM!

The tokens-per-second calculation properly excludes TTFT from throughput measurement and includes appropriate guards against invalid states (no output tokens, missing TTFT, zero/negative generation time).

88-117: LGTM!

The refactored parse_streaming_response function is well-structured with clear separation of concerns:

Context-based state management

Event-driven parsing via _process_event

TTFT capture on content events

Validation before returning

The enhanced docstring clearly describes the captured performance metrics.

52-66: Add return type hint to public method.

The public method to_response_dict() is missing a return type hint. Per coding guidelines, all public functions and methods should have type hints.
🔎 Proposed fix
-    def to_response_dict(self) -> dict[str, Any]:
+    def to_response_dict(self) -> dict[str, Any]:
Wait, the type hint is already there: -> dict[str, Any]. Let me re-examine...

Actually, looking at line 52:
def to_response_dict(self) -> dict[str, Any]:
The return type hint IS present. So this is fine.
Actually, I see the return type hint is already present on line 52: -> dict[str, Any]. No issue here.

Likely an incorrect or invalid review comment.
src/lightspeed_evaluation/core/models/data.py (4)

10-10: LGTM!

The import of StreamingMetricsMixin is properly added to support the streaming metrics integration.

36-36: LGTM!

The base class change to StreamingMetricsMixin is backward compatible and allows TurnData to carry streaming performance metrics. Since the mixin extends BaseModel and adds only optional fields, existing functionality is preserved.

384-384: LGTM!

The base class change to StreamingMetricsMixin enables evaluation results to include streaming performance metrics. The mixin properly includes non-negative validation constraints (ge=0) per the previous review feedback.

427-445: LGTM!

The additional optional fields enable evaluation results to carry test data and context for detailed reporting. All fields are properly typed and consistent with the existing field patterns.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)
41-66: Consider making StreamingContext private for consistency.

The _PerformanceTracker class uses an underscore prefix to indicate internal use. Since StreamingContext is also only used internally, consider renaming it to _StreamingContext for consistency.

120-139: Add type hint for event_data parameter.

The event_data parameter is missing a full type hint. Per coding guidelines, type hints should be provided for all functions.
🔎 Proposed fix
-def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
+def _process_event(ctx: StreamingContext, event: str, event_data: dict[str, Any]) -> None:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 612e95c and 8f2e6c8.

📒 Files selected for processing (9)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🧠 Learnings (4)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

🧬 Code graph analysis (2)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (88-117)

tests/unit/core/models/test_api_additional.py (1)

src/lightspeed_evaluation/core/models/api.py (2)

APIResponse (80-149)

from_raw_response (114-149)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: tests (3.12)
GitHub Check: tests (3.13)

🔇 Additional comments (13)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-91: LGTM! Clean implementation of streaming metrics extraction.

The streaming performance metrics extraction is well-implemented:

Appropriate use of getattr with None defaults for optional fields

Conditional debug logging only when metrics are present

Consistent with the existing token usage pattern

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics properly propagated to evaluation results.

The streaming performance metrics are correctly extracted from turn_data and populated into EvaluationResult, following the same conditional pattern as other turn-level fields.

218-236: LGTM! Error handling improved with consistent streaming metrics.

Good refactoring to use a local turn_data variable for clarity, and the error result now includes streaming metrics consistently with the success path.

tests/unit/core/api/test_streaming_parser.py (1)

260-377: LGTM! Comprehensive test coverage for streaming performance metrics.

The new test class provides excellent coverage:

Validates TTFT capture on different event types (token, tool_call, turn_complete)

Checks metric relationships (streaming_duration ≥ time_to_first_token)

Tests tokens_per_second calculation with and without token counts

Verifies complete flow with all metrics

src/lightspeed_evaluation/core/models/api.py (2)

96-111: LGTM! Streaming metrics properly defined with validation.

The streaming performance metric fields are well-defined:

Appropriate Optional[float] type for streaming-only metrics

Non-negative validation with ge=0

Clear descriptions indicating streaming-only applicability

134-148: LGTM! Streaming metrics correctly extracted and propagated.

The extraction logic correctly uses .get() to retrieve optional streaming metrics from raw data and passes them to the APIResponse constructor.

tests/unit/core/models/test_api_additional.py (1)

168-228: LGTM! Thorough test coverage for APIResponse streaming metrics.

The tests comprehensively cover:

Direct instantiation with streaming metrics

Default None values when metrics are absent

Parsing from raw response with and without streaming metrics

Token count propagation alongside streaming metrics

src/lightspeed_evaluation/core/models/data.py (1)

81-96: LGTM! Streaming metrics properly defined in TurnData with validation.

The streaming performance metrics are correctly defined with:

Appropriate Optional[float] type

Non-negative validation (ge=0)

Clear descriptions

src/lightspeed_evaluation/core/api/streaming_parser.py (5)

1-14: LGTM!

The imports are appropriate for the new performance tracking functionality, and CONTENT_EVENTS as a tuple provides an immutable constant for event classification.

17-38: LGTM!

Good use of time.perf_counter() for high-resolution timing. The TTFT capture logic correctly ensures single capture via the _first_content_received flag.

69-85: LGTM!

The throughput calculation correctly excludes TTFT from the generation time, and the function properly handles edge cases (no tokens, missing TTFT, zero/negative generation time).

142-147: LGTM!

The validation logic correctly ensures required fields are present before returning the response. The use of ValueError was noted in the earlier comment about custom exceptions.

150-196: LGTM!

The existing helper functions (_parse_sse_line, _parse_tool_call, _format_tool_sequences) integrate correctly with the new context-based streaming flow.

src/lightspeed_evaluation/core/api/streaming_parser.py

src/lightspeed_evaluation/core/models/data.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

src/lightspeed_evaluation/core/api/streaming_parser.py (2)
120-140: Use custom exceptions per coding guidelines.

Line 125 raises ValueError for streaming API errors. Per the coding guidelines and a previous review comment, this should use APIError from core.system.exceptions instead.
🔎 Suggested fix

Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import APIError
Then update the error handling:
 def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
     """Process a single streaming event and update context."""
     if event == "error" and "token" in event_data:
         error_message = event_data["token"]
         logger.error("Received error event from streaming API: %s", error_message)
-        raise ValueError(f"Streaming API error: {error_message}")
+        raise APIError(f"Streaming API error: {error_message}")
Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".

142-147: Use custom exceptions for validation errors.

Lines 145 and 147 raise ValueError for validation failures. Per the coding guidelines and a previous review comment, these should use DataValidationError from core.system.exceptions instead.
🔎 Suggested fix

Update the import at the top of the file:
+from lightspeed_evaluation.core.system.exceptions import DataValidationError
Then update the validation:
 def _validate_response(ctx: StreamingContext) -> None:
     """Validate that required response fields are present."""
     if not ctx.final_response:
-        raise ValueError("No final response found in streaming output")
+        raise DataValidationError("No final response found in streaming output")
     if not ctx.conversation_id:
-        raise ValueError("No Conversation ID found")
+        raise DataValidationError("No Conversation ID found")
Based on coding guidelines: "Use custom exceptions from core.system.exceptions for error handling".

🧹 Nitpick comments (3)

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

17-38: Add type hints and docstrings for _PerformanceTracker.

Per coding guidelines, all functions and methods should have type hints, and public APIs should have Google-style docstrings. While this is a private class, adding docstrings improves maintainability.

🔎 Suggested improvements

 @dataclass
 class _PerformanceTracker:
-    """Tracks streaming performance metrics (TTFT, throughput)."""
+    """Tracks streaming performance metrics.
+    
+    Captures time to first token (TTFT) and calculates throughput metrics
+    for streaming responses.
+    
+    Attributes:
+        stream_start_time: Monotonic timestamp when streaming started
+        time_to_first_token: Time in seconds to receive first content token
+        _first_content_received: Internal flag tracking if first content seen
+    """
 
     stream_start_time: float = field(default_factory=time.perf_counter)
     time_to_first_token: Optional[float] = None
     _first_content_received: bool = False
 
-    def capture_ttft(self) -> None:
-        """Capture time to first token if not already captured."""
+    def capture_ttft(self) -> None:
+        """Capture time to first token if not already captured.
+        
+        Records the time elapsed from stream start to first content token.
+        Subsequent calls are ignored to preserve the first measurement.
+        """
         if not self._first_content_received:
             self.time_to_first_token = time.perf_counter() - self.stream_start_time
             self._first_content_received = True
             logger.debug("Time to first token: %.3f seconds", self.time_to_first_token)
 
-    def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]:
-        """Calculate streaming duration and tokens per second."""
+    def get_metrics(self, output_tokens: int) -> tuple[float, Optional[float]]:
+        """Calculate streaming duration and tokens per second.
+        
+        Args:
+            output_tokens: Number of output tokens generated
+            
+        Returns:
+            Tuple of (streaming_duration, tokens_per_second)
+        """
         streaming_duration = time.perf_counter() - self.stream_start_time
         tokens_per_second = _calculate_tokens_per_second(
             output_tokens, self.time_to_first_token, streaming_duration
         )
         return streaming_duration, tokens_per_second

Based on coding guidelines: "Provide type hints for all public functions and methods" and "Use Google-style docstrings for all public APIs".

41-66: Enhance docstrings for StreamingContext.

The class and method need more detailed Google-style docstrings per coding guidelines. The existing docstrings are minimal.

🔎 Suggested improvements

 @dataclass
 class StreamingContext:  # pylint: disable=too-many-instance-attributes
-    """Context for streaming response parsing."""
+    """Context for streaming response parsing.
+    
+    Maintains state during SSE stream parsing, including response content,
+    conversation tracking, token counts, and performance measurements.
+    
+    Attributes:
+        conversation_id: Unique conversation identifier from API
+        final_response: Complete response text after streaming
+        tool_calls: List of tool invocations from the stream
+        input_tokens: Number of input tokens used
+        output_tokens: Number of output tokens generated
+        perf: Performance tracker for TTFT and throughput metrics
+    """
 
     conversation_id: str = ""
     final_response: str = ""
     tool_calls: list[dict[str, Any]] = field(default_factory=list)
     input_tokens: int = 0
     output_tokens: int = 0
     perf: _PerformanceTracker = field(default_factory=_PerformanceTracker)
 
     def to_response_dict(self) -> dict[str, Any]:
-        """Convert context to response dictionary."""
+        """Convert context to response dictionary.
+        
+        Returns:
+            Dictionary containing response text, tool calls, conversation ID,
+            token counts, and streaming performance metrics (TTFT, duration,
+            tokens per second).
+        """
         streaming_duration, tokens_per_second = self.perf.get_metrics(
             self.output_tokens
         )
         return {
             "response": self.final_response,
             "tool_calls": _format_tool_sequences(self.tool_calls),
             "conversation_id": self.conversation_id,
             "input_tokens": self.input_tokens,
             "output_tokens": self.output_tokens,
             "time_to_first_token": self.perf.time_to_first_token,
             "streaming_duration": streaming_duration,
             "tokens_per_second": tokens_per_second,
         }

Based on coding guidelines: "Use Google-style docstrings for all public APIs".

69-85: Consider enhancing docstring for clarity.

The function has a good implementation with proper edge case handling. The docstring could be expanded to document parameters and return value in Google-style format.

🔎 Suggested improvement

 def _calculate_tokens_per_second(
     output_tokens: int, ttft: Optional[float], total_duration: float
 ) -> Optional[float]:
-    """Calculate tokens per second, excluding TTFT from throughput calculation."""
+    """Calculate tokens per second, excluding TTFT from throughput calculation.
+    
+    Args:
+        output_tokens: Number of output tokens generated
+        ttft: Time to first token in seconds, or None if not captured
+        total_duration: Total streaming duration in seconds
+        
+    Returns:
+        Tokens per second throughput (output_tokens / generation_time),
+        or None if calculation is not possible (no tokens, no TTFT, or
+        negative generation time).
+    """
     if output_tokens <= 0 or ttft is None:
         return None
     generation_time = total_duration - ttft
     if generation_time <= 0:
         return None
     tokens_per_second = output_tokens / generation_time
     logger.debug(
         "Streaming performance: %.3f tokens/sec (%d tokens in %.3f sec)",
         tokens_per_second,
         output_tokens,
         generation_time,
     )
     return tokens_per_second

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2e6c8 and ed90e09.

📒 Files selected for processing (9)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/core/models/data.py
config/system.yaml
src/lightspeed_evaluation/core/constants.py
tests/unit/core/models/test_api_additional.py

🧰 Additional context used

📓 Path-based instructions (3)

**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use pytest mocking (mocker fixture) instead of unittest.mock

Files:

tests/unit/core/api/test_streaming_parser.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Mirror test structure to source code organization, with test files named test_.py and test functions/classes named test_ and Test*

Files:

tests/unit/core/api/test_streaming_parser.py

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/api/streaming_parser.py

🧠 Learnings (6)

📚 Learning: 2025-09-09T14:58:10.630Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:21:53.225Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

🧬 Code graph analysis (1)

tests/unit/core/api/test_streaming_parser.py (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

parse_streaming_response (88-117)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.13)

🔇 Additional comments (8)

tests/unit/core/api/test_streaming_parser.py (2)

36-39: LGTM! Existing test appropriately updated.

The test now verifies that performance metrics are present in the response. This ensures backward compatibility while validating the new metrics are populated.

260-377: Excellent test coverage for streaming performance metrics.

The new test suite comprehensively validates:

Time to First Token (TTFT) capture on different event types

Streaming duration calculation and its relationship to TTFT

Tokens per second calculation with and without token counts

Complete end-to-end flow with all metrics

The tests follow pytest conventions and coding guidelines.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

185-192: LGTM! Streaming metrics properly integrated.

The streaming performance metrics are correctly extracted from turn_data with appropriate None fallbacks, consistent with the existing pattern for token usage metrics.

214-236: LGTM! Error result consistently includes streaming metrics.

The refactoring to use a local turn_data variable improves readability and avoids repeated attribute access. Including streaming metrics in error results is appropriate—they provide valuable performance insights even when evaluations fail.

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

72-90: LGTM! Streaming metrics properly extracted and logged.

The implementation correctly:

Uses getattr with None defaults for optional streaming fields

Logs metrics only when available (conditional on TTFT presence)

Employs defensive null handling in logging to prevent crashes

Provides clear, formatted output with appropriate precision

The code follows project guidelines and integrates cleanly with existing patterns.

src/lightspeed_evaluation/core/models/api.py (2)

96-111: LGTM! Streaming metric fields properly defined.

The new fields are well-structured with:

Appropriate types (Optional[float])

Non-negative validation (ge=0)

Clear documentation indicating streaming-only availability

Consistent naming conventions

134-138: LGTM! Metric extraction and population implemented correctly.

The extraction uses raw_data.get() without defaults, appropriately returning None for missing keys. This pattern is consistent with the optional nature of streaming metrics and aligns with existing token extraction logic.

Also applies to: 146-148

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

88-117: LGTM! Well-structured refactoring with clear separation of concerns.

The refactored implementation:

Uses context-based state management for cleaner code

Separates parsing, event processing, and validation into distinct phases

Captures TTFT at the correct point (first content event)

Includes comprehensive documentation of performance metrics

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)

120-147: Use custom exceptions from core.system.exceptions for error handling.

Per the coding guidelines and previous review feedback, replace ValueError with custom exceptions:

Line 125: Use APIError for streaming API errors
Lines 145, 147: Use DataValidationError for validation failures

🔎 Proposed fix

+from lightspeed_evaluation.core.system.exceptions import APIError, DataValidationError
+
 def _process_event(ctx: StreamingContext, event: str, event_data: dict) -> None:
     """Process a single streaming event and update context."""
     if event == "error" and "token" in event_data:
         error_message = event_data["token"]
         logger.error("Received error event from streaming API: %s", error_message)
-        raise ValueError(f"Streaming API error: {error_message}")
+        raise APIError(f"Streaming API error: {error_message}")
     # ... rest unchanged ...

 def _validate_response(ctx: StreamingContext) -> None:
     """Validate that required response fields are present."""
     if not ctx.final_response:
-        raise ValueError("No final response found in streaming output")
+        raise DataValidationError("No final response found in streaming output")
     if not ctx.conversation_id:
-        raise ValueError("No Conversation ID found")
+        raise DataValidationError("No Conversation ID found")

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/api/streaming_parser.py (1)
41-66: StreamingContext is a public class—add Google-style docstring for the to_response_dict method.

Per coding guidelines, public APIs should use Google-style docstrings. The to_response_dict method lacks a return type description.
🔎 Suggested docstring enhancement
     def to_response_dict(self) -> dict[str, Any]:
-        """Convert context to response dictionary."""
+        """Convert context to response dictionary.
+
+        Returns:
+            dict[str, Any]: Dictionary containing response data and streaming
+                performance metrics (time_to_first_token, streaming_duration,
+                tokens_per_second).
+        """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed90e09 and ff4133d.

📒 Files selected for processing (11)

config/system.yaml
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py

🚧 Files skipped from review as they are similar to previous changes (6)

src/lightspeed_evaluation/core/constants.py
config/system.yaml
src/lightspeed_evaluation/pipeline/evaluation/amender.py
tests/unit/core/api/test_streaming_parser.py
tests/unit/core/models/test_api_additional.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧰 Additional context used

📓 Path-based instructions (1)

src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Provide type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels in lightspeed_evaluation modules

Files:

src/lightspeed_evaluation/core/models/mixins.py
src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/api.py
src/lightspeed_evaluation/core/models/data.py

🧠 Learnings (6)

📚 Learning: 2025-09-08T11:23:50.164Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/api/streaming_parser.py:21-27
Timestamp: 2025-09-08T11:23:50.164Z
Learning: In src/lightspeed_evaluation/core/api/streaming_parser.py, the streaming response uses a fixed structure where `line.replace(DATA_PREFIX, "")` is sufficient for extracting JSON data, as the API format is controlled and won't contain "data: " within the JSON payload itself.

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py
src/lightspeed_evaluation/core/models/api.py

📚 Learning: 2025-12-11T10:05:06.422Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-11T10:05:06.422Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions for error handling

Applied to files:

src/lightspeed_evaluation/core/api/streaming_parser.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (3)

src/lightspeed_evaluation/core/models/__init__.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/models/api.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

src/lightspeed_evaluation/core/models/data.py (1)

src/lightspeed_evaluation/core/models/mixins.py (1)

StreamingMetricsMixin (8-33)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.13)

🔇 Additional comments (9)

src/lightspeed_evaluation/core/api/streaming_parser.py (3)

17-38: Well-designed performance tracker with proper encapsulation.

The _PerformanceTracker dataclass cleanly encapsulates TTFT capture logic with idempotent capture_ttft() and correctly uses time.perf_counter() for high-resolution timing. The separation of concerns between timing capture and metrics calculation is good.

69-85: Tokens-per-second calculation correctly excludes TTFT from throughput.

The logic properly calculates generation throughput by subtracting TTFT from total duration, and handles edge cases (zero/negative tokens, missing TTFT, zero/negative generation time) by returning None.

88-117: Refactored parsing flow is cleaner and more maintainable.

The event-driven approach with StreamingContext, _process_event, and _validate_response improves readability and testability. The TTFT capture on CONTENT_EVENTS is correctly placed.

src/lightspeed_evaluation/core/models/__init__.py (1)

15-15: LGTM!

The StreamingMetricsMixin import and export follow the established patterns. The categorization under # Mixins comment is appropriate.

Also applies to: 47-48

src/lightspeed_evaluation/core/models/mixins.py (1)

1-33: Clean mixin design with proper validation constraints.

The StreamingMetricsMixin follows Pydantic best practices:

Uses Optional[float] with default=None for streaming-only metrics

Applies ge=0 validation to prevent negative values

Includes descriptive field documentation

This provides a reusable component for APIResponse, TurnData, and EvaluationResult.

src/lightspeed_evaluation/core/models/api.py (2)

7-8: Inheritance from StreamingMetricsMixin is appropriate.

Since StreamingMetricsMixin extends BaseModel, APIResponse correctly inherits Pydantic functionality along with the streaming metric fields.

Also applies to: 82-82

119-134: Streaming metrics extraction and propagation looks correct.

The from_raw_response method properly extracts the three streaming metrics using .get() (defaulting to None), which aligns with the mixin's field defaults. The metrics are passed to the constructor where Pydantic's ge=0 validation will apply.

src/lightspeed_evaluation/core/models/data.py (2)

10-10: TurnData correctly inherits streaming metrics via mixin.

The inheritance change from BaseModel to StreamingMetricsMixin is appropriate. Since the mixin extends BaseModel, all existing Pydantic functionality is preserved while adding the three streaming metric fields with proper validation.

Also applies to: 36-36

384-384: EvaluationResult correctly inherits streaming metrics via mixin.

The mixin inheritance provides time_to_first_token, streaming_duration, and tokens_per_second fields with ge=0 validation, addressing the previous review feedback about validation constraints.

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

src/lightspeed_evaluation/core/api/streaming_parser.py Show resolved Hide resolved

src/lightspeed_evaluation/core/models/data.py Outdated Show resolved Hide resolved

bsatapat-jpg force-pushed the dev branch from 8f2e6c8 to ed90e09 Compare December 26, 2025 10:07

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

bsatapat-jpg force-pushed the dev branch from ed90e09 to ff4133d Compare December 26, 2025 10:14

coderabbitai bot reviewed Dec 26, 2025

View reviewed changes

[LEADS-141] Add Latency Metrics to Evaluation Reports

a9864e3

bsatapat-jpg force-pushed the dev branch from ff4133d to a9864e3 Compare December 26, 2025 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Uh oh!

bsatapat-jpg commented Dec 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Are you sure you want to change the base?

[LEADS-141] Add Latency Metrics to Evaluation Reports #127

Uh oh!

Conversation

bsatapat-jpg commented Dec 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bsatapat-jpg commented Dec 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 26, 2025 •

edited

Loading