-
Notifications
You must be signed in to change notification settings - Fork 9
vibcoded partial fit #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
vibcoded partial fit #679
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements incremental training support for TMC Shapley valuation using models with partial_fit() capability. The implementation allows for significant performance improvements (2-10x faster) when using compatible models like SGDClassifier, by updating models incrementally as data points are added sequentially, rather than retraining from scratch for each subset.
- Introduces
PartialFitModelUtilityclass that extendsModelUtilitywith incremental training capabilities - Adds
PermutationEvaluationStrategyWithPartialFitto manage partial_fit state across permutations - Implements automatic detection and strategy selection for utilities supporting partial_fit
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pydvl/valuation/utility/modelutility.py | Adds PartialFitModelUtility class with incremental training logic and state management |
| src/pydvl/valuation/samplers/permutation.py | Adds PermutationEvaluationStrategyWithPartialFit strategy and automatic detection in make_strategy() |
| test_partial_fit_simple.py | Provides unit tests for partial_fit functionality including detection, incremental training, and strategy selection |
| examples/partial_fit_tmc_example.py | Demonstrates usage comparison between standard and partial_fit utilities with TMC Shapley |
| PARTIAL_FIT_FEATURE.md | Comprehensive documentation covering implementation details, usage, and performance considerations |
| QUICK_START_PARTIAL_FIT.md | Quick start guide showing simple migration from ModelUtility to PartialFitModelUtility |
| IMPLEMENTATION_SUMMARY.md | Summary of implementation changes and testing instructions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if hasattr(self._current_model, "classes_") or not hasattr( | ||
| self._current_model, "partial_fit" | ||
| ): |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition not hasattr(self._current_model, 'partial_fit') is redundant since _can_use_partial_fit() already checks self._supports_partial_fit, and we only reach this code when can_use_partial is True. This check will never be False in this context. Consider simplifying to just check for classes_ attribute.
| if hasattr(self._current_model, "classes_") or not hasattr( | |
| self._current_model, "partial_fit" | |
| ): | |
| if hasattr(self._current_model, "classes_"): |
| if self.training_data is None: | ||
| raise ValueError("No training data provided") | ||
|
|
||
| new_indices = set(sample.subset) - self._current_indices |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to a sorted list and then to an array may produce non-deterministic ordering if the training is run in parallel with different workers processing the same permutations. While the set difference is correct, the sorted order depends on the hash values of integers which are deterministic. However, consider if this ordering matters for reproducibility across different Python versions or implementations.
| new_indices = set(sample.subset) - self._current_indices | |
| new_indices = set(sample.subset) - self._current_indices | |
| # Sort indices to ensure deterministic ordering across runs and Python versions |
|
|
||
| self.truncation.reset(self.utility) | ||
| truncated = False | ||
| curr = prev = float(self.utility(None)) |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is duplicated from the parent class PermutationEvaluationStrategy.process() at line 295. The entire process method is nearly identical except for the added reset_partial_fit_state() call. Consider refactoring to call the parent's process method with a hook for resetting state, or extract the common logic to reduce code duplication.
| assert utility._current_model is None | ||
| assert len(utility._current_indices) == 0 |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions test private attributes (_current_model and _current_indices), which couples the tests to implementation details. Consider testing the observable behavior instead, such as verifying that calling the utility with a new sample after reset starts fresh training rather than using partial_fit.
| show_warnings=False, | ||
| progress=False, | ||
| ) | ||
| valuation_standard.fit(train_data) |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The example doesn't demonstrate timing comparisons which would better showcase the performance benefits mentioned in the PR description. Consider adding timing measurements to show the actual speedup achieved with partial_fit.
| X, y = make_classification(n_samples=30, n_features=5, random_state=42) | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) | ||
|
|
||
| train_data = Dataset.from_arrays(X_train, y_train) |
Copilot
AI
Nov 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable train_data is not used.
| train_data = Dataset.from_arrays(X_train, y_train) |
|
Hi |
No description provided.