Skip to content

Conversation

@GrachevArtem
Copy link

No description provided.

@mdbenito mdbenito requested a review from Copilot November 19, 2025 10:55
Copilot finished reviewing on behalf of mdbenito November 19, 2025 10:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements incremental training support for TMC Shapley valuation using models with partial_fit() capability. The implementation allows for significant performance improvements (2-10x faster) when using compatible models like SGDClassifier, by updating models incrementally as data points are added sequentially, rather than retraining from scratch for each subset.

  • Introduces PartialFitModelUtility class that extends ModelUtility with incremental training capabilities
  • Adds PermutationEvaluationStrategyWithPartialFit to manage partial_fit state across permutations
  • Implements automatic detection and strategy selection for utilities supporting partial_fit

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/pydvl/valuation/utility/modelutility.py Adds PartialFitModelUtility class with incremental training logic and state management
src/pydvl/valuation/samplers/permutation.py Adds PermutationEvaluationStrategyWithPartialFit strategy and automatic detection in make_strategy()
test_partial_fit_simple.py Provides unit tests for partial_fit functionality including detection, incremental training, and strategy selection
examples/partial_fit_tmc_example.py Demonstrates usage comparison between standard and partial_fit utilities with TMC Shapley
PARTIAL_FIT_FEATURE.md Comprehensive documentation covering implementation details, usage, and performance considerations
QUICK_START_PARTIAL_FIT.md Quick start guide showing simple migration from ModelUtility to PartialFitModelUtility
IMPLEMENTATION_SUMMARY.md Summary of implementation changes and testing instructions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +497 to +499
if hasattr(self._current_model, "classes_") or not hasattr(
self._current_model, "partial_fit"
):
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition not hasattr(self._current_model, 'partial_fit') is redundant since _can_use_partial_fit() already checks self._supports_partial_fit, and we only reach this code when can_use_partial is True. This check will never be False in this context. Consider simplifying to just check for classes_ attribute.

Suggested change
if hasattr(self._current_model, "classes_") or not hasattr(
self._current_model, "partial_fit"
):
if hasattr(self._current_model, "classes_"):

Copilot uses AI. Check for mistakes.
if self.training_data is None:
raise ValueError("No training data provided")

new_indices = set(sample.subset) - self._current_indices
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting to a sorted list and then to an array may produce non-deterministic ordering if the training is run in parallel with different workers processing the same permutations. While the set difference is correct, the sorted order depends on the hash values of integers which are deterministic. However, consider if this ordering matters for reproducibility across different Python versions or implementations.

Suggested change
new_indices = set(sample.subset) - self._current_indices
new_indices = set(sample.subset) - self._current_indices
# Sort indices to ensure deterministic ordering across runs and Python versions

Copilot uses AI. Check for mistakes.

self.truncation.reset(self.utility)
truncated = False
curr = prev = float(self.utility(None))
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is duplicated from the parent class PermutationEvaluationStrategy.process() at line 295. The entire process method is nearly identical except for the added reset_partial_fit_state() call. Consider refactoring to call the parent's process method with a hook for resetting state, or extract the common logic to reduce code duplication.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +77
assert utility._current_model is None
assert len(utility._current_indices) == 0
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions test private attributes (_current_model and _current_indices), which couples the tests to implementation details. Consider testing the observable behavior instead, such as verifying that calling the utility with a new sample after reset starts fresh training rather than using partial_fit.

Copilot uses AI. Check for mistakes.
show_warnings=False,
progress=False,
)
valuation_standard.fit(train_data)
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The example doesn't demonstrate timing comparisons which would better showcase the performance benefits mentioned in the PR description. Consider adding timing measurements to show the actual speedup achieved with partial_fit.

Copilot uses AI. Check for mistakes.
X, y = make_classification(n_samples=30, n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

train_data = Dataset.from_arrays(X_train, y_train)
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable train_data is not used.

Suggested change
train_data = Dataset.from_arrays(X_train, y_train)

Copilot uses AI. Check for mistakes.
@GrachevArtem
Copy link
Author

Hi
Sorry we supposed to do this PR originaly in our private fork and then put it here. It is not ready yet and I forgot to delete.
Are you interested in something like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant