-
Notifications
You must be signed in to change notification settings - Fork 0
Labels
enhancementNew feature or requestNew feature or request
Description
Why
Version 0.23.0 addresses multiple performance and usability issues discovered when importing large codebases (e.g., SQLAlchemy with 2000+ features):
- Silent I/O Operations: Many long-running I/O operations had no progress indicators, causing 30-60 second gaps with no user feedback
- Sequential Bottlenecks: Critical operations like
model_dump()and file hash computation were executed sequentially, creating slowdowns at 500+ features - Data Loss Risk: Interrupted imports could lose progress if features weren't saved early
- Stale Feature Data: Incremental imports didn't validate existing features for correctness or improvements in analysis logic
- Missing Progress Visibility: Users had no visibility into progress during enhanced analysis, contract extraction, and enrichment phases
This release ensures consistent performance, better user feedback, and data integrity for large codebase imports.
What Changes
Progress Reporting Enhancements
- Enhanced Analysis Setup: Added spinner progress for file discovery (
repo.rglob("*.py")), filtering, and hash collection phases- Eliminates 30-60 second silent wait periods during file discovery
- Shows real-time status: "Preparing enhanced analysis..." → "Discovering Python files..." → "Filtering X files..." → "Ready to analyze X files"
- Contract Loading: Added progress bar for parallel YAML contract loading
- Shows "Loading X existing contract(s)..." with completion count
- Enrichment Context Operations: Added spinner progress for hash comparison, context building, and file writing
- File Hash Computation: Added progress bar for file hash computation phase within source tracking (eliminates 30+ second silent gap)
- Plan Enrichment: Added progress bars for plan enrichment and edge case story addition phases
- Contract Validation: Added progress bars for contract validation with Specmatic
Performance Optimizations
- Bundle Save Performance: Moved
feature.model_dump()serialization from sequential to parallel execution- Pass Feature objects directly to parallel tasks instead of pre-dumped dicts
- Call
model_dump()insidesave_artifact()function (executes in parallel across 8 worker threads) - Eliminates slowdown at 500+ features during save operations
- Source Linking Performance: Optimized
link_to_specsmethod with:- Pre-computed AST parsing results and file hashes before parallel processing
- Inverted index for O(1) file stem lookups
- Set operations for O(1) membership checks
- Single
set.union()operation for collecting candidate stems
- File Hash-based Caching: Added caching to avoid recalculating unchanged artifacts:
RelationshipMapper: Caches file hashes, AST parsing, and analysis resultsGraphAnalyzer: Caches file hashes, imports, and module names- Contract extraction: Only regenerates if source files changed
- Enrichment context: Compares hash to avoid unnecessary writes
Feature Validation
- New
--revalidate-featuresFlag: Forces re-analysis of features even if file hashes haven't changed - Automatic Validation: Validates existing features when restarting import (checks for orphaned files, missing files, structure issues)
- Validation Reporting: Shows warnings only for actual problems (orphaned or missing files), reduces noise
Early Save Checkpoint
- Checkpoint After Analysis: Saves extracted features immediately after
_analyze_codebasecompletes - Prevents Data Loss: If import is interrupted during subsequent phases, features are already persisted
- Resume Support: Allows resuming interrupted imports without losing progress
Incremental Import Improvements
- Fixed Logic: Only triggers full feature re-analysis if source files actually changed or
--revalidate-featuresis used - Optimized Bundle Loading: Avoids loading bundle twice when checking for source file changes
- Better Change Detection: Improved incremental change detection with progress feedback
Documentation Updates
- New Guide: Created
docs/guides/import-features.mdwith comprehensive documentation - Updated Commands Reference: Added
--revalidate-featuresflag documentation - Updated Examples: Added examples for re-validation and resuming interrupted imports
- Updated README: Added timing information, checkpoint details, and performance notes
Files Modified:
src/specfact_cli/commands/import_cmd.py: Progress reporting, validation, checkpointing, cachingsrc/specfact_cli/models/project.py: Bundle save performance optimizationsrc/specfact_cli/utils/source_scanner.py: Linking performance optimizationssrc/specfact_cli/analyzers/relationship_mapper.py: Caching and progress callbackssrc/specfact_cli/analyzers/graph_analyzer.py: Caching and progress callbacksdocs/: Comprehensive documentation updatesCHANGELOG.md: Version 0.23.0 entry
Acceptance Criteria
- All progress reporting implemented for I/O operations
- Bundle save performance optimized (parallel
model_dump()) - Source linking performance optimized (pre-computation, inverted index, set operations)
- File hash-based caching implemented for enhanced analysis
- Feature validation with
--revalidate-featuresflag - Early save checkpoint implemented
- Incremental import logic fixed
- Documentation updated
- All tests pass
- Type checking passes
Dependencies
- None (standalone release)
Related Issues/PRs
- Part of ongoing performance improvements for large codebase handling
- Builds on previous progress reporting work
Additional Context
This release focuses on improving the user experience and performance when working with large codebases. The optimizations follow consistent patterns:
- Move expensive operations to parallel execution (bundle save, source linking)
- Add progress indicators for all long-running operations (I/O, analysis, validation)
- Use caching to avoid redundant work (file hashes, AST parsing, analysis results)
- Provide data integrity features (checkpointing, validation)
Performance Impact:
- Bundle save: Eliminates sequential bottleneck, consistent performance at all sizes
- Source linking: Reduced from O(nm) to O(nk + n*c) complexity
- Enhanced analysis: Caching prevents recalculation of unchanged files
- Overall: Consistent performance regardless of bundle size (tested with 2000+ features)
User Experience Impact:
- No more silent 30-60 second gaps during import
- Clear progress feedback for all operations
- Ability to resume interrupted imports
- Validation helps identify and fix issues early
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Done