Skip to content

[Change] Optimize bundle save performance for large codebases (500+ features) #91

@djm81

Description

@djm81

Why

Version 0.23.0 addresses multiple performance and usability issues discovered when importing large codebases (e.g., SQLAlchemy with 2000+ features):

  1. Silent I/O Operations: Many long-running I/O operations had no progress indicators, causing 30-60 second gaps with no user feedback
  2. Sequential Bottlenecks: Critical operations like model_dump() and file hash computation were executed sequentially, creating slowdowns at 500+ features
  3. Data Loss Risk: Interrupted imports could lose progress if features weren't saved early
  4. Stale Feature Data: Incremental imports didn't validate existing features for correctness or improvements in analysis logic
  5. Missing Progress Visibility: Users had no visibility into progress during enhanced analysis, contract extraction, and enrichment phases

This release ensures consistent performance, better user feedback, and data integrity for large codebase imports.

What Changes

Progress Reporting Enhancements

  • Enhanced Analysis Setup: Added spinner progress for file discovery (repo.rglob("*.py")), filtering, and hash collection phases
    • Eliminates 30-60 second silent wait periods during file discovery
    • Shows real-time status: "Preparing enhanced analysis..." → "Discovering Python files..." → "Filtering X files..." → "Ready to analyze X files"
  • Contract Loading: Added progress bar for parallel YAML contract loading
    • Shows "Loading X existing contract(s)..." with completion count
  • Enrichment Context Operations: Added spinner progress for hash comparison, context building, and file writing
  • File Hash Computation: Added progress bar for file hash computation phase within source tracking (eliminates 30+ second silent gap)
  • Plan Enrichment: Added progress bars for plan enrichment and edge case story addition phases
  • Contract Validation: Added progress bars for contract validation with Specmatic

Performance Optimizations

  • Bundle Save Performance: Moved feature.model_dump() serialization from sequential to parallel execution
    • Pass Feature objects directly to parallel tasks instead of pre-dumped dicts
    • Call model_dump() inside save_artifact() function (executes in parallel across 8 worker threads)
    • Eliminates slowdown at 500+ features during save operations
  • Source Linking Performance: Optimized link_to_specs method with:
    • Pre-computed AST parsing results and file hashes before parallel processing
    • Inverted index for O(1) file stem lookups
    • Set operations for O(1) membership checks
    • Single set.union() operation for collecting candidate stems
  • File Hash-based Caching: Added caching to avoid recalculating unchanged artifacts:
    • RelationshipMapper: Caches file hashes, AST parsing, and analysis results
    • GraphAnalyzer: Caches file hashes, imports, and module names
    • Contract extraction: Only regenerates if source files changed
    • Enrichment context: Compares hash to avoid unnecessary writes

Feature Validation

  • New --revalidate-features Flag: Forces re-analysis of features even if file hashes haven't changed
  • Automatic Validation: Validates existing features when restarting import (checks for orphaned files, missing files, structure issues)
  • Validation Reporting: Shows warnings only for actual problems (orphaned or missing files), reduces noise

Early Save Checkpoint

  • Checkpoint After Analysis: Saves extracted features immediately after _analyze_codebase completes
  • Prevents Data Loss: If import is interrupted during subsequent phases, features are already persisted
  • Resume Support: Allows resuming interrupted imports without losing progress

Incremental Import Improvements

  • Fixed Logic: Only triggers full feature re-analysis if source files actually changed or --revalidate-features is used
  • Optimized Bundle Loading: Avoids loading bundle twice when checking for source file changes
  • Better Change Detection: Improved incremental change detection with progress feedback

Documentation Updates

  • New Guide: Created docs/guides/import-features.md with comprehensive documentation
  • Updated Commands Reference: Added --revalidate-features flag documentation
  • Updated Examples: Added examples for re-validation and resuming interrupted imports
  • Updated README: Added timing information, checkpoint details, and performance notes

Files Modified:

  • src/specfact_cli/commands/import_cmd.py: Progress reporting, validation, checkpointing, caching
  • src/specfact_cli/models/project.py: Bundle save performance optimization
  • src/specfact_cli/utils/source_scanner.py: Linking performance optimizations
  • src/specfact_cli/analyzers/relationship_mapper.py: Caching and progress callbacks
  • src/specfact_cli/analyzers/graph_analyzer.py: Caching and progress callbacks
  • docs/: Comprehensive documentation updates
  • CHANGELOG.md: Version 0.23.0 entry

Acceptance Criteria

  • All progress reporting implemented for I/O operations
  • Bundle save performance optimized (parallel model_dump())
  • Source linking performance optimized (pre-computation, inverted index, set operations)
  • File hash-based caching implemented for enhanced analysis
  • Feature validation with --revalidate-features flag
  • Early save checkpoint implemented
  • Incremental import logic fixed
  • Documentation updated
  • All tests pass
  • Type checking passes

Dependencies

  • None (standalone release)

Related Issues/PRs

  • Part of ongoing performance improvements for large codebase handling
  • Builds on previous progress reporting work

Additional Context

This release focuses on improving the user experience and performance when working with large codebases. The optimizations follow consistent patterns:

  1. Move expensive operations to parallel execution (bundle save, source linking)
  2. Add progress indicators for all long-running operations (I/O, analysis, validation)
  3. Use caching to avoid redundant work (file hashes, AST parsing, analysis results)
  4. Provide data integrity features (checkpointing, validation)

Performance Impact:

  • Bundle save: Eliminates sequential bottleneck, consistent performance at all sizes
  • Source linking: Reduced from O(nm) to O(nk + n*c) complexity
  • Enhanced analysis: Caching prevents recalculation of unchanged files
  • Overall: Consistent performance regardless of bundle size (tested with 2000+ features)

User Experience Impact:

  • No more silent 30-60 second gaps during import
  • Clear progress feedback for all operations
  • Ability to resume interrupted imports
  • Validation helps identify and fix issues early

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions