|
| 1 | +--- |
| 2 | +oncalls: |
| 3 | + - odai_jarvis |
| 4 | +llms-gk: devmate_executorch_cadence_md |
| 5 | +apply_to_regex: ".*" |
| 6 | +--- |
| 7 | + |
| 8 | +# Cadence AOT (Ahead-of-Time) Compilation Domain |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +The Cadence AOT domain provides ahead-of-time compilation capabilities for Cadence/Xtensa DSP backends within ExecuTorch. This domain transforms PyTorch models into optimized graphs suitable for execution on Cadence DSP architectures including HiFi Audio and Fusion G3 processors. The compilation pipeline includes quantization, graph transformations, memory planning, and backend-specific optimizations. |
| 13 | + |
| 14 | +## Directory Structure |
| 15 | + |
| 16 | +``` |
| 17 | +aot/ |
| 18 | +├── TARGETS # Build configuration with oncall info |
| 19 | +├── __init__.py # Package initialization |
| 20 | +├── compiler.py # Main compilation entry points |
| 21 | +├── compiler_funcs.py # Helper compilation functions |
| 22 | +├── compiler_utils.py # Compilation utilities |
| 23 | +├── graph_builder.py # Test graph construction utilities |
| 24 | +├── program_builder.py # Program creation for tests |
| 25 | +├── passes.py # Pass management and orchestration |
| 26 | +├── pass_utils.py # Pass utilities and registration |
| 27 | +├── quantizer/ # Quantization infrastructure |
| 28 | +│ ├── quantizer.py # Cadence-specific quantizers |
| 29 | +│ ├── fusion_pass.py # Quantization fusion patterns |
| 30 | +│ ├── patterns.py # Quantization patterns |
| 31 | +│ └── utils.py # Quantization utilities |
| 32 | +├── memory_planning.py # Memory allocation algorithms |
| 33 | +├── memory_planning_algo.py # Memory planning implementations |
| 34 | +├── memory_constraints.py # Memory placement constraints |
| 35 | +├── functions*.yaml # Function definitions per DSP variant |
| 36 | +├── ops_registrations.py # Operator registrations |
| 37 | +├── ref_implementations.py # Reference operator implementations |
| 38 | +├── fuse_ops.py # Operation fusion passes |
| 39 | +├── remove_ops.py # Operation removal passes |
| 40 | +├── replace_ops.py # Operation replacement passes |
| 41 | +├── reorder_ops.py # Operation reordering passes |
| 42 | +├── simplify_ops.py # Operation simplification passes |
| 43 | +├── decompose_ops.py # Operation decomposition passes |
| 44 | +├── type_dispatch.py # Type-based dispatch optimization |
| 45 | +├── utils.py # General utilities |
| 46 | +└── tests/ # Comprehensive test suite |
| 47 | +``` |
| 48 | + |
| 49 | +## Core Components |
| 50 | + |
| 51 | +### Compiler (`compiler.py`) |
| 52 | +Main compilation orchestrator providing: |
| 53 | +- **Model Tracing**: Export PyTorch models to ExportedProgram using `trace()` |
| 54 | +- **Quantization Pipeline**: Full PT2 quantization via `quantize_pt2()` including prepare/convert/fuse |
| 55 | +- **Edge Lowering**: Transform to EXIR edge dialect via `export_to_edge()` |
| 56 | +- **Backend Optimization**: Apply Cadence-specific passes via `export_to_cadence()` |
| 57 | +- **ExecutorTorch Generation**: Final .pte file creation with memory planning |
| 58 | + |
| 59 | +**Key APIs:** |
| 60 | +- `quantize_and_export_to_executorch()`: End-to-end compilation |
| 61 | +- `export_to_cadence()`: Edge lowering with optimizations |
| 62 | +- `trace()`: Model export with selective op preservation |
| 63 | + |
| 64 | +### Pass Management (`passes.py`) |
| 65 | +Orchestrates graph transformation pipeline: |
| 66 | +- **Pass Registration**: Decorator-based pass registration with opt levels |
| 67 | +- **Pipeline Phases**: Initialize → Remove → Fusion → Replace → Simplify → Finalize |
| 68 | +- **Opt Level Filtering**: Conditional pass execution based on optimization level |
| 69 | +- **ATEN vs EXIR**: Separate pass systems for different IR levels |
| 70 | + |
| 71 | +**Pass Categories:** |
| 72 | +- `InitializePipeline`: Dead code elimination, spec propagation |
| 73 | +- `CadenceRemoveNops`: Remove no-op operations |
| 74 | +- `CadenceFuseOpsInGraph`: Operation fusion for efficiency |
| 75 | +- `CadenceReplaceOpsInGraph`: Backend-specific replacements |
| 76 | +- `FinalizePipeline`: Final cleanup and validation |
| 77 | + |
| 78 | +### Memory Planning (`memory_planning.py`) |
| 79 | +Advanced memory allocation for constrained DSP environments: |
| 80 | +- **Multi-Memory Hierarchy**: Support for different memory types (DRAM, SRAM, etc.) |
| 81 | +- **Greedy Algorithms**: `PositionBasedGreedyWithHierarchy`, `GreedyWithHeuristic` |
| 82 | +- **Lifetime Analysis**: Tensor lifetime tracking for optimal placement |
| 83 | +- **Constraint System**: Memory placement rules and restrictions |
| 84 | +- **Peak Usage Analysis**: Memory consumption profiling and optimization |
| 85 | + |
| 86 | +**Algorithms:** |
| 87 | +- Position-based greedy: Fast memory in order of hierarchy |
| 88 | +- Heuristic-based: Gap-filling strategy from arXiv paper (2001.03288) |
| 89 | + |
| 90 | +### Graph Builder (`graph_builder.py`) |
| 91 | +Test infrastructure for graph construction: |
| 92 | +- **Programmatic Graph Creation**: Build test graphs without PyTorch tracing |
| 93 | +- **FakeTensor Support**: Efficient shape/dtype propagation |
| 94 | +- **Single Op Testing**: Isolated operator validation |
| 95 | +- **Custom Graph Patterns**: Complex multi-op scenarios |
| 96 | + |
| 97 | +## System Flows |
| 98 | + |
| 99 | +### Quantization Pipeline |
| 100 | +``` |
| 101 | +PyTorch Model |
| 102 | + ↓ trace() - Export with selective ops |
| 103 | +ExportedProgram |
| 104 | + ↓ prepare_pt2() - Insert observers |
| 105 | +Prepared Model |
| 106 | + ↓ calibration - Run calibration data |
| 107 | +Calibrated Model |
| 108 | + ↓ convert_pt2() - Insert fake quantization |
| 109 | +Converted Model |
| 110 | + ↓ fuse_pt2() - Fuse dq→op→q patterns |
| 111 | +Quantized Model |
| 112 | +``` |
| 113 | + |
| 114 | +### Compilation Pipeline |
| 115 | +``` |
| 116 | +Quantized Model |
| 117 | + ↓ apply_torch_ops_passes() - ATEN-level transforms |
| 118 | +Torch-Optimized Model |
| 119 | + ↓ to_edge() - Lower to EXIR edge dialect |
| 120 | +Edge Program |
| 121 | + ↓ apply_exir_ops_passes() - Cadence optimizations |
| 122 | +Cadence Program |
| 123 | + ↓ memory_planning - Allocate tensors |
| 124 | +Planned Program |
| 125 | + ↓ to_executorch() - Serialize to .pte |
| 126 | +ExecutorTorch Program |
| 127 | +``` |
| 128 | + |
| 129 | +### Pass Execution |
| 130 | +``` |
| 131 | +Graph Input |
| 132 | + ↓ InitializePipeline - Setup and spec propagation |
| 133 | + ↓ RemoveRedundantOps - Clean unnecessary operations |
| 134 | + ↓ CadenceReorderOps - Reorder for efficiency |
| 135 | + ↓ CadenceRemoveNops - Remove no-ops |
| 136 | + ↓ CadenceFuseOps - Fuse operations |
| 137 | + ↓ CadenceReplaceOps - Backend replacements |
| 138 | + ↓ CadenceSimplifyOps - Simplify expressions |
| 139 | + ↓ FinalizePipeline - Final cleanup |
| 140 | +Optimized Graph |
| 141 | +``` |
| 142 | + |
| 143 | +## Testing Strategy |
| 144 | + |
| 145 | +Comprehensive test coverage in `tests/` directory: |
| 146 | +- **Unit Tests**: Individual pass validation (`test_*_passes.py`) |
| 147 | +- **Integration Tests**: End-to-end compilation flows |
| 148 | +- **Memory Tests**: Memory planning algorithm validation |
| 149 | +- **Graph Builder Tests**: Programmatic graph construction |
| 150 | +- **Reference Tests**: Operator implementation validation |
| 151 | + |
| 152 | +**Test Patterns:** |
| 153 | +- Parameterized tests for multiple DSP variants |
| 154 | +- Graph comparison utilities for before/after validation |
| 155 | +- Tolerance-based numerical validation |
| 156 | +- Mock backends for isolation testing |
| 157 | + |
| 158 | +## Dependencies |
| 159 | + |
| 160 | +**Core ExecuTorch:** |
| 161 | +- `executorch.exir`: Export IR and compilation pipeline |
| 162 | +- `executorch.runtime`: Runtime integration and memory management |
| 163 | +- `executorch.kernels`: Portable and quantized operators |
| 164 | + |
| 165 | +**External:** |
| 166 | +- `torch`: PyTorch 2.x for model export and quantization |
| 167 | +- `torchao`: Advanced quantization algorithms |
| 168 | +- `tabulate`: Memory planning visualization |
| 169 | + |
| 170 | +**Cadence-Specific:** |
| 171 | +- Hardware-specific operator implementations in `../hifi/`, `../fusion_g3/` |
| 172 | +- Runtime components in `../runtime/` |
| 173 | + |
| 174 | +## DSP Variant Support |
| 175 | + |
| 176 | +The domain supports multiple Cadence DSP architectures: |
| 177 | +- **HiFi Audio**: Audio processing optimizations (`functions_hifi.yaml`) |
| 178 | +- **Fusion G3**: General compute optimization (`functions_fusion_g3.yaml`) |
| 179 | +- **Vision**: Computer vision workloads (`functions_vision.yaml`) |
| 180 | +- **Generic**: Base functionality (`functions.yaml`) |
| 181 | + |
| 182 | +Each variant has specialized operator implementations and optimization passes tuned for the specific DSP capabilities and memory hierarchy. |
| 183 | + |
| 184 | +## Key Optimizations |
| 185 | + |
| 186 | +**Graph-Level:** |
| 187 | +- Operation fusion to reduce memory traffic |
| 188 | +- Dead code elimination and redundant operation removal |
| 189 | +- Type-based dispatch for optimal kernel selection |
| 190 | +- Tensor layout transformations for DSP efficiency |
| 191 | + |
| 192 | +**Memory-Level:** |
| 193 | +- Multi-tier memory hierarchy utilization |
| 194 | +- Lifetime-based tensor placement |
| 195 | +- Memory usage minimization for constrained environments |
| 196 | +- IDMA (Internal DMA) operation optimization |
| 197 | + |
| 198 | +**Quantization:** |
| 199 | +- INT8/INT16 quantization for DSP efficiency |
| 200 | +- Custom quantization patterns for Cadence operators |
| 201 | +- Calibration-based quantization parameter optimization |
| 202 | +- Post-training quantization (PTQ) focus |
0 commit comments