Skip to content

Commit cf225fe

Browse files
Zonglin Pengmeta-codesync[bot]
authored andcommitted
create backend cadence domain context
Differential Revision: D84163932
1 parent 7e0c7d0 commit cf225fe

File tree

8 files changed

+962
-0
lines changed

8 files changed

+962
-0
lines changed
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
---
2+
oncalls:
3+
- odai_jarvis
4+
llms-gk: devmate_executorch_cadence_md
5+
apply_to_regex: ".*"
6+
---
7+
8+
# Cadence AOT (Ahead-of-Time) Compilation Domain
9+
10+
## Overview
11+
12+
The Cadence AOT domain provides ahead-of-time compilation capabilities for Cadence/Xtensa DSP backends within ExecuTorch. This domain transforms PyTorch models into optimized graphs suitable for execution on Cadence DSP architectures including HiFi Audio and Fusion G3 processors. The compilation pipeline includes quantization, graph transformations, memory planning, and backend-specific optimizations.
13+
14+
## Directory Structure
15+
16+
```
17+
aot/
18+
├── TARGETS # Build configuration with oncall info
19+
├── __init__.py # Package initialization
20+
├── compiler.py # Main compilation entry points
21+
├── compiler_funcs.py # Helper compilation functions
22+
├── compiler_utils.py # Compilation utilities
23+
├── graph_builder.py # Test graph construction utilities
24+
├── program_builder.py # Program creation for tests
25+
├── passes.py # Pass management and orchestration
26+
├── pass_utils.py # Pass utilities and registration
27+
├── quantizer/ # Quantization infrastructure
28+
│ ├── quantizer.py # Cadence-specific quantizers
29+
│ ├── fusion_pass.py # Quantization fusion patterns
30+
│ ├── patterns.py # Quantization patterns
31+
│ └── utils.py # Quantization utilities
32+
├── memory_planning.py # Memory allocation algorithms
33+
├── memory_planning_algo.py # Memory planning implementations
34+
├── memory_constraints.py # Memory placement constraints
35+
├── functions*.yaml # Function definitions per DSP variant
36+
├── ops_registrations.py # Operator registrations
37+
├── ref_implementations.py # Reference operator implementations
38+
├── fuse_ops.py # Operation fusion passes
39+
├── remove_ops.py # Operation removal passes
40+
├── replace_ops.py # Operation replacement passes
41+
├── reorder_ops.py # Operation reordering passes
42+
├── simplify_ops.py # Operation simplification passes
43+
├── decompose_ops.py # Operation decomposition passes
44+
├── type_dispatch.py # Type-based dispatch optimization
45+
├── utils.py # General utilities
46+
└── tests/ # Comprehensive test suite
47+
```
48+
49+
## Core Components
50+
51+
### Compiler (`compiler.py`)
52+
Main compilation orchestrator providing:
53+
- **Model Tracing**: Export PyTorch models to ExportedProgram using `trace()`
54+
- **Quantization Pipeline**: Full PT2 quantization via `quantize_pt2()` including prepare/convert/fuse
55+
- **Edge Lowering**: Transform to EXIR edge dialect via `export_to_edge()`
56+
- **Backend Optimization**: Apply Cadence-specific passes via `export_to_cadence()`
57+
- **ExecutorTorch Generation**: Final .pte file creation with memory planning
58+
59+
**Key APIs:**
60+
- `quantize_and_export_to_executorch()`: End-to-end compilation
61+
- `export_to_cadence()`: Edge lowering with optimizations
62+
- `trace()`: Model export with selective op preservation
63+
64+
### Pass Management (`passes.py`)
65+
Orchestrates graph transformation pipeline:
66+
- **Pass Registration**: Decorator-based pass registration with opt levels
67+
- **Pipeline Phases**: Initialize → Remove → Fusion → Replace → Simplify → Finalize
68+
- **Opt Level Filtering**: Conditional pass execution based on optimization level
69+
- **ATEN vs EXIR**: Separate pass systems for different IR levels
70+
71+
**Pass Categories:**
72+
- `InitializePipeline`: Dead code elimination, spec propagation
73+
- `CadenceRemoveNops`: Remove no-op operations
74+
- `CadenceFuseOpsInGraph`: Operation fusion for efficiency
75+
- `CadenceReplaceOpsInGraph`: Backend-specific replacements
76+
- `FinalizePipeline`: Final cleanup and validation
77+
78+
### Memory Planning (`memory_planning.py`)
79+
Advanced memory allocation for constrained DSP environments:
80+
- **Multi-Memory Hierarchy**: Support for different memory types (DRAM, SRAM, etc.)
81+
- **Greedy Algorithms**: `PositionBasedGreedyWithHierarchy`, `GreedyWithHeuristic`
82+
- **Lifetime Analysis**: Tensor lifetime tracking for optimal placement
83+
- **Constraint System**: Memory placement rules and restrictions
84+
- **Peak Usage Analysis**: Memory consumption profiling and optimization
85+
86+
**Algorithms:**
87+
- Position-based greedy: Fast memory in order of hierarchy
88+
- Heuristic-based: Gap-filling strategy from arXiv paper (2001.03288)
89+
90+
### Graph Builder (`graph_builder.py`)
91+
Test infrastructure for graph construction:
92+
- **Programmatic Graph Creation**: Build test graphs without PyTorch tracing
93+
- **FakeTensor Support**: Efficient shape/dtype propagation
94+
- **Single Op Testing**: Isolated operator validation
95+
- **Custom Graph Patterns**: Complex multi-op scenarios
96+
97+
## System Flows
98+
99+
### Quantization Pipeline
100+
```
101+
PyTorch Model
102+
↓ trace() - Export with selective ops
103+
ExportedProgram
104+
↓ prepare_pt2() - Insert observers
105+
Prepared Model
106+
↓ calibration - Run calibration data
107+
Calibrated Model
108+
↓ convert_pt2() - Insert fake quantization
109+
Converted Model
110+
↓ fuse_pt2() - Fuse dq→op→q patterns
111+
Quantized Model
112+
```
113+
114+
### Compilation Pipeline
115+
```
116+
Quantized Model
117+
↓ apply_torch_ops_passes() - ATEN-level transforms
118+
Torch-Optimized Model
119+
↓ to_edge() - Lower to EXIR edge dialect
120+
Edge Program
121+
↓ apply_exir_ops_passes() - Cadence optimizations
122+
Cadence Program
123+
↓ memory_planning - Allocate tensors
124+
Planned Program
125+
↓ to_executorch() - Serialize to .pte
126+
ExecutorTorch Program
127+
```
128+
129+
### Pass Execution
130+
```
131+
Graph Input
132+
↓ InitializePipeline - Setup and spec propagation
133+
↓ RemoveRedundantOps - Clean unnecessary operations
134+
↓ CadenceReorderOps - Reorder for efficiency
135+
↓ CadenceRemoveNops - Remove no-ops
136+
↓ CadenceFuseOps - Fuse operations
137+
↓ CadenceReplaceOps - Backend replacements
138+
↓ CadenceSimplifyOps - Simplify expressions
139+
↓ FinalizePipeline - Final cleanup
140+
Optimized Graph
141+
```
142+
143+
## Testing Strategy
144+
145+
Comprehensive test coverage in `tests/` directory:
146+
- **Unit Tests**: Individual pass validation (`test_*_passes.py`)
147+
- **Integration Tests**: End-to-end compilation flows
148+
- **Memory Tests**: Memory planning algorithm validation
149+
- **Graph Builder Tests**: Programmatic graph construction
150+
- **Reference Tests**: Operator implementation validation
151+
152+
**Test Patterns:**
153+
- Parameterized tests for multiple DSP variants
154+
- Graph comparison utilities for before/after validation
155+
- Tolerance-based numerical validation
156+
- Mock backends for isolation testing
157+
158+
## Dependencies
159+
160+
**Core ExecuTorch:**
161+
- `executorch.exir`: Export IR and compilation pipeline
162+
- `executorch.runtime`: Runtime integration and memory management
163+
- `executorch.kernels`: Portable and quantized operators
164+
165+
**External:**
166+
- `torch`: PyTorch 2.x for model export and quantization
167+
- `torchao`: Advanced quantization algorithms
168+
- `tabulate`: Memory planning visualization
169+
170+
**Cadence-Specific:**
171+
- Hardware-specific operator implementations in `../hifi/`, `../fusion_g3/`
172+
- Runtime components in `../runtime/`
173+
174+
## DSP Variant Support
175+
176+
The domain supports multiple Cadence DSP architectures:
177+
- **HiFi Audio**: Audio processing optimizations (`functions_hifi.yaml`)
178+
- **Fusion G3**: General compute optimization (`functions_fusion_g3.yaml`)
179+
- **Vision**: Computer vision workloads (`functions_vision.yaml`)
180+
- **Generic**: Base functionality (`functions.yaml`)
181+
182+
Each variant has specialized operator implementations and optimization passes tuned for the specific DSP capabilities and memory hierarchy.
183+
184+
## Key Optimizations
185+
186+
**Graph-Level:**
187+
- Operation fusion to reduce memory traffic
188+
- Dead code elimination and redundant operation removal
189+
- Type-based dispatch for optimal kernel selection
190+
- Tensor layout transformations for DSP efficiency
191+
192+
**Memory-Level:**
193+
- Multi-tier memory hierarchy utilization
194+
- Lifetime-based tensor placement
195+
- Memory usage minimization for constrained environments
196+
- IDMA (Internal DMA) operation optimization
197+
198+
**Quantization:**
199+
- INT8/INT16 quantization for DSP efficiency
200+
- Custom quantization patterns for Cadence operators
201+
- Calibration-based quantization parameter optimization
202+
- Post-training quantization (PTQ) focus
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
oncalls:
3+
- odai_jarvis
4+
llms-gk: devmate_executorch_cadence_md
5+
apply_to_regex: ".*"
6+
---
7+
8+
# Cadence Quantizer Domain
9+
10+
## Overview
11+
PT2E quantization infrastructure for Cadence DSPs. Provides pattern-based graph annotation, fusion passes, and quantization configs for HiFi/Fusion G3/Vision.
12+
13+
## Directory Structure
14+
```
15+
aot/quantizer/
16+
├── quantizer.py # Quantizer classes & configs
17+
├── patterns.py # Pattern definitions
18+
├── fusion_pass.py # Graph fusion
19+
└── utils.py # Quantization utilities
20+
```
21+
22+
## Components
23+
24+
### Quantizers (`quantizer.py`)
25+
- `CadenceAtenQuantizer` - Base quantizer
26+
- `CadenceDefaultQuantizer` - Standard ops
27+
- `CadenceWithLayerNormQuantizer` - Layer norm
28+
- `CadenceWakeWordQuantizer` - Wake word (add/cat)
29+
- `CadenceFusedConvReluQuantizer` - Fused conv+relu
30+
- `CadenceWithSoftmaxQuantizer` - A16 softmax
31+
32+
### Patterns (`patterns.py`)
33+
Linear, MatMul, BMM, Conv1d/2d, ConvRelu, ReLU, Add, Cat, LayerNorm, Softmax
34+
35+
### Fusion (`fusion_pass.py`)
36+
Transforms dequant→op→quant into fused quantized ops. Maps ATen to Cadence ops.
37+
38+
### Utils (`utils.py`)
39+
- `quantize_tensor_multiplier()` - Fixed-point conversion
40+
- `get_bias_qparams()` - Bias quantization
41+
- `find_sequential_partitions_aten()` - Pattern matching
42+
43+
## Configs
44+
- `qconfig_A8W8` - int8 asymmetric
45+
- `qconfig_A8W8sym` - int8 symmetric weight
46+
- `qconfig_A16` - int16 activation
47+
48+
## Workflow
49+
```
50+
Model → prepare_pt2e → calibrate → convert_pt2e → QuantFusion → DSP binary
51+
```
52+
53+
## Testing
54+
```bash
55+
buck2 test //executorch/backends/cadence/aot/quantizer/...
56+
```
57+
58+
## Dependencies
59+
- `torch.fx`, `torchao.quantization.pt2e`
60+
- `executorch.exir.pass_base`
61+
62+
## Adding Patterns
63+
1. Define in `patterns.py`
64+
2. Add fusion in `fusion_pass.py`
65+
3. Register in `quantizer.py`
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# This file ensures the .llms/rules directory exists in version control

0 commit comments

Comments
 (0)