[Docs] Add list of indexing autotuning docs (#1027)

oulgen · web-flow · commit 0efcf06e7d79 · 2025-10-26T09:50:28.000-07:00
diff --git a/README.md b/README.md
@@ -35,6 +35,7 @@ portable between different hardware. Helion automates and autotunes over:
 
    * Automatically calculates strides and indices.
    * Autotunes choices among various indexing methods (pointers, block pointers, TensorDescriptors).
+   * Supports per-load indexing strategies for fine-grained memory access control.
 
 2. **Masking:**
 
@@ -257,10 +258,14 @@ Reorders the program IDs (PIDs) of the generated kernel for improved L2
 cache behavior. A value of `1` disables this optimization, while higher
 values specify the grouping size.
 
-* **indexing** (`"pointer"`, `"tensor_descriptor"` or `"block_ptr"`):
-Specifies the type of indexing code to generate. The `"tensor_descriptor"`
-option uses Tensor Memory Accelerators (TMAs) but requires a Hopper or
-newer GPU and the latest development version of Triton.
+* **indexing** (`"pointer"`, `"tensor_descriptor"`, `"block_ptr"`, or a list of these):
+Specifies the memory indexing strategy for load operations. Can be:
+  - A single strategy (applies to all loads): `indexing="block_ptr"`
+  - A list of strategies (one per load operation): `indexing=["pointer", "block_ptr", "tensor_descriptor"]`
+  - Empty/omitted (defaults to `"pointer"` for all loads)
+
+  The `"tensor_descriptor"` option uses Tensor Memory Accelerators (TMAs) but
+  requires a Hopper or newer GPU and the latest development version of Triton.
 
 * **pid\_type** (`"flat"`, `"xyz"`, `"persistent_blocked"`, or `"persistent_interleaved"`):
   Specifies the program ID mapping strategy. `"flat"` uses only the x-dimension,
diff --git a/docs/api/config.md b/docs/api/config.md
@@ -109,10 +109,30 @@ Configs are typically discovered automatically through autotuning, but can also
 
 .. autoattribute:: Config.indexing
 
-   Memory indexing strategy:
+   Memory indexing strategy for load operations. Can be specified as:
 
-   - ``"pointer"``: Pointer-based indexing
-   - ``"tensor_descriptor"``: Tensor descriptor indexing
+   **Single strategy (applies to all loads - backward compatible):**
+
+   .. code-block:: python
+
+      indexing="block_ptr"  # All loads use block pointers
+
+   **Per-load strategies (list, one per load operation):**
+
+   .. code-block:: python
+
+      indexing=["pointer", "block_ptr", "tensor_descriptor"]
+
+   **Empty/omitted (defaults to** ``"pointer"`` **for all loads):**
+
+   .. code-block:: python
+
+      # indexing not specified - all loads use pointer indexing
+
+   **Valid strategies:**
+
+   - ``"pointer"``: Pointer-based indexing (default)
+   - ``"tensor_descriptor"``: Tensor descriptor indexing (requires Hopper+ GPU)
    - ``"block_ptr"``: Block pointer indexing
 ```
 
@@ -185,6 +205,42 @@ def kernel_with_eviction(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
 # hl.load(x, [tile], eviction_policy="evict_first")
 ```
 
+### Per-Load Indexing Example
+
+```python
+import torch
+import helion
+import helion.language as hl
+
+# Single indexing strategy for all loads (backward compatible)
+@helion.kernel(config={"indexing": "block_ptr"})
+def kernel_uniform_indexing(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
+    out = torch.empty_like(x)
+    for tile in hl.tile(x.size(0)):
+        a = hl.load(x, [tile])  # Uses block_ptr
+        b = hl.load(y, [tile])  # Uses block_ptr
+        out[tile] = a + b
+    return out
+
+# Per-load indexing strategies for fine-grained control
+@helion.kernel(
+    config={
+        "block_size": 16,
+        "indexing": ["pointer", "block_ptr", "tensor_descriptor"],
+    }
+)
+def kernel_mixed_indexing(
+    x: torch.Tensor, y: torch.Tensor, z: torch.Tensor
+) -> torch.Tensor:
+    out = torch.empty_like(x)
+    for tile in hl.tile(x.size(0)):
+        a = hl.load(x, [tile])  # First load: pointer indexing
+        b = hl.load(y, [tile])  # Second load: block_ptr indexing
+        c = hl.load(z, [tile])  # Third load: tensor_descriptor indexing
+        out[tile] = a + b + c
+    return out
+```
+
 ### Config Serialization
 
 ```python
diff --git a/docs/index.md b/docs/index.md
@@ -35,6 +35,7 @@ portable between different hardware. Helion automates and autotunes over:
 
     * Automatically calculates strides and indices.
     * Autotunes choices among various indexing methods (pointers, block pointers, TensorDescriptors).
+    * Supports per-load indexing strategies for fine-grained memory access control.
 
 2. **Masking:**