ModelEngine-Group
diff --git a/‎docs/source/feature/sparse_attn.md‎
Lines changed: 40 additions & 0 deletions b/‎docs/source/feature/sparse_attn.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎docs/source/getting-started/example/sparse_attn.md‎
Lines changed: 104 additions & 0 deletions b/‎docs/source/getting-started/example/sparse_attn.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎docs/source/images/attention_overhead.png‎
87.1 KB b/‎docs/source/images/attention_overhead.png‎
87.1 KB
diff --git a/‎docs/source/images/attention_sparsity.png‎
177 KB b/‎docs/source/images/attention_sparsity.png‎
177 KB
diff --git a/‎docs/source/images/sparse_attn_arch.png‎
103 KB b/‎docs/source/images/sparse_attn_arch.png‎
103 KB
diff --git a/‎examples/offline_inference.py‎
Lines changed: 2 additions & 1 deletion b/‎examples/offline_inference.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎unifiedcache/integration/vllm/ucm_sparse/base.py‎
Lines changed: 204 additions & 0 deletions b/‎unifiedcache/integration/vllm/ucm_sparse/base.py‎
Lines changed: 204 additions & 0 deletions
@@ -1 +1,41 @@
 # Sparse Attention
+## Motivations
+Attention mechanisms, especially in LLMs, are often the bottleneck in terms of latency during inference due to their computational complexity. Despite their importance in capturing contextual relationships, traditional attention requires processing all token interactions, leading to significant delays.
+
+<p align="center">
+  <img alt="UCM" src="../images/attention_overhead.png" width="80%">
+</p>
+
+Researchers have found that attention in LLM is highly dispersed:
+<p align="center">
+  <img alt="UCM" src="../images/attention_sparsity.png" width="80%">
+</p>
+
+This movitates them actively developing sparse attention algorithms to address the latency issue. These algorithms aim to reduce the number of token interactions by focusing only on the most relevant parts of the input, thereby lowering the computation and memory requirements.
+While promising, the gap between theoretical prototypes and practical implementations in inference frameworks remains a significant challenge.
+
+Many existing frameworks, like vLLM, are optimized for traditional attention mechanisms. Adapting them for sparse attention can be complex and may require substantial modifications to the underlying architecture.
+Issues such as maintaining compatibility with existing model architectures, ensuring efficient memory usage, and leveraging hardware acceleration must be addressed to facilitate the adoption of sparse attention in real-world applications.
+
+We present an **unified sparse attention framework** under UCM. Proposing a unified framework can streamline the integration of various sparse attention algorithms into inference engines like vLLM. This framework could provide **standardized interfaces and utilities** to simplify the implementation process.
+By utilizing UCM, researchers can efficiently implement rapid prototyping and testing of different sparse attention algorithms without the need for extensive re-engineering of the inference engine. By leveraging shared optimizations within the framework, it can help ensure that the performance gains from sparse attention are realized in real-world scenarios.
+
+## Architecture
+### Overview
+The core concept of our UCMSparse attention framework is to offload the complete Key-Value (KV) cache to a dedicated KV cache storage. We then identify the crucial KV pairs relevant to the current context, as determined by our sparse attention algorithms, and selectively load only the necessary portions of the KV cache from storage into High Bandwidth Memory (HBM). This design significantly reduces the HBM footprint while accelerating generation speed.
+<p align="center">
+  <img alt="UCM" src="../images/sparse_attn_arch.png" width="80%">
+</p>
+
+
+### Key Concepts
+- UCMSparse in scheduler:  this instance locates in the same process as the `EnginerCore` and acts like a sparse attention budget controller. It estimates the number of slots required by a specific sparse attention algorithm. Then `KVCacheManager` allocates necessary blocks based on `num_slots_sparse`. For example, `ESA` only needs 20%~30% blocks of the normal attention.
+- UCMSparse in model_runner: this instance locates in the same process as the `Worker`. 
+A typical sparse attention algorithm works like this:
+    1. In prefill, it dumps full KV Cache from HBM to storage.
+    2. In decode, it retrieves the most relevant blocks based on the context and loads the blocks from store to HBM.
+    3. In decoode, it also dumps new generated blocks to keep the latest context accessible.
+- By fine-grained task scheduling, retrieval and loading can be executed asynchronously and overlap with the model execution. Therefore no overhead is introduced by UCMSparse and generation speed is boosted benefitted by less computational load and fewer memory accesses.
+
+
+See `ESA` for more details.
@@ -0,0 +1,104 @@
+# Sparse Attention
+
+This document provides a usage example and configuration guide for the **sparse attention**, which is increasingly recognized for their ability to mitigate the challenges associated with high memory bandwidth (HBM) usage and to enhance the efficiency of large language models (LLMs).
+
+
+## Configuration
+
+To use the sparse_attn connector, you need to configure the `ucm_sparse_method` field in your model's launch configuration.
+
+### Example:
+```python
+kv_connector_extra_config={
+    "ucm_connector_name": "UcmDram",
+    "ucm_connector_config": {
+        "max_cache_size": 5368709120,
+        "kv_block_size": 262144,
+    },
+    "ucm_sparse_method": "ESA" # specify the sparse attention algorithm here
+}
+```
+
+## Launching Inference
+
+### Offline Inference
+
+To start **offline inference** with the NFS connector，modify the script `examples/offline_inference.py` to include the `ucm_sparse_method` and put a long prompt to see the acceleration effects:
+
+```python
+# In examples/offline_inference.py
+ktc = KVTransferConfig(
+   ...
+   kv_connector_extra_config={
+    "ucm_connector_name": "UcmDram",
+    "ucm_connector_config": {
+        "max_cache_size": 5368709120,
+        "kv_block_size": 262144,
+    },
+    "ucm_sparse_method": "ESA" # specify the sparse attention algorithm here
+  }
+)
+
+prompts = [
+    "PUT A LONG PROMPT HERE TO SEE ACCELERATION EFFECTS."
+]
+```
+
+Then run the script as follows:
+
+```bash
+cd examples/
+export PYTHONHASHSEED=123456
+python offline_inference.py
+```
+
+### Online Inference
+
+For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
+
+```bash
+export PYTHONHASHSEED=123456
+vllm serve /home/models/Qwen2.5-14B-Instruct \
+--max-model-len 20000 \
+--tensor-parallel-size 2 \
+--gpu_memory_utilization 0.87 \
+--trust-remote-code \
+--port 7800 \
+--kv-transfer-config \
+'{
+    "kv_connector": "UnifiedCacheConnectorV1",
+    "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector",
+    "kv_role": "kv_both",
+    "kv_connector_extra_config": {
+        "ucm_connector_name": "UcmNfsStore",
+        "ucm_connector_config": {
+            "storage_backends": "/mnt/test",
+            "kv_block_size": 33554432
+        },
+        "ucm_sparse_method": "ESA"
+    }
+}'
+```
+
+If you see log as below:
+
+```bash
+INFO:     Started server process [1049932]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Congratulations, you have successfully started the vLLM server with NFS Connector!
+
+After successfully started the vLLM server，You can interact with the API as following:
+
+```bash
+curl http://localhost:7800/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "/home/models/Qwen2.5-14B-Instruct",
+        "prompt": "PUT A LONG PROMPT HERE TO SEE ACCELERATION EFFECTS.",
+        "max_tokens": 100,
+        "temperature": 0
+    }'
+```
@@ -30,6 +30,7 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
                 "max_cache_size": 5368709120,
                 "kv_block_size": 262144,
             },
+            "ucm_sparse_method": "ESA",
         },
     )
 
@@ -66,7 +67,7 @@ def print_output(
 def main():
     module_path = "unifiedcache.integration.vllm.uc_connector"
     name = "UnifiedCacheConnectorV1"
-    model = "/home/models/Qwen2.5-14B-Instruct"
+    model = os.getenv("MODEL_PATH", "/home/models/Qwen2.5-14B-Instruct")
 
     setup_environment_variables()
 
 
@@ -0,0 +1,204 @@
+"""
+UcmSparseBase Class provides interfaces for general sparse attention algorithm implementation in vLLM.
+
+The class provides the following primitives:
+    Scheduler-side: runs in the scheduler, binds metadata, which
+    is used by the worker-side to retrieval/load KV cache.
+        estimate_num_slots_sparsed() - get the number of required slots.
+        update_state_after_alloc() - update UcmSparse state after
+            temporary buffer alloc by the CacheManager.
+        request_finished_in_scheduler() - called when a request is finished, with
+            the computed kv cache blocks for the request.
+            Returns metadata for the next step.
+
+    Worker-side: runs in each worker, retrieval/load KV cache.
+        execute_begin() - hook at the beginning of "ModelRunner->execute_model".
+        execute_finished() - hook at the end of "ModelRunner->execute_model".
+        attention_begin() - hook at the beginning of "unified_attention".
+        attention_finished() - hook at the end of "unified_attention".
+        request_finished_in_worker() - release the resources, like block features.
+"""
+
+from __future__ import annotations
+
+import enum
+from abc import ABC, abstractmethod
+from typing import TYPE_CHECKING, Dict, List, Optional, Union
+
+if TYPE_CHECKING:
+    from vllm.v1.core.sched.output import SchedulerOutput
+    from vllm.v1.request import Request
+    from vllm.attention.backends.abstract import AttentionMetadata
+    from unifiedcache.ucm_connector.base import UcmKVStoreBase
+    from vllm.config import VllmConfig
+
+import torch
+from vllm.distributed.kv_transfer import get_kv_transfer_group, has_kv_transfer_group
+from vllm.forward_context import ForwardContext
+from vllm_ascend.worker.npu_input_batch import CachedRequestState, InputBatch
+
+INVALID_SLOT = -1
+
+
+class UcmSparseRole(enum.Enum):
+    # sparser running in the scheduler process
+    SCHEDULER = 0
+
+    # sparser running in the worker process
+    WORKER = 1
+
+
+class UcmSparseMetadata(ABC):  # noqa: B024
+    """
+    Abstract Metadata used to communicate between the
+    Scheduler UcmSparse instance and Worker UcmSparse instance.
+    """
+
+    pass
+
+
+class UcmSparseBase(ABC):
+    """
+    An general interface for impl sparse attention algorithm in vLLM
+    """
+
+    def __init__(self, vllm_config: VllmConfig, role: UcmSparseRole):
+        self._sparse_metadata: Optional[UcmSparseMetadata] = None
+        self._vllm_config = vllm_config
+        self._role = role
+
+    @property
+    def role(self) -> UcmSparseRole:
+        return self._role
+
+    # ==============================
+    # Worker-side methods
+    # ==============================
+
+    def bind_sparse_metadata(self, sparse_metadata: UcmSparseMetadata) -> None:
+        """Set the connector metadata from the scheduler.
+
+        This function should be called by the model runner every time
+        before the model execution. The metadata will be used for runtime
+        KV cache loading and saving.
+
+        Args:
+            connector_metadata (dict): the connector metadata.
+        """
+        self._sparse_metadata = sparse_metadata
+
+    def clear_sparse_metadata(self) -> None:
+        """Clear the sparse metadata.
+
+        This function should be called by the model runner every time
+        after the model execution.
+        """
+        self._sparse_metadata = None
+
+    def _get_sparse_metadata(self) -> UcmSparseMetadata:
+        """Get the sparse metadata.
+
+        This function should only be called inside the UCMSparse.
+
+        Returns:
+            SparseMetadata: the UCM sparse metadata.
+        """
+
+        # Should only be called while set to valid metadata.
+        assert self._sparse_metadata is not None
+        return self._sparse_metadata
+
+    def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
+        """
+        Args: kv_caches:
+            dictionary of layer names, kv cache
+        """
+        pass
+
+    def execute_begin(self, scheduler_output: SchedulerOutput):
+        """
+        This is called at the beginning of "ModelRunner->execute_model" function.
+        """
+        pass
+
+    def execute_finished(self):
+        """
+        This is called at the end of "ModelRunner->execute_model" function.
+        """
+        pass
+
+    def attention_begin(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        layer_name: str,
+        forward_context: ForwardContext,
+    ) -> None:
+        """
+        This is called at the beginning of "unified_attention".
+        Sparse attention algorithm can modify forward_context.attn_metadata if necessary.
+        (UC_TODO: modify dataclass is not allowed in python?)
+        """
+        pass
+
+    def attention_finished(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        attn_output: torch.Tensor,
+        layer_name: str,
+        forward_context: ForwardContext,
+    ) -> None:
+        """
+        This is called at the end of "unified_attention".
+        """
+        pass
+
+    def request_finished_in_worker(self, request_id: Union[int, str]):
+        """
+        This function releases the resources of finished requests at worker-side.
+        """
+        pass
+
+    # ==============================
+    # Scheduler-side methods
+    # ==============================
+
+    @abstractmethod
+    def request_begin(self, request_id: Union[int, str], prompt_token_ids: List[int]):
+        """
+        This is called at the beginning of "Scheduler->add_request" function.
+        """
+        pass
+
+    def request_finished_in_scheduler(self, request_id: Union[int, str]):
+        """
+        This is called inside "Scheduler->finish_requests" function.
+        Generate the metadata required by UcmSparse instance at worker-side.
+        """
+        pass
+
+    def estimate_num_slots_sparsed(self, request: Request) -> int:
+        """
+        This is called by "Scheduler->schedule" function to estimate the number of required blocks.
+        """
+        pass
+
+    def update_state_after_alloc(self, request: Request, num_blocks: int):
+        """
+        Update UcmSparse state after block allocation.
+        """
+        pass
+
+    def build_sparse_meta(
+        self,
+        scheduler_output: SchedulerOutput,
+        requests: dict[str, CachedRequestState],
+        input_batch: InputBatch,
+    ) -> UcmSparseMetadata:
+        """
+        Build the sparse metadata for this step.
+        """
+        pass