[Doc] Release Note and Known Issue List for IPEX XPU 2.8 (#5745)

tye1 · web-flow · commit 09505bb158cd · 2025-08-06T15:34:07.000+08:00
diff --git a/docs/tutorials/known_issues.md b/docs/tutorials/known_issues.md
@@ -81,26 +81,9 @@ Troubleshooting
   - **Cause**: Not activate C++ compiler. `torch.compile` need to find correct `cl.exe` path.
   - **Solution**: One could open "Developer Command Prompt for VS 2022" or follow [Visual Studio Developer Command Prompt and Developer PowerShell](https://learn.microsoft.com/en-us/visualstudio/ide/reference/command-prompt-powershell?view=vs-2022#developer-command-prompt) to activate visual studio environment.
 
-- **Problem**: LoweringException: ImportError: cannot import name 'intel' from 'triton._C.libtriton'
-  - **Cause**: Installing Triton causes pytorch-triton-xpu to stop working.
-  - **Solution**: Resolve the issue with following command:
-
-    ```bash
-    pip list | grep triton
-    # If triton related packages are listed, remove them
-    pip uninstall triton
-    pip uninstall pytorch-triton-xpu
-    # Reinstall correct version of pytorch-triton-xpu
-    pip install pytorch-triton-xpu==3.3.0 --index-url  https://download.pytorch.org/whl/xpu
-    ```
-
-- **Problem**: RuntimeError: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type, when executing DLRMv2 BF16 training on 4 cards Intel® Data Center GPU Max platform. 
-  - **Cause**: Issue exists in the default sycl path of oneCCL 2021.14 which uses two IPC exchanges. 
-  - **Solution**: Use `export CCL_ATL_TRANSPORT=ofi` to work around.
- 
-- **Problem**: Segmentation fault, when executing LLaMa2-70B inference on Intel® Data Center GPU Max platform, base on online quantization.
-  - **Cause**: Issue exists Intel Neural Compressor (INC) v3.3: during the initial import of INC, the accelerator is cached with `lru_cache`. Subsequently, setting `INC_TARGET_DEVICE` in INC transformers-like API does not take effect. This results in two devices being present in the model, leading to memory-related errors as seen in the error messages.
-  - **Solution**: Run the workload `INC_TARGET_DEVICE="cpu" python` to work around, if using online quantization.
+- **Problem**: If you encounter a system hang issue when executing llama3-8b and phi3-mini FSDP fine-tuning cases based on XCCL backend on Intel® Data Center GPU Max platform, and the hang occurs after workload completion and before the process exits.
+  - **Cause**: Compatibility issue between accelerate v1.8.1 and transformers v4.51.3.
+  - **Solution**: Use torch-ccl to replace XCCL to workaround.
 
 ## Performance Issue
 
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -1,6 +1,44 @@
 Releases
 =============
 
+We launched Intel® Extension for PyTorch\* in 2020 with the goal of extending the official PyTorch\* to simplify achieving high performance on Intel® CPU and GPU platforms. Over the years, we have successfully upstreamed most of our features and optimizations for Intel® platforms into PyTorch\*. Moving forward, our strategy is to focus on developing new features and supporting upcoming platform launches directly within PyTorch\*. We are discontinuing active development on Intel® Extension for PyTorch\*, effective immediately after 2.8 release. We will continue to provide critical bug fixes and security patches throughout the PyTorch\* 2.9 timeframe to ensure a smooth transition for our partners and the community.
+
+## 2.8.10+xpu
+
+Intel® Extension for PyTorch\* v2.8.10+xpu is the new release which supports Intel® GPU platforms (Intel® Arc™ Graphics family, Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 Mobile Processors and Intel® Data Center GPU Max Series) based on PyTorch\* 2.8.0.
+
+### Highlights
+
+- Intel® oneDNN v3.8.1 integration
+- Intel® Deep Learning Essentials 2025.1.3 compatibility
+- Large Language Model (LLM) optimization
+
+   Intel® Extension for PyTorch\* optimizes the performance of Qwen3, along with other typical LLM models on Intel® GPU platforms，with the supported transformer version upgraded to [4.51.3](https://github.com/huggingface/transformers/releases/tag/v4.51.3). A full list of optimized LLM models is available in the [LLM Optimizations Overview](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/llm.html). Intel® Extension for PyTorch\* also adds the support for more custom kernels, such as `selective_scan_fn`, `causal_conv1d_fn` and `causal_conv1d_update`, for the functionality support of [Jamba](https://arxiv.org/abs/2403.19887) model.
+
+- PyTorch\* XCCL adoption for distributed scenarios
+
+  Intel® Extension for PyTorch\* adopts the PyTorch\* XCCL backend for distrubuted scenarios on the Intel® GPU platform. We observed that the scaling performance using PyTorch\* XCCL is on par with OneCCL Bindings for PyTorch\* (torch-ccl) for validated AI workloads. As a result, we will discontinue active development of torch-ccl immediately after the 2.8 release.
+
+  A pseudocode example illustrating the transition from torch-ccl to PyTorch\* XCCL at the model script level is shown below:
+
+    ```
+    import torch
+
+    if torch.distributed.is_xccl_available:
+      torch.distributed.init_process_group(backend='xccl')
+    else:
+      import oneccl_bindings_for_pytorch
+      torch.distributed.init_process_group(backend='ccl')      
+    ```
+
+- Redundant code removal
+
+  Intel® Extension for PyTorch\* no longer overrides the device allocator. It is recommended to use the allocator provided by PyTorch\* instead. Intel® Extension for PyTorch\* also removes all overridden oneMKL and oneDNN related operators except GEMM and SDPA.
+
+### Known Issues
+
+Please refer to [Known Issues webpage](./known_issues.md).
+
 ## 2.7.10+xpu
 
 Intel® Extension for PyTorch\* v2.7.10+xpu is the new release which supports Intel® GPU platforms (Intel® Arc™ Graphics family, Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 Mobile Processors and Intel® Data Center GPU Max Series) based on PyTorch\* 2.7.0.