feat: Add BFloat16 support for Gemm and MatMul CPU operators #26317

snnn · 2025-10-15T21:04:12Z

This commit introduces BFloat16 support for Gemm and MatMul operators on the CPU execution provider.

Key changes:

Added BFloat16 data type and moved related files to onnxruntime/core/common.
Implemented MlasBf16AccelerationSupported to detect hardware support for BFloat16.
Added Gemm and MatMul kernels for BFloat16 using Eigen.
Registered the new kernels for the CPU execution provider.
Added unit tests for BFloat16 Gemm and MatMul.
Fixed ambiguous comparison operators for BFloat16.
Moved endian.h/float8.h/float16.h from onnxruntime_frameworks.lib to onnxruntime_common.lib because onnxruntime_utils.lib depends on these headers. This change is to avoid circular dependency.

This commit introduces BFloat16 support for Gemm and MatMul operators on the CPU execution provider. Key changes: - Added BFloat16 data type and moved related files to onnxruntime/core/common. - Implemented MlasBf16AccelerationSupported to detect hardware support for BFloat16. - Added Gemm and MatMul kernels for BFloat16 using Eigen. - Registered the new kernels for the CPU execution provider. - Added unit tests for BFloat16 Gemm and MatMul. - Fixed ambiguous comparison operators for BFloat16.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-10-15T21:09:53Z

onnxruntime/core/providers/cpu/cpu_execution_provider.cc

+            BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, int64_t, MatMul)>,
+            BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, BFloat16, MatMul)>,


Suggested change

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, int64_t, MatMul)>,

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, BFloat16, MatMul)>,

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, int64_t, MatMul)>,

BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, BFloat16, MatMul)>,

github-actions · 2025-10-15T21:09:53Z

onnxruntime/core/util/math_cpu.cc

+                                   ptrdiff_t N, ptrdiff_t K, BFloat16 alpha, const BFloat16* A, const BFloat16* B, BFloat16 beta,
+                                   BFloat16* C, ThreadPool*) {


Suggested change

ptrdiff_t N, ptrdiff_t K, BFloat16 alpha, const BFloat16* A, const BFloat16* B, BFloat16 beta,

BFloat16* C, ThreadPool*) {

ptrdiff_t N, ptrdiff_t K, BFloat16 alpha, const BFloat16* A, const BFloat16* B, BFloat16 beta,

BFloat16* C, ThreadPool*) {

github-actions · 2025-10-15T21:09:54Z

onnxruntime/core/util/math_cpu.cc

+      switch (TransB) {
+        case CblasNoTrans:
+          C_mat.noalias() += alpha_bfloat * (ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(B), N, K) *
+                                      ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));


Suggested change

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));

github-actions · 2025-10-15T21:09:54Z

onnxruntime/core/util/math_cpu.cc

+          return;
+        case CblasTrans:
+          C_mat.noalias() += alpha_bfloat * (ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(B), K, N).transpose() *
+                                      ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));


Suggested change

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));

github-actions · 2025-10-15T21:09:54Z

onnxruntime/core/util/math_cpu.cc

+      switch (TransB) {
+        case CblasNoTrans:
+          C_mat.noalias() += alpha_bfloat * (ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(B), N, K) *
+                                      ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());


Suggested change

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());

github-actions · 2025-10-15T21:09:54Z

onnxruntime/core/util/math_cpu.cc

+          return;
+        case CblasTrans:
+          C_mat.noalias() += alpha_bfloat * (ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(B), K, N).transpose() *
+                                      ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());


Suggested change

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());

ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), M, K).transpose());

github-actions · 2025-10-15T21:09:54Z

onnxruntime/test/framework/math_test.cc

+                    VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),
+                    tp.get());


Suggested change

VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),

tp.get());

VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),

tp.get());

github-actions · 2025-10-15T21:09:54Z

onnxruntime/test/framework/math_test.cc

+                   VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),
+                   tp.get());


Suggested change

VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),

tp.get());

VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),

tp.get());

github-actions · 2025-10-15T21:09:54Z

onnxruntime/test/framework/math_test.cc

+                    VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),
+                    tp.get());


Suggested change

VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),

tp.get());

VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),

tp.get());

github-actions · 2025-10-15T21:09:55Z

onnxruntime/test/framework/math_test.cc

+                   VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),
+                   tp.get());


Suggested change

VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),

tp.get());

VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),

tp.get());

…nitialize from MLAS bf16 check

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-10-16T17:36:29Z

onnxruntime/core/platform/windows/env.cc

  InitializeCpuInfo();
 }

+


Suggested change

github-actions · 2025-10-16T17:36:29Z

onnxruntime/core/platform/windows/env.cc

 */
 void WindowsEnv::InitializeCpuInfo() {
+  // Initialize cpuinfo once on Windows similar to PosixEnv constructor.
+  (void)cpuinfo_initialize(); //Ignore the error if it failed to initialize


Suggested change

(void)cpuinfo_initialize(); //Ignore the error if it failed to initialize

(void)cpuinfo_initialize(); // Ignore the error if it failed to initialize

yuslepukhin

I am not seeing tests specifically for BFloat16
I think we better separate refactoring from BFloat16

yuslepukhin · 2025-10-16T18:06:41Z

onnxruntime/core/platform/windows/env.cc

 void WindowsEnv::InitializeCpuInfo() {
+  // Initialize cpuinfo once on Windows similar to PosixEnv constructor.


Do we nee a macro here if cpuinfo supported?

I added code to force the library be available on Windows.

yuslepukhin · 2025-10-16T18:09:29Z

onnxruntime/core/util/math_cpu.cc

+                                ptrdiff_t N, ptrdiff_t K, BFloat16 alpha, const BFloat16* A, const BFloat16* B, BFloat16 beta,
+                                BFloat16* C, ThreadPool*) {
+  auto C_mat = EigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<Eigen::bfloat16*>(C), N, M);
+  if (beta == BFloat16(0.f)) {


is beta input, local var or a mistyped output? It is not clear from the code.

snnn · 2025-10-16T18:42:28Z

I am not seeing tests specifically for BFloat16

I added tests.

onnxruntime_test_all --gtest_filter=MathBFloat16GemmTests/*

And I tested them on my local machine which has BFloat16 support.

snnn · 2025-10-16T18:42:51Z

I will split the refactoring(the renaming of the files) to a new PR.

Move `endian.h`, `float16.h`, and `float8.h` from `core/framework/` to `core/common/` to avoid circular dependencies and improve architectural layering. ## Motivation These headers define fundamental data types that are used across multiple low-level libraries: - `onnxruntime_common` (foundation layer) - `onnxruntime_mlas` (math library, depends on common) - `onnxruntime_util` (utilities, depends on common) - `onnxruntime_graph` (graph IR, depends on common) Previously, these types were in `core/framework/`, which is part of the `onnxruntime_framework` library that sits at a higher architectural level. This created circular dependency issues since mlas uses the "float16.h" . ## Changes ### File Moves (3 files): - `include/onnxruntime/core/framework/endian.h` → `include/onnxruntime/core/common/endian.h` - `include/onnxruntime/core/framework/float16.h` → `include/onnxruntime/core/common/float16.h` - `include/onnxruntime/core/framework/float8.h` → `include/onnxruntime/core/common/float8.h` ### Include Path Updates (53 files): Updated all references from: - `core/framework/endian.h` → `core/common/endian.h` - `core/framework/float16.h` → `core/common/float16.h` - `core/framework/float8.h` → `core/common/float8.h` Affected components: - Contrib ops (CPU, CUDA, ROCm) - Core framework and utilities - Providers (CPU, CUDA, CANN, QNN, OpenVINO, MIGraphX) - Tests - Training code ## Architectural Benefits This change establishes clearer architectural boundaries: ``` Level 0 (Foundation): onnxruntime_common (includes endian, float16, float8) onnxruntime_mlas → depends on common Level 1 (Core): onnxruntime_util → depends on common onnxruntime_graph → depends on common Level 2 (Framework): onnxruntime_framework → depends on common ``` By placing fundamental types in `common`, we ensure: 1. No circular dependencies between library targets 2. Lower-level libraries can access these types without pulling in framework 3. Clear separation between fundamental types (common) and framework-specific types like int4, float4 (framework) This PR is split from #26317 as suggested by the reviewer.

snnn added 2 commits October 15, 2025 12:44

feat: Add bfloat16 support for Gemm and MatMul on CPU

4d31693

github-actions bot reviewed Oct 15, 2025

View reviewed changes

Gemini and others added 15 commits October 15, 2025 14:15

mlas.h: expand bf16 support guard without whitespace noise

d077b53

remove

8157277

Format code

d25bbe5

revert

e419576

Initialize cpuinfo in Env (Posix+Windows); remove redundant cpuinfo_i…

e36ac3a

…nitialize from MLAS bf16 check

Update includes to renamed core/common/endian.h

f76aca5

Rename float16.h to core/common/float16.h and update includes

207726d

update

ef80220

Merge remote-tracking branch 'origin/main' into bfloat16_cpu_support

5ecf919

fix build

49a5e52

update

ff426fb

update

640cdab

fix

a915e40

update

9de02aa

Fix training code

1a52ca2

snnn linked an issue Oct 16, 2025 that may be closed by this pull request

bfloat16 causing an error: : NOT_IMPLEMENTED : Could not find an implementation for MatMul(13) #26311

Open

snnn requested a review from yuslepukhin October 16, 2025 16:17

update

715518f

github-actions bot reviewed Oct 16, 2025

View reviewed changes

update

14ae0f1

yuslepukhin reviewed Oct 16, 2025

View reviewed changes

snnn mentioned this pull request Oct 16, 2025

Refactor: Move fundamental data type headers to core/common #26330

Merged

Gemini added 2 commits October 17, 2025 11:01

Merge remote-tracking branch 'origin/main' into bfloat16_cpu_support

36d23a7

format code

0c02307

		BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, int64_t, MatMul)>,
		BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kOnnxDomain, 13, BFloat16, MatMul)>,

		ptrdiff_t N, ptrdiff_t K, BFloat16 alpha, const BFloat16* A, const BFloat16* B, BFloat16 beta,
		BFloat16* C, ThreadPool*) {

	ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));
	ConstEigenMatrixMap<Eigen::bfloat16>(reinterpret_cast<const Eigen::bfloat16*>(A), K, M));

		VECTOR_HEAD(X_bf16), VECTOR_HEAD(W_bf16), kZero_bf16, VECTOR_HEAD(Y_bf16),
		tp.get());

		VECTOR_HEAD(X_fp32), VECTOR_HEAD(W_fp32), 0.0f, VECTOR_HEAD(Y_ref),
		tp.get());

	(void)cpuinfo_initialize(); //Ignore the error if it failed to initialize
	(void)cpuinfo_initialize(); // Ignore the error if it failed to initialize

		void WindowsEnv::InitializeCpuInfo() {
		// Initialize cpuinfo once on Windows similar to PosixEnv constructor.

feat: Add BFloat16 support for Gemm and MatMul CPU operators #26317

Are you sure you want to change the base?

feat: Add BFloat16 support for Gemm and MatMul CPU operators #26317

Uh oh!

Conversation

snnn commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

snnn Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

snnn commented Oct 16, 2025

Uh oh!

snnn commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snnn commented Oct 15, 2025 •

edited

Loading