Skip to content

Conversation

@johnnynunez
Copy link

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @johnnynunez, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving and fixing the build process for ARM architectures, specifically on aarch64 Linux. It integrates more precise ARM CPU instruction detection, applies appropriate compiler optimizations to leverage advanced ARM features, and refines core matrix multiplication routines to ensure compatibility and optimal performance. These changes are vital for enhancing the project's support and efficiency on ARM-based systems.

Highlights

  • ARM CPU Instruction Detection: Enhanced the setup.py script to accurately detect ARM CPU features like Scalable Vector Extension (SVE) and NEON on aarch64 Linux systems, improving the build configuration for ARM architectures.
  • ARM-specific CMake Flags: Introduced explicit C and C++ compiler flags (-march=armv8.6-a+...) for aarch64 Linux builds within setup.py, enabling a broad range of advanced ARMv8.6-a features for optimized performance.
  • Quantization Matrix Multiplication Adjustments: Modified iqk_mul_mat.inc and tinyblas_cpu_sgemm.inc to adapt quantization matrix multiplication functions for ARM, including changes to function signatures and how scales are loaded for Q80 and Q81 types using NEON intrinsics.
  • Simplified ARM SGEMM Dispatch: In sgemm.cpp, the ARM-specific SGEMM (Single-precision General Matrix Multiply) dispatch logic was simplified to directly utilize ARMv8.2+ optimized functions, removing runtime hardware capability checks for a more streamlined approach.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add support for ARM builds, but it introduces several critical issues that need to be addressed. The changes to the build script break support for Windows and use hardcoded CPU features that will limit ARM compatibility. More critically, changes in sgemm.cpp break builds for all non-ARM architectures by unconditionally selecting ARM-specific kernels. There are also some performance and clarity issues in the NEON implementation for quantized matrix multiplication. Please review the suggested changes to resolve these regressions and improve the implementation.

Comment on lines 181 to +214
if sys.platform.startswith("linux"):
with open('/proc/cpuinfo', 'r', encoding="utf-8") as cpu_f:
cpuinfo = cpu_f.read()
flags_line = [line for line in cpuinfo.split(
'\n') if line.startswith('flags')][0]
flags = flags_line.split(':')[1].strip().split(' ')
# fancy with AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI
for flag in flags:
if 'avx512bw' in flag:
return 'fancy'
for flag in flags:
if 'avx512' in flag:
return 'avx512'
for flag in flags:
if 'avx2' in flag:
return 'avx2'
raise ValueError(
"Unsupported cpu Instructions: {}".format(flags_line))
elif sys.platform == "win32":
from cpufeature.extension import CPUFeature

if CPUFeature.get("AVX512bw", False):
return 'fancy'
if CPUFeature.get("AVX512f", False):
return 'avx512'
if CPUFeature.get("AVX2", False):
return 'avx2'
raise ValueError(
"Unsupported cpu Instructions: {}".format(str(CPUFeature)))
else:
raise ValueError("Unsupported platform: {}".format(sys.platform))
if platform.machine() == "aarch64":
# Adapt this part based on GH200's /proc/cpuinfo
for line in cpuinfo.split('\n'):
if line.startswith('Features'):
features_line = line
features = features_line.split(':')[1].strip().split(' ')
if 'sve' in features: # Example: Scalable Vector Extension
return 'sve' # Or a custom label
elif 'neon' in features:
return 'neon'
else:
print("Using generic Arm CPU instructions")
return 'native_arm' # Or a default Arm label
print("Warning: Could not find 'Features' line in /proc/cpuinfo on aarch64. Using native.")
return 'native' # Fallback for aarch64 if 'Features' not found
else: # Assume x86-like if not aarch64
for line in cpuinfo.split('\n'):
if line.startswith('flags'):
flags_line = line
flags = flags_line.split(':')[1].strip().split(' ')
if 'avx512bw' in flags:
return 'fancy'
elif 'avx512' in flags:
return 'avx512'
elif 'avx2' in flags:
return 'avx2'
raise ValueError(
"Unsupported cpu Instructions: {}".format(flags_line))
print("Warning: Could not find 'flags' line in /proc/cpuinfo on x86-like. Using native.")
return 'native' # Fallback for x86-like if 'flags' not found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change removes support for Windows (win32) and other non-Linux platforms from the get_cpu_instruct method. The previous implementation had specific logic for Windows and a fallback for other platforms. On non-Linux systems, this function will now return None, which will likely cause failures later in the build process. While this PR focuses on ARM (which is often Linux-based), breaking existing platform support is a significant regression. Please restore the support for other platforms like Windows to avoid breaking builds for other users.

Comment on lines +4453 to +4455
const block_q8_0 * y4 = (const block_q8_0 *)y[iy] + i;
float16_t d_val = GGML_FP16_TO_FP32(y4->d);
return vdup_n_f16(d_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The conversion from ggml_fp16_t to float16_t via GGML_FP16_TO_FP32 is inefficient. It involves a round trip from half-precision to single-precision and back to half-precision, which is unnecessary and adds overhead in an inline function. You can directly use y4->d with vdup_n_f16 as the compiler can handle the cast.

        return vdup_n_f16(y4->d);

Copy link
Contributor

@KMSorSMS KMSorSMS Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnynunez Since you have tested on your device, could you please attach your running screenshot and your platform info from nvidia-smi and lscpu, for example? ( Because our next version will drop the support for llamafile ( for performance consideration)

Copy link
Author

@johnnynunez johnnynunez Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH200 @KMSorSMS
Screenshot 2025-10-24 at 07 44 36

Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                ARM
  Model name:             Neoverse-V2
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   64
    Socket(s):            1
    Stepping:             r0p0
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc f
                          lagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
NUMA:                     
  NUMA node(s):           9
  NUMA node0 CPU(s):      0-63
  NUMA node1 CPU(s):      
  NUMA node2 CPU(s):      
  NUMA node3 CPU(s):      
  NUMA node4 CPU(s):      
  NUMA node5 CPU(s):      
  NUMA node6 CPU(s):      
  NUMA node7 CPU(s):      
  NUMA node8 CPU(s):      
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

@johnnynunez
Copy link
Author

image

@johnnynunez johnnynunez changed the title [NVIDIA] Fix ARM builds Fix ARM builds Oct 24, 2025
@ovowei ovowei requested a review from KMSorSMS October 24, 2025 10:16
@KMSorSMS KMSorSMS requested a review from chenht2022 October 24, 2025 12:11
@KMSorSMS
Copy link
Contributor

KMSorSMS commented Oct 29, 2025

@johnnynunez I think we should not hardcode the arch features. Have you tried native?

        if platform.system() == "Linux" and platform.machine() == "aarch64":
            # Using -march=native is more portable as it optimizes for the host CPU.
            cmake_args += [
                "-DCMAKE_C_FLAGS=-march=native",
                "-DCMAKE_CXX_FLAGS=-march=native",
            ]

And go further, I suppose we can add some features to cmakelist(https://github.com/kvcache-ai/ktransformers/blob/main/csrc/ktransformers_ext/CMakeLists.txt) there. Will you do this change or I may try to find an arm machine to fix and test?

Hey, I'm just wondering if you missed seeing this? And you may try to fix the conflict with the current main branch. Sorry that we blocked this pr as we are going to migrate our mat kernel part into other high-performance libraries, so this feature (llamafile) will be deprecated in the future. And gptq or awq quant method seems more efficient for cpu hardware. But it's welcome to give an update to llamafile part, and if it will be smoothly merged, we will accept.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@johnnynunez
Copy link
Author

@johnnynunez I think we should not hardcode the arch features. Have you tried native?

        if platform.system() == "Linux" and platform.machine() == "aarch64":
            # Using -march=native is more portable as it optimizes for the host CPU.
            cmake_args += [
                "-DCMAKE_C_FLAGS=-march=native",
                "-DCMAKE_CXX_FLAGS=-march=native",
            ]

And go further, I suppose we can add some features to cmakelist(https://github.com/kvcache-ai/ktransformers/blob/main/csrc/ktransformers_ext/CMakeLists.txt) there. Will you do this change or I may try to find an arm machine to fix and test?

Hey, I'm just wondering if you missed seeing this? And you may try to fix the conflict with the current main branch. Sorry that we blocked this pr as we are going to migrate our mat kernel part into other high-performance libraries, so this feature (llamafile) will be deprecated in the future. And gptq or awq quant method seems more efficient for cpu hardware. But it's welcome to give an update to llamafile part, and if it will be smoothly merged, we will accept.

im coming to at home after pytorch conference. I will check when i arrive

@KMSorSMS
Copy link
Contributor

ok, thanks for your assistance ❤️.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants