Skip to content

fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline)#346

Merged
ashvardanian merged 3 commits into
ashvardanian:main-devfrom
swasik:fix/gcc13-aarch64-bf16-fp16
Apr 13, 2026
Merged

fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline)#346
ashvardanian merged 3 commits into
ashvardanian:main-devfrom
swasik:fix/gcc13-aarch64-bf16-fp16

Conversation

@swasik

@swasik swasik commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Summary

NumKong fails to compile with GCC 13 on AArch64 (tested on AWS Graviton4 / Neoverse V2, Ubuntu 24.04). This PR fixes two independent root causes.

Bug 1 — Assembler rejects FP16 instructions in neonbfdot code

Symptom: Error: selected processor does not support 'fmov v2.8h,1.0e+0'

Root cause: All 8 neonbfdot headers use #pragma GCC target("arch=armv8.6-a+simd+bf16") — without +fp16. GCC 13's optimizer materializes BF16 constants (e.g. vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) as fmov v.8h, 1.0, which is a FEAT_FP16 instruction. Without +fp16 in the pragma, GCC emits .arch armv8.6-a+crc in the assembly (no fp16), and the assembler rejects the fmov.

All 6 failing fmov v.8h instructions originate in nk_reduce_moments_bf16_neonbfdot but the same pattern exists in all neonbfdot headers.

Fix: armv8.6-a+simd+bf16armv8.6-a+simd+bf16+fp16 in all 8 neonbfdot headers (both #pragma GCC target and #pragma clang attribute push).

Affected headers

Header Lines
include/numkong/curved/neonbfdot.h 39, 42
include/numkong/dot/neonbfdot.h 72, 75
include/numkong/dots/neonbfdot.h 22, 25
include/numkong/each/neonbfdot.h 45, 48
include/numkong/mesh/neonbfdot.h 44, 47
include/numkong/reduce/neonbfdot.h 25, 28
include/numkong/spatial/neonbfdot.h 44, 47
include/numkong/spatials/neonbfdot.h 23, 26

Bug 2 — always_inline target mismatch with user CFLAGS

Symptom: error: inlining failed — always_inline function 'nk_u1x8_popcount_' target specific option mismatch

Root cause: build.rs compiles all dispatch files with a single cc::Build that has no -march flag. NK_INTERNAL (always_inline) helpers in types.h are compiled at the TU-level baseline target. If the user (or the build environment) supplies -march=native or other CFLAGS that raise the baseline above what a #pragma GCC target region specifies, GCC refuses to inline: the callee's richer target is not a subset of the caller's restricted pragma target.

Concrete example: nk_u1x8_popcount_ (types.h:1506) is NK_INTERNAL, compiled at the TU baseline. set/neon.h:58 pushes #pragma GCC target("arch=armv8-a+simd") and calls it at line 82. With -march=native (Neoverse V2), the baseline is much richer than armv8-a+simd, so GCC refuses the inline.

Fix: In build.rs, add -march=armv8-a+simd as the explicit baseline for aarch64 builds. This matches the lowest pragma target in the library, so every pragma scope is a superset and inlining succeeds.

Testing

  • Environment: GCC 13.3.0, GNU Binutils 2.42, ARM Neoverse V2 (AWS Graviton4), Ubuntu 24.04
  • Before: cargo build -r fails with assembler errors on dispatch_bf16.c
  • After: CC=gcc cargo build -r succeeds; binary runs correctly

Notes

  • SVE BF16 headers (svebfdot.h) are not affected — they load BF16 values from memory and use svbfdot_f32 directly, never triggering the constant-materialization pattern that produces FP16 fmov instructions.
  • The +fp16 addition to the Clang pragmas is for consistency; Clang does not exhibit either bug.
  • The -march=armv8-a+simd flag uses flag_if_supported() so it is silently ignored on compilers that don't accept it.

Two independent GCC 13 bugs prevent compilation on AArch64 (e.g. AWS Graviton):

1. neonbfdot headers: GCC 13 optimizes BF16 constant materialization
   (vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) into 'fmov v.8h, 1.0',
   which is a FEAT_FP16 instruction.  Without +fp16 in the pragma target,
   the assembler rejects it.  Fix: armv8.6-a+simd+bf16 -> +bf16+fp16
   in all 8 neonbfdot headers (both GCC and Clang pragmas).

2. build.rs: Without an explicit -march baseline, user CFLAGS or
   -march=native set a rich TU-level target.  When set/neon.h downgrades
   to armv8-a+simd via pragma, GCC refuses to inline NK_INTERNAL
   (always_inline) helpers from types.h compiled at the richer baseline.
   Fix: add -march=armv8-a+simd as explicit baseline for aarch64 builds.

Tested: GCC 13.3.0 on ARM Neoverse V2 (Graviton4), Ubuntu 24.04.
@swasik

swasik commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

@ashvardanian FYI: The fix for the compilation bug I met while playing on AWS r8g instance.

@ashvardanian

ashvardanian commented Apr 8, 2026

Copy link
Copy Markdown
Owner

Very interesting! Thanks for flagging this, @swasik!

Let me think more about the build.rs changes. We probably need a more systemic solution across all platforms and SDKs.

As for the pragmas, do we actually need FP16 features for BF16 logic? Maybe we can harden the kernels somehow to avoid incorrect code-gen?

@ashvardanian

Copy link
Copy Markdown
Owner

Assuming you are already on that machine, can you please check GCC 14 and recent Clang code-gen as well? Are they immune to this issue?

@swasik

swasik commented Apr 8, 2026

Copy link
Copy Markdown
Contributor Author

Clang 18 works fine without the fix. I will check GCC 14 later but I am not sure if it will be available easily on Ubuntu LTS 24.

@ashvardanian

Copy link
Copy Markdown
Owner

GCC 14 should be available from default repositories on Ubuntu 24 🤗

@swasik

swasik commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

GCC 14 should be available from default repositories on Ubuntu 24 🤗

You are right - strange that it is not installed by default. I changed to GCC 14 and it compiles successfully.

@ashvardanian ashvardanian changed the base branch from main to main-dev April 13, 2026 14:40
GCC 13's optimizer lowers `vdupq_n_u16(X)` to `fmov v.8h, #imm` (a
FEAT_FP16 encoding) whenever X matches a representable FP16 immediate,
which includes bf16 bit patterns like 1.0 (0x3F80). Inside a `+bf16`-only
pragma region this fails to assemble.

Introduce `nk_u16x8_splat_` in `cast/neon.h` with an empty `__asm__`
constraint on the scalar source, forcing GCC to emit `mov w; dup v.8h, w`
instead — valid under plain armv8-a+simd, still loop-invariant-hoistable,
no-op on Clang, skipped on MSVC. Apply it at the two bf16 sites in
reduce/neonbfdot.h and at the four f16 sites in reduce/neonfhm.h for
uniformity.

Revert the prophylactic `+fp16` additions to the eight neonbfdot pragmas
— the helper makes them unnecessary and avoids entangling FP16 hardware
requirements into BF16 kernels.
…n fix

The `-march=armv8-a+simd` baseline for aarch64 is one facet of a broader
target-baseline policy that deserves its own PR — it applies equally to
x86_64, riscv64, powerpc64le, and loongarch64, and interacts with
CMakeLists.txt's `-march=native` behavior and setup.py's riscv handling.

Revert that hunk here so PR ashvardanian#346 ships purely as the neonbfdot codegen
fix (the `fmov v.8h, 1.0` assembler failure), and follow up with a
dedicated baseline-policy PR afterwards.
@ashvardanian ashvardanian merged commit fc3d8ec into ashvardanian:main-dev Apr 13, 2026
@ashvardanian

Copy link
Copy Markdown
Owner

Thanks for your help, @swasik! I've split your suggestions into two separate streams, cause they are pretty independent and important enough individually. First one is merged, and the second one is on the way 🤗

ashvardanian pushed a commit that referenced this pull request Apr 14, 2026
### Minor

- Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c)
- Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279)
- Add: Span-based matrix `_into` APIs, parallel Hammings/Jaccards, full-crate docs (99289df)
- Add: OpenMP for Python & JavaScript (499ecc9)
- Add: Granite Rapids AMX for F16 & F32 (28036ea)

### Patch

- Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02)
- Make: Detect illegal instructions in macOS CI (289cdaf)
- Fix: Drop `-march=` on macOS setup.py builds (28aac74)
- Fix: Exclude `std::signal` from WASM builds (14814c5)
- Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0)
- Make: Drop `+nosimd` from AArch64 baseline (23f5195)
- Make: Forbid auto-vectorization in portable baseline builds (43e8324)
- Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f)
- Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec)
- Improve: Log faulting capability detection (a401f8a)
- Improve: Log faulting kernel on fatal signals in `nk_test` (22c7c79)
- Make: Normalize Python test dependencies across CI and docs (8a0f3d4)
- Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685)
- Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb)
- Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2)
- Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021)
- Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118)
- Fix: GCC requires +sme prefix in target attribute for __arm_sc_* stubs (291dc0a)
- Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75)
- Fix: OpenMP wheel builds on macOS and Windows (f569121)
- Fix: Add target("sme") to __arm_sc_* stubs for GCC compatibility (ad2add0)
- Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7)
- Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934)
- Improve: Manual SME streaming control, single enter/exit per API call (6432837)
- Fix: Update `cdist` edge-case test for re-added `threads=` kwarg (50681af)
- Make: Allow force-enabling ISA targets via environment variables (0e58702)
- Improve: Abandon F32→F64 via Ozaki on Granite Rapids (94a5f19)
- Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83)
- Make: Standardize CI compilers and add Windows test job (9a22ea4)
- Make: Shrink serial fallbacks with scoped size optimization (83154a8)
- Make: Compress Windows builds (e30ad3d)
- Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants