Aquileo | fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline) by swasik · Pull Request #346 · ashvardanian/NumKong

swasik · 2026-04-08T10:36:12Z

Summary

NumKong fails to compile with GCC 13 on AArch64 (tested on AWS Graviton4 / Neoverse V2, Ubuntu 24.04). This PR fixes two independent root causes.

Bug 1 — Assembler rejects FP16 instructions in neonbfdot code

Symptom: Error: selected processor does not support 'fmov v2.8h,1.0e+0'

Root cause: All 8 neonbfdot headers use #pragma GCC target("arch=armv8.6-a+simd+bf16") — without +fp16. GCC 13's optimizer materializes BF16 constants (e.g. vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) as fmov v.8h, 1.0, which is a FEAT_FP16 instruction. Without +fp16 in the pragma, GCC emits .arch armv8.6-a+crc in the assembly (no fp16), and the assembler rejects the fmov.

All 6 failing fmov v.8h instructions originate in nk_reduce_moments_bf16_neonbfdot but the same pattern exists in all neonbfdot headers.

Fix: armv8.6-a+simd+bf16 → armv8.6-a+simd+bf16+fp16 in all 8 neonbfdot headers (both #pragma GCC target and #pragma clang attribute push).

Affected headers

Header	Lines
`include/numkong/curved/neonbfdot.h`	39, 42
`include/numkong/dot/neonbfdot.h`	72, 75
`include/numkong/dots/neonbfdot.h`	22, 25
`include/numkong/each/neonbfdot.h`	45, 48
`include/numkong/mesh/neonbfdot.h`	44, 47
`include/numkong/reduce/neonbfdot.h`	25, 28
`include/numkong/spatial/neonbfdot.h`	44, 47
`include/numkong/spatials/neonbfdot.h`	23, 26

Bug 2 — `always_inline` target mismatch with user CFLAGS

Symptom: error: inlining failed — always_inline function 'nk_u1x8_popcount_' target specific option mismatch

Root cause: build.rs compiles all dispatch files with a single cc::Build that has no -march flag. NK_INTERNAL (always_inline) helpers in types.h are compiled at the TU-level baseline target. If the user (or the build environment) supplies -march=native or other CFLAGS that raise the baseline above what a #pragma GCC target region specifies, GCC refuses to inline: the callee's richer target is not a subset of the caller's restricted pragma target.

Concrete example: nk_u1x8_popcount_ (types.h:1506) is NK_INTERNAL, compiled at the TU baseline. set/neon.h:58 pushes #pragma GCC target("arch=armv8-a+simd") and calls it at line 82. With -march=native (Neoverse V2), the baseline is much richer than armv8-a+simd, so GCC refuses the inline.

Fix: In build.rs, add -march=armv8-a+simd as the explicit baseline for aarch64 builds. This matches the lowest pragma target in the library, so every pragma scope is a superset and inlining succeeds.

Testing

Environment: GCC 13.3.0, GNU Binutils 2.42, ARM Neoverse V2 (AWS Graviton4), Ubuntu 24.04
Before: cargo build -r fails with assembler errors on dispatch_bf16.c
After: CC=gcc cargo build -r succeeds; binary runs correctly

Notes

SVE BF16 headers (svebfdot.h) are not affected — they load BF16 values from memory and use svbfdot_f32 directly, never triggering the constant-materialization pattern that produces FP16 fmov instructions.
The +fp16 addition to the Clang pragmas is for consistency; Clang does not exhibit either bug.
The -march=armv8-a+simd flag uses flag_if_supported() so it is silently ignored on compilers that don't accept it.

Two independent GCC 13 bugs prevent compilation on AArch64 (e.g. AWS Graviton): 1. neonbfdot headers: GCC 13 optimizes BF16 constant materialization (vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) into 'fmov v.8h, 1.0', which is a FEAT_FP16 instruction. Without +fp16 in the pragma target, the assembler rejects it. Fix: armv8.6-a+simd+bf16 -> +bf16+fp16 in all 8 neonbfdot headers (both GCC and Clang pragmas). 2. build.rs: Without an explicit -march baseline, user CFLAGS or -march=native set a rich TU-level target. When set/neon.h downgrades to armv8-a+simd via pragma, GCC refuses to inline NK_INTERNAL (always_inline) helpers from types.h compiled at the richer baseline. Fix: add -march=armv8-a+simd as explicit baseline for aarch64 builds. Tested: GCC 13.3.0 on ARM Neoverse V2 (Graviton4), Ubuntu 24.04.

swasik · 2026-04-08T10:38:42Z

@ashvardanian FYI: The fix for the compilation bug I met while playing on AWS r8g instance.

ashvardanian · 2026-04-08T10:43:58Z

Very interesting! Thanks for flagging this, @swasik!

Let me think more about the build.rs changes. We probably need a more systemic solution across all platforms and SDKs.

As for the pragmas, do we actually need FP16 features for BF16 logic? Maybe we can harden the kernels somehow to avoid incorrect code-gen?

ashvardanian · 2026-04-08T10:44:48Z

Assuming you are already on that machine, can you please check GCC 14 and recent Clang code-gen as well? Are they immune to this issue?

swasik · 2026-04-08T11:30:41Z

Clang 18 works fine without the fix. I will check GCC 14 later but I am not sure if it will be available easily on Ubuntu LTS 24.

ashvardanian · 2026-04-08T12:04:24Z

GCC 14 should be available from default repositories on Ubuntu 24 🤗

swasik · 2026-04-10T14:08:39Z

GCC 14 should be available from default repositories on Ubuntu 24 🤗

You are right - strange that it is not installed by default. I changed to GCC 14 and it compiles successfully.

GCC 13's optimizer lowers `vdupq_n_u16(X)` to `fmov v.8h, #imm` (a FEAT_FP16 encoding) whenever X matches a representable FP16 immediate, which includes bf16 bit patterns like 1.0 (0x3F80). Inside a `+bf16`-only pragma region this fails to assemble. Introduce `nk_u16x8_splat_` in `cast/neon.h` with an empty `__asm__` constraint on the scalar source, forcing GCC to emit `mov w; dup v.8h, w` instead — valid under plain armv8-a+simd, still loop-invariant-hoistable, no-op on Clang, skipped on MSVC. Apply it at the two bf16 sites in reduce/neonbfdot.h and at the four f16 sites in reduce/neonfhm.h for uniformity. Revert the prophylactic `+fp16` additions to the eight neonbfdot pragmas — the helper makes them unnecessary and avoids entangling FP16 hardware requirements into BF16 kernels.

…n fix The `-march=armv8-a+simd` baseline for aarch64 is one facet of a broader target-baseline policy that deserves its own PR — it applies equally to x86_64, riscv64, powerpc64le, and loongarch64, and interacts with CMakeLists.txt's `-march=native` behavior and setup.py's riscv handling. Revert that hunk here so PR ashvardanian#346 ships purely as the neonbfdot codegen fix (the `fmov v.8h, 1.0` assembler failure), and follow up with a dedicated baseline-policy PR afterwards.

ashvardanian · 2026-04-13T19:22:06Z

Thanks for your help, @swasik! I've split your suggestions into two separate streams, cause they are pretty independent and important enough individually. First one is merged, and the second one is on the way 🤗

### Minor - Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c) - Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279) - Add: Span-based matrix `_into` APIs, parallel Hammings/Jaccards, full-crate docs (99289df) - Add: OpenMP for Python & JavaScript (499ecc9) - Add: Granite Rapids AMX for F16 & F32 (28036ea) ### Patch - Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02) - Make: Detect illegal instructions in macOS CI (289cdaf) - Fix: Drop `-march=` on macOS setup.py builds (28aac74) - Fix: Exclude `std::signal` from WASM builds (14814c5) - Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0) - Make: Drop `+nosimd` from AArch64 baseline (23f5195) - Make: Forbid auto-vectorization in portable baseline builds (43e8324) - Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f) - Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec) - Improve: Log faulting capability detection (a401f8a) - Improve: Log faulting kernel on fatal signals in `nk_test` (22c7c79) - Make: Normalize Python test dependencies across CI and docs (8a0f3d4) - Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685) - Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb) - Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2) - Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021) - Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118) - Fix: GCC requires +sme prefix in target attribute for __arm_sc_* stubs (291dc0a) - Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75) - Fix: OpenMP wheel builds on macOS and Windows (f569121) - Fix: Add target("sme") to __arm_sc_* stubs for GCC compatibility (ad2add0) - Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7) - Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934) - Improve: Manual SME streaming control, single enter/exit per API call (6432837) - Fix: Update `cdist` edge-case test for re-added `threads=` kwarg (50681af) - Make: Allow force-enabling ISA targets via environment variables (0e58702) - Improve: Abandon F32→F64 via Ozaki on Granite Rapids (94a5f19) - Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83) - Make: Standardize CI compilers and add Windows test job (9a22ea4) - Make: Shrink serial fallbacks with scoped size optimization (83154a8) - Make: Compress Windows builds (e30ad3d) - Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)

ashvardanian changed the base branch from main to main-dev April 13, 2026 14:40

ashvardanian added 2 commits April 13, 2026 14:51

ashvardanian merged commit fc3d8ec into ashvardanian:main-dev Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline)#346

fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline)#346
ashvardanian merged 3 commits into
ashvardanian:main-devfrom
swasik:fix/gcc13-aarch64-bf16-fp16

swasik commented Apr 8, 2026

Uh oh!

swasik commented Apr 8, 2026

Uh oh!

ashvardanian commented Apr 8, 2026 •
edited

Loading

Uh oh!

ashvardanian commented Apr 8, 2026

Uh oh!

swasik commented Apr 8, 2026

Uh oh!

ashvardanian commented Apr 8, 2026

Uh oh!

swasik commented Apr 10, 2026

Uh oh!

ashvardanian commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swasik commented Apr 8, 2026

Summary

Bug 1 — Assembler rejects FP16 instructions in neonbfdot code

Affected headers

Bug 2 — always_inline target mismatch with user CFLAGS

Testing

Notes

Uh oh!

swasik commented Apr 8, 2026

Uh oh!

ashvardanian commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashvardanian commented Apr 8, 2026

Uh oh!

swasik commented Apr 8, 2026

Uh oh!

ashvardanian commented Apr 8, 2026

Uh oh!

swasik commented Apr 10, 2026

Uh oh!

ashvardanian commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 2 — `always_inline` target mismatch with user CFLAGS

ashvardanian commented Apr 8, 2026 •
edited

Loading