fix: GCC 13 compilation failure on AArch64 (neonbfdot BF16 + always_inline)#346
Conversation
Two independent GCC 13 bugs prevent compilation on AArch64 (e.g. AWS Graviton): 1. neonbfdot headers: GCC 13 optimizes BF16 constant materialization (vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) into 'fmov v.8h, 1.0', which is a FEAT_FP16 instruction. Without +fp16 in the pragma target, the assembler rejects it. Fix: armv8.6-a+simd+bf16 -> +bf16+fp16 in all 8 neonbfdot headers (both GCC and Clang pragmas). 2. build.rs: Without an explicit -march baseline, user CFLAGS or -march=native set a rich TU-level target. When set/neon.h downgrades to armv8-a+simd via pragma, GCC refuses to inline NK_INTERNAL (always_inline) helpers from types.h compiled at the richer baseline. Fix: add -march=armv8-a+simd as explicit baseline for aarch64 builds. Tested: GCC 13.3.0 on ARM Neoverse V2 (Graviton4), Ubuntu 24.04.
|
@ashvardanian FYI: The fix for the compilation bug I met while playing on AWS r8g instance. |
|
Very interesting! Thanks for flagging this, @swasik! Let me think more about the build.rs changes. We probably need a more systemic solution across all platforms and SDKs. As for the pragmas, do we actually need FP16 features for BF16 logic? Maybe we can harden the kernels somehow to avoid incorrect code-gen? |
|
Assuming you are already on that machine, can you please check GCC 14 and recent Clang code-gen as well? Are they immune to this issue? |
|
Clang 18 works fine without the fix. I will check GCC 14 later but I am not sure if it will be available easily on Ubuntu LTS 24. |
|
GCC 14 should be available from default repositories on Ubuntu 24 🤗 |
You are right - strange that it is not installed by default. I changed to GCC 14 and it compiles successfully. |
GCC 13's optimizer lowers `vdupq_n_u16(X)` to `fmov v.8h, #imm` (a FEAT_FP16 encoding) whenever X matches a representable FP16 immediate, which includes bf16 bit patterns like 1.0 (0x3F80). Inside a `+bf16`-only pragma region this fails to assemble. Introduce `nk_u16x8_splat_` in `cast/neon.h` with an empty `__asm__` constraint on the scalar source, forcing GCC to emit `mov w; dup v.8h, w` instead — valid under plain armv8-a+simd, still loop-invariant-hoistable, no-op on Clang, skipped on MSVC. Apply it at the two bf16 sites in reduce/neonbfdot.h and at the four f16 sites in reduce/neonfhm.h for uniformity. Revert the prophylactic `+fp16` additions to the eight neonbfdot pragmas — the helper makes them unnecessary and avoids entangling FP16 hardware requirements into BF16 kernels.
…n fix The `-march=armv8-a+simd` baseline for aarch64 is one facet of a broader target-baseline policy that deserves its own PR — it applies equally to x86_64, riscv64, powerpc64le, and loongarch64, and interacts with CMakeLists.txt's `-march=native` behavior and setup.py's riscv handling. Revert that hunk here so PR ashvardanian#346 ships purely as the neonbfdot codegen fix (the `fmov v.8h, 1.0` assembler failure), and follow up with a dedicated baseline-policy PR afterwards.
|
Thanks for your help, @swasik! I've split your suggestions into two separate streams, cause they are pretty independent and important enough individually. First one is merged, and the second one is on the way 🤗 |
### Minor - Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c) - Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279) - Add: Span-based matrix `_into` APIs, parallel Hammings/Jaccards, full-crate docs (99289df) - Add: OpenMP for Python & JavaScript (499ecc9) - Add: Granite Rapids AMX for F16 & F32 (28036ea) ### Patch - Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02) - Make: Detect illegal instructions in macOS CI (289cdaf) - Fix: Drop `-march=` on macOS setup.py builds (28aac74) - Fix: Exclude `std::signal` from WASM builds (14814c5) - Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0) - Make: Drop `+nosimd` from AArch64 baseline (23f5195) - Make: Forbid auto-vectorization in portable baseline builds (43e8324) - Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f) - Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec) - Improve: Log faulting capability detection (a401f8a) - Improve: Log faulting kernel on fatal signals in `nk_test` (22c7c79) - Make: Normalize Python test dependencies across CI and docs (8a0f3d4) - Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685) - Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb) - Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2) - Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021) - Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118) - Fix: GCC requires +sme prefix in target attribute for __arm_sc_* stubs (291dc0a) - Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75) - Fix: OpenMP wheel builds on macOS and Windows (f569121) - Fix: Add target("sme") to __arm_sc_* stubs for GCC compatibility (ad2add0) - Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7) - Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934) - Improve: Manual SME streaming control, single enter/exit per API call (6432837) - Fix: Update `cdist` edge-case test for re-added `threads=` kwarg (50681af) - Make: Allow force-enabling ISA targets via environment variables (0e58702) - Improve: Abandon F32→F64 via Ozaki on Granite Rapids (94a5f19) - Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83) - Make: Standardize CI compilers and add Windows test job (9a22ea4) - Make: Shrink serial fallbacks with scoped size optimization (83154a8) - Make: Compress Windows builds (e30ad3d) - Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)
Summary
NumKong fails to compile with GCC 13 on AArch64 (tested on AWS Graviton4 / Neoverse V2, Ubuntu 24.04). This PR fixes two independent root causes.
Bug 1 — Assembler rejects FP16 instructions in neonbfdot code
Symptom:
Error: selected processor does not support 'fmov v2.8h,1.0e+0'Root cause: All 8
neonbfdotheaders use#pragma GCC target("arch=armv8.6-a+simd+bf16")— without+fp16. GCC 13's optimizer materializes BF16 constants (e.g.vreinterpretq_bf16_u16(vdupq_n_u16(0x3F80))) asfmov v.8h, 1.0, which is a FEAT_FP16 instruction. Without+fp16in the pragma, GCC emits.arch armv8.6-a+crcin the assembly (no fp16), and the assembler rejects thefmov.All 6 failing
fmov v.8hinstructions originate innk_reduce_moments_bf16_neonbfdotbut the same pattern exists in all neonbfdot headers.Fix:
armv8.6-a+simd+bf16→armv8.6-a+simd+bf16+fp16in all 8 neonbfdot headers (both#pragma GCC targetand#pragma clang attribute push).Affected headers
include/numkong/curved/neonbfdot.hinclude/numkong/dot/neonbfdot.hinclude/numkong/dots/neonbfdot.hinclude/numkong/each/neonbfdot.hinclude/numkong/mesh/neonbfdot.hinclude/numkong/reduce/neonbfdot.hinclude/numkong/spatial/neonbfdot.hinclude/numkong/spatials/neonbfdot.hBug 2 —
always_inlinetarget mismatch with user CFLAGSSymptom:
error: inlining failed — always_inline function 'nk_u1x8_popcount_' target specific option mismatchRoot cause:
build.rscompiles all dispatch files with a singlecc::Buildthat has no-marchflag.NK_INTERNAL(always_inline) helpers intypes.hare compiled at the TU-level baseline target. If the user (or the build environment) supplies-march=nativeor other CFLAGS that raise the baseline above what a#pragma GCC targetregion specifies, GCC refuses to inline: the callee's richer target is not a subset of the caller's restricted pragma target.Concrete example:
nk_u1x8_popcount_(types.h:1506) isNK_INTERNAL, compiled at the TU baseline.set/neon.h:58pushes#pragma GCC target("arch=armv8-a+simd")and calls it at line 82. With-march=native(Neoverse V2), the baseline is much richer thanarmv8-a+simd, so GCC refuses the inline.Fix: In
build.rs, add-march=armv8-a+simdas the explicit baseline for aarch64 builds. This matches the lowest pragma target in the library, so every pragma scope is a superset and inlining succeeds.Testing
cargo build -rfails with assembler errors ondispatch_bf16.cCC=gcc cargo build -rsucceeds; binary runs correctlyNotes
svebfdot.h) are not affected — they load BF16 values from memory and usesvbfdot_f32directly, never triggering the constant-materialization pattern that produces FP16fmovinstructions.+fp16addition to the Clang pragmas is for consistency; Clang does not exhibit either bug.-march=armv8-a+simdflag usesflag_if_supported()so it is silently ignored on compilers that don't accept it.