You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added missing benchmarks, such that every autotuned specialization is now benchmarked.
Added a new cmake option, BENCHMARK_USE_AMDSMI. It is set to OFF by default. When this option is set to ON, it lets benchmarks use AMD SMI to output more GPU statistics.
Added the first tested example program for device_search, which is linked in the documentation.
Added apply_config_improvements.py, which generates improved configs by taking the best specializations from old and new configs.
Run the script with --help for usage instructions, and see projects/rocprim/docs/concepts/tuning.rst for documentation.
Kernel Tuner proof-of-concept.
Enhanced SPIR-V support and performance.
Optimizations
Improved performance of device_radix_sort onesweep variant
Resolved issues
Fixed the issue where rocprim::device_scan_by_key failed when performing an "in-place" inclusive scan by reusing "keys" as output, by adding a buffer to store the last keys of each block (excluding the last block). This fix only affects the specific case of reusing "keys" as output in an inclusive scan, and does not affect other cases.
Fixed benchmark build error on Windows.
Fixed offload compress build option.
Fixed float_bit_mask for rocprim::half.
Fixed handling of undefined behaviour when __builtin_clz, __builtin_ctz, and similar builtins are called.
Fixed potential build error with rocprim::detail::histogram_impl.
Known issues
Potential hang with rocprim::partition_threeway with large input data sizes on later ROCm builds. A workaround is currently in place.