Skip to content

rocPRIM 4.2.0 for ROCm 7.2.0

Choose a tag to compare

@rocm-cirocm-ci released this 21 Jan 18:58

Added

  • Added missing benchmarks, such that every autotuned specialization is now benchmarked.
  • Added a new cmake option, BENCHMARK_USE_AMDSMI. It is set to OFF by default. When this option is set to ON, it lets benchmarks use AMD SMI to output more GPU statistics.
  • Added the first tested example program for device_search, which is linked in the documentation.
  • Added apply_config_improvements.py, which generates improved configs by taking the best specializations from old and new configs.
    • Run the script with --help for usage instructions, and see projects/rocprim/docs/concepts/tuning.rst for documentation.
  • Kernel Tuner proof-of-concept.
  • Enhanced SPIR-V support and performance.

Optimizations

  • Improved performance of device_radix_sort onesweep variant

Resolved issues

  • Fixed the issue where rocprim::device_scan_by_key failed when performing an "in-place" inclusive scan by reusing "keys" as output, by adding a buffer to store the last keys of each block (excluding the last block). This fix only affects the specific case of reusing "keys" as output in an inclusive scan, and does not affect other cases.
  • Fixed benchmark build error on Windows.
  • Fixed offload compress build option.
  • Fixed float_bit_mask for rocprim::half.
  • Fixed handling of undefined behaviour when __builtin_clz, __builtin_ctz, and similar builtins are called.
  • Fixed potential build error with rocprim::detail::histogram_impl.

Known issues

  • Potential hang with rocprim::partition_threeway with large input data sizes on later ROCm builds. A workaround is currently in place.