January 25, 2026

CPU vs. ROCm vs. RustiCl

System: Asus Z13 32GB (2025) – AMD Ryzen AI 395+ Max with 8060S iGPU

Darktable Benchmark Report

Drop your log files here

Supports multiple dt_bench.txt files at once.

Surprising victory for RustiCL at the APU

In contrast to the benchmark AMD RX 9060XT 8GB (which used a dedicated graphics card), here we see an integrated graphics unit (iGPU) from the new AMD Ryzen AI 300 series (“Strix Point”). The result is reversed here: the open-source driver significantly outperforms the official solution.

The results in detail:

1st place: RustiCL (Mesa OpenCL)

  • Time: 3.560 seconds
  • Analysis: RustiCL delivers outstanding performance here. It is approximately 65% faster than the ROCm implementation on the same hardware. This suggests that the Mesa driver is already extremely well optimized for this very new integrated RDNA 3.5 architecture (Radeon 8060S) or handles the APU’s shared memory more efficiently.

2nd place: ROCm (gfx1151)

  • Time: 5.883 seconds
  • Analysis: ROCm disappoints somewhat here. At almost 6 seconds, it is barely faster than pure CPU calculation. One possible reason could be the overhead of the ROCm stack, which is more significant with smaller, integrated GPUs than with large dedicated cards. In addition, we see in the “tiling” area that ROCm uses tiles (tiling) here (denoiseprofile: 2x1, atrous: 2x1), while RustiCL appears to be able to process the image in one go. This tiling often costs time due to overhead.

3rd place: CPU only (Ryzen AI 9 365 / “395-Max”)

  • Time: 6.393 seconds
  • Analysis: The CPU performance is very respectable for a mobile chip and is dangerously close to ROCm performance. This shows how inefficient ROCm is in this particular scenario—it offers hardly any advantage over the processor itself.

Conclusion

With this modern AMD APU (Ryzen AI 300 series), RustiCL is currently the clearly better choice for Darktable.

  • RustiCL uses hardware efficiently and avoids unnecessary tiling.
  • ROCm seems to (still) have problems on this specific integrated architecture, whether due to driver maturity or unfavorable memory management, resulting in unnecessary tiling and poorer performance.