AArch64 Atomics: Might They Be Harming Your Performance?

PPoPP(2023)

引用 0|浏览1
暂无评分
摘要
Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is "compare-and-swap" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs. In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20 x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3 x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要