On 1/23/20 10:25 AM, Waiman Long wrote: > On 1/23/20 6:35 AM, Will Deacon wrote: >> Hi folks, >> >> (I think Lihao is travelling at the moment, so he may be delayed in his >> replies) >> >> On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote: >>> On 1/22/20 6:45 AM, Lihao Liang wrote: >>>> On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan wrote: >>>>> Summary >>>>> ------- >>>>> >>>>> Lock throughput can be increased by handing a lock to a waiter on the >>>>> same NUMA node as the lock holder, provided care is taken to avoid >>>>> starvation of waiters on other NUMA nodes. This patch introduces CNA >>>>> (compact NUMA-aware lock) as the slow path for qspinlock. It is >>>>> enabled through a configuration option (NUMA_AWARE_SPINLOCKS). >>>>> >>>> Thanks for your patches. The experimental results look promising! >>>> >>>> I understand that the new CNA qspinlock uses randomization to achieve >>>> long-term fairness, and provides the numa_spinlock_threshold parameter >>>> for users to tune. As Linux runs extremely diverse workloads, it is not >>>> clear how randomization affects its fairness, and how users with >>>> different requirements are supposed to tune this parameter. >>>> >>>> To this end, Will and I consider it beneficial to be able to answer the >>>> following question: >>>> >>>> With different values of numa_spinlock_threshold and >>>> SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different >>>> sockets have to wait to acquire the lock? This is particularly relevant >>>> in high contention situations when new threads keep arriving on the same >>>> socket as the lock holder. >>>> >>>> In this email, I try to provide some formal analysis to address this >>>> question. Let's assume the probability for the lock to stay on the >>>> same socket is *at least* p, which corresponds to the probability for >>>> the function probably(unsigned int num_bits) in the patch to return *false*, >>>> where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the >>>> function. >>> That is not strictly true from my understanding of the code. The >>> probably() function does not come into play if a secondary queue is >>> present. Also calling cna_scan_main_queue() doesn't guarantee that a >>> waiter in the same node can be found. So the simple mathematical >>> analysis isn't that applicable in this case. One will have to do an >>> actual simulation to find out what the actual behavior will be. >> It's certainly true that the analysis is based on the worst-case scenario, >> but I think it's still worth considering. For example, the secondary queue >> does not exist initially so it seems a bit odd that we only instantiate it >> with < 1% probability. >> >> That said, my real concern with any of this is that it makes formal >> modelling and analysis of the qspinlock considerably more challenging. I >> would /really/ like to see an update to the TLA+ model we have of the >> current implementation [1] and preferably also the userspace version I >> hacked together [2] so that we can continue to test and validate changes >> to the code outside of the usual kernel stress-testing. > I do agree that the current CNA code is hard to model. The CNA lock > behaves like a regular qspinlock in many cases. If the lock becomes > fairly contended with waiters from different nodes, it will > opportunistically switch to CNA mode where preference is given to > waiters in the same node. BTW, I added the attached draft lock_event patch on top of the v9 CNA patch series to observe the behavior of the CNA lock. Using a 2-socket 96-thread x86-64 server, the lock event output after boot up was: cna_intra_max=1942 cna_mainscan_hit=134 cna_merge_queue=73 cna_prescan_hit=16662 cna_prescan_miss=268 cna_splice_new=352 cna_splice_old=2415 lock_pending=130090 lock_slowpath=191868 lock_use_node2=135 After resetting the counts and running a 96-thread lock stress test for 10s, I got cna_intra_max=65536 cna_mainscan_hit=46 cna_merge_queue=661 cna_prescan_hit=42486841 cna_prescan_miss=68 cna_splice_new=676 cna_splice_old=402 lock_pending=11012 lock_slowpath=44332335 lock_use_node2=57203 So the cna_intra_max does go to the maximum of 64k. Cheers, Longman