> On Mar 17, 2022, at 5:45 PM, Dave Hansen wrote: > > On 3/17/22 17:20, Nadav Amit wrote: >> I don’t have other data right now. Let me run some measurements later >> tonight. I understand your explanation, but I still do not see how >> much “later” can the lazy check be that it really matters. Just >> strange. > > These will-it-scale tests are really brutal. They're usually sitting in > really tight kernel entry/exit loops. Everything is pounding on kernel > locks and bouncing cachelines around like crazy. It might only be a few > thousand cycles between two successive kernel entries. > > Things like the call_single_queue cacheline have to be dragged from > other CPUs *and* there are locks that you can spin on. While a thread > is doing all this spinning, it is forcing more and more threads into the > lazy TLB state. The longer you spin, the more threads have entered the > kernel, contended on the mmap_lock and gone idle. > > Is it really surprising that a loop that can take hundreds of locks can > take a long time? > > for_each_cpu(cpu, cfd->cpumask) { > csd_lock(csd); > ... > } Thanks for the clarification. It took me some time to rehash. Yes, my patch should get reverted. So I think I now get what you are talking about: this loop can take a lot of time, which beforehand I did not see. But I am not sure it is exactly as you describe (unless I am missing something). So I guess you are right, but I am sharing my understanding. So let’s go over what overheads are induced (or not) in the loop: (1) Contended csd_lock(): csd_lock() is not a cross-core lock. In this workload, which does not use asynchronous IPIs, it is not contended. (2) Cache-misses on csd_lock(). I am not sure this really induces overheads or at least has to induce overhead. On one hand, although two CSDs can reside in the same cacheline, we run csd_lock_wait() eventually to check our smp-call was served. So the cacheline of the CSD should eventually be present in the IPI-initiator's cache (ready for the next invocation). On the other hand, as there is no write access by csd_lock_wait(), then due to MESI, the CSD cache-line might still be shared, and would require invalidation on the next csd_lock() invocation. I would presume that as there are no data dependencies the CPU can continue speculatively continue execution past csd_lock() even if there is a cache-miss. So I don’t see it really induing an overhead. (3) Cache-line misses on llist_add(). This makes perfect sense, and I don’t see a easy way around, especially that apparently on x86 FAA and CAS latency is similar. (4) cpu_tlbstate_shared.is_lazy - well, this is only added once you revert the patch. One thing to note, although I am not sure it is really relevant here (because your explanation on core stuck on mmap_lock makes sense), is that there are two possible reasons for fewer IPIs: (a) Skipping shootdowns (i.e., you find remote CPUs to be lazy); and (b) Save IPIs (i.e. llist_add() finds that another IPI is already pending). I have some vague ideas how to shorten the loop in smp_call_function_many_cond(), but I guess for now revert is the way to go. Thanks for you patience. Yes, revert.