On 3/17/22 17:20, Nadav Amit wrote: > I don’t have other data right now. Let me run some measurements later > tonight. I understand your explanation, but I still do not see how > much “later” can the lazy check be that it really matters. Just > strange. These will-it-scale tests are really brutal. They're usually sitting in really tight kernel entry/exit loops. Everything is pounding on kernel locks and bouncing cachelines around like crazy. It might only be a few thousand cycles between two successive kernel entries. Things like the call_single_queue cacheline have to be dragged from other CPUs *and* there are locks that you can spin on. While a thread is doing all this spinning, it is forcing more and more threads into the lazy TLB state. The longer you spin, the more threads have entered the kernel, contended on the mmap_lock and gone idle. Is it really surprising that a loop that can take hundreds of locks can take a long time? for_each_cpu(cpu, cfd->cpumask) { csd_lock(csd); ... }