On 3/17/22 17:20, Nadav Amit wrote:
> I don’t have other data right now. Let me run some measurements later
> tonight. I understand your explanation, but I still do not see how
> much “later” can the lazy check be that it really matters. Just
> strange.

These will-it-scale tests are really brutal.  They're usually sitting in
really tight kernel entry/exit loops.  Everything is pounding on kernel
locks and bouncing cachelines around like crazy.  It might only be a few
thousand cycles between two successive kernel entries.

Things like the call_single_queue cacheline have to be dragged from
other CPUs *and* there are locks that you can spin on.  While a thread
is doing all this spinning, it is forcing more and more threads into the
lazy TLB state.  The longer you spin, the more threads have entered the
kernel, contended on the mmap_lock and gone idle.

Is it really surprising that a loop that can take hundreds of locks can
take a long time?

                for_each_cpu(cpu, cfd->cpumask) {
                        csd_lock(csd);
			...
		}