> On Mar 17, 2022, at 5:45 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/17/22 17:20, Nadav Amit wrote:
>> I don’t have other data right now. Let me run some measurements later
>> tonight. I understand your explanation, but I still do not see how
>> much “later” can the lazy check be that it really matters. Just
>> strange.
> 
> These will-it-scale tests are really brutal.  They're usually sitting in
> really tight kernel entry/exit loops.  Everything is pounding on kernel
> locks and bouncing cachelines around like crazy.  It might only be a few
> thousand cycles between two successive kernel entries.
> 
> Things like the call_single_queue cacheline have to be dragged from
> other CPUs *and* there are locks that you can spin on.  While a thread
> is doing all this spinning, it is forcing more and more threads into the
> lazy TLB state.  The longer you spin, the more threads have entered the
> kernel, contended on the mmap_lock and gone idle.
> 
> Is it really surprising that a loop that can take hundreds of locks can
> take a long time?
> 
>                for_each_cpu(cpu, cfd->cpumask) {
>                        csd_lock(csd);
> 			...
> 		}

Thanks for the clarification. It took me some time to rehash. Yes, my patch
should get reverted.

So I think I now get what you are talking about: this loop can take a lot
of time, which beforehand I did not see. But I am not sure it is exactly as
you describe (unless I am missing something). So I guess you are right, but I
am sharing my understanding.

So let’s go over what overheads are induced (or not) in the loop:

(1) Contended csd_lock(): csd_lock() is not a cross-core lock. In this
workload, which does not use asynchronous IPIs, it is not contended.

(2) Cache-misses on csd_lock(). I am not sure this really induces overheads or
at least has to induce overhead.

On one hand, although two CSDs can reside in the same cacheline, we run
csd_lock_wait() eventually to check our smp-call was served. So the cacheline
of the CSD should eventually be present in the IPI-initiator's cache
(ready for the next invocation). On the other hand, as there is no write
access by csd_lock_wait(), then due to MESI, the CSD cache-line might still
be shared, and would require invalidation on the next csd_lock() invocation.

I would presume that as there are no data dependencies the CPU can continue
speculatively continue execution past csd_lock() even if there is a
cache-miss. So I don’t see it really induing an overhead.

(3) Cache-line misses on llist_add(). This makes perfect sense, and I don’t
see a easy way around, especially that apparently on x86 FAA and CAS
latency is similar.

(4) cpu_tlbstate_shared.is_lazy - well, this is only added once you revert the
patch.

One thing to note, although I am not sure it is really relevant here
(because your explanation on core stuck on mmap_lock makes sense), is that
there are two possible reasons for fewer IPIs:

(a) Skipping shootdowns (i.e., you find remote CPUs to be lazy); and
(b) Save IPIs (i.e. llist_add() finds that another IPI is already pending).

I have some vague ideas how to shorten the loop in
smp_call_function_many_cond(), but I guess for now revert is the way to
go.

Thanks for you patience. Yes, revert.