Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Guo Ren <guoren@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Kogan <alex.kogan@oracle.com>,
	linux@armlinux.org.uk, mingo@redhat.com, will.deacon@arm.com,
	arnd@arndb.de, longman@redhat.com, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com,
	jglauber@marvell.com, steven.sistare@oracle.com,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com
Subject: Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock
Date: Fri, 4 Aug 2023 20:19:25 -0400	[thread overview]
Message-ID: <CAJF2gTQB7pvLYCkZ+9Xmdtmv4ZVynC2embAza-sgPwL8c-D-sw@mail.gmail.com> (raw)
In-Reply-To: <20230804182312.GO212435@hirez.programming.kicks-ass.net>

On Fri, Aug 4, 2023 at 2:24 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Aug 04, 2023 at 10:17:35AM -0400, Guo Ren wrote:
>
> > > See, this is where the ARM64 WFE would come in handy; I don't suppose
> > > RISC-V has anything like that?
> > Em... arm64 smp_cond_load only could save power consumption or release
> > the pipeline resources of an SMT processor. When (Node1 cpu64) is in
> > the WFE state, it still needs (Node0 cpu1) to write the value to give
> > a cross-NUMA signal. So I didn't see what WFE related to reducing
> > cross-Numa transactions, or I missed something. Sorry
>
> The benefit is that WFE significantly reduces the memory traffic. Since
> it 'suspends' the core and waits for a write-notification instead of
> busy polling the memory location you get a ton less loads.
Em... I had a different observation: When a long lock queue appeared
by a store buffer delay problem in the lock torture test, we observed
all interconnects get into a quiet state, and there was no more memory
traffic. All the cores are loop-loading "different" cacheline from
their L1 cache, caused by queued_spinlock. So I don't see any memory
traffics on the bus.

For the LL + WFE, AFAIK, LL is a load instruction that would grab the
cacheline from the bus into the L1-cache and set the reservation set
(arm may call it exclusive-monitor). If any cacheline invalidation
requests (readunique/cleanunique/...) come in, WFE would retire, and
the reservation set would be cleared. So from a cacheline perspective,
there is no difference between "LL+WFE" and "looping loads."

Let's see two scenarios of LL+WFE, multi-cores, and muti-threadings of one core:
 - In the multi-cores case, WFE didn't give any more benefits than the
loop loading from my perspective. Because the only thing WFE could do
is to "suspend core" (I borrowed your word here), but it can't be deep
sleep because the response from WFE is the most prior thing. As you
said, we should prevent "terribly contended" situations, so WFE must
keep fast reactions in the pipeline, not deep sleep. That's WFI stuff.
And loop loading also could reduce power consumption through the
proper micro-arch design: When the pipeline gets into a loop loading
state, the loop buffer mechanism start, no instructions fetch happens,
the frontend component can suspend for a while, and the only working
components are "loop buffer" and "LSU load path." Other components
could suspend for a while. So loop loading is not as terrible as you
thought.

 - In the multi-threading of one core case, introducing an ISA
instruction (WFE) to solve the loop loading problem is worthwhile
because the thread could release the resource of the processor's pipe
line.

>

--
Best Regards
 Guo Ren