Re: per-cpu thoughts

From: Mark Rutland <mark.rutland@arm.com>
To: Paul Walmsley <paul.walmsley@sifive.com>
Cc: "Paul Walmsley" <paul@pwsan.com>,
	"Björn Töpel" <bjorn.topel@gmail.com>,
	"Palmer Dabbelt" <palmer@sifive.com>,
	will.deacon@arm.com, catalin.marinas@arm.com,
	"Nick Kossifidis" <mick@ics.forth.gr>,
	"Christopher Lameter" <cl@linux.com>,
	linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: per-cpu thoughts
Date: Tue, 12 Mar 2019 11:23:50 +0000	[thread overview]
Message-ID: <20190312112349.GA35803@lakrids.cambridge.arm.com> (raw)
In-Reply-To: <alpine.DEB.2.21.9999.1903111110520.11892@viisi.sifive.com>

On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote:
> On Mon, 11 Mar 2019, Mark Rutland wrote:
> > On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
> > > On Mon, 11 Mar 2019, Björn Töpel wrote:
> > > > 
> > > > But the generic one disables interrupts, right?
> > > > 
> > > > I believe the rational for RV is similar to ARM's; AMO+preemption
> > > > disable regions is *slightly* better than the generic, but not as good
> > > > as the IA one. Or am I missing something?
> > > 
> > > There's been a discussion going on in a private thread about this that I 
> > > unfortunately didn't add you to.  The discussion is still ongoing, but I 
> > > think Christoph and myself and a few other folks have agreed that the 
> > > preempt_disable/enable is not needed for the amoadd approach.  This is 
> > > since the apparent intention of the preemption disable/enable is to ensure 
> > > the correctness of the counter increment; however there is no risk of 
> > > incorrectness in an amoadd sequence since the atomic add is locked across 
> > > all of the cache coherency domain. 
> > 
> > We also thought that initially, but there's a sbutle race that can
> > occur, and so we added code to disable preemption in commit:
> > 
> >   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> > 
> > The problem on arm64 is that our atomics take a single base register,
> > and we have to generate the percpu address with separate instructions
> > from the atomic itself. That means we can get preempted between address
> > generation and the atomic, which is problematic for sequences like:
> > 
> > 	// Thread-A			// Thread-B
> > 
> > 	this_cpu_add(var)
> > 					local_irq_disable(flags)
> > 					...
> > 					v = __this_cpu_read(var);
> > 					v = some_function(v);
> > 					__this_cpu_write(var, v);
> > 					...
> > 					local_irq_restore(flags)
> > 
> > ... which can unexpectedly race as:
> > 
> > 
> > 	// Thread-A			// Thread-B
> > 	
> > 	< generate CPU X addr >
> > 	< preempted >
> > 
> > 					< scheduled on CPU X >
> > 					local_irq_disable(flags);
> > 					v = __this_cpu_read(var);
> > 
> > 	< scheduled on CPU Y >
> > 	< add to CPU X's var >
> > 					v = some_function(v);
> > 					__this_cpu_write(var, v);
> > 					local_irq_restore(flags);
> > 
> > ... and hence we lose an update to a percpu variable.
> 
> Makes sense, and thank you very much for the detailed sequences.  
> Open-coded per-cpu code sequences would also cause RISC-V to be exposed to 
> the same issue.  Christoph mentioned this, but at the time my attention 
> was focused on per-cpu counters, and not the broader range of per-cpu code 
> sequences.
> 
> > I suspect RISC-V would have the same problem, unless its AMOs can
> > generate the percpu address and perform the update in a single
> > instruction.
> 
> My understanding is that many of Christoph's per-cpu performance concerns 
> revolve around counters in the VM code, such as:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355

The mod_*_state() functions are the only ones which mess with
preemption, and that should only mandate a few locally-visible
modifications of preempt_count.

Similar cases apply within SLUB, and I'd hoped to improve that with my
this-cpu-reg branch, but I didn't see a measureable improvement on
workloads I tried.

Have you seen a measureable performance problem here?

> and probably elsewhere by now.  It may be worth creating a distinct API 
> for those counters.  If only increment, decrement, and read operations are 
> needed, there shouldn't be a need to disable or re-enable 
> preemption in those code paths - assuming that one is either able to 
> tolerate the occasional cache line bounce or retries in a long LL/SC 
> sequence.  Any opinions on that?

I'm afraid I don't understand this code well enough to say whether that
would be safe.

It's not clear to me whether there would be a measureable performance
difference, as I'd expect fiddling with preempt_count to be relatively
cheap. The AMOs themselves don't need to enforce ordering here, and only
a few compiler barriers are necessary.

Thanks,
Mark.

> > FWIW, I had a go at building percpu ops that didn't need to disable
> > preemption, but that required LL/SC atomics, reserving a GPR for the
> > percpu offset, and didn't result in a measurable difference in practice.
> > The patches are at:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5
> 
> Very interesting indeed.  Thank you for sharing that,
> 
> - Paul

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv