Hi Mark,

On Mon, 11 Mar 2019, Mark Rutland wrote:
> On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
> > On Mon, 11 Mar 2019, Björn Töpel wrote:
> > > 
> > > But the generic one disables interrupts, right?
> > > 
> > > I believe the rational for RV is similar to ARM's; AMO+preemption
> > > disable regions is *slightly* better than the generic, but not as good
> > > as the IA one. Or am I missing something?
> > 
> > There's been a discussion going on in a private thread about this that I 
> > unfortunately didn't add you to.  The discussion is still ongoing, but I 
> > think Christoph and myself and a few other folks have agreed that the 
> > preempt_disable/enable is not needed for the amoadd approach.  This is 
> > since the apparent intention of the preemption disable/enable is to ensure 
> > the correctness of the counter increment; however there is no risk of 
> > incorrectness in an amoadd sequence since the atomic add is locked across 
> > all of the cache coherency domain. 
> 
> We also thought that initially, but there's a sbutle race that can
> occur, and so we added code to disable preemption in commit:
> 
>   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> 
> The problem on arm64 is that our atomics take a single base register,
> and we have to generate the percpu address with separate instructions
> from the atomic itself. That means we can get preempted between address
> generation and the atomic, which is problematic for sequences like:
> 
> 	// Thread-A			// Thread-B
> 
> 	this_cpu_add(var)
> 					local_irq_disable(flags)
> 					...
> 					v = __this_cpu_read(var);
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					...
> 					local_irq_restore(flags)
> 
> ... which can unexpectedly race as:
> 
> 
> 	// Thread-A			// Thread-B
> 	
> 	< generate CPU X addr >
> 	< preempted >
> 
> 					< scheduled on CPU X >
> 					local_irq_disable(flags);
> 					v = __this_cpu_read(var);
> 
> 	< scheduled on CPU Y >
> 	< add to CPU X's var >
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					local_irq_restore(flags);
> 
> ... and hence we lose an update to a percpu variable.

Makes sense, and thank you very much for the detailed sequences.  
Open-coded per-cpu code sequences would also cause RISC-V to be exposed to 
the same issue.  Christoph mentioned this, but at the time my attention 
was focused on per-cpu counters, and not the broader range of per-cpu code 
sequences.

> I suspect RISC-V would have the same problem, unless its AMOs can
> generate the percpu address and perform the update in a single
> instruction.

My understanding is that many of Christoph's per-cpu performance concerns 
revolve around counters in the VM code, such as:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355

and probably elsewhere by now.  It may be worth creating a distinct API 
for those counters.  If only increment, decrement, and read operations are 
needed, there shouldn't be a need to disable or re-enable 
preemption in those code paths - assuming that one is either able to 
tolerate the occasional cache line bounce or retries in a long LL/SC 
sequence.  Any opinions on that?

> FWIW, I had a go at building percpu ops that didn't need to disable
> preemption, but that required LL/SC atomics, reserving a GPR for the
> percpu offset, and didn't result in a measurable difference in practice.
> The patches are at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Very interesting indeed.  Thank you for sharing that,


- Paul