Hi Mark, On Mon, 11 Mar 2019, Mark Rutland wrote: > On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote: > > On Mon, 11 Mar 2019, Björn Töpel wrote: > > > > > > But the generic one disables interrupts, right? > > > > > > I believe the rational for RV is similar to ARM's; AMO+preemption > > > disable regions is *slightly* better than the generic, but not as good > > > as the IA one. Or am I missing something? > > > > There's been a discussion going on in a private thread about this that I > > unfortunately didn't add you to. The discussion is still ongoing, but I > > think Christoph and myself and a few other folks have agreed that the > > preempt_disable/enable is not needed for the amoadd approach. This is > > since the apparent intention of the preemption disable/enable is to ensure > > the correctness of the counter increment; however there is no risk of > > incorrectness in an amoadd sequence since the atomic add is locked across > > all of the cache coherency domain. > > We also thought that initially, but there's a sbutle race that can > occur, and so we added code to disable preemption in commit: > > f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe") > > The problem on arm64 is that our atomics take a single base register, > and we have to generate the percpu address with separate instructions > from the atomic itself. That means we can get preempted between address > generation and the atomic, which is problematic for sequences like: > > // Thread-A // Thread-B > > this_cpu_add(var) > local_irq_disable(flags) > ... > v = __this_cpu_read(var); > v = some_function(v); > __this_cpu_write(var, v); > ... > local_irq_restore(flags) > > ... which can unexpectedly race as: > > > // Thread-A // Thread-B > > < generate CPU X addr > > < preempted > > > < scheduled on CPU X > > local_irq_disable(flags); > v = __this_cpu_read(var); > > < scheduled on CPU Y > > < add to CPU X's var > > v = some_function(v); > __this_cpu_write(var, v); > local_irq_restore(flags); > > ... and hence we lose an update to a percpu variable. Makes sense, and thank you very much for the detailed sequences. Open-coded per-cpu code sequences would also cause RISC-V to be exposed to the same issue. Christoph mentioned this, but at the time my attention was focused on per-cpu counters, and not the broader range of per-cpu code sequences. > I suspect RISC-V would have the same problem, unless its AMOs can > generate the percpu address and perform the update in a single > instruction. My understanding is that many of Christoph's per-cpu performance concerns revolve around counters in the VM code, such as: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355 and probably elsewhere by now. It may be worth creating a distinct API for those counters. If only increment, decrement, and read operations are needed, there shouldn't be a need to disable or re-enable preemption in those code paths - assuming that one is either able to tolerate the occasional cache line bounce or retries in a long LL/SC sequence. Any opinions on that? > FWIW, I had a go at building percpu ops that didn't need to disable > preemption, but that required LL/SC atomics, reserving a GPR for the > percpu offset, and didn't result in a measurable difference in practice. > The patches are at: > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5 Very interesting indeed. Thank you for sharing that, - Paul