linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Re: per-cpu thoughts
       [not found]       ` <CAJ+HfNh6bc1A8vskxqJ9CP11aPdTgrPS2SOgLGXMLyefogFLtw@mail.gmail.com>
@ 2019-03-11 15:26         ` Paul Walmsley
  2019-03-11 16:48           ` Mark Rutland
                             ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Paul Walmsley @ 2019-03-11 15:26 UTC (permalink / raw)
  To: Björn Töpel, Christopher Lameter
  Cc: Paul Walmsley, catalin.marinas, Palmer Dabbelt, will.deacon,
	Nick Kossifidis, linux-riscv, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 2498 bytes --]

+ the ARM64 guys and lakml

On Mon, 11 Mar 2019, Björn Töpel wrote:

> On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> wrote:
> >
> > On Mon, 11 Mar 2019, Björn Töpel wrote:
> >
> > > > Thanks a bunch!  I feel like the best option here is to just use the AMOs
> > > > without disabling preemption and ensuring that all other accesses are atomic
> > > > (via AMOs or LR/SC).  The only reason I can see that wouldn't be the way to go
> > > > would be if it requires non-arch modifications, as if we go down that path
> > > > we'll be able to handle the performance edge cases in the implementation.
> > > >
> > >
> > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt
> > > disable, only preemption.)
> >
> > If you disable preemption then you dont need AMO anymore. In fact that is
> > the default fallback generic implementation. Just dont do anything and you
> > already have that solution.
> 
> But the generic one disables interrupts, right?
> 
> I believe the rational for RV is similar to ARM's; AMO+preemption
> disable regions is *slightly* better than the generic, but not as good
> as the IA one. Or am I missing something?

There's been a discussion going on in a private thread about this that I 
unfortunately didn't add you to.  The discussion is still ongoing, but I 
think Christoph and myself and a few other folks have agreed that the 
preempt_disable/enable is not needed for the amoadd approach.  This is 
since the apparent intention of the preemption disable/enable is to ensure 
the correctness of the counter increment; however there is no risk of 
incorrectness in an amoadd sequence since the atomic add is locked across 
all of the cache coherency domain.  Christoph, would you disagree with 
that characterization?

There are a few outstanding points that we're trying to talk through, but 
it should be fine for an initial implementation to start with the 
amoadd-based approach.

As far as the ARM LSE atomic implementation goes, I'm not an expert on 
those instructions. If those instructions are locked across all of the 
caches for the cores in the Linux system also, then they probably don't 
need the preempt_disable/enable either - assuming our collective 
understanding of the purpose of the preempt_disable/enable is correct.

All this is, of course, assuming there is no secondary purpose to the 
preempt_disable/enable that we haven't managed to elicit yet.


- Paul

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-11 15:26         ` per-cpu thoughts Paul Walmsley
@ 2019-03-11 16:48           ` Mark Rutland
  2019-03-11 18:39             ` Paul Walmsley
  2019-03-12  4:26           ` Christopher Lameter
  2019-03-22 14:51           ` Nick Kossifidis
  2 siblings, 1 reply; 14+ messages in thread
From: Mark Rutland @ 2019-03-11 16:48 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis,
	Christopher Lameter, linux-riscv, linux-arm-kernel

Hi Paul,

On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
> + the ARM64 guys and lakml
> 
> On Mon, 11 Mar 2019, Björn Töpel wrote:
> 
> > On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> wrote:
> > >
> > > On Mon, 11 Mar 2019, Björn Töpel wrote:
> > >
> > > > > Thanks a bunch!  I feel like the best option here is to just use the AMOs
> > > > > without disabling preemption and ensuring that all other accesses are atomic
> > > > > (via AMOs or LR/SC).  The only reason I can see that wouldn't be the way to go
> > > > > would be if it requires non-arch modifications, as if we go down that path
> > > > > we'll be able to handle the performance edge cases in the implementation.
> > > >
> > > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt
> > > > disable, only preemption.)
> > >
> > > If you disable preemption then you dont need AMO anymore. In fact that is
> > > the default fallback generic implementation. Just dont do anything and you
> > > already have that solution.
> > 
> > But the generic one disables interrupts, right?
> > 
> > I believe the rational for RV is similar to ARM's; AMO+preemption
> > disable regions is *slightly* better than the generic, but not as good
> > as the IA one. Or am I missing something?
> 
> There's been a discussion going on in a private thread about this that I 
> unfortunately didn't add you to.  The discussion is still ongoing, but I 
> think Christoph and myself and a few other folks have agreed that the 
> preempt_disable/enable is not needed for the amoadd approach.  This is 
> since the apparent intention of the preemption disable/enable is to ensure 
> the correctness of the counter increment; however there is no risk of 
> incorrectness in an amoadd sequence since the atomic add is locked across 
> all of the cache coherency domain. 

We also thought that initially, but there's a sbutle race that can
occur, and so we added code to disable preemption in commit:

  f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")

The problem on arm64 is that our atomics take a single base register,
and we have to generate the percpu address with separate instructions
from the atomic itself. That means we can get preempted between address
generation and the atomic, which is problematic for sequences like:

	// Thread-A			// Thread-B

	this_cpu_add(var)
					local_irq_disable(flags)
					...
					v = __this_cpu_read(var);
					v = some_function(v);
					__this_cpu_write(var, v);
					...
					local_irq_restore(flags)

... which can unexpectedly race as:


	// Thread-A			// Thread-B
	
	< generate CPU X addr >
	< preempted >

					< scheduled on CPU X >
					local_irq_disable(flags);
					v = __this_cpu_read(var);

	< scheduled on CPU Y >
	< add to CPU X's var >
					v = some_function(v);
					__this_cpu_write(var, v);
					local_irq_restore(flags);

... and hence we lose an update to a percpu variable.

I suspect RISC-V would have the same problem, unless its AMOs can
generate the percpu address and perform the update in a single
instruction.

> There are a few outstanding points that we're trying to talk through, but 
> it should be fine for an initial implementation to start with the 
> amoadd-based approach.
> 
> As far as the ARM LSE atomic implementation goes, I'm not an expert on 
> those instructions. If those instructions are locked across all of the 
> caches for the cores in the Linux system also, then they probably don't 
> need the preempt_disable/enable either - assuming our collective 
> understanding of the purpose of the preempt_disable/enable is correct.
> 
> All this is, of course, assuming there is no secondary purpose to the 
> preempt_disable/enable that we haven't managed to elicit yet.

Sorry to be the bearer of bad news! :)

FWIW, I had a go at building percpu ops that didn't need to disable
preemption, but that required LL/SC atomics, reserving a GPR for the
percpu offset, and didn't result in a measurable difference in practice.
The patches are at:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-11 16:48           ` Mark Rutland
@ 2019-03-11 18:39             ` Paul Walmsley
  2019-03-12 11:23               ` Mark Rutland
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Walmsley @ 2019-03-11 18:39 UTC (permalink / raw)
  To: Mark Rutland, Christopher Lameter
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 3776 bytes --]

Hi Mark,

On Mon, 11 Mar 2019, Mark Rutland wrote:
> On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
> > On Mon, 11 Mar 2019, Björn Töpel wrote:
> > > 
> > > But the generic one disables interrupts, right?
> > > 
> > > I believe the rational for RV is similar to ARM's; AMO+preemption
> > > disable regions is *slightly* better than the generic, but not as good
> > > as the IA one. Or am I missing something?
> > 
> > There's been a discussion going on in a private thread about this that I 
> > unfortunately didn't add you to.  The discussion is still ongoing, but I 
> > think Christoph and myself and a few other folks have agreed that the 
> > preempt_disable/enable is not needed for the amoadd approach.  This is 
> > since the apparent intention of the preemption disable/enable is to ensure 
> > the correctness of the counter increment; however there is no risk of 
> > incorrectness in an amoadd sequence since the atomic add is locked across 
> > all of the cache coherency domain. 
> 
> We also thought that initially, but there's a sbutle race that can
> occur, and so we added code to disable preemption in commit:
> 
>   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> 
> The problem on arm64 is that our atomics take a single base register,
> and we have to generate the percpu address with separate instructions
> from the atomic itself. That means we can get preempted between address
> generation and the atomic, which is problematic for sequences like:
> 
> 	// Thread-A			// Thread-B
> 
> 	this_cpu_add(var)
> 					local_irq_disable(flags)
> 					...
> 					v = __this_cpu_read(var);
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					...
> 					local_irq_restore(flags)
> 
> ... which can unexpectedly race as:
> 
> 
> 	// Thread-A			// Thread-B
> 	
> 	< generate CPU X addr >
> 	< preempted >
> 
> 					< scheduled on CPU X >
> 					local_irq_disable(flags);
> 					v = __this_cpu_read(var);
> 
> 	< scheduled on CPU Y >
> 	< add to CPU X's var >
> 					v = some_function(v);
> 					__this_cpu_write(var, v);
> 					local_irq_restore(flags);
> 
> ... and hence we lose an update to a percpu variable.

Makes sense, and thank you very much for the detailed sequences.  
Open-coded per-cpu code sequences would also cause RISC-V to be exposed to 
the same issue.  Christoph mentioned this, but at the time my attention 
was focused on per-cpu counters, and not the broader range of per-cpu code 
sequences.

> I suspect RISC-V would have the same problem, unless its AMOs can
> generate the percpu address and perform the update in a single
> instruction.

My understanding is that many of Christoph's per-cpu performance concerns 
revolve around counters in the VM code, such as:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355

and probably elsewhere by now.  It may be worth creating a distinct API 
for those counters.  If only increment, decrement, and read operations are 
needed, there shouldn't be a need to disable or re-enable 
preemption in those code paths - assuming that one is either able to 
tolerate the occasional cache line bounce or retries in a long LL/SC 
sequence.  Any opinions on that?

> FWIW, I had a go at building percpu ops that didn't need to disable
> preemption, but that required LL/SC atomics, reserving a GPR for the
> percpu offset, and didn't result in a measurable difference in practice.
> The patches are at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Very interesting indeed.  Thank you for sharing that,


- Paul

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-11 15:26         ` per-cpu thoughts Paul Walmsley
  2019-03-11 16:48           ` Mark Rutland
@ 2019-03-12  4:26           ` Christopher Lameter
  2019-03-12 14:21             ` Paul Walmsley
  2019-03-22 14:51           ` Nick Kossifidis
  2 siblings, 1 reply; 14+ messages in thread
From: Christopher Lameter @ 2019-03-12  4:26 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv,
	linux-arm-kernel

On Mon, 11 Mar 2019, Paul Walmsley wrote:

> There's been a discussion going on in a private thread about this that I
> unfortunately didn't add you to.  The discussion is still ongoing, but I
> think Christoph and myself and a few other folks have agreed that the
> preempt_disable/enable is not needed for the amoadd approach.  This is
> since the apparent intention of the preemption disable/enable is to ensure
> the correctness of the counter increment; however there is no risk of
> incorrectness in an amoadd sequence since the atomic add is locked across
> all of the cache coherency domain.  Christoph, would you disagree with
> that characterization?

No. As I said before all write operations have to happen from the local
cpu to a per cpu segment.

> There are a few outstanding points that we're trying to talk through, but
> it should be fine for an initial implementation to start with the
> amoadd-based approach.
>
> As far as the ARM LSE atomic implementation goes, I'm not an expert on
> those instructions. If those instructions are locked across all of the
> caches for the cores in the Linux system also, then they probably don't
> need the preempt_disable/enable either - assuming our collective
> understanding of the purpose of the preempt_disable/enable is correct.
>
> All this is, of course, assuming there is no secondary purpose to the
> preempt_disable/enable that we haven't managed to elicit yet.

The intention is that the write occurs to the local per cpu segment and
that any per cpu RMW activities (like this_cpu_inc()) appear to be
"atomic" in regards to other processes running on the same cpu.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-11 18:39             ` Paul Walmsley
@ 2019-03-12 11:23               ` Mark Rutland
  2019-03-12 16:01                 ` Paul Walmsley
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Rutland @ 2019-03-12 11:23 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis,
	Christopher Lameter, linux-riscv, linux-arm-kernel

On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote:
> On Mon, 11 Mar 2019, Mark Rutland wrote:
> > On Mon, Mar 11, 2019 at 08:26:45AM -0700, Paul Walmsley wrote:
> > > On Mon, 11 Mar 2019, Björn Töpel wrote:
> > > > 
> > > > But the generic one disables interrupts, right?
> > > > 
> > > > I believe the rational for RV is similar to ARM's; AMO+preemption
> > > > disable regions is *slightly* better than the generic, but not as good
> > > > as the IA one. Or am I missing something?
> > > 
> > > There's been a discussion going on in a private thread about this that I 
> > > unfortunately didn't add you to.  The discussion is still ongoing, but I 
> > > think Christoph and myself and a few other folks have agreed that the 
> > > preempt_disable/enable is not needed for the amoadd approach.  This is 
> > > since the apparent intention of the preemption disable/enable is to ensure 
> > > the correctness of the counter increment; however there is no risk of 
> > > incorrectness in an amoadd sequence since the atomic add is locked across 
> > > all of the cache coherency domain. 
> > 
> > We also thought that initially, but there's a sbutle race that can
> > occur, and so we added code to disable preemption in commit:
> > 
> >   f3eab7184ddcd486 ("arm64: percpu: Make this_cpu accessors pre-empt safe")
> > 
> > The problem on arm64 is that our atomics take a single base register,
> > and we have to generate the percpu address with separate instructions
> > from the atomic itself. That means we can get preempted between address
> > generation and the atomic, which is problematic for sequences like:
> > 
> > 	// Thread-A			// Thread-B
> > 
> > 	this_cpu_add(var)
> > 					local_irq_disable(flags)
> > 					...
> > 					v = __this_cpu_read(var);
> > 					v = some_function(v);
> > 					__this_cpu_write(var, v);
> > 					...
> > 					local_irq_restore(flags)
> > 
> > ... which can unexpectedly race as:
> > 
> > 
> > 	// Thread-A			// Thread-B
> > 	
> > 	< generate CPU X addr >
> > 	< preempted >
> > 
> > 					< scheduled on CPU X >
> > 					local_irq_disable(flags);
> > 					v = __this_cpu_read(var);
> > 
> > 	< scheduled on CPU Y >
> > 	< add to CPU X's var >
> > 					v = some_function(v);
> > 					__this_cpu_write(var, v);
> > 					local_irq_restore(flags);
> > 
> > ... and hence we lose an update to a percpu variable.
> 
> Makes sense, and thank you very much for the detailed sequences.  
> Open-coded per-cpu code sequences would also cause RISC-V to be exposed to 
> the same issue.  Christoph mentioned this, but at the time my attention 
> was focused on per-cpu counters, and not the broader range of per-cpu code 
> sequences.
> 
> > I suspect RISC-V would have the same problem, unless its AMOs can
> > generate the percpu address and perform the update in a single
> > instruction.
> 
> My understanding is that many of Christoph's per-cpu performance concerns 
> revolve around counters in the VM code, such as:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355

The mod_*_state() functions are the only ones which mess with
preemption, and that should only mandate a few locally-visible
modifications of preempt_count.

Similar cases apply within SLUB, and I'd hoped to improve that with my
this-cpu-reg branch, but I didn't see a measureable improvement on
workloads I tried.

Have you seen a measureable performance problem here?

> and probably elsewhere by now.  It may be worth creating a distinct API 
> for those counters.  If only increment, decrement, and read operations are 
> needed, there shouldn't be a need to disable or re-enable 
> preemption in those code paths - assuming that one is either able to 
> tolerate the occasional cache line bounce or retries in a long LL/SC 
> sequence.  Any opinions on that?

I'm afraid I don't understand this code well enough to say whether that
would be safe.

It's not clear to me whether there would be a measureable performance
difference, as I'd expect fiddling with preempt_count to be relatively
cheap. The AMOs themselves don't need to enforce ordering here, and only
a few compiler barriers are necessary.

Thanks,
Mark.

> > FWIW, I had a go at building percpu ops that didn't need to disable
> > preemption, but that required LL/SC atomics, reserving a GPR for the
> > percpu offset, and didn't result in a measurable difference in practice.
> > The patches are at:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?h=arm64/this-cpu-reg&id=84ee5f23f93d4a650e828f831da9ed29c54623c5
> 
> Very interesting indeed.  Thank you for sharing that,
> 
> - Paul


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12  4:26           ` Christopher Lameter
@ 2019-03-12 14:21             ` Paul Walmsley
  2019-03-12 17:42               ` Christopher Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Walmsley @ 2019-03-12 14:21 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Paul Walmsley, catalin.marinas, Björn Töpel,
	Palmer Dabbelt, will.deacon, Paul Walmsley, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

On Tue, 12 Mar 2019, Christopher Lameter wrote:

> On Mon, 11 Mar 2019, Paul Walmsley wrote:
> 
> > There's been a discussion going on in a private thread about this that I
> > unfortunately didn't add you to.  The discussion is still ongoing, but I
> > think Christoph and myself and a few other folks have agreed that the
> > preempt_disable/enable is not needed for the amoadd approach.  This is
> > since the apparent intention of the preemption disable/enable is to ensure
> > the correctness of the counter increment; however there is no risk of
> > incorrectness in an amoadd sequence since the atomic add is locked across
> > all of the cache coherency domain.  Christoph, would you disagree with
> > that characterization?
> 
> No. As I said before all write operations have to happen from the local
> cpu to a per cpu segment.

In light of the thread that's going on with Mark, given the open-coded 
__this_cpu_* operations, it's clear we don't have a choice - we'd need to 
disable preemption for the basic operations.

The counters, though, may not need the preemption disable/reenable.  
Christoph, you expressed earlier that you think the overhead of the 
preempt_disable/enable is quite high.  Do you think it's worth creating a 
separate, restricted implementation for per-cpu counters?


- Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 11:23               ` Mark Rutland
@ 2019-03-12 16:01                 ` Paul Walmsley
  2019-03-12 17:34                   ` Christopher Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Paul Walmsley @ 2019-03-12 16:01 UTC (permalink / raw)
  To: Mark Rutland, Christopher Lameter
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

Hi Mark,

On Tue, 12 Mar 2019, Mark Rutland wrote:

> On Mon, Mar 11, 2019 at 11:39:56AM -0700, Paul Walmsley wrote:
> 
> > My understanding is that many of Christoph's per-cpu performance concerns 
> > revolve around counters in the VM code, such as:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/vmstat.c#n355
> 
> The mod_*_state() functions are the only ones which mess with
> preemption, and that should only mandate a few locally-visible
> modifications of preempt_count.

Also __{inc,dec}_*_state() calls __this_cpu_{inc,dec}_return() which 
tweaks the preemption count.

> Similar cases apply within SLUB, and I'd hoped to improve that with my
> this-cpu-reg branch, but I didn't see a measureable improvement on
> workloads I tried.

That certainly suggests that all of this could be much to-do about 
nothing, or at least very little.  One observation is that some of the 
performance concerns that Christoph is expressing here may be about 
ensuring predictable and minimal latency bounds, rather than raw 
throughput.

> Have you seen a measureable performance problem here?

Not yet.  The two motivations at the moment are:

1. to determine how our initial per-arch implementation for percpu.h 
should look, and 

2. to get a high-level view of whether unlocked base + offset increment 
instructions are worthwhile, from people who know more than I do about 
them.

So far the counters look like a distinct use-case - one that might have 
relaxed requirements wrt preemption changes.

> > and probably elsewhere by now.  It may be worth creating a distinct API 
> > for those counters.  If only increment, decrement, and read operations are 
> > needed, there shouldn't be a need to disable or re-enable 
> > preemption in those code paths - assuming that one is either able to 
> > tolerate the occasional cache line bounce or retries in a long LL/SC 
> > sequence.  Any opinions on that?
> 
> I'm afraid I don't understand this code well enough to say whether that
> would be safe.

That makes two of us.  Have followed up with Christoph in a separate 
thread with lakml cc'ed.

> It's not clear to me whether there would be a measureable performance
> difference, as I'd expect fiddling with preempt_count to be relatively
> cheap. The AMOs themselves don't need to enforce ordering here, and only
> a few compiler barriers are necessary.

OK.  I have been assuming that the risk of a scheduler call in 
preempt_enable() is what Christoph is concerned about here:

https://lore.kernel.org/linux-riscv/b0653f7a6f1bc0c9329d37de690d3bed@mailhost.ics.forth.gr/T/#m6e609e26a9e5405c4a7e2dbd5ca8c969cada5c36

If is possible to eliminate the latency risk from a 'simple' counter 
increment/decrement by creating a restricted API, that may be worthwhile.

Christoph has also been concerned that the AMO operations will carry an 
unacceptable performance overhead.  But the RISC-V AMO operations can be 
written such that they don't have the ordering restrictions that the Intel 
LOCK-prefixed operations do, and thus those concerns may not apply -- at 
least not to the same extent.  Perhaps this is also true for the ARM LSE 
atomics.


- Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 16:01                 ` Paul Walmsley
@ 2019-03-12 17:34                   ` Christopher Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-03-12 17:34 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Mark Rutland, Paul Walmsley, Björn Töpel,
	Palmer Dabbelt, will.deacon, catalin.marinas, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

On Tue, 12 Mar 2019, Paul Walmsley wrote:

> > Similar cases apply within SLUB, and I'd hoped to improve that with my
> > this-cpu-reg branch, but I didn't see a measureable improvement on
> > workloads I tried.
>
> That certainly suggests that all of this could be much to-do about
> nothing, or at least very little.  One observation is that some of the
> performance concerns that Christoph is expressing here may be about
> ensuring predictable and minimal latency bounds, rather than raw
> throughput.

The performance concerns are mainly when scaling RISC V to many cores
which will create contention for counter handling. The scalable counter
system (ZVCs) was developed to address these issues and later the this
cpu operations where optimizing that performance.

At this point on RISC V with just a couple of cores you may not see too
much of an effect. In fact a UP system would be running faster if it does
not use these schemes since there is no contention. Scalability of counter
operations becomes a challenge as core counts increase.

> OK.  I have been assuming that the risk of a scheduler call in
> preempt_enable() is what Christoph is concerned about here:
>
> https://lore.kernel.org/linux-riscv/b0653f7a6f1bc0c9329d37de690d3bed@mailhost.ics.forth.gr/T/#m6e609e26a9e5405c4a7e2dbd5ca8c969cada5c36
>
> If is possible to eliminate the latency risk from a 'simple' counter
> increment/decrement by creating a restricted API, that may be worthwhile.
>
> Christoph has also been concerned that the AMO operations will carry an
> unacceptable performance overhead.  But the RISC-V AMO operations can be
> written such that they don't have the ordering restrictions that the Intel
> LOCK-prefixed operations do, and thus those concerns may not apply -- at
> least not to the same extent.  Perhaps this is also true for the ARM LSE
> atomics.

My main concern at this point is to ensure that RISC V has the proper
setup for the future and that decision are made that scaling up of RISC V
to hundreds of cores (if not more) does not become a bottleneck. One of
the use cases here for us is likely to have extreme parallel operations
for HPC style compute.

The main issue for the core VM is to limit the overhead of the statics and
counter operations. Introducing atomic operations in key fault paths has
caused performance regressions in the past.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 14:21             ` Paul Walmsley
@ 2019-03-12 17:42               ` Christopher Lameter
  2019-03-12 17:59                 ` Gary Guo
  2019-03-13 20:15                 ` Paul Walmsley
  0 siblings, 2 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-03-12 17:42 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv,
	linux-arm-kernel

On Tue, 12 Mar 2019, Paul Walmsley wrote:

> The counters, though, may not need the preemption disable/reenable.
> Christoph, you expressed earlier that you think the overhead of the
> preempt_disable/enable is quite high.  Do you think it's worth creating a
> separate, restricted implementation for per-cpu counters?

As I have always said: I would like to have per cpu atomic instructions
added on RISCV V that works like those on Intel. Single instruction and
relative to a per cpu based addressable counter operations please.

I think the attempt to reengineer the core counter mechanisms on Linux is
not that realistic and would require you to first know the cross platform
issues that have driven the development of these things in the first
place. Sorry that I have been just superficially involved in these
discussions but I have a hard time seeing this going anywere. There are
already established core implementations and various arches take on this
issue and those have been around for longer than a decade. It will be hard
to come up with something better.

Can we focus on the RISC V instruction support? I also do not think that
this is a currently pressing issue but it will be when you scale up RISC V
to many cores (especiall hundreds or thousands of concurrent hardware
threads like what our interest is likely going to be in coming years and
likely also when RISC V is going to be used for enterprise / cloud data
services).






_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 17:42               ` Christopher Lameter
@ 2019-03-12 17:59                 ` Gary Guo
  2019-03-13 18:58                   ` Christopher Lameter
  2019-03-13 20:15                 ` Paul Walmsley
  1 sibling, 1 reply; 14+ messages in thread
From: Gary Guo @ 2019-03-12 17:59 UTC (permalink / raw)
  To: Christopher Lameter, Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis, linux-riscv,
	linux-arm-kernel

The added instruction would require three operands (base, offset,
value), which is clearly an issue on how to encode it. Also, as RISC-V's
base ISA is already frozen, even if we have this instruction, it will be
an extension, and is not going to be implemented by all hardware. Linux
will still need to support those hardware without this instruction.

On 12/03/2019 17:42, Christopher Lameter wrote:
> On Tue, 12 Mar 2019, Paul Walmsley wrote:
> 
>> The counters, though, may not need the preemption disable/reenable.
>> Christoph, you expressed earlier that you think the overhead of the
>> preempt_disable/enable is quite high.  Do you think it's worth creating a
>> separate, restricted implementation for per-cpu counters?
> 
> As I have always said: I would like to have per cpu atomic instructions
> added on RISCV V that works like those on Intel. Single instruction and
> relative to a per cpu based addressable counter operations please.
> 
> I think the attempt to reengineer the core counter mechanisms on Linux is
> not that realistic and would require you to first know the cross platform
> issues that have driven the development of these things in the first
> place. Sorry that I have been just superficially involved in these
> discussions but I have a hard time seeing this going anywere. There are
> already established core implementations and various arches take on this
> issue and those have been around for longer than a decade. It will be hard
> to come up with something better.
> 
> Can we focus on the RISC V instruction support? I also do not think that
> this is a currently pressing issue but it will be when you scale up RISC V
> to many cores (especiall hundreds or thousands of concurrent hardware
> threads like what our interest is likely going to be in coming years and
> likely also when RISC V is going to be used for enterprise / cloud data
> services).
> 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 17:59                 ` Gary Guo
@ 2019-03-13 18:58                   ` Christopher Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-03-13 18:58 UTC (permalink / raw)
  To: Gary Guo
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, Paul Walmsley, catalin.marinas, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

On Tue, 12 Mar 2019, Gary Guo wrote:

> The added instruction would require three operands (base, offset,
> value), which is clearly an issue on how to encode it. Also, as RISC-V's
> base ISA is already frozen, even if we have this instruction, it will be
> an extension, and is not going to be implemented by all hardware. Linux
> will still need to support those hardware without this instruction.

We have the ability to dynamically patch the kernel on bootup as the
features of the architecture are detected. That wont be an issue and we
do this on x86 for numerous variances in the instruction set.

The per cpu atomic instructions requires a base register that contains the
pointer to the base of the per cpu area of the currently executing
processor and then an offset that may be immediate or also in a register.

And yes then the value of the increment for the most general case of
counter handling. We also need RMW operations like cmpxchg.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-12 17:42               ` Christopher Lameter
  2019-03-12 17:59                 ` Gary Guo
@ 2019-03-13 20:15                 ` Paul Walmsley
  1 sibling, 0 replies; 14+ messages in thread
From: Paul Walmsley @ 2019-03-13 20:15 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Paul Walmsley, catalin.marinas, Björn Töpel,
	Palmer Dabbelt, will.deacon, Paul Walmsley, Nick Kossifidis,
	linux-riscv, linux-arm-kernel

Hi Christoph

On Tue, 12 Mar 2019, Christopher Lameter wrote:

> On Tue, 12 Mar 2019, Paul Walmsley wrote:
> 
> > The counters, though, may not need the preemption disable/reenable.
> > Christoph, you expressed earlier that you think the overhead of the
> > preempt_disable/enable is quite high.  Do you think it's worth creating a
> > separate, restricted implementation for per-cpu counters?
> 
> As I have always said: I would like to have per cpu atomic instructions
> added on RISCV V that works like those on Intel. Single instruction and
> relative to a per cpu based addressable counter operations please.

Speaking from a RISC-V point of view, you could propose those kinds of 
instructions as an RISC-V extension.  One good way to start might be to 
start a discussion in the isa-dev Google Group first.  Others might speak 
up who agree with you and may be willing to help with the technical 
details:

https://groups.google.com/a/groups.riscv.org/forum/#!forum/isa-dev

Organizations that are members of the RISC-V Foundation can propose a new 
RISC-V Foundation working group to produce new instructions.  There are a 
few of these extension task groups in progress now.

> I think the attempt to reengineer the core counter mechanisms on Linux is
> not that realistic and would require you to first know the cross platform
> issues that have driven the development of these things in the first
> place. Sorry that I have been just superficially involved in these
> discussions but I have a hard time seeing this going anywere. There are
> already established core implementations and various arches take on this
> issue and those have been around for longer than a decade. It will be hard
> to come up with something better.

You might be right and you are certainly more of an expert in the per-cpu 
code than most of us.  But if the existing implementation was designed 
around x86 instruction set constraints which might not apply to RISC-V (or 
ARM), there might be room for improvement.

> Can we focus on the RISC V instruction support? I also do not think that
> this is a currently pressing issue but it will be when you scale up RISC V
> to many cores (especiall hundreds or thousands of concurrent hardware
> threads like what our interest is likely going to be in coming years and
> likely also when RISC V is going to be used for enterprise / cloud data
> services).

If the software discussion isn't productive, you could always take your 
concern directly to the architecture and microarchitecture mailing lists 
involved with the RISC-V Foundation.  That way there's less need to talk 
it through with those of us who are primarily software focused.  I would 
expect those engineers will probably go through a similar process:

1. they would want to become familiar with the minimal code sequences

2. if they think that an alternate code sequence would perform better, or 
just as well, they would probably expect someone to try it first

3. if they think that a given microarchitectural optimization would 
resolve the issue, without needing to add new instructions, they would 
probably be less interested in adding them

4. they might want to see traces that illustrate that there is an 
underlying problem

This is more or less the approach that we're trying to take here, although 
with more of a software focus.  We know you as a serious contributor and 
maintainer from the Linux community, which the hardware engineers might 
not, so it's easier to justify spending more time initially trying to 
understand the issue.  Also, if it turns out that software optimizations 
are possible, it may help other architectures like ARM.


- Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-11 15:26         ` per-cpu thoughts Paul Walmsley
  2019-03-11 16:48           ` Mark Rutland
  2019-03-12  4:26           ` Christopher Lameter
@ 2019-03-22 14:51           ` Nick Kossifidis
  2019-03-22 17:57             ` Christopher Lameter
  2 siblings, 1 reply; 14+ messages in thread
From: Nick Kossifidis @ 2019-03-22 14:51 UTC (permalink / raw)
  To: Paul Walmsley
  Cc: Paul Walmsley, Björn Töpel, Palmer Dabbelt,
	will.deacon, catalin.marinas, Nick Kossifidis,
	Christopher Lameter, linux-riscv, linux-arm-kernel

Στις 2019-03-11 17:26, Paul Walmsley έγραψε:
> + the ARM64 guys and lakml
> 
> On Mon, 11 Mar 2019, Björn Töpel wrote:
> 
>> On Mon, 11 Mar 2019 at 15:56, Christopher Lameter <cl@linux.com> 
>> wrote:
>> >
>> > On Mon, 11 Mar 2019, Björn Töpel wrote:
>> >
>> > > > Thanks a bunch!  I feel like the best option here is to just use the AMOs
>> > > > without disabling preemption and ensuring that all other accesses are atomic
>> > > > (via AMOs or LR/SC).  The only reason I can see that wouldn't be the way to go
>> > > > would be if it requires non-arch modifications, as if we go down that path
>> > > > we'll be able to handle the performance edge cases in the implementation.
>> > > >
>> > >
>> > > Hmm, you mean AMO *with* preempt_{dis,en}able, right? (No, interrupt
>> > > disable, only preemption.)
>> >
>> > If you disable preemption then you dont need AMO anymore. In fact that is
>> > the default fallback generic implementation. Just dont do anything and you
>> > already have that solution.
>> 
>> But the generic one disables interrupts, right?
>> 
>> I believe the rational for RV is similar to ARM's; AMO+preemption
>> disable regions is *slightly* better than the generic, but not as good
>> as the IA one. Or am I missing something?
> 
> There's been a discussion going on in a private thread about this that 
> I
> unfortunately didn't add you to.  The discussion is still ongoing, but 
> I
> think Christoph and myself and a few other folks have agreed that the
> preempt_disable/enable is not needed for the amoadd approach.  This is
> since the apparent intention of the preemption disable/enable is to 
> ensure
> the correctness of the counter increment; however there is no risk of
> incorrectness in an amoadd sequence since the atomic add is locked 
> across
> all of the cache coherency domain.  Christoph, would you disagree with
> that characterization?
> 

Some further input on this...

https://www.kernel.org/doc/Documentation/preempt-locking.txt

"Note that you do not need to explicitly prevent preemption if you are 
holding
any locks or interrupts are disabled, since preemption is implicitly 
disabled
in those cases.

But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
disabling preemption - any cond_resched() or cond_resched_lock() might 
trigger
a reschedule if the preempt count is 0. A simple printk() might trigger 
a
reschedule. So use this implicit preemption-disabling property only if 
you
know that the affected codepath does not do any of this. Best policy is 
to use
this only for small, atomic code that you wrote and which calls no 
complex
functions."

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: per-cpu thoughts
  2019-03-22 14:51           ` Nick Kossifidis
@ 2019-03-22 17:57             ` Christopher Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-03-22 17:57 UTC (permalink / raw)
  To: Nick Kossifidis
  Cc: Paul Walmsley, catalin.marinas, Björn Töpel,
	Palmer Dabbelt, will.deacon, Paul Walmsley, linux-riscv,
	linux-arm-kernel

On Fri, 22 Mar 2019, Nick Kossifidis wrote:

> But keep in mind that 'irqs disabled' is a fundamentally unsafe way of
> disabling preemption - any cond_resched() or cond_resched_lock() might trigger
> a reschedule if the preempt count is 0. A simple printk() might trigger a
> reschedule. So use this implicit preemption-disabling property only if you
> know that the affected codepath does not do any of this. Best policy is to use
> this only for small, atomic code that you wrote and which calls no complex
> functions."

Really? Preemption tracks hardirq status if you use the kernel provided
way to disabling and enabling interrupts. See

linux/include/preempt.h:

/*
 * We put the hardirq and softirq counter into the preemption
 * counter. The bitmask has the following meaning:
 *
 * - bits 0-7 are the preemption count (max preemption depth: 256)
 * - bits 8-15 are the softirq count (max # of softirqs: 256)
 *
 * The hardirq count could in theory be the same as the number of
 * interrupts in the system, but we run all interrupt handlers with
 * interrupts disabled, so we cannot have nesting interrupts. Though
 * there are a few palaeontologic drivers which reenable interrupts in
 * the handler, so we need more than one bit here.
 *
 *         PREEMPT_MASK:        0x000000ff
 *         SOFTIRQ_MASK:        0x0000ff00
 *         HARDIRQ_MASK:        0x000f0000
 *             NMI_MASK:        0x00100000
 * PREEMPT_NEED_RESCHED:        0x80000000
 */
#define PREEMPT_BITS    8
#define SOFTIRQ_BITS    8
#define HARDIRQ_BITS    4
#define NMI_BITS        1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-03-22 17:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAJ+HfNgSx49dRXDdJGovb=auAjSB703bN9tULjibW1OKHuHcPg@mail.gmail.com>
     [not found] ` <mhng-79f13983-1ea9-4f9a-949a-87dac2060f42@palmer-si-x1c4>
     [not found]   ` <CAJ+HfNhmUdz4y81VNnTOsvyvSNw3i3E=qR+F0Qa7XThSOAiJ1A@mail.gmail.com>
     [not found]     ` <010001696d414b3a-d35fa0a2-01fa-4e8c-be57-ff703610755a-000000@email.amazonses.com>
     [not found]       ` <CAJ+HfNh6bc1A8vskxqJ9CP11aPdTgrPS2SOgLGXMLyefogFLtw@mail.gmail.com>
2019-03-11 15:26         ` per-cpu thoughts Paul Walmsley
2019-03-11 16:48           ` Mark Rutland
2019-03-11 18:39             ` Paul Walmsley
2019-03-12 11:23               ` Mark Rutland
2019-03-12 16:01                 ` Paul Walmsley
2019-03-12 17:34                   ` Christopher Lameter
2019-03-12  4:26           ` Christopher Lameter
2019-03-12 14:21             ` Paul Walmsley
2019-03-12 17:42               ` Christopher Lameter
2019-03-12 17:59                 ` Gary Guo
2019-03-13 18:58                   ` Christopher Lameter
2019-03-13 20:15                 ` Paul Walmsley
2019-03-22 14:51           ` Nick Kossifidis
2019-03-22 17:57             ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).