linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Should this_cpu_read() be volatile?
       [not found]                   ` <C29C792A-3F47-482D-B0D8-99EABEDF8882@gmail.com>
@ 2018-12-08  0:40                     ` Nadav Amit
  2018-12-08 10:52                       ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Nadav Amit @ 2018-12-08  0:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Vlastimil Babka, Linux-MM, LKML, X86 ML,
	Ingo Molnar, Thomas Gleixner

[Resend, changing title & adding lkml and some others ]

On Dec 7, 2018, at 3:12 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

[ We can start a new thread, since I have the tendency to hijack threads. ]

> On Dec 7, 2018, at 12:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Dec 06, 2018 at 09:26:24AM -0800, Nadav Amit wrote:
>>> On Dec 6, 2018, at 2:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Thu, Dec 06, 2018 at 12:28:26AM -0800, Nadav Amit wrote:
>>>> [ +Peter ]
>>>> 

[snip]

>>>> 
>>>> *But* there is one thing that may require some attention - patch
>>>> b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints
>>>> on the VM_ARGS() evaluation. And this patch also imposes, it appears,
>>>> (unnecessary) constraints on other pieces of code.
>>>> 
>>>> These constraints are due to the addition of the volatile keyword for
>>>> this_cpu_read() by the patch. This affects at least 68 functions in my
>>>> kernel build, some of which are hot (I think), e.g., finish_task_switch(),
>>>> smp_x86_platform_ipi() and select_idle_sibling().
>>>> 
>>>> Peter, perhaps the solution was too big of a hammer? Is it possible instead
>>>> to create a separate "this_cpu_read_once()” with the volatile keyword? Such
>>>> a function can be used for native_sched_clock() and other seqlocks, etc.
>>> 
>>> No. like the commit writes this_cpu_read() _must_ imply READ_ONCE(). If
>>> you want something else, use something else, there's plenty other
>>> options available.
>>> 
>>> There's this_cpu_op_stable(), but also __this_cpu_read() and
>>> raw_this_cpu_read() (which currently don't differ from this_cpu_read()
>>> but could).
>> 
>> Would setting the inline assembly memory operand both as input and output be
>> better than using the “volatile”?
> 
> I don't know.. I'm forever befuddled by the exact semantics of gcc
> inline asm.
> 
>> I think that If you do that, the compiler would should the this_cpu_read()
>> as something that changes the per-cpu-variable, which would make it invalid
>> to re-read the value. At the same time, it would not prevent reordering the
>> read with other stuff.
> 
> So the thing is; as I wrote, the generic version of this_cpu_*() is:
> 
> 	local_irq_save();
> 	__this_cpu_*();
> 	local_irq_restore();
> 
> And per local_irq_{save,restore}() including compiler barriers that
> cannot be reordered around either.
> 
> And per the principle of least surprise, I think our primitives should
> have similar semantics.

I guess so, but as you’ll see below, the end result is ugly.

> I'm actually having difficulty finding the this_cpu_read() in any of the
> functions you mention, so I cannot make any concrete suggestions other
> than pointing at the alternative functions available.


So I got deeper into the code to understand a couple of differences. In the
case of select_idle_sibling(), the patch (Peter’s) increase the function
code size by 123 bytes (over the baseline of 986). The per-cpu variable is
called through the following call chain:

	select_idle_sibling()
	=> select_idle_cpu()
	=> local_clock()
	=> raw_smp_processor_id()

And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
the processor id changes in between (which obviously wouldn’t happen). There
may be more changes around, which I didn’t fully analyze. But the very least
reading the processor id should not get “volatile”.

As for finish_task_switch(), the impact is only few bytes, but still
unnecessary. It appears that with your patch preempt_count() causes multiple
reads of __preempt_count in this code:

       if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
                     "corrupted preempt_count: %s/%d/0x%x\n",
                     current->comm, current->pid, preempt_count()))
               preempt_count_set(FORK_PREEMPT_COUNT);

Again, this is unwarranted, as the preemption count should not be changed in
any interrupt.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Should this_cpu_read() be volatile?
  2018-12-08  0:40                     ` Should this_cpu_read() be volatile? Nadav Amit
@ 2018-12-08 10:52                       ` Peter Zijlstra
  2018-12-10  0:57                         ` Nadav Amit
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2018-12-08 10:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Matthew Wilcox, Vlastimil Babka, Linux-MM, LKML, X86 ML,
	Ingo Molnar, Thomas Gleixner

On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:

> > I'm actually having difficulty finding the this_cpu_read() in any of the
> > functions you mention, so I cannot make any concrete suggestions other
> > than pointing at the alternative functions available.
> 
> 
> So I got deeper into the code to understand a couple of differences. In the
> case of select_idle_sibling(), the patch (Peter’s) increase the function
> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
> called through the following call chain:
> 
> 	select_idle_sibling()
> 	=> select_idle_cpu()
> 	=> local_clock()
> 	=> raw_smp_processor_id()
> 
> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
> the processor id changes in between (which obviously wouldn’t happen).

That is the thing with raw_smp_processor_id(), it is allowed to be used
in preemptible context, and there it _obviously_ can change between
subsequent invocations.

So again, this change is actually good.

If we want to fix select_idle_cpu(), we should maybe not use
local_clock() there but use sched_clock_cpu() with a stable argument,
this code runs with IRQs disabled and therefore the CPU number is stable
for us here.

> There may be more changes around, which I didn’t fully analyze. But
> the very least reading the processor id should not get “volatile”.
> 
> As for finish_task_switch(), the impact is only few bytes, but still
> unnecessary. It appears that with your patch preempt_count() causes multiple
> reads of __preempt_count in this code:
> 
>        if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
>                      "corrupted preempt_count: %s/%d/0x%x\n",
>                      current->comm, current->pid, preempt_count()))
>                preempt_count_set(FORK_PREEMPT_COUNT);

My patch proposed here:

  https://marc.info/?l=linux-mm&m=154409548410209

would actually fix that one I think, preempt_count() uses
raw_cpu_read_4() which will loose the volatile with that patch.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Should this_cpu_read() be volatile?
  2018-12-08 10:52                       ` Peter Zijlstra
@ 2018-12-10  0:57                         ` Nadav Amit
  2018-12-10  8:55                           ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Nadav Amit @ 2018-12-10  0:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Vlastimil Babka, Linux-MM, LKML, X86 ML,
	Ingo Molnar, Thomas Gleixner, Andy Lutomirski

> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 07, 2018 at 04:40:52PM -0800, Nadav Amit wrote:
> 
>>> I'm actually having difficulty finding the this_cpu_read() in any of the
>>> functions you mention, so I cannot make any concrete suggestions other
>>> than pointing at the alternative functions available.
>> 
>> 
>> So I got deeper into the code to understand a couple of differences. In the
>> case of select_idle_sibling(), the patch (Peter’s) increase the function
>> code size by 123 bytes (over the baseline of 986). The per-cpu variable is
>> called through the following call chain:
>> 
>> 	select_idle_sibling()
>> 	=> select_idle_cpu()
>> 	=> local_clock()
>> 	=> raw_smp_processor_id()
>> 
>> And results in 2 more calls to sched_clock_cpu(), as the compiler assumes
>> the processor id changes in between (which obviously wouldn’t happen).
> 
> That is the thing with raw_smp_processor_id(), it is allowed to be used
> in preemptible context, and there it _obviously_ can change between
> subsequent invocations.
> 
> So again, this change is actually good.
> 
> If we want to fix select_idle_cpu(), we should maybe not use
> local_clock() there but use sched_clock_cpu() with a stable argument,
> this code runs with IRQs disabled and therefore the CPU number is stable
> for us here.
> 
>> There may be more changes around, which I didn’t fully analyze. But
>> the very least reading the processor id should not get “volatile”.
>> 
>> As for finish_task_switch(), the impact is only few bytes, but still
>> unnecessary. It appears that with your patch preempt_count() causes multiple
>> reads of __preempt_count in this code:
>> 
>>       if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
>>                     "corrupted preempt_count: %s/%d/0x%x\n",
>>                     current->comm, current->pid, preempt_count()))
>>               preempt_count_set(FORK_PREEMPT_COUNT);
> 
> My patch proposed here:
> 
>  https://marc.info/?l=linux-mm&m=154409548410209
> 
> would actually fix that one I think, preempt_count() uses
> raw_cpu_read_4() which will loose the volatile with that patch.

Sorry for the spam from yesterday. That what happens when I try to write
emails on my phone while I’m distracted.

I tested the patch you referenced, and it certainly improves the situation
for reads, but there are still small and big issues lying around.

The biggest one is that (I think) smp_processor_id() should apparently use
__this_cpu_read(). Anyhow, when preemption checks are on, it is validated
that smp_processor_id() is not used while preemption is enabled, and IRQs
are not supposed to change its value. Otherwise the generated code of many
functions is affected.

There are all kind of other smaller issues, such as set_irq_regs() and
get_irq_regs(), which should run with disabled interrupts. They affect the
generated code in do_IRQ() and others.

But beyond that, there are so many places in the code that use
this_cpu_read() while IRQs are guaranteed to be disabled. For example
arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
should be running with interrupts disabled. Having said that, in my build
only flush_tlb_func_common() was affected.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Should this_cpu_read() be volatile?
  2018-12-10  0:57                         ` Nadav Amit
@ 2018-12-10  8:55                           ` Peter Zijlstra
  2018-12-11 17:11                             ` Nadav Amit
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2018-12-10  8:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Matthew Wilcox, Vlastimil Babka, Linux-MM, LKML, X86 ML,
	Ingo Molnar, Thomas Gleixner, Andy Lutomirski

On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
> > On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:

> > My patch proposed here:
> > 
> >  https://marc.info/?l=linux-mm&m=154409548410209
> > 
> > would actually fix that one I think, preempt_count() uses
> > raw_cpu_read_4() which will loose the volatile with that patch.

> I tested the patch you referenced, and it certainly improves the situation
> for reads, but there are still small and big issues lying around.

I'm sure :-(, this has been 'festering' for a long while it seems. And
esp. on x86 specific code, where for a long time we all assumed the
various per-cpu APIs were in fact the same (which turns out to very much
not be true).

> The biggest one is that (I think) smp_processor_id() should apparently use
> __this_cpu_read().

Agreed, and note that this will also improve code generation on !x86.

However, I'm not sure the current !debug definition:

#define smp_processor_id() raw_smp_processor_id()

is actually correct. Where raw_smp_processor_id() must be
this_cpu_read() to avoid CSE, we actually want to allow CSE on
smp_processor_id() etc..

> There are all kind of other smaller issues, such as set_irq_regs() and
> get_irq_regs(), which should run with disabled interrupts. They affect the
> generated code in do_IRQ() and others.
> 
> But beyond that, there are so many places in the code that use
> this_cpu_read() while IRQs are guaranteed to be disabled. For example
> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
> should be running with interrupts disabled. Having said that, in my build
> only flush_tlb_func_common() was affected.

This all feels like something static analysis could help with; such
tools would also make sense for !x86 where the difference between the
various per-cpu accessors is even bigger.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Should this_cpu_read() be volatile?
  2018-12-10  8:55                           ` Peter Zijlstra
@ 2018-12-11 17:11                             ` Nadav Amit
  0 siblings, 0 replies; 5+ messages in thread
From: Nadav Amit @ 2018-12-11 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Matthew Wilcox, Vlastimil Babka, Linux-MM, LKML, X86 ML,
	Ingo Molnar, Thomas Gleixner, Andy Lutomirski

> On Dec 10, 2018, at 12:55 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Sun, Dec 09, 2018 at 04:57:43PM -0800, Nadav Amit wrote:
>>> On Dec 8, 2018, at 2:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> My patch proposed here:
>>> 
>>> https://marc.info/?l=linux-mm&m=154409548410209
>>> 
>>> would actually fix that one I think, preempt_count() uses
>>> raw_cpu_read_4() which will loose the volatile with that patch.
> 
>> I tested the patch you referenced, and it certainly improves the situation
>> for reads, but there are still small and big issues lying around.
> 
> I'm sure :-(, this has been 'festering' for a long while it seems. And
> esp. on x86 specific code, where for a long time we all assumed the
> various per-cpu APIs were in fact the same (which turns out to very much
> not be true).
> 
>> The biggest one is that (I think) smp_processor_id() should apparently use
>> __this_cpu_read().
> 
> Agreed, and note that this will also improve code generation on !x86.
> 
> However, I'm not sure the current !debug definition:
> 
> #define smp_processor_id() raw_smp_processor_id()
> 
> is actually correct. Where raw_smp_processor_id() must be
> this_cpu_read() to avoid CSE, we actually want to allow CSE on
> smp_processor_id() etc..

Yes. That makes sense.

> 
>> There are all kind of other smaller issues, such as set_irq_regs() and
>> get_irq_regs(), which should run with disabled interrupts. They affect the
>> generated code in do_IRQ() and others.
>> 
>> But beyond that, there are so many places in the code that use
>> this_cpu_read() while IRQs are guaranteed to be disabled. For example
>> arch/x86/mm/tlb.c is full with this_cpu_read/write() and almost(?) all
>> should be running with interrupts disabled. Having said that, in my build
>> only flush_tlb_func_common() was affected.
> 
> This all feels like something static analysis could help with; such
> tools would also make sense for !x86 where the difference between the
> various per-cpu accessors is even bigger.

If something like that existed, it could also allow to get rid of
local_irq_save() (and use local_irq_disable() instead).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-12-11 17:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20181128140136.GG10377@bombadil.infradead.org>
     [not found] ` <3264149f-e01e-faa2-3bc8-8aa1c255e075@suse.cz>
     [not found]   ` <20181203161352.GP10377@bombadil.infradead.org>
     [not found]     ` <4F09425C-C9AB-452F-899C-3CF3D4B737E1@gmail.com>
     [not found]       ` <20181203224920.GQ10377@bombadil.infradead.org>
     [not found]         ` <C377D9EF-A0F4-4142-8145-6942DC29A353@gmail.com>
     [not found]           ` <EB579DAE-B25F-4869-8529-8586DF4AECFF@gmail.com>
     [not found]             ` <20181206102559.GG13538@hirez.programming.kicks-ass.net>
     [not found]               ` <55B665E1-3F64-4D87-B779-D1B4AFE719A9@gmail.com>
     [not found]                 ` <20181207084550.GA2237@hirez.programming.kicks-ass.net>
     [not found]                   ` <C29C792A-3F47-482D-B0D8-99EABEDF8882@gmail.com>
2018-12-08  0:40                     ` Should this_cpu_read() be volatile? Nadav Amit
2018-12-08 10:52                       ` Peter Zijlstra
2018-12-10  0:57                         ` Nadav Amit
2018-12-10  8:55                           ` Peter Zijlstra
2018-12-11 17:11                             ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).