linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
@ 2019-08-30 14:08 Russell King - ARM Linux admin
  2019-08-30 15:24 ` Oleg Nesterov
  2019-08-30 15:30 ` Linus Torvalds
  0 siblings, 2 replies; 75+ messages in thread
From: Russell King - ARM Linux admin @ 2019-08-30 14:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Thomas Gleixner,
	Vladimir Davydov, Kirill Tkhai, Ingo Molnar, linux-kernel

In commit 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and
try_get_task_struct()") probe_kernel_address() is used without checking
it's return value, resulting in undefined behaviour of this function.

I stumbled over this due to looking at the definition of this function,
due to a patch submission for ARM, and delving into the internals of the
probe_kernel_*() functions.

Essentially, probe_kernel_address() uses probe_kernel_read(), which
eventually uses __copy_from_user_inatomic().
__copy_from_user_inatomic() is defined thusly:

 * NOTE: only copy_from_user() zero-pads the destination in case of short copy.
 * Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
 * at all; their callers absolutely must check the return value.

which means that when probe_kernel_address() returns -EFAULT, the
destination is left uninitialised.  In the case of
task_rcu_dereference(), this means that "siginfo" can be used without
having been initialised, resulting in this function returning an
indeterminant result (based on the value of an uninitialised variable
on the stack.)

One option would be to explicitly initialise "sighand" on function
entry, or maybe check the return value of probe_kernel_address().  It
looks non-trivial due to the interaction with put_task_struct().  I
suggest someone who knows this code needs to patch this issue.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
@ 2019-08-30 15:24 ` Oleg Nesterov
  2019-08-30 15:30 ` Linus Torvalds
  1 sibling, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2019-08-30 15:24 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Thomas Gleixner,
	Vladimir Davydov, Kirill Tkhai, Ingo Molnar, linux-kernel

On 08/30, Russell King - ARM Linux admin wrote:
>
> which means that when probe_kernel_address() returns -EFAULT, the
> destination is left uninitialised.  In the case of
> task_rcu_dereference(), this means that "siginfo" can be used without
> having been initialised,

Yes, but this is fine, please see the long comment below (case 2).

If probe_kernel_address() fails, "sighand" is not initialized. but this
doesn't differ from the case when we inspect the random value if this
task_struct was freed, then reallocated as another thing, then freed and
reallocated as task_struct again.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
  2019-08-30 15:24 ` Oleg Nesterov
@ 2019-08-30 15:30 ` Linus Torvalds
  2019-08-30 15:40   ` Russell King - ARM Linux admin
  2019-08-30 15:41   ` Linus Torvalds
  1 sibling, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2019-08-30 15:30 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Oleg Nesterov, Peter Zijlstra, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Vladimir Davydov,
	Kirill Tkhai, Ingo Molnar, Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 7:08 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> which means that when probe_kernel_address() returns -EFAULT, the
> destination is left uninitialised.  In the case of
> task_rcu_dereference(), this means that "siginfo" can be used without
> having been initialised, resulting in this function returning an
> indeterminant result (based on the value of an uninitialised variable
> on the stack.)

Do you actually see that behavior?

Because the foillowing lines:

        smp_rmb();
        if (unlikely(task != READ_ONCE(*ptask)))
                goto retry;

are what is supposed to protect it - yes, it could have faulted, but
only if 'task' isn't valid any more, and we just re-checked it.

              Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 15:30 ` Linus Torvalds
@ 2019-08-30 15:40   ` Russell King - ARM Linux admin
  2019-08-30 15:43     ` Linus Torvalds
  2019-08-30 15:41   ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Russell King - ARM Linux admin @ 2019-08-30 15:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Peter Zijlstra, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Vladimir Davydov,
	Kirill Tkhai, Ingo Molnar, Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 08:30:00AM -0700, Linus Torvalds wrote:
> On Fri, Aug 30, 2019 at 7:08 AM Russell King - ARM Linux admin
> <linux@armlinux.org.uk> wrote:
> >
> > which means that when probe_kernel_address() returns -EFAULT, the
> > destination is left uninitialised.  In the case of
> > task_rcu_dereference(), this means that "siginfo" can be used without
> > having been initialised, resulting in this function returning an
> > indeterminant result (based on the value of an uninitialised variable
> > on the stack.)
> 
> Do you actually see that behavior?

No, it was an observation of the code.

> Because the foillowing lines:
> 
>         smp_rmb();
>         if (unlikely(task != READ_ONCE(*ptask)))
>                 goto retry;
> 
> are what is supposed to protect it - yes, it could have faulted, but
> only if 'task' isn't valid any more, and we just re-checked it.

Ah, ok.  Might be worth some comments - I find the comments in that
function particularly unhelpful (even after Oleg mentions this is
case 2.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 15:30 ` Linus Torvalds
  2019-08-30 15:40   ` Russell King - ARM Linux admin
@ 2019-08-30 15:41   ` Linus Torvalds
  2019-08-30 16:09     ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-08-30 15:41 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Oleg Nesterov, Peter Zijlstra, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 8:30 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Do you actually see that behavior?
>
> Because the foillowing lines:
>
>         smp_rmb();
>         if (unlikely(task != READ_ONCE(*ptask)))
>                 goto retry;

Side note: that code had better not be performance-critical, because
"probe_kernel_address()" is actually really really slow.

We really should do a real set of "read kernel with fault handling" functions.

We have *one* right now: load_unaligned_zeropad(), but that one
assumes that at least the first byte is valid and that the fault can
only be because of unaligned page crossers.

The problem with "probe_kernel_address()" is that it really just
re-uses the user access functions, and then plays games to make them
work for kernel addresses. Games we shouldn't play, and it's all very
expensive when it really shouldn't need to be. Including changing
limits, but also doing all the system instructions to allow user
accesses (PAN on ARM, clac/stac on x86).

Doing a set of "access kernel with error handling" should be trivial,
it's just that every architecture needs to do it. So we'd probably
need to do something where architectures can say "I have it", and fall
back on the silly legacy implementation otherwise..

             Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 15:40   ` Russell King - ARM Linux admin
@ 2019-08-30 15:43     ` Linus Torvalds
  0 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2019-08-30 15:43 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Oleg Nesterov, Peter Zijlstra, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Vladimir Davydov,
	Kirill Tkhai, Ingo Molnar, Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 8:40 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> Ah, ok.  Might be worth some comments - I find the comments in that
> function particularly unhelpful (even after Oleg mentions this is
> case 2.)

Yeah, a comment like "This might fault if we race with the task
scheduling away and being destroyed, but we check that below".

But that code has some performance issues too, as mentioned. Which is
all kinds of sad since it clearly _tries_ to perform well with RCU
locking and optimistic accesses etc.

             Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 15:41   ` Linus Torvalds
@ 2019-08-30 16:09     ` Oleg Nesterov
  2019-08-30 16:21       ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2019-08-30 16:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing

On 08/30, Linus Torvalds wrote:
>
> Side note: that code had better not be performance-critical, because
> "probe_kernel_address()" is actually really really slow.

Yes, please see

	[PATCH 2/3] introduce probe_slab_address()
	https://lore.kernel.org/lkml/20141027195425.GC11736@redhat.com/

I sent 5 years ago ;) Do you think

	/*
	 * Same as probe_kernel_address(), but @addr must be the valid pointer
	 * to a slab object, potentially freed/reused/unmapped.
	 */
	#ifdef CONFIG_DEBUG_PAGEALLOC
	#define probe_slab_address(addr, retval)	\
		probe_kernel_address(addr, retval)
	#else
	#define probe_slab_address(addr, retval)	\
		({							\
			(retval) = *(typeof(retval) *)(addr);		\
			0;						\
		})
	#endif

can work?

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 16:09     ` Oleg Nesterov
@ 2019-08-30 16:21       ` Linus Torvalds
  2019-08-30 16:44         ` Oleg Nesterov
  2019-08-30 19:36         ` Eric W. Biederman
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2019-08-30 16:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
>
> Yes, please see
>
>         [PATCH 2/3] introduce probe_slab_address()
>         https://lore.kernel.org/lkml/20141027195425.GC11736@redhat.com/
>
> I sent 5 years ago ;) Do you think
>
>         /*
>          * Same as probe_kernel_address(), but @addr must be the valid pointer
>          * to a slab object, potentially freed/reused/unmapped.
>          */
>         #ifdef CONFIG_DEBUG_PAGEALLOC
>         #define probe_slab_address(addr, retval)        \
>                 probe_kernel_address(addr, retval)
>         #else
>         #define probe_slab_address(addr, retval)        \
>                 ({                                                      \
>                         (retval) = *(typeof(retval) *)(addr);           \
>                         0;                                              \
>                 })
>         #endif
>
> can work?

Ugh. I would much rather handle the general case, because honestly,
tracing has had a lot of issues with our hacky "probe_kernel_read()"
stuff that bases itself on user addresses.

It's also one of the few remaining users of "set_fs()" in core code,
and we really should try to get rid of those.

So your code would work for this particular case, but not for other
cases that can trap simply because the pointer isn't reliable (tracing
being the main case for that - but if the source of the pointer itself
might have been free'd, you would also have that situation).

So I'd really prefer to have something a bit fancier. On most
architectures, doing a good exception fixup for kernel addresses is
really not that hard.

On x86, for example, we actually have *exactly* that. The
"__get_user_asm()" macro is basically it. It purely does a load
instruction from an unchecked address.

(It's a really odd syntax, but you could remove the __chk_user_ptr()
from the __get_user_size() macro, and now you'd have basically a "any
regular size kernel access with exception handlng").

But yes, your hack is I guess optimal for this particular case where
you simply can depend on "we know the pointer was valid, we just don't
know if it was freed".

Hmm. Don't we RCU-free the task struct? Because then we don't even
need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
the pointer as long as we have the RCU read lock. We do that in other
cases.

                    Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 16:21       ` Linus Torvalds
@ 2019-08-30 16:44         ` Oleg Nesterov
  2019-08-30 16:58           ` Linus Torvalds
  2019-08-30 19:36         ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2019-08-30 16:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing

On 08/30, Linus Torvalds wrote:
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock.

For example,

	rcu_read_lock();
	p = task_rcu_dereference(&cpu_rq(cpu)->curr);
	rcu_read_unlock();

->curr is not protected by RCU, the last schedule does put_task_struct()
in finish_task_switch().

Of course we can change this and add another call_rcu (actually we can do
better), and after that we do not need task_rcu_dereference() at all.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 16:44         ` Oleg Nesterov
@ 2019-08-30 16:58           ` Linus Torvalds
  0 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2019-08-30 16:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing

On Fri, Aug 30, 2019 at 9:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> ->curr is not protected by RCU, the last schedule does put_task_struct()
> in finish_task_switch().

Right you are.

It's only the sighand allocation itself that is RCU-allocated (using
SLAB_TYPESAFE_BY_RCU, so only the backing page freeing is RCU-delayed,
it can be re-used immediately).

For some reason I thought the main thread struct was too, but that was
just my fevered imagination.

> Of course we can change this and add another call_rcu (actually we can do
> better), and after that we do not need task_rcu_dereference() at all.

No, we wouldn't do another RCU call, we'd just make task_struct_cachep
be SLAB_TYPESAFE_BY_RCU too. In the reuse case, you have no cost at
all.

However, the overhead of RCU freeing is real. It's much less for the
SLAB_TYPESAFE_BY_RCU case (at least for small allocations) than for
"free every single allocaiton by RCU", but it's still very real.

(For small allocations, you only take the RCU hit when you free the
backing store of the pages, which is much less frequent - but for
something big like the task_struct, I don't know how much of a
buffering the slab allocation ends up being)

So it's probably better to take the hit at task_rcu_dereference() than
add RCU logic to the task freeing.

              Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 16:21       ` Linus Torvalds
  2019-08-30 16:44         ` Oleg Nesterov
@ 2019-08-30 19:36         ` Eric W. Biederman
  2019-09-02 13:40           ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-08-30 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>>
>> Yes, please see
>>
>>         [PATCH 2/3] introduce probe_slab_address()
>>         https://lore.kernel.org/lkml/20141027195425.GC11736@redhat.com/
>>
>> I sent 5 years ago ;) Do you think
>>
>>         /*
>>          * Same as probe_kernel_address(), but @addr must be the valid pointer
>>          * to a slab object, potentially freed/reused/unmapped.
>>          */
>>         #ifdef CONFIG_DEBUG_PAGEALLOC
>>         #define probe_slab_address(addr, retval)        \
>>                 probe_kernel_address(addr, retval)
>>         #else
>>         #define probe_slab_address(addr, retval)        \
>>                 ({                                                      \
>>                         (retval) = *(typeof(retval) *)(addr);           \
>>                         0;                                              \
>>                 })
>>         #endif
>>
>> can work?
>
> Ugh. I would much rather handle the general case, because honestly,
> tracing has had a lot of issues with our hacky "probe_kernel_read()"
> stuff that bases itself on user addresses.
>
> It's also one of the few remaining users of "set_fs()" in core code,
> and we really should try to get rid of those.
>
> So your code would work for this particular case, but not for other
> cases that can trap simply because the pointer isn't reliable (tracing
> being the main case for that - but if the source of the pointer itself
> might have been free'd, you would also have that situation).
>
> So I'd really prefer to have something a bit fancier. On most
> architectures, doing a good exception fixup for kernel addresses is
> really not that hard.
>
> On x86, for example, we actually have *exactly* that. The
> "__get_user_asm()" macro is basically it. It purely does a load
> instruction from an unchecked address.
>
> (It's a really odd syntax, but you could remove the __chk_user_ptr()
> from the __get_user_size() macro, and now you'd have basically a "any
> regular size kernel access with exception handlng").
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock. We do that in other
> cases.

Sort of.  The rcu delay happens when release_task calls
delayed_put_task_struct.  Which unfortunately means that anytime after
exit_notify, release_task can operate on a task.  So it is possible
that by the time do_dead_task is called the rcu grace period is up.


Which is the problem the users of task_rcu_dereference are fighting.
They are performing an rcu walk on the set of cups in task_numa_migrate
and in the userspace membarrier system calls.

For a short while we the rcu delay in put_task_struct but that required
changes all of the place and was just a pain to work with.

Then I did:
> commit 8c7904a00b06d2ee51149794b619e07369fcf9d4
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Fri Mar 31 02:31:37 2006 -0800
> 
>     [PATCH] task: RCU protect task->usage
>     
>     A big problem with rcu protected data structures that are also reference
>     counted is that you must jump through several hoops to increase the reference
>     count.  I think someone finally implemented atomic_inc_not_zero(&count) to
>     automate the common case.  Unfortunately this means you must special case the
>     rcu access case.
>     
>     When data structures are only visible via rcu in a manner that is not
>     determined by the reference count on the object (i.e.  tasks are visible until
>     their zombies are reaped) there is a much simpler technique we can employ.
>     Simply delaying the decrement of the reference count until the rcu interval is
>     over.
>     
>     What that means is that the proc code that looks up a task and later
>     wants to sleep can now do:
>     
>     rcu_read_lock();
>     task = find_task_by_pid(some_pid);
>     if (task) {
>             get_task_struct(task);
>     }
>     rcu_read_unlock();
>     
>     The effect on the rest of the kernel is that put_task_struct becomes cheaper
>     and immediate, and in the case where the task has been reaped it frees the
>     task immediate instead of unnecessarily waiting an until the rcu interval is
>     over.
>     
>     Cleanup of task_struct does not happen when its reference count drops to
>     zero, instead cleanup happens when release_task is called.  Tasks can only
>     be looked up via rcu before release_task is called.  All rcu protected
>     members of task_struct are freed by release_task.
>     
>     Therefore we can move call_rcu from put_task_struct into release_task.  And
>     we can modify release_task to not immediately release the reference count
>     but instead have it call put_task_struct from the function it gives to
>     call_rcu.
>     
>     The end result:
>     
>     - get_task_struct is safe in an rcu context where we have just looked
>       up the task.
>     
>     - put_task_struct() simplifies into its old pre rcu self.
>     
>     This reorganization also makes put_task_struct uncallable from modules as
>     it is not exported but it does not appear to be called from any modules so
>     this should not be an issue, and is trivially fixed.
>     
>     Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>

About a decade later task_struct grew some new rcu users and Oleg
introduced task_rcu_dereference to deal with them:

> commit 150593bf869393d10a79f6bd3df2585ecc20a9bb
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Wed May 18 19:02:18 2016 +0200
> 
>     sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
>     
>     Generally task_struct is only protected by RCU if it was found on a
>     RCU protected list (say, for_each_process() or find_task_by_vpid()).
>     
>     As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
>     drops the (potentially) last reference without RCU gp, this means
>     that we need to fix the code which uses foreign_rq->curr under
>     rcu_read_lock().
>     
>     Add a new helper which can be used to dereference rq->curr or any
>     other pointer to task_struct assuming that it should be cleared or
>     updated before the final put_task_struct(). It returns non-NULL
>     only if this task can't go away before rcu_read_unlock().
>     
>     ( Also add try_get_task_struct() to make it easier to use this API
>       correctly. )

So I think it makes a lot of sense to change how we do this.  Either
moving the rcu delay back into put_task_struct or doing halfway like
creating a put_dead_task_struct that will perform the rcu delay after
a task has been taken off the run queues and has stopped being a zombie.

Something like:
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..bf323418094e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_dead_task_struct(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..3a85bc2e8031 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_dead_task_struct(struct task_struct *task)
+{
+	bool delay = false;
+	unsigned long flags;
+
+	/* Is the task both reaped and no longer being scheduled? */
+	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	if ((task->state == TASK_DEAD) &&
+	    (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
+		delay = true;
+	raw_spin_lock_irqrestore(&task->pi_lock, flags);
+
+	/* If both are true use rcu delay the put_task_struct */
+	if (delay)
+		call_rcu(&task->rcu, delayed_put_task_struct);
+	else
+		put_task_struct(task);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_dead_task_struct(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..5b697c0572ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_dead_task_struct(prev);
 	}
 
 	tick_nohz_task_switch();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..c3e1a302211a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..74df8e0dfc84 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);



Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-08-30 19:36         ` Eric W. Biederman
@ 2019-09-02 13:40           ` Oleg Nesterov
  2019-09-02 13:53             ` Peter Zijlstra
  2019-09-02 17:04             ` Eric W. Biederman
  0 siblings, 2 replies; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-02 13:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing

On 08/30, Eric W. Biederman wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  	put_task_struct(tsk);
>  }
>  
> +void put_dead_task_struct(struct task_struct *task)
> +{
> +	bool delay = false;
> +	unsigned long flags;
> +
> +	/* Is the task both reaped and no longer being scheduled? */
> +	raw_spin_lock_irqsave(&task->pi_lock, flags);
> +	if ((task->state == TASK_DEAD) &&
> +	    (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
> +		delay = true;
> +	raw_spin_lock_irqrestore(&task->pi_lock, flags);
> +
> +	/* If both are true use rcu delay the put_task_struct */
> +	if (delay)
> +		call_rcu(&task->rcu, delayed_put_task_struct);
> +	else
> +		put_task_struct(task);
> +}
>  
>  void release_task(struct task_struct *p)
>  {
> @@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
>  
>  	write_unlock_irq(&tasklist_lock);
>  	release_thread(p);
> -	call_rcu(&p->rcu, delayed_put_task_struct);
> +	put_dead_task_struct(p);

I had a similar change in mind, see below. This is subjective, but to me
it looks more simple and clean.

Oleg.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811..1f9b021 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1134,7 +1134,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		bool			xxx;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7..baacfce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void call_delayed_put_task_struct(struct task_struct *p)
+{
+	if (xchg(&p->xxx, 1))
+		call_rcu(&p->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	call_delayed_put_task_struct(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1..e90f6de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,11 +900,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
-	 */
-	refcount_set(&tsk->usage, 2);
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f1..e77389c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		call_delayed_put_task_struct(prev);
 	}
 
 	tick_nohz_task_switch();


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-09-02 13:40           ` Oleg Nesterov
@ 2019-09-02 13:53             ` Peter Zijlstra
  2019-09-02 14:44               ` Oleg Nesterov
  2019-09-02 17:04             ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-02 13:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing

On Mon, Sep 02, 2019 at 03:40:03PM +0200, Oleg Nesterov wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8dc1811..1f9b021 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1134,7 +1134,10 @@ struct task_struct {
>  
>  	struct tlbflush_unmap_batch	tlb_ubc;
>  
> -	struct rcu_head			rcu;
> +	union {
> +		bool			xxx;
> +		struct rcu_head		rcu;
> +	};
>  
>  	/* Cache last used pipe for splice(): */
>  	struct pipe_inode_info		*splice_pipe;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7..baacfce 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  	put_task_struct(tsk);
>  }
>  
> +void call_delayed_put_task_struct(struct task_struct *p)
> +{
> +	if (xchg(&p->xxx, 1))
> +		call_rcu(&p->rcu, delayed_put_task_struct);
> +}

I think this is the first usage of xchg() on _Bool, also not all archs
implement xchg8()


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-09-02 13:53             ` Peter Zijlstra
@ 2019-09-02 14:44               ` Oleg Nesterov
  2019-09-02 16:20                 ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-02 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing

On 09/02, Peter Zijlstra wrote:
>
> On Mon, Sep 02, 2019 at 03:40:03PM +0200, Oleg Nesterov wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 8dc1811..1f9b021 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1134,7 +1134,10 @@ struct task_struct {
> >  
> >  	struct tlbflush_unmap_batch	tlb_ubc;
> >  
> > -	struct rcu_head			rcu;
> > +	union {
> > +		bool			xxx;
> > +		struct rcu_head		rcu;
> > +	};
> >  
> >  	/* Cache last used pipe for splice(): */
> >  	struct pipe_inode_info		*splice_pipe;
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index a75b6a7..baacfce 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
> >  	put_task_struct(tsk);
> >  }
> >  
> > +void call_delayed_put_task_struct(struct task_struct *p)
> > +{
> > +	if (xchg(&p->xxx, 1))
> > +		call_rcu(&p->rcu, delayed_put_task_struct);
> > +}
> 
> I think this is the first usage of xchg() on _Bool, also not all archs
> implement xchg8()

I didn't even notice I used "bool" ;)


speaking of the users of task_rcu_dereference(), membarrier_global_expedited()
does

		rcu_read_lock();
		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {

This asks for READ_ONCE, but this is minor. Why can't p->mm be freed?

I guess it is fine to read the garbage from &p->mm->membarrier_state if we race
with the exiting task, but in theory this looks unsafe if CONFIG_DEBUG_PAGEALLOC.

Another possible user of probe_slab_address() or I am totally confused?

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-09-02 14:44               ` Oleg Nesterov
@ 2019-09-02 16:20                 ` Peter Zijlstra
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-02 16:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Mathieu Desnoyers

On Mon, Sep 02, 2019 at 04:44:24PM +0200, Oleg Nesterov wrote:

> speaking of the users of task_rcu_dereference(), membarrier_global_expedited()
> does
> 
> 		rcu_read_lock();
> 		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
> 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> 
> This asks for READ_ONCE, but this is minor. Why can't p->mm be freed?
> 
> I guess it is fine to read the garbage from &p->mm->membarrier_state if we race
> with the exiting task, but in theory this looks unsafe if CONFIG_DEBUG_PAGEALLOC.
> 
> Another possible user of probe_slab_address() or I am totally confused?

You're quite right; that's busted.

Due to the lack of READ_ONCE() on p->mm, the above can in fact turn into
a NULL deref when we hit do_exit() around exit_mm(). The first p->mm
read is before and sees !NULL, the second is after and does observe
NULL, and *bang*.

I suppose it wants to be something like:

	mm = READ_ONCE(p->mm);
	if (mm && probe_address())

(I'm not sure _slab_ is a useful part of the name; it should work on
kernel memory irrespective of the allocator)

If it got freed, that CPU already just did something that implies
smp_mb() so we're good. So whatever garbage gets read, is fine. Either
we do a superfluous IPI or not is immaterial.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-09-02 13:40           ` Oleg Nesterov
  2019-09-02 13:53             ` Peter Zijlstra
@ 2019-09-02 17:04             ` Eric W. Biederman
  2019-09-02 17:34               ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-02 17:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing

Oleg Nesterov <oleg@redhat.com> writes:

> On 08/30, Eric W. Biederman wrote:
>>
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>>  	put_task_struct(tsk);
>>  }
>>  
>> +void put_dead_task_struct(struct task_struct *task)
>> +{
>> +	bool delay = false;
>> +	unsigned long flags;
>> +
>> +	/* Is the task both reaped and no longer being scheduled? */
>> +	raw_spin_lock_irqsave(&task->pi_lock, flags);
>> +	if ((task->state == TASK_DEAD) &&
>> +	    (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
>> +		delay = true;
>> +	raw_spin_lock_irqrestore(&task->pi_lock, flags);
>> +
>> +	/* If both are true use rcu delay the put_task_struct */
>> +	if (delay)
>> +		call_rcu(&task->rcu, delayed_put_task_struct);
>> +	else
>> +		put_task_struct(task);
>> +}
>>  
>>  void release_task(struct task_struct *p)
>>  {
>> @@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
>>  
>>  	write_unlock_irq(&tasklist_lock);
>>  	release_thread(p);
>> -	call_rcu(&p->rcu, delayed_put_task_struct);
>> +	put_dead_task_struct(p);
>
> I had a similar change in mind, see below. This is subjective, but to me
> it looks more simple and clean.
>
> Oleg.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8dc1811..1f9b021 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1134,7 +1134,10 @@ struct task_struct {
>  
>  	struct tlbflush_unmap_batch	tlb_ubc;
>  
> -	struct rcu_head			rcu;
> +	union {
> +		bool			xxx;
> +		struct rcu_head		rcu;
> +	};
>  
>  	/* Cache last used pipe for splice(): */
>  	struct pipe_inode_info		*splice_pipe;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7..baacfce 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  	put_task_struct(tsk);
>  }
>  
> +void call_delayed_put_task_struct(struct task_struct *p)
> +{
> +	if (xchg(&p->xxx, 1))
> +		call_rcu(&p->rcu, delayed_put_task_struct);
> +}

I like using the storage we will later use for the rcu_head.

Is the intention your new variable xxx start as 0, and the only
on the second write it becomes 1 and we take action?

That should work but it is a funny way to encode a decrement.  I think
it would be more straight forward to use refcount_dec_and_test.

So something like this:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..99a4518b9b17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,7 +1142,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		refcount_t		rcu_users;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..8bd51af44bf8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..a42fd8889036 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+	if (refcount_dec_and_test(&task->rcu_users))
+		call_rcu(&task->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,10 +227,10 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct_rcu_user(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
 		goto repeat;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..dc4799210e05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,11 +900,15 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
+	/* One for the user space visible state that goes away when
+	 * the processes zombie is reaped.
+	 * One for the reference from the scheduler.
+	 *
+	 * The reference count is ignored and free_task is called
+	 * directly until copy_process completes.
 	 */
-	refcount_set(&tsk->usage, 2);
+	refcount_set(&tsk->rcu_users, 2);
+	refcount_set(&tsk->usage, 1); /* One for the rcu users */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..69015b7c28da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}
 
 	tick_nohz_task_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..082f8ba2b1f4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -892,7 +892,7 @@ struct rq {
 	 */
 	unsigned long		nr_uninterruptible;
 
-	struct task_struct	*curr;
+	struct task_struct __rcu *curr;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;


Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value
  2019-09-02 17:04             ` Eric W. Biederman
@ 2019-09-02 17:34               ` Linus Torvalds
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-02 17:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing

On Mon, Sep 2, 2019 at 10:04 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I like using the storage we will later use for the rcu_head.
>
> Is the intention your new variable xxx start as 0, and the only
> on the second write it becomes 1 and we take action?
>
> That should work but it is a funny way to encode a decrement.  I think
> it would be more straight forward to use refcount_dec_and_test.
>
> So something like this:

I like how this patch looks. It makes more sense to me than some of
the ad-hoc cases, and I wonder if this might be a pattern in general.

We have a very different "some users don't need RCU" in the dentry
code, and recently in the credential handling code. So I wonder if
this is a larger pattern, but I think your patch looks good
independently on its own.

But this is all based on "that patch _feels_ conceptually right",
rather than any deep thinking or (God forbid) any actual testing.

                Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 0/3] task: Making tasks on the runqueue rcu protected
  2019-09-02 17:34               ` Linus Torvalds
@ 2019-09-03  4:50                 ` Eric W. Biederman
  2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
                                     ` (5 more replies)
  0 siblings, 6 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03  4:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


I have split this work into 3 simple patches, so the code is straight
forward to review and so that if any mistakes slip in it is easy to
bisect them.  In the process of review what it takes to remove
task_rcu_dereference I found yet another user of tasks on the
runqueue in rcu context; the rcuwait_event code.  That code only needs
it now unnecessary limits removed.

I have lightly tested it, and read through everything I can think of
that might be an issue.

Peter would this be a good fit for your scheduler tree?  If not I will
toss it onto a branch someplace and send it to Linus when the merge
window opens.

Oleg do you have any issues with this code?

Eric W. Biederman (3):
      task: Add a count of task rcu users
      task: RCU protect tasks on the runqueue
      task: Clean house now that tasks on the runqueue are rcu protected

 include/linux/rcuwait.h    | 20 +++----------
 include/linux/sched.h      |  5 +++-
 include/linux/sched/task.h |  2 +-
 kernel/exit.c              | 74 ++++------------------------------------------
 kernel/fork.c              |  8 +++--
 kernel/sched/core.c        |  7 +++--
 kernel/sched/fair.c        |  2 +-
 kernel/sched/membarrier.c  |  4 +--
 8 files changed, 27 insertions(+), 95 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 1/3] task: Add a count of task rcu users
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
@ 2019-09-03  4:51                   ` Eric W. Biederman
  2019-09-04 14:36                     ` Oleg Nesterov
  2019-09-04 14:44                     ` Frederic Weisbecker
  2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
                                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03  4:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


Add a count of the number of rcu users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference, and just use rcu_dereference.

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched.h      | 5 ++++-
 include/linux/sched/task.h | 1 +
 kernel/exit.c              | 7 ++++++-
 kernel/fork.c              | 7 +++----
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..99a4518b9b17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,7 +1142,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		refcount_t		rcu_users;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4c44c37236b2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -116,6 +116,7 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..2e259286f4e7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+	if (refcount_dec_and_test(&task->rcu_users))
+		call_rcu(&task->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct_rcu_user(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..9f04741d5c70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
-	 */
+	/* One for the user space visible state that goes away when reaped. */
+	refcount_set(&tsk->rcu_users, 1);
+	/* One for the rcu users, and one for the scheduler */
 	refcount_set(&tsk->usage, 2);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
  2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
@ 2019-09-03  4:52                   ` Eric W. Biederman
  2019-09-03  7:41                     ` Peter Zijlstra
  2019-09-04 14:22                     ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
  2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
                                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03  4:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


In the ordinary case today the rcu grace period of a task comes when a
task is reaped, well after the task has left the runqueue.  This
change guarantees that the rcu grace period always happens after a
task has left the runqueue.  As this is something that usaually happens
today I do not expect any code correctness problems with this change.
At most I anticipate timing challenges.

The only code that will run later are the functions
perf_event_delayed_put, and trace-sched_process_free.  The function
perf_event_delayed_put in the final analysis is just a WARN_ON for
cases that I assume should never happen.  So I don't see any problem
with delaying it.

The function trace_sched_process_free is a trace point and thus user
space visible.   The strangest dependencies can happen but short
of the bizarre it appears to me that trace_sched_process_free is
getting a slightly more accurate picture of when a task struct
is free.  As it is now guaranteed that the process will not be
on the runqueue.

Resources for a process are freed in release_task or in __put_task_struct
when the reference count goes to 0.  Both of which are happening at
effectively the same time as before.  The rcu grace period is just
potentially happening a little bit later in the timeline.

In the common case of a process being reaped after it leaves the
runqueue everything will happen exactly as before.

In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c       | 11 +++++++----
 kernel/sched/core.c |  7 ++++---
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/* One for the user space visible state that goes away when reaped. */
-	refcount_set(&tsk->rcu_users, 1);
-	/* One for the rcu users, and one for the scheduler */
-	refcount_set(&tsk->usage, 2);
+	/*
+	 * One for the user space visible state that goes away when reaped.
+	 * One for the scheduler.
+	 */
+	refcount_set(&tsk->rcu_users, 2);
+	/* One for the rcu users */
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..802958407369 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}
 
 	tick_nohz_task_switch();
@@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
-		rq->curr = next;
+		rcu_assign_pointer(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -5863,7 +5863,8 @@ void init_idle(struct task_struct *idle, int cpu)
 	__set_task_cpu(idle, cpu);
 	rcu_read_unlock();
 
-	rq->curr = rq->idle = idle;
+	rq->idle = idle;
+	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
  2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
  2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
@ 2019-09-03  4:52                   ` Eric W. Biederman
  2019-09-03  9:45                     ` kbuild test robot
  2019-09-03 13:06                     ` Oleg Nesterov
  2019-09-03 13:58                   ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
                                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03  4:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


Use rcu_dereference instead of task_rcu_dereference.

Remove task_rcu_dereference.

Remove the complications of rcuwait that were in place because tasks
on the runqueue were not rcu protected.  It is now safe to call
wake_up_process if the target was know to be on the runqueue in the
current rcu interval.

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/rcuwait.h    | 20 +++---------
 include/linux/sched/task.h |  1 -
 kernel/exit.c              | 67 --------------------------------------
 kernel/sched/fair.c        |  2 +-
 kernel/sched/membarrier.c  |  4 +--
 5 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 563290fc194f..75c97e4bbc57 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -6,16 +6,11 @@
 
 /*
  * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner; where it is forbidden to use
- * after exit_notify(). task_struct is not properly rcu protected,
- * unless dealing with rcu-aware lists, ie: find_task_by_*().
+ * task in an rcu-safe manner.
  *
- * Alternatively we have task_rcu_dereference(), but the return
- * semantics have different implications which would break the
- * wakeup side. The only time @task is non-nil is when a user is
- * blocked (or checking if it needs to) on a condition, and reset
- * as soon as we know that the condition has succeeded and are
- * awoken.
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
  */
 struct rcuwait {
 	struct task_struct __rcu *task;
@@ -37,13 +32,6 @@ extern void rcuwait_wake_up(struct rcuwait *w);
  */
 #define rcuwait_wait_event(w, condition)				\
 ({									\
-	/*								\
-	 * Complain if we are called after do_exit()/exit_notify(),     \
-	 * as we cannot rely on the rcu critical region for the		\
-	 * wakeup side.							\
-	 */                                                             \
-	WARN_ON(current->exit_state);                                   \
-									\
 	rcu_assign_pointer((w)->task, current);				\
 	for (;;) {							\
 		/*							\
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4c44c37236b2..8bd51af44bf8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,6 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/exit.c b/kernel/exit.c
index 2e259286f4e7..f943773622fc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,69 +234,6 @@ void release_task(struct task_struct *p)
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
@@ -316,10 +253,6 @@ void rcuwait_wake_up(struct rcuwait *w)
 	 */
 	smp_mb(); /* (B) */
 
-	/*
-	 * Avoid using task_rcu_dereference() magic as long as we are careful,
-	 * see comment in rcuwait_wait_event() regarding ->exit_state.
-	 */
 	task = rcu_dereference(w->task);
 	if (task)
 		wake_up_process(task);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..215c640e2a6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..b14250a11608 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
@ 2019-09-03  7:41                     ` Peter Zijlstra
  2019-09-03  7:47                       ` Peter Zijlstra
  2019-09-04 14:22                     ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-03  7:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On Mon, Sep 02, 2019 at 11:52:01PM -0500, Eric W. Biederman wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2b037f195473..802958407369 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> @@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
>  
>  	if (likely(prev != next)) {
>  		rq->nr_switches++;
> -		rq->curr = next;
> +		rcu_assign_pointer(rq->curr, next);
>  		/*
>  		 * The membarrier system call requires each architecture
>  		 * to have a full memory barrier after updating

This one is sad; it puts a (potentially) expensive barrier in here. And
I'm not sure I can explain the need for it. That is, we've not changed
@next before this and don't need to 'publish' it as such.

Can we use RCU_INIT_POINTER() or simply WRITE_ONCE(), here?


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03  7:41                     ` Peter Zijlstra
@ 2019-09-03  7:47                       ` Peter Zijlstra
  2019-09-03 16:44                         ` Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-03  7:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On Tue, Sep 03, 2019 at 09:41:17AM +0200, Peter Zijlstra wrote:
> On Mon, Sep 02, 2019 at 11:52:01PM -0500, Eric W. Biederman wrote:
> 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..802958407369 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> 
> > @@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
> >  
> >  	if (likely(prev != next)) {
> >  		rq->nr_switches++;
> > -		rq->curr = next;
> > +		rcu_assign_pointer(rq->curr, next);
> >  		/*
> >  		 * The membarrier system call requires each architecture
> >  		 * to have a full memory barrier after updating
> 
> This one is sad; it puts a (potentially) expensive barrier in here. And
> I'm not sure I can explain the need for it. That is, we've not changed
> @next before this and don't need to 'publish' it as such.
> 
> Can we use RCU_INIT_POINTER() or simply WRITE_ONCE(), here?

That is, I'm thinking we qualify for point 3 (both a and b) of
RCU_INIT_POINTER().

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected
  2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
@ 2019-09-03  9:45                     ` kbuild test robot
  2019-09-03 13:06                     ` Oleg Nesterov
  1 sibling, 0 replies; 75+ messages in thread
From: kbuild test robot @ 2019-09-03  9:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: kbuild-all, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso

Hi "Eric,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[cannot apply to v5.3-rc7 next-20190902]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Eric-W-Biederman/task-Making-tasks-on-the-runqueue-rcu-protected/20190903-130546
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.1-rc1-7-g2b96cd8-dirty
        make ARCH=x86_64 allmodconfig
        make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)

>> kernel/sched/membarrier.c:74:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> kernel/sched/membarrier.c:74:21: sparse:    struct task_struct [noderef] <asn:4> *
>> kernel/sched/membarrier.c:74:21: sparse:    struct task_struct *
   kernel/sched/membarrier.c:153:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/membarrier.c:153:21: sparse:    struct task_struct [noderef] <asn:4> *
   kernel/sched/membarrier.c:153:21: sparse:    struct task_struct *

vim +74 kernel/sched/membarrier.c

    32	
    33	static int membarrier_global_expedited(void)
    34	{
    35		int cpu;
    36		bool fallback = false;
    37		cpumask_var_t tmpmask;
    38	
    39		if (num_online_cpus() == 1)
    40			return 0;
    41	
    42		/*
    43		 * Matches memory barriers around rq->curr modification in
    44		 * scheduler.
    45		 */
    46		smp_mb();	/* system call entry is not a mb. */
    47	
    48		/*
    49		 * Expedited membarrier commands guarantee that they won't
    50		 * block, hence the GFP_NOWAIT allocation flag and fallback
    51		 * implementation.
    52		 */
    53		if (!zalloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
    54			/* Fallback for OOM. */
    55			fallback = true;
    56		}
    57	
    58		cpus_read_lock();
    59		for_each_online_cpu(cpu) {
    60			struct task_struct *p;
    61	
    62			/*
    63			 * Skipping the current CPU is OK even through we can be
    64			 * migrated at any point. The current CPU, at the point
    65			 * where we read raw_smp_processor_id(), is ensured to
    66			 * be in program order with respect to the caller
    67			 * thread. Therefore, we can skip this CPU from the
    68			 * iteration.
    69			 */
    70			if (cpu == raw_smp_processor_id())
    71				continue;
    72	
    73			rcu_read_lock();
  > 74			p = rcu_dereference(cpu_rq(cpu)->curr);
    75			if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
    76					   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
    77				if (!fallback)
    78					__cpumask_set_cpu(cpu, tmpmask);
    79				else
    80					smp_call_function_single(cpu, ipi_mb, NULL, 1);
    81			}
    82			rcu_read_unlock();
    83		}
    84		if (!fallback) {
    85			preempt_disable();
    86			smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
    87			preempt_enable();
    88			free_cpumask_var(tmpmask);
    89		}
    90		cpus_read_unlock();
    91	
    92		/*
    93		 * Memory barrier on the caller thread _after_ we finished
    94		 * waiting for the last IPI. Matches memory barriers around
    95		 * rq->curr modification in scheduler.
    96		 */
    97		smp_mb();	/* exit from system call is not a mb */
    98		return 0;
    99	}
   100	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected
  2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
  2019-09-03  9:45                     ` kbuild test robot
@ 2019-09-03 13:06                     ` Oleg Nesterov
  1 sibling, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-03 13:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On 09/02, Eric W. Biederman wrote:
>
> @@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
>  		return;
>  
>  	rcu_read_lock();
> -	cur = task_rcu_dereference(&dst_rq->curr);
> +	cur = rcu_dereference(dst_rq->curr);
>  	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
>  		cur = NULL;

afaics rq->curr can't be NULL, so you can also simplify the "if" check

	cur = task_rcu_dereference(&dst_rq->curr);
	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
		cur = NULL;

Same for membarrier_global_expedited/membarrier_private_expedited changed
by this patch.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 0/3] task: Making tasks on the runqueue rcu protected
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
                                     ` (2 preceding siblings ...)
  2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
@ 2019-09-03 13:58                   ` Oleg Nesterov
  2019-09-03 15:44                   ` Linus Torvalds
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
  5 siblings, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-03 13:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On 09/02, Eric W. Biederman wrote:
>
> Oleg do you have any issues with this code?

OK, let it be refcount_t, I agree it looks more readable.

> Eric W. Biederman (3):
>       task: Add a count of task rcu users
>       task: RCU protect tasks on the runqueue
>       task: Clean house now that tasks on the runqueue are rcu protected
> 
>  include/linux/rcuwait.h    | 20 +++----------
>  include/linux/sched.h      |  5 +++-
>  include/linux/sched/task.h |  2 +-
>  kernel/exit.c              | 74 ++++------------------------------------------
>  kernel/fork.c              |  8 +++--
>  kernel/sched/core.c        |  7 +++--
>  kernel/sched/fair.c        |  2 +-
>  kernel/sched/membarrier.c  |  4 +--
>  8 files changed, 27 insertions(+), 95 deletions(-)

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 0/3] task: Making tasks on the runqueue rcu protected
  2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
                                     ` (3 preceding siblings ...)
  2019-09-03 13:58                   ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
@ 2019-09-03 15:44                   ` Linus Torvalds
  2019-09-03 19:46                     ` Peter Zijlstra
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
  5 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-03 15:44 UTC (permalink / raw)
  To: Eric W. Biederman, Paul E. McKenney
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On Mon, Sep 2, 2019 at 9:50 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I have split this work into 3 simple patches, so the code is straight
> forward to review and so that if any mistakes slip in it is easy to
> bisect them.  In the process of review what it takes to remove
> task_rcu_dereference I found yet another user of tasks on the
> runqueue in rcu context; the rcuwait_event code.  That code only needs
> it now unnecessary limits removed.

Looks very good to me.

I think PeterZ is right that the rcu_assign_pointer() in [PATCH 2/3]
could be a RCU_INIT_POINTER() due to condition #3 in the
RCU_INIT_POINTER rules. The initialization of the pointer value simply
has nothing to do with what the pointer points to - we're just
switching it to another case.

That said, it won't affect any of the core architectures much, because
smp_store_release() isn't that expensive (it's just a compiler barrier
on x86, it's a cheap instruction on arm64, and it should be very cheap
on any other architecture too unless they do insane things - even on
powerpc, which is about the worst case for any barriers, it's just an
lwsync).

It might be good to have Paul _look_ at it, and because of the minimal
performance impact I don't worry about it too much if it happens
later, but it should be something we keep in mind.

                  Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03  7:47                       ` Peter Zijlstra
@ 2019-09-03 16:44                         ` Eric W. Biederman
  2019-09-03 17:08                           ` Linus Torvalds
                                             ` (3 more replies)
  0 siblings, 4 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 03, 2019 at 09:41:17AM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 02, 2019 at 11:52:01PM -0500, Eric W. Biederman wrote:
>> 
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 2b037f195473..802958407369 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> 
>> > @@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
>> >  
>> >  	if (likely(prev != next)) {
>> >  		rq->nr_switches++;
>> > -		rq->curr = next;
>> > +		rcu_assign_pointer(rq->curr, next);
>> >  		/*
>> >  		 * The membarrier system call requires each architecture
>> >  		 * to have a full memory barrier after updating
>> 
>> This one is sad; it puts a (potentially) expensive barrier in here. And
>> I'm not sure I can explain the need for it. That is, we've not changed
>> @next before this and don't need to 'publish' it as such.
>> 
>> Can we use RCU_INIT_POINTER() or simply WRITE_ONCE(), here?
>
> That is, I'm thinking we qualify for point 3 (both a and b) of
> RCU_INIT_POINTER().

I don't think point (b) is a concern on any widely visible architecture.
After taking a quick skim through the users it does appear to me that
we almost definitely have changes to the task_struct since the last time
another cpu say that structure (3 a) and that we have cases where
reading stale values in the task_struct will result in incorrect
operation of the code.

The concern of point (b) is the old alpha caching case where you could
dereference a pointer and get a stale copy of the data structure.  This
is a concern when an you are following the pointer from another cpu.

From my quick skim the cases I can see where point (b) might apply are
in fair.c:task_numa_compare lots of fields in task_struct are read.  It
looks like reading a stale (old/wrong) value of cur->numa_group could be
very inexplicable and weird.  Similarly in the membarrier code reading a
pre-exec version of cur->mm could give completely inexplicable results.
Finally in rcuwait_wake_up reading a stale version of the process
cur->state could cause incorrect or missed wake ups in wake_up_process.


There might already be enough barriers in the scheduler that the barrier
in rcu_update_pointer is redundant.  The comment about membarrier at
least suggests that for processes that return to userspace we have a
full memory barrier.

So with a big fat comment explaining why it is safe we could potentially
use RCU_INIT_POINTER.  I currently don't see where the appropriate
barriers are so I can not write that comment or with a clear conscious
write the code to use RCU_INIT_POINTER instead of rcu_assign_pointer.

Eric


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 16:44                         ` Eric W. Biederman
@ 2019-09-03 17:08                           ` Linus Torvalds
  2019-09-03 18:13                             ` Eric W. Biederman
  2019-09-14 12:31                           ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-03 17:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On Tue, Sep 3, 2019 at 9:45 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> So with a big fat comment explaining why it is safe we could potentially
> use RCU_INIT_POINTER.  I currently don't see where the appropriate
> barriers are so I can not write that comment or with a clear conscious
> write the code to use RCU_INIT_POINTER instead of rcu_assign_pointer.

The only difference ends up being that RCU_INIT_POINTER() is just a
store, while rcu_assign_pointer() uses a smp_store_release().

(There is some build-time special case code to make
rcu_assign_pointer(NULL) avoid the store_release, but that is
irrelevant for this discussion).

So from a memory ordering standpoint,
RCU_INIT_POINTER-vs-rcu_assign_pointer doesn't change what pointer you
get (on the other CPU that does the reading), but only whether the
stores to behind the pointer have been ordered wrt the reading too.

Which no existing case can care about, since it didn't use to have any
ordering anyway before this patch series. The individual values read
off the thread pointer had their own individual memory ordering rules
(ie instead of making the _pointer_ be the serialization point, we
have rules for how "p->on_cpu" is ordered wrt the rq lock etc).

So one argument for just using RCU_INIT_POINTER is that it's the same
ordering that we had before, and then it's up to any users of that
pointer to order any accesses to any fields in 'struct task_struct'.

Conversely, one argument for using rcu_assign_pointer() is that when
we pair it with an RCU read, we get certain ordering guarantees
automatically. So _if_ we have fields that change when a process is
put on the run-queue, and the RCU users want to read those fields,
then the release/acquire semantics might perform better than potential
existing smp memory barriers we might have right now.

                 Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 17:08                           ` Linus Torvalds
@ 2019-09-03 18:13                             ` Eric W. Biederman
  2019-09-03 19:18                               ` Linus Torvalds
  2019-09-03 19:42                               ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-03 18:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Sep 3, 2019 at 9:45 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> So with a big fat comment explaining why it is safe we could potentially
>> use RCU_INIT_POINTER.  I currently don't see where the appropriate
>> barriers are so I can not write that comment or with a clear conscious
>> write the code to use RCU_INIT_POINTER instead of rcu_assign_pointer.
>
> The only difference ends up being that RCU_INIT_POINTER() is just a
> store, while rcu_assign_pointer() uses a smp_store_release().
>
> (There is some build-time special case code to make
> rcu_assign_pointer(NULL) avoid the store_release, but that is
> irrelevant for this discussion).
>
> So from a memory ordering standpoint,
> RCU_INIT_POINTER-vs-rcu_assign_pointer doesn't change what pointer you
> get (on the other CPU that does the reading), but only whether the
> stores to behind the pointer have been ordered wrt the reading too.

Which is my understanding.

> Which no existing case can care about, since it didn't use to have any
> ordering anyway before this patch series. The individual values read
> off the thread pointer had their own individual memory ordering rules
> (ie instead of making the _pointer_ be the serialization point, we
> have rules for how "p->on_cpu" is ordered wrt the rq lock etc).

Which would not be a regression if an existing case cared about it.

There are so few architectures where this is a real difference (anything
except alpha?) that we could have subtle bugs that have not been tracked
down for a long time.

I keep finding subtle bugs in much older and less subtle cases so I know
it can happen that very minor bugs can get overlooked.

> So one argument for just using RCU_INIT_POINTER is that it's the same
> ordering that we had before, and then it's up to any users of that
> pointer to order any accesses to any fields in 'struct task_struct'.

I agree that RCU_INIT_POINTER is equivalent to what we have now.

> Conversely, one argument for using rcu_assign_pointer() is that when
> we pair it with an RCU read, we get certain ordering guarantees
> automatically. So _if_ we have fields that change when a process is
> put on the run-queue, and the RCU users want to read those fields,
> then the release/acquire semantics might perform better than potential
> existing smp memory barriers we might have right now.

I think this is where I am looking a things differently than you and
Peter.  Why does it have to be ___schedule() that changes the value
in the task_struct?  Why can't it be something else that changes the
value and then proceeds to call schedule()?

What is the size of the window of changes that is relevant?

If we use RCU_INIT_POINTER if there was something that changed
task_struct and then called schedule() what ensures that a remote cpu
that has a stale copy of task_struct cached will update it's cache
after following the new value rq->curr?  Don't we need
rcu_assign_pointer to get that guarantee?

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 18:13                             ` Eric W. Biederman
@ 2019-09-03 19:18                               ` Linus Torvalds
  2019-09-03 20:06                                 ` Peter Zijlstra
  2019-09-03 19:42                               ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
  1 sibling, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-03 19:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

On Tue, Sep 3, 2019 at 11:13 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I think this is where I am looking a things differently than you and
> Peter.  Why does it have to be ___schedule() that changes the value
> in the task_struct?  Why can't it be something else that changes the
> value and then proceeds to call schedule()?

No, I think we're in violent agreement here: it's _not_ necessary
schedule that changes any values at all. The values behind the pointer
are live both before - and even more so _after_ - we put the process
on the percpu rq.

> What is the size of the window of changes that is relevant?

It's not the _size_ of the window that is relevant. It's the _direction_.

A "smp_store_release()" is only an ordering wrt previous changes. Why
would _previous_ changes be special? They aren't. In many ways, the
task struct before it is on the runqueue is fairly static. It's only
_after_ it is on the runqueue that the process starts doing things to
its own data structures.

> If we use RCU_INIT_POINTER if there was something that changed
> task_struct and then called schedule() what ensures that a remote cpu
> that has a stale copy of task_struct cached will update it's cache
> after following the new value rq->curr?  Don't we need
> rcu_assign_pointer to get that guarantee?

Why are those "before you called schedule" modifications special?

It's _way_ more likely that the task struct fields will change _after_
it was scheduled in.

So in many ways, the scheduling point isn't really the most natural
place for a barrier. It's just an event. But it's not clear why
"changes before that" should be synchronized or be a special case. The
process was visible other ways long before it's being actively run.

So why add a barrier to the scheduler when it's not clear that it makes sense?

Now, if you can point to some particular field where that ordering
makes sense for the particular case of "make it active on the
runqueue" vs "look up the task from the runqueue using RCU", then I do
think that the whole release->acquire consistency makes sense.

But it's not clear that such a field exists, particularly when this is
in no way the *common* way to even get a task pointer, and other paths
do *not* use the runqueue as the serialization point.

See what I'm saying?

Is the runqueue addition point special for synchronization? I don't
see it, and it historically has never been.

But *IF* it is, then yes, then rcu_assign_pointer() would make sense.

                    Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 18:13                             ` Eric W. Biederman
  2019-09-03 19:18                               ` Linus Torvalds
@ 2019-09-03 19:42                               ` Peter Zijlstra
  1 sibling, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-03 19:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

On Tue, Sep 03, 2019 at 01:13:22PM -0500, Eric W. Biederman wrote:

> I think this is where I am looking a things differently than you and
> Peter.  Why does it have to be ___schedule() that changes the value
> in the task_struct?  Why can't it be something else that changes the
> value and then proceeds to call schedule()?

If you call schedule() you will pass through plenty that already implies
smp_mb() before writing the ->curr pointer. If you care about that case,
adding RELEASE semantics to that store gains you absolutely nothing
except a marginally slower kernel.

> If we use RCU_INIT_POINTER if there was something that changed
> task_struct and then called schedule() what ensures that a remote cpu
> that has a stale copy of task_struct cached will update it's cache
> after following the new value rq->curr?  Don't we need
> rcu_assign_pointer to get that guarantee?

That whole construct doesn't really make sense: one it is very rare to
change task_struct content for !current tasks (and if we do, it must be
with atomic ops, because then there can be concurrency), secondly when
calling schedule() there is no guarantee on what @next will be.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 0/3] task: Making tasks on the runqueue rcu protected
  2019-09-03 15:44                   ` Linus Torvalds
@ 2019-09-03 19:46                     ` Peter Zijlstra
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-03 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Paul E. McKenney, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Tue, Sep 03, 2019 at 08:44:40AM -0700, Linus Torvalds wrote:

> That said, it won't affect any of the core architectures much, because
> smp_store_release() isn't that expensive (it's just a compiler barrier
> on x86, it's a cheap instruction on arm64, and it should be very cheap
> on any other architecture too unless they do insane things - even on
> powerpc, which is about the worst case for any barriers, it's just an
> lwsync).

Right, x86/s390/Sparc it's a compiler barrier, ARM64 has a store-release
op which is relatively cheap, Power has an LWSYNC, but the rest does
store-release with smp_mb() -- and this includes ARM.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 19:18                               ` Linus Torvalds
@ 2019-09-03 20:06                                 ` Peter Zijlstra
  2019-09-03 21:32                                   ` Paul E. McKenney
  2019-09-27  8:10                                   ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
  0 siblings, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-03 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

On Tue, Sep 03, 2019 at 12:18:47PM -0700, Linus Torvalds wrote:
> Now, if you can point to some particular field where that ordering
> makes sense for the particular case of "make it active on the
> runqueue" vs "look up the task from the runqueue using RCU", then I do
> think that the whole release->acquire consistency makes sense.
> 
> But it's not clear that such a field exists, particularly when this is
> in no way the *common* way to even get a task pointer, and other paths
> do *not* use the runqueue as the serialization point.

Even if we could find a case (and I'm not seeing one in a hurry), I
would try really hard to avoid adding extra barriers here and instead
make the consumer a little more complicated if at all possible.

The Power folks got rid of a SYNC (yes, more expensive than LWSYNC) from
their __switch_to() implementation and that had a measurable impact.

9145effd626d ("powerpc/64: Drop explicit hwsync in context switch")

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 20:06                                 ` Peter Zijlstra
@ 2019-09-03 21:32                                   ` Paul E. McKenney
  2019-09-05 20:02                                     ` Eric W. Biederman
  2019-09-27  8:10                                   ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-03 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Eric W. Biederman, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

On Tue, Sep 03, 2019 at 10:06:03PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 03, 2019 at 12:18:47PM -0700, Linus Torvalds wrote:
> > Now, if you can point to some particular field where that ordering
> > makes sense for the particular case of "make it active on the
> > runqueue" vs "look up the task from the runqueue using RCU", then I do
> > think that the whole release->acquire consistency makes sense.
> > 
> > But it's not clear that such a field exists, particularly when this is
> > in no way the *common* way to even get a task pointer, and other paths
> > do *not* use the runqueue as the serialization point.
> 
> Even if we could find a case (and I'm not seeing one in a hurry), I
> would try really hard to avoid adding extra barriers here and instead
> make the consumer a little more complicated if at all possible.
> 
> The Power folks got rid of a SYNC (yes, more expensive than LWSYNC) from
> their __switch_to() implementation and that had a measurable impact.
> 
> 9145effd626d ("powerpc/64: Drop explicit hwsync in context switch")

The patch [1] looks good to me.  And yes, if the structure pointed to by
the second argument of rcu_assign_pointer() is already visible to readers,
it is OK to instead use RCU_INIT_POINTER().  Yes, this loses ordering.
But weren't these simple assignments before RCU got involved?

As a very rough rule of thumb, LWSYNC is about twice as fast as SYNC.
(Depends on workload, exact details of the hardware, timing, phase of
the moon, you name it.)  So removing the LWSYNC is likely to provide
measureable benefit, but I must defer to the powerpc maintainers.
To that end, I added Michael on CC.

							Thanx, Paul

[1] https://lore.kernel.org/lkml/878sr6t21a.fsf_-_@x220.int.ebiederm.org/

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
  2019-09-03  7:41                     ` Peter Zijlstra
@ 2019-09-04 14:22                     ` Frederic Weisbecker
  1 sibling, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-04 14:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Mon, Sep 02, 2019 at 11:52:01PM -0500, Eric W. Biederman wrote:
> 
> In the ordinary case today the rcu grace period of a task comes when a
> task is reaped, well after the task has left the runqueue.  This
> change guarantees that the rcu grace period always happens after a
> task has left the runqueue.  As this is something that usaually happens
> today I do not expect any code correctness problems with this change.
> At most I anticipate timing challenges.

What do you consider as the reaping point here? If this is the call to
release_task(), it can happen way before the task forever leaves the runqueue.

Let alone the RCU call to delayed_put_task_struct() can happen way before
the target leaves the runqueue, either after autoreap or normal reaping.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
@ 2019-09-04 14:36                     ` Oleg Nesterov
  2019-09-04 14:44                     ` Frederic Weisbecker
  1 sibling, 0 replies; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-04 14:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Russell King - ARM Linux admin, Peter Zijlstra,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso

On 09/02, Eric W. Biederman wrote:
>
> @@ -900,10 +900,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> -	/*
> -	 * One for us, one for whoever does the "release_task()" (usually
> -	 * parent)
> -	 */
> +	/* One for the user space visible state that goes away when reaped. */
> +	refcount_set(&tsk->rcu_users, 1);

forgot to mention...

This is minor, but we don't really need to initialize child->rcu_users in
dup_task_struct(), we can just add

	.rcu_users = REFCOUNT_INIT(2),

into init_task's initializer.

Until we have a refcount_inc(task->rcu_users) user, of course.

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
  2019-09-04 14:36                     ` Oleg Nesterov
@ 2019-09-04 14:44                     ` Frederic Weisbecker
  2019-09-04 15:32                       ` Oleg Nesterov
  1 sibling, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-04 14:44 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Mon, Sep 02, 2019 at 11:51:34PM -0500, Eric W. Biederman wrote:
> 
> Add a count of the number of rcu users (currently 1) of the task
> struct so that we can later add the scheduler case and get rid of the
> very subtle task_rcu_dereference, and just use rcu_dereference.
> 
> As suggested by Oleg have the count overlap rcu_head so that no
> additional space in task_struct is required.
> 
> Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
> Inspired-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched.h      | 5 ++++-
>  include/linux/sched/task.h | 1 +
>  kernel/exit.c              | 7 ++++++-
>  kernel/fork.c              | 7 +++----
>  4 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9f51932bd543..99a4518b9b17 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1142,7 +1142,10 @@ struct task_struct {
>  
>  	struct tlbflush_unmap_batch	tlb_ubc;
>  
> -	struct rcu_head			rcu;
> +	union {
> +		refcount_t		rcu_users;
> +		struct rcu_head		rcu;

So what happens if, say:


           CPU 1                         CPU 2
   --------------------------------------------------------------
   rcu_read_lock()
   p = rcu_dereference(rq->task)
   if (refcount_inc_not_zero(p->rcu_users)) {
       .....
                                         release_task() {
                                             put_task_struct_rcu_user() {
                                                 call_rcu() {
                                                     queue rcu_head
                                                 }
                                             }
                                         }
       put_task_struct_rcu_user(); //here rcu_users has been overwritten


Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-04 14:44                     ` Frederic Weisbecker
@ 2019-09-04 15:32                       ` Oleg Nesterov
  2019-09-04 16:33                         ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Oleg Nesterov @ 2019-09-04 15:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso

On 09/04, Frederic Weisbecker wrote:
>
> So what happens if, say:
>
>
>            CPU 1                         CPU 2
>    --------------------------------------------------------------
>    rcu_read_lock()
>    p = rcu_dereference(rq->task)
>    if (refcount_inc_not_zero(p->rcu_users)) {
>        .....
>                                          release_task() {
>                                              put_task_struct_rcu_user() {
>                                                  call_rcu() {
>                                                      queue rcu_head

in this particular case call_rcu() won't be called, so

>                                                  }
>                                              }
>                                          }
>        put_task_struct_rcu_user(); //here rcu_users has been overwritten

rcu_users won't be overwritten.

But nobody should try to increment ->rcu_users,

	rcu_read_lock();
	p = rcu_dereference(rq->task);
	refcount_inc_not_zero(p->rcu_users);

is already wrong because both release_task/last-schedule can happen in
between, before refcount_inc_not_zero().

Oleg.


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-04 15:32                       ` Oleg Nesterov
@ 2019-09-04 16:33                         ` Frederic Weisbecker
  2019-09-04 18:20                           ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-04 16:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linus Torvalds,
	Russell King - ARM Linux admin, Peter Zijlstra, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso

On Wed, Sep 04, 2019 at 05:32:46PM +0200, Oleg Nesterov wrote:
> On 09/04, Frederic Weisbecker wrote:
> >
> > So what happens if, say:
> >
> >
> >            CPU 1                         CPU 2
> >    --------------------------------------------------------------
> >    rcu_read_lock()
> >    p = rcu_dereference(rq->task)
> >    if (refcount_inc_not_zero(p->rcu_users)) {
> >        .....
> >                                          release_task() {
> >                                              put_task_struct_rcu_user() {
> >                                                  call_rcu() {
> >                                                      queue rcu_head
> 
> in this particular case call_rcu() won't be called, so

Yeah I confused myself in finding the scenario but like you say below, release_task()
can simply happen between the p = rcu_dereference() and the refcount_inc to break things.

I thought the point of these rcu_users was to be able to do:

rcu_read_lock()
p = rcu_dereference(task)
if (!refcount_inc_not_zero(p->rcu_users)) {
    rcu_read_unlock();
    return -1;
}
rcu_read_unlock();

//do stuff with task

put_task_struct_rcu_user(p);

Thanks.

> 
> >                                                  }
> >                                              }
> >                                          }
> >        put_task_struct_rcu_user(); //here rcu_users has been overwritten
> 
> rcu_users won't be overwritten.
> 
> But nobody should try to increment ->rcu_users,
> 
> 	rcu_read_lock();
> 	p = rcu_dereference(rq->task);
> 	refcount_inc_not_zero(p->rcu_users);
> 
> is already wrong because both release_task/last-schedule can happen in
> between, before refcount_inc_not_zero().
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-04 16:33                         ` Frederic Weisbecker
@ 2019-09-04 18:20                           ` Linus Torvalds
  2019-09-05 14:59                             ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-04 18:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Eric W. Biederman, Russell King - ARM Linux admin,
	Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Wed, Sep 4, 2019 at 9:33 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> I thought the point of these rcu_users was to be able to do:
>
> rcu_read_lock()
> p = rcu_dereference(task)
> if (!refcount_inc_not_zero(p->rcu_users)) {

No. Because of the shared state, you can't do that from RCU context.

But you *could* increase rcu_users from within the process context
itself (as long as you do it before the exit path, ie during normal
system call execution), or possibly while holding the tasklist_lock
and verifying that the task hasn't died yet.

I'm not sure there is any sensible case for doing that, though. It
would have to have a similar pattern to the runqueue use, where you
add a new RCU lookup point for the task. I'm sure something like that
_could_ exist, I just can't think of any right now.

             Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 1/3] task: Add a count of task rcu users
  2019-09-04 18:20                           ` Linus Torvalds
@ 2019-09-05 14:59                             ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-05 14:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Eric W. Biederman, Russell King - ARM Linux admin,
	Peter Zijlstra, Chris Metcalf, Christoph Lameter, Kirill Tkhai,
	Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Wed, Sep 04, 2019 at 11:20:03AM -0700, Linus Torvalds wrote:
> On Wed, Sep 4, 2019 at 9:33 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > I thought the point of these rcu_users was to be able to do:
> >
> > rcu_read_lock()
> > p = rcu_dereference(task)
> > if (!refcount_inc_not_zero(p->rcu_users)) {
> 
> No. Because of the shared state, you can't do that from RCU context.
> 
> But you *could* increase rcu_users from within the process context
> itself (as long as you do it before the exit path, ie during normal
> system call execution), or possibly while holding the tasklist_lock
> and verifying that the task hasn't died yet.
> 
> I'm not sure there is any sensible case for doing that, though. It
> would have to have a similar pattern to the runqueue use, where you
> add a new RCU lookup point for the task. I'm sure something like that
> _could_ exist, I just can't think of any right now.

Yeah indeed, in fact I just hadn't read the patches entirely so I missed
the point. I see now that the purpose of rcu_users is to postpone the RCU
delayed put_task_struct() at least once we are done with both final schedule
out and release_task().

Sorry for the noise :o)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-03 21:32                                   ` Paul E. McKenney
@ 2019-09-05 20:02                                     ` Eric W. Biederman
  2019-09-05 20:55                                       ` Paul E. McKenney
  2019-09-06  7:07                                       ` Peter Zijlstra
  0 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-05 20:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Tue, Sep 03, 2019 at 10:06:03PM +0200, Peter Zijlstra wrote:
>> On Tue, Sep 03, 2019 at 12:18:47PM -0700, Linus Torvalds wrote:
>> > Now, if you can point to some particular field where that ordering
>> > makes sense for the particular case of "make it active on the
>> > runqueue" vs "look up the task from the runqueue using RCU", then I do
>> > think that the whole release->acquire consistency makes sense.
>> > 
>> > But it's not clear that such a field exists, particularly when this is
>> > in no way the *common* way to even get a task pointer, and other paths
>> > do *not* use the runqueue as the serialization point.
>> 
>> Even if we could find a case (and I'm not seeing one in a hurry), I
>> would try really hard to avoid adding extra barriers here and instead
>> make the consumer a little more complicated if at all possible.
>> 
>> The Power folks got rid of a SYNC (yes, more expensive than LWSYNC) from
>> their __switch_to() implementation and that had a measurable impact.
>> 
>> 9145effd626d ("powerpc/64: Drop explicit hwsync in context switch")
>
> The patch [1] looks good to me.  And yes, if the structure pointed to by
> the second argument of rcu_assign_pointer() is already visible to readers,
> it is OK to instead use RCU_INIT_POINTER().  Yes, this loses ordering.
> But weren't these simple assignments before RCU got involved?
>
> As a very rough rule of thumb, LWSYNC is about twice as fast as SYNC.
> (Depends on workload, exact details of the hardware, timing, phase of
> the moon, you name it.)  So removing the LWSYNC is likely to provide
> measureable benefit, but I must defer to the powerpc maintainers.
> To that end, I added Michael on CC.
>
> [1] https://lore.kernel.org/lkml/878sr6t21a.fsf_-_@x220.int.ebiederm.org/

Paul, what is the purpose of the barrier in rcu_assign_pointer?

My intuition says it is the assignment half of rcu_dereference, and that
anything that rcu_dereference does not need is too strong.

My basic understanding is that only alpha ever has the memory ordering
issue that rcu_dereference deals with.  So I am a bit surprised that
this is anything other than a noop for anything except alpha.

In my patch I used rcu_assign_pointer because that is the canonically
correct way to do things.  Peter makes a good case that adding an extra
barrier in ___schedule could be detrimental to system performance.
At the same time if there is a correctness issue on alpha that we have
been overlooking because of low testing volume on alpha I don't want to
just let this slide and have very subtle bugs.

The practical concern is that people have been really wanting to do
lockless and rcu operations on tasks in the runqueue for a while and
there are several very clever pieces of code doing that now.  By
changing the location of the rcu put I am trying to make these uses
ordinary rcu uses.

The uses in question are the pieces of code I update in:
https://lore.kernel.org/lkml/8736het20c.fsf_-_@x220.int.ebiederm.org/

In short.  Why is rcu_assign_pointer() more than WRITE_ONCE() on
anything but alpha?  Do other architectures need more?  Is the trade off
worth it if we avoid using rcu_assign_pointer on performance critical
paths.

Eric

p.s. I am being slow at working through all of this as I am dealing
     with my young baby son, and busy packing for the conference.

     I might not be able to get back to this discussion until after
     I have landed in Lisbon on Saturday night.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-05 20:02                                     ` Eric W. Biederman
@ 2019-09-05 20:55                                       ` Paul E. McKenney
  2019-09-06  7:07                                       ` Peter Zijlstra
  1 sibling, 0 replies; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-05 20:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

On Thu, Sep 05, 2019 at 03:02:49PM -0500, Eric W. Biederman wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
> > On Tue, Sep 03, 2019 at 10:06:03PM +0200, Peter Zijlstra wrote:
> >> On Tue, Sep 03, 2019 at 12:18:47PM -0700, Linus Torvalds wrote:
> >> > Now, if you can point to some particular field where that ordering
> >> > makes sense for the particular case of "make it active on the
> >> > runqueue" vs "look up the task from the runqueue using RCU", then I do
> >> > think that the whole release->acquire consistency makes sense.
> >> > 
> >> > But it's not clear that such a field exists, particularly when this is
> >> > in no way the *common* way to even get a task pointer, and other paths
> >> > do *not* use the runqueue as the serialization point.
> >> 
> >> Even if we could find a case (and I'm not seeing one in a hurry), I
> >> would try really hard to avoid adding extra barriers here and instead
> >> make the consumer a little more complicated if at all possible.
> >> 
> >> The Power folks got rid of a SYNC (yes, more expensive than LWSYNC) from
> >> their __switch_to() implementation and that had a measurable impact.
> >> 
> >> 9145effd626d ("powerpc/64: Drop explicit hwsync in context switch")
> >
> > The patch [1] looks good to me.  And yes, if the structure pointed to by
> > the second argument of rcu_assign_pointer() is already visible to readers,
> > it is OK to instead use RCU_INIT_POINTER().  Yes, this loses ordering.
> > But weren't these simple assignments before RCU got involved?
> >
> > As a very rough rule of thumb, LWSYNC is about twice as fast as SYNC.
> > (Depends on workload, exact details of the hardware, timing, phase of
> > the moon, you name it.)  So removing the LWSYNC is likely to provide
> > measureable benefit, but I must defer to the powerpc maintainers.
> > To that end, I added Michael on CC.
> >
> > [1] https://lore.kernel.org/lkml/878sr6t21a.fsf_-_@x220.int.ebiederm.org/
> 
> Paul, what is the purpose of the barrier in rcu_assign_pointer?
> 
> My intuition says it is the assignment half of rcu_dereference, and that
> anything that rcu_dereference does not need is too strong.
> 
> My basic understanding is that only alpha ever has the memory ordering
> issue that rcu_dereference deals with.  So I am a bit surprised that
> this is anything other than a noop for anything except alpha.

Yes, only Alpha needs an actual memory-barrier instruction in
rcu_dereference().  And it is the only one that gets one.

> In my patch I used rcu_assign_pointer because that is the canonically
> correct way to do things.  Peter makes a good case that adding an extra
> barrier in ___schedule could be detrimental to system performance.
> At the same time if there is a correctness issue on alpha that we have
> been overlooking because of low testing volume on alpha I don't want to
> just let this slide and have very subtle bugs.

But rcu_assign_pointer() needs a memory-barrier instruction on the
weakly ordered architectures: ARM, powerpc, Itanium, and so on.

> The practical concern is that people have been really wanting to do
> lockless and rcu operations on tasks in the runqueue for a while and
> there are several very clever pieces of code doing that now.  By
> changing the location of the rcu put I am trying to make these uses
> ordinary rcu uses.
> 
> The uses in question are the pieces of code I update in:
> https://lore.kernel.org/lkml/8736het20c.fsf_-_@x220.int.ebiederm.org/

The rcu_assign_pointer() at the end of rcuwait_wait_event(), right?

> In short.  Why is rcu_assign_pointer() more than WRITE_ONCE() on
> anything but alpha?  Do other architectures need more?  Is the trade off
> worth it if we avoid using rcu_assign_pointer on performance critical
> paths.

Note the difference between the read-side rcu_dereference(), which
does not require any memory-barrier instructions except on Alpha,
and the update-side rcu_assign_pointer() which does require a
memory-barrier instruction on quite a few architectures.  Even on
the strongly ordered architectures (x86, s390, ...) a compiler
barrier() is required for rcu_assign_pointer().

Except note the exceptional cases where RCU_INIT_POINTER() may be
used in place of rcu_assign_pointer(), which are called out in
RCU_INIT_POINTER()'s header comment:

 * Initialize an RCU-protected pointer in special cases where readers
 * do not need ordering constraints on the CPU or the compiler.  These
 * special cases are:
 *
 * 1.	This use of RCU_INIT_POINTER() is NULLing out the pointer *or*
 * 2.	The caller has taken whatever steps are required to prevent
 *	RCU readers from concurrently accessing this pointer *or*
 * 3.	The referenced data structure has already been exposed to
 *	readers either at compile time or via rcu_assign_pointer() *and*
 *
 *	a.	You have not made *any* reader-visible changes to
 *		this structure since then *or*
 *	b.	It is OK for readers accessing this structure from its
 *		new location to see the old state of the structure.  (For
 *		example, the changes were to statistical counters or to
 *		other state where exact synchronization is not required.)
 *
 * Failure to follow these rules governing use of RCU_INIT_POINTER() will
 * result in impossible-to-diagnose memory corruption.  As in the structures
 * will look OK in crash dumps, but any concurrent RCU readers might
 * see pre-initialized values of the referenced data structure.  So
 * please be very careful how you use RCU_INIT_POINTER()!!!

If current is already visible (which it should be unless
rcuwait_wait_event() is invoked at task-creation time), then
RCU_INIT_POINTER() could be used in rcuwait_wait_event().

> Eric
> 
> p.s. I am being slow at working through all of this as I am dealing
>      with my young baby son, and busy packing for the conference.

Congratulations on the baby son!!!  I remember those times well, but
they were more than three decades ago for my oldest.  ;-)

>      I might not be able to get back to this discussion until after
>      I have landed in Lisbon on Saturday night.

Looking forward to it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-05 20:02                                     ` Eric W. Biederman
  2019-09-05 20:55                                       ` Paul E. McKenney
@ 2019-09-06  7:07                                       ` Peter Zijlstra
  2019-09-09 12:22                                         ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-06  7:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul E. McKenney, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

On Thu, Sep 05, 2019 at 03:02:49PM -0500, Eric W. Biederman wrote:
> Paul, what is the purpose of the barrier in rcu_assign_pointer?
> 
> My intuition says it is the assignment half of rcu_dereference, and that
> anything that rcu_dereference does not need is too strong.

I see that Paul has also replied, but let me give another take on it.

RCU does *two* different but related things. It provide object
consistency and it provides object lifetime management.

Object consistency is provided by rcu_assign_pointer() /
rcu_dereference:

 - rcu_assign_pointer() is a PUBLISH operation, it is meant to expose an
object to the world. In that respect it needs to ensure that all
previous stores to this object are complete before it writes the
pointer and exposes the object.

To this purpose, rcu_assign_pointer() is an smp_store_release().

 - rcu_dereference() is a CONSUME operation. It matches the PUBLISH from
above and guarantees that any further loads/stores to the observed
object come after.

Naturally this would be an smp_load_acquire(), _HOWEVER_, since we need
the address of the object in order to construct a load/store from/to
said object, we can rely on the address-dependency to provide this
barrier (except for Alpha, which is fscking weird). After all if you
haven't completed the load of the (object) base address, you cannot
compute the object member address and issue any subsequent memops, now
can you ;-)

Now the thing here is that the 'rq->curr = next' assignment does _NOT_
expose the task (our object). The task is exposed the moment it enters
the PID hash. It is this that sets us free of the PUBLISH requirement
and thus we can use RCU_INIT_POINTER().


The object lifetime thing is provided by
rcu_read_load()/rcu_read_unlock() on the one hand and
synchronize_rcu()/call_rcu() on the other. That ensures that if you
observe an object inside the RCU read side section, the object storage
must stay valid.

And it is *that* properpy we wish to make use of.


Does this make sense to you?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-06  7:07                                       ` Peter Zijlstra
@ 2019-09-09 12:22                                         ` Eric W. Biederman
  2019-09-25  7:36                                           ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-09 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Sep 05, 2019 at 03:02:49PM -0500, Eric W. Biederman wrote:
>> Paul, what is the purpose of the barrier in rcu_assign_pointer?
>> 
>> My intuition says it is the assignment half of rcu_dereference, and that
>> anything that rcu_dereference does not need is too strong.
>
> I see that Paul has also replied, but let me give another take on it.

I appreciate his reply as well.  I don't think he picked up that we
are talking about my change to do rcu_assign_pointer(rq->curr, next)
in schedule.

> RCU does *two* different but related things. It provide object
> consistency and it provides object lifetime management.
>
> Object consistency is provided by rcu_assign_pointer() /
> rcu_dereference:
>
>  - rcu_assign_pointer() is a PUBLISH operation, it is meant to expose an
> object to the world. In that respect it needs to ensure that all
> previous stores to this object are complete before it writes the
> pointer and exposes the object.
>
> To this purpose, rcu_assign_pointer() is an smp_store_release().
>
>  - rcu_dereference() is a CONSUME operation. It matches the PUBLISH from
> above and guarantees that any further loads/stores to the observed
> object come after.
>
> Naturally this would be an smp_load_acquire(), _HOWEVER_, since we need
> the address of the object in order to construct a load/store from/to
> said object, we can rely on the address-dependency to provide this
> barrier (except for Alpha, which is fscking weird). After all if you
> haven't completed the load of the (object) base address, you cannot
> compute the object member address and issue any subsequent memops, now
> can you ;-)

I follow and agree till here.

I was definitely confused ealier on which half of the rcu_assign_pointer()
rcu_dereference was important on most cpus.   As I currently understand
things, in a normal case we need a memory barrier between the assignment
of the data to a structure and the assignment of a pointer to that
structure to ensure there is an ordering between the pointer and
the data.

That memory barrier is usually provided by rcu_assign_pointer.

> Now the thing here is that the 'rq->curr = next' assignment does _NOT_
> expose the task (our object). The task is exposed the moment it enters
> the PID hash. It is this that sets us free of the PUBLISH requirement
> and thus we can use RCU_INIT_POINTER().

So I disagree about the PID hash here.  I don't think exposure is
necessarily the right concept to think this through.

> The object lifetime thing is provided by
> rcu_read_load()/rcu_read_unlock() on the one hand and
> synchronize_rcu()/call_rcu() on the other. That ensures that if you
> observe an object inside the RCU read side section, the object storage
> must stay valid.
>
> And it is *that* properpy we wish to make use of.

I absolutely agree that object lifetime is something we want to make use
of.

> Does this make sense to you?

Not completely.

Let me see if I can explain my confusion in terms of task_numa_compare.

The function task_numa_comare now does:

	rcu_read_lock();
	cur = rcu_dereference(dst_rq->curr);

Then it proceeds to examine a few fields of cur.

	cur->flags;
        cur->cpus_ptr;
        cur->numa_group;
        cur->numa_faults[];
        cur->total_numa_faults;
        cur->se.cfs_rq;

My concern here is the ordering between setting rq->curr and and setting
those other fields.  If we read values of those fields that were not
current when we set cur than I think task_numa_compare would give an
incorrect answer.

Now in __schedule the code does:

	rq_lock(rq, &rf);
        smp_mp__after_spinlock();

	...

	next = pick_next_task(rq, prev, &rf);

	if (prev != next) {
        	rq->nr_switches++
        	rq->curr = next; // rcu_assign_pointer(rq->curr, next);
        }

Which basically leads me back to Linus's analsysis where he pointed out
that it is the actions while the task is running before it calls
scheudle that in general cause the fields to be set.

Given that we have a full memory barrier for taking the previous task
off the cpu everything for the next task is guaranteed to be ordered
with the assignment to rq->curr except whatever fields pick_next_task()
decides to change.

From what I can see pick_next_task_fair does not mess with any of
the fields that task_numa_compare uses.  So the existing memory barrier
should be more than sufficient for that case.

So I think I just need to write up a patch 4/3 that replaces the my
"rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
next)".  And includes a great big comment to that effect.



Now maybe I am wrong.  Maybe we can afford to see data that was changed
before the task became rq->curr in task_numa_compare.  But I don't think
so.  I really think it is that full memory barrier just a bit earlier
in __schedule that is keeping us safe.

Am I wrong?

Eric





^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 1/4] task: Add a count of task rcu users
  2019-09-03 16:44                         ` Eric W. Biederman
  2019-09-03 17:08                           ` Linus Torvalds
@ 2019-09-14 12:31                           ` Eric W. Biederman
  2019-09-14 12:31                           ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
  2019-09-14 12:32                           ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
  3 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


Add a count of the number of rcu users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference, and just use rcu_dereference.

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched.h      | 5 ++++-
 include/linux/sched/task.h | 1 +
 kernel/exit.c              | 7 ++++++-
 kernel/fork.c              | 7 +++----
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..99a4518b9b17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,7 +1142,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		refcount_t		rcu_users;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4c44c37236b2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -116,6 +116,7 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..2e259286f4e7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+	if (refcount_dec_and_test(&task->rcu_users))
+		call_rcu(&task->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct_rcu_user(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..9f04741d5c70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
-	 */
+	/* One for the user space visible state that goes away when reaped. */
+	refcount_set(&tsk->rcu_users, 1);
+	/* One for the rcu users, and one for the scheduler */
 	refcount_set(&tsk->usage, 2);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue
  2019-09-03 16:44                         ` Eric W. Biederman
  2019-09-03 17:08                           ` Linus Torvalds
  2019-09-14 12:31                           ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
@ 2019-09-14 12:31                           ` Eric W. Biederman
  2019-09-14 12:32                           ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
  3 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


In the ordinary case today the rcu grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user in finish_task_switch to
guarantee that the there is a rcu lifetime after the task
leaves the runqueue.

Besides the change in the start of the rcu grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c       | 11 +++++++----
 kernel/sched/core.c |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/* One for the user space visible state that goes away when reaped. */
-	refcount_set(&tsk->rcu_users, 1);
-	/* One for the rcu users, and one for the scheduler */
-	refcount_set(&tsk->usage, 2);
+	/*
+	 * One for the user space visible state that goes away when reaped.
+	 * One for the scheduler.
+	 */
+	refcount_set(&tsk->rcu_users, 2);
+	/* One for the rcu users */
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..69015b7c28da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}
 
 	tick_nohz_task_switch();
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code
  2019-09-03 16:44                         ` Eric W. Biederman
                                             ` (2 preceding siblings ...)
  2019-09-14 12:31                           ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
@ 2019-09-14 12:32                           ` Eric W. Biederman
  3 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso


Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch.

In particular now that there tasks exiting the runqueue exprience
a rcu grace period none of the work performed by task_rcu_dereference
excpet the rcu_dereference is necessary so replace task_rcu_dereference
with rcu_dereference.

Remove the code in rcuwait_wait_event that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a rcu grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up as it is no longer relevant.

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/rcuwait.h    | 20 +++---------
 include/linux/sched/task.h |  1 -
 kernel/exit.c              | 67 --------------------------------------
 kernel/sched/fair.c        |  2 +-
 kernel/sched/membarrier.c  |  4 +--
 5 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 563290fc194f..75c97e4bbc57 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -6,16 +6,11 @@
 
 /*
  * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner; where it is forbidden to use
- * after exit_notify(). task_struct is not properly rcu protected,
- * unless dealing with rcu-aware lists, ie: find_task_by_*().
+ * task in an rcu-safe manner.
  *
- * Alternatively we have task_rcu_dereference(), but the return
- * semantics have different implications which would break the
- * wakeup side. The only time @task is non-nil is when a user is
- * blocked (or checking if it needs to) on a condition, and reset
- * as soon as we know that the condition has succeeded and are
- * awoken.
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
  */
 struct rcuwait {
 	struct task_struct __rcu *task;
@@ -37,13 +32,6 @@ extern void rcuwait_wake_up(struct rcuwait *w);
  */
 #define rcuwait_wait_event(w, condition)				\
 ({									\
-	/*								\
-	 * Complain if we are called after do_exit()/exit_notify(),     \
-	 * as we cannot rely on the rcu critical region for the		\
-	 * wakeup side.							\
-	 */                                                             \
-	WARN_ON(current->exit_state);                                   \
-									\
 	rcu_assign_pointer((w)->task, current);				\
 	for (;;) {							\
 		/*							\
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4c44c37236b2..8bd51af44bf8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,6 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/exit.c b/kernel/exit.c
index 2e259286f4e7..f943773622fc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,69 +234,6 @@ void release_task(struct task_struct *p)
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
@@ -316,10 +253,6 @@ void rcuwait_wake_up(struct rcuwait *w)
 	 */
 	smp_mb(); /* (B) */
 
-	/*
-	 * Avoid using task_rcu_dereference() magic as long as we are careful,
-	 * see comment in rcuwait_wait_event() regarding ->exit_state.
-	 */
 	task = rcu_dereference(w->task);
 	if (task)
 		wake_up_process(task);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..215c640e2a6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..b14250a11608 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 1/4] task: Add a count of task rcu users
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
@ 2019-09-14 12:33                     ` Eric W. Biederman
  2019-09-15 13:54                       ` Paul E. McKenney
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
  2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
                                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso,
	Paul E. McKenney, Linus Torvalds


Add a count of the number of rcu users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference, and just use rcu_dereference.

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched.h      | 5 ++++-
 include/linux/sched/task.h | 1 +
 kernel/exit.c              | 7 ++++++-
 kernel/fork.c              | 7 +++----
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..99a4518b9b17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,7 +1142,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		refcount_t		rcu_users;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..4c44c37236b2 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -116,6 +116,7 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..2e259286f4e7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+	if (refcount_dec_and_test(&task->rcu_users))
+		call_rcu(&task->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct_rcu_user(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..9f04741d5c70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
-	 */
+	/* One for the user space visible state that goes away when reaped. */
+	refcount_set(&tsk->rcu_users, 1);
+	/* One for the rcu users, and one for the scheduler */
 	refcount_set(&tsk->usage, 2);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
  2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
@ 2019-09-14 12:33                     ` Eric W. Biederman
  2019-09-15 14:07                       ` Paul E. McKenney
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
  2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
                                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso,
	Paul E. McKenney, Linus Torvalds


In the ordinary case today the rcu grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user in finish_task_switch to
guarantee that the there is a rcu lifetime after the task
leaves the runqueue.

Besides the change in the start of the rcu grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c       | 11 +++++++----
 kernel/sched/core.c |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/* One for the user space visible state that goes away when reaped. */
-	refcount_set(&tsk->rcu_users, 1);
-	/* One for the rcu users, and one for the scheduler */
-	refcount_set(&tsk->usage, 2);
+	/*
+	 * One for the user space visible state that goes away when reaped.
+	 * One for the scheduler.
+	 */
+	refcount_set(&tsk->rcu_users, 2);
+	/* One for the rcu users */
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..69015b7c28da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}
 
 	tick_nohz_task_switch();
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
  2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
  2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
@ 2019-09-14 12:34                     ` Eric W. Biederman
  2019-09-15 14:32                       ` Paul E. McKenney
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
  2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
  2019-09-14 17:43                     ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
  4 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso,
	Paul E. McKenney, Linus Torvalds


Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch.

In particular now that there tasks exiting the runqueue exprience
a rcu grace period none of the work performed by task_rcu_dereference
excpet the rcu_dereference is necessary so replace task_rcu_dereference
with rcu_dereference.

Remove the code in rcuwait_wait_event that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a rcu grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up as it is no longer relevant.

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/rcuwait.h    | 20 +++---------
 include/linux/sched/task.h |  1 -
 kernel/exit.c              | 67 --------------------------------------
 kernel/sched/fair.c        |  2 +-
 kernel/sched/membarrier.c  |  4 +--
 5 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 563290fc194f..75c97e4bbc57 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -6,16 +6,11 @@
 
 /*
  * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner; where it is forbidden to use
- * after exit_notify(). task_struct is not properly rcu protected,
- * unless dealing with rcu-aware lists, ie: find_task_by_*().
+ * task in an rcu-safe manner.
  *
- * Alternatively we have task_rcu_dereference(), but the return
- * semantics have different implications which would break the
- * wakeup side. The only time @task is non-nil is when a user is
- * blocked (or checking if it needs to) on a condition, and reset
- * as soon as we know that the condition has succeeded and are
- * awoken.
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
  */
 struct rcuwait {
 	struct task_struct __rcu *task;
@@ -37,13 +32,6 @@ extern void rcuwait_wake_up(struct rcuwait *w);
  */
 #define rcuwait_wait_event(w, condition)				\
 ({									\
-	/*								\
-	 * Complain if we are called after do_exit()/exit_notify(),     \
-	 * as we cannot rely on the rcu critical region for the		\
-	 * wakeup side.							\
-	 */                                                             \
-	WARN_ON(current->exit_state);                                   \
-									\
 	rcu_assign_pointer((w)->task, current);				\
 	for (;;) {							\
 		/*							\
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4c44c37236b2..8bd51af44bf8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,6 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/exit.c b/kernel/exit.c
index 2e259286f4e7..f943773622fc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,69 +234,6 @@ void release_task(struct task_struct *p)
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
@@ -316,10 +253,6 @@ void rcuwait_wake_up(struct rcuwait *w)
 	 */
 	smp_mb(); /* (B) */
 
-	/*
-	 * Avoid using task_rcu_dereference() magic as long as we are careful,
-	 * see comment in rcuwait_wait_event() regarding ->exit_state.
-	 */
 	task = rcu_dereference(w->task);
 	if (task)
 		wake_up_process(task);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..215c640e2a6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..b14250a11608 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v2 4/4] task: RCUify the assignment of rq->curr
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
                                       ` (2 preceding siblings ...)
  2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
@ 2019-09-14 12:35                     ` Eric W. Biederman
  2019-09-15 14:41                       ` Paul E. McKenney
  2019-09-20 23:02                       ` Frederic Weisbecker
  2019-09-14 17:43                     ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
  4 siblings, 2 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-14 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Russell King - ARM Linux admin, Chris Metcalf,
	Christoph Lameter, Kirill Tkhai, Mike Galbraith, Thomas Gleixner,
	Ingo Molnar, Linux List Kernel Mailing, Davidlohr Bueso,
	Paul E. McKenney, Linus Torvalds


The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
to be paird with rcu_assign_pointer of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary rcu semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference.  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
	rq_lock(...);
	smp_mb__after_spinlock();

	next = pick_next_task(...);
	rq->curr = next;

	context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer.

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/sched/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 69015b7c28da..668262806942 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
-		rq->curr = next;
+		/*
+		 * RCU users of rcu_dereference(rq->curr) may not see
+		 * changes to task_struct made by pick_next_task().
+		 */
+		RCU_INIT_POINTER(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -5863,7 +5867,8 @@ void init_idle(struct task_struct *idle, int cpu)
 	__set_task_cpu(idle, cpu);
 	rcu_read_unlock();
 
-	rq->curr = rq->idle = idle;
+	rq->idle = idle;
+	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
-- 
2.21.0.dirty


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected
       [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
                                       ` (3 preceding siblings ...)
  2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
@ 2019-09-14 17:43                     ` Linus Torvalds
  2019-09-17 17:38                       ` Eric W. Biederman
  4 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-14 17:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

On Sat, Sep 14, 2019 at 5:30 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I have reworked these patches one more time to make it clear that the
> first 3 patches only fix task_struct so that it experiences a rcu grace
> period after it leaves the runqueue for the last time.

I remain a fan of these patches, and the added comment on the last one
is I think a sufficient clarification of the issue.

But it's patch 3 that makes me go "yeah, this is the right approach",
because it just removes subtle code in favor of something that is
understandable.

Yes, most of the lines removed may be comments, and so it doesn't
actually remove a lot of _code_, but I think the comments are a result
of just how subtle and fragile our current approach is, and the new
model not needing them as much is I think a real issue (rather than
just Eric being less verbose in the new comments and removing lines of
code that way).

Can anybody see anything wrong with the series? Because I'd love to
have it for 5.4,

             Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 1/4] task: Add a count of task rcu users
  2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
@ 2019-09-15 13:54                       ` Paul E. McKenney
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 13:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sat, Sep 14, 2019 at 07:33:34AM -0500, Eric W. Biederman wrote:
> 
> Add a count of the number of rcu users (currently 1) of the task
> struct so that we can later add the scheduler case and get rid of the
> very subtle task_rcu_dereference, and just use rcu_dereference.
> 
> As suggested by Oleg have the count overlap rcu_head so that no
> additional space in task_struct is required.
> 
> Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
> Inspired-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched.h      | 5 ++++-
>  include/linux/sched/task.h | 1 +
>  kernel/exit.c              | 7 ++++++-
>  kernel/fork.c              | 7 +++----
>  4 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9f51932bd543..99a4518b9b17 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1142,7 +1142,10 @@ struct task_struct {
>  
>  	struct tlbflush_unmap_batch	tlb_ubc;
>  
> -	struct rcu_head			rcu;
> +	union {
> +		refcount_t		rcu_users;
> +		struct rcu_head		rcu;
> +	};
>  
>  	/* Cache last used pipe for splice(): */
>  	struct pipe_inode_info		*splice_pipe;
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 0497091e40c1..4c44c37236b2 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -116,6 +116,7 @@ static inline void put_task_struct(struct task_struct *t)
>  }
>  
>  struct task_struct *task_rcu_dereference(struct task_struct **ptask);
> +void put_task_struct_rcu_user(struct task_struct *task);
>  
>  #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
>  extern int arch_task_struct_size __read_mostly;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 5b4a5dcce8f8..2e259286f4e7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>  	put_task_struct(tsk);
>  }
>  
> +void put_task_struct_rcu_user(struct task_struct *task)
> +{
> +	if (refcount_dec_and_test(&task->rcu_users))
> +		call_rcu(&task->rcu, delayed_put_task_struct);

We "instantly" transition from the union being ->rcu_user to being ->rcu,
so there needs to be some mechanism that has previously made sure that
nothing is going to attempt to use ->rcu on this task.

We cannot be relying solely on something like atomic_add_unless(),
because call_rcu() will likely result in ->rcu being non-zero.

> +}
>  
>  void release_task(struct task_struct *p)
>  {
> @@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
>  
>  	write_unlock_irq(&tasklist_lock);
>  	release_thread(p);
> -	call_rcu(&p->rcu, delayed_put_task_struct);
> +	put_task_struct_rcu_user(p);

This, along with the pre-existing initialization of ->rcu_user to two 
below gives some hope, at least assuming that release_task() is invoked
after no one can possibly try to increment ->rcu_user.  And in v5.2,
release_task() is invoked from these places:

o	de_thread().  On this one, I must defer to Oleg, Peter, and crew.
	It might be that the list removals while write-holding the
	tasklist_lock do the trick, but that assumes that this lock is
	involved in acquiring a reference.

o	find_child_reaper().  This is invoked via exit_notify() from
	do_exit(), just after the call to exit_tasks_rcu_start().
	This is OK from a Tasks RCU perspective because it is using
	separate synchornization.  Something earlier must prevent
	new ->rcu_user references.

o	wait_task_zombie().  Here is hoping that the check for
	EXIT_DEAD is helpful here.

I am not seeing how this would be safe, but then again this is only the
first patch.  Plus there is much about this use case that I do not know.
OK, on to the other patches...

							Thanx, Paul

>  
>  	p = leader;
>  	if (unlikely(zap_leader))
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2852d0e76ea3..9f04741d5c70 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -900,10 +900,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (orig->cpus_ptr == &orig->cpus_mask)
>  		tsk->cpus_ptr = &tsk->cpus_mask;
>  
> -	/*
> -	 * One for us, one for whoever does the "release_task()" (usually
> -	 * parent)
> -	 */
> +	/* One for the user space visible state that goes away when reaped. */
> +	refcount_set(&tsk->rcu_users, 1);
> +	/* One for the rcu users, and one for the scheduler */
>  	refcount_set(&tsk->usage, 2);
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	tsk->btrace_seq = 0;
> -- 
> 2.21.0.dirty
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue
  2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
@ 2019-09-15 14:07                       ` Paul E. McKenney
  2019-09-15 14:09                         ` Paul E. McKenney
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 14:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sat, Sep 14, 2019 at 07:33:58AM -0500, Eric W. Biederman wrote:
> 
> In the ordinary case today the rcu grace period for a task_struct is
> triggered when another process wait's for it's zombine and causes the
> kernel to call release_task().  As the waiting task has to receive a
> signal and then act upon it before this happens, typically this will
> occur after the original task as been removed from the runqueue.
> 
> Unfortunaty in some cases such as self reaping tasks it can be shown
> that release_task() will be called starting the grace period for
> task_struct long before the task leaves the runqueue.
> 
> Therefore use put_task_struct_rcu_user in finish_task_switch to
> guarantee that the there is a rcu lifetime after the task
> leaves the runqueue.
> 
> Besides the change in the start of the rcu grace period for the
> task_struct this change may cause perf_event_delayed_put and
> trace_sched_process_free.  The function perf_event_delayed_put boils
> down to just a WARN_ON for cases that I assume never show happen.  So
> I don't see any problem with delaying it.
> 
> The function trace_sched_process_free is a trace point and thus
> visible to user space.  Occassionally userspace has the strangest
> dependencies so this has a miniscule chance of causing a regression.
> This change only changes the timing of when the tracepoint is called.
> The change in timing arguably gives userspace a more accurate picture
> of what is going on.  So I don't expect there to be a regression.
> 
> In the case where a task self reaps we are pretty much guaranteed that
> the rcu grace period is delayed.  So we should get quite a bit of
> coverage in of this worst case for the change in a normal threaded
> workload.  So I expect any issues to turn up quickly or not at all.
> 
> I have lightly tested this change and everything appears to work
> fine.
> 
> Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
> Inspired-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/fork.c       | 11 +++++++----
>  kernel/sched/core.c |  2 +-
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9f04741d5c70..7a74ade4e7d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  	if (orig->cpus_ptr == &orig->cpus_mask)
>  		tsk->cpus_ptr = &tsk->cpus_mask;
>  
> -	/* One for the user space visible state that goes away when reaped. */
> -	refcount_set(&tsk->rcu_users, 1);
> -	/* One for the rcu users, and one for the scheduler */
> -	refcount_set(&tsk->usage, 2);
> +	/*
> +	 * One for the user space visible state that goes away when reaped.
> +	 * One for the scheduler.
> +	 */
> +	refcount_set(&tsk->rcu_users, 2);

OK, this would allow us to add a later decrement-and-test of
->rcu_users ...

> +	/* One for the rcu users */
> +	refcount_set(&tsk->usage, 1);
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	tsk->btrace_seq = 0;
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2b037f195473..69015b7c28da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>  		/* Task is done with its stack. */
>  		put_task_stack(prev);
>  
> -		put_task_struct(prev);
> +		put_task_struct_rcu_user(prev);

... which is here.  And this looks to be invoked from the __schedule()
called from do_task_dead() at the very end of do_exit().

This looks plausible, but still requires that it no longer be possible to
enter an RCU read-side critical section that might increment ->rcu_users
after this point in time.  This might be enforced by a grace period
between the time that the task was removed from its lists and the current
time (seems unlikely, though, in that case why bother with call_rcu()?) or
by some other synchronization.

On to the next patch!

							Thanx, Paul

>  	}
>  
>  	tick_nohz_task_switch();
> -- 
> 2.21.0.dirty
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue
  2019-09-15 14:07                       ` Paul E. McKenney
@ 2019-09-15 14:09                         ` Paul E. McKenney
  0 siblings, 0 replies; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 14:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sun, Sep 15, 2019 at 07:07:52AM -0700, Paul E. McKenney wrote:
> On Sat, Sep 14, 2019 at 07:33:58AM -0500, Eric W. Biederman wrote:
> > 
> > In the ordinary case today the rcu grace period for a task_struct is
> > triggered when another process wait's for it's zombine and causes the

Oh, and "waits for its", just to hit the grammar en passant...  ;-)

							Thanx, Paul

> > kernel to call release_task().  As the waiting task has to receive a
> > signal and then act upon it before this happens, typically this will
> > occur after the original task as been removed from the runqueue.
> > 
> > Unfortunaty in some cases such as self reaping tasks it can be shown
> > that release_task() will be called starting the grace period for
> > task_struct long before the task leaves the runqueue.
> > 
> > Therefore use put_task_struct_rcu_user in finish_task_switch to
> > guarantee that the there is a rcu lifetime after the task
> > leaves the runqueue.
> > 
> > Besides the change in the start of the rcu grace period for the
> > task_struct this change may cause perf_event_delayed_put and
> > trace_sched_process_free.  The function perf_event_delayed_put boils
> > down to just a WARN_ON for cases that I assume never show happen.  So
> > I don't see any problem with delaying it.
> > 
> > The function trace_sched_process_free is a trace point and thus
> > visible to user space.  Occassionally userspace has the strangest
> > dependencies so this has a miniscule chance of causing a regression.
> > This change only changes the timing of when the tracepoint is called.
> > The change in timing arguably gives userspace a more accurate picture
> > of what is going on.  So I don't expect there to be a regression.
> > 
> > In the case where a task self reaps we are pretty much guaranteed that
> > the rcu grace period is delayed.  So we should get quite a bit of
> > coverage in of this worst case for the change in a normal threaded
> > workload.  So I expect any issues to turn up quickly or not at all.
> > 
> > I have lightly tested this change and everything appears to work
> > fine.
> > 
> > Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Inspired-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > ---
> >  kernel/fork.c       | 11 +++++++----
> >  kernel/sched/core.c |  2 +-
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9f04741d5c70..7a74ade4e7d6 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
> >  	if (orig->cpus_ptr == &orig->cpus_mask)
> >  		tsk->cpus_ptr = &tsk->cpus_mask;
> >  
> > -	/* One for the user space visible state that goes away when reaped. */
> > -	refcount_set(&tsk->rcu_users, 1);
> > -	/* One for the rcu users, and one for the scheduler */
> > -	refcount_set(&tsk->usage, 2);
> > +	/*
> > +	 * One for the user space visible state that goes away when reaped.
> > +	 * One for the scheduler.
> > +	 */
> > +	refcount_set(&tsk->rcu_users, 2);
> 
> OK, this would allow us to add a later decrement-and-test of
> ->rcu_users ...
> 
> > +	/* One for the rcu users */
> > +	refcount_set(&tsk->usage, 1);
> >  #ifdef CONFIG_BLK_DEV_IO_TRACE
> >  	tsk->btrace_seq = 0;
> >  #endif
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..69015b7c28da 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> >  		/* Task is done with its stack. */
> >  		put_task_stack(prev);
> >  
> > -		put_task_struct(prev);
> > +		put_task_struct_rcu_user(prev);
> 
> ... which is here.  And this looks to be invoked from the __schedule()
> called from do_task_dead() at the very end of do_exit().
> 
> This looks plausible, but still requires that it no longer be possible to
> enter an RCU read-side critical section that might increment ->rcu_users
> after this point in time.  This might be enforced by a grace period
> between the time that the task was removed from its lists and the current
> time (seems unlikely, though, in that case why bother with call_rcu()?) or
> by some other synchronization.
> 
> On to the next patch!
> 
> 							Thanx, Paul
> 
> >  	}
> >  
> >  	tick_nohz_task_switch();
> > -- 
> > 2.21.0.dirty
> > 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code
  2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
@ 2019-09-15 14:32                       ` Paul E. McKenney
  2019-09-15 17:07                         ` Linus Torvalds
  2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 14:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sat, Sep 14, 2019 at 07:34:30AM -0500, Eric W. Biederman wrote:
> 
> Remove work arounds that were written before there was a grace period
> after tasks left the runqueue in finish_task_switch.
> 
> In particular now that there tasks exiting the runqueue exprience
> a rcu grace period none of the work performed by task_rcu_dereference
> excpet the rcu_dereference is necessary so replace task_rcu_dereference
> with rcu_dereference.
> 
> Remove the code in rcuwait_wait_event that checks to ensure the current
> task has not exited.  It is no longer necessary as it is guaranteed
> that any running task will experience a rcu grace period after it
> leaves the run queueue.
> 
> Remove the comment in rcuwait_wake_up as it is no longer relevant.
> 
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
> Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

First, what am I looking for?

I am looking for something that prevents the following:

o	Task A acquires a reference to Task B's task_struct while
	protected only by RCU, and is just about to increment ->rcu_users
	when it is delayed.  Maybe its vCPU is preempted or something.

o	Task B begins exiting.

o	Task B removes itself from all the lists.  (Which doesn't
	matter to Task A, which already has an RCU-protected reference
	to Task B's task_struct structure.

o	Task B does a bunch of stuff protected by various locks and atomic
	operations, but does not wait for an RCU grace period to elapse.
	(Or am I wrong about that?)

o	Task B does its final context switch, invoking finish_task_switch(),
	in turn invoking put_task_struct_rcu_user(), which does the
	final decrement of ->rcu_user to zero.	And then immediately
	overwrites that value by invoking call_rcu().

o	Task A resumes, sees a (bogus) non-zero ->rcu_user and proceeds
	to mangle the freelist.  Or worse yet, something just now
	allocated as some other data structure.

(If this really can happen, one easy fix is of course to remove the
union so that ->rcu and ->rcu_users don't sit on top of each other.)

With that, on to the patch!

Which does not do anything that I can see to prevent the above sequence
of events.

On to the next patch!

							Thanx, Paul

> ---
>  include/linux/rcuwait.h    | 20 +++---------
>  include/linux/sched/task.h |  1 -
>  kernel/exit.c              | 67 --------------------------------------
>  kernel/sched/fair.c        |  2 +-
>  kernel/sched/membarrier.c  |  4 +--
>  5 files changed, 7 insertions(+), 87 deletions(-)
> 
> diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
> index 563290fc194f..75c97e4bbc57 100644
> --- a/include/linux/rcuwait.h
> +++ b/include/linux/rcuwait.h
> @@ -6,16 +6,11 @@
>  
>  /*
>   * rcuwait provides a way of blocking and waking up a single
> - * task in an rcu-safe manner; where it is forbidden to use
> - * after exit_notify(). task_struct is not properly rcu protected,
> - * unless dealing with rcu-aware lists, ie: find_task_by_*().
> + * task in an rcu-safe manner.
>   *
> - * Alternatively we have task_rcu_dereference(), but the return
> - * semantics have different implications which would break the
> - * wakeup side. The only time @task is non-nil is when a user is
> - * blocked (or checking if it needs to) on a condition, and reset
> - * as soon as we know that the condition has succeeded and are
> - * awoken.
> + * The only time @task is non-nil is when a user is blocked (or
> + * checking if it needs to) on a condition, and reset as soon as we
> + * know that the condition has succeeded and are awoken.
>   */
>  struct rcuwait {
>  	struct task_struct __rcu *task;
> @@ -37,13 +32,6 @@ extern void rcuwait_wake_up(struct rcuwait *w);
>   */
>  #define rcuwait_wait_event(w, condition)				\
>  ({									\
> -	/*								\
> -	 * Complain if we are called after do_exit()/exit_notify(),     \
> -	 * as we cannot rely on the rcu critical region for the		\
> -	 * wakeup side.							\
> -	 */                                                             \
> -	WARN_ON(current->exit_state);                                   \
> -									\
>  	rcu_assign_pointer((w)->task, current);				\
>  	for (;;) {							\
>  		/*							\
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 4c44c37236b2..8bd51af44bf8 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -115,7 +115,6 @@ static inline void put_task_struct(struct task_struct *t)
>  		__put_task_struct(t);
>  }
>  
> -struct task_struct *task_rcu_dereference(struct task_struct **ptask);
>  void put_task_struct_rcu_user(struct task_struct *task);
>  
>  #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 2e259286f4e7..f943773622fc 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -234,69 +234,6 @@ void release_task(struct task_struct *p)
>  		goto repeat;
>  }
>  
> -/*
> - * Note that if this function returns a valid task_struct pointer (!NULL)
> - * task->usage must remain >0 for the duration of the RCU critical section.
> - */
> -struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> -{
> -	struct sighand_struct *sighand;
> -	struct task_struct *task;
> -
> -	/*
> -	 * We need to verify that release_task() was not called and thus
> -	 * delayed_put_task_struct() can't run and drop the last reference
> -	 * before rcu_read_unlock(). We check task->sighand != NULL,
> -	 * but we can read the already freed and reused memory.
> -	 */
> -retry:
> -	task = rcu_dereference(*ptask);
> -	if (!task)
> -		return NULL;
> -
> -	probe_kernel_address(&task->sighand, sighand);
> -
> -	/*
> -	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
> -	 * was already freed we can not miss the preceding update of this
> -	 * pointer.
> -	 */
> -	smp_rmb();
> -	if (unlikely(task != READ_ONCE(*ptask)))
> -		goto retry;
> -
> -	/*
> -	 * We've re-checked that "task == *ptask", now we have two different
> -	 * cases:
> -	 *
> -	 * 1. This is actually the same task/task_struct. In this case
> -	 *    sighand != NULL tells us it is still alive.
> -	 *
> -	 * 2. This is another task which got the same memory for task_struct.
> -	 *    We can't know this of course, and we can not trust
> -	 *    sighand != NULL.
> -	 *
> -	 *    In this case we actually return a random value, but this is
> -	 *    correct.
> -	 *
> -	 *    If we return NULL - we can pretend that we actually noticed that
> -	 *    *ptask was updated when the previous task has exited. Or pretend
> -	 *    that probe_slab_address(&sighand) reads NULL.
> -	 *
> -	 *    If we return the new task (because sighand is not NULL for any
> -	 *    reason) - this is fine too. This (new) task can't go away before
> -	 *    another gp pass.
> -	 *
> -	 *    And note: We could even eliminate the false positive if re-read
> -	 *    task->sighand once again to avoid the falsely NULL. But this case
> -	 *    is very unlikely so we don't care.
> -	 */
> -	if (!sighand)
> -		return NULL;
> -
> -	return task;
> -}
> -
>  void rcuwait_wake_up(struct rcuwait *w)
>  {
>  	struct task_struct *task;
> @@ -316,10 +253,6 @@ void rcuwait_wake_up(struct rcuwait *w)
>  	 */
>  	smp_mb(); /* (B) */
>  
> -	/*
> -	 * Avoid using task_rcu_dereference() magic as long as we are careful,
> -	 * see comment in rcuwait_wait_event() regarding ->exit_state.
> -	 */
>  	task = rcu_dereference(w->task);
>  	if (task)
>  		wake_up_process(task);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc9cfeaac8bd..215c640e2a6b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
>  		return;
>  
>  	rcu_read_lock();
> -	cur = task_rcu_dereference(&dst_rq->curr);
> +	cur = rcu_dereference(dst_rq->curr);
>  	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
>  		cur = NULL;
>  
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index aa8d75804108..b14250a11608 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
>  			continue;
>  
>  		rcu_read_lock();
> -		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> +		p = rcu_dereference(cpu_rq(cpu)->curr);
>  		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
>  				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
>  			if (!fallback)
> @@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
>  		if (cpu == raw_smp_processor_id())
>  			continue;
>  		rcu_read_lock();
> -		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
> +		p = rcu_dereference(cpu_rq(cpu)->curr);
>  		if (p && p->mm == current->mm) {
>  			if (!fallback)
>  				__cpumask_set_cpu(cpu, tmpmask);
> -- 
> 2.21.0.dirty
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
@ 2019-09-15 14:41                       ` Paul E. McKenney
  2019-09-15 17:59                         ` Eric W. Biederman
  2019-09-20 23:02                       ` Frederic Weisbecker
  1 sibling, 1 reply; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 14:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
> 
> The current task on the runqueue is currently read with rcu_dereference().
> 
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr.  Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
> 
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler.  So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
> 
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
> 
> As I write this there are effectively two users of rcu_dereference on
> rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference.  Then there is
> task_numa_compare() in kernel/sched/fair.c.  My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
> 
> The code in __schedule() essentially does:
> 	rq_lock(...);
> 	smp_mb__after_spinlock();
> 
> 	next = pick_next_task(...);
> 	rq->curr = next;
> 
> 	context_switch(prev, next);
> 
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier.  Further there is a full memory barrier
> in context_switch().
> 
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing.  A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
> 
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
> 
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
> 
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl".  These fields are the sched_entity structures
> of the varies schedulers.

s/varies/various/ for whatever that is worth.

> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
> 
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

The reasoning makes sense.  I have not double-checked all the code.

> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/sched/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>  
>  	if (likely(prev != next)) {
>  		rq->nr_switches++;
> -		rq->curr = next;
> +		/*
> +		 * RCU users of rcu_dereference(rq->curr) may not see
> +		 * changes to task_struct made by pick_next_task().
> +		 */
> +		RCU_INIT_POINTER(rq->curr, next);
>  		/*
>  		 * The membarrier system call requires each architecture
>  		 * to have a full memory barrier after updating
> @@ -5863,7 +5867,8 @@ void init_idle(struct task_struct *idle, int cpu)
>  	__set_task_cpu(idle, cpu);
>  	rcu_read_unlock();
>  
> -	rq->curr = rq->idle = idle;
> +	rq->idle = idle;
> +	rcu_assign_pointer(rq->curr, idle);
>  	idle->on_rq = TASK_ON_RQ_QUEUED;
>  #ifdef CONFIG_SMP
>  	idle->on_cpu = 1;
> -- 
> 2.21.0.dirty

So this looks good in and of itself, but I still do not see what prevents
the unfortunate sequence of events called out in my previous email.
On the other hand, if ->rcu and ->rcu_users were not allocated on top
of each other by a union, I would be happy to provide a Reviewed-by.

And I am fundamentally distrusting of a refcount_dec_and_test() that
is immediately followed by code that clobbers the now-zero value.
Yes, this does have valid use cases, but it has a lot more invalid
use cases.  The valid use cases have excluded all increments somehow
else, so that the refcount_dec_and_test() call's only job is to
synchronize between concurrent calls to put_task_struct_rcu_user().
But I am not seeing the "excluded all increments somehow".

So, what am I missing here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code
  2019-09-15 14:32                       ` Paul E. McKenney
@ 2019-09-15 17:07                         ` Linus Torvalds
  2019-09-15 18:47                           ` Paul E. McKenney
  0 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2019-09-15 17:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric W. Biederman, Peter Zijlstra, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Sun, Sep 15, 2019 at 7:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> First, what am I looking for?
>
> I am looking for something that prevents the following:
>
> o       Task A acquires a reference to Task B's task_struct while
>         protected only by RCU, and is just about to increment ->rcu_users
>         when it is delayed.  Maybe its vCPU is preempted or something.

Where exactly do you see "increment ->rcu_users"

There are _no_ users that can increment rcu_users. The thing is
initialized to '2' when the process is created, and nobody ever
increments it. EVER.

It's only ever decremented, and when it hits zero we know that both
users are gone, and we start the rcu-delayed free.

            Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-15 14:41                       ` Paul E. McKenney
@ 2019-09-15 17:59                         ` Eric W. Biederman
  2019-09-15 18:25                           ` Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-15 17:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

"Paul E. McKenney" <paulmck@kernel.org> writes:

> So this looks good in and of itself, but I still do not see what prevents
> the unfortunate sequence of events called out in my previous email.
> On the other hand, if ->rcu and ->rcu_users were not allocated on top
> of each other by a union, I would be happy to provide a Reviewed-by.
>
> And I am fundamentally distrusting of a refcount_dec_and_test() that
> is immediately followed by code that clobbers the now-zero value.
> Yes, this does have valid use cases, but it has a lot more invalid
> use cases.  The valid use cases have excluded all increments somehow
> else, so that the refcount_dec_and_test() call's only job is to
> synchronize between concurrent calls to put_task_struct_rcu_user().
> But I am not seeing the "excluded all increments somehow".
>
> So, what am I missing here?

Probably only that the users of the task_struct in this sense are now
quite mature.

The two data structures that allow rcu access to the task_struct are
the pid hash and the runqueue.    The practical problem is that they
have two very different lifetimes.  So we need some kind of logic that
let's us know when they are both done.  A recount does that job very
well.

Placing the recount on the same storage as the unused (at that point)
rcu_head removes the need to be clever in other ways to avoid bloating
the task_struct.

If you really want a reference to the task_struct from rcu context you
can just use get_task_struct.  Because until the grace period completes
it is guaranteed that the task_struct has a positive count.

Right now I can't imagine a use case for wanting to increase rcu_users
anywhere or to decrease rcu_users except where we do.  If there is such
a case most likely it will increase the reference count at
initialization time.

If anyone validly wants to increment rcu_users from an rcu critical
section we can move it out of the union at that time.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-15 17:59                         ` Eric W. Biederman
@ 2019-09-15 18:25                           ` Eric W. Biederman
  2019-09-15 18:48                             ` Paul E. McKenney
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-15 18:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Paul E. McKenney" <paulmck@kernel.org> writes:
>
>> So this looks good in and of itself, but I still do not see what prevents
>> the unfortunate sequence of events called out in my previous email.
>> On the other hand, if ->rcu and ->rcu_users were not allocated on top
>> of each other by a union, I would be happy to provide a Reviewed-by.
>>
>> And I am fundamentally distrusting of a refcount_dec_and_test() that
>> is immediately followed by code that clobbers the now-zero value.
>> Yes, this does have valid use cases, but it has a lot more invalid
>> use cases.  The valid use cases have excluded all increments somehow
>> else, so that the refcount_dec_and_test() call's only job is to
>> synchronize between concurrent calls to put_task_struct_rcu_user().
>> But I am not seeing the "excluded all increments somehow".
>>
>> So, what am I missing here?
>
> Probably only that the users of the task_struct in this sense are now
> quite mature.
>
> The two data structures that allow rcu access to the task_struct are
> the pid hash and the runqueue.    The practical problem is that they
> have two very different lifetimes.  So we need some kind of logic that
> let's us know when they are both done.  A recount does that job very
> well.
>
> Placing the recount on the same storage as the unused (at that point)
> rcu_head removes the need to be clever in other ways to avoid bloating
> the task_struct.
>
> If you really want a reference to the task_struct from rcu context you
> can just use get_task_struct.  Because until the grace period completes
> it is guaranteed that the task_struct has a positive count.
>
> Right now I can't imagine a use case for wanting to increase rcu_users
> anywhere or to decrease rcu_users except where we do.  If there is such
> a case most likely it will increase the reference count at
> initialization time.
>
> If anyone validly wants to increment rcu_users from an rcu critical
> section we can move it out of the union at that time.

Paul were you worrying about incrementing rcu_users because Frederic
Weisbecker brought the concept up earlier in the review?

It was his confusion that the point of rcu_users was so that it could
be incremented from an rcu critical section.  That definitely is not
the point of rcu_users.

If you were wondering about someone messing with rcu_users from an rcu
critical region independently that does suggest the code could use
a "comment saying don't do that!"  Multiple people getting confused
about the purpose of a reference count independently does suggest there
is a human factor problem in there somewhere.

Eric


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code
  2019-09-15 17:07                         ` Linus Torvalds
@ 2019-09-15 18:47                           ` Paul E. McKenney
  0 siblings, 0 replies; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Peter Zijlstra, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso

On Sun, Sep 15, 2019 at 10:07:24AM -0700, Linus Torvalds wrote:
> On Sun, Sep 15, 2019 at 7:32 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > First, what am I looking for?
> >
> > I am looking for something that prevents the following:
> >
> > o       Task A acquires a reference to Task B's task_struct while
> >         protected only by RCU, and is just about to increment ->rcu_users
> >         when it is delayed.  Maybe its vCPU is preempted or something.
> 
> Where exactly do you see "increment ->rcu_users"
> 
> There are _no_ users that can increment rcu_users. The thing is
> initialized to '2' when the process is created, and nobody ever
> increments it. EVER.
> 
> It's only ever decremented, and when it hits zero we know that both
> users are gone, and we start the rcu-delayed free.

Color me blind and apologies for the noise!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-15 18:25                           ` Eric W. Biederman
@ 2019-09-15 18:48                             ` Paul E. McKenney
  0 siblings, 0 replies; 75+ messages in thread
From: Paul E. McKenney @ 2019-09-15 18:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Linus Torvalds

On Sun, Sep 15, 2019 at 01:25:02PM -0500, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Paul E. McKenney" <paulmck@kernel.org> writes:
> >
> >> So this looks good in and of itself, but I still do not see what prevents
> >> the unfortunate sequence of events called out in my previous email.
> >> On the other hand, if ->rcu and ->rcu_users were not allocated on top
> >> of each other by a union, I would be happy to provide a Reviewed-by.
> >>
> >> And I am fundamentally distrusting of a refcount_dec_and_test() that
> >> is immediately followed by code that clobbers the now-zero value.
> >> Yes, this does have valid use cases, but it has a lot more invalid
> >> use cases.  The valid use cases have excluded all increments somehow
> >> else, so that the refcount_dec_and_test() call's only job is to
> >> synchronize between concurrent calls to put_task_struct_rcu_user().
> >> But I am not seeing the "excluded all increments somehow".
> >>
> >> So, what am I missing here?
> >
> > Probably only that the users of the task_struct in this sense are now
> > quite mature.
> >
> > The two data structures that allow rcu access to the task_struct are
> > the pid hash and the runqueue.    The practical problem is that they
> > have two very different lifetimes.  So we need some kind of logic that
> > let's us know when they are both done.  A recount does that job very
> > well.
> >
> > Placing the recount on the same storage as the unused (at that point)
> > rcu_head removes the need to be clever in other ways to avoid bloating
> > the task_struct.
> >
> > If you really want a reference to the task_struct from rcu context you
> > can just use get_task_struct.  Because until the grace period completes
> > it is guaranteed that the task_struct has a positive count.
> >
> > Right now I can't imagine a use case for wanting to increase rcu_users
> > anywhere or to decrease rcu_users except where we do.  If there is such
> > a case most likely it will increase the reference count at
> > initialization time.
> >
> > If anyone validly wants to increment rcu_users from an rcu critical
> > section we can move it out of the union at that time.
> 
> Paul were you worrying about incrementing rcu_users because Frederic
> Weisbecker brought the concept up earlier in the review?
> 
> It was his confusion that the point of rcu_users was so that it could
> be incremented from an rcu critical section.  That definitely is not
> the point of rcu_users.
> 
> If you were wondering about someone messing with rcu_users from an rcu
> critical region independently that does suggest the code could use
> a "comment saying don't do that!"  Multiple people getting confused
> about the purpose of a reference count independently does suggest there
> is a human factor problem in there somewhere.

I would welcome a "this is never incremented" comment or some such.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected
  2019-09-14 17:43                     ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
@ 2019-09-17 17:38                       ` Eric W. Biederman
  2019-09-25  7:51                         ` Peter Zijlstra
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-17 17:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Sep 14, 2019 at 5:30 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> I have reworked these patches one more time to make it clear that the
>> first 3 patches only fix task_struct so that it experiences a rcu grace
>> period after it leaves the runqueue for the last time.
>
> I remain a fan of these patches, and the added comment on the last one
> is I think a sufficient clarification of the issue.
>
> But it's patch 3 that makes me go "yeah, this is the right approach",
> because it just removes subtle code in favor of something that is
> understandable.
>
> Yes, most of the lines removed may be comments, and so it doesn't
> actually remove a lot of _code_, but I think the comments are a result
> of just how subtle and fragile our current approach is, and the new
> model not needing them as much is I think a real issue (rather than
> just Eric being less verbose in the new comments and removing lines of
> code that way).

In fact the comments I add are orthogonal to the comments I removed.
My last patch stands on it's own.  It can be applied with or without the
rest.   I just needed to know which of the ordinary rcu guarantees were
or were not present in the code.

> Can anybody see anything wrong with the series? Because I'd love to
> have it for 5.4,

Peter,

I am more than happy for these to come through your tree.  However
if this is one thing to many I will be happy to send Linus a pull
request myself early next week.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
  2019-09-15 14:41                       ` Paul E. McKenney
@ 2019-09-20 23:02                       ` Frederic Weisbecker
  2019-09-26  1:49                         ` Eric W. Biederman
  1 sibling, 1 reply; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-20 23:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney, Linus Torvalds

On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
> 
> The current task on the runqueue is currently read with rcu_dereference().
> 
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr.  Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
> 
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler.  So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
> 
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
> 
> As I write this there are effectively two users of rcu_dereference on
> rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference.  Then there is
> task_numa_compare() in kernel/sched/fair.c.  My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
> 
> The code in __schedule() essentially does:
> 	rq_lock(...);
> 	smp_mb__after_spinlock();
> 
> 	next = pick_next_task(...);
> 	rq->curr = next;
> 
> 	context_switch(prev, next);
> 
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier.  Further there is a full memory barrier
> in context_switch().
> 
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing.  A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
> 
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
> 
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
> 
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl".  These fields are the sched_entity structures
> of the varies schedulers.
> 
> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
> 
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
> 
> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/sched/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>  
>  	if (likely(prev != next)) {
>  		rq->nr_switches++;
> -		rq->curr = next;
> +		/*
> +		 * RCU users of rcu_dereference(rq->curr) may not see
> +		 * changes to task_struct made by pick_next_task().
> +		 */
> +		RCU_INIT_POINTER(rq->curr, next);

It would be nice to have more explanations in the comments as to why we
don't use rcu_assign_pointer() here (the very fast-path issue) and why
it is expected to be fine (the rq_lock() + post spinlock barrier) under
which condition. Some short summary of the changelog. Because that line
implies way too many subtleties.

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
  2019-09-09 12:22                                         ` Eric W. Biederman
@ 2019-09-25  7:36                                           ` Peter Zijlstra
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-25  7:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul E. McKenney, Linus Torvalds, Oleg Nesterov,
	Russell King - ARM Linux admin, Chris Metcalf, Christoph Lameter,
	Kirill Tkhai, Mike Galbraith, Thomas Gleixner, Ingo Molnar,
	Linux List Kernel Mailing, Davidlohr Bueso, mpe

On Mon, Sep 09, 2019 at 07:22:15AM -0500, Eric W. Biederman wrote:

> Let me see if I can explain my confusion in terms of task_numa_compare.
> 
> The function task_numa_comare now does:
> 
> 	rcu_read_lock();
> 	cur = rcu_dereference(dst_rq->curr);
> 
> Then it proceeds to examine a few fields of cur.
> 
> 	cur->flags;
>         cur->cpus_ptr;
>         cur->numa_group;
>         cur->numa_faults[];
>         cur->total_numa_faults;
>         cur->se.cfs_rq;
> 
> My concern here is the ordering between setting rq->curr and and setting
> those other fields.  If we read values of those fields that were not
> current when we set cur than I think task_numa_compare would give an
> incorrect answer.

That code is explicitly without serialization. One memory barrier isn't
going to make any difference what so ever.

Specifically, those numbers will change _after_ we make it current, not
before.

> From what I can see pick_next_task_fair does not mess with any of
> the fields that task_numa_compare uses.  So the existing memory barrier
> should be more than sufficient for that case.

There should not be, but even if there is, load-balancing in general
(very much including the NUMA balancing) is completely unserialized and
prone to reading stale data at all times.

This is on purpose; it minimizes the interference of load-balancing, and
if we make a 'wrong' choice, the next pass can fix it up.

> So I think I just need to write up a patch 4/3 that replaces the my
> "rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
> next)".  And includes a great big comment to that effect.
> 
> 
> 
> Now maybe I am wrong.  Maybe we can afford to see data that was changed
> before the task became rq->curr in task_numa_compare.  But I don't think
> so.  I really think it is that full memory barrier just a bit earlier
> in __schedule that is keeping us safe.
> 
> Am I wrong?

Not in so far that we now both seem to agree on using
RCU_INIT_POINTER(); but we're still disagreeing on why.

My argument is that since we can already obtain any task_struct pointer
through find_task_by_vpid() and friends, any access to its individual
members must already have serialzation rules (or explicitly none, as for
load-balancing).

I'm viewing this as just another variant of find_task_by_vpid(), except
we index by CPU instead of PID. There can be changes to task_struct
between to invocations of find_task_by_vpid(), any user will have to
somehow deal with that. This is no different.

Take for instance p->cpus_ptr, it has very specific serialzation rules
(which that NUMA balancer thing blatantly ignores), as do a lot of other
task_struct fields.

The memory barrier in question isn't going to help with any of that.

Specifically, if we care about the consistency of the numbers changed by
pick_next_task(), we must acquire rq->lock (of course, once we do that,
we can immediately use rq->curr without further RCU use).

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected
  2019-09-17 17:38                       ` Eric W. Biederman
@ 2019-09-25  7:51                         ` Peter Zijlstra
  2019-09-26  1:11                           ` Eric W. Biederman
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2019-09-25  7:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

On Tue, Sep 17, 2019 at 12:38:04PM -0500, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:

> > Can anybody see anything wrong with the series? Because I'd love to
> > have it for 5.4,
> 
> Peter,
> 
> I am more than happy for these to come through your tree.  However
> if this is one thing to many I will be happy to send Linus a pull
> request myself early next week.

Yeah, sorry for being late, I fell ill after LPC and am only now
getting back to things.

I see nothing wrong with these patches; if they've not been picked up
(and I'm not seeing them in Linus' tree yet) I'll pick them up now and
munge them together with Mathieu's membarrier patches and get them to
Linus in a few days.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected
  2019-09-25  7:51                         ` Peter Zijlstra
@ 2019-09-26  1:11                           ` Eric W. Biederman
  0 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-26  1:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 17, 2019 at 12:38:04PM -0500, Eric W. Biederman wrote:
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> > Can anybody see anything wrong with the series? Because I'd love to
>> > have it for 5.4,
>> 
>> Peter,
>> 
>> I am more than happy for these to come through your tree.  However
>> if this is one thing to many I will be happy to send Linus a pull
>> request myself early next week.
>
> Yeah, sorry for being late, I fell ill after LPC and am only now
> getting back to things.
>
> I see nothing wrong with these patches; if they've not been picked up
> (and I'm not seeing them in Linus' tree yet) I'll pick them up now and
> munge them together with Mathieu's membarrier patches and get them to
> Linus in a few days.

Sounds good.  I had some distractions so I wasn't able to get this yet.
So I am more than happy for you to pick these up.  This is better coming
through your tree in any event.

Eric


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-20 23:02                       ` Frederic Weisbecker
@ 2019-09-26  1:49                         ` Eric W. Biederman
  2019-09-26 12:42                           ` Frederic Weisbecker
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2019-09-26  1:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney, Linus Torvalds

Frederic Weisbecker <frederic@kernel.org> writes:

> On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 69015b7c28da..668262806942 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>>  
>>  	if (likely(prev != next)) {
>>  		rq->nr_switches++;
>> -		rq->curr = next;
>> +		/*
>> +		 * RCU users of rcu_dereference(rq->curr) may not see
>> +		 * changes to task_struct made by pick_next_task().
>> +		 */
>> +		RCU_INIT_POINTER(rq->curr, next);
>
> It would be nice to have more explanations in the comments as to why we
> don't use rcu_assign_pointer() here (the very fast-path issue) and why
> it is expected to be fine (the rq_lock() + post spinlock barrier) under
> which condition. Some short summary of the changelog. Because that line
> implies way too many subtleties.

Crucially that line documents the standard rules don't apply,
and it documents which guarantees a new user of the code can probably
count on.  I say probably because the comment may go stale before I new
user of rcu appears.  I have my hopes things are simple enough at that
location that if the comment needs to be changed it can be.

If it is not obvious from reading the code that calls
"task_rcu_dereference(rq->curr)" now "rcu_dereference(rq->curr)" why we
don't need the guarantees from rcu_assign_pointet() my sense is that
it should be those locations that document what guarantees they need.

Of the several different locations that use this my sense is that they
all have different requirements.

- The rcuwait code just needs the lifetime change as it never dereferences
  rq->curr.

- The membarrier code just looks at rq->curr->mm for a moment so it
  hardly needs anything.  I suspect we might be able to make the rcu
  critical section smaller in that code.

- I don't know the code in task_numa_compare() well enough even to make an
  educated guess.  Peter asserts (if I read his reply correctly) it is
  all just a heuristic so stale values should not matter.

  My reading of the code strongly suggests that we have the ordinary
  rcu_assign_pointer() guarantees there.  The few fields that are not
  covered by the ordinary guarantees do not appear to be read.  So even
  if Peter is wrong RCU_INIT_POINTER appears safe to me.

  I also don't think we will have confusion with people reading the
  code and expecting ordinary rcu_dereference semantics().

I can't possibly see putting the above several lines in a meaningful
comment where RCU_INIT_POINTER is called.  Especially in a comment
that will survive changes to any of those functions.  My experience
is comments that try that are almost always overlooked when someone
updates the code.

I barely found all of the comments that depended upon the details of
task_rcu_dereference and updated them in my patchset, when I removed
the need for task_rcu_dereference.

I don't think it would be wise to put a comment that is a wall of words
in the middle of __schedule().  I think it will become inaccurate with
time and because it is a lot of words I think it will be ignored.


As for the __schedule: It is the heart of the scheduler.  It is
performance code.  It is clever code.  It is likely to stay that way
because it is the scheduler.  There are good technical reasons for the
code is the way it is, and anyone changing the scheduler in a
responsible manner that includes benchmarking should find those
technical reasons quickly enough.


So I think a quick word to the wise is enough.  Comments are certainly
not enough to prevent people being careless and making foolish mistakes.

Eric


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
  2019-09-26  1:49                         ` Eric W. Biederman
@ 2019-09-26 12:42                           ` Frederic Weisbecker
  0 siblings, 0 replies; 75+ messages in thread
From: Frederic Weisbecker @ 2019-09-26 12:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Peter Zijlstra, Oleg Nesterov, Russell King - ARM Linux admin,
	Chris Metcalf, Christoph Lameter, Kirill Tkhai, Mike Galbraith,
	Thomas Gleixner, Ingo Molnar, Linux List Kernel Mailing,
	Davidlohr Bueso, Paul E. McKenney, Linus Torvalds

On Wed, Sep 25, 2019 at 08:49:17PM -0500, Eric W. Biederman wrote:
> Frederic Weisbecker <frederic@kernel.org> writes:
> 
> > On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 69015b7c28da..668262806942 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
> >>  
> >>  	if (likely(prev != next)) {
> >>  		rq->nr_switches++;
> >> -		rq->curr = next;
> >> +		/*
> >> +		 * RCU users of rcu_dereference(rq->curr) may not see
> >> +		 * changes to task_struct made by pick_next_task().
> >> +		 */
> >> +		RCU_INIT_POINTER(rq->curr, next);
> >
> > It would be nice to have more explanations in the comments as to why we
> > don't use rcu_assign_pointer() here (the very fast-path issue) and why
> > it is expected to be fine (the rq_lock() + post spinlock barrier) under
> > which condition. Some short summary of the changelog. Because that line
> > implies way too many subtleties.
> 
> Crucially that line documents the standard rules don't apply,
> and it documents which guarantees a new user of the code can probably
> count on.  I say probably because the comment may go stale before I new
> user of rcu appears.  I have my hopes things are simple enough at that
> location that if the comment needs to be changed it can be.

At least I can't understand that line without referring to the changelog.

> 
> If it is not obvious from reading the code that calls
> "task_rcu_dereference(rq->curr)" now "rcu_dereference(rq->curr)" why we
> don't need the guarantees from rcu_assign_pointet() my sense is that
> it should be those locations that document what guarantees they need.

Both sides should probably have comments.

> 
> Of the several different locations that use this my sense is that they
> all have different requirements.
> 
> - The rcuwait code just needs the lifetime change as it never dereferences
>   rq->curr.
> 
> - The membarrier code just looks at rq->curr->mm for a moment so it
>   hardly needs anything.  I suspect we might be able to make the rcu
>   critical section smaller in that code.
> 
> - I don't know the code in task_numa_compare() well enough even to make an
>   educated guess.  Peter asserts (if I read his reply correctly) it is
>   all just a heuristic so stale values should not matter.
> 
>   My reading of the code strongly suggests that we have the ordinary
>   rcu_assign_pointer() guarantees there.  The few fields that are not
>   covered by the ordinary guarantees do not appear to be read.  So even
>   if Peter is wrong RCU_INIT_POINTER appears safe to me.
> 
>   I also don't think we will have confusion with people reading the
>   code and expecting ordinary rcu_dereference semantics().
> 
> I can't possibly see putting the above several lines in a meaningful
> comment where RCU_INIT_POINTER is called.  Especially in a comment
> that will survive changes to any of those functions.  My experience
> is comments that try that are almost always overlooked when someone
> updates the code.

That's ok, it's the nature of comments, they get out of date. But at
least they provide a link to history so we can rewind to find the initial
how and why for a tricky line.

I bet nobody wants git blame as a base for their text editors.

> 
> I barely found all of the comments that depended upon the details of
> task_rcu_dereference and updated them in my patchset, when I removed
> the need for task_rcu_dereference.
> 
> I don't think it would be wise to put a comment that is a wall of words
> in the middle of __schedule().  I think it will become inaccurate with
> time and because it is a lot of words I think it will be ignored.
> 
> 
> As for the __schedule: It is the heart of the scheduler.  It is
> performance code.  It is clever code.  It is likely to stay that way
> because it is the scheduler.  There are good technical reasons for the
> code is the way it is, and anyone changing the scheduler in a
> responsible manner that includes benchmarking should find those
> technical reasons quickly enough.
> 
> 
> So I think a quick word to the wise is enough.  Comments are certainly
> not enough to prevent people being careless and making foolish mistakes.

Well it's not even about preventing anything, it's only about making
a line of cryptic code understandable for reviewers. No need for thorough
details, indeed anyone making use of that code or modifying it has to dive
into the deep guts anyway.

So how about that:

/*
 * Avoid rcu_dereference() in this very fast path.
 * Instead rely on full barrier implied by rq_lock() + smp_mb__after_spinlock().
 * Warning: In-between writes may be missed by readers (eg: pick_next_task())
 */

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
  2019-09-15 14:32                       ` Paul E. McKenney
@ 2019-09-27  8:10                       ` tip-bot2 for Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: tip-bot2 for Eric W. Biederman @ 2019-09-27  8:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric W. Biederman, Peter Zijlstra (Intel),
	Chris Metcalf, Christoph Lameter, Davidlohr Bueso, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Oleg Nesterov, Paul E. McKenney,
	Russell King - ARM Linux admin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     154abafc68bfb7c2ef2ad5308a3b2de8968c3f61
Gitweb:        https://git.kernel.org/tip/154abafc68bfb7c2ef2ad5308a3b2de8968c3f61
Author:        Eric W. Biederman <ebiederm@xmission.com>
AuthorDate:    Sat, 14 Sep 2019 07:34:30 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 25 Sep 2019 17:42:29 +02:00

tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rcuwait.h    | 20 ++---------
 include/linux/sched/task.h |  1 +-
 kernel/exit.c              | 67 +-------------------------------------
 kernel/sched/fair.c        |  2 +-
 kernel/sched/membarrier.c  |  4 +-
 5 files changed, 7 insertions(+), 87 deletions(-)

diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h
index 563290f..75c97e4 100644
--- a/include/linux/rcuwait.h
+++ b/include/linux/rcuwait.h
@@ -6,16 +6,11 @@
 
 /*
  * rcuwait provides a way of blocking and waking up a single
- * task in an rcu-safe manner; where it is forbidden to use
- * after exit_notify(). task_struct is not properly rcu protected,
- * unless dealing with rcu-aware lists, ie: find_task_by_*().
+ * task in an rcu-safe manner.
  *
- * Alternatively we have task_rcu_dereference(), but the return
- * semantics have different implications which would break the
- * wakeup side. The only time @task is non-nil is when a user is
- * blocked (or checking if it needs to) on a condition, and reset
- * as soon as we know that the condition has succeeded and are
- * awoken.
+ * The only time @task is non-nil is when a user is blocked (or
+ * checking if it needs to) on a condition, and reset as soon as we
+ * know that the condition has succeeded and are awoken.
  */
 struct rcuwait {
 	struct task_struct __rcu *task;
@@ -37,13 +32,6 @@ extern void rcuwait_wake_up(struct rcuwait *w);
  */
 #define rcuwait_wait_event(w, condition)				\
 ({									\
-	/*								\
-	 * Complain if we are called after do_exit()/exit_notify(),     \
-	 * as we cannot rely on the rcu critical region for the		\
-	 * wakeup side.							\
-	 */                                                             \
-	WARN_ON(current->exit_state);                                   \
-									\
 	rcu_assign_pointer((w)->task, current);				\
 	for (;;) {							\
 		/*							\
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 153a683..4b1c3b6 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -119,7 +119,6 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
 void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/kernel/exit.c b/kernel/exit.c
index 3bcaec2..a46a50d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -234,69 +234,6 @@ repeat:
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
@@ -316,10 +253,6 @@ void rcuwait_wake_up(struct rcuwait *w)
 	 */
 	smp_mb(); /* (B) */
 
-	/*
-	 * Avoid using task_rcu_dereference() magic as long as we are careful,
-	 * see comment in rcuwait_wait_event() regarding ->exit_state.
-	 */
 	task = rcu_dereference(w->task);
 	if (task)
 		wake_up_process(task);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3101c66..5bc2399 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1602,7 +1602,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d758..b14250a 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr
  2019-09-03 20:06                                 ` Peter Zijlstra
  2019-09-03 21:32                                   ` Paul E. McKenney
@ 2019-09-27  8:10                                   ` tip-bot2 for Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: tip-bot2 for Eric W. Biederman @ 2019-09-27  8:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric W. Biederman, Peter Zijlstra (Intel),
	Chris Metcalf, Christoph Lameter, Davidlohr Bueso, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Oleg Nesterov, Paul E. McKenney,
	Russell King - ARM Linux admin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     5311a98fef7d0dc2e8040ae0e18f5568d6d1dd5a
Gitweb:        https://git.kernel.org/tip/5311a98fef7d0dc2e8040ae0e18f5568d6d1dd5a
Author:        Eric W. Biederman <ebiederm@xmission.com>
AuthorDate:    Sat, 14 Sep 2019 07:35:02 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 25 Sep 2019 17:42:29 +02:00

tasks, sched/core: RCUify the assignment of rq->curr

The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference() on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference().  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
	rq_lock(...);
	smp_mb__after_spinlock();

	next = pick_next_task(...);
	rq->curr = next;

	context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c |  9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e5fefb..84c7116 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4033,7 +4033,11 @@ static void __sched notrace __schedule(bool preempt)
 
 	if (likely(prev != next)) {
 		rq->nr_switches++;
-		rq->curr = next;
+		/*
+		 * RCU users of rcu_dereference(rq->curr) may not see
+		 * changes to task_struct made by pick_next_task().
+		 */
+		RCU_INIT_POINTER(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6060,7 +6064,8 @@ void init_idle(struct task_struct *idle, int cpu)
 	__set_task_cpu(idle, cpu);
 	rcu_read_unlock();
 
-	rq->curr = rq->idle = idle;
+	rq->idle = idle;
+	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [tip: sched/urgent] tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
  2019-09-15 14:07                       ` Paul E. McKenney
@ 2019-09-27  8:10                       ` tip-bot2 for Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: tip-bot2 for Eric W. Biederman @ 2019-09-27  8:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric W. Biederman, Peter Zijlstra (Intel),
	Chris Metcalf, Christoph Lameter, Davidlohr Bueso, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Paul E. McKenney,
	Russell King - ARM Linux admin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     0ff7b2cfbae36ebcd216c6a5ad7f8534eebeaee2
Gitweb:        https://git.kernel.org/tip/0ff7b2cfbae36ebcd216c6a5ad7f8534eebeaee2
Author:        Eric W. Biederman <ebiederm@xmission.com>
AuthorDate:    Sat, 14 Sep 2019 07:33:58 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 25 Sep 2019 17:42:29 +02:00

tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue

In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/fork.c       | 11 +++++++----
 kernel/sched/core.c |  2 +-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 7eefe33..d6e5525 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,10 +902,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/* One for the user space visible state that goes away when reaped. */
-	refcount_set(&tsk->rcu_users, 1);
-	/* One for the rcu users, and one for the scheduler */
-	refcount_set(&tsk->usage, 2);
+	/*
+	 * One for the user space visible state that goes away when reaped.
+	 * One for the scheduler.
+	 */
+	refcount_set(&tsk->rcu_users, 2);
+	/* One for the rcu users */
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06961b9..5e5fefb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3254,7 +3254,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}
 
 	tick_nohz_task_switch();

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [tip: sched/urgent] tasks: Add a count of task RCU users
  2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
  2019-09-15 13:54                       ` Paul E. McKenney
@ 2019-09-27  8:10                       ` tip-bot2 for Eric W. Biederman
  1 sibling, 0 replies; 75+ messages in thread
From: tip-bot2 for Eric W. Biederman @ 2019-09-27  8:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Eric W. Biederman, Peter Zijlstra (Intel),
	Chris Metcalf, Christoph Lameter, Davidlohr Bueso, Kirill Tkhai,
	Linus Torvalds, Mike Galbraith, Paul E. McKenney,
	Russell King - ARM Linux admin, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     3fbd7ee285b2bbc6eebd15a3c8786d9776a402a8
Gitweb:        https://git.kernel.org/tip/3fbd7ee285b2bbc6eebd15a3c8786d9776a402a8
Author:        Eric W. Biederman <ebiederm@xmission.com>
AuthorDate:    Sat, 14 Sep 2019 07:33:34 -05:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 25 Sep 2019 17:42:29 +02:00

tasks: Add a count of task RCU users

Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h      | 5 ++++-
 include/linux/sched/task.h | 1 +
 kernel/exit.c              | 7 ++++++-
 kernel/fork.c              | 7 +++----
 4 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2e9196..8e43e54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1147,7 +1147,10 @@ struct task_struct {
 
 	struct tlbflush_unmap_batch	tlb_ubc;
 
-	struct rcu_head			rcu;
+	union {
+		refcount_t		rcu_users;
+		struct rcu_head		rcu;
+	};
 
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 3d90ed8..153a683 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -120,6 +120,7 @@ static inline void put_task_struct(struct task_struct *t)
 }
 
 struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 22ab6a4..3bcaec2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+	if (refcount_dec_and_test(&task->rcu_users))
+		call_rcu(&task->rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ repeat:
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_task_struct_rcu_user(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index 1d1cd06..7eefe33 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,10 +902,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;
 
-	/*
-	 * One for us, one for whoever does the "release_task()" (usually
-	 * parent)
-	 */
+	/* One for the user space visible state that goes away when reaped. */
+	refcount_set(&tsk->rcu_users, 1);
+	/* One for the rcu users, and one for the scheduler */
 	refcount_set(&tsk->usage, 2);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2019-09-27  8:11 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
2019-08-30 15:24 ` Oleg Nesterov
2019-08-30 15:30 ` Linus Torvalds
2019-08-30 15:40   ` Russell King - ARM Linux admin
2019-08-30 15:43     ` Linus Torvalds
2019-08-30 15:41   ` Linus Torvalds
2019-08-30 16:09     ` Oleg Nesterov
2019-08-30 16:21       ` Linus Torvalds
2019-08-30 16:44         ` Oleg Nesterov
2019-08-30 16:58           ` Linus Torvalds
2019-08-30 19:36         ` Eric W. Biederman
2019-09-02 13:40           ` Oleg Nesterov
2019-09-02 13:53             ` Peter Zijlstra
2019-09-02 14:44               ` Oleg Nesterov
2019-09-02 16:20                 ` Peter Zijlstra
2019-09-02 17:04             ` Eric W. Biederman
2019-09-02 17:34               ` Linus Torvalds
2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
2019-09-04 14:36                     ` Oleg Nesterov
2019-09-04 14:44                     ` Frederic Weisbecker
2019-09-04 15:32                       ` Oleg Nesterov
2019-09-04 16:33                         ` Frederic Weisbecker
2019-09-04 18:20                           ` Linus Torvalds
2019-09-05 14:59                             ` Frederic Weisbecker
2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
2019-09-03  7:41                     ` Peter Zijlstra
2019-09-03  7:47                       ` Peter Zijlstra
2019-09-03 16:44                         ` Eric W. Biederman
2019-09-03 17:08                           ` Linus Torvalds
2019-09-03 18:13                             ` Eric W. Biederman
2019-09-03 19:18                               ` Linus Torvalds
2019-09-03 20:06                                 ` Peter Zijlstra
2019-09-03 21:32                                   ` Paul E. McKenney
2019-09-05 20:02                                     ` Eric W. Biederman
2019-09-05 20:55                                       ` Paul E. McKenney
2019-09-06  7:07                                       ` Peter Zijlstra
2019-09-09 12:22                                         ` Eric W. Biederman
2019-09-25  7:36                                           ` Peter Zijlstra
2019-09-27  8:10                                   ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
2019-09-03 19:42                               ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
2019-09-14 12:31                           ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-14 12:31                           ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-14 12:32                           ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-04 14:22                     ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
2019-09-03  9:45                     ` kbuild test robot
2019-09-03 13:06                     ` Oleg Nesterov
2019-09-03 13:58                   ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
2019-09-03 15:44                   ` Linus Torvalds
2019-09-03 19:46                     ` Peter Zijlstra
     [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-15 13:54                       ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-15 14:07                       ` Paul E. McKenney
2019-09-15 14:09                         ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-15 14:32                       ` Paul E. McKenney
2019-09-15 17:07                         ` Linus Torvalds
2019-09-15 18:47                           ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
2019-09-15 14:41                       ` Paul E. McKenney
2019-09-15 17:59                         ` Eric W. Biederman
2019-09-15 18:25                           ` Eric W. Biederman
2019-09-15 18:48                             ` Paul E. McKenney
2019-09-20 23:02                       ` Frederic Weisbecker
2019-09-26  1:49                         ` Eric W. Biederman
2019-09-26 12:42                           ` Frederic Weisbecker
2019-09-14 17:43                     ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
2019-09-17 17:38                       ` Eric W. Biederman
2019-09-25  7:51                         ` Peter Zijlstra
2019-09-26  1:11                           ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).