From: Frederic Weisbecker <frederic@kernel.org>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Oleg Nesterov <oleg@redhat.com>,
Russell King - ARM Linux admin <linux@armlinux.org.uk>,
Chris Metcalf <cmetcalf@ezchip.com>,
Christoph Lameter <cl@linux.com>, Kirill Tkhai <tkhai@yandex.ru>,
Mike Galbraith <efault@gmx.de>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@kernel.org>,
Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
Davidlohr Bueso <dave@stgolabs.net>,
"Paul E. McKenney" <paulmck@kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr
Date: Sat, 21 Sep 2019 01:02:49 +0200 [thread overview]
Message-ID: <20190920230247.GA6449@lenoir> (raw)
In-Reply-To: <87ftkzdpjd.fsf_-_@x220.int.ebiederm.org>
On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
>
> The current task on the runqueue is currently read with rcu_dereference().
>
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr. Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
>
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler. So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
>
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
>
> As I write this there are effectively two users of rcu_dereference on
> rq->curr. There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference. Then there is
> task_numa_compare() in kernel/sched/fair.c. My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
>
> The code in __schedule() essentially does:
> rq_lock(...);
> smp_mb__after_spinlock();
>
> next = pick_next_task(...);
> rq->curr = next;
>
> context_switch(prev, next);
>
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier. Further there is a full memory barrier
> in context_switch().
>
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing. A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
>
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
>
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
>
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl". These fields are the sched_entity structures
> of the varies schedulers.
>
> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
>
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
>
> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> kernel/sched/core.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> - rq->curr = next;
> + /*
> + * RCU users of rcu_dereference(rq->curr) may not see
> + * changes to task_struct made by pick_next_task().
> + */
> + RCU_INIT_POINTER(rq->curr, next);
It would be nice to have more explanations in the comments as to why we
don't use rcu_assign_pointer() here (the very fast-path issue) and why
it is expected to be fine (the rq_lock() + post spinlock barrier) under
which condition. Some short summary of the changelog. Because that line
implies way too many subtleties.
Thanks.
next prev parent reply other threads:[~2019-09-20 23:02 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
2019-08-30 15:24 ` Oleg Nesterov
2019-08-30 15:30 ` Linus Torvalds
2019-08-30 15:40 ` Russell King - ARM Linux admin
2019-08-30 15:43 ` Linus Torvalds
2019-08-30 15:41 ` Linus Torvalds
2019-08-30 16:09 ` Oleg Nesterov
2019-08-30 16:21 ` Linus Torvalds
2019-08-30 16:44 ` Oleg Nesterov
2019-08-30 16:58 ` Linus Torvalds
2019-08-30 19:36 ` Eric W. Biederman
2019-09-02 13:40 ` Oleg Nesterov
2019-09-02 13:53 ` Peter Zijlstra
2019-09-02 14:44 ` Oleg Nesterov
2019-09-02 16:20 ` Peter Zijlstra
2019-09-02 17:04 ` Eric W. Biederman
2019-09-02 17:34 ` Linus Torvalds
2019-09-03 4:50 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
2019-09-03 4:51 ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
2019-09-04 14:36 ` Oleg Nesterov
2019-09-04 14:44 ` Frederic Weisbecker
2019-09-04 15:32 ` Oleg Nesterov
2019-09-04 16:33 ` Frederic Weisbecker
2019-09-04 18:20 ` Linus Torvalds
2019-09-05 14:59 ` Frederic Weisbecker
2019-09-03 4:52 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
2019-09-03 7:41 ` Peter Zijlstra
2019-09-03 7:47 ` Peter Zijlstra
2019-09-03 16:44 ` Eric W. Biederman
2019-09-03 17:08 ` Linus Torvalds
2019-09-03 18:13 ` Eric W. Biederman
2019-09-03 19:18 ` Linus Torvalds
2019-09-03 20:06 ` Peter Zijlstra
2019-09-03 21:32 ` Paul E. McKenney
2019-09-05 20:02 ` Eric W. Biederman
2019-09-05 20:55 ` Paul E. McKenney
2019-09-06 7:07 ` Peter Zijlstra
2019-09-09 12:22 ` Eric W. Biederman
2019-09-25 7:36 ` Peter Zijlstra
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
2019-09-03 19:42 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
2019-09-14 12:31 ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-14 12:31 ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-14 12:32 ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-04 14:22 ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
2019-09-03 4:52 ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
2019-09-03 9:45 ` kbuild test robot
2019-09-03 13:06 ` Oleg Nesterov
2019-09-03 13:58 ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
2019-09-03 15:44 ` Linus Torvalds
2019-09-03 19:46 ` Peter Zijlstra
[not found] ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
2019-09-14 12:33 ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-15 13:54 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
2019-09-14 12:33 ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-15 14:07 ` Paul E. McKenney
2019-09-15 14:09 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
2019-09-14 12:34 ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-15 14:32 ` Paul E. McKenney
2019-09-15 17:07 ` Linus Torvalds
2019-09-15 18:47 ` Paul E. McKenney
2019-09-27 8:10 ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
2019-09-14 12:35 ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
2019-09-15 14:41 ` Paul E. McKenney
2019-09-15 17:59 ` Eric W. Biederman
2019-09-15 18:25 ` Eric W. Biederman
2019-09-15 18:48 ` Paul E. McKenney
2019-09-20 23:02 ` Frederic Weisbecker [this message]
2019-09-26 1:49 ` Eric W. Biederman
2019-09-26 12:42 ` Frederic Weisbecker
2019-09-14 17:43 ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
2019-09-17 17:38 ` Eric W. Biederman
2019-09-25 7:51 ` Peter Zijlstra
2019-09-26 1:11 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190920230247.GA6449@lenoir \
--to=frederic@kernel.org \
--cc=cl@linux.com \
--cc=cmetcalf@ezchip.com \
--cc=dave@stgolabs.net \
--cc=ebiederm@xmission.com \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tkhai@yandex.ru \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).