linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Russell King - ARM Linux admin <linux@armlinux.org.uk>,
	Chris Metcalf <cmetcalf@ezchip.com>,
	Christoph Lameter <cl@linux.com>, Kirill Tkhai <tkhai@yandex.ru>,
	Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@kernel.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	mpe@ellerman.id.au
Subject: Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
Date: Mon, 09 Sep 2019 07:22:15 -0500	[thread overview]
Message-ID: <87woehek20.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <20190906070736.GR2349@hirez.programming.kicks-ass.net> (Peter Zijlstra's message of "Fri, 6 Sep 2019 09:07:36 +0200")

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Sep 05, 2019 at 03:02:49PM -0500, Eric W. Biederman wrote:
>> Paul, what is the purpose of the barrier in rcu_assign_pointer?
>> 
>> My intuition says it is the assignment half of rcu_dereference, and that
>> anything that rcu_dereference does not need is too strong.
>
> I see that Paul has also replied, but let me give another take on it.

I appreciate his reply as well.  I don't think he picked up that we
are talking about my change to do rcu_assign_pointer(rq->curr, next)
in schedule.

> RCU does *two* different but related things. It provide object
> consistency and it provides object lifetime management.
>
> Object consistency is provided by rcu_assign_pointer() /
> rcu_dereference:
>
>  - rcu_assign_pointer() is a PUBLISH operation, it is meant to expose an
> object to the world. In that respect it needs to ensure that all
> previous stores to this object are complete before it writes the
> pointer and exposes the object.
>
> To this purpose, rcu_assign_pointer() is an smp_store_release().
>
>  - rcu_dereference() is a CONSUME operation. It matches the PUBLISH from
> above and guarantees that any further loads/stores to the observed
> object come after.
>
> Naturally this would be an smp_load_acquire(), _HOWEVER_, since we need
> the address of the object in order to construct a load/store from/to
> said object, we can rely on the address-dependency to provide this
> barrier (except for Alpha, which is fscking weird). After all if you
> haven't completed the load of the (object) base address, you cannot
> compute the object member address and issue any subsequent memops, now
> can you ;-)

I follow and agree till here.

I was definitely confused ealier on which half of the rcu_assign_pointer()
rcu_dereference was important on most cpus.   As I currently understand
things, in a normal case we need a memory barrier between the assignment
of the data to a structure and the assignment of a pointer to that
structure to ensure there is an ordering between the pointer and
the data.

That memory barrier is usually provided by rcu_assign_pointer.

> Now the thing here is that the 'rq->curr = next' assignment does _NOT_
> expose the task (our object). The task is exposed the moment it enters
> the PID hash. It is this that sets us free of the PUBLISH requirement
> and thus we can use RCU_INIT_POINTER().

So I disagree about the PID hash here.  I don't think exposure is
necessarily the right concept to think this through.

> The object lifetime thing is provided by
> rcu_read_load()/rcu_read_unlock() on the one hand and
> synchronize_rcu()/call_rcu() on the other. That ensures that if you
> observe an object inside the RCU read side section, the object storage
> must stay valid.
>
> And it is *that* properpy we wish to make use of.

I absolutely agree that object lifetime is something we want to make use
of.

> Does this make sense to you?

Not completely.

Let me see if I can explain my confusion in terms of task_numa_compare.

The function task_numa_comare now does:

	rcu_read_lock();
	cur = rcu_dereference(dst_rq->curr);

Then it proceeds to examine a few fields of cur.

	cur->flags;
        cur->cpus_ptr;
        cur->numa_group;
        cur->numa_faults[];
        cur->total_numa_faults;
        cur->se.cfs_rq;

My concern here is the ordering between setting rq->curr and and setting
those other fields.  If we read values of those fields that were not
current when we set cur than I think task_numa_compare would give an
incorrect answer.

Now in __schedule the code does:

	rq_lock(rq, &rf);
        smp_mp__after_spinlock();

	...

	next = pick_next_task(rq, prev, &rf);

	if (prev != next) {
        	rq->nr_switches++
        	rq->curr = next; // rcu_assign_pointer(rq->curr, next);
        }

Which basically leads me back to Linus's analsysis where he pointed out
that it is the actions while the task is running before it calls
scheudle that in general cause the fields to be set.

Given that we have a full memory barrier for taking the previous task
off the cpu everything for the next task is guaranteed to be ordered
with the assignment to rq->curr except whatever fields pick_next_task()
decides to change.

From what I can see pick_next_task_fair does not mess with any of
the fields that task_numa_compare uses.  So the existing memory barrier
should be more than sufficient for that case.

So I think I just need to write up a patch 4/3 that replaces the my
"rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
next)".  And includes a great big comment to that effect.



Now maybe I am wrong.  Maybe we can afford to see data that was changed
before the task became rq->curr in task_numa_compare.  But I don't think
so.  I really think it is that full memory barrier just a bit earlier
in __schedule that is keeping us safe.

Am I wrong?

Eric





  reply	other threads:[~2019-09-09 12:22 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-30 14:08 [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value Russell King - ARM Linux admin
2019-08-30 15:24 ` Oleg Nesterov
2019-08-30 15:30 ` Linus Torvalds
2019-08-30 15:40   ` Russell King - ARM Linux admin
2019-08-30 15:43     ` Linus Torvalds
2019-08-30 15:41   ` Linus Torvalds
2019-08-30 16:09     ` Oleg Nesterov
2019-08-30 16:21       ` Linus Torvalds
2019-08-30 16:44         ` Oleg Nesterov
2019-08-30 16:58           ` Linus Torvalds
2019-08-30 19:36         ` Eric W. Biederman
2019-09-02 13:40           ` Oleg Nesterov
2019-09-02 13:53             ` Peter Zijlstra
2019-09-02 14:44               ` Oleg Nesterov
2019-09-02 16:20                 ` Peter Zijlstra
2019-09-02 17:04             ` Eric W. Biederman
2019-09-02 17:34               ` Linus Torvalds
2019-09-03  4:50                 ` [PATCH 0/3] task: Making tasks on the runqueue rcu protected Eric W. Biederman
2019-09-03  4:51                   ` [PATCH 1/3] task: Add a count of task rcu users Eric W. Biederman
2019-09-04 14:36                     ` Oleg Nesterov
2019-09-04 14:44                     ` Frederic Weisbecker
2019-09-04 15:32                       ` Oleg Nesterov
2019-09-04 16:33                         ` Frederic Weisbecker
2019-09-04 18:20                           ` Linus Torvalds
2019-09-05 14:59                             ` Frederic Weisbecker
2019-09-03  4:52                   ` [PATCH 2/3] task: RCU protect tasks on the runqueue Eric W. Biederman
2019-09-03  7:41                     ` Peter Zijlstra
2019-09-03  7:47                       ` Peter Zijlstra
2019-09-03 16:44                         ` Eric W. Biederman
2019-09-03 17:08                           ` Linus Torvalds
2019-09-03 18:13                             ` Eric W. Biederman
2019-09-03 19:18                               ` Linus Torvalds
2019-09-03 20:06                                 ` Peter Zijlstra
2019-09-03 21:32                                   ` Paul E. McKenney
2019-09-05 20:02                                     ` Eric W. Biederman
2019-09-05 20:55                                       ` Paul E. McKenney
2019-09-06  7:07                                       ` Peter Zijlstra
2019-09-09 12:22                                         ` Eric W. Biederman [this message]
2019-09-25  7:36                                           ` Peter Zijlstra
2019-09-27  8:10                                   ` [tip: sched/urgent] tasks, sched/core: RCUify the assignment of rq->curr tip-bot2 for Eric W. Biederman
2019-09-03 19:42                               ` [PATCH 2/3] task: RCU protect tasks on the runqueue Peter Zijlstra
2019-09-14 12:31                           ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-14 12:31                           ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-14 12:32                           ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-04 14:22                     ` [PATCH 2/3] task: RCU protect tasks on the runqueue Frederic Weisbecker
2019-09-03  4:52                   ` [PATCH 3/3] task: Clean house now that tasks on the runqueue are rcu protected Eric W. Biederman
2019-09-03  9:45                     ` kbuild test robot
2019-09-03 13:06                     ` Oleg Nesterov
2019-09-03 13:58                   ` [PATCH 0/3] task: Making tasks on the runqueue " Oleg Nesterov
2019-09-03 15:44                   ` Linus Torvalds
2019-09-03 19:46                     ` Peter Zijlstra
     [not found]                   ` <87muf7f4bf.fsf_-_@x220.int.ebiederm.org>
2019-09-14 12:33                     ` [PATCH v2 1/4] task: Add a count of task rcu users Eric W. Biederman
2019-09-15 13:54                       ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks: Add a count of task RCU users tip-bot2 for Eric W. Biederman
2019-09-14 12:33                     ` [PATCH v2 2/4] task: Ensure tasks are available for a grace period after leaving the runqueue Eric W. Biederman
2019-09-15 14:07                       ` Paul E. McKenney
2019-09-15 14:09                         ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: " tip-bot2 for Eric W. Biederman
2019-09-14 12:34                     ` [PATCH v2 3/4] task: With a grace period after finish_task_switch, remove unnecessary code Eric W. Biederman
2019-09-15 14:32                       ` Paul E. McKenney
2019-09-15 17:07                         ` Linus Torvalds
2019-09-15 18:47                           ` Paul E. McKenney
2019-09-27  8:10                       ` [tip: sched/urgent] tasks, sched/core: With a grace period after finish_task_switch(), " tip-bot2 for Eric W. Biederman
2019-09-14 12:35                     ` [PATCH v2 4/4] task: RCUify the assignment of rq->curr Eric W. Biederman
2019-09-15 14:41                       ` Paul E. McKenney
2019-09-15 17:59                         ` Eric W. Biederman
2019-09-15 18:25                           ` Eric W. Biederman
2019-09-15 18:48                             ` Paul E. McKenney
2019-09-20 23:02                       ` Frederic Weisbecker
2019-09-26  1:49                         ` Eric W. Biederman
2019-09-26 12:42                           ` Frederic Weisbecker
2019-09-14 17:43                     ` [PATCH v2 0/4] task: Making tasks on the runqueue rcu protected Linus Torvalds
2019-09-17 17:38                       ` Eric W. Biederman
2019-09-25  7:51                         ` Peter Zijlstra
2019-09-26  1:11                           ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87woehek20.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=cl@linux.com \
    --cc=cmetcalf@ezchip.com \
    --cc=dave@stgolabs.net \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=oleg@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tkhai@yandex.ru \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).