All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Phil Auld <pauld@redhat.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Waiman Long <longman@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] sched: Fix nr_uninterruptible race causing increasing load average
Date: Thu, 8 Jul 2021 09:54:06 +0200	[thread overview]
Message-ID: <YOavHgRUBM6cc95s@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <YOatszHNZc9XRbYB@hirez.programming.kicks-ass.net>

On Thu, Jul 08, 2021 at 09:48:03AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 08, 2021 at 09:26:26AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > > ("sched: Fix loadavg accounting race") causes increasing values of load
> > > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > > CPU not always seeing the write to task->sched_contributes_to_load in
> > > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > > waking up a task which incremented nr_uninterruptible when it slept.
> > > 
> > > The rq->lock serialization is insufficient across different rq->locks.
> > > 
> > > Add smp_wmb() to schedule and smp_rmb() before the read in
> > > ttwu_do_activate().
> > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 4ca80df205ce..ced7074716eb 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > >  
> > >  	lockdep_assert_held(&rq->lock);
> > >  
> > > +	/* Pairs with smp_wmb in __schedule() */
> > > +	smp_rmb();
> > >  	if (p->sched_contributes_to_load)
> > >  		rq->nr_uninterruptible--;
> > >  
> > 
> > Is this really needed ?! (this question is a big fat clue the comment is
> > insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> > and hence the p->sched_contributed_to_load must already happen after.
> > 
> > > @@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt)
> > >  				!(prev_state & TASK_NOLOAD) &&
> > >  				!(prev->flags & PF_FROZEN);
> > >  
> > > +			/*
> > > +			 * Make sure the previous write is ordered before p->on_rq etc so
> > > +			 * that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
> > > +			 */
> > > +			smp_wmb();
> > >  			if (prev->sched_contributes_to_load)
> > >  				rq->nr_uninterruptible++;
> > 
> > That comment is terrible, look at all the other barrier comments around
> > there for clues; in effect you're worrying about:
> > 
> > 	p->sched_contributes_to_load = X	R1 = p->on_rq
> > 	WMB					RMB
> > 	p->on_rq = Y				R2 = p->sched_contributes_to_load
> > 
> > Right?
> > 
> > 
> > Bah bah bah.. I so detest having to add barriers here for silly
> > accounting. Let me think about this a little.
> 
> I got the below:
> 
> __schedule()					ttwu()
> 
> rq_lock()					raw_spin_lock(&p->pi_lock)
> smp_mb__after_spinlock();			smp_mb__after_spinlock();
> 
> p->sched_contributes_to_load = X;		if (READ_ONCE(p->on_rq) && ...)
> 						  goto unlock;
> 						smp_acquire__after_ctrl_dep();
> 
> 						smp_cond_load_acquire(&p->on_cpu, !VAL)
> 
> deactivate_task()
>   p->on_rq = 0;
> 
> context_switch()
>   finish_task_switch()
>     finish_task()
>       smp_store_release(p->on_cpu, 0);
> 
> 						ttwu_queue()
> 						  rq_lock()
> 						    ttwu_do_activate()
> 						      if (p->sched_contributes_to_load)
> 						        ...
> 						  rq_unlock()
> 						raw_spin_unlock(&p->pi_lock);
>     finish_lock_switch()
>       rq_unlock();
> 
> 
> 
> The only way for ttwu() to end up in an enqueue, is if it did a
> LOAD-ACQUIRE on ->on_cpu, 

That's not completely true; there's the WF_ON_CPU case, but in that
scenario we IPI the CPU doing __schedule and it becomes simple UP/PO and
everything must trivially work.

> but that orders with the STORE-RELEASE on the
> same, which ensures the p->sched_contributes_to_load LOAD must happen
> after the STORE.
> 
> What am I missing? Your Changelog/comments provide insufficient clues..

  reply	other threads:[~2021-07-08  7:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-07 19:04 [PATCH] sched: Fix nr_uninterruptible race causing increasing load average Phil Auld
2021-07-08  7:26 ` Peter Zijlstra
2021-07-08  7:48   ` Peter Zijlstra
2021-07-08  7:54     ` Peter Zijlstra [this message]
2021-07-08 14:54       ` Phil Auld
2021-07-09 12:57         ` Peter Zijlstra
2021-07-11 13:19           ` Phil Auld
2021-07-08 13:25   ` Phil Auld
2021-07-09 11:38     ` Peter Zijlstra
2021-07-11 12:57       ` Phil Auld
2021-07-23 13:38       ` Phil Auld
2021-07-28 15:45         ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YOavHgRUBM6cc95s@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.