All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hillf Danton <hdanton@sina.com>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Yu Liao <liaoyu15@huawei.com>,
	fweisbec@gmail.com, mingo@kernel.org, liwei391@huawei.com,
	adobriyan@gmail.com, mirsad.todorovac@alu.unizg.hr,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
Date: Wed,  1 Feb 2023 22:01:17 +0800	[thread overview]
Message-ID: <20230201140117.539-1-hdanton@sina.com> (raw)
In-Reply-To: <Y9pU4cunJd3aI9+S@lothringen>

On Wed, 1 Feb 2023 13:02:41 +0100 Frederic Weisbecker <frederic@kernel.org>
> On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote:
> > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > Seriously this procfs accuracy is the least of the problems and if this
> > > would be the only issue then we could trivially fix it by declaring that
> > > the procfs output might go backwards. It's an estimate after all. If
> > > there would be a real reason to ensure monotonicity there then we could
> > > easily do that in the readout code.
> > > 
> > > But the real issue is that both get_cpu_idle_time_us() and
> > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> > > worse than the above procfs idle time going backwards.
> > > 
> > > If update_ts_time_stats() is invoked concurrently for the same CPU then
> > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> > > numbers.
> > > 
> > > This has been broken 12 years ago in commit 595aac488b54 ("sched:
> > > Introduce a function to update the idle statistics").
> > 
> > [...]
> > 
> > > 
> > > P.S.: I hate the spinlock in the idle code path, but I don't have a
> > >       better idea.
> > 
> > Provided the percpu rule is enforced, the random numbers mentioned above
> > could be erased without another spinlock added.
> > 
> > Hillf
> > +++ b/kernel/time/tick-sched.c
> > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> >  /*
> >   * Updates the per-CPU time idle statistics counters
> >   */
> > -static void
> > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> > +				int io, u64 *last_update_time)
> >  {
> >  	ktime_t delta;
> >  
> > +	if (last_update_time)
> > +		*last_update_time = ktime_to_us(now);
> > +
> >  	if (ts->idle_active) {
> >  		delta = ktime_sub(now, ts->idle_entrytime);
> > +
> > +		/* update is only expected on the local CPU */
> > +		if (cpu != smp_processor_id()) {
> 
> Why not just updating it only on idle exit then?

This aligns to idle exit as much as it can by disallowing remote update.
> 
> > +			if (io)
> 
> I fear it's not up to the caller to decides if the idle time is IO or not.

Could you specify a bit on your concern, given the callers of this function?
> 
> > +				delta = ktime_add(ts->iowait_sleeptime, delta);
> > +			else
> > +				delta = ktime_add(ts->idle_sleeptime, delta);
> > +			return ktime_to_us(delta);

Based on the above comments, I guest you missed this line which prevents
get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts.
> > +		}
> > +
> >  		if (nr_iowait_cpu(cpu) > 0)
> >  			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> >  		else
> 
> But you kept the old update above.
> 
> So if this is not the local CPU, what do you do?
> 
> You'd need to return (without updating iowait_sleeptime):
> 
>       ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
> 
> Right?

Yes, the diff goes as you suggest.

> But then you may race with the local updater, risking to return
> the delta added twice. So you need at least a seqcount.

Add seqcount if needed. No problem.
> 
> But in the end, nr_iowait_cpu() is broken because that counter can be
> decremented remotely and so the whole thing is beyond repair:
> 
> CPU 0                       CPU  1                    CPU 2
> -----                       -----                     ------
> //io_schedule() TASK A
> current->in_iowait = 1
> rq(0)->nr_iowait++
> //switch to idle
>                     // READ /proc/stat
>                     // See nr_iowait_cpu(0) == 1
>                     return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
> 
>                                                       //try_to_wake_up(TASK A)
>                                                       rq(0)->nr_iowait--
> //idle exit
> // See nr_iowait_cpu(0) == 0
> ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)

Ah see your point.

The diff disallows remotely updating ts, and it is updated in idle exit
after my proposal, so what nr_iowait_cpu() breaks is mitigated.

Thanks for taking a look, particularly the race linked to nr_iowait_cpu().

Hillf


  reply	other threads:[~2023-02-01 14:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-28  2:00 [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Yu Liao
2023-01-31 14:44 ` Thomas Gleixner
2023-01-31 18:35   ` Alexey Dobriyan
2023-01-31 19:59     ` Peter Zijlstra
2023-01-31 19:57   ` Peter Zijlstra
2023-01-31 21:11     ` Frederic Weisbecker
2023-02-01  9:03       ` Peter Zijlstra
2023-02-01  4:53   ` Hillf Danton
2023-02-01 12:02     ` Frederic Weisbecker
2023-02-01 14:01       ` Hillf Danton [this message]
2023-02-01 14:28         ` Frederic Weisbecker
2023-02-08 15:19   ` [PATCH] timers/nohz: Restructure and reshuffle struct tick_sched Frederic Weisbecker
2023-02-06  7:03 ` [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() kernel test robot
2023-02-07  5:25 ` Mirsad Goran Todorovac
2023-02-07  8:03   ` Mirsad Goran Todorovac

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230201140117.539-1-hdanton@sina.com \
    --to=hdanton@sina.com \
    --cc=adobriyan@gmail.com \
    --cc=frederic@kernel.org \
    --cc=fweisbec@gmail.com \
    --cc=liaoyu15@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwei391@huawei.com \
    --cc=mingo@kernel.org \
    --cc=mirsad.todorovac@alu.unizg.hr \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.