Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()

From: Hillf Danton <hdanton@sina.com>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Yu Liao <liaoyu15@huawei.com>,
	fweisbec@gmail.com, mingo@kernel.org, liwei391@huawei.com,
	adobriyan@gmail.com, mirsad.todorovac@alu.unizg.hr,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us()
Date: Wed,  1 Feb 2023 22:01:17 +0800	[thread overview]
Message-ID: <20230201140117.539-1-hdanton@sina.com> (raw)
In-Reply-To: <Y9pU4cunJd3aI9+S@lothringen>

On Wed, 1 Feb 2023 13:02:41 +0100 Frederic Weisbecker <frederic@kernel.org>
> On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote:
> > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > Seriously this procfs accuracy is the least of the problems and if this
> > > would be the only issue then we could trivially fix it by declaring that
> > > the procfs output might go backwards. It's an estimate after all. If
> > > there would be a real reason to ensure monotonicity there then we could
> > > easily do that in the readout code.
> > > 
> > > But the real issue is that both get_cpu_idle_time_us() and
> > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way
> > > worse than the above procfs idle time going backwards.
> > > 
> > > If update_ts_time_stats() is invoked concurrently for the same CPU then
> > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random
> > > numbers.
> > > 
> > > This has been broken 12 years ago in commit 595aac488b54 ("sched:
> > > Introduce a function to update the idle statistics").
> > 
> > [...]
> > 
> > > 
> > > P.S.: I hate the spinlock in the idle code path, but I don't have a
> > >       better idea.
> > 
> > Provided the percpu rule is enforced, the random numbers mentioned above
> > could be erased without another spinlock added.
> > 
> > Hillf
> > +++ b/kernel/time/tick-sched.c
> > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti
> >  /*
> >   * Updates the per-CPU time idle statistics counters
> >   */
> > -static void
> > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
> > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now,
> > +				int io, u64 *last_update_time)
> >  {
> >  	ktime_t delta;
> >  
> > +	if (last_update_time)
> > +		*last_update_time = ktime_to_us(now);
> > +
> >  	if (ts->idle_active) {
> >  		delta = ktime_sub(now, ts->idle_entrytime);
> > +
> > +		/* update is only expected on the local CPU */
> > +		if (cpu != smp_processor_id()) {
> 
> Why not just updating it only on idle exit then?

This aligns to idle exit as much as it can by disallowing remote update.
> 
> > +			if (io)
> 
> I fear it's not up to the caller to decides if the idle time is IO or not.

Could you specify a bit on your concern, given the callers of this function?
> 
> > +				delta = ktime_add(ts->iowait_sleeptime, delta);
> > +			else
> > +				delta = ktime_add(ts->idle_sleeptime, delta);
> > +			return ktime_to_us(delta);

Based on the above comments, I guest you missed this line which prevents
get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts.
> > +		}
> > +
> >  		if (nr_iowait_cpu(cpu) > 0)
> >  			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
> >  		else
> 
> But you kept the old update above.
> 
> So if this is not the local CPU, what do you do?
> 
> You'd need to return (without updating iowait_sleeptime):
> 
>       ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
> 
> Right?

Yes, the diff goes as you suggest.

> But then you may race with the local updater, risking to return
> the delta added twice. So you need at least a seqcount.

Add seqcount if needed. No problem.
> 
> But in the end, nr_iowait_cpu() is broken because that counter can be
> decremented remotely and so the whole thing is beyond repair:
> 
> CPU 0                       CPU  1                    CPU 2
> -----                       -----                     ------
> //io_schedule() TASK A
> current->in_iowait = 1
> rq(0)->nr_iowait++
> //switch to idle
>                     // READ /proc/stat
>                     // See nr_iowait_cpu(0) == 1
>                     return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime)
> 
>                                                       //try_to_wake_up(TASK A)
>                                                       rq(0)->nr_iowait--
> //idle exit
> // See nr_iowait_cpu(0) == 0
> ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)

Ah see your point.

The diff disallows remotely updating ts, and it is updated in idle exit
after my proposal, so what nr_iowait_cpu() breaks is mitigated.

Thanks for taking a look, particularly the race linked to nr_iowait_cpu().

Hillf