[0/8] sched: Fix hot-unplug regressions
mbox series

Message ID 20210116113033.608340773@infradead.org
Headers show
Series
  • sched: Fix hot-unplug regressions
Related show

Message

Peter Zijlstra Jan. 16, 2021, 11:30 a.m. UTC
Hi,

These patches (no longer 4), seems to fix all the hotplug regressions as per
nearly a 100 18*SRCU-P runs over-night.

I did clean up the patches, so possibly I wrecked it again. I've started new
runs and will again leave them running over-night.

Paul, if you could please also throw your monster machine at it.

Comments

Peter Zijlstra Jan. 16, 2021, 3:25 p.m. UTC | #1
On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches (no longer 4), seems to fix all the hotplug regressions as per
> nearly a 100 18*SRCU-P runs over-night.
> 
> I did clean up the patches, so possibly I wrecked it again. I've started new
> runs and will again leave them running over-night.

Hurph... I've got one splat from this version, one I've not seen before:

[   68.712848] Dying CPU not properly vacated!
...
[   68.744448] CPU1 enqueued tasks (2 total):
[   68.745018]  pid: 14, name: rcu_preempt
[   68.745557]  pid: 18, name: migration/1

Paul, rcu_preempt, is from rcu_spawn_gp_kthread(), right? Afaict that
doesn't even have affinity.. /me wonders HTH that ended up on the
runqueue so late.
Paul E. McKenney Jan. 16, 2021, 3:45 p.m. UTC | #2
On Sat, Jan 16, 2021 at 04:25:58PM +0100, Peter Zijlstra wrote:
> On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches (no longer 4), seems to fix all the hotplug regressions as per
> > nearly a 100 18*SRCU-P runs over-night.
> > 
> > I did clean up the patches, so possibly I wrecked it again. I've started new
> > runs and will again leave them running over-night.
> 
> Hurph... I've got one splat from this version, one I've not seen before:
> 
> [   68.712848] Dying CPU not properly vacated!
> ...
> [   68.744448] CPU1 enqueued tasks (2 total):
> [   68.745018]  pid: 14, name: rcu_preempt
> [   68.745557]  pid: 18, name: migration/1
> 
> Paul, rcu_preempt, is from rcu_spawn_gp_kthread(), right? Afaict that
> doesn't even have affinity.. /me wonders HTH that ended up on the
> runqueue so late.

Yes, rcu_preempt is from rcu_spawn_gp_kthread(), and you are right that
the kernel code does not bind it anywhere.  If this is rcutorture,
there isn't enough of a userspace to do the binding there, eihter.
Wakeups for the rcu_preempt task can happen in odd places, though.

Grasping at straws...  Would Frederic's series help?  This is in
-rcu here:

cfd941c rcu/nocb: Detect unsafe checks for offloaded rdp
028d407 rcu: Remove superfluous rdp fetch
38e216a rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
53775fd rcu/nocb: Perform deferred wake up before last idle's need_resched() check
1fbabce rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
2856844 entry: Explicitly flush pending rcuog wakeup before last rescheduling points
4d959df sched: Report local wake up on resched blind zone within idle loop
2617331 entry: Report local wake up on resched blind zone while resuming to user
79acd12 timer: Report ignored local enqueue in nohz mode

I have been including these in all of my tests of your patches.

							Thanx, Paul
Paul E. McKenney Jan. 16, 2021, 3:48 p.m. UTC | #3
On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> Hi,
> 
> These patches (no longer 4), seems to fix all the hotplug regressions as per
> nearly a 100 18*SRCU-P runs over-night.

Nice!!!

> I did clean up the patches, so possibly I wrecked it again. I've started new
> runs and will again leave them running over-night.
> 
> Paul, if you could please also throw your monster machine at it.

Will do as soon as the tests I started yesterday complete, which should
be this afternoon, Pacific Time.

My thought is to do the full set of scenarios overnight, then try
hammering either SRCU-P and/or whatever else shows up in the overnight
test.  Seem reasonable?

							Thanx, Paul
Peter Zijlstra Jan. 16, 2021, 6:51 p.m. UTC | #4
On Sat, Jan 16, 2021 at 07:45:42AM -0800, Paul E. McKenney wrote:
> On Sat, Jan 16, 2021 at 04:25:58PM +0100, Peter Zijlstra wrote:
> > On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> > > Hi,
> > > 
> > > These patches (no longer 4), seems to fix all the hotplug regressions as per
> > > nearly a 100 18*SRCU-P runs over-night.
> > > 
> > > I did clean up the patches, so possibly I wrecked it again. I've started new
> > > runs and will again leave them running over-night.
> > 
> > Hurph... I've got one splat from this version, one I've not seen before:
> > 
> > [   68.712848] Dying CPU not properly vacated!
> > ...
> > [   68.744448] CPU1 enqueued tasks (2 total):
> > [   68.745018]  pid: 14, name: rcu_preempt
> > [   68.745557]  pid: 18, name: migration/1
> > 
> > Paul, rcu_preempt, is from rcu_spawn_gp_kthread(), right? Afaict that
> > doesn't even have affinity.. /me wonders HTH that ended up on the
> > runqueue so late.
> 
> Yes, rcu_preempt is from rcu_spawn_gp_kthread(), and you are right that
> the kernel code does not bind it anywhere.  If this is rcutorture,
> there isn't enough of a userspace to do the binding there, eihter.
> Wakeups for the rcu_preempt task can happen in odd places, though.
> 
> Grasping at straws...

My current straw is that the wakeup lands on the wakelist before ttwu()
will refuse to wake to the CPU, and then lands on the RQ after we've
waited. Which seems near impossible..

I'll keep staring..
Paul E. McKenney Jan. 18, 2021, 5:28 a.m. UTC | #5
On Sat, Jan 16, 2021 at 07:48:59AM -0800, Paul E. McKenney wrote:
> On Sat, Jan 16, 2021 at 12:30:33PM +0100, Peter Zijlstra wrote:
> > Hi,
> > 
> > These patches (no longer 4), seems to fix all the hotplug regressions as per
> > nearly a 100 18*SRCU-P runs over-night.
> 
> Nice!!!
> 
> > I did clean up the patches, so possibly I wrecked it again. I've started new
> > runs and will again leave them running over-night.
> > 
> > Paul, if you could please also throw your monster machine at it.
> 
> Will do as soon as the tests I started yesterday complete, which should
> be this afternoon, Pacific Time.
> 
> My thought is to do the full set of scenarios overnight, then try
> hammering either SRCU-P and/or whatever else shows up in the overnight
> test.  Seem reasonable?

And the SRCU-P runs did fine.  I got some task hangs on TREE03 and (to
a lesser extent) TREE04.  These might well be my fault, so I will try
bisecting tomorrow, Pacific Time.

							Thanx, Paul