All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible bug in CPU hotplug handling (4.14.0-rc5)
@ 2017-10-20 17:37 Tvrtko Ursulin
  2017-10-20 18:14 ` Thomas Gleixner
  0 siblings, 1 reply; 2+ messages in thread
From: Tvrtko Ursulin @ 2017-10-20 17:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	Sebastian Andrzej Siewior, Paul E. McKenney, Boris Ostrovsky


Hi all,

I think I've found in bug in the CPU hotplug handling when multi-instance
states are used. That is in the 4.14.0-rc5 kernel.

I have not attempted to get to the bottom of the issue to propose an actual
fix, since the logic there looks somewhat complex, but thought to first
seek opinion of the people in the know regarding this area.

What I think is happening is that when a multi-instance state is registered
last, the per-cpu st->node field remains "sticky" ie. remains set to the
address of the node belonging to the client which registered last.

The following hotplug event will then call all the callbacks passing that
same st->node to all the clients. Obviously bad things happen then, ranging
from oopses to silent memory corruption.

I have added some custom log messages to catch this and if you can try to
follow through them this is what happens. First a multi-instance state
client registers:

 cpuhp store callbacks state=176 name=perf/x86/intel/i915:online multi=1
 i915 cpuhp slot 176
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=0 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=0 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=1 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=1 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=2 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=2 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=3 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=3 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=4 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=4 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=5 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=5 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=6 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=6 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=7 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=7 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp added state=176 node=ffff88021db38390

Then a hotplug event happens:

 cpuhp store callbacks state=176 name=perf/x86/intel/i915:online multi=1
 i915 cpuhp slot 176
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=0 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=0 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=1 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=1 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=2 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=2 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=3 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=3 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=4 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=4 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=5 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=5 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=6 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=6 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp_issue_call state=176 node=ffff88021db38390
 cpuhp_thread_fun cpu=7 state=176 bringup=1 node=ffff88021db38390
 cpuhp_invoke_callback multi-single cpu=7 state=176 node=ffff88021db38390
 cpuhp_thread_fun result=0
 cpuhp added state=176 node=ffff88021db38390
... etc ..

As you can see the node belonging to the i915 client is passed to all other
registered callbacks. (Due how cpuhp_thread_fun calls cpuhp_invoke_callback
with a set st->node, which then takes a different path in the latter.)

When another multi-instance client, like for example padata (via pcrypt),
is present, it will use the wrong pointer to dereference it's internal data
structure.

In this particular case I have ensured our client is the last to register
by re-loading it after boot. But I guess the same could happen depending on
the order of registrations during boot.

I can workaround around this in i915 by registering, and immediately
unregistering a dummy state, after the real one has been registered. This
action ensures the st->node field gets cleared.

If desired I could try to cook up a simple reproducer module? Or perhaps
it is immediately obvious to people experienced in this area what is
happening?

Kind regards,

Tvrtko

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Possible bug in CPU hotplug handling (4.14.0-rc5)
  2017-10-20 17:37 Possible bug in CPU hotplug handling (4.14.0-rc5) Tvrtko Ursulin
@ 2017-10-20 18:14 ` Thomas Gleixner
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Gleixner @ 2017-10-20 18:14 UTC (permalink / raw)
  To: Tvrtko Ursulin
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra,
	Sebastian Andrzej Siewior, Paul E. McKenney, Boris Ostrovsky

On Fri, 20 Oct 2017, Tvrtko Ursulin wrote:
 
> Hi all,
> 
> I think I've found in bug in the CPU hotplug handling when multi-instance
> states are used. That is in the 4.14.0-rc5 kernel.

> I have not attempted to get to the bottom of the issue to propose an actual
> fix, since the logic there looks somewhat complex, but thought to first
> seek opinion of the people in the know regarding this area.

These commits fiddled with that post rc1, so it's probably burried
somewhere there.

1db49484f21e ("smp/hotplug: Hotplug state fail injection")
5ebe7742fff8 ("smp/hotplug: Differentiate the AP completion between up and down")
5f4b55e10645 ("smp/hotplug: Differentiate the AP-work lockdep class between up and down")
724a86881d03 ("smp/hotplug: Callback vs state-machine consistency")
4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
96abb968549c ("smp/hotplug: Allow external multi-instance rollback")

I'll have a look at it on sunday (tomorrow I'm conferencing) with Peter, so
we should be able to figure it out, unless you beat us to it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-10-20 18:14 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-20 17:37 Possible bug in CPU hotplug handling (4.14.0-rc5) Tvrtko Ursulin
2017-10-20 18:14 ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.