On Tue, 2020-03-10 at 09:09 +0100, Juergen Gross wrote: > Offlining a cpu with core scheduling active can result in a hanging > system. Reason is the scheduling resource and unit of the to be > removed > cpus needs to be split in order to remove the cpu from its cpupool > and > move it to the idle scheduler. In case one of the involved cpus > happens > to have received a sched slave event due to a vcpu former having been > running on that cpu being woken up again, it can happen that this cpu > will enter sched_wait_rendezvous_in() while its scheduling resource > is > just about to be split. It might wait for ever for the other sibling > to join, which will never happen due to the resources already being > modified. > > This can easily be avoided by: > - resetting the rendezvous counters of the idle unit which is kept > - checking for a new scheduling resource in > sched_wait_rendezvous_in() > after reacquiring the scheduling lock and resetting the counters in > that case without scheduling another vcpu > - moving schedule resource modifications (in schedule_cpu_rm()) and > retrieving (schedule(), sched_slave() is fine already, others are > not > critical) into locked regions > > Reported-by: Igor Druzhinin > Signed-off-by: Juergen Gross > Reviewed-by: Dario Faggioli Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <> (Raistlin Majere)