All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
@ 2020-08-04  3:29 Michael Roth
  2020-08-04 13:35 ` Michael Ellerman
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Roth @ 2020-08-04  3:29 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nathan Lynch, Cedric Le Goater, Greg Kurz

For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:

  [ 1767.353447] cpu 65 (hwid 65) Ready to die...
  [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
  [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
  [ 1767.952322] ------------[ cut here ]------------
  [ 1767.952323] kernel BUG at lib/list_debug.c:56!
  [ 1767.952325] Oops: Exception in kernel mode, sig: 5 [#1]
  [ 1767.952326] LE SMP NR_CPUS=2048 NUMA pSeries
  [ 1767.952328] Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat_ipv6 nf_nat_ipv6 nft_chain_route_ipv6 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_chain_route_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink uio_pdrv_genirq ip_tables xfs libcrc32c sd_mod sg xts vmx_crypto virtio_net net_failover failover virtio_scsi dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
  [ 1767.952352] CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
  [ 1767.952354] NIP:  c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
  [ 1767.952355] REGS: c0000009e5a17840 TRAP: 0700   Not tainted  (4.18.0-221.el8.ppc64le)
  [ 1767.952355] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28000842  XER: 20040000
  [ 1767.952360] CFAR: c0000000001ffe64 IRQMASK: 1
  [ 1767.952360] GPR00: c0000000007ab508 c0000009e5a17ac0 c000000001ac0700 0000000000000054
  [ 1767.952360] GPR04: c0000009f056cf90 c0000009f05f5628 c0000009ed40d654 c000000001c90700
  [ 1767.952360] GPR08: 0000000000000007 c0000009f0573980 00000009ef2b0000 7562202c38303230
  [ 1767.952360] GPR12: 0000000000000000 c0000007fff6ce80 c00a000002470208 0000000000000000
  [ 1767.952360] GPR16: 0000000000000001 c0000009f05fbb00 0000000000000800 c0000009ff3d4980
  [ 1767.952360] GPR20: c0000009f05fbb10 5deadbeef0000100 5deadbeef0000200 0000000000187961
  [ 1767.952360] GPR24: c0000009e5a17b78 0000000000000000 0000000000000001 ffffffffffffffff
  [ 1767.952360] GPR28: c00a000002470200 c0000009f05fbb10 c0000009f05fbb10 0000000000000000
  [ 1767.952375] NIP [c0000000007ab50c] __list_del_entry_valid+0xac/0x100
  [ 1767.952376] LR [c0000000007ab508] __list_del_entry_valid+0xa8/0x100
  [ 1767.952377] Call Trace:
  [ 1767.952378] [c0000009e5a17ac0] [c0000000007ab508] __list_del_entry_valid+0xa8/0x100 (unreliable)
  [ 1767.952381] [c0000009e5a17b20] [c000000000476e18] free_pcppages_bulk+0x1f8/0x940
  [ 1767.952383] [c0000009e5a17c20] [c00000000047a9a0] free_unref_page+0xd0/0x100
  [ 1767.952386] [c0000009e5a17c50] [c0000000000bc2a8] xive_spapr_cleanup_queue+0x148/0x1b0
  [ 1767.952388] [c0000009e5a17cf0] [c0000000000b96dc] xive_teardown_cpu+0x1bc/0x240
  [ 1767.952391] [c0000009e5a17d30] [c00000000010bf28] pseries_mach_cpu_die+0x78/0x2f0
  [ 1767.952393] [c0000009e5a17de0] [c00000000005c8d8] cpu_die+0x48/0x70
  [ 1767.952394] [c0000009e5a17e00] [c000000000021cf0] arch_cpu_idle_dead+0x20/0x40
  [ 1767.952397] [c0000009e5a17e20] [c0000000001b4294] do_idle+0x2f4/0x4c0
  [ 1767.952399] [c0000009e5a17ea0] [c0000000001b46a8] cpu_startup_entry+0x38/0x40
  [ 1767.952400] [c0000009e5a17ed0] [c00000000005c43c] start_secondary+0x7bc/0x8f0
  [ 1767.952403] [c0000009e5a17f90] [c00000000000ac70] start_secondary_prolog+0x10/0x14

or on the worker thread handling the unplug:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2
  [ 1538.360736] BUG: Bad page state in process kworker/u768:3  pfn:95de1
  [ 1538.360746] cpu 314 (hwid 314) Ready to die...
  [ 1538.360784] page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
  [ 1538.361881] flags: 0x5ffffc00000000()
  [ 1538.361908] raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
  [ 1538.361955] raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
  [ 1538.362002] page dumped because: nonzero mapcount
  [ 1538.362033] Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set nf_tables nfnetlink sunrpc xts vmx_crypto ip_tables xfs libcrc32c sd_mod sg virtio_net net_failover virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod
  [ 1538.362613] CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
  [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
  [ 1538.362725] Call Trace:
  [ 1538.362743] [c0000009d4adf590] [c000000000e0e0fc] dump_stack+0xb0/0xf4 (unreliable)
  [ 1538.362789] [c0000009d4adf5d0] [c000000000475dfc] bad_page+0x12c/0x1b0
  [ 1538.362827] [c0000009d4adf660] [c0000000004784bc] free_pcppages_bulk+0x5bc/0x940
  [ 1538.362871] [c0000009d4adf760] [c000000000478c38] page_alloc_cpu_dead+0x118/0x120
  [ 1538.362918] [c0000009d4adf7b0] [c00000000015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760
  [ 1538.362969] [c0000009d4adf820] [c00000000015eee8] _cpu_down+0x188/0x340
  [ 1538.363007] [c0000009d4adf890] [c00000000015d75c] cpu_down+0x5c/0xa0
  [ 1538.363045] [c0000009d4adf8c0] [c00000000092c544] cpu_subsys_offline+0x24/0x40
  [ 1538.363091] [c0000009d4adf8e0] [c0000000009212f0] device_offline+0xf0/0x130
  [ 1538.363129] [c0000009d4adf920] [c00000000010aee4] dlpar_offline_cpu+0x1c4/0x2a0
  [ 1538.363174] [c0000009d4adf9e0] [c00000000010b2f8] dlpar_cpu_remove+0xb8/0x190
  [ 1538.363219] [c0000009d4adfa60] [c00000000010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150
  [ 1538.363264] [c0000009d4adfaf0] [c00000000010ca24] dlpar_cpu+0x94/0x800
  [ 1538.363302] [c0000009d4adfc00] [c000000000102cc8] pseries_hp_work_fn+0x128/0x1e0
  [ 1538.363347] [c0000009d4adfc70] [c00000000018aa84] process_one_work+0x304/0x5d0
  [ 1538.363394] [c0000009d4adfd10] [c00000000018b5cc] worker_thread+0xcc/0x7a0
  [ 1538.363433] [c0000009d4adfdc0] [c00000000019567c] kthread+0x1ac/0x1c0
  [ 1538.363469] [c0000009d4adfe30] [c00000000000b7dc] ret_from_kernel_thread+0x5c/0x80

The latter trace is due to the following sequence:

  page_alloc_cpu_dead
    drain_pages
      drain_pages_zone
        free_pcppages_bulk

where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs
a loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for
25 iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2

At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race with
the XIVE cleanup code executed by the unplugged CPU.

Fix this by inserting an msleep() after each RTAS call to avoid
pseries_cpu_die() returning prematurely, and double the number of
attempts so we wait at least a total of 5 seconds. While this isn't an
ideal solution, it is similar to how we dealt with a similar issue for
cede_offline mode in the past (940ce422a3).

Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Cedric Le Goater <clg@kaod.org>
Cc: Greg Kurz <groug@kaod.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index c6e0d8abf75e..3cb172758052 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
 	int cpu_status = 1;
 	unsigned int pcpu = get_hard_smp_processor_id(cpu);
 
-	for (tries = 0; tries < 25; tries++) {
+	for (tries = 0; tries < 50; tries++) {
 		cpu_status = smp_query_cpu_stopped(pcpu);
 		if (cpu_status == QCSS_STOPPED ||
 		    cpu_status == QCSS_HARDWARE_ERROR)
 			break;
-		cpu_relax();
-
+		msleep(100);
 	}
 
 	if (cpu_status != 0) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-04  3:29 [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death Michael Roth
@ 2020-08-04 13:35 ` Michael Ellerman
  2020-08-04 14:16   ` Greg Kurz
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Ellerman @ 2020-08-04 13:35 UTC (permalink / raw)
  To: Michael Roth, linuxppc-dev
  Cc: Nathan Lynch, Cedric Le Goater, Thiago Jung Bauermann, Greg Kurz

Hi Mike,

There is a bit of history to this code, but not in a good way :)

Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> For a power9 KVM guest with XIVE enabled, running a test loop
> where we hotplug 384 vcpus and then unplug them, the following traces
> can be seen (generally within a few loops) either from the unplugged
> vcpu:
>
>   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
...
>
> At that point the worker thread assumes the unplugged CPU is in some
> unknown/dead state and procedes with the cleanup, causing the race with
> the XIVE cleanup code executed by the unplugged CPU.
>
> Fix this by inserting an msleep() after each RTAS call to avoid

We previously had an msleep(), but it was removed:

  b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")

> pseries_cpu_die() returning prematurely, and double the number of
> attempts so we wait at least a total of 5 seconds. While this isn't an
> ideal solution, it is similar to how we dealt with a similar issue for
> cede_offline mode in the past (940ce422a3).

Thiago tried to fix this previously but there was a bit of discussion
that didn't quite resolve:

  https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/


Spinning forever seems like a bad idea, but as has been demonstrated at
least twice now, continuing when we don't know the state of the other
CPU can lead to straight up crashes.

So I think I'm persuaded that it's preferable to have the kernel stuck
spinning rather than oopsing.

I'm 50/50 on whether we should have a cond_resched() in the loop. My
first instinct is no, if we're stuck here for 20s a stack trace would be
good. But then we will probably hit that on some big and/or heavily
loaded machine.

So possibly we should call cond_resched() but have some custom logic in
the loop to print a warning if we are stuck for more than some
sufficiently long amount of time.


> Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588

This is not public.

I tend to trim Bugzilla links from the change log, because I'm not
convinced they will last forever, but it is good to have them in the
mail archive.

cheers

> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Cedric Le Goater <clg@kaod.org>
> Cc: Greg Kurz <groug@kaod.org>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index c6e0d8abf75e..3cb172758052 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
>  	int cpu_status = 1;
>  	unsigned int pcpu = get_hard_smp_processor_id(cpu);
>  
> -	for (tries = 0; tries < 25; tries++) {
> +	for (tries = 0; tries < 50; tries++) {
>  		cpu_status = smp_query_cpu_stopped(pcpu);
>  		if (cpu_status == QCSS_STOPPED ||
>  		    cpu_status == QCSS_HARDWARE_ERROR)
>  			break;
> -		cpu_relax();
> -
> +		msleep(100);
>  	}
>  
>  	if (cpu_status != 0) {
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-04 13:35 ` Michael Ellerman
@ 2020-08-04 14:16   ` Greg Kurz
  2020-08-05  3:07     ` Michael Ellerman
  0 siblings, 1 reply; 13+ messages in thread
From: Greg Kurz @ 2020-08-04 14:16 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Michael Roth, Thiago Jung Bauermann,
	Cedric Le Goater

On Tue, 04 Aug 2020 23:35:10 +1000
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Hi Mike,
> 
> There is a bit of history to this code, but not in a good way :)
> 
> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> > For a power9 KVM guest with XIVE enabled, running a test loop
> > where we hotplug 384 vcpus and then unplug them, the following traces
> > can be seen (generally within a few loops) either from the unplugged
> > vcpu:
> >
> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
> ...
> >
> > At that point the worker thread assumes the unplugged CPU is in some
> > unknown/dead state and procedes with the cleanup, causing the race with
> > the XIVE cleanup code executed by the unplugged CPU.
> >
> > Fix this by inserting an msleep() after each RTAS call to avoid
> 
> We previously had an msleep(), but it was removed:
> 
>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> 

Ah, I hadn't seen that one...

> > pseries_cpu_die() returning prematurely, and double the number of
> > attempts so we wait at least a total of 5 seconds. While this isn't an
> > ideal solution, it is similar to how we dealt with a similar issue for
> > cede_offline mode in the past (940ce422a3).
> 
> Thiago tried to fix this previously but there was a bit of discussion
> that didn't quite resolve:
> 
>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
> 

Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
which is prohibited by PAPR... not fixing actual crashes.

> 
> Spinning forever seems like a bad idea, but as has been demonstrated at
> least twice now, continuing when we don't know the state of the other
> CPU can lead to straight up crashes.
> 
> So I think I'm persuaded that it's preferable to have the kernel stuck
> spinning rather than oopsing.
> 

+1

> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> first instinct is no, if we're stuck here for 20s a stack trace would be
> good. But then we will probably hit that on some big and/or heavily
> loaded machine.
> 
> So possibly we should call cond_resched() but have some custom logic in
> the loop to print a warning if we are stuck for more than some
> sufficiently long amount of time.
> 

How long should that be ?

> 
> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> 
> This is not public.
> 

I'll have a look at changing that.

> I tend to trim Bugzilla links from the change log, because I'm not
> convinced they will last forever, but it is good to have them in the
> mail archive.
> 
> cheers
> 

Cheers,

--
Greg

> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Cedric Le Goater <clg@kaod.org>
> > Cc: Greg Kurz <groug@kaod.org>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index c6e0d8abf75e..3cb172758052 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
> >  	int cpu_status = 1;
> >  	unsigned int pcpu = get_hard_smp_processor_id(cpu);
> >  
> > -	for (tries = 0; tries < 25; tries++) {
> > +	for (tries = 0; tries < 50; tries++) {
> >  		cpu_status = smp_query_cpu_stopped(pcpu);
> >  		if (cpu_status == QCSS_STOPPED ||
> >  		    cpu_status == QCSS_HARDWARE_ERROR)
> >  			break;
> > -		cpu_relax();
> > -
> > +		msleep(100);
> >  	}
> >  
> >  	if (cpu_status != 0) {
> > -- 
> > 2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-04 14:16   ` Greg Kurz
@ 2020-08-05  3:07     ` Michael Ellerman
  2020-08-05  4:01       ` Thiago Jung Bauermann
                         ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Michael Ellerman @ 2020-08-05  3:07 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Nathan Lynch, linuxppc-dev, Michael Roth, Thiago Jung Bauermann,
	Cedric Le Goater

Greg Kurz <groug@kaod.org> writes:
> On Tue, 04 Aug 2020 23:35:10 +1000
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>> There is a bit of history to this code, but not in a good way :)
>> 
>> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>> > For a power9 KVM guest with XIVE enabled, running a test loop
>> > where we hotplug 384 vcpus and then unplug them, the following traces
>> > can be seen (generally within a few loops) either from the unplugged
>> > vcpu:
>> >
>> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
>> ...
>> >
>> > At that point the worker thread assumes the unplugged CPU is in some
>> > unknown/dead state and procedes with the cleanup, causing the race with
>> > the XIVE cleanup code executed by the unplugged CPU.
>> >
>> > Fix this by inserting an msleep() after each RTAS call to avoid
>> 
>> We previously had an msleep(), but it was removed:
>> 
>>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
>
> Ah, I hadn't seen that one...
>
>> > pseries_cpu_die() returning prematurely, and double the number of
>> > attempts so we wait at least a total of 5 seconds. While this isn't an
>> > ideal solution, it is similar to how we dealt with a similar issue for
>> > cede_offline mode in the past (940ce422a3).
>> 
>> Thiago tried to fix this previously but there was a bit of discussion
>> that didn't quite resolve:
>> 
>>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
>
> Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
> messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
> which is prohibited by PAPR... not fixing actual crashes.

I'm pretty sure at one point we were triggering crashes *in* RTAS via
this path, I think that got resolved.

>> Spinning forever seems like a bad idea, but as has been demonstrated at
>> least twice now, continuing when we don't know the state of the other
>> CPU can lead to straight up crashes.
>> 
>> So I think I'm persuaded that it's preferable to have the kernel stuck
>> spinning rather than oopsing.
>> 
>
> +1
>
>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> first instinct is no, if we're stuck here for 20s a stack trace would be
>> good. But then we will probably hit that on some big and/or heavily
>> loaded machine.
>> 
>> So possibly we should call cond_resched() but have some custom logic in
>> the loop to print a warning if we are stuck for more than some
>> sufficiently long amount of time.
>
> How long should that be ?

Yeah good question.

I guess step one would be seeing how long it can take on the 384 vcpu
machine. And we can probably test on some other big machines.

Hopefully Nathan can give us some idea of how long he's seen it take on
large systems? I know he was concerned about the 20s timeout of the
softlockup detector.

Maybe a minute or two?

>> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
>> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
>> 
>> This is not public.
>
> I'll have a look at changing that.

Thanks.

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05  3:07     ` Michael Ellerman
@ 2020-08-05  4:01       ` Thiago Jung Bauermann
  2020-08-05  4:37       ` Michael Roth
  2020-08-07  7:05       ` Nathan Lynch
  2 siblings, 0 replies; 13+ messages in thread
From: Thiago Jung Bauermann @ 2020-08-05  4:01 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Cedric Le Goater, linuxppc-dev, Greg Kurz, Michael Roth


Michael Ellerman <mpe@ellerman.id.au> writes:

> Greg Kurz <groug@kaod.org> writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman <mpe@ellerman.id.au> wrote:
>>> There is a bit of history to this code, but not in a good way :)
>>>
>>> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>>> > For a power9 KVM guest with XIVE enabled, running a test loop
>>> > where we hotplug 384 vcpus and then unplug them, the following traces
>>> > can be seen (generally within a few loops) either from the unplugged
>>> > vcpu:
>>> >
>>> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
>>> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
>>> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
>>> ...
>>> >
>>> > At that point the worker thread assumes the unplugged CPU is in some
>>> > unknown/dead state and procedes with the cleanup, causing the race with
>>> > the XIVE cleanup code executed by the unplugged CPU.
>>> >
>>> > Fix this by inserting an msleep() after each RTAS call to avoid
>>>
>>> We previously had an msleep(), but it was removed:
>>>
>>>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
>>
>> Ah, I hadn't seen that one...
>>
>>> > pseries_cpu_die() returning prematurely, and double the number of
>>> > attempts so we wait at least a total of 5 seconds. While this isn't an
>>> > ideal solution, it is similar to how we dealt with a similar issue for
>>> > cede_offline mode in the past (940ce422a3).
>>>
>>> Thiago tried to fix this previously but there was a bit of discussion
>>> that didn't quite resolve:
>>>
>>>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
>>
>> Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
>> messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
>> which is prohibited by PAPR... not fixing actual crashes.
>
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.

Yes, pHyp's RTAS now tolerates concurrent calls to stop-self. The
original bug that was reported when I worked on this ended in an RTAS
crash because of this timeout. The crash was fixed then.

--
Thiago Jung Bauermann
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05  3:07     ` Michael Ellerman
  2020-08-05  4:01       ` Thiago Jung Bauermann
@ 2020-08-05  4:37       ` Michael Roth
  2020-08-05 22:29         ` Michael Roth
  2020-08-07  7:05       ` Nathan Lynch
  2 siblings, 1 reply; 13+ messages in thread
From: Michael Roth @ 2020-08-05  4:37 UTC (permalink / raw)
  To: Greg Kurz, Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater, Thiago Jung Bauermann

Quoting Michael Ellerman (2020-08-04 22:07:08)
> Greg Kurz <groug@kaod.org> writes:
> > On Tue, 04 Aug 2020 23:35:10 +1000
> > Michael Ellerman <mpe@ellerman.id.au> wrote:
> >> There is a bit of history to this code, but not in a good way :)
> >> 
> >> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> >> > For a power9 KVM guest with XIVE enabled, running a test loop
> >> > where we hotplug 384 vcpus and then unplug them, the following traces
> >> > can be seen (generally within a few loops) either from the unplugged
> >> > vcpu:
> >> >
> >> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >> >   [ 1767.952311] list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
> >> ...
> >> >
> >> > At that point the worker thread assumes the unplugged CPU is in some
> >> > unknown/dead state and procedes with the cleanup, causing the race with
> >> > the XIVE cleanup code executed by the unplugged CPU.
> >> >
> >> > Fix this by inserting an msleep() after each RTAS call to avoid
> >> 
> >> We previously had an msleep(), but it was removed:
> >> 
> >>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> >
> > Ah, I hadn't seen that one...
> >
> >> > pseries_cpu_die() returning prematurely, and double the number of
> >> > attempts so we wait at least a total of 5 seconds. While this isn't an
> >> > ideal solution, it is similar to how we dealt with a similar issue for
> >> > cede_offline mode in the past (940ce422a3).
> >> 
> >> Thiago tried to fix this previously but there was a bit of discussion
> >> that didn't quite resolve:
> >> 
> >>   https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauerman@linux.ibm.com/
> >
> > Yeah it appears that the motivation at the time was to make the "Querying DEAD?"
> > messages to disappear and to avoid potentially concurrent calls to rtas-stop-self
> > which is prohibited by PAPR... not fixing actual crashes.
> 
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.
> 
> >> Spinning forever seems like a bad idea, but as has been demonstrated at
> >> least twice now, continuing when we don't know the state of the other
> >> CPU can lead to straight up crashes.
> >> 
> >> So I think I'm persuaded that it's preferable to have the kernel stuck
> >> spinning rather than oopsing.
> >> 
> >
> > +1
> >
> >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >> first instinct is no, if we're stuck here for 20s a stack trace would be
> >> good. But then we will probably hit that on some big and/or heavily
> >> loaded machine.
> >> 
> >> So possibly we should call cond_resched() but have some custom logic in
> >> the loop to print a warning if we are stuck for more than some
> >> sufficiently long amount of time.
> >
> > How long should that be ?
> 
> Yeah good question.
> 
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
> 
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.
> 
> Maybe a minute or two?

Hmm, so I took a stab at this where I called cond_resched() after
every 5 seconds of polling and printed a warning at the same time (FWIW
that doesn't seem to trigger any warnings on a loaded 96-core mihawk
system using KVM running the 384vcpu unplug loop)

But it sounds like that's not quite what you had in mind. How frequently
do you think we should call cond_resched()? Maybe after 25 iterations
of polling smp_query_cpu_stopped() to keep original behavior somewhat
similar?

I'll let the current patch run on the mihawk system overnight in the
meantime so we at least have that data point, but would be good to
know what things look like a large pHyp machine.

Thanks!

> 
> >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> >> 
> >> This is not public.
> >
> > I'll have a look at changing that.
> 
> Thanks.
> 
> cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05  4:37       ` Michael Roth
@ 2020-08-05 22:29         ` Michael Roth
  2020-08-05 22:31           ` Michael Roth
  2020-08-06 12:51           ` Michael Ellerman
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Roth @ 2020-08-05 22:29 UTC (permalink / raw)
  To: Greg Kurz, Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater, Thiago Jung Bauermann

Quoting Michael Roth (2020-08-04 23:37:32)
> Quoting Michael Ellerman (2020-08-04 22:07:08)
> > Greg Kurz <groug@kaod.org> writes:
> > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > >> least twice now, continuing when we don't know the state of the other
> > >> CPU can lead to straight up crashes.
> > >> 
> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > >> spinning rather than oopsing.
> > >> 
> > >
> > > +1
> > >
> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > >> good. But then we will probably hit that on some big and/or heavily
> > >> loaded machine.
> > >> 
> > >> So possibly we should call cond_resched() but have some custom logic in
> > >> the loop to print a warning if we are stuck for more than some
> > >> sufficiently long amount of time.
> > >
> > > How long should that be ?
> > 
> > Yeah good question.
> > 
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> > 
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> > 
> > Maybe a minute or two?
> 
> Hmm, so I took a stab at this where I called cond_resched() after
> every 5 seconds of polling and printed a warning at the same time (FWIW
> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> system using KVM running the 384vcpu unplug loop)
> 
> But it sounds like that's not quite what you had in mind. How frequently
> do you think we should call cond_resched()? Maybe after 25 iterations
> of polling smp_query_cpu_stopped() to keep original behavior somewhat
> similar?
> 
> I'll let the current patch run on the mihawk system overnight in the
> meantime so we at least have that data point, but would be good to
> know what things look like a large pHyp machine.

At one point I did manage to get the system in a state where unplug
operations were taking 1-2s, but still not enough to trigger any
5s warning, and I wasn't able to reproduce that in subsequent runs.

I also tried reworking the patch so that we print a warning and
cond_resched() after 200 ms to make sure that path gets executed, but
only managed to trigger the warning twice after a few hours.

So, if we print a warning after a couple minutes, that seems pretty
conservative as far as avoiding spurious warnings. And if we
cond_resched() after 25 loops of polling (~0.1 ms in the cases
that caused the original crash), that would avoid most of the
default RCU/lockup warnings.

But having a second timeout to trigger the cond_resched() after some
set interval like 2s seems more deterministic since we're less
susceptible to longer delays due to things like the RTAS calls
contending for QEMU's global mutex in the the KVM case.


> 
> Thanks!
> 
> > 
> > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > >> 
> > >> This is not public.
> > >
> > > I'll have a look at changing that.
> > 
> > Thanks.
> > 
> > cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05 22:29         ` Michael Roth
@ 2020-08-05 22:31           ` Michael Roth
  2020-08-06 12:51           ` Michael Ellerman
  1 sibling, 0 replies; 13+ messages in thread
From: Michael Roth @ 2020-08-05 22:31 UTC (permalink / raw)
  To: Greg Kurz, Michael Ellerman
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater, Thiago Jung Bauermann

Quoting Michael Roth (2020-08-05 17:29:28)
> Quoting Michael Roth (2020-08-04 23:37:32)
> > Quoting Michael Ellerman (2020-08-04 22:07:08)
> > > Greg Kurz <groug@kaod.org> writes:
> > > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > > Michael Ellerman <mpe@ellerman.id.au> wrote:
> > > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > > >> least twice now, continuing when we don't know the state of the other
> > > >> CPU can lead to straight up crashes.
> > > >> 
> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > > >> spinning rather than oopsing.
> > > >> 
> > > >
> > > > +1
> > > >
> > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > > >> good. But then we will probably hit that on some big and/or heavily
> > > >> loaded machine.
> > > >> 
> > > >> So possibly we should call cond_resched() but have some custom logic in
> > > >> the loop to print a warning if we are stuck for more than some
> > > >> sufficiently long amount of time.
> > > >
> > > > How long should that be ?
> > > 
> > > Yeah good question.
> > > 
> > > I guess step one would be seeing how long it can take on the 384 vcpu
> > > machine. And we can probably test on some other big machines.
> > > 
> > > Hopefully Nathan can give us some idea of how long he's seen it take on
> > > large systems? I know he was concerned about the 20s timeout of the
> > > softlockup detector.
> > > 
> > > Maybe a minute or two?
> > 
> > Hmm, so I took a stab at this where I called cond_resched() after
> > every 5 seconds of polling and printed a warning at the same time (FWIW
> > that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> > system using KVM running the 384vcpu unplug loop)
> > 
> > But it sounds like that's not quite what you had in mind. How frequently
> > do you think we should call cond_resched()? Maybe after 25 iterations
> > of polling smp_query_cpu_stopped() to keep original behavior somewhat
> > similar?
> > 
> > I'll let the current patch run on the mihawk system overnight in the
> > meantime so we at least have that data point, but would be good to
> > know what things look like a large pHyp machine.
> 
> At one point I did manage to get the system in a state where unplug
> operations were taking 1-2s, but still not enough to trigger any
> 5s warning, and I wasn't able to reproduce that in subsequent runs.
> 
> I also tried reworking the patch so that we print a warning and
> cond_resched() after 200 ms to make sure that path gets executed, but
> only managed to trigger the warning twice after a few hours.
> 
> So, if we print a warning after a couple minutes, that seems pretty
> conservative as far as avoiding spurious warnings. And if we
> cond_resched() after 25 loops of polling (~0.1 ms in the cases

~0.1 seconds I mean

> that caused the original crash), that would avoid most of the
> default RCU/lockup warnings.
> 
> But having a second timeout to trigger the cond_resched() after some
> set interval like 2s seems more deterministic since we're less
> susceptible to longer delays due to things like the RTAS calls
> contending for QEMU's global mutex in the the KVM case.
> 
> 
> > 
> > Thanks!
> > 
> > > 
> > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
> > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > > >> 
> > > >> This is not public.
> > > >
> > > > I'll have a look at changing that.
> > > 
> > > Thanks.
> > > 
> > > cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05 22:29         ` Michael Roth
  2020-08-05 22:31           ` Michael Roth
@ 2020-08-06 12:51           ` Michael Ellerman
  1 sibling, 0 replies; 13+ messages in thread
From: Michael Ellerman @ 2020-08-06 12:51 UTC (permalink / raw)
  To: Michael Roth, Greg Kurz
  Cc: Nathan Lynch, linuxppc-dev, Cedric Le Goater, Thiago Jung Bauermann

Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> Quoting Michael Roth (2020-08-04 23:37:32)
>> Quoting Michael Ellerman (2020-08-04 22:07:08)
>> > Greg Kurz <groug@kaod.org> writes:
>> > > On Tue, 04 Aug 2020 23:35:10 +1000
>> > > Michael Ellerman <mpe@ellerman.id.au> wrote:
>> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
>> > >> least twice now, continuing when we don't know the state of the other
>> > >> CPU can lead to straight up crashes.
>> > >> 
>> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
>> > >> spinning rather than oopsing.
>> > >> 
>> > >
>> > > +1
>> > >
>> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
>> > >> good. But then we will probably hit that on some big and/or heavily
>> > >> loaded machine.
>> > >> 
>> > >> So possibly we should call cond_resched() but have some custom logic in
>> > >> the loop to print a warning if we are stuck for more than some
>> > >> sufficiently long amount of time.
>> > >
>> > > How long should that be ?
>> > 
>> > Yeah good question.
>> > 
>> > I guess step one would be seeing how long it can take on the 384 vcpu
>> > machine. And we can probably test on some other big machines.
>> > 
>> > Hopefully Nathan can give us some idea of how long he's seen it take on
>> > large systems? I know he was concerned about the 20s timeout of the
>> > softlockup detector.
>> > 
>> > Maybe a minute or two?
>> 
>> Hmm, so I took a stab at this where I called cond_resched() after
>> every 5 seconds of polling and printed a warning at the same time (FWIW
>> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
>> system using KVM running the 384vcpu unplug loop)
>> 
>> But it sounds like that's not quite what you had in mind. How frequently
>> do you think we should call cond_resched()? Maybe after 25 iterations
>> of polling smp_query_cpu_stopped() to keep original behavior somewhat
>> similar?

I think we can just call it on every iteration, it should be cheap
compared to an RTAS call.

The concern was just by doing that you effectively prevent the
softlockup detector from reporting you as stuck in that path. Hence the
desire to manually print a warning after ~60s or something.

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-05  3:07     ` Michael Ellerman
  2020-08-05  4:01       ` Thiago Jung Bauermann
  2020-08-05  4:37       ` Michael Roth
@ 2020-08-07  7:05       ` Nathan Lynch
  2020-08-11  5:39         ` Michael Roth
  2 siblings, 1 reply; 13+ messages in thread
From: Nathan Lynch @ 2020-08-07  7:05 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kurz, linuxppc-dev, Michael Roth, Thiago Jung Bauermann,
	Cedric Le Goater

Hi everyone,

Michael Ellerman <mpe@ellerman.id.au> writes:
> Greg Kurz <groug@kaod.org> writes:
>> On Tue, 04 Aug 2020 23:35:10 +1000
>> Michael Ellerman <mpe@ellerman.id.au> wrote:
>>> Spinning forever seems like a bad idea, but as has been demonstrated at
>>> least twice now, continuing when we don't know the state of the other
>>> CPU can lead to straight up crashes.
>>> 
>>> So I think I'm persuaded that it's preferable to have the kernel stuck
>>> spinning rather than oopsing.
>>> 
>>
>> +1
>>
>>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
>>> first instinct is no, if we're stuck here for 20s a stack trace would be
>>> good. But then we will probably hit that on some big and/or heavily
>>> loaded machine.
>>> 
>>> So possibly we should call cond_resched() but have some custom logic in
>>> the loop to print a warning if we are stuck for more than some
>>> sufficiently long amount of time.
>>
>> How long should that be ?
>
> Yeah good question.
>
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
>
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.

Maybe I'm not quite clear what this is referring to, but I don't think
stop-self/query-stopped-state latency increases with processor count, at
least not on PowerVM. And IIRC I was concerned with the earlier patch's
potential for causing the softlockup watchdog to rightly complain by
polling the stopped state without ever scheduling away.

The fact that smp_query_cpu_stopped() kind of collapses the two distinct
results from the query-cpu-stopped-state RTAS call into one return value
may make it harder than necessary to reason about the questions around
cond_resched() and whether to warn.

Sorry to pull this stunt but I have had some code sitting in a neglected
branch that I think gets the logic around this right.

What we should have is a simple C wrapper for the RTAS call that reflects the
architected inputs and outputs:

================================================================
(-- rtas.c --)

/**
 * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
 * @hwcpu: Identifies the processor thread to be queried.
 * @status: Pointer to status, valid only on success.
 *
 * Determine whether the given processor thread is in the stopped
 * state.  If successful and @status is non-NULL, the thread's status
 * is stored to @status.
 *
 * Return:
 * * 0   - Success
 * * -1  - Hardware error
 * * -2  - Busy, try again later
 */
int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
{
       unsigned int cpu_status;
       int token;
       int fwrc;

       token = rtas_token("query-cpu-stopped-state");

       fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
       if (fwrc != 0)
               goto out;

       if (status != NULL)
               *status = cpu_status;
out:
       return fwrc;
}
================================================================


And then a utility function that waits for the remote thread to enter
stopped state, with higher-level logic for rescheduling and warning. The
fact that smp_query_cpu_stopped() currently does not handle a -2/busy
status is a bug, fixed below by using rtas_busy_delay(). Note the
justification for the explicit cond_resched() in the outer loop:

================================================================
(-- rtas.h --)

/* query-cpu-stopped-state CPU_status */
#define RTAS_QCSS_STATUS_STOPPED     0
#define RTAS_QCSS_STATUS_IN_PROGRESS 1
#define RTAS_QCSS_STATUS_NOT_STOPPED 2

(-- pseries/hotplug-cpu.c --)

/**
 * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
 */
static void wait_for_cpu_stopped(unsigned int cpu)
{
       unsigned int status;
       unsigned int hwcpu;

       hwcpu = get_hard_smp_processor_id(cpu);

       do {
               int fwrc;

               /*
                * rtas_busy_delay() will yield only if RTAS returns a
                * busy status. Since query-cpu-stopped-state can
                * yield RTAS_QCSS_STATUS_IN_PROGRESS or
                * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
                * period before the target thread stops, we must take
                * care to explicitly reschedule while polling.
                */
               cond_resched();

               do {
                       fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
               } while (rtas_busy_delay(fwrc));

               if (fwrc == 0)
                       continue;

               pr_err_ratelimited("query-cpu-stopped-state for "
                                  "thread 0x%x returned %d\n",
                                  hwcpu, fwrc);
               goto out;

       } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
                status == RTAS_QCSS_STATUS_IN_PROGRESS);

       if (status != RTAS_QCSS_STATUS_STOPPED) {
               pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
                                  "status %d for thread 0x%x\n",
                                  status, hwcpu);
       }
out:
       return;
}

[...]

static void pseries_cpu_die(unsigned int cpu)
{
       wait_for_cpu_stopped(cpu);
       paca_ptrs[cpu]->cpu_start = 0;
}
================================================================

wait_for_cpu_stopped() should be able to accommodate a time-based
warning if necessary, but speaking as a likely recipient of any bug
reports that would arise here, I'm not convinced of the need and I
don't know what a good value would be. It's relatively easy to sample
the stack of a task that's apparently failing to make progress, plus I
probably would use 'perf probe' or similar to report the inputs and
outputs for the RTAS call.

I'm happy to make this a proper submission after I can clean it up and
retest it, or Michael R. is welcome to appropriate it, assuming it's
acceptable.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-07  7:05       ` Nathan Lynch
@ 2020-08-11  5:39         ` Michael Roth
  2020-08-11 11:56           ` Michael Ellerman
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Roth @ 2020-08-11  5:39 UTC (permalink / raw)
  To: Michael Ellerman, Nathan Lynch
  Cc: linuxppc-dev, Greg Kurz, Thiago Jung Bauermann, Cedric Le Goater

Quoting Nathan Lynch (2020-08-07 02:05:09)
> Hi everyone,
> 
> Michael Ellerman <mpe@ellerman.id.au> writes:
> > Greg Kurz <groug@kaod.org> writes:
> >> On Tue, 04 Aug 2020 23:35:10 +1000
> >> Michael Ellerman <mpe@ellerman.id.au> wrote:
> >>> Spinning forever seems like a bad idea, but as has been demonstrated at
> >>> least twice now, continuing when we don't know the state of the other
> >>> CPU can lead to straight up crashes.
> >>> 
> >>> So I think I'm persuaded that it's preferable to have the kernel stuck
> >>> spinning rather than oopsing.
> >>> 
> >>
> >> +1
> >>
> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >>> first instinct is no, if we're stuck here for 20s a stack trace would be
> >>> good. But then we will probably hit that on some big and/or heavily
> >>> loaded machine.
> >>> 
> >>> So possibly we should call cond_resched() but have some custom logic in
> >>> the loop to print a warning if we are stuck for more than some
> >>> sufficiently long amount of time.
> >>
> >> How long should that be ?
> >
> > Yeah good question.
> >
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> >
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> 
> Maybe I'm not quite clear what this is referring to, but I don't think
> stop-self/query-stopped-state latency increases with processor count, at
> least not on PowerVM. And IIRC I was concerned with the earlier patch's
> potential for causing the softlockup watchdog to rightly complain by
> polling the stopped state without ever scheduling away.
> 
> The fact that smp_query_cpu_stopped() kind of collapses the two distinct
> results from the query-cpu-stopped-state RTAS call into one return value
> may make it harder than necessary to reason about the questions around
> cond_resched() and whether to warn.
> 
> Sorry to pull this stunt but I have had some code sitting in a neglected
> branch that I think gets the logic around this right.
> 
> What we should have is a simple C wrapper for the RTAS call that reflects the
> architected inputs and outputs:
> 
> ================================================================
> (-- rtas.c --)
> 
> /**
>  * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
>  * @hwcpu: Identifies the processor thread to be queried.
>  * @status: Pointer to status, valid only on success.
>  *
>  * Determine whether the given processor thread is in the stopped
>  * state.  If successful and @status is non-NULL, the thread's status
>  * is stored to @status.
>  *
>  * Return:
>  * * 0   - Success
>  * * -1  - Hardware error
>  * * -2  - Busy, try again later
>  */
> int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
> {
>        unsigned int cpu_status;
>        int token;
>        int fwrc;
> 
>        token = rtas_token("query-cpu-stopped-state");
> 
>        fwrc = rtas_call(token, 1, 2, &cpu_status, hwcpu);
>        if (fwrc != 0)
>                goto out;
> 
>        if (status != NULL)
>                *status = cpu_status;
> out:
>        return fwrc;
> }
> ================================================================
> 
> 
> And then a utility function that waits for the remote thread to enter
> stopped state, with higher-level logic for rescheduling and warning. The
> fact that smp_query_cpu_stopped() currently does not handle a -2/busy
> status is a bug, fixed below by using rtas_busy_delay(). Note the
> justification for the explicit cond_resched() in the outer loop:
> 
> ================================================================
> (-- rtas.h --)
> 
> /* query-cpu-stopped-state CPU_status */
> #define RTAS_QCSS_STATUS_STOPPED     0
> #define RTAS_QCSS_STATUS_IN_PROGRESS 1
> #define RTAS_QCSS_STATUS_NOT_STOPPED 2
> 
> (-- pseries/hotplug-cpu.c --)
> 
> /**
>  * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
>  */
> static void wait_for_cpu_stopped(unsigned int cpu)
> {
>        unsigned int status;
>        unsigned int hwcpu;
> 
>        hwcpu = get_hard_smp_processor_id(cpu);
> 
>        do {
>                int fwrc;
> 
>                /*
>                 * rtas_busy_delay() will yield only if RTAS returns a
>                 * busy status. Since query-cpu-stopped-state can
>                 * yield RTAS_QCSS_STATUS_IN_PROGRESS or
>                 * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
>                 * period before the target thread stops, we must take
>                 * care to explicitly reschedule while polling.
>                 */
>                cond_resched();
> 
>                do {
>                        fwrc = rtas_query_cpu_stopped_state(hwcpu, &status);
>                } while (rtas_busy_delay(fwrc));
> 
>                if (fwrc == 0)
>                        continue;
> 
>                pr_err_ratelimited("query-cpu-stopped-state for "
>                                   "thread 0x%x returned %d\n",
>                                   hwcpu, fwrc);
>                goto out;
> 
>        } while (status == RTAS_QCSS_STATUS_NOT_STOPPED ||
>                 status == RTAS_QCSS_STATUS_IN_PROGRESS);
> 
>        if (status != RTAS_QCSS_STATUS_STOPPED) {
>                pr_err_ratelimited("query-cpu-stopped-state yielded unknown "
>                                   "status %d for thread 0x%x\n",
>                                   status, hwcpu);
>        }
> out:
>        return;
> }
> 
> [...]
> 
> static void pseries_cpu_die(unsigned int cpu)
> {
>        wait_for_cpu_stopped(cpu);
>        paca_ptrs[cpu]->cpu_start = 0;
> }
> ================================================================
> 
> wait_for_cpu_stopped() should be able to accommodate a time-based
> warning if necessary, but speaking as a likely recipient of any bug
> reports that would arise here, I'm not convinced of the need and I
> don't know what a good value would be. It's relatively easy to sample
> the stack of a task that's apparently failing to make progress, plus I
> probably would use 'perf probe' or similar to report the inputs and
> outputs for the RTAS call.

I think if we make the timeout sufficiently high like 2 minutes or so
it wouldn't hurt and if we did seem them it would probably point to an
actual bug. But I don't have a strong feeling either way.

> 
> I'm happy to make this a proper submission after I can clean it up and
> retest it, or Michael R. is welcome to appropriate it, assuming it's
> acceptable.
> 

I've given it a shot with this patch and it seems to be holding up in
testing. If we don't think the ~2 minutes warning message is needed I
can clean it up to post:

https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e

I'd likely break the refactoring patches out to a separate patch under
Nathan's name since it fixes a separate bug potentially.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-11  5:39         ` Michael Roth
@ 2020-08-11 11:56           ` Michael Ellerman
  2020-08-11 14:46             ` Nathan Lynch
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Ellerman @ 2020-08-11 11:56 UTC (permalink / raw)
  To: Michael Roth, Nathan Lynch
  Cc: linuxppc-dev, Greg Kurz, Thiago Jung Bauermann, Cedric Le Goater

Michael Roth <mdroth@linux.vnet.ibm.com> writes:
> Quoting Nathan Lynch (2020-08-07 02:05:09)
...
>> wait_for_cpu_stopped() should be able to accommodate a time-based
>> warning if necessary, but speaking as a likely recipient of any bug
>> reports that would arise here, I'm not convinced of the need and I
>> don't know what a good value would be. It's relatively easy to sample
>> the stack of a task that's apparently failing to make progress, plus I
>> probably would use 'perf probe' or similar to report the inputs and
>> outputs for the RTAS call.
>
> I think if we make the timeout sufficiently high like 2 minutes or so
> it wouldn't hurt and if we did seem them it would probably point to an
> actual bug. But I don't have a strong feeling either way.

I think we should print a warning after 2 minutes.

It's true that there are fairly easy mechanisms to work out where the
thread is stuck, but customers are unlikely to use them. They're just
going to report that it's stuck with no further info, and probably
reboot the machine before we get a chance to get any further info.

Whereas if the kernel prints a warning with a stack trace we at least
have that to go on in an initial bug report.

>> I'm happy to make this a proper submission after I can clean it up and
>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>> acceptable.
>> 
>
> I've given it a shot with this patch and it seems to be holding up in
> testing. If we don't think the ~2 minutes warning message is needed I
> can clean it up to post:
>
> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>
> I'd likely break the refactoring patches out to a separate patch under
> Nathan's name since it fixes a separate bug potentially.

While I like Nathan's refactoring, we probably want to do the minimal
fix first to ease backporting.

Then do the refactoring on top of that.

cheers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
  2020-08-11 11:56           ` Michael Ellerman
@ 2020-08-11 14:46             ` Nathan Lynch
  0 siblings, 0 replies; 13+ messages in thread
From: Nathan Lynch @ 2020-08-11 14:46 UTC (permalink / raw)
  To: Michael Ellerman, Michael Roth
  Cc: linuxppc-dev, Greg Kurz, Thiago Jung Bauermann, Cedric Le Goater

Michael Ellerman <mpe@ellerman.id.au> writes:

> Michael Roth <mdroth@linux.vnet.ibm.com> writes:
>> Quoting Nathan Lynch (2020-08-07 02:05:09)
> ...
>>> wait_for_cpu_stopped() should be able to accommodate a time-based
>>> warning if necessary, but speaking as a likely recipient of any bug
>>> reports that would arise here, I'm not convinced of the need and I
>>> don't know what a good value would be. It's relatively easy to sample
>>> the stack of a task that's apparently failing to make progress, plus I
>>> probably would use 'perf probe' or similar to report the inputs and
>>> outputs for the RTAS call.
>>
>> I think if we make the timeout sufficiently high like 2 minutes or so
>> it wouldn't hurt and if we did seem them it would probably point to an
>> actual bug. But I don't have a strong feeling either way.
>
> I think we should print a warning after 2 minutes.
>
> It's true that there are fairly easy mechanisms to work out where the
> thread is stuck, but customers are unlikely to use them. They're just
> going to report that it's stuck with no further info, and probably
> reboot the machine before we get a chance to get any further info.
>
> Whereas if the kernel prints a warning with a stack trace we at least
> have that to go on in an initial bug report.
>
>>> I'm happy to make this a proper submission after I can clean it up and
>>> retest it, or Michael R. is welcome to appropriate it, assuming it's
>>> acceptable.
>>> 
>>
>> I've given it a shot with this patch and it seems to be holding up in
>> testing. If we don't think the ~2 minutes warning message is needed I
>> can clean it up to post:
>>
>> https://github.com/mdroth/linux/commit/354b8c97bf0dc1146e36aa72273f5b33fe90d09e
>>
>> I'd likely break the refactoring patches out to a separate patch under
>> Nathan's name since it fixes a separate bug potentially.
>
> While I like Nathan's refactoring, we probably want to do the minimal
> fix first to ease backporting.
>
> Then do the refactoring on top of that.

Fair enough, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-08-11 14:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-04  3:29 [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death Michael Roth
2020-08-04 13:35 ` Michael Ellerman
2020-08-04 14:16   ` Greg Kurz
2020-08-05  3:07     ` Michael Ellerman
2020-08-05  4:01       ` Thiago Jung Bauermann
2020-08-05  4:37       ` Michael Roth
2020-08-05 22:29         ` Michael Roth
2020-08-05 22:31           ` Michael Roth
2020-08-06 12:51           ` Michael Ellerman
2020-08-07  7:05       ` Nathan Lynch
2020-08-11  5:39         ` Michael Roth
2020-08-11 11:56           ` Michael Ellerman
2020-08-11 14:46             ` Nathan Lynch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.