linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* arm64 torture test hotplug failures (offlining causes -EBUSY)
@ 2023-01-16 17:03 Joel Fernandes
  2023-01-16 18:03 ` Marc Zyngier
  2023-01-16 18:32 ` Zhouyi Zhou
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-16 17:03 UTC (permalink / raw)
  To: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas
  Cc: rcu, Paul E. McKenney

Hello,
I am seeing -EBUSY returned a lot during torture_onoff() when running
rcutorture on arm64. This causes hotplug failure 30% of the time. I am
also seeing this in 6.1-rc kernels. I believe see this only for CPU0.

This causes warnings in torture tests:
[  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
[  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16

Full kernel log here:
http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log

Any ideas on why this is happening and only for CPU 0 (presumably the
boot CPU)? I'd personally need these warnings to go away for my tests
as this causes rcutorture's tests to not cleanly pass for me. It
appears remove_cpu() -> device_offline() is what returns the error.

You can browse through all the torture test artifacts here which
contains everything (build logs, vmlinux, etc).
http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/

Thanks for your help!

 - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-16 17:03 arm64 torture test hotplug failures (offlining causes -EBUSY) Joel Fernandes
@ 2023-01-16 18:03 ` Marc Zyngier
  2023-01-16 22:43   ` Joel Fernandes
  2023-01-16 18:32 ` Zhouyi Zhou
  1 sibling, 1 reply; 34+ messages in thread
From: Marc Zyngier @ 2023-01-16 18:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Mark Rutland,
	Catalin Marinas, rcu, Paul E. McKenney

Hi Joel,

On Mon, 16 Jan 2023 17:03:31 +0000,
Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> Hello,
> I am seeing -EBUSY returned a lot during torture_onoff() when running
> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> 
> This causes warnings in torture tests:
> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> 
> Full kernel log here:
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> 
> Any ideas on why this is happening and only for CPU 0 (presumably the
> boot CPU)? I'd personally need these warnings to go away for my tests
> as this causes rcutorture's tests to not cleanly pass for me. It
> appears remove_cpu() -> device_offline() is what returns the error.

I've taken your kernel for a ride as a KVM guest (probably similar to
what you are doing), and saw the same thing (CPU0 not offlining):

[   64.555845] Detected VIPT I-cache on CPU4
[   64.556146] GICv3: CPU4: found redistributor 4 region 0:0x000000003ff70000
[   64.556689] CPU4: Booted secondary processor 0x0000000004 [0x612f0290]
[   69.823670] rcu-torture:torture_onoff task: offline 0 failed: errno -16
[   73.991960] psci: CPU7 killed (polled 0 ms)
[   74.239626] rcu-torture: rcu_torture_read_exit: Start of episode
[   74.243863] rcu-torture: rcu_torture_read_exit: End of episode

I then tried v6.2-rc4 with defconfig + RCU_TORTURE and your command
line, and CPU0 does seem to hotplug off correctly:

[   47.217109] psci: CPU3 killed (polled 0 ms)
[   52.241009] Detected VIPT I-cache on CPU3
[   52.241227] cacheinfo: Unable to detect cache hierarchy for CPU 3
[   52.241481] GICv3: CPU3: found redistributor 3 region 0:0x000000003ff50000
[   52.241849] CPU3: Booted secondary processor 0x0000000003 [0x612f0290]
[   56.337011] psci: CPU0 killed (polled 0 ms)
[...]
[  121.090339] rcu-torture: Free-Block Circulation:  922 920 919 918 917 916 914 913 912 911 0
[  125.574311] Detected VIPT I-cache on CPU0
[  125.574557] cacheinfo: Unable to detect cache hierarchy for CPU 0
[  125.574901] GICv3: CPU0: found redistributor 0 region 0:0x000000003fef0000
[  125.575322] CPU0: Booted secondary processor 0x0000000000 [0x612f0290]
[  130.176893] rcu-torture: rcu_torture_read_exit: Start of episode
[  130.317001] psci: CPU0 killed (polled 0 ms)
[...]
[  225.588999] Detected VIPT I-cache on CPU0
[  225.589224] cacheinfo: Unable to detect cache hierarchy for CPU 0
[  225.589535] GICv3: CPU0: found redistributor 0 region 0:0x000000003fef0000
[  225.589946] CPU0: Booted secondary processor 0x0000000000 [0x612f0290]

No such error is being reported.

Is there anything special in your config that would help triggering
this with the current tip of tree?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-16 17:03 arm64 torture test hotplug failures (offlining causes -EBUSY) Joel Fernandes
  2023-01-16 18:03 ` Marc Zyngier
@ 2023-01-16 18:32 ` Zhouyi Zhou
  2023-01-16 22:38   ` Joel Fernandes
  1 sibling, 1 reply; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-16 18:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

Hi Joel

On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
> I am seeing -EBUSY returned a lot during torture_onoff() when running
> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
>
> This causes warnings in torture tests:
> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>
> Full kernel log here:
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
>
> Any ideas on why this is happening and only for CPU 0 (presumably the
> boot CPU)? I'd personally need these warnings to go away for my tests
> as this causes rcutorture's tests to not cleanly pass for me. It
> appears remove_cpu() -> device_offline() is what returns the error.
>
I guess this probably because CPU 0 is the tick_do_timer_cpu in
nohz_full mode, which prevent that cpu from
going offline [1]. We have discussed this topic, but there is no
agreement on how to solve it yet.

[1] https://lore.kernel.org/lkml/20221127175317.GF4001@paulmck-ThinkPad-P17-Gen-1/T/

Thanks
Zhouyi
> You can browse through all the torture test artifacts here which
> contains everything (build logs, vmlinux, etc).
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/
>
> Thanks for your help!
>
>  - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-16 18:32 ` Zhouyi Zhou
@ 2023-01-16 22:38   ` Joel Fernandes
  2023-01-17  0:15     ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-16 22:38 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

Hi Zhouyi,

On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>
[..]
> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hello,
> > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> >
> > This causes warnings in torture tests:
> > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >
> > Full kernel log here:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> >
> > Any ideas on why this is happening and only for CPU 0 (presumably the
> > boot CPU)? I'd personally need these warnings to go away for my tests
> > as this causes rcutorture's tests to not cleanly pass for me. It
> > appears remove_cpu() -> device_offline() is what returns the error.
> >
> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> nohz_full mode, which prevent that cpu from
> going offline [1]. We have discussed this topic, but there is no
> agreement on how to solve it yet.

But I am seeing the issue in TRACE02 config which is:
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set

So that is not NO_HZ_FULL:
http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
However, I can't seem to find the full kernel logs for that.

Also, other than the TRACE02 fail, I only see the issue with configs
with CONFIG_NO_HZ_FULL=y

Can you try TRACE02 specifically, and see if you can reproduce the
same issue on your setup? Meanwhile, I'll try to trace what is
returning the -EBUSY.

Thanks!

 - Joel

>
> [1] https://lore.kernel.org/lkml/20221127175317.GF4001@paulmck-ThinkPad-P17-Gen-1/T/
>
> Thanks
> Zhouyi
> > You can browse through all the torture test artifacts here which
> > contains everything (build logs, vmlinux, etc).
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/
> >
> > Thanks for your help!
> >
> >  - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-16 18:03 ` Marc Zyngier
@ 2023-01-16 22:43   ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-16 22:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Mark Rutland,
	Catalin Marinas, rcu, Paul E. McKenney

Hi Marc,
Thanks a lot for taking a look.

On Mon, Jan 16, 2023 at 1:03 PM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Joel,
>
> On Mon, 16 Jan 2023 17:03:31 +0000,
> Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hello,
> > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> >
> > This causes warnings in torture tests:
> > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >
> > Full kernel log here:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> >
> > Any ideas on why this is happening and only for CPU 0 (presumably the
> > boot CPU)? I'd personally need these warnings to go away for my tests
> > as this causes rcutorture's tests to not cleanly pass for me. It
> > appears remove_cpu() -> device_offline() is what returns the error.
>
> I've taken your kernel for a ride as a KVM guest (probably similar to
> what you are doing), and saw the same thing (CPU0 not offlining):
>
> [   64.555845] Detected VIPT I-cache on CPU4
> [   64.556146] GICv3: CPU4: found redistributor 4 region 0:0x000000003ff70000
> [   64.556689] CPU4: Booted secondary processor 0x0000000004 [0x612f0290]
> [   69.823670] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> [   73.991960] psci: CPU7 killed (polled 0 ms)
> [   74.239626] rcu-torture: rcu_torture_read_exit: Start of episode
> [   74.243863] rcu-torture: rcu_torture_read_exit: End of episode
>
> I then tried v6.2-rc4 with defconfig + RCU_TORTURE and your command
> line, and CPU0 does seem to hotplug off correctly:

Interesting, can you try the Config fragment of the failing config on
the 6.2-rc4 [1] ?

[1] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/ConfigFragment

Notable, it has the following which Zhoui said he was able to repro
with on another arch as well:
CONFIG_NO_HZ_IDLE=n
CONFIG_NO_HZ_FULL=y

> [   47.217109] psci: CPU3 killed (polled 0 ms)
> [   52.241009] Detected VIPT I-cache on CPU3
> [   52.241227] cacheinfo: Unable to detect cache hierarchy for CPU 3
> [   52.241481] GICv3: CPU3: found redistributor 3 region 0:0x000000003ff50000
> [   52.241849] CPU3: Booted secondary processor 0x0000000003 [0x612f0290]
> [   56.337011] psci: CPU0 killed (polled 0 ms)
> [...]
> [  121.090339] rcu-torture: Free-Block Circulation:  922 920 919 918 917 916 914 913 912 911 0
> [  125.574311] Detected VIPT I-cache on CPU0
> [  125.574557] cacheinfo: Unable to detect cache hierarchy for CPU 0
> [  125.574901] GICv3: CPU0: found redistributor 0 region 0:0x000000003fef0000
> [  125.575322] CPU0: Booted secondary processor 0x0000000000 [0x612f0290]
> [  130.176893] rcu-torture: rcu_torture_read_exit: Start of episode
> [  130.317001] psci: CPU0 killed (polled 0 ms)
> [...]
> [  225.588999] Detected VIPT I-cache on CPU0
> [  225.589224] cacheinfo: Unable to detect cache hierarchy for CPU 0
> [  225.589535] GICv3: CPU0: found redistributor 0 region 0:0x000000003fef0000
> [  225.589946] CPU0: Booted secondary processor 0x0000000000 [0x612f0290]
>
> No such error is being reported.
>
> Is there anything special in your config that would help triggering
> this with the current tip of tree?

Perhaps, your config needs the options in the config fragment I mentioned above.

Thanks!

 - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-16 22:38   ` Joel Fernandes
@ 2023-01-17  0:15     ` Joel Fernandes
  2023-01-17  0:37       ` Zhouyi Zhou
  2023-01-17  4:30       ` Paul E. McKenney
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17  0:15 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> Hi Zhouyi,
> 
> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >
> [..]
> > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > Hello,
> > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > >
> > > This causes warnings in torture tests:
> > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >
> > > Full kernel log here:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > >
> > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > appears remove_cpu() -> device_offline() is what returns the error.
> > >
> > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > nohz_full mode, which prevent that cpu from
> > going offline [1]. We have discussed this topic, but there is no
> > agreement on how to solve it yet.
> 
> But I am seeing the issue in TRACE02 config which is:
> CONFIG_NO_HZ_IDLE=y
> # CONFIG_NO_HZ_FULL is not set
> 
> So that is not NO_HZ_FULL:
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> However, I can't seem to find the full kernel logs for that.
> 
> Also, other than the TRACE02 fail, I only see the issue with configs
> with CONFIG_NO_HZ_FULL=y
> 
> Can you try TRACE02 specifically, and see if you can reproduce the
> same issue on your setup? Meanwhile, I'll try to trace what is
> returning the -EBUSY.

How about something simple like the following? (untested)

---8<-----------------------

diff --git a/kernel/torture.c b/kernel/torture.c
index bc8fb361efc0..cd64110694c0 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
 			// PCI probe frequently disables hotplug during boot.
 			(*n_offl_attempts)--;
 			s = " (-EBUSY forgiven during boot)";
+		} else if (tick_nohz_full_running && ret == -EBUSY) {
+			(*n_offl_attempts)--;
+			s = " (-EBUSY forgiven if nohz_full is running)";
 		}
 		if (verbose)
 			pr_alert("%s" TORTURE_FLAG

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  0:15     ` Joel Fernandes
@ 2023-01-17  0:37       ` Zhouyi Zhou
  2023-01-17  1:45         ` Joel Fernandes
  2023-01-17  4:30       ` Paul E. McKenney
  1 sibling, 1 reply; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-17  0:37 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > Hi Zhouyi,
> >
> > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > >
> > [..]
> > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > Hello,
> > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > >
> > > > This causes warnings in torture tests:
> > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >
> > > > Full kernel log here:
> > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > >
> > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > >
> > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > nohz_full mode, which prevent that cpu from
> > > going offline [1]. We have discussed this topic, but there is no
> > > agreement on how to solve it yet.
> >
> > But I am seeing the issue in TRACE02 config which is:
> > CONFIG_NO_HZ_IDLE=y
> > # CONFIG_NO_HZ_FULL is not set
> >
> > So that is not NO_HZ_FULL:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > However, I can't seem to find the full kernel logs for that.
> >
> > Also, other than the TRACE02 fail, I only see the issue with configs
> > with CONFIG_NO_HZ_FULL=y
> >
> > Can you try TRACE02 specifically, and see if you can reproduce the
> > same issue on your setup? Meanwhile, I'll try to trace what is
> > returning the -EBUSY.
I am trying TRACE02 on my X86_64 machine using cross compile and
qemu-system-aarch64 now, my equipment is limited, but hope I can be of
beneficial to the community ;-)
>
> How about something simple like the following? (untested)
>
> ---8<-----------------------
>
> diff --git a/kernel/torture.c b/kernel/torture.c
> index bc8fb361efc0..cd64110694c0 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>                         // PCI probe frequently disables hotplug during boot.
>                         (*n_offl_attempts)--;
>                         s = " (-EBUSY forgiven during boot)";
> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> +                       (*n_offl_attempts)--;
> +                       s = " (-EBUSY forgiven if nohz_full is running)";
 Fantastic fix!! thus we can fix the time keeper cpu torture problem
without touch the time keeper code.

Thanks
Zhouyi
>                 }
>                 if (verbose)
>                         pr_alert("%s" TORTURE_FLAG

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  0:37       ` Zhouyi Zhou
@ 2023-01-17  1:45         ` Joel Fernandes
  2023-01-17  3:15           ` Zhouyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17  1:45 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > Hi Zhouyi,
> > >
> > > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >
> > > [..]
> > > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > Hello,
> > > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > >
> > > > > This causes warnings in torture tests:
> > > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >
> > > > > Full kernel log here:
> > > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > >
> > > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > > >
> > > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > nohz_full mode, which prevent that cpu from
> > > > going offline [1]. We have discussed this topic, but there is no
> > > > agreement on how to solve it yet.
> > >
> > > But I am seeing the issue in TRACE02 config which is:
> > > CONFIG_NO_HZ_IDLE=y
> > > # CONFIG_NO_HZ_FULL is not set
> > >
> > > So that is not NO_HZ_FULL:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > However, I can't seem to find the full kernel logs for that.
> > >
> > > Also, other than the TRACE02 fail, I only see the issue with configs
> > > with CONFIG_NO_HZ_FULL=y
> > >
> > > Can you try TRACE02 specifically, and see if you can reproduce the
> > > same issue on your setup? Meanwhile, I'll try to trace what is
> > > returning the -EBUSY.
> I am trying TRACE02 on my X86_64 machine using cross compile and
> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> beneficial to the community ;-)

Cool, I am assuming you are trying the patch you shared which you wrote in
November. I bet you will still see the issue.

> >
> > How about something simple like the following? (untested)
> >
> > ---8<-----------------------
> >
> > diff --git a/kernel/torture.c b/kernel/torture.c
> > index bc8fb361efc0..cd64110694c0 100644
> > --- a/kernel/torture.c
> > +++ b/kernel/torture.c
> > @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >                         // PCI probe frequently disables hotplug during boot.
> >                         (*n_offl_attempts)--;
> >                         s = " (-EBUSY forgiven during boot)";
> > +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > +                       (*n_offl_attempts)--;
> > +                       s = " (-EBUSY forgiven if nohz_full is running)";
>  Fantastic fix!! thus we can fix the time keeper cpu torture problem
> without touch the time keeper code.

Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
you shared does not fix it either -- because TRACE02 is not a no-hz-full
test. :-(

We will need to do a bit of tracing to figure out where the -EBUSY is coming
from for TRACE02.

I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
hotplug failure is "normal". Thoughts?

thanks,

 - Joel


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  1:45         ` Joel Fernandes
@ 2023-01-17  3:15           ` Zhouyi Zhou
  2023-01-17  4:34             ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-17  3:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 9:45 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> > On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > > Hi Zhouyi,
> > > >
> > > > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > > >
> > > > [..]
> > > > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > Hello,
> > > > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > > >
> > > > > > This causes warnings in torture tests:
> > > > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > >
> > > > > > Full kernel log here:
> > > > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > > >
> > > > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > > > >
> > > > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > > nohz_full mode, which prevent that cpu from
> > > > > going offline [1]. We have discussed this topic, but there is no
> > > > > agreement on how to solve it yet.
> > > >
> > > > But I am seeing the issue in TRACE02 config which is:
> > > > CONFIG_NO_HZ_IDLE=y
> > > > # CONFIG_NO_HZ_FULL is not set
> > > >
> > > > So that is not NO_HZ_FULL:
> > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > > However, I can't seem to find the full kernel logs for that.
> > > >
> > > > Also, other than the TRACE02 fail, I only see the issue with configs
> > > > with CONFIG_NO_HZ_FULL=y
> > > >
> > > > Can you try TRACE02 specifically, and see if you can reproduce the
> > > > same issue on your setup? Meanwhile, I'll try to trace what is
> > > > returning the -EBUSY.
> > I am trying TRACE02 on my X86_64 machine using cross compile and
> > qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> > beneficial to the community ;-)
>
> Cool, I am assuming you are trying the patch you shared which you wrote in
> November. I bet you will still see the issue.
yes, I still see the issue with no hz full.
>
> > >
> > > How about something simple like the following? (untested)
> > >
> > > ---8<-----------------------
> > >
> > > diff --git a/kernel/torture.c b/kernel/torture.c
> > > index bc8fb361efc0..cd64110694c0 100644
> > > --- a/kernel/torture.c
> > > +++ b/kernel/torture.c
> > > @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > >                         // PCI probe frequently disables hotplug during boot.
> > >                         (*n_offl_attempts)--;
> > >                         s = " (-EBUSY forgiven during boot)";
> > > +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > > +                       (*n_offl_attempts)--;
> > > +                       s = " (-EBUSY forgiven if nohz_full is running)";
> >  Fantastic fix!! thus we can fix the time keeper cpu torture problem
> > without touch the time keeper code.
>
> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> test. :-(
>
> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> from for TRACE02.
agree TRACE02 is another issue, unfortunately I can't reproduce the
bug neither with your original Image [1]
nor with my cross compiled kernel using [2].

I guess there may be two reasons:
1) my testbed is X86_64 based.
2) the command that I invoke qemu is not right:
2-1) the newly compiled linux-5.15.89-rc1
qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
-serial file:/tmp/consoleJan1702.log  -kernel arch/arm64/boot/Image
-append "console=ttyAMA0 oops=panic panic_on_warn=1 panic=-1
ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
rcutorture.onoff_holdoff=1000 rcutorture.n_barrier_cbs=4
rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
test_no_idle_hz=1 verbose=1" -m 2048 -net user,hostfwd=tcp::10024-:22
-net nic
2-2) original Image [1]
qemu-system-aarch64 -machine virt   -cpu cortex-a57   -nographic -smp
4  -serial file:/tmp/consoleJan1701.log   -kernel /home/zzy/Image
-append "console=ttyAMA0  oops=panic panic_on_warn=1 panic=-1
ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
rcutorture.onoff_holdoff=30 n_barrier_cbs=4
rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
test_no_idle_hz=1 verbose=1"   -m 2048   -net
user,hostfwd=tcp::10023-:22 -net nic

As Mark can reproduce the issue using [1], there must be something
wrong with my x86_64 based environment.

Sorry not to be of help this time.

I am very happy and interested to perform further tests whenever there
are further instructions ;-)

[1] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/Image
[2] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/.config
>
> I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
> hotplug failure is "normal". Thoughts?
This decision is too important for a beginner like me, however may
thanks for your trust in me ;-) What does Paul think about it ;-)

Thanks
Zhouyi
>
> thanks,
>
>  - Joel
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  0:15     ` Joel Fernandes
  2023-01-17  0:37       ` Zhouyi Zhou
@ 2023-01-17  4:30       ` Paul E. McKenney
  2023-01-17  4:36         ` Joel Fernandes
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-17  4:30 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu

On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > Hi Zhouyi,
> > 
> > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > >
> > [..]
> > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > Hello,
> > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > >
> > > > This causes warnings in torture tests:
> > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >
> > > > Full kernel log here:
> > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > >
> > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > >
> > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > nohz_full mode, which prevent that cpu from
> > > going offline [1]. We have discussed this topic, but there is no
> > > agreement on how to solve it yet.
> > 
> > But I am seeing the issue in TRACE02 config which is:
> > CONFIG_NO_HZ_IDLE=y
> > # CONFIG_NO_HZ_FULL is not set
> > 
> > So that is not NO_HZ_FULL:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > However, I can't seem to find the full kernel logs for that.
> > 
> > Also, other than the TRACE02 fail, I only see the issue with configs
> > with CONFIG_NO_HZ_FULL=y
> > 
> > Can you try TRACE02 specifically, and see if you can reproduce the
> > same issue on your setup? Meanwhile, I'll try to trace what is
> > returning the -EBUSY.
> 
> How about something simple like the following? (untested)
> 
> ---8<-----------------------
> 
> diff --git a/kernel/torture.c b/kernel/torture.c
> index bc8fb361efc0..cd64110694c0 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>  			// PCI probe frequently disables hotplug during boot.
>  			(*n_offl_attempts)--;
>  			s = " (-EBUSY forgiven during boot)";
> +		} else if (tick_nohz_full_running && ret == -EBUSY) {
> +			(*n_offl_attempts)--;
> +			s = " (-EBUSY forgiven if nohz_full is running)";

But this should be forgiven for the timekeeping CPU, not everyone,
correct?

Yes, I know that CPU-hotplug operations can fail, but in my testing
they almost never do.  This means that a new failure might well be a
real bug somewhere that needs attention.

							Thanx, Paul

>  		}
>  		if (verbose)
>  			pr_alert("%s" TORTURE_FLAG

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  3:15           ` Zhouyi Zhou
@ 2023-01-17  4:34             ` Joel Fernandes
  2023-01-17 11:42               ` Zhouyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17  4:34 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney



> On Jan 16, 2023, at 10:15 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> 
> On Tue, Jan 17, 2023 at 9:45 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>>> On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
>>> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>> 
>>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
>>>>> Hi Zhouyi,
>>>>> 
>>>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>>>>>> 
>>>>> [..]
>>>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
>>>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
>>>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
>>>>>>> 
>>>>>>> This causes warnings in torture tests:
>>>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>>>> 
>>>>>>> Full kernel log here:
>>>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
>>>>>>> 
>>>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
>>>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
>>>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
>>>>>>> appears remove_cpu() -> device_offline() is what returns the error.
>>>>>>> 
>>>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
>>>>>> nohz_full mode, which prevent that cpu from
>>>>>> going offline [1]. We have discussed this topic, but there is no
>>>>>> agreement on how to solve it yet.
>>>>> 
>>>>> But I am seeing the issue in TRACE02 config which is:
>>>>> CONFIG_NO_HZ_IDLE=y
>>>>> # CONFIG_NO_HZ_FULL is not set
>>>>> 
>>>>> So that is not NO_HZ_FULL:
>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
>>>>> However, I can't seem to find the full kernel logs for that.
>>>>> 
>>>>> Also, other than the TRACE02 fail, I only see the issue with configs
>>>>> with CONFIG_NO_HZ_FULL=y
>>>>> 
>>>>> Can you try TRACE02 specifically, and see if you can reproduce the
>>>>> same issue on your setup? Meanwhile, I'll try to trace what is
>>>>> returning the -EBUSY.
>>> I am trying TRACE02 on my X86_64 machine using cross compile and
>>> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
>>> beneficial to the community ;-)
>> 
>> Cool, I am assuming you are trying the patch you shared which you wrote in
>> November. I bet you will still see the issue.
> yes, I still see the issue with no hz full.
>> 
>>>> 
>>>> How about something simple like the following? (untested)
>>>> 
>>>> ---8<-----------------------
>>>> 
>>>> diff --git a/kernel/torture.c b/kernel/torture.c
>>>> index bc8fb361efc0..cd64110694c0 100644
>>>> --- a/kernel/torture.c
>>>> +++ b/kernel/torture.c
>>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>>>>                        // PCI probe frequently disables hotplug during boot.
>>>>                        (*n_offl_attempts)--;
>>>>                        s = " (-EBUSY forgiven during boot)";
>>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
>>>> +                       (*n_offl_attempts)--;
>>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
>>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
>>> without touch the time keeper code.
>> 
>> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
>> you shared does not fix it either -- because TRACE02 is not a no-hz-full
>> test. :-(
>> 
>> We will need to do a bit of tracing to figure out where the -EBUSY is coming
>> from for TRACE02.
> agree TRACE02 is another issue, unfortunately I can't reproduce the
> bug neither with your original Image [1]
> nor with my cross compiled kernel using [2].
> 
> I guess there may be two reasons:
> 1) my testbed is X86_64 based.
> 2) the command that I invoke qemu is not right:
> 2-1) the newly compiled linux-5.15.89-rc1
> qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4

Does 8 CPUs make any difference? That is my setup.

Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.

Thanks,

Joel 



> -serial file:/tmp/consoleJan1702.log  -kernel arch/arm64/boot/Image
> -append "console=ttyAMA0 oops=panic panic_on_warn=1 panic=-1
> ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> rcutorture.onoff_holdoff=1000 rcutorture.n_barrier_cbs=4
> rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> test_no_idle_hz=1 verbose=1" -m 2048 -net user,hostfwd=tcp::10024-:22
> -net nic
> 2-2) original Image [1]
> qemu-system-aarch64 -machine virt   -cpu cortex-a57   -nographic -smp
> 4  -serial file:/tmp/consoleJan1701.log   -kernel /home/zzy/Image
> -append "console=ttyAMA0  oops=panic panic_on_warn=1 panic=-1
> ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> rcutorture.onoff_holdoff=30 n_barrier_cbs=4
> rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> test_no_idle_hz=1 verbose=1"   -m 2048   -net
> user,hostfwd=tcp::10023-:22 -net nic
> 
> As Mark can reproduce the issue using [1], there must be something
> wrong with my x86_64 based environment.
> 
> Sorry not to be of help this time.
> 
> I am very happy and interested to perform further tests whenever there
> are further instructions ;-)
> 
> [1] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/Image
> [2] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/.config
>> 
>> I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
>> hotplug failure is "normal". Thoughts?
> This decision is too important for a beginner like me, however may
> thanks for your trust in me ;-) What does Paul think about it ;-)
> 
> Thanks
> Zhouyi
>> 
>> thanks,
>> 
>> - Joel
>> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  4:30       ` Paul E. McKenney
@ 2023-01-17  4:36         ` Joel Fernandes
  2023-01-17  4:54           ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17  4:36 UTC (permalink / raw)
  To: paulmck
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu



> On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
>>> Hi Zhouyi,
>>> 
>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>>>> 
>>> [..]
>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>> 
>>>>> Hello,
>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
>>>>> 
>>>>> This causes warnings in torture tests:
>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> 
>>>>> Full kernel log here:
>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
>>>>> 
>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
>>>>> appears remove_cpu() -> device_offline() is what returns the error.
>>>>> 
>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
>>>> nohz_full mode, which prevent that cpu from
>>>> going offline [1]. We have discussed this topic, but there is no
>>>> agreement on how to solve it yet.
>>> 
>>> But I am seeing the issue in TRACE02 config which is:
>>> CONFIG_NO_HZ_IDLE=y
>>> # CONFIG_NO_HZ_FULL is not set
>>> 
>>> So that is not NO_HZ_FULL:
>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
>>> However, I can't seem to find the full kernel logs for that.
>>> 
>>> Also, other than the TRACE02 fail, I only see the issue with configs
>>> with CONFIG_NO_HZ_FULL=y
>>> 
>>> Can you try TRACE02 specifically, and see if you can reproduce the
>>> same issue on your setup? Meanwhile, I'll try to trace what is
>>> returning the -EBUSY.
>> 
>> How about something simple like the following? (untested)
>> 
>> ---8<-----------------------
>> 
>> diff --git a/kernel/torture.c b/kernel/torture.c
>> index bc8fb361efc0..cd64110694c0 100644
>> --- a/kernel/torture.c
>> +++ b/kernel/torture.c
>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>>            // PCI probe frequently disables hotplug during boot.
>>            (*n_offl_attempts)--;
>>            s = " (-EBUSY forgiven during boot)";
>> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
>> +            (*n_offl_attempts)--;
>> +            s = " (-EBUSY forgiven if nohz_full is running)";
> 
> But this should be forgiven for the timekeeping CPU, not everyone,
> correct?
> 
> Yes, I know that CPU-hotplug operations can fail, but in my testing
> they almost never do.  This means that a new failure might well be a
> real bug somewhere that needs attention.

Sure. We may need to expose some API to reveal that. 

It appeared though that Thomas in the other thread related to patch from Zhouyi, was suggesting that rcutorture tolerate hotplug failure though, because they are not abnormal, right?

Thanks,

 - Joel


> 
>                            Thanx, Paul
> 
>>        }
>>        if (verbose)
>>            pr_alert("%s" TORTURE_FLAG

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  4:36         ` Joel Fernandes
@ 2023-01-17  4:54           ` Paul E. McKenney
  2023-01-17 20:02             ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-17  4:54 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu

On Mon, Jan 16, 2023 at 11:36:57PM -0500, Joel Fernandes wrote:
> > On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> >>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> >>> Hi Zhouyi,
> >>> 
> >>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >>>> 
> >>> [..]
> >>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>> 
> >>>>> Hello,
> >>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> >>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> >>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> >>>>> 
> >>>>> This causes warnings in torture tests:
> >>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>> 
> >>>>> Full kernel log here:
> >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> >>>>> 
> >>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> >>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> >>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> >>>>> appears remove_cpu() -> device_offline() is what returns the error.
> >>>>> 
> >>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> >>>> nohz_full mode, which prevent that cpu from
> >>>> going offline [1]. We have discussed this topic, but there is no
> >>>> agreement on how to solve it yet.
> >>> 
> >>> But I am seeing the issue in TRACE02 config which is:
> >>> CONFIG_NO_HZ_IDLE=y
> >>> # CONFIG_NO_HZ_FULL is not set
> >>> 
> >>> So that is not NO_HZ_FULL:
> >>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> >>> However, I can't seem to find the full kernel logs for that.
> >>> 
> >>> Also, other than the TRACE02 fail, I only see the issue with configs
> >>> with CONFIG_NO_HZ_FULL=y
> >>> 
> >>> Can you try TRACE02 specifically, and see if you can reproduce the
> >>> same issue on your setup? Meanwhile, I'll try to trace what is
> >>> returning the -EBUSY.
> >> 
> >> How about something simple like the following? (untested)
> >> 
> >> ---8<-----------------------
> >> 
> >> diff --git a/kernel/torture.c b/kernel/torture.c
> >> index bc8fb361efc0..cd64110694c0 100644
> >> --- a/kernel/torture.c
> >> +++ b/kernel/torture.c
> >> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >>            // PCI probe frequently disables hotplug during boot.
> >>            (*n_offl_attempts)--;
> >>            s = " (-EBUSY forgiven during boot)";
> >> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
> >> +            (*n_offl_attempts)--;
> >> +            s = " (-EBUSY forgiven if nohz_full is running)";
> > 
> > But this should be forgiven for the timekeeping CPU, not everyone,
> > correct?
> > 
> > Yes, I know that CPU-hotplug operations can fail, but in my testing
> > they almost never do.  This means that a new failure might well be a
> > real bug somewhere that needs attention.
> 
> Sure. We may need to expose some API to reveal that. 
> 
> It appeared though that Thomas in the other thread related to patch
> from Zhouyi, was suggesting that rcutorture tolerate hotplug failure
> though, because they are not abnormal, right?

Based on my rcutorture testing experience on x86, they are not at all
normal.  The only time I have seen rcutorture CPU-hotplug failures has
been due to some bug that needed fixing.

Is there a plan to make CPU hotplug failures more frequent?

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  4:34             ` Joel Fernandes
@ 2023-01-17 11:42               ` Zhouyi Zhou
  2023-01-17 19:50                 ` Joel Fernandes
  2023-01-18 10:15                 ` Zhouyi Zhou
  0 siblings, 2 replies; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-17 11:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 12:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Jan 16, 2023, at 10:15 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 9:45 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>
> >>> On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> >>> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>
> >>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> >>>>> Hi Zhouyi,
> >>>>>
> >>>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >>>>>>
> >>>>> [..]
> >>>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> >>>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> >>>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> >>>>>>>
> >>>>>>> This causes warnings in torture tests:
> >>>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>>>>
> >>>>>>> Full kernel log here:
> >>>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> >>>>>>>
> >>>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> >>>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> >>>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> >>>>>>> appears remove_cpu() -> device_offline() is what returns the error.
> >>>>>>>
> >>>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> >>>>>> nohz_full mode, which prevent that cpu from
> >>>>>> going offline [1]. We have discussed this topic, but there is no
> >>>>>> agreement on how to solve it yet.
> >>>>>
> >>>>> But I am seeing the issue in TRACE02 config which is:
> >>>>> CONFIG_NO_HZ_IDLE=y
> >>>>> # CONFIG_NO_HZ_FULL is not set
> >>>>>
> >>>>> So that is not NO_HZ_FULL:
> >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> >>>>> However, I can't seem to find the full kernel logs for that.
> >>>>>
> >>>>> Also, other than the TRACE02 fail, I only see the issue with configs
> >>>>> with CONFIG_NO_HZ_FULL=y
> >>>>>
> >>>>> Can you try TRACE02 specifically, and see if you can reproduce the
> >>>>> same issue on your setup? Meanwhile, I'll try to trace what is
> >>>>> returning the -EBUSY.
> >>> I am trying TRACE02 on my X86_64 machine using cross compile and
> >>> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> >>> beneficial to the community ;-)
> >>
> >> Cool, I am assuming you are trying the patch you shared which you wrote in
> >> November. I bet you will still see the issue.
> > yes, I still see the issue with no hz full.
> >>
> >>>>
> >>>> How about something simple like the following? (untested)
> >>>>
> >>>> ---8<-----------------------
> >>>>
> >>>> diff --git a/kernel/torture.c b/kernel/torture.c
> >>>> index bc8fb361efc0..cd64110694c0 100644
> >>>> --- a/kernel/torture.c
> >>>> +++ b/kernel/torture.c
> >>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >>>>                        // PCI probe frequently disables hotplug during boot.
> >>>>                        (*n_offl_attempts)--;
> >>>>                        s = " (-EBUSY forgiven during boot)";
> >>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> >>>> +                       (*n_offl_attempts)--;
> >>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
> >>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
> >>> without touch the time keeper code.
> >>
> >> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> >> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> >> test. :-(
> >>
> >> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> >> from for TRACE02.
> > agree TRACE02 is another issue, unfortunately I can't reproduce the
> > bug neither with your original Image [1]
> > nor with my cross compiled kernel using [2].
> >
> > I guess there may be two reasons:
> > 1) my testbed is X86_64 based.
> > 2) the command that I invoke qemu is not right:
> > 2-1) the newly compiled linux-5.15.89-rc1
> > qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
>
> Does 8 CPUs make any difference? That is my setup.
8 CPUs make no difference ;-(
>
> Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.
I guess it may be a CPU model specific issue, while I can't invoke
qemu-system-aarch64  with  "-machine virt,gic-version=host -cpu host"
because I didn't have an aarch64 bare metal host.

OK, I am doing the same setup on linux-5.15.y as I did last November
in the PPC VM of Open Source Lab of Oregon State University, this will
take about 20 hours, and report what I found after the test finishes.

Thanks
Zhouyi
>
> Thanks,
>
> Joel
>
>
>
> > -serial file:/tmp/consoleJan1702.log  -kernel arch/arm64/boot/Image
> > -append "console=ttyAMA0 oops=panic panic_on_warn=1 panic=-1
> > ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> > rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> > rcutorture.onoff_holdoff=1000 rcutorture.n_barrier_cbs=4
> > rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> > test_no_idle_hz=1 verbose=1" -m 2048 -net user,hostfwd=tcp::10024-:22
> > -net nic
> > 2-2) original Image [1]
> > qemu-system-aarch64 -machine virt   -cpu cortex-a57   -nographic -smp
> > 4  -serial file:/tmp/consoleJan1701.log   -kernel /home/zzy/Image
> > -append "console=ttyAMA0  oops=panic panic_on_warn=1 panic=-1
> > ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> > rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> > rcutorture.onoff_holdoff=30 n_barrier_cbs=4
> > rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> > test_no_idle_hz=1 verbose=1"   -m 2048   -net
> > user,hostfwd=tcp::10023-:22 -net nic
> >
> > As Mark can reproduce the issue using [1], there must be something
> > wrong with my x86_64 based environment.
> >
> > Sorry not to be of help this time.
> >
> > I am very happy and interested to perform further tests whenever there
> > are further instructions ;-)
> >
> > [1] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/Image
> > [2] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/.config
> >>
> >> I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
> >> hotplug failure is "normal". Thoughts?
> > This decision is too important for a beginner like me, however may
> > thanks for your trust in me ;-) What does Paul think about it ;-)
> >
> > Thanks
> > Zhouyi
> >>
> >> thanks,
> >>
> >> - Joel
> >>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17 11:42               ` Zhouyi Zhou
@ 2023-01-17 19:50                 ` Joel Fernandes
  2023-01-18 10:15                 ` Zhouyi Zhou
  1 sibling, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17 19:50 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 11:43 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
[...]
> > >>>>
> > >>>> How about something simple like the following? (untested)
> > >>>>
> > >>>> ---8<-----------------------
> > >>>>
> > >>>> diff --git a/kernel/torture.c b/kernel/torture.c
> > >>>> index bc8fb361efc0..cd64110694c0 100644
> > >>>> --- a/kernel/torture.c
> > >>>> +++ b/kernel/torture.c
> > >>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > >>>>                        // PCI probe frequently disables hotplug during boot.
> > >>>>                        (*n_offl_attempts)--;
> > >>>>                        s = " (-EBUSY forgiven during boot)";
> > >>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > >>>> +                       (*n_offl_attempts)--;
> > >>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
> > >>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
> > >>> without touch the time keeper code.
> > >>
> > >> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> > >> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> > >> test. :-(
> > >>
> > >> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> > >> from for TRACE02.
> > > agree TRACE02 is another issue, unfortunately I can't reproduce the
> > > bug neither with your original Image [1]
> > > nor with my cross compiled kernel using [2].
> > >
> > > I guess there may be two reasons:
> > > 1) my testbed is X86_64 based.
> > > 2) the command that I invoke qemu is not right:
> > > 2-1) the newly compiled linux-5.15.89-rc1
> > > qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
> >
> > Does 8 CPUs make any difference? That is my setup.
> 8 CPUs make no difference ;-(

Ah, it was worth a try! Hmm.

> > Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.
> I guess it may be a CPU model specific issue, while I can't invoke
> qemu-system-aarch64  with  "-machine virt,gic-version=host -cpu host"
> because I didn't have an aarch64 bare metal host.
>
> OK, I am doing the same setup on linux-5.15.y as I did last November
> in the PPC VM of Open Source Lab of Oregon State University, this will
> take about 20 hours, and report what I found after the test finishes.

Sounds good, Thanks!

Thanks,

 - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17  4:54           ` Paul E. McKenney
@ 2023-01-17 20:02             ` Joel Fernandes
  2023-01-17 20:42               ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-17 20:02 UTC (permalink / raw)
  To: paulmck
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu

On Tue, Jan 17, 2023 at 4:54 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Jan 16, 2023 at 11:36:57PM -0500, Joel Fernandes wrote:
> > > On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> > >>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > >>> Hi Zhouyi,
> > >>>
> > >>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > >>>>
> > >>> [..]
> > >>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >>>>>
> > >>>>> Hello,
> > >>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > >>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > >>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > >>>>>
> > >>>>> This causes warnings in torture tests:
> > >>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >>>>>
> > >>>>> Full kernel log here:
> > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > >>>>>
> > >>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > >>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > >>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > >>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > >>>>>
> > >>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > >>>> nohz_full mode, which prevent that cpu from
> > >>>> going offline [1]. We have discussed this topic, but there is no
> > >>>> agreement on how to solve it yet.
> > >>>
> > >>> But I am seeing the issue in TRACE02 config which is:
> > >>> CONFIG_NO_HZ_IDLE=y
> > >>> # CONFIG_NO_HZ_FULL is not set
> > >>>
> > >>> So that is not NO_HZ_FULL:
> > >>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > >>> However, I can't seem to find the full kernel logs for that.
> > >>>
> > >>> Also, other than the TRACE02 fail, I only see the issue with configs
> > >>> with CONFIG_NO_HZ_FULL=y
> > >>>
> > >>> Can you try TRACE02 specifically, and see if you can reproduce the
> > >>> same issue on your setup? Meanwhile, I'll try to trace what is
> > >>> returning the -EBUSY.
> > >>
> > >> How about something simple like the following? (untested)
> > >>
> > >> ---8<-----------------------
> > >>
> > >> diff --git a/kernel/torture.c b/kernel/torture.c
> > >> index bc8fb361efc0..cd64110694c0 100644
> > >> --- a/kernel/torture.c
> > >> +++ b/kernel/torture.c
> > >> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > >>            // PCI probe frequently disables hotplug during boot.
> > >>            (*n_offl_attempts)--;
> > >>            s = " (-EBUSY forgiven during boot)";
> > >> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
> > >> +            (*n_offl_attempts)--;
> > >> +            s = " (-EBUSY forgiven if nohz_full is running)";
> > >
> > > But this should be forgiven for the timekeeping CPU, not everyone,
> > > correct?
> > >
> > > Yes, I know that CPU-hotplug operations can fail, but in my testing
> > > they almost never do.  This means that a new failure might well be a
> > > real bug somewhere that needs attention.
> >
> > Sure. We may need to expose some API to reveal that.
> >
> > It appeared though that Thomas in the other thread related to patch
> > from Zhouyi, was suggesting that rcutorture tolerate hotplug failure
> > though, because they are not abnormal, right?
>
> Based on my rcutorture testing experience on x86, they are not at all
> normal.  The only time I have seen rcutorture CPU-hotplug failures has
> been due to some bug that needed fixing.

I see, ok. I need to debug what is returning -EBUSY for !NO_HZ_FULL on
arm64, I will report back once I do.

Meanwhile, Marc I am wondering if you are able to reproduce the issue
on your side on TRACE02 config, like I am?

Here is the TRACE02 config fragment:
http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/ConfigFragment/*view*/

Here are instructions on how to run it (torture test parameters etc)
if you are loading the module yourself:
http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/bare-metal/*view*/

> Is there a plan to make CPU hotplug failures more frequent?

I am not aware of such a plan but I was going by "There are quite some
reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
not a fatal problem, really." in [1].

What about an rcutorture to skip hotplug for a certain cpu id,
rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
should debug this issue more before getting to that.

Thanks,

- Joel
[1] https://lkml.org/lkml/2022/11/27/182

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17 20:02             ` Joel Fernandes
@ 2023-01-17 20:42               ` Paul E. McKenney
  2023-01-18  2:17                 ` Joel Fernandes
  0 siblings, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-17 20:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu

On Tue, Jan 17, 2023 at 08:02:24PM +0000, Joel Fernandes wrote:
> On Tue, Jan 17, 2023 at 4:54 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 11:36:57PM -0500, Joel Fernandes wrote:
> > > > On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> > > >>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > >>> Hi Zhouyi,
> > > >>>
> > > >>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >>>>
> > > >>> [..]
> > > >>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >>>>>
> > > >>>>> Hello,
> > > >>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > >>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > >>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > >>>>>
> > > >>>>> This causes warnings in torture tests:
> > > >>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >>>>>
> > > >>>>> Full kernel log here:
> > > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > >>>>>
> > > >>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > > >>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > > >>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > > >>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > > >>>>>
> > > >>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > >>>> nohz_full mode, which prevent that cpu from
> > > >>>> going offline [1]. We have discussed this topic, but there is no
> > > >>>> agreement on how to solve it yet.
> > > >>>
> > > >>> But I am seeing the issue in TRACE02 config which is:
> > > >>> CONFIG_NO_HZ_IDLE=y
> > > >>> # CONFIG_NO_HZ_FULL is not set
> > > >>>
> > > >>> So that is not NO_HZ_FULL:
> > > >>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > >>> However, I can't seem to find the full kernel logs for that.
> > > >>>
> > > >>> Also, other than the TRACE02 fail, I only see the issue with configs
> > > >>> with CONFIG_NO_HZ_FULL=y
> > > >>>
> > > >>> Can you try TRACE02 specifically, and see if you can reproduce the
> > > >>> same issue on your setup? Meanwhile, I'll try to trace what is
> > > >>> returning the -EBUSY.
> > > >>
> > > >> How about something simple like the following? (untested)
> > > >>
> > > >> ---8<-----------------------
> > > >>
> > > >> diff --git a/kernel/torture.c b/kernel/torture.c
> > > >> index bc8fb361efc0..cd64110694c0 100644
> > > >> --- a/kernel/torture.c
> > > >> +++ b/kernel/torture.c
> > > >> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > > >>            // PCI probe frequently disables hotplug during boot.
> > > >>            (*n_offl_attempts)--;
> > > >>            s = " (-EBUSY forgiven during boot)";
> > > >> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
> > > >> +            (*n_offl_attempts)--;
> > > >> +            s = " (-EBUSY forgiven if nohz_full is running)";
> > > >
> > > > But this should be forgiven for the timekeeping CPU, not everyone,
> > > > correct?
> > > >
> > > > Yes, I know that CPU-hotplug operations can fail, but in my testing
> > > > they almost never do.  This means that a new failure might well be a
> > > > real bug somewhere that needs attention.
> > >
> > > Sure. We may need to expose some API to reveal that.
> > >
> > > It appeared though that Thomas in the other thread related to patch
> > > from Zhouyi, was suggesting that rcutorture tolerate hotplug failure
> > > though, because they are not abnormal, right?
> >
> > Based on my rcutorture testing experience on x86, they are not at all
> > normal.  The only time I have seen rcutorture CPU-hotplug failures has
> > been due to some bug that needed fixing.
> 
> I see, ok. I need to debug what is returning -EBUSY for !NO_HZ_FULL on
> arm64, I will report back once I do.
> 
> Meanwhile, Marc I am wondering if you are able to reproduce the issue
> on your side on TRACE02 config, like I am?
> 
> Here is the TRACE02 config fragment:
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/ConfigFragment/*view*/
> 
> Here are instructions on how to run it (torture test parameters etc)
> if you are loading the module yourself:
> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/bare-metal/*view*/

I am assuming that this is directed to someone having easy access
to ARM hardware.

> > Is there a plan to make CPU hotplug failures more frequent?
> 
> I am not aware of such a plan but I was going by "There are quite some
> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> not a fatal problem, really." in [1].
> 
> What about an rcutorture to skip hotplug for a certain cpu id,
> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> should debug this issue more before getting to that.

Yes, in fact there already are some checks along those lines, for example,
the torture_offline() function's check of cpu_is_hotpluggable().  So for
example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
the housekeeping CPU as !cpu_is_hotpluggable().

And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
architecture specifies a .cpu_can_disable() function.

So architectures that don't want specific CPUs to be hotpluggable
can and should so specify.

							Thanx, Paul

> Thanks,
> 
> - Joel
> [1] https://lkml.org/lkml/2022/11/27/182

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17 20:42               ` Paul E. McKenney
@ 2023-01-18  2:17                 ` Joel Fernandes
  2023-01-18  4:00                   ` Paul E. McKenney
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-18  2:17 UTC (permalink / raw)
  To: paulmck
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

Hi Paul,

On Tue, Jan 17, 2023 at 8:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Jan 17, 2023 at 08:02:24PM +0000, Joel Fernandes wrote:
> > On Tue, Jan 17, 2023 at 4:54 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Jan 16, 2023 at 11:36:57PM -0500, Joel Fernandes wrote:
> > > > > On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> > > > >>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > > >>> Hi Zhouyi,
> > > > >>>
> > > > >>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > > >>>>
> > > > >>> [..]
> > > > >>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >>>>>
> > > > >>>>> Hello,
> > > > >>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > >>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > >>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > >>>>>
> > > > >>>>> This causes warnings in torture tests:
> > > > >>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >>>>>
> > > > >>>>> Full kernel log here:
> > > > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > >>>>>
> > > > >>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > >>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > > > >>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > > > >>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > > > >>>>>
> > > > >>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > >>>> nohz_full mode, which prevent that cpu from
> > > > >>>> going offline [1]. We have discussed this topic, but there is no
> > > > >>>> agreement on how to solve it yet.
> > > > >>>
> > > > >>> But I am seeing the issue in TRACE02 config which is:
> > > > >>> CONFIG_NO_HZ_IDLE=y
> > > > >>> # CONFIG_NO_HZ_FULL is not set
> > > > >>>
> > > > >>> So that is not NO_HZ_FULL:
> > > > >>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > > >>> However, I can't seem to find the full kernel logs for that.
> > > > >>>
> > > > >>> Also, other than the TRACE02 fail, I only see the issue with configs
> > > > >>> with CONFIG_NO_HZ_FULL=y
> > > > >>>
> > > > >>> Can you try TRACE02 specifically, and see if you can reproduce the
> > > > >>> same issue on your setup? Meanwhile, I'll try to trace what is
> > > > >>> returning the -EBUSY.
> > > > >>
> > > > >> How about something simple like the following? (untested)
> > > > >>
> > > > >> ---8<-----------------------
> > > > >>
> > > > >> diff --git a/kernel/torture.c b/kernel/torture.c
> > > > >> index bc8fb361efc0..cd64110694c0 100644
> > > > >> --- a/kernel/torture.c
> > > > >> +++ b/kernel/torture.c
> > > > >> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > > > >>            // PCI probe frequently disables hotplug during boot.
> > > > >>            (*n_offl_attempts)--;
> > > > >>            s = " (-EBUSY forgiven during boot)";
> > > > >> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
> > > > >> +            (*n_offl_attempts)--;
> > > > >> +            s = " (-EBUSY forgiven if nohz_full is running)";
> > > > >
> > > > > But this should be forgiven for the timekeeping CPU, not everyone,
> > > > > correct?
> > > > >
> > > > > Yes, I know that CPU-hotplug operations can fail, but in my testing
> > > > > they almost never do.  This means that a new failure might well be a
> > > > > real bug somewhere that needs attention.
> > > >
> > > > Sure. We may need to expose some API to reveal that.
> > > >
> > > > It appeared though that Thomas in the other thread related to patch
> > > > from Zhouyi, was suggesting that rcutorture tolerate hotplug failure
> > > > though, because they are not abnormal, right?
> > >
> > > Based on my rcutorture testing experience on x86, they are not at all
> > > normal.  The only time I have seen rcutorture CPU-hotplug failures has
> > > been due to some bug that needed fixing.
> >
> > I see, ok. I need to debug what is returning -EBUSY for !NO_HZ_FULL on
> > arm64, I will report back once I do.
> >
> > Meanwhile, Marc I am wondering if you are able to reproduce the issue
> > on your side on TRACE02 config, like I am?
> >
> > Here is the TRACE02 config fragment:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/ConfigFragment/*view*/
> >
> > Here are instructions on how to run it (torture test parameters etc)
> > if you are loading the module yourself:
> > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/bare-metal/*view*/
>
> I am assuming that this is directed to someone having easy access
> to ARM hardware.
>
> > > Is there a plan to make CPU hotplug failures more frequent?
> >
> > I am not aware of such a plan but I was going by "There are quite some
> > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > not a fatal problem, really." in [1].
> >
> > What about an rcutorture to skip hotplug for a certain cpu id,
> > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > should debug this issue more before getting to that.
>
> Yes, in fact there already are some checks along those lines, for example,
> the torture_offline() function's check of cpu_is_hotpluggable().  So for
> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> the housekeeping CPU as !cpu_is_hotpluggable().

I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
not seeing it). Even on x86, if you enable
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
rcutorture with boot args:

nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
rcutorture.shutdown_secs=30

You will see this in the kernel logs:
[    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
[    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16

So RCU torture test clearly thought the CPUs were hot-pluggable, when
they was chance for them to return -EBUSY (due to housekeeping and
what not). So this issue seems to be architecture independent, in that
sense.

So the 2 ways forward I see are:
- Make the torture test aware of which CPUs are 'house keeping'
- Make it possible to turn off CPU0 hotplugging on ARM64 by default
(via CONFIG or boot option).

Another option could be, forgive -EBUSY on CPU0 for
CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
housekeeping CPU?

Adding Frederic to CC as well as we are talking about
housekeeping/isolation stuff.

> And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> architecture specifies a .cpu_can_disable() function.

Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).

Thanks,

 - Joel


> So architectures that don't want specific CPUs to be hotpluggable
> can and should so specify.
>
>                                                         Thanx, Paul
>
> > Thanks,
> >
> > - Joel
> > [1] https://lkml.org/lkml/2022/11/27/182

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18  2:17                 ` Joel Fernandes
@ 2023-01-18  4:00                   ` Paul E. McKenney
  2023-01-18 16:51                     ` Will Deacon
  2023-01-18 22:37                     ` Joel Fernandes
  0 siblings, 2 replies; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-18  4:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
> Hi Paul,
> 
> On Tue, Jan 17, 2023 at 8:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 08:02:24PM +0000, Joel Fernandes wrote:
> > > On Tue, Jan 17, 2023 at 4:54 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 11:36:57PM -0500, Joel Fernandes wrote:
> > > > > > On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
> > > > > >>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > > > >>> Hi Zhouyi,
> > > > > >>>
> > > > > >>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > > > >>>>
> > > > > >>> [..]
> > > > > >>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > >>>>>
> > > > > >>>>> Hello,
> > > > > >>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > > >>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > > >>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > > >>>>>
> > > > > >>>>> This causes warnings in torture tests:
> > > > > >>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > >>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > >>>>>
> > > > > >>>>> Full kernel log here:
> > > > > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > > >>>>>
> > > > > >>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > > >>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > > > > >>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > > > > >>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > > > > >>>>>
> > > > > >>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > > >>>> nohz_full mode, which prevent that cpu from
> > > > > >>>> going offline [1]. We have discussed this topic, but there is no
> > > > > >>>> agreement on how to solve it yet.
> > > > > >>>
> > > > > >>> But I am seeing the issue in TRACE02 config which is:
> > > > > >>> CONFIG_NO_HZ_IDLE=y
> > > > > >>> # CONFIG_NO_HZ_FULL is not set
> > > > > >>>
> > > > > >>> So that is not NO_HZ_FULL:
> > > > > >>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > > > >>> However, I can't seem to find the full kernel logs for that.
> > > > > >>>
> > > > > >>> Also, other than the TRACE02 fail, I only see the issue with configs
> > > > > >>> with CONFIG_NO_HZ_FULL=y
> > > > > >>>
> > > > > >>> Can you try TRACE02 specifically, and see if you can reproduce the
> > > > > >>> same issue on your setup? Meanwhile, I'll try to trace what is
> > > > > >>> returning the -EBUSY.
> > > > > >>
> > > > > >> How about something simple like the following? (untested)
> > > > > >>
> > > > > >> ---8<-----------------------
> > > > > >>
> > > > > >> diff --git a/kernel/torture.c b/kernel/torture.c
> > > > > >> index bc8fb361efc0..cd64110694c0 100644
> > > > > >> --- a/kernel/torture.c
> > > > > >> +++ b/kernel/torture.c
> > > > > >> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > > > > >>            // PCI probe frequently disables hotplug during boot.
> > > > > >>            (*n_offl_attempts)--;
> > > > > >>            s = " (-EBUSY forgiven during boot)";
> > > > > >> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
> > > > > >> +            (*n_offl_attempts)--;
> > > > > >> +            s = " (-EBUSY forgiven if nohz_full is running)";
> > > > > >
> > > > > > But this should be forgiven for the timekeeping CPU, not everyone,
> > > > > > correct?
> > > > > >
> > > > > > Yes, I know that CPU-hotplug operations can fail, but in my testing
> > > > > > they almost never do.  This means that a new failure might well be a
> > > > > > real bug somewhere that needs attention.
> > > > >
> > > > > Sure. We may need to expose some API to reveal that.
> > > > >
> > > > > It appeared though that Thomas in the other thread related to patch
> > > > > from Zhouyi, was suggesting that rcutorture tolerate hotplug failure
> > > > > though, because they are not abnormal, right?
> > > >
> > > > Based on my rcutorture testing experience on x86, they are not at all
> > > > normal.  The only time I have seen rcutorture CPU-hotplug failures has
> > > > been due to some bug that needed fixing.
> > >
> > > I see, ok. I need to debug what is returning -EBUSY for !NO_HZ_FULL on
> > > arm64, I will report back once I do.
> > >
> > > Meanwhile, Marc I am wondering if you are able to reproduce the issue
> > > on your side on TRACE02 config, like I am?
> > >
> > > Here is the TRACE02 config fragment:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/ConfigFragment/*view*/
> > >
> > > Here are instructions on how to run it (torture test parameters etc)
> > > if you are loading the module yourself:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/bare-metal/*view*/
> >
> > I am assuming that this is directed to someone having easy access
> > to ARM hardware.
> >
> > > > Is there a plan to make CPU hotplug failures more frequent?
> > >
> > > I am not aware of such a plan but I was going by "There are quite some
> > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > not a fatal problem, really." in [1].
> > >
> > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > should debug this issue more before getting to that.
> >
> > Yes, in fact there already are some checks along those lines, for example,
> > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > the housekeeping CPU as !cpu_is_hotpluggable().
> 
> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> not seeing it). Even on x86, if you enable
> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> rcutorture with boot args:
> 
> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> rcutorture.shutdown_secs=30
> 
> You will see this in the kernel logs:
> [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> 
> So RCU torture test clearly thought the CPUs were hot-pluggable, when
> they was chance for them to return -EBUSY (due to housekeeping and
> what not). So this issue seems to be architecture independent, in that
> sense.
> 
> So the 2 ways forward I see are:
> - Make the torture test aware of which CPUs are 'house keeping'
> - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> (via CONFIG or boot option).
> 
> Another option could be, forgive -EBUSY on CPU0 for
> CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> housekeeping CPU?

I would be happier to forgive failure to offline housekeeping CPUs than
blanket forgiveness of CPU 0.  Especially given that I recently got
burned by a non-zero boot cpu.  ;-)

But wouldn't it be even better for cpu_is_hotpluggable() to know the
NO_HZ_FULL rules of the road?

> Adding Frederic to CC as well as we are talking about
> housekeeping/isolation stuff.

But as you say, perhaps Frederic has a better idea.

> > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > architecture specifies a .cpu_can_disable() function.
> 
> Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).

Apologies!  I will look more carefully at the pathnames next time!

But maybe arm64 needs something similar?

							Thanx, Paul

> Thanks,
> 
>  - Joel
> 
> 
> > So architectures that don't want specific CPUs to be hotpluggable
> > can and should so specify.
> >
> >                                                         Thanx, Paul
> >
> > > Thanks,
> > >
> > > - Joel
> > > [1] https://lkml.org/lkml/2022/11/27/182

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-17 11:42               ` Zhouyi Zhou
  2023-01-17 19:50                 ` Joel Fernandes
@ 2023-01-18 10:15                 ` Zhouyi Zhou
  2023-01-18 15:51                   ` Joel Fernandes
  1 sibling, 1 reply; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-18 10:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Tue, Jan 17, 2023 at 7:42 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 12:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> >
> > > On Jan 16, 2023, at 10:15 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 9:45 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >>
> > >>> On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> > >>> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >>>>
> > >>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > >>>>> Hi Zhouyi,
> > >>>>>
> > >>>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > >>>>>>
> > >>>>> [..]
> > >>>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >>>>>>>
> > >>>>>>> Hello,
> > >>>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > >>>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > >>>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > >>>>>>>
> > >>>>>>> This causes warnings in torture tests:
> > >>>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >>>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >>>>>>>
> > >>>>>>> Full kernel log here:
> > >>>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > >>>>>>>
> > >>>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > >>>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > >>>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > >>>>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > >>>>>>>
> > >>>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > >>>>>> nohz_full mode, which prevent that cpu from
> > >>>>>> going offline [1]. We have discussed this topic, but there is no
> > >>>>>> agreement on how to solve it yet.
> > >>>>>
> > >>>>> But I am seeing the issue in TRACE02 config which is:
> > >>>>> CONFIG_NO_HZ_IDLE=y
> > >>>>> # CONFIG_NO_HZ_FULL is not set
> > >>>>>
> > >>>>> So that is not NO_HZ_FULL:
> > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > >>>>> However, I can't seem to find the full kernel logs for that.
> > >>>>>
> > >>>>> Also, other than the TRACE02 fail, I only see the issue with configs
> > >>>>> with CONFIG_NO_HZ_FULL=y
> > >>>>>
> > >>>>> Can you try TRACE02 specifically, and see if you can reproduce the
> > >>>>> same issue on your setup? Meanwhile, I'll try to trace what is
> > >>>>> returning the -EBUSY.
> > >>> I am trying TRACE02 on my X86_64 machine using cross compile and
> > >>> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> > >>> beneficial to the community ;-)
> > >>
> > >> Cool, I am assuming you are trying the patch you shared which you wrote in
> > >> November. I bet you will still see the issue.
> > > yes, I still see the issue with no hz full.
> > >>
> > >>>>
> > >>>> How about something simple like the following? (untested)
> > >>>>
> > >>>> ---8<-----------------------
> > >>>>
> > >>>> diff --git a/kernel/torture.c b/kernel/torture.c
> > >>>> index bc8fb361efc0..cd64110694c0 100644
> > >>>> --- a/kernel/torture.c
> > >>>> +++ b/kernel/torture.c
> > >>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > >>>>                        // PCI probe frequently disables hotplug during boot.
> > >>>>                        (*n_offl_attempts)--;
> > >>>>                        s = " (-EBUSY forgiven during boot)";
> > >>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > >>>> +                       (*n_offl_attempts)--;
> > >>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
> > >>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
> > >>> without touch the time keeper code.
> > >>
> > >> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> > >> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> > >> test. :-(
> > >>
> > >> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> > >> from for TRACE02.
> > > agree TRACE02 is another issue, unfortunately I can't reproduce the
> > > bug neither with your original Image [1]
> > > nor with my cross compiled kernel using [2].
> > >
> > > I guess there may be two reasons:
> > > 1) my testbed is X86_64 based.
> > > 2) the command that I invoke qemu is not right:
> > > 2-1) the newly compiled linux-5.15.89-rc1
> > > qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
> >
> > Does 8 CPUs make any difference? That is my setup.
> 8 CPUs make no difference ;-(
> >
> > Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.
> I guess it may be a CPU model specific issue, while I can't invoke
> qemu-system-aarch64  with  "-machine virt,gic-version=host -cpu host"
> because I didn't have an aarch64 bare metal host.
>
> OK, I am doing the same setup on linux-5.15.y as I did last November
> in the PPC VM of Open Source Lab of Oregon State University, this will
> take about 20 hours, and report what I found after the test finishes.
There are some problems in launching linux-5.15.y in the PPC VM of
Open Source Lab of Oregon State University, I am digging out why, so I
can't report the test result today, I'm sorry ;-(. I will report to
you once I have any progress as soon as possible.

Best Regards
Thank you all for your guidance, I learned a lot in this process.
Zhouyi
>
> Thanks
> Zhouyi
> >
> > Thanks,
> >
> > Joel
> >
> >
> >
> > > -serial file:/tmp/consoleJan1702.log  -kernel arch/arm64/boot/Image
> > > -append "console=ttyAMA0 oops=panic panic_on_warn=1 panic=-1
> > > ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> > > rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> > > rcutorture.onoff_holdoff=1000 rcutorture.n_barrier_cbs=4
> > > rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> > > test_no_idle_hz=1 verbose=1" -m 2048 -net user,hostfwd=tcp::10024-:22
> > > -net nic
> > > 2-2) original Image [1]
> > > qemu-system-aarch64 -machine virt   -cpu cortex-a57   -nographic -smp
> > > 4  -serial file:/tmp/consoleJan1701.log   -kernel /home/zzy/Image
> > > -append "console=ttyAMA0  oops=panic panic_on_warn=1 panic=-1
> > > ftrace_dump_on_oops=orig_cpu debug earlyprintk=serial slub_debug=UZ
> > > rcutorture.torture_type=tasks-tracing rcutorture.onoff_interval=1000
> > > rcutorture.onoff_holdoff=30 n_barrier_cbs=4
> > > rcutorture.stat_interval=15 rcutorture.shutdown_secs=1200
> > > test_no_idle_hz=1 verbose=1"   -m 2048   -net
> > > user,hostfwd=tcp::10023-:22 -net nic
> > >
> > > As Mark can reproduce the issue using [1], there must be something
> > > wrong with my x86_64 based environment.
> > >
> > > Sorry not to be of help this time.
> > >
> > > I am very happy and interested to perform further tests whenever there
> > > are further instructions ;-)
> > >
> > > [1] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/Image
> > > [2] http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/.config
> > >>
> > >> I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
> > >> hotplug failure is "normal". Thoughts?
> > > This decision is too important for a beginner like me, however may
> > > thanks for your trust in me ;-) What does Paul think about it ;-)
> > >
> > > Thanks
> > > Zhouyi
> > >>
> > >> thanks,
> > >>
> > >> - Joel
> > >>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 10:15                 ` Zhouyi Zhou
@ 2023-01-18 15:51                   ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-18 15:51 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Paul E. McKenney

On Wed, Jan 18, 2023 at 10:15 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 7:42 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:34 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > >
> > >
> > > > On Jan 16, 2023, at 10:15 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 9:45 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >>
> > > >>> On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> > > >>> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >>>>
> > > >>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > >>>>> Hi Zhouyi,
> > > >>>>>
> > > >>>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >>>>>>
> > > >>>>> [..]
> > > >>>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >>>>>>>
> > > >>>>>>> Hello,
> > > >>>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > >>>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > >>>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > >>>>>>>
> > > >>>>>>> This causes warnings in torture tests:
> > > >>>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >>>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >>>>>>>
> > > >>>>>>> Full kernel log here:
> > > >>>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > >>>>>>>
> > > >>>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
> > > >>>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
> > > >>>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
> > > >>>>>>> appears remove_cpu() -> device_offline() is what returns the error.
> > > >>>>>>>
> > > >>>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > >>>>>> nohz_full mode, which prevent that cpu from
> > > >>>>>> going offline [1]. We have discussed this topic, but there is no
> > > >>>>>> agreement on how to solve it yet.
> > > >>>>>
> > > >>>>> But I am seeing the issue in TRACE02 config which is:
> > > >>>>> CONFIG_NO_HZ_IDLE=y
> > > >>>>> # CONFIG_NO_HZ_FULL is not set
> > > >>>>>
> > > >>>>> So that is not NO_HZ_FULL:
> > > >>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > >>>>> However, I can't seem to find the full kernel logs for that.
> > > >>>>>
> > > >>>>> Also, other than the TRACE02 fail, I only see the issue with configs
> > > >>>>> with CONFIG_NO_HZ_FULL=y
> > > >>>>>
> > > >>>>> Can you try TRACE02 specifically, and see if you can reproduce the
> > > >>>>> same issue on your setup? Meanwhile, I'll try to trace what is
> > > >>>>> returning the -EBUSY.
> > > >>> I am trying TRACE02 on my X86_64 machine using cross compile and
> > > >>> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> > > >>> beneficial to the community ;-)
> > > >>
> > > >> Cool, I am assuming you are trying the patch you shared which you wrote in
> > > >> November. I bet you will still see the issue.
> > > > yes, I still see the issue with no hz full.
> > > >>
> > > >>>>
> > > >>>> How about something simple like the following? (untested)
> > > >>>>
> > > >>>> ---8<-----------------------
> > > >>>>
> > > >>>> diff --git a/kernel/torture.c b/kernel/torture.c
> > > >>>> index bc8fb361efc0..cd64110694c0 100644
> > > >>>> --- a/kernel/torture.c
> > > >>>> +++ b/kernel/torture.c
> > > >>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > > >>>>                        // PCI probe frequently disables hotplug during boot.
> > > >>>>                        (*n_offl_attempts)--;
> > > >>>>                        s = " (-EBUSY forgiven during boot)";
> > > >>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > > >>>> +                       (*n_offl_attempts)--;
> > > >>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
> > > >>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
> > > >>> without touch the time keeper code.
> > > >>
> > > >> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> > > >> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> > > >> test. :-(
> > > >>
> > > >> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> > > >> from for TRACE02.
> > > > agree TRACE02 is another issue, unfortunately I can't reproduce the
> > > > bug neither with your original Image [1]
> > > > nor with my cross compiled kernel using [2].
> > > >
> > > > I guess there may be two reasons:
> > > > 1) my testbed is X86_64 based.
> > > > 2) the command that I invoke qemu is not right:
> > > > 2-1) the newly compiled linux-5.15.89-rc1
> > > > qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
> > >
> > > Does 8 CPUs make any difference? That is my setup.
> > 8 CPUs make no difference ;-(
> > >
> > > Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.
> > I guess it may be a CPU model specific issue, while I can't invoke
> > qemu-system-aarch64  with  "-machine virt,gic-version=host -cpu host"
> > because I didn't have an aarch64 bare metal host.
> >
> > OK, I am doing the same setup on linux-5.15.y as I did last November
> > in the PPC VM of Open Source Lab of Oregon State University, this will
> > take about 20 hours, and report what I found after the test finishes.
> There are some problems in launching linux-5.15.y in the PPC VM of
> Open Source Lab of Oregon State University, I am digging out why, so I
> can't report the test result today, I'm sorry ;-(. I will report to
> you once I have any progress as soon as possible.
>

No problem! Looking forward to seeing what you see.

On my side I am not able to reproduce the TRACE02 issue any more, so
it was a one-off I think. However, I see arm64 hotplug failures as
expected on all other CONFIG_RCU_NO_HZ_FULL configs.

Will debug / work on a solution for that a bit later today hopefully.

Thanks,

 - Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18  4:00                   ` Paul E. McKenney
@ 2023-01-18 16:51                     ` Will Deacon
  2023-01-18 17:56                       ` Paul E. McKenney
  2023-01-18 22:01                       ` Joel Fernandes
  2023-01-18 22:37                     ` Joel Fernandes
  1 sibling, 2 replies; 34+ messages in thread
From: Will Deacon @ 2023-01-18 16:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Frederic Weisbecker

On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
>
> I would be happier to forgive failure to offline housekeeping CPUs than
> blanket forgiveness of CPU 0.  Especially given that I recently got
> burned by a non-zero boot cpu.  ;-)
> 
> But wouldn't it be even better for cpu_is_hotpluggable() to know the
> NO_HZ_FULL rules of the road?
> 
> > Adding Frederic to CC as well as we are talking about
> > housekeeping/isolation stuff.
> 
> But as you say, perhaps Frederic has a better idea.
> 
> > > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > > architecture specifies a .cpu_can_disable() function.
> > 
> > Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).
> 
> Apologies!  I will look more carefully at the pathnames next time!
> 
> But maybe arm64 needs something similar?

Just chiming quickly from the arm64 side here, but there's nothing in the
architecture that precludes offlining CPU 0 and it certainly works on some
platforms, so I'd be hesitant to rule it out entirely for testing.

One reason why hotplug can fail in practice is if a trusted OS (i.e. code
running on the secure side of the fence outside of Linux's view of the
world) is resident on a core and rejects firmware requests to power it
off. The PSCI code (drivers/firmware/psci/) should detect this and return
-EPERM, although earlier in this thread there was mention of -EBUSY so it
sounds like something else...

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 16:51                     ` Will Deacon
@ 2023-01-18 17:56                       ` Paul E. McKenney
  2023-01-18 22:01                       ` Joel Fernandes
  1 sibling, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-18 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Joel Fernandes, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Frederic Weisbecker

On Wed, Jan 18, 2023 at 04:51:22PM +0000, Will Deacon wrote:
> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
> >
> > I would be happier to forgive failure to offline housekeeping CPUs than
> > blanket forgiveness of CPU 0.  Especially given that I recently got
> > burned by a non-zero boot cpu.  ;-)
> > 
> > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > NO_HZ_FULL rules of the road?
> > 
> > > Adding Frederic to CC as well as we are talking about
> > > housekeeping/isolation stuff.
> > 
> > But as you say, perhaps Frederic has a better idea.
> > 
> > > > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > > > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > > > architecture specifies a .cpu_can_disable() function.
> > > 
> > > Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).
> > 
> > Apologies!  I will look more carefully at the pathnames next time!
> > 
> > But maybe arm64 needs something similar?
> 
> Just chiming quickly from the arm64 side here, but there's nothing in the
> architecture that precludes offlining CPU 0 and it certainly works on some
> platforms, so I'd be hesitant to rule it out entirely for testing.
> 
> One reason why hotplug can fail in practice is if a trusted OS (i.e. code
> running on the secure side of the fence outside of Linux's view of the
> world) is resident on a core and rejects firmware requests to power it
> off. The PSCI code (drivers/firmware/psci/) should detect this and return
> -EPERM, although earlier in this thread there was mention of -EBUSY so it
> sounds like something else...

We can certainly special-case -EPERM in rcutorture.  But what should we
expect?  Would this be a random encounter with a trusted OS, or should we
expect that a given trusted OS instance would grab a giving CPU long-term?
My guess is the former, but I do feel the need to ask.  ;-)

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 16:51                     ` Will Deacon
  2023-01-18 17:56                       ` Paul E. McKenney
@ 2023-01-18 22:01                       ` Joel Fernandes
  2023-01-19  9:12                         ` Mark Rutland
  1 sibling, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-18 22:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu, Frederic Weisbecker

Hey Will,

On Wed, Jan 18, 2023 at 4:51 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
> >
> > I would be happier to forgive failure to offline housekeeping CPUs than
> > blanket forgiveness of CPU 0.  Especially given that I recently got
> > burned by a non-zero boot cpu.  ;-)
> >
> > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > NO_HZ_FULL rules of the road?
> >
> > > Adding Frederic to CC as well as we are talking about
> > > housekeeping/isolation stuff.
> >
> > But as you say, perhaps Frederic has a better idea.
> >
> > > > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > > > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > > > architecture specifies a .cpu_can_disable() function.
> > >
> > > Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).
> >
> > Apologies!  I will look more carefully at the pathnames next time!
> >
> > But maybe arm64 needs something similar?
>
> Just chiming quickly from the arm64 side here, but there's nothing in the
> architecture that precludes offlining CPU 0 and it certainly works on some
> platforms, so I'd be hesitant to rule it out entirely for testing.
>
> One reason why hotplug can fail in practice is if a trusted OS (i.e. code
> running on the secure side of the fence outside of Linux's view of the
> world) is resident on a core and rejects firmware requests to power it
> off. The PSCI code (drivers/firmware/psci/) should detect this and return
> -EPERM, although earlier in this thread there was mention of -EBUSY so it
> sounds like something else...

Thank you for the heads up on that. To give you context, I am
currently testing rcutorture on stable kernels 5.10, 5.15, 6.1 on my
ARM64 QC7180 board. I certainly don't want to hit the -EPERM in the
future on this or other ARM64 hardware. It would be great if
cpu_psci_cpu_can_disable() in arm64 can return false if hotplugging
causes -EPERM indefinitely. Then we do not need to make any changes.
This is similar to the idea Paul mentioned in an earlier thread where
the ARCH can disable the hotplug and make it clear the CPU removal is
off limits.

Meanwhile, I am also looking into whether we can make the housekeeping
CPU (returning -EBUSY) offlining be encoded somehow in the
cpu_is_hotpluggable() logic (also an idea from Paul). That appears to
not be arch code related though.

Thanks,

- Joel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18  4:00                   ` Paul E. McKenney
  2023-01-18 16:51                     ` Will Deacon
@ 2023-01-18 22:37                     ` Joel Fernandes
  2023-01-18 22:39                       ` Joel Fernandes
  2023-01-19 13:57                       ` Frederic Weisbecker
  1 sibling, 2 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-18 22:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
[...]
> > > > > Is there a plan to make CPU hotplug failures more frequent?
> > > >
> > > > I am not aware of such a plan but I was going by "There are quite some
> > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > > not a fatal problem, really." in [1].
> > > >
> > > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > > should debug this issue more before getting to that.
> > >
> > > Yes, in fact there already are some checks along those lines, for example,
> > > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > > the housekeeping CPU as !cpu_is_hotpluggable().
> > 
> > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> > not seeing it). Even on x86, if you enable
> > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> > rcutorture with boot args:
> > 
> > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> > rcutorture.shutdown_secs=30
> > 
> > You will see this in the kernel logs:
> > [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > 
> > So RCU torture test clearly thought the CPUs were hot-pluggable, when
> > they was chance for them to return -EBUSY (due to housekeeping and
> > what not). So this issue seems to be architecture independent, in that
> > sense.
> > 
> > So the 2 ways forward I see are:
> > - Make the torture test aware of which CPUs are 'house keeping'
> > - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> > (via CONFIG or boot option).
> > 
> > Another option could be, forgive -EBUSY on CPU0 for
> > CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> > housekeeping CPU?
> 
> I would be happier to forgive failure to offline housekeeping CPUs than
> blanket forgiveness of CPU 0.  Especially given that I recently got
> burned by a non-zero boot cpu.  ;-)
> 
> But wouldn't it be even better for cpu_is_hotpluggable() to know the
> NO_HZ_FULL rules of the road?

That's a great idea. I found a way to do that without having to do the
EXPORT_SYMBOL (like in Zhouyi's patch).

Would the following be acceptable (only build-tested)?

I can run more tests and submit a patch:

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 55405ebf23ab..f73bc520b70e 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
 bool cpu_is_hotpluggable(unsigned int cpu)
 {
 	struct device *dev = get_cpu_device(cpu);
-	return dev && container_of(dev, struct cpu, dev)->hotpluggable;
+	return dev && container_of(dev, struct cpu, dev)->hotpluggable
+		&& !tick_nohz_cpu_hotpluggable(cpu);
 }
 EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index bfd571f18cfd..9459fef5b857 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 				     enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
 				       enum tick_dep_bits bit);
+extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);
 
 /*
  * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases
@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
 
 static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
 static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
+static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }
 
 static inline void tick_dep_set(enum tick_dep_bits bit) { }
 static inline void tick_dep_clear(enum tick_dep_bits bit) { }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 9c6f661fb436..d1cc7525240e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -522,6 +522,11 @@ static int tick_nohz_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
+{
+	return tick_nohz_cpu_down(cpu) == 0;
+}
+
 void __init tick_nohz_init(void)
 {
 	int cpu, ret;
-- 
2.39.0.246.g2a6d74b583-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 22:37                     ` Joel Fernandes
@ 2023-01-18 22:39                       ` Joel Fernandes
  2023-01-19  0:15                         ` Paul E. McKenney
  2023-01-19  3:21                         ` Zhouyi Zhou
  2023-01-19 13:57                       ` Frederic Weisbecker
  1 sibling, 2 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-18 22:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> [...]
> > > > > > Is there a plan to make CPU hotplug failures more frequent?
> > > > >
> > > > > I am not aware of such a plan but I was going by "There are quite some
> > > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > > > not a fatal problem, really." in [1].
> > > > >
> > > > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > > > should debug this issue more before getting to that.
> > > >
> > > > Yes, in fact there already are some checks along those lines, for example,
> > > > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > > > the housekeeping CPU as !cpu_is_hotpluggable().
> > >
> > > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> > > not seeing it). Even on x86, if you enable
> > > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> > > rcutorture with boot args:
> > >
> > > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> > > rcutorture.shutdown_secs=30
> > >
> > > You will see this in the kernel logs:
> > > [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > >
> > > So RCU torture test clearly thought the CPUs were hot-pluggable, when
> > > they was chance for them to return -EBUSY (due to housekeeping and
> > > what not). So this issue seems to be architecture independent, in that
> > > sense.
> > >
> > > So the 2 ways forward I see are:
> > > - Make the torture test aware of which CPUs are 'house keeping'
> > > - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> > > (via CONFIG or boot option).
> > >
> > > Another option could be, forgive -EBUSY on CPU0 for
> > > CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> > > housekeeping CPU?
> >
> > I would be happier to forgive failure to offline housekeeping CPUs than
> > blanket forgiveness of CPU 0.  Especially given that I recently got
> > burned by a non-zero boot cpu.  ;-)
> >
> > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > NO_HZ_FULL rules of the road?
>
> That's a great idea. I found a way to do that without having to do the
> EXPORT_SYMBOL (like in Zhouyi's patch).
>
> Would the following be acceptable (only build-tested)?
>
> I can run more tests and submit a patch:
>
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index 55405ebf23ab..f73bc520b70e 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
>  bool cpu_is_hotpluggable(unsigned int cpu)
>  {
>         struct device *dev = get_cpu_device(cpu);
> -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> +               && !tick_nohz_cpu_hotpluggable(cpu);

Oops, I should lose that "!" , but otherwise should be ok.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 22:39                       ` Joel Fernandes
@ 2023-01-19  0:15                         ` Paul E. McKenney
  2023-01-19  0:53                           ` Joel Fernandes
  2023-01-19  3:21                         ` Zhouyi Zhou
  1 sibling, 1 reply; 34+ messages in thread
From: Paul E. McKenney @ 2023-01-19  0:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Wed, Jan 18, 2023 at 10:39:28PM +0000, Joel Fernandes wrote:
> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > [...]
> > > > > > > Is there a plan to make CPU hotplug failures more frequent?
> > > > > >
> > > > > > I am not aware of such a plan but I was going by "There are quite some
> > > > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > > > > not a fatal problem, really." in [1].
> > > > > >
> > > > > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > > > > should debug this issue more before getting to that.
> > > > >
> > > > > Yes, in fact there already are some checks along those lines, for example,
> > > > > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > > > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > > > > the housekeeping CPU as !cpu_is_hotpluggable().
> > > >
> > > > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> > > > not seeing it). Even on x86, if you enable
> > > > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> > > > rcutorture with boot args:
> > > >
> > > > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> > > > rcutorture.shutdown_secs=30
> > > >
> > > > You will see this in the kernel logs:
> > > > [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >
> > > > So RCU torture test clearly thought the CPUs were hot-pluggable, when
> > > > they was chance for them to return -EBUSY (due to housekeeping and
> > > > what not). So this issue seems to be architecture independent, in that
> > > > sense.
> > > >
> > > > So the 2 ways forward I see are:
> > > > - Make the torture test aware of which CPUs are 'house keeping'
> > > > - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> > > > (via CONFIG or boot option).
> > > >
> > > > Another option could be, forgive -EBUSY on CPU0 for
> > > > CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> > > > housekeeping CPU?
> > >
> > > I would be happier to forgive failure to offline housekeeping CPUs than
> > > blanket forgiveness of CPU 0.  Especially given that I recently got
> > > burned by a non-zero boot cpu.  ;-)
> > >
> > > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > > NO_HZ_FULL rules of the road?
> >
> > That's a great idea. I found a way to do that without having to do the
> > EXPORT_SYMBOL (like in Zhouyi's patch).
> >
> > Would the following be acceptable (only build-tested)?
> >
> > I can run more tests and submit a patch:
> >
> > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > index 55405ebf23ab..f73bc520b70e 100644
> > --- a/drivers/base/cpu.c
> > +++ b/drivers/base/cpu.c
> > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> >  bool cpu_is_hotpluggable(unsigned int cpu)
> >  {
> >         struct device *dev = get_cpu_device(cpu);
> > -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> > +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> > +               && !tick_nohz_cpu_hotpluggable(cpu);
> 
> Oops, I should lose that "!" , but otherwise should be ok.

Looks plausible to me, but I must defer to Frederic and the various
architecture maintainers.

							Thanx, Paul

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-19  0:15                         ` Paul E. McKenney
@ 2023-01-19  0:53                           ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-19  0:53 UTC (permalink / raw)
  To: paulmck
  Cc: Zhouyi Zhou, moderated list:ARM/STM32 ARCHITECTURE, Will Deacon,
	Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Thu, Jan 19, 2023 at 12:15 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Jan 18, 2023 at 10:39:28PM +0000, Joel Fernandes wrote:
> > On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > > [...]
> > > > > > > > Is there a plan to make CPU hotplug failures more frequent?
> > > > > > >
> > > > > > > I am not aware of such a plan but I was going by "There are quite some
> > > > > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > > > > > not a fatal problem, really." in [1].
> > > > > > >
> > > > > > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > > > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > > > > > should debug this issue more before getting to that.
> > > > > >
> > > > > > Yes, in fact there already are some checks along those lines, for example,
> > > > > > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > > > > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > > > > > the housekeeping CPU as !cpu_is_hotpluggable().
> > > > >
> > > > > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> > > > > not seeing it). Even on x86, if you enable
> > > > > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> > > > > rcutorture with boot args:
> > > > >
> > > > > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> > > > > rcutorture.shutdown_secs=30
> > > > >
> > > > > You will see this in the kernel logs:
> > > > > [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >
> > > > > So RCU torture test clearly thought the CPUs were hot-pluggable, when
> > > > > they was chance for them to return -EBUSY (due to housekeeping and
> > > > > what not). So this issue seems to be architecture independent, in that
> > > > > sense.
> > > > >
> > > > > So the 2 ways forward I see are:
> > > > > - Make the torture test aware of which CPUs are 'house keeping'
> > > > > - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> > > > > (via CONFIG or boot option).
> > > > >
> > > > > Another option could be, forgive -EBUSY on CPU0 for
> > > > > CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> > > > > housekeeping CPU?
> > > >
> > > > I would be happier to forgive failure to offline housekeeping CPUs than
> > > > blanket forgiveness of CPU 0.  Especially given that I recently got
> > > > burned by a non-zero boot cpu.  ;-)
> > > >
> > > > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > > > NO_HZ_FULL rules of the road?
> > >
> > > That's a great idea. I found a way to do that without having to do the
> > > EXPORT_SYMBOL (like in Zhouyi's patch).
> > >
> > > Would the following be acceptable (only build-tested)?
> > >
> > > I can run more tests and submit a patch:
> > >
> > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > > index 55405ebf23ab..f73bc520b70e 100644
> > > --- a/drivers/base/cpu.c
> > > +++ b/drivers/base/cpu.c
> > > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> > >  bool cpu_is_hotpluggable(unsigned int cpu)
> > >  {
> > >         struct device *dev = get_cpu_device(cpu);
> > > -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> > > +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> > > +               && !tick_nohz_cpu_hotpluggable(cpu);
> >
> > Oops, I should lose that "!" , but otherwise should be ok.
>
> Looks plausible to me, but I must defer to Frederic and the various
> architecture maintainers.

Sure that works for me. I will wait to hear more comments, and then
will send out a patch either way tomorrow.

Thanks.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 22:39                       ` Joel Fernandes
  2023-01-19  0:15                         ` Paul E. McKenney
@ 2023-01-19  3:21                         ` Zhouyi Zhou
  2023-01-19  8:26                           ` Joel Fernandes
  1 sibling, 1 reply; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-19  3:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, moderated list:ARM/STM32 ARCHITECTURE,
	Will Deacon, Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > [...]
> > > > > > > Is there a plan to make CPU hotplug failures more frequent?
> > > > > >
> > > > > > I am not aware of such a plan but I was going by "There are quite some
> > > > > > reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> > > > > > not a fatal problem, really." in [1].
> > > > > >
> > > > > > What about an rcutorture to skip hotplug for a certain cpu id,
> > > > > > rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> > > > > > should debug this issue more before getting to that.
> > > > >
> > > > > Yes, in fact there already are some checks along those lines, for example,
> > > > > the torture_offline() function's check of cpu_is_hotpluggable().  So for
> > > > > example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> > > > > the housekeeping CPU as !cpu_is_hotpluggable().
> > > >
> > > > I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> > > > not seeing it). Even on x86, if you enable
> > > > CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> > > > rcutorture with boot args:
> > > >
> > > > nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> > > > rcutorture.shutdown_secs=30
> > > >
> > > > You will see this in the kernel logs:
> > > > [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > >
> > > > So RCU torture test clearly thought the CPUs were hot-pluggable, when
> > > > they was chance for them to return -EBUSY (due to housekeeping and
> > > > what not). So this issue seems to be architecture independent, in that
> > > > sense.
> > > >
> > > > So the 2 ways forward I see are:
> > > > - Make the torture test aware of which CPUs are 'house keeping'
> > > > - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> > > > (via CONFIG or boot option).
> > > >
> > > > Another option could be, forgive -EBUSY on CPU0 for
> > > > CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> > > > housekeeping CPU?
> > >
> > > I would be happier to forgive failure to offline housekeeping CPUs than
> > > blanket forgiveness of CPU 0.  Especially given that I recently got
> > > burned by a non-zero boot cpu.  ;-)
> > >
> > > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > > NO_HZ_FULL rules of the road?
> >
> > That's a great idea. I found a way to do that without having to do the
> > EXPORT_SYMBOL (like in Zhouyi's patch).
> >
> > Would the following be acceptable (only build-tested)?
> >
> > I can run more tests and submit a patch:
> >
> > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > index 55405ebf23ab..f73bc520b70e 100644
> > --- a/drivers/base/cpu.c
> > +++ b/drivers/base/cpu.c
> > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> >  bool cpu_is_hotpluggable(unsigned int cpu)
> >  {
> >         struct device *dev = get_cpu_device(cpu);
> > -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> > +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> > +               && !tick_nohz_cpu_hotpluggable(cpu);
>
> Oops, I should lose that "!" , but otherwise should be ok.
Looks plausible to me, According to your fantastic fix, I will perform
a new round of tests on the PPC VM of open source Lab of Oregon State
University.

I learned a lot during this process

Thanks
Zhouyi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-19  3:21                         ` Zhouyi Zhou
@ 2023-01-19  8:26                           ` Joel Fernandes
  2023-01-19 12:17                             ` Zhouyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Fernandes @ 2023-01-19  8:26 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: Paul E. McKenney, moderated list:ARM/STM32 ARCHITECTURE,
	Will Deacon, Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker



> On Jan 18, 2023, at 10:21 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> 
> On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>> 
>>> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>>> 
>>>> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
>>> [...]
>>>>>>>> Is there a plan to make CPU hotplug failures more frequent?
>>>>>>> 
>>>>>>> I am not aware of such a plan but I was going by "There are quite some
>>>>>>> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
>>>>>>> not a fatal problem, really." in [1].
>>>>>>> 
>>>>>>> What about an rcutorture to skip hotplug for a certain cpu id,
>>>>>>> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
>>>>>>> should debug this issue more before getting to that.
>>>>>> 
>>>>>> Yes, in fact there already are some checks along those lines, for example,
>>>>>> the torture_offline() function's check of cpu_is_hotpluggable().  So for
>>>>>> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
>>>>>> the housekeeping CPU as !cpu_is_hotpluggable().
>>>>> 
>>>>> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
>>>>> not seeing it). Even on x86, if you enable
>>>>> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
>>>>> rcutorture with boot args:
>>>>> 
>>>>> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
>>>>> rcutorture.shutdown_secs=30
>>>>> 
>>>>> You will see this in the kernel logs:
>>>>> [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> 
>>>>> So RCU torture test clearly thought the CPUs were hot-pluggable, when
>>>>> they was chance for them to return -EBUSY (due to housekeeping and
>>>>> what not). So this issue seems to be architecture independent, in that
>>>>> sense.
>>>>> 
>>>>> So the 2 ways forward I see are:
>>>>> - Make the torture test aware of which CPUs are 'house keeping'
>>>>> - Make it possible to turn off CPU0 hotplugging on ARM64 by default
>>>>> (via CONFIG or boot option).
>>>>> 
>>>>> Another option could be, forgive -EBUSY on CPU0 for
>>>>> CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
>>>>> housekeeping CPU?
>>>> 
>>>> I would be happier to forgive failure to offline housekeeping CPUs than
>>>> blanket forgiveness of CPU 0.  Especially given that I recently got
>>>> burned by a non-zero boot cpu.  ;-)
>>>> 
>>>> But wouldn't it be even better for cpu_is_hotpluggable() to know the
>>>> NO_HZ_FULL rules of the road?
>>> 
>>> That's a great idea. I found a way to do that without having to do the
>>> EXPORT_SYMBOL (like in Zhouyi's patch).
>>> 
>>> Would the following be acceptable (only build-tested)?
>>> 
>>> I can run more tests and submit a patch:
>>> 
>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>> index 55405ebf23ab..f73bc520b70e 100644
>>> --- a/drivers/base/cpu.c
>>> +++ b/drivers/base/cpu.c
>>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
>>> bool cpu_is_hotpluggable(unsigned int cpu)
>>> {
>>>        struct device *dev = get_cpu_device(cpu);
>>> -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
>>> +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
>>> +               && !tick_nohz_cpu_hotpluggable(cpu);
>> 
>> Oops, I should lose that "!" , but otherwise should be ok.
> Looks plausible to me, According to your fantastic fix, I will perform
> a new round of tests on the PPC VM of open source Lab of Oregon State
> University.

Thank you! And if it passes, I will add your Tested-by tag for attribution if you do not mind.

> I learned a lot during this process

Cool!!

 - Joel


> 
> Thanks
> Zhouyi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 22:01                       ` Joel Fernandes
@ 2023-01-19  9:12                         ` Mark Rutland
  0 siblings, 0 replies; 34+ messages in thread
From: Mark Rutland @ 2023-01-19  9:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Will Deacon, Paul E. McKenney, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Marc Zyngier,
	Catalin Marinas, rcu, Frederic Weisbecker

Hi Joel, Will,

On Wed, Jan 18, 2023 at 10:01:07PM +0000, Joel Fernandes wrote:
> On Wed, Jan 18, 2023 at 4:51 PM Will Deacon <will@kernel.org> wrote:
> > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
> > >
> > > I would be happier to forgive failure to offline housekeeping CPUs than
> > > blanket forgiveness of CPU 0.  Especially given that I recently got
> > > burned by a non-zero boot cpu.  ;-)
> > >
> > > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > > NO_HZ_FULL rules of the road?
> > >
> > > > Adding Frederic to CC as well as we are talking about
> > > > housekeeping/isolation stuff.
> > >
> > > But as you say, perhaps Frederic has a better idea.
> > >
> > > > > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > > > > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > > > > architecture specifies a .cpu_can_disable() function.
> > > >
> > > > Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).
> > >
> > > Apologies!  I will look more carefully at the pathnames next time!
> > >
> > > But maybe arm64 needs something similar?
> >
> > Just chiming quickly from the arm64 side here, but there's nothing in the
> > architecture that precludes offlining CPU 0 and it certainly works on some
> > platforms, so I'd be hesitant to rule it out entirely for testing.
> >
> > One reason why hotplug can fail in practice is if a trusted OS (i.e. code
> > running on the secure side of the fence outside of Linux's view of the
> > world) is resident on a core and rejects firmware requests to power it
> > off. The PSCI code (drivers/firmware/psci/) should detect this and return
> > -EPERM, although earlier in this thread there was mention of -EBUSY so it
> > sounds like something else...
> 
> Thank you for the heads up on that. To give you context, I am
> currently testing rcutorture on stable kernels 5.10, 5.15, 6.1 on my
> ARM64 QC7180 board. I certainly don't want to hit the -EPERM in the
> future on this or other ARM64 hardware. It would be great if
> cpu_psci_cpu_can_disable() in arm64 can return false if hotplugging
> causes -EPERM indefinitely. Then we do not need to make any changes.

That should already be the case, and I think we're good on that front.

A trusted OS (which blocks offlining a CPU) will always be resident on a
specific CPU (since we don't have any code to migrate trusted OSs across CPUs
as this is not standardised, and we don't have code to instantiate a trusted OS
from Linux). Where a non-migrateable trusted OS is present, it's going to have
been instantiated prior to booting Linux, and therefore will be on CPU0 (or a
CPU that Linux is not using at all).

Given the above, the return value of cpu_psci_cpu_can_disable() should not
change for a given CPU, and it should only be able to return false on CPU0.
Most systems don't have a trusted OS blocking PSCI CPU_OFF, and CPU0 can be
offlined.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-19  8:26                           ` Joel Fernandes
@ 2023-01-19 12:17                             ` Zhouyi Zhou
  0 siblings, 0 replies; 34+ messages in thread
From: Zhouyi Zhou @ 2023-01-19 12:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, moderated list:ARM/STM32 ARCHITECTURE,
	Will Deacon, Marc Zyngier, Mark Rutland, Catalin Marinas, rcu,
	Frederic Weisbecker

On Thu, Jan 19, 2023 at 4:26 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> > On Jan 18, 2023, at 10:21 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>
> >>> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >>>
> >>>> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> >>> [...]
> >>>>>>>> Is there a plan to make CPU hotplug failures more frequent?
> >>>>>>>
> >>>>>>> I am not aware of such a plan but I was going by "There are quite some
> >>>>>>> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> >>>>>>> not a fatal problem, really." in [1].
> >>>>>>>
> >>>>>>> What about an rcutorture to skip hotplug for a certain cpu id,
> >>>>>>> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> >>>>>>> should debug this issue more before getting to that.
> >>>>>>
> >>>>>> Yes, in fact there already are some checks along those lines, for example,
> >>>>>> the torture_offline() function's check of cpu_is_hotpluggable().  So for
> >>>>>> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> >>>>>> the housekeeping CPU as !cpu_is_hotpluggable().
> >>>>>
> >>>>> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> >>>>> not seeing it). Even on x86, if you enable
> >>>>> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> >>>>> rcutorture with boot args:
> >>>>>
> >>>>> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> >>>>> rcutorture.shutdown_secs=30
> >>>>>
> >>>>> You will see this in the kernel logs:
> >>>>> [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>> [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>>
> >>>>> So RCU torture test clearly thought the CPUs were hot-pluggable, when
> >>>>> they was chance for them to return -EBUSY (due to housekeeping and
> >>>>> what not). So this issue seems to be architecture independent, in that
> >>>>> sense.
> >>>>>
> >>>>> So the 2 ways forward I see are:
> >>>>> - Make the torture test aware of which CPUs are 'house keeping'
> >>>>> - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> >>>>> (via CONFIG or boot option).
> >>>>>
> >>>>> Another option could be, forgive -EBUSY on CPU0 for
> >>>>> CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> >>>>> housekeeping CPU?
> >>>>
> >>>> I would be happier to forgive failure to offline housekeeping CPUs than
> >>>> blanket forgiveness of CPU 0.  Especially given that I recently got
> >>>> burned by a non-zero boot cpu.  ;-)
> >>>>
> >>>> But wouldn't it be even better for cpu_is_hotpluggable() to know the
> >>>> NO_HZ_FULL rules of the road?
> >>>
> >>> That's a great idea. I found a way to do that without having to do the
> >>> EXPORT_SYMBOL (like in Zhouyi's patch).
> >>>
> >>> Would the following be acceptable (only build-tested)?
> >>>
> >>> I can run more tests and submit a patch:
> >>>
> >>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> >>> index 55405ebf23ab..f73bc520b70e 100644
> >>> --- a/drivers/base/cpu.c
> >>> +++ b/drivers/base/cpu.c
> >>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> >>> bool cpu_is_hotpluggable(unsigned int cpu)
> >>> {
> >>>        struct device *dev = get_cpu_device(cpu);
> >>> -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> >>> +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> >>> +               && !tick_nohz_cpu_hotpluggable(cpu);
> >>
> >> Oops, I should lose that "!" , but otherwise should be ok.
> > Looks plausible to me, According to your fantastic fix, I will perform
> > a new round of tests on the PPC VM of open source Lab of Oregon State
> > University.
>
> Thank you! And if it passes, I will add your Tested-by tag for attribution if you do not mind.
Thank you very much in advance for giving me a Tested-by, I like it
very much ;-)
After patching 8e82c28ea2b4(torture: Make thread detection more robust
by using lspcu) to linux-5.15.y on PPC64 VM,
I can proceed with the torturing test now.

The test performed on original linux-5.15.y still needs an hour or two
to finish, after
that I can apply your fix, and perform another 20+ hours torturing
test (it is a little slow because it is on a virtual machine). Thank
you for your patience.

Cheers
Zhouyi
>
> > I learned a lot during this process
>
> Cool!!
>
>  - Joel
>
>
> >
> > Thanks
> > Zhouyi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-18 22:37                     ` Joel Fernandes
  2023-01-18 22:39                       ` Joel Fernandes
@ 2023-01-19 13:57                       ` Frederic Weisbecker
  2023-01-19 20:25                         ` Joel Fernandes
  1 sibling, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2023-01-19 13:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu

On Wed, Jan 18, 2023 at 10:37:08PM +0000, Joel Fernandes wrote:
> 
> That's a great idea. I found a way to do that without having to do the
> EXPORT_SYMBOL (like in Zhouyi's patch).
> 
> Would the following be acceptable (only build-tested)?
> 
> I can run more tests and submit a patch:
> 
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index 55405ebf23ab..f73bc520b70e 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
>  bool cpu_is_hotpluggable(unsigned int cpu)
>  {
>  	struct device *dev = get_cpu_device(cpu);
> -	return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> +	return dev && container_of(dev, struct cpu, dev)->hotpluggable
> +		&& !tick_nohz_cpu_hotpluggable(cpu);
>  }
>  EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);
>  
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..9459fef5b857 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
>  				     enum tick_dep_bits bit);
>  extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
>  				       enum tick_dep_bits bit);
> +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);
>  
>  /*
>   * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases
> @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
>  
>  static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
>  static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
> +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }
>  
>  static inline void tick_dep_set(enum tick_dep_bits bit) { }
>  static inline void tick_dep_clear(enum tick_dep_bits bit) { }
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 9c6f661fb436..d1cc7525240e 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -522,6 +522,11 @@ static int tick_nohz_cpu_down(unsigned int cpu)
>  	return 0;
>  }
>  
> +bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
> +{
> +	return tick_nohz_cpu_down(cpu) == 0;
> +}
> +

Can you make it the opposite? Have tick_nohz_cpu_down() call
tick_nohz_cpu_hotpluggable()? To avoid future accidents.

Thanks.


>  void __init tick_nohz_init(void)
>  {
>  	int cpu, ret;
> -- 
> 2.39.0.246.g2a6d74b583-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
  2023-01-19 13:57                       ` Frederic Weisbecker
@ 2023-01-19 20:25                         ` Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2023-01-19 20:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Zhouyi Zhou,
	moderated list:ARM/STM32 ARCHITECTURE, Will Deacon, Marc Zyngier,
	Mark Rutland, Catalin Marinas, rcu

On Thu, Jan 19, 2023 at 02:57:59PM +0100, Frederic Weisbecker wrote:
> On Wed, Jan 18, 2023 at 10:37:08PM +0000, Joel Fernandes wrote:
> > 
> > That's a great idea. I found a way to do that without having to do the
> > EXPORT_SYMBOL (like in Zhouyi's patch).
> > 
> > Would the following be acceptable (only build-tested)?
> > 
> > I can run more tests and submit a patch:
> > 
> > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > index 55405ebf23ab..f73bc520b70e 100644
> > --- a/drivers/base/cpu.c
> > +++ b/drivers/base/cpu.c
> > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> >  bool cpu_is_hotpluggable(unsigned int cpu)
> >  {
> >  	struct device *dev = get_cpu_device(cpu);
> > -	return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> > +	return dev && container_of(dev, struct cpu, dev)->hotpluggable
> > +		&& !tick_nohz_cpu_hotpluggable(cpu);
> >  }
> >  EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);
> >  
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index bfd571f18cfd..9459fef5b857 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
> >  				     enum tick_dep_bits bit);
> >  extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
> >  				       enum tick_dep_bits bit);
> > +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);
> >  
> >  /*
> >   * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases
> > @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
> >  
> >  static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
> >  static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
> > +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }
> >  
> >  static inline void tick_dep_set(enum tick_dep_bits bit) { }
> >  static inline void tick_dep_clear(enum tick_dep_bits bit) { }
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 9c6f661fb436..d1cc7525240e 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -522,6 +522,11 @@ static int tick_nohz_cpu_down(unsigned int cpu)
> >  	return 0;
> >  }
> >  
> > +bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
> > +{
> > +	return tick_nohz_cpu_down(cpu) == 0;
> > +}
> > +
> 
> Can you make it the opposite? Have tick_nohz_cpu_down() call
> tick_nohz_cpu_hotpluggable()? To avoid future accidents.
> 
> Thanks.

You mean move the logic of tick_nohz_cpu_down() into
tick_nohz_cpu_hotpluggable()? That wont work because
tick_nohz_cpu_hotpluggable() returns a boolean, while tick_nohz_cpu_down(cpu)
returns an integer.

I could do something like the following and that should prevent the accident
you mentioned, which I think is that someone accidentally adds some code with
side-effects to tick_nohz_cpu_down() and ends up changing the behavior of
tick_nohz_cpu_hotpluggable(). Or was there a different accident you were
referring to?

I will submit a patch like the following, then. Thanks.

---8<-----------------------

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 4c98849577d4..7af8e33735a3 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
 bool cpu_is_hotpluggable(unsigned int cpu)
 {
 	struct device *dev = get_cpu_device(cpu);
-	return dev && container_of(dev, struct cpu, dev)->hotpluggable;
+	return dev && container_of(dev, struct cpu, dev)->hotpluggable
+		&& tick_nohz_cpu_hotpluggable(cpu);
 }
 EXPORT_SYMBOL_GPL(cpu_is_hotpluggable);
 
diff --git a/include/linux/tick.h b/include/linux/tick.h
index bfd571f18cfd..9459fef5b857 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 				     enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
 				       enum tick_dep_bits bit);
+extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu);
 
 /*
  * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases
@@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
 
 static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
 static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
+static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; }
 
 static inline void tick_dep_set(enum tick_dep_bits bit) { }
 static inline void tick_dep_clear(enum tick_dep_bits bit) { }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index ba2ac1469d47..6a2e52d5f0d0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -532,7 +532,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask)
 	tick_nohz_full_running = true;
 }
 
-static int tick_nohz_cpu_down(unsigned int cpu)
+static int tick_nohz_cpu_hotplug_ret(unsigned int cpu)
 {
 	/*
 	 * The tick_do_timer_cpu CPU handles housekeeping duty (unbound
@@ -544,6 +544,16 @@ static int tick_nohz_cpu_down(unsigned int cpu)
 	return 0;
 }
 
+static int tick_nohz_cpu_down(unsigned int cpu)
+{
+	return tick_nohz_cpu_hotplug_ret(cpu);
+}
+
+bool tick_nohz_cpu_hotpluggable(unsigned int cpu)
+{
+	return tick_nohz_cpu_hotplug_ret(cpu) == 0;
+}
+
 void __init tick_nohz_init(void)
 {
 	int cpu, ret;
-- 
2.39.0.246.g2a6d74b583-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-01-19 20:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16 17:03 arm64 torture test hotplug failures (offlining causes -EBUSY) Joel Fernandes
2023-01-16 18:03 ` Marc Zyngier
2023-01-16 22:43   ` Joel Fernandes
2023-01-16 18:32 ` Zhouyi Zhou
2023-01-16 22:38   ` Joel Fernandes
2023-01-17  0:15     ` Joel Fernandes
2023-01-17  0:37       ` Zhouyi Zhou
2023-01-17  1:45         ` Joel Fernandes
2023-01-17  3:15           ` Zhouyi Zhou
2023-01-17  4:34             ` Joel Fernandes
2023-01-17 11:42               ` Zhouyi Zhou
2023-01-17 19:50                 ` Joel Fernandes
2023-01-18 10:15                 ` Zhouyi Zhou
2023-01-18 15:51                   ` Joel Fernandes
2023-01-17  4:30       ` Paul E. McKenney
2023-01-17  4:36         ` Joel Fernandes
2023-01-17  4:54           ` Paul E. McKenney
2023-01-17 20:02             ` Joel Fernandes
2023-01-17 20:42               ` Paul E. McKenney
2023-01-18  2:17                 ` Joel Fernandes
2023-01-18  4:00                   ` Paul E. McKenney
2023-01-18 16:51                     ` Will Deacon
2023-01-18 17:56                       ` Paul E. McKenney
2023-01-18 22:01                       ` Joel Fernandes
2023-01-19  9:12                         ` Mark Rutland
2023-01-18 22:37                     ` Joel Fernandes
2023-01-18 22:39                       ` Joel Fernandes
2023-01-19  0:15                         ` Paul E. McKenney
2023-01-19  0:53                           ` Joel Fernandes
2023-01-19  3:21                         ` Zhouyi Zhou
2023-01-19  8:26                           ` Joel Fernandes
2023-01-19 12:17                             ` Zhouyi Zhou
2023-01-19 13:57                       ` Frederic Weisbecker
2023-01-19 20:25                         ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).