All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Qian Cai <cai@redhat.com>
Cc: Will Deacon <will@kernel.org>,
	catalin.marinas@arm.com, kernel-team@android.com,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier
Date: Thu, 5 Nov 2020 20:07:14 -0800	[thread overview]
Message-ID: <20201106040714.GS3249@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <ec2de23c04e400266fcf98dfd282da0b173a68c3.camel@redhat.com>

On Thu, Nov 05, 2020 at 09:15:24PM -0500, Qian Cai wrote:
> On Thu, 2020-11-05 at 15:28 -0800, Paul E. McKenney wrote:
> > On Thu, Nov 05, 2020 at 06:02:49PM -0500, Qian Cai wrote:
> > > On Thu, 2020-11-05 at 22:22 +0000, Will Deacon wrote:
> > > > On Fri, Oct 30, 2020 at 04:33:25PM +0000, Will Deacon wrote:
> > > > > On Wed, 28 Oct 2020 14:26:14 -0400, Qian Cai wrote:
> > > > > > The call to rcu_cpu_starting() in secondary_start_kernel() is not
> > > > > > early
> > > > > > enough in the CPU-hotplug onlining process, which results in lockdep
> > > > > > splats as follows:
> > > > > > 
> > > > > >  WARNING: suspicious RCU usage
> > > > > >  -----------------------------
> > > > > >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader
> > > > > > section!!
> > > > > > 
> > > > > > [...]
> > > > > 
> > > > > Applied to arm64 (for-next/fixes), thanks!
> > > > > 
> > > > > [1/1] arm64/smp: Move rcu_cpu_starting() earlier
> > > > >       https://git.kernel.org/arm64/c/ce3d31ad3cac
> > > > 
> > > > Hmm, this patch has caused a regression in the case that we fail to
> > > > online a CPU because it has incompatible CPU features and so we park it
> > > > in cpu_die_early(). We now get an endless spew of RCU stalls because the
> > > > core will never come online, but is being tracked by RCU. So I'm tempted
> > > > to revert this and live with the lockdep warning while we figure out a
> > > > proper fix.
> > > > 
> > > > What's the correct say to undo rcu_cpu_starting(), given that we cannot
> > > > invoke the full hotplug machinery here? Is it correct to call
> > > > rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> > > > CPU doing cpu_up(), or should we do something else?
> > > It looks to me that rcu_report_dead() does the opposite of
> > > rcu_cpu_starting(),
> > > so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to
> > > rewind,
> > > Paul?
> > 
> > Yes, rcu_report_dead() should do the trick.  Presumably the earlier
> > online-time CPU-hotplug notifiers are also unwound?
> I don't think that is an issue here. cpu_die_early() set CPU_STUCK_IN_KERNEL,
> and then __cpu_up() will see a timeout waiting for the AP online and then deal
> with CPU_STUCK_IN_KERNEL according. Thus, something like this? I don't see
> anything in rcu_report_dead() depends on CONFIG_HOTPLUG_CPU=y.

If this works for the ARM folks, it seems like a reasonable approach
to me.  I cannot reasonably test this because not only do I not have
an ARM system, I don't have a system on which a kernel can be built
with CONFIG_HOTPLUG_CPU=n, so I must rely on others' testing.

							Thanx, Paul

> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 09c96f57818c..10729d2d6084 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -421,6 +421,8 @@ void cpu_die_early(void)
>  
>  	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>  
> +	rcu_report_dead(cpu);
> +
>  	cpu_park_loop();
>  }
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a52f42f64b6..bd04b09b84b3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4077,7 +4077,6 @@ void rcu_cpu_starting(unsigned int cpu)
>  	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
>  }
>  
> -#ifdef CONFIG_HOTPLUG_CPU
>  /*
>   * The outgoing function has no further need of RCU, so remove it from
>   * the rcu_node tree's ->qsmaskinitnext bit masks.
> @@ -4117,6 +4116,7 @@ void rcu_report_dead(unsigned int cpu)
>  	rdp->cpu_started = false;
>  }
>  
> +#ifdef CONFIG_HOTPLUG_CPU
>  /*
>   * The outgoing CPU has just passed through the dying-idle state, and we
>   * are being invoked from the CPU that was IPIed to continue the offline
> 

WARNING: multiple messages have this Message-ID (diff)
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Qian Cai <cai@redhat.com>
Cc: Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	kernel-team@android.com, linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier
Date: Thu, 5 Nov 2020 20:07:14 -0800	[thread overview]
Message-ID: <20201106040714.GS3249@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <ec2de23c04e400266fcf98dfd282da0b173a68c3.camel@redhat.com>

On Thu, Nov 05, 2020 at 09:15:24PM -0500, Qian Cai wrote:
> On Thu, 2020-11-05 at 15:28 -0800, Paul E. McKenney wrote:
> > On Thu, Nov 05, 2020 at 06:02:49PM -0500, Qian Cai wrote:
> > > On Thu, 2020-11-05 at 22:22 +0000, Will Deacon wrote:
> > > > On Fri, Oct 30, 2020 at 04:33:25PM +0000, Will Deacon wrote:
> > > > > On Wed, 28 Oct 2020 14:26:14 -0400, Qian Cai wrote:
> > > > > > The call to rcu_cpu_starting() in secondary_start_kernel() is not
> > > > > > early
> > > > > > enough in the CPU-hotplug onlining process, which results in lockdep
> > > > > > splats as follows:
> > > > > > 
> > > > > >  WARNING: suspicious RCU usage
> > > > > >  -----------------------------
> > > > > >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader
> > > > > > section!!
> > > > > > 
> > > > > > [...]
> > > > > 
> > > > > Applied to arm64 (for-next/fixes), thanks!
> > > > > 
> > > > > [1/1] arm64/smp: Move rcu_cpu_starting() earlier
> > > > >       https://git.kernel.org/arm64/c/ce3d31ad3cac
> > > > 
> > > > Hmm, this patch has caused a regression in the case that we fail to
> > > > online a CPU because it has incompatible CPU features and so we park it
> > > > in cpu_die_early(). We now get an endless spew of RCU stalls because the
> > > > core will never come online, but is being tracked by RCU. So I'm tempted
> > > > to revert this and live with the lockdep warning while we figure out a
> > > > proper fix.
> > > > 
> > > > What's the correct say to undo rcu_cpu_starting(), given that we cannot
> > > > invoke the full hotplug machinery here? Is it correct to call
> > > > rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> > > > CPU doing cpu_up(), or should we do something else?
> > > It looks to me that rcu_report_dead() does the opposite of
> > > rcu_cpu_starting(),
> > > so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to
> > > rewind,
> > > Paul?
> > 
> > Yes, rcu_report_dead() should do the trick.  Presumably the earlier
> > online-time CPU-hotplug notifiers are also unwound?
> I don't think that is an issue here. cpu_die_early() set CPU_STUCK_IN_KERNEL,
> and then __cpu_up() will see a timeout waiting for the AP online and then deal
> with CPU_STUCK_IN_KERNEL according. Thus, something like this? I don't see
> anything in rcu_report_dead() depends on CONFIG_HOTPLUG_CPU=y.

If this works for the ARM folks, it seems like a reasonable approach
to me.  I cannot reasonably test this because not only do I not have
an ARM system, I don't have a system on which a kernel can be built
with CONFIG_HOTPLUG_CPU=n, so I must rely on others' testing.

							Thanx, Paul

> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 09c96f57818c..10729d2d6084 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -421,6 +421,8 @@ void cpu_die_early(void)
>  
>  	update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>  
> +	rcu_report_dead(cpu);
> +
>  	cpu_park_loop();
>  }
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a52f42f64b6..bd04b09b84b3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4077,7 +4077,6 @@ void rcu_cpu_starting(unsigned int cpu)
>  	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
>  }
>  
> -#ifdef CONFIG_HOTPLUG_CPU
>  /*
>   * The outgoing function has no further need of RCU, so remove it from
>   * the rcu_node tree's ->qsmaskinitnext bit masks.
> @@ -4117,6 +4116,7 @@ void rcu_report_dead(unsigned int cpu)
>  	rdp->cpu_started = false;
>  }
>  
> +#ifdef CONFIG_HOTPLUG_CPU
>  /*
>   * The outgoing CPU has just passed through the dying-idle state, and we
>   * are being invoked from the CPU that was IPIed to continue the offline
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2020-11-06  4:07 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-28 18:26 [PATCH] arm64/smp: Move rcu_cpu_starting() earlier Qian Cai
2020-10-28 18:26 ` Qian Cai
2020-10-28 21:00 ` Paul E. McKenney
2020-10-28 21:00   ` Paul E. McKenney
2020-10-29  9:10 ` Will Deacon
2020-10-29  9:10   ` Will Deacon
2020-10-29 13:17   ` Qian Cai
2020-10-29 13:17     ` Qian Cai
2020-10-30  8:15     ` Will Deacon
2020-10-30  8:15       ` Will Deacon
2020-10-29 14:09   ` Paul E. McKenney
2020-10-29 14:09     ` Paul E. McKenney
2020-10-30 16:33 ` Will Deacon
2020-10-30 16:33   ` Will Deacon
2020-11-05 22:22   ` Will Deacon
2020-11-05 22:22     ` Will Deacon
2020-11-05 23:02     ` Qian Cai
2020-11-05 23:02       ` Qian Cai
2020-11-05 23:28       ` Paul E. McKenney
2020-11-05 23:28         ` Paul E. McKenney
2020-11-06  2:15         ` Qian Cai
2020-11-06  2:15           ` Qian Cai
2020-11-06  4:07           ` Paul E. McKenney [this message]
2020-11-06  4:07             ` Paul E. McKenney
2020-11-06 10:37           ` Will Deacon
2020-11-06 10:37             ` Will Deacon
2020-11-06 12:48             ` Qian Cai
2020-11-06 12:48               ` Qian Cai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201106040714.GS3249@paulmck-ThinkPad-P72 \
    --to=paulmck@kernel.org \
    --cc=cai@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=kernel-team@android.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.