rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RCU: rcu stall issues and an approach to the fix
@ 2021-07-22 20:08 donghai qiao
  2021-07-22 20:41 ` Paul E. McKenney
  2021-07-23  0:29 ` Boqun Feng
  0 siblings, 2 replies; 26+ messages in thread
From: donghai qiao @ 2021-07-22 20:08 UTC (permalink / raw)
  To: rcu, donghai qiao

RCU experts,

When you reply, please also keep me CC'ed.

The problem of RCU stall might be an old problem and it can happen quite often.
As I have observed, when the problem occurs,  at least one CPU in the system
on which its rdp->gp_seq falls behind others by 4 (qs).

e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
rdp->gp_seq = 0x1388a1.

Because RCU stall issues can last a long period of time, the number of callbacks
in the list rdp->cblist of all CPUs can accumulate to thousands. In
the worst case,
it triggers panic.

When looking into the problem further, I'd think the problem is related to the
Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
would send a rescheduling request via send_IPI to that CPU to try to force a
context switch to make some progress. However, at least one situation can fail
this effort, which is when the CPU is running a user thread and it is the only
user thread in the rq, then this attempted context switching will not happen
immediately. In particular if the system is also configured with NOHZ_FULL for
the CPU and as long as the user thread is running, the forced context
switch will
never happen unless the user thread volunteers to yield the CPU. I think this
should be one of the major root causes of these RCU stall issues. Even if
NOHZ_FULL is not configured, there will be at least 1 tick delay which can
affect the realtime kernel, by the way.

But it seems not a good idea to craft a fix from the scheduler side because
this has to invalidate some existing scheduling optimizations. The current
scheduler is deliberately optimized to avoid such context switching.  So my
question is why the RCU core cannot effectively update qs for the stalled CPU
when it detects that the stalled CPU is running a user thread?  The reason
is pretty obvious because when a CPU is running a user thread, it must not
be in any kernel read-side critical sections. So it should be safe to close
its current RCU grace period on this CPU. Also, with this approach we can make
RCU work more efficiently than the approach of context switch which needs to
go through an IPI interrupt and the destination CPU needs to wake up its
ksoftirqd or wait for the next scheduling cycle.

If my suggested approach makes sense, I can go ahead to fix it that way.

Thanks
Donghai

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-22 20:08 RCU: rcu stall issues and an approach to the fix donghai qiao
@ 2021-07-22 20:41 ` Paul E. McKenney
  2021-07-23  0:29 ` Boqun Feng
  1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-22 20:41 UTC (permalink / raw)
  To: donghai qiao; +Cc: rcu

On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> RCU experts,
> 
> When you reply, please also keep me CC'ed.
> 
> The problem of RCU stall might be an old problem and it can happen quite often.
> As I have observed, when the problem occurs,  at least one CPU in the system
> on which its rdp->gp_seq falls behind others by 4 (qs).
> 
> e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> rdp->gp_seq = 0x1388a1.

CPUs that are idle for long periods of time can fall back much farther.
I have seen systems with CPUs having rdp->gp_seq thousands of grace
periods behind.  So yes, this can happen when there are RCU CPU stall
warnings, but it can also happen other ways.

> Because RCU stall issues can last a long period of time, the number of callbacks
> in the list rdp->cblist of all CPUs can accumulate to thousands. In
> the worst case,
> it triggers panic.

That is quite true.

> When looking into the problem further, I'd think the problem is related to the
> Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> would send a rescheduling request via send_IPI to that CPU to try to force a
> context switch to make some progress. However, at least one situation can fail
> this effort, which is when the CPU is running a user thread and it is the only
> user thread in the rq, then this attempted context switching will not happen
> immediately. In particular if the system is also configured with NOHZ_FULL for
> the CPU and as long as the user thread is running, the forced context
> switch will
> never happen unless the user thread volunteers to yield the CPU. I think this
> should be one of the major root causes of these RCU stall issues. Even if
> NOHZ_FULL is not configured, there will be at least 1 tick delay which can
> affect the realtime kernel, by the way.
> 
> But it seems not a good idea to craft a fix from the scheduler side because
> this has to invalidate some existing scheduling optimizations. The current
> scheduler is deliberately optimized to avoid such context switching.  So my
> question is why the RCU core cannot effectively update qs for the stalled CPU
> when it detects that the stalled CPU is running a user thread?  The reason
> is pretty obvious because when a CPU is running a user thread, it must not
> be in any kernel read-side critical sections. So it should be safe to close
> its current RCU grace period on this CPU. Also, with this approach we can make
> RCU work more efficiently than the approach of context switch which needs to
> go through an IPI interrupt and the destination CPU needs to wake up its
> ksoftirqd or wait for the next scheduling cycle.
> 
> If my suggested approach makes sense, I can go ahead to fix it that way.

If you have not yet read through Documentation/RCU/stallwarn.rst,
please do so.  There are many potential underlying causes of RCU CPU
stall warnings, most of which are simply bugs that need to be fixed.
One common example bug is a very long in-kernel loop that is missing
a cond_resched() -- it is after all hard to provide 500-millisecond
response time when your kernel has a 21-second tight loop.

Now, if you are seeing RCU CPU stall warnings for no apparent reason,
let's take a look and find root cause.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-22 20:08 RCU: rcu stall issues and an approach to the fix donghai qiao
  2021-07-22 20:41 ` Paul E. McKenney
@ 2021-07-23  0:29 ` Boqun Feng
  2021-07-23  3:49   ` Paul E. McKenney
  1 sibling, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2021-07-23  0:29 UTC (permalink / raw)
  To: donghai qiao; +Cc: rcu

On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> RCU experts,
> 
> When you reply, please also keep me CC'ed.
> 
> The problem of RCU stall might be an old problem and it can happen quite often.
> As I have observed, when the problem occurs,  at least one CPU in the system
> on which its rdp->gp_seq falls behind others by 4 (qs).
> 
> e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> rdp->gp_seq = 0x1388a1.
> 
> Because RCU stall issues can last a long period of time, the number of callbacks
> in the list rdp->cblist of all CPUs can accumulate to thousands. In
> the worst case,
> it triggers panic.
> 
> When looking into the problem further, I'd think the problem is related to the
> Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> would send a rescheduling request via send_IPI to that CPU to try to force a
> context switch to make some progress. However, at least one situation can fail
> this effort, which is when the CPU is running a user thread and it is the only
> user thread in the rq, then this attempted context switching will not happen
> immediately. In particular if the system is also configured with NOHZ_FULL for

Correct me if I'm wrong, if a CPU is solely running a user thread, how
can that CPU stall RCU? Because you need to be in a RCU read-side
critical section to stall RCU. Or the problem you're talking here is
about *recovering* from RCU stall?

Regards,
Boqun

> the CPU and as long as the user thread is running, the forced context
> switch will
> never happen unless the user thread volunteers to yield the CPU. I think this
> should be one of the major root causes of these RCU stall issues. Even if
> NOHZ_FULL is not configured, there will be at least 1 tick delay which can
> affect the realtime kernel, by the way.
> 
> But it seems not a good idea to craft a fix from the scheduler side because
> this has to invalidate some existing scheduling optimizations. The current
> scheduler is deliberately optimized to avoid such context switching.  So my
> question is why the RCU core cannot effectively update qs for the stalled CPU
> when it detects that the stalled CPU is running a user thread?  The reason
> is pretty obvious because when a CPU is running a user thread, it must not
> be in any kernel read-side critical sections. So it should be safe to close
> its current RCU grace period on this CPU. Also, with this approach we can make
> RCU work more efficiently than the approach of context switch which needs to
> go through an IPI interrupt and the destination CPU needs to wake up its
> ksoftirqd or wait for the next scheduling cycle.
> 
> If my suggested approach makes sense, I can go ahead to fix it that way.
> 
> Thanks
> Donghai

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-23  0:29 ` Boqun Feng
@ 2021-07-23  3:49   ` Paul E. McKenney
       [not found]     ` <CAOzhvcOLFzFGZAptOTrP9Xne1-LiO8jka1sPF6+0=WiLh-cQUA@mail.gmail.com>
  2021-07-23 17:25     ` donghai qiao
  0 siblings, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-23  3:49 UTC (permalink / raw)
  To: Boqun Feng; +Cc: donghai qiao, rcu

On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > RCU experts,
> > 
> > When you reply, please also keep me CC'ed.
> > 
> > The problem of RCU stall might be an old problem and it can happen quite often.
> > As I have observed, when the problem occurs,  at least one CPU in the system
> > on which its rdp->gp_seq falls behind others by 4 (qs).
> > 
> > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > rdp->gp_seq = 0x1388a1.
> > 
> > Because RCU stall issues can last a long period of time, the number of callbacks
> > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > the worst case,
> > it triggers panic.
> > 
> > When looking into the problem further, I'd think the problem is related to the
> > Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> > would send a rescheduling request via send_IPI to that CPU to try to force a
> > context switch to make some progress. However, at least one situation can fail
> > this effort, which is when the CPU is running a user thread and it is the only
> > user thread in the rq, then this attempted context switching will not happen
> > immediately. In particular if the system is also configured with NOHZ_FULL for
> 
> Correct me if I'm wrong, if a CPU is solely running a user thread, how
> can that CPU stall RCU? Because you need to be in a RCU read-side
> critical section to stall RCU. Or the problem you're talking here is
> about *recovering* from RCU stall?

Excellent point, Boqun!

Donghai, have you tried reproducing this using a kernel built with
CONFIG_RCU_EQS_DEBUG=y?

							Thanx, Paul

> Regards,
> Boqun
> 
> > the CPU and as long as the user thread is running, the forced context
> > switch will
> > never happen unless the user thread volunteers to yield the CPU. I think this
> > should be one of the major root causes of these RCU stall issues. Even if
> > NOHZ_FULL is not configured, there will be at least 1 tick delay which can
> > affect the realtime kernel, by the way.
> > 
> > But it seems not a good idea to craft a fix from the scheduler side because
> > this has to invalidate some existing scheduling optimizations. The current
> > scheduler is deliberately optimized to avoid such context switching.  So my
> > question is why the RCU core cannot effectively update qs for the stalled CPU
> > when it detects that the stalled CPU is running a user thread?  The reason
> > is pretty obvious because when a CPU is running a user thread, it must not
> > be in any kernel read-side critical sections. So it should be safe to close
> > its current RCU grace period on this CPU. Also, with this approach we can make
> > RCU work more efficiently than the approach of context switch which needs to
> > go through an IPI interrupt and the destination CPU needs to wake up its
> > ksoftirqd or wait for the next scheduling cycle.
> > 
> > If my suggested approach makes sense, I can go ahead to fix it that way.
> > 
> > Thanks
> > Donghai

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
       [not found]     ` <CAOzhvcOLFzFGZAptOTrP9Xne1-LiO8jka1sPF6+0=WiLh-cQUA@mail.gmail.com>
@ 2021-07-23 17:25       ` Paul E. McKenney
  2021-07-23 18:41         ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-23 17:25 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu

On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> wrote:
> 
> > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > RCU experts,
> > > >
> > > > When you reply, please also keep me CC'ed.
> > > >
> > > > The problem of RCU stall might be an old problem and it can happen
> > quite often.
> > > > As I have observed, when the problem occurs,  at least one CPU in the
> > system
> > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > >
> > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > rdp->gp_seq = 0x1388a1.
> > > >
> > > > Because RCU stall issues can last a long period of time, the number of
> > callbacks
> > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > the worst case,
> > > > it triggers panic.
> > > >
> > > > When looking into the problem further, I'd think the problem is
> > related to the
> > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > rcu_gp_kthread
> > > > would send a rescheduling request via send_IPI to that CPU to try to
> > force a
> > > > context switch to make some progress. However, at least one situation
> > can fail
> > > > this effort, which is when the CPU is running a user thread and it is
> > the only
> > > > user thread in the rq, then this attempted context switching will not
> > happen
> > > > immediately. In particular if the system is also configured with
> > NOHZ_FULL for
> > >
> > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > critical section to stall RCU. Or the problem you're talking here is
> > > about *recovering* from RCU stall?
> 
> In response to Boqun's question, the crashdumps I analyzed were configured
> with this :
> 
> CONFIG_PREEMPT_RCU=n
> CONFIG_PREEMPT_COUNT=n
> CONFIG_PROVE_RCU=n
> 
> Because these configurations were not enabled, the compiler generated empty
> binary code for functions rcu_read_lock() and rcu_read_unlock() which
> delimit rcu read-side critical sections. And the crashdump showed both
> functions have no binary code in the kernel module and I am pretty sure.

Agreed, that is expected behavior.

> In the first place I thought this kernel might be built the wrong way,
> but later I found other sources that said this was ok.  That's why when
> CPUs enter or leave rcu critical section, the rcu core
> is not informed.

If RCU core was informed every time that a CPU entered or left an RCU
read-side critical section, performance and scalability would be
abysmal.  So yes, this interaction is very arms-length.

> When the current grace period is closed, rcu_gp_kthread will open a new
> period for all. This will be reflected from every
> CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> progress is made. So when a cpu is running
> a user thread whilst a new period is open, it can not update its rcu unless
> a context switch occurs or upon a sched tick.
> But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> stall will happen.

Except that if a CPU is running in nohz_full mode, each transition from
kernel to user execution must invoke rcu_user_enter() and each transition
back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
allows RCU's grace-period kthread ("rcu_sched" in this configuration)
to detect even momentary nohz_full usermode execution.

You can check this in your crash dump by looking at the offending CPU's
rcu_data structure's ->dynticks field and comparing to the activities
of rcu_user_enter().

> When RCU detects that qs is stalled on a CPU, it tries to force a context
> switch to make progress on that CPU. This is
> done through a resched IPI. But this can not always succeed depending on
> the scheduler.   A while ago, this code
> process the resched IPI:
> 
> void scheduler_ipi(void)
> {
>         ...
>         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
>                 return;
>         ...
>         irq_enter();
>         sched_ttwu_pending();
>         ...
>         if (unlikely(got_nohz_idle_kick())) {
>                 this_rq()->idle_balance = 1;
>                 raise_softirq_irqoff(SCHED_SOFTIRQ);
>         }
>         irq_exit();
> }
> 
> As you can see the function returns from the first "if statement" before it
> can issue a SCHED_SOFTIRQ. Later this
> code has been changed, but similar check/optimization remains in many
> places in the scheduler. The things I try to
> fix are those that resched_cpu fails to do.

???  Current mainline has this instead:

static __always_inline void scheduler_ipi(void)
{
	/*
	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
	 * TIF_NEED_RESCHED remotely (for the first time) will also send
	 * this IPI.
	 */
	preempt_fold_need_resched();
}

Combined with the activities of resched_curr(), which is invoked
from resched_cpu(), this should force a call to the scheduler on
the return path from this IPI.

So what kernel version are you using?

Recent kernels have logic to enable the tick on nohz_full CPUs that are
slow to supply RCU with a quiescent state, but this should happen only
when such CPUs are spinning in kernel mode.  Again, usermode execution
is dealt with by rcu_user_enter().

> Hope this explains it.
> Donghai
> 
> 
> > Excellent point, Boqun!
> >
> > Donghai, have you tried reproducing this using a kernel built with
> > CONFIG_RCU_EQS_DEBUG=y?
> >
> 
> I can give this configuration a try. Will let you know the results.

This should help detect any missing rcu_user_enter() or rcu_user_exit()
calls.

							Thanx, Paul

> Thanks.
> Donghai
> 
> 
> >
> >                                                         Thanx, Paul
> >
> > > Regards,
> > > Boqun
> > >
> > > > the CPU and as long as the user thread is running, the forced context
> > > > switch will
> > > > never happen unless the user thread volunteers to yield the CPU. I
> > think this
> > > > should be one of the major root causes of these RCU stall issues. Even
> > if
> > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > can
> > > > affect the realtime kernel, by the way.
> > > >
> > > > But it seems not a good idea to craft a fix from the scheduler side
> > because
> > > > this has to invalidate some existing scheduling optimizations. The
> > current
> > > > scheduler is deliberately optimized to avoid such context switching.
> > So my
> > > > question is why the RCU core cannot effectively update qs for the
> > stalled CPU
> > > > when it detects that the stalled CPU is running a user thread?  The
> > reason
> > > > is pretty obvious because when a CPU is running a user thread, it must
> > not
> > > > be in any kernel read-side critical sections. So it should be safe to
> > close
> > > > its current RCU grace period on this CPU. Also, with this approach we
> > can make
> > > > RCU work more efficiently than the approach of context switch which
> > needs to
> > > > go through an IPI interrupt and the destination CPU needs to wake up
> > its
> > > > ksoftirqd or wait for the next scheduling cycle.
> > > >
> > > > If my suggested approach makes sense, I can go ahead to fix it that
> > way.
> > > >
> > > > Thanks
> > > > Donghai
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-23  3:49   ` Paul E. McKenney
       [not found]     ` <CAOzhvcOLFzFGZAptOTrP9Xne1-LiO8jka1sPF6+0=WiLh-cQUA@mail.gmail.com>
@ 2021-07-23 17:25     ` donghai qiao
  1 sibling, 0 replies; 26+ messages in thread
From: donghai qiao @ 2021-07-23 17:25 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu

On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > RCU experts,
> > >
> > > When you reply, please also keep me CC'ed.
> > >
> > > The problem of RCU stall might be an old problem and it can happen quite often.
> > > As I have observed, when the problem occurs,  at least one CPU in the system
> > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > >
> > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > rdp->gp_seq = 0x1388a1.
> > >
> > > Because RCU stall issues can last a long period of time, the number of callbacks
> > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > the worst case,
> > > it triggers panic.
> > >
> > > When looking into the problem further, I'd think the problem is related to the
> > > Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> > > would send a rescheduling request via send_IPI to that CPU to try to force a
> > > context switch to make some progress. However, at least one situation can fail
> > > this effort, which is when the CPU is running a user thread and it is the only
> > > user thread in the rq, then this attempted context switching will not happen
> > > immediately. In particular if the system is also configured with NOHZ_FULL for
> >
> > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > can that CPU stall RCU? Because you need to be in a RCU read-side
> > critical section to stall RCU. Or the problem you're talking here is
> > about *recovering* from RCU stall?

In response to Boqun's question, the crashdumps I analyzed were
configured with this :

CONFIG_PREEMPT_RCU=n
CONFIG_PREEMPT_COUNT=n
CONFIG_PROVE_RCU=n

Because these configurations were not enabled, the compiler generated
empty binary code for functions rcu_read_lock() and rcu_read_unlock()
which delimit rcu read-side critical sections. And the crashdump
showed both functions have no binary code in the kernel module and I
am pretty sure.  In the first place I thought this kernel might be
built the wrong way,
but later I found other sources that said this was ok.  That's why
when CPUs enter or leave rcu critical section, the rcu core
is not informed.

When the current grace period is closed, rcu_gp_kthread will open a
new period for all. This will be reflected from every
CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when
a progress is made. So when a cpu is running
a user thread whilst a new period is open, it can not update its rcu
unless a context switch occurs or upon a sched tick.
But if a CPU is configured as NOHZ, this will be a problem to RCU, so
rcu stall will happen.

When RCU detects that qs is stalled on a CPU, it tries to force a
context switch to make progress on that CPU. This is
done through a resched IPI. But this can not always succeed depending
on the scheduler.   A while ago, this code
process the resched IPI:

void scheduler_ipi(void)
{
        ...
        if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
                return;
        ...
        irq_enter();
        sched_ttwu_pending();
        ...
        if (unlikely(got_nohz_idle_kick())) {
                this_rq()->idle_balance = 1;
                raise_softirq_irqoff(SCHED_SOFTIRQ);
        }
        irq_exit();
}

As you can see the function returns from the first "if statement"
before it can issue a SCHED_SOFTIRQ. Later this
code has been changed, but similar check/optimization remains in many
places in the scheduler. The things I try to
fix are those that resched_cpu fails to do.

Hope this explains it.

>
> Excellent point, Boqun!
>
> Donghai, have you tried reproducing this using a kernel built with
> CONFIG_RCU_EQS_DEBUG=y?

I haven't tried it. I can do it and update you with the results when I am done.

Thanks!
Donghai

>
>                                                         Thanx, Paul
>
> > Regards,
> > Boqun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-23 17:25       ` Paul E. McKenney
@ 2021-07-23 18:41         ` donghai qiao
  2021-07-23 19:06           ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-07-23 18:41 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu

On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > wrote:
> >
> > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > RCU experts,
> > > > >
> > > > > When you reply, please also keep me CC'ed.
> > > > >
> > > > > The problem of RCU stall might be an old problem and it can happen
> > > quite often.
> > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > system
> > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > >
> > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > rdp->gp_seq = 0x1388a1.
> > > > >
> > > > > Because RCU stall issues can last a long period of time, the number of
> > > callbacks
> > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > the worst case,
> > > > > it triggers panic.
> > > > >
> > > > > When looking into the problem further, I'd think the problem is
> > > related to the
> > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > rcu_gp_kthread
> > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > force a
> > > > > context switch to make some progress. However, at least one situation
> > > can fail
> > > > > this effort, which is when the CPU is running a user thread and it is
> > > the only
> > > > > user thread in the rq, then this attempted context switching will not
> > > happen
> > > > > immediately. In particular if the system is also configured with
> > > NOHZ_FULL for
> > > >
> > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > critical section to stall RCU. Or the problem you're talking here is
> > > > about *recovering* from RCU stall?
> >
> > In response to Boqun's question, the crashdumps I analyzed were configured
> > with this :
> >
> > CONFIG_PREEMPT_RCU=n
> > CONFIG_PREEMPT_COUNT=n
> > CONFIG_PROVE_RCU=n
> >
> > Because these configurations were not enabled, the compiler generated empty
> > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > delimit rcu read-side critical sections. And the crashdump showed both
> > functions have no binary code in the kernel module and I am pretty sure.
>
> Agreed, that is expected behavior.
>
> > In the first place I thought this kernel might be built the wrong way,
> > but later I found other sources that said this was ok.  That's why when
> > CPUs enter or leave rcu critical section, the rcu core
> > is not informed.
>
> If RCU core was informed every time that a CPU entered or left an RCU
> read-side critical section, performance and scalability would be
> abysmal.  So yes, this interaction is very arms-length.

Thanks for confirming that.

> > When the current grace period is closed, rcu_gp_kthread will open a new
> > period for all. This will be reflected from every
> > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > progress is made. So when a cpu is running
> > a user thread whilst a new period is open, it can not update its rcu unless
> > a context switch occurs or upon a sched tick.
> > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > stall will happen.
>
> Except that if a CPU is running in nohz_full mode, each transition from
> kernel to user execution must invoke rcu_user_enter() and each transition
> back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> to detect even momentary nohz_full usermode execution.

Yes, agreed.

>
> You can check this in your crash dump by looking at the offending CPU's
> rcu_data structure's ->dynticks field and comparing to the activities
> of rcu_user_enter().

On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..

>
> > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > switch to make progress on that CPU. This is
> > done through a resched IPI. But this can not always succeed depending on
> > the scheduler.   A while ago, this code
> > process the resched IPI:
> >
> > void scheduler_ipi(void)
> > {
> >         ...
> >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> >                 return;
> >         ...
> >         irq_enter();
> >         sched_ttwu_pending();
> >         ...
> >         if (unlikely(got_nohz_idle_kick())) {
> >                 this_rq()->idle_balance = 1;
> >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> >         }
> >         irq_exit();
> > }
> >
> > As you can see the function returns from the first "if statement" before it
> > can issue a SCHED_SOFTIRQ. Later this
> > code has been changed, but similar check/optimization remains in many
> > places in the scheduler. The things I try to
> > fix are those that resched_cpu fails to do.
>
> ???  Current mainline has this instead:
>
> static __always_inline void scheduler_ipi(void)
> {
>         /*
>          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
>          * TIF_NEED_RESCHED remotely (for the first time) will also send
>          * this IPI.
>          */
>         preempt_fold_need_resched();
> }

This function was changed a year ago in the upstream kernel. But this
is not the only
code that fails the request of resched from rcu.  The scheduler is
optimized to avoid
context switching which it thinks is unnecessary over the years.

>
> Combined with the activities of resched_curr(), which is invoked
> from resched_cpu(), this should force a call to the scheduler on
> the return path from this IPI.
>
> So what kernel version are you using?

The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
But this problem is reproducible on fedora 34 and the latest upstream kernel,
however, I don't have a crashdump of that kind right now.

>
> Recent kernels have logic to enable the tick on nohz_full CPUs that are
> slow to supply RCU with a quiescent state, but this should happen only
> when such CPUs are spinning in kernel mode.  Again, usermode execution
> is dealt with by rcu_user_enter().

That also reflected why the CPU was running a user thread when the RCU stall
was detected.  So I guess something should be done for this case.

>
> > Hope this explains it.
> > Donghai
> >
> >
> > > Excellent point, Boqun!
> > >
> > > Donghai, have you tried reproducing this using a kernel built with
> > > CONFIG_RCU_EQS_DEBUG=y?
> > >
> >
> > I can give this configuration a try. Will let you know the results.
>
> This should help detect any missing rcu_user_enter() or rcu_user_exit()
> calls.

Got it.

Thanks
Donghai

>
>                                                         Thanx, Paul
>
> > Thanks.
> > Donghai
> >
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > > Regards,
> > > > Boqun
> > > >
> > > > > the CPU and as long as the user thread is running, the forced context
> > > > > switch will
> > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > think this
> > > > > should be one of the major root causes of these RCU stall issues. Even
> > > if
> > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > can
> > > > > affect the realtime kernel, by the way.
> > > > >
> > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > because
> > > > > this has to invalidate some existing scheduling optimizations. The
> > > current
> > > > > scheduler is deliberately optimized to avoid such context switching.
> > > So my
> > > > > question is why the RCU core cannot effectively update qs for the
> > > stalled CPU
> > > > > when it detects that the stalled CPU is running a user thread?  The
> > > reason
> > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > not
> > > > > be in any kernel read-side critical sections. So it should be safe to
> > > close
> > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > can make
> > > > > RCU work more efficiently than the approach of context switch which
> > > needs to
> > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > its
> > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > >
> > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > way.
> > > > >
> > > > > Thanks
> > > > > Donghai
> > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-23 18:41         ` donghai qiao
@ 2021-07-23 19:06           ` Paul E. McKenney
  2021-07-24  0:01             ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-23 19:06 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu

On Fri, Jul 23, 2021 at 02:41:20PM -0400, donghai qiao wrote:
> On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > > wrote:
> > >
> > > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > > RCU experts,
> > > > > >
> > > > > > When you reply, please also keep me CC'ed.
> > > > > >
> > > > > > The problem of RCU stall might be an old problem and it can happen
> > > > quite often.
> > > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > > system
> > > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > > >
> > > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > > rdp->gp_seq = 0x1388a1.
> > > > > >
> > > > > > Because RCU stall issues can last a long period of time, the number of
> > > > callbacks
> > > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > > the worst case,
> > > > > > it triggers panic.
> > > > > >
> > > > > > When looking into the problem further, I'd think the problem is
> > > > related to the
> > > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > > rcu_gp_kthread
> > > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > > force a
> > > > > > context switch to make some progress. However, at least one situation
> > > > can fail
> > > > > > this effort, which is when the CPU is running a user thread and it is
> > > > the only
> > > > > > user thread in the rq, then this attempted context switching will not
> > > > happen
> > > > > > immediately. In particular if the system is also configured with
> > > > NOHZ_FULL for
> > > > >
> > > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > > critical section to stall RCU. Or the problem you're talking here is
> > > > > about *recovering* from RCU stall?
> > >
> > > In response to Boqun's question, the crashdumps I analyzed were configured
> > > with this :
> > >
> > > CONFIG_PREEMPT_RCU=n
> > > CONFIG_PREEMPT_COUNT=n
> > > CONFIG_PROVE_RCU=n
> > >
> > > Because these configurations were not enabled, the compiler generated empty
> > > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > > delimit rcu read-side critical sections. And the crashdump showed both
> > > functions have no binary code in the kernel module and I am pretty sure.
> >
> > Agreed, that is expected behavior.
> >
> > > In the first place I thought this kernel might be built the wrong way,
> > > but later I found other sources that said this was ok.  That's why when
> > > CPUs enter or leave rcu critical section, the rcu core
> > > is not informed.
> >
> > If RCU core was informed every time that a CPU entered or left an RCU
> > read-side critical section, performance and scalability would be
> > abysmal.  So yes, this interaction is very arms-length.
> 
> Thanks for confirming that.
> 
> > > When the current grace period is closed, rcu_gp_kthread will open a new
> > > period for all. This will be reflected from every
> > > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > > progress is made. So when a cpu is running
> > > a user thread whilst a new period is open, it can not update its rcu unless
> > > a context switch occurs or upon a sched tick.
> > > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > > stall will happen.
> >
> > Except that if a CPU is running in nohz_full mode, each transition from
> > kernel to user execution must invoke rcu_user_enter() and each transition
> > back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> > allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> > to detect even momentary nohz_full usermode execution.
> 
> Yes, agreed.
> 
> > You can check this in your crash dump by looking at the offending CPU's
> > rcu_data structure's ->dynticks field and comparing to the activities
> > of rcu_user_enter().
> 
> On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
> I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
> other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..

That is expected behavior for a CPU running nohz_full user code for an
extended time period.  RCU is supposed to leave that CPU strictly alone,
after all.  ;-)

> > > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > > switch to make progress on that CPU. This is
> > > done through a resched IPI. But this can not always succeed depending on
> > > the scheduler.   A while ago, this code
> > > process the resched IPI:
> > >
> > > void scheduler_ipi(void)
> > > {
> > >         ...
> > >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > >                 return;
> > >         ...
> > >         irq_enter();
> > >         sched_ttwu_pending();
> > >         ...
> > >         if (unlikely(got_nohz_idle_kick())) {
> > >                 this_rq()->idle_balance = 1;
> > >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> > >         }
> > >         irq_exit();
> > > }
> > >
> > > As you can see the function returns from the first "if statement" before it
> > > can issue a SCHED_SOFTIRQ. Later this
> > > code has been changed, but similar check/optimization remains in many
> > > places in the scheduler. The things I try to
> > > fix are those that resched_cpu fails to do.
> >
> > ???  Current mainline has this instead:
> >
> > static __always_inline void scheduler_ipi(void)
> > {
> >         /*
> >          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> >          * TIF_NEED_RESCHED remotely (for the first time) will also send
> >          * this IPI.
> >          */
> >         preempt_fold_need_resched();
> > }
> 
> This function was changed a year ago in the upstream kernel. But this
> is not the only
> code that fails the request of resched from rcu.  The scheduler is
> optimized to avoid
> context switching which it thinks is unnecessary over the years.

But RCU shouldn't get to the point where it would invoke resched_cpu().
Instead, it should see that CPU's rdp->dynticks value and report a
quiescent state on that CPU's behalf.  See the rcu_implicit_dynticks_qs()
function and its callers.

> > Combined with the activities of resched_curr(), which is invoked
> > from resched_cpu(), this should force a call to the scheduler on
> > the return path from this IPI.
> >
> > So what kernel version are you using?
> 
> The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
> But this problem is reproducible on fedora 34 and the latest upstream kernel,
> however, I don't have a crashdump of that kind right now.

Interesting.  Are those systems doing anything unusual?  Long-running
interrupts, CPU-hotplug operations, ... ?

> > Recent kernels have logic to enable the tick on nohz_full CPUs that are
> > slow to supply RCU with a quiescent state, but this should happen only
> > when such CPUs are spinning in kernel mode.  Again, usermode execution
> > is dealt with by rcu_user_enter().
> 
> That also reflected why the CPU was running a user thread when the RCU stall
> was detected.  So I guess something should be done for this case.

You lost me on this one.  Did the rdp->dynticks value show that the
CPU was in an extended quiescent state?

							Thanx, Paul

> > > Hope this explains it.
> > > Donghai
> > >
> > >
> > > > Excellent point, Boqun!
> > > >
> > > > Donghai, have you tried reproducing this using a kernel built with
> > > > CONFIG_RCU_EQS_DEBUG=y?
> > > >
> > >
> > > I can give this configuration a try. Will let you know the results.
> >
> > This should help detect any missing rcu_user_enter() or rcu_user_exit()
> > calls.
> 
> Got it.
> 
> Thanks
> Donghai
> 
> >
> >                                                         Thanx, Paul
> >
> > > Thanks.
> > > Donghai
> > >
> > >
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > > Regards,
> > > > > Boqun
> > > > >
> > > > > > the CPU and as long as the user thread is running, the forced context
> > > > > > switch will
> > > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > > think this
> > > > > > should be one of the major root causes of these RCU stall issues. Even
> > > > if
> > > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > > can
> > > > > > affect the realtime kernel, by the way.
> > > > > >
> > > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > > because
> > > > > > this has to invalidate some existing scheduling optimizations. The
> > > > current
> > > > > > scheduler is deliberately optimized to avoid such context switching.
> > > > So my
> > > > > > question is why the RCU core cannot effectively update qs for the
> > > > stalled CPU
> > > > > > when it detects that the stalled CPU is running a user thread?  The
> > > > reason
> > > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > > not
> > > > > > be in any kernel read-side critical sections. So it should be safe to
> > > > close
> > > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > > can make
> > > > > > RCU work more efficiently than the approach of context switch which
> > > > needs to
> > > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > > its
> > > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > > >
> > > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > > way.
> > > > > >
> > > > > > Thanks
> > > > > > Donghai
> > > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-23 19:06           ` Paul E. McKenney
@ 2021-07-24  0:01             ` donghai qiao
  2021-07-25  0:48               ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-07-24  0:01 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu

On Fri, Jul 23, 2021 at 3:06 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 23, 2021 at 02:41:20PM -0400, donghai qiao wrote:
> > On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > > > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > > > wrote:
> > > >
> > > > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > > > RCU experts,
> > > > > > >
> > > > > > > When you reply, please also keep me CC'ed.
> > > > > > >
> > > > > > > The problem of RCU stall might be an old problem and it can happen
> > > > > quite often.
> > > > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > > > system
> > > > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > > > >
> > > > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > > > rdp->gp_seq = 0x1388a1.
> > > > > > >
> > > > > > > Because RCU stall issues can last a long period of time, the number of
> > > > > callbacks
> > > > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > > > the worst case,
> > > > > > > it triggers panic.
> > > > > > >
> > > > > > > When looking into the problem further, I'd think the problem is
> > > > > related to the
> > > > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > > > rcu_gp_kthread
> > > > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > > > force a
> > > > > > > context switch to make some progress. However, at least one situation
> > > > > can fail
> > > > > > > this effort, which is when the CPU is running a user thread and it is
> > > > > the only
> > > > > > > user thread in the rq, then this attempted context switching will not
> > > > > happen
> > > > > > > immediately. In particular if the system is also configured with
> > > > > NOHZ_FULL for
> > > > > >
> > > > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > > > critical section to stall RCU. Or the problem you're talking here is
> > > > > > about *recovering* from RCU stall?
> > > >
> > > > In response to Boqun's question, the crashdumps I analyzed were configured
> > > > with this :
> > > >
> > > > CONFIG_PREEMPT_RCU=n
> > > > CONFIG_PREEMPT_COUNT=n
> > > > CONFIG_PROVE_RCU=n
> > > >
> > > > Because these configurations were not enabled, the compiler generated empty
> > > > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > > > delimit rcu read-side critical sections. And the crashdump showed both
> > > > functions have no binary code in the kernel module and I am pretty sure.
> > >
> > > Agreed, that is expected behavior.
> > >
> > > > In the first place I thought this kernel might be built the wrong way,
> > > > but later I found other sources that said this was ok.  That's why when
> > > > CPUs enter or leave rcu critical section, the rcu core
> > > > is not informed.
> > >
> > > If RCU core was informed every time that a CPU entered or left an RCU
> > > read-side critical section, performance and scalability would be
> > > abysmal.  So yes, this interaction is very arms-length.
> >
> > Thanks for confirming that.
> >
> > > > When the current grace period is closed, rcu_gp_kthread will open a new
> > > > period for all. This will be reflected from every
> > > > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > > > progress is made. So when a cpu is running
> > > > a user thread whilst a new period is open, it can not update its rcu unless
> > > > a context switch occurs or upon a sched tick.
> > > > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > > > stall will happen.
> > >
> > > Except that if a CPU is running in nohz_full mode, each transition from
> > > kernel to user execution must invoke rcu_user_enter() and each transition
> > > back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> > > allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> > > to detect even momentary nohz_full usermode execution.
> >
> > Yes, agreed.
> >
> > > You can check this in your crash dump by looking at the offending CPU's
> > > rcu_data structure's ->dynticks field and comparing to the activities
> > > of rcu_user_enter().
> >
> > On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
> > I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
> > other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..
>
> That is expected behavior for a CPU running nohz_full user code for an
> extended time period.  RCU is supposed to leave that CPU strictly alone,
> after all.  ;-)

Yeah, this does happen if the CPU is enabled nohz_full. But I believe
I ever saw similar
situation on a nohz enabled CPU as well.  So shouldn't RCU close off
the gp for this CPU
rather than waiting ? That's the way I am thinking of fixing it.

>
> > > > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > > > switch to make progress on that CPU. This is
> > > > done through a resched IPI. But this can not always succeed depending on
> > > > the scheduler.   A while ago, this code
> > > > process the resched IPI:
> > > >
> > > > void scheduler_ipi(void)
> > > > {
> > > >         ...
> > > >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > > >                 return;
> > > >         ...
> > > >         irq_enter();
> > > >         sched_ttwu_pending();
> > > >         ...
> > > >         if (unlikely(got_nohz_idle_kick())) {
> > > >                 this_rq()->idle_balance = 1;
> > > >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> > > >         }
> > > >         irq_exit();
> > > > }
> > > >
> > > > As you can see the function returns from the first "if statement" before it
> > > > can issue a SCHED_SOFTIRQ. Later this
> > > > code has been changed, but similar check/optimization remains in many
> > > > places in the scheduler. The things I try to
> > > > fix are those that resched_cpu fails to do.
> > >
> > > ???  Current mainline has this instead:
> > >
> > > static __always_inline void scheduler_ipi(void)
> > > {
> > >         /*
> > >          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> > >          * TIF_NEED_RESCHED remotely (for the first time) will also send
> > >          * this IPI.
> > >          */
> > >         preempt_fold_need_resched();
> > > }
> >
> > This function was changed a year ago in the upstream kernel. But this
> > is not the only
> > code that fails the request of resched from rcu.  The scheduler is
> > optimized to avoid
> > context switching which it thinks is unnecessary over the years.
>
> But RCU shouldn't get to the point where it would invoke resched_cpu().
> Instead, it should see that CPU's rdp->dynticks value and report a
> quiescent state on that CPU's behalf.  See the rcu_implicit_dynticks_qs()
> function and its callers.

What I was seeing in these crashdumps was that the rcu gp_kthread
invoked rcu_implicit_dynticks_qs() ---> resched_cpu()

rcu_implicit_dynticks_qs()
{
        ......
        if (tick_nohz_full_cpu(rdp->cpu) && time_after(......) ) {
                  WRITE_ONCE(...);
                  resched_cpu(rdp->cpu);
                  WRITE_ONCE(...);
        }

        if (time_after (...) {
                   if (time_aftr(...) {
                             resched_cpu(rdp->cpu);
                             ...
                   }
                   ...
         }
         return 0;
}

The first time, resched_cpu() was invoked from the first "if
statement",  later, it was invoked
from the second "if statement".  But because the scheduler ignored
that resched request from
rcu due to optimization, nothing happened.

>
> > > Combined with the activities of resched_curr(), which is invoked
> > > from resched_cpu(), this should force a call to the scheduler on
> > > the return path from this IPI.
> > >
> > > So what kernel version are you using?
> >
> > The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
> > But this problem is reproducible on fedora 34 and the latest upstream kernel,
> > however, I don't have a crashdump of that kind right now.
>
> Interesting.  Are those systems doing anything unusual?  Long-running
> interrupts, CPU-hotplug operations, ... ?

Only some regular test suites. Nothing special was running. I believe no long
interrupts,  no CPU-hotplug. But I noticed that the console was outputting
some message via printk(), but the message was not too long.
>
> > > Recent kernels have logic to enable the tick on nohz_full CPUs that are
> > > slow to supply RCU with a quiescent state, but this should happen only
> > > when such CPUs are spinning in kernel mode.  Again, usermode execution
> > > is dealt with by rcu_user_enter().
> >
> > That also reflected why the CPU was running a user thread when the RCU stall
> > was detected.  So I guess something should be done for this case.
>
> You lost me on this one.  Did the rdp->dynticks value show that the
> CPU was in an extended quiescent state?

The value of rdp->dynticks on the stalled CPU0  was 0x6eab02,
thus    rdp->dynticks & RCU_DYNTICK_CTRL_CTR != 0
so it was not in an extended qs (idle).

Thanks
Donghai


>
>                                                         Thanx, Paul
>
> > > > Hope this explains it.
> > > > Donghai
> > > >
> > > >
> > > > > Excellent point, Boqun!
> > > > >
> > > > > Donghai, have you tried reproducing this using a kernel built with
> > > > > CONFIG_RCU_EQS_DEBUG=y?
> > > > >
> > > >
> > > > I can give this configuration a try. Will let you know the results.
> > >
> > > This should help detect any missing rcu_user_enter() or rcu_user_exit()
> > > calls.
> >
> > Got it.
> >
> > Thanks
> > Donghai
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > > Thanks.
> > > > Donghai
> > > >
> > > >
> > > > >
> > > > >                                                         Thanx, Paul
> > > > >
> > > > > > Regards,
> > > > > > Boqun
> > > > > >
> > > > > > > the CPU and as long as the user thread is running, the forced context
> > > > > > > switch will
> > > > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > > > think this
> > > > > > > should be one of the major root causes of these RCU stall issues. Even
> > > > > if
> > > > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > > > can
> > > > > > > affect the realtime kernel, by the way.
> > > > > > >
> > > > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > > > because
> > > > > > > this has to invalidate some existing scheduling optimizations. The
> > > > > current
> > > > > > > scheduler is deliberately optimized to avoid such context switching.
> > > > > So my
> > > > > > > question is why the RCU core cannot effectively update qs for the
> > > > > stalled CPU
> > > > > > > when it detects that the stalled CPU is running a user thread?  The
> > > > > reason
> > > > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > > > not
> > > > > > > be in any kernel read-side critical sections. So it should be safe to
> > > > > close
> > > > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > > > can make
> > > > > > > RCU work more efficiently than the approach of context switch which
> > > > > needs to
> > > > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > > > its
> > > > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > > > >
> > > > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > > > way.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Donghai
> > > > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-24  0:01             ` donghai qiao
@ 2021-07-25  0:48               ` Paul E. McKenney
  2021-07-27 22:34                 ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-25  0:48 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu

On Fri, Jul 23, 2021 at 08:01:34PM -0400, donghai qiao wrote:
> On Fri, Jul 23, 2021 at 3:06 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Jul 23, 2021 at 02:41:20PM -0400, donghai qiao wrote:
> > > On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > > > > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > > > > wrote:
> > > > >
> > > > > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > > > > RCU experts,
> > > > > > > >
> > > > > > > > When you reply, please also keep me CC'ed.
> > > > > > > >
> > > > > > > > The problem of RCU stall might be an old problem and it can happen
> > > > > > quite often.
> > > > > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > > > > system
> > > > > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > > > > >
> > > > > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > > > > rdp->gp_seq = 0x1388a1.
> > > > > > > >
> > > > > > > > Because RCU stall issues can last a long period of time, the number of
> > > > > > callbacks
> > > > > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > > > > the worst case,
> > > > > > > > it triggers panic.
> > > > > > > >
> > > > > > > > When looking into the problem further, I'd think the problem is
> > > > > > related to the
> > > > > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > > > > rcu_gp_kthread
> > > > > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > > > > force a
> > > > > > > > context switch to make some progress. However, at least one situation
> > > > > > can fail
> > > > > > > > this effort, which is when the CPU is running a user thread and it is
> > > > > > the only
> > > > > > > > user thread in the rq, then this attempted context switching will not
> > > > > > happen
> > > > > > > > immediately. In particular if the system is also configured with
> > > > > > NOHZ_FULL for
> > > > > > >
> > > > > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > > > > critical section to stall RCU. Or the problem you're talking here is
> > > > > > > about *recovering* from RCU stall?
> > > > >
> > > > > In response to Boqun's question, the crashdumps I analyzed were configured
> > > > > with this :
> > > > >
> > > > > CONFIG_PREEMPT_RCU=n
> > > > > CONFIG_PREEMPT_COUNT=n
> > > > > CONFIG_PROVE_RCU=n
> > > > >
> > > > > Because these configurations were not enabled, the compiler generated empty
> > > > > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > > > > delimit rcu read-side critical sections. And the crashdump showed both
> > > > > functions have no binary code in the kernel module and I am pretty sure.
> > > >
> > > > Agreed, that is expected behavior.
> > > >
> > > > > In the first place I thought this kernel might be built the wrong way,
> > > > > but later I found other sources that said this was ok.  That's why when
> > > > > CPUs enter or leave rcu critical section, the rcu core
> > > > > is not informed.
> > > >
> > > > If RCU core was informed every time that a CPU entered or left an RCU
> > > > read-side critical section, performance and scalability would be
> > > > abysmal.  So yes, this interaction is very arms-length.
> > >
> > > Thanks for confirming that.
> > >
> > > > > When the current grace period is closed, rcu_gp_kthread will open a new
> > > > > period for all. This will be reflected from every
> > > > > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > > > > progress is made. So when a cpu is running
> > > > > a user thread whilst a new period is open, it can not update its rcu unless
> > > > > a context switch occurs or upon a sched tick.
> > > > > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > > > > stall will happen.
> > > >
> > > > Except that if a CPU is running in nohz_full mode, each transition from
> > > > kernel to user execution must invoke rcu_user_enter() and each transition
> > > > back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> > > > allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> > > > to detect even momentary nohz_full usermode execution.
> > >
> > > Yes, agreed.
> > >
> > > > You can check this in your crash dump by looking at the offending CPU's
> > > > rcu_data structure's ->dynticks field and comparing to the activities
> > > > of rcu_user_enter().
> > >
> > > On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
> > > I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
> > > other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..
> >
> > That is expected behavior for a CPU running nohz_full user code for an
> > extended time period.  RCU is supposed to leave that CPU strictly alone,
> > after all.  ;-)
> 
> Yeah, this does happen if the CPU is enabled nohz_full. But I believe
> I ever saw similar
> situation on a nohz enabled CPU as well.  So shouldn't RCU close off
> the gp for this CPU
> rather than waiting ? That's the way I am thinking of fixing it.

Yes, the same thing can happen if a CPU remains idle for a long time.

> > > > > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > > > > switch to make progress on that CPU. This is
> > > > > done through a resched IPI. But this can not always succeed depending on
> > > > > the scheduler.   A while ago, this code
> > > > > process the resched IPI:
> > > > >
> > > > > void scheduler_ipi(void)
> > > > > {
> > > > >         ...
> > > > >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > > > >                 return;
> > > > >         ...
> > > > >         irq_enter();
> > > > >         sched_ttwu_pending();
> > > > >         ...
> > > > >         if (unlikely(got_nohz_idle_kick())) {
> > > > >                 this_rq()->idle_balance = 1;
> > > > >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> > > > >         }
> > > > >         irq_exit();
> > > > > }
> > > > >
> > > > > As you can see the function returns from the first "if statement" before it
> > > > > can issue a SCHED_SOFTIRQ. Later this
> > > > > code has been changed, but similar check/optimization remains in many
> > > > > places in the scheduler. The things I try to
> > > > > fix are those that resched_cpu fails to do.
> > > >
> > > > ???  Current mainline has this instead:
> > > >
> > > > static __always_inline void scheduler_ipi(void)
> > > > {
> > > >         /*
> > > >          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> > > >          * TIF_NEED_RESCHED remotely (for the first time) will also send
> > > >          * this IPI.
> > > >          */
> > > >         preempt_fold_need_resched();
> > > > }
> > >
> > > This function was changed a year ago in the upstream kernel. But this
> > > is not the only
> > > code that fails the request of resched from rcu.  The scheduler is
> > > optimized to avoid
> > > context switching which it thinks is unnecessary over the years.
> >
> > But RCU shouldn't get to the point where it would invoke resched_cpu().
> > Instead, it should see that CPU's rdp->dynticks value and report a
> > quiescent state on that CPU's behalf.  See the rcu_implicit_dynticks_qs()
> > function and its callers.
> 
> What I was seeing in these crashdumps was that the rcu gp_kthread
> invoked rcu_implicit_dynticks_qs() ---> resched_cpu()
> 
> rcu_implicit_dynticks_qs()
> {
>         ......
>         if (tick_nohz_full_cpu(rdp->cpu) && time_after(......) ) {
>                   WRITE_ONCE(...);
>                   resched_cpu(rdp->cpu);
>                   WRITE_ONCE(...);
>         }
> 
>         if (time_after (...) {
>                    if (time_aftr(...) {
>                              resched_cpu(rdp->cpu);
>                              ...
>                    }
>                    ...
>          }
>          return 0;
> }
> 
> The first time, resched_cpu() was invoked from the first "if
> statement",  later, it was invoked
> from the second "if statement".  But because the scheduler ignored
> that resched request from
> rcu due to optimization, nothing happened.

I agree that this can happen.  However, it should not have happened
in this case, because...

> > > > Combined with the activities of resched_curr(), which is invoked
> > > > from resched_cpu(), this should force a call to the scheduler on
> > > > the return path from this IPI.
> > > >
> > > > So what kernel version are you using?
> > >
> > > The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
> > > But this problem is reproducible on fedora 34 and the latest upstream kernel,
> > > however, I don't have a crashdump of that kind right now.
> >
> > Interesting.  Are those systems doing anything unusual?  Long-running
> > interrupts, CPU-hotplug operations, ... ?
> 
> Only some regular test suites. Nothing special was running. I believe no long
> interrupts,  no CPU-hotplug. But I noticed that the console was outputting
> some message via printk(), but the message was not too long.
> >
> > > > Recent kernels have logic to enable the tick on nohz_full CPUs that are
> > > > slow to supply RCU with a quiescent state, but this should happen only
> > > > when such CPUs are spinning in kernel mode.  Again, usermode execution
> > > > is dealt with by rcu_user_enter().
> > >
> > > That also reflected why the CPU was running a user thread when the RCU stall
> > > was detected.  So I guess something should be done for this case.
> >
> > You lost me on this one.  Did the rdp->dynticks value show that the
> > CPU was in an extended quiescent state?
> 
> The value of rdp->dynticks on the stalled CPU0  was 0x6eab02,
> thus    rdp->dynticks & RCU_DYNTICK_CTRL_CTR != 0
> so it was not in an extended qs (idle).

And, assuming that this CPU was executing in nohz_full usermode, this
is a bug.  If a CPU is in nohz_full usermode, then its rdp->dynticks
field should have the RCU_DYNTICK_CTRL_CTR bit set to zero.

The usual cause of this situation is that there is a path to
or run nohz_full usermode that fails to call rcu_user_enter() or
rcu_user_exit(), respectively.  Similar problems can arise if there is
a path between nohz_full userspace and interrupt context that does not
invoke rcu_irq_enter() (from userspace to interrupt) or rcu_irq_exit()
(from interrupt back to userspace).

There were some issues of this sort around the v5.8 timeframe.  Might
there be another patch that needs to be backported?  Or a patch that
was backported, but should not have been?

Is it possible to bisect this?

Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?

Either way, what should happen is that dyntick_save_progress_counter() or
rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
nohz_full user execution, and then the quiescent state will be supplied
on behalf of that CPU.

Now it is possible that you also have a problem where resched_cpu() isn't
working.  But if so, that is a separate bug that should also be fixed.

							Thanx, Paul

> Thanks
> Donghai
> 
> 
> >
> >                                                         Thanx, Paul
> >
> > > > > Hope this explains it.
> > > > > Donghai
> > > > >
> > > > >
> > > > > > Excellent point, Boqun!
> > > > > >
> > > > > > Donghai, have you tried reproducing this using a kernel built with
> > > > > > CONFIG_RCU_EQS_DEBUG=y?
> > > > > >
> > > > >
> > > > > I can give this configuration a try. Will let you know the results.
> > > >
> > > > This should help detect any missing rcu_user_enter() or rcu_user_exit()
> > > > calls.
> > >
> > > Got it.
> > >
> > > Thanks
> > > Donghai
> > >
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > > Thanks.
> > > > > Donghai
> > > > >
> > > > >
> > > > > >
> > > > > >                                                         Thanx, Paul
> > > > > >
> > > > > > > Regards,
> > > > > > > Boqun
> > > > > > >
> > > > > > > > the CPU and as long as the user thread is running, the forced context
> > > > > > > > switch will
> > > > > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > > > > think this
> > > > > > > > should be one of the major root causes of these RCU stall issues. Even
> > > > > > if
> > > > > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > > > > can
> > > > > > > > affect the realtime kernel, by the way.
> > > > > > > >
> > > > > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > > > > because
> > > > > > > > this has to invalidate some existing scheduling optimizations. The
> > > > > > current
> > > > > > > > scheduler is deliberately optimized to avoid such context switching.
> > > > > > So my
> > > > > > > > question is why the RCU core cannot effectively update qs for the
> > > > > > stalled CPU
> > > > > > > > when it detects that the stalled CPU is running a user thread?  The
> > > > > > reason
> > > > > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > > > > not
> > > > > > > > be in any kernel read-side critical sections. So it should be safe to
> > > > > > close
> > > > > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > > > > can make
> > > > > > > > RCU work more efficiently than the approach of context switch which
> > > > > > needs to
> > > > > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > > > > its
> > > > > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > > > > >
> > > > > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > > > > way.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Donghai
> > > > > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-25  0:48               ` Paul E. McKenney
@ 2021-07-27 22:34                 ` donghai qiao
  2021-07-28  0:10                   ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-07-27 22:34 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu

On Sat, Jul 24, 2021 at 8:48 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 23, 2021 at 08:01:34PM -0400, donghai qiao wrote:
> > On Fri, Jul 23, 2021 at 3:06 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Jul 23, 2021 at 02:41:20PM -0400, donghai qiao wrote:
> > > > On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > > > > > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > > > > > wrote:
> > > > > >
> > > > > > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > > > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > > > > > RCU experts,
> > > > > > > > >
> > > > > > > > > When you reply, please also keep me CC'ed.
> > > > > > > > >
> > > > > > > > > The problem of RCU stall might be an old problem and it can happen
> > > > > > > quite often.
> > > > > > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > > > > > system
> > > > > > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > > > > > >
> > > > > > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > > > > > rdp->gp_seq = 0x1388a1.
> > > > > > > > >
> > > > > > > > > Because RCU stall issues can last a long period of time, the number of
> > > > > > > callbacks
> > > > > > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > > > > > the worst case,
> > > > > > > > > it triggers panic.
> > > > > > > > >
> > > > > > > > > When looking into the problem further, I'd think the problem is
> > > > > > > related to the
> > > > > > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > > > > > rcu_gp_kthread
> > > > > > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > > > > > force a
> > > > > > > > > context switch to make some progress. However, at least one situation
> > > > > > > can fail
> > > > > > > > > this effort, which is when the CPU is running a user thread and it is
> > > > > > > the only
> > > > > > > > > user thread in the rq, then this attempted context switching will not
> > > > > > > happen
> > > > > > > > > immediately. In particular if the system is also configured with
> > > > > > > NOHZ_FULL for
> > > > > > > >
> > > > > > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > > > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > > > > > critical section to stall RCU. Or the problem you're talking here is
> > > > > > > > about *recovering* from RCU stall?
> > > > > >
> > > > > > In response to Boqun's question, the crashdumps I analyzed were configured
> > > > > > with this :
> > > > > >
> > > > > > CONFIG_PREEMPT_RCU=n
> > > > > > CONFIG_PREEMPT_COUNT=n
> > > > > > CONFIG_PROVE_RCU=n
> > > > > >
> > > > > > Because these configurations were not enabled, the compiler generated empty
> > > > > > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > > > > > delimit rcu read-side critical sections. And the crashdump showed both
> > > > > > functions have no binary code in the kernel module and I am pretty sure.
> > > > >
> > > > > Agreed, that is expected behavior.
> > > > >
> > > > > > In the first place I thought this kernel might be built the wrong way,
> > > > > > but later I found other sources that said this was ok.  That's why when
> > > > > > CPUs enter or leave rcu critical section, the rcu core
> > > > > > is not informed.
> > > > >
> > > > > If RCU core was informed every time that a CPU entered or left an RCU
> > > > > read-side critical section, performance and scalability would be
> > > > > abysmal.  So yes, this interaction is very arms-length.
> > > >
> > > > Thanks for confirming that.
> > > >
> > > > > > When the current grace period is closed, rcu_gp_kthread will open a new
> > > > > > period for all. This will be reflected from every
> > > > > > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > > > > > progress is made. So when a cpu is running
> > > > > > a user thread whilst a new period is open, it can not update its rcu unless
> > > > > > a context switch occurs or upon a sched tick.
> > > > > > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > > > > > stall will happen.
> > > > >
> > > > > Except that if a CPU is running in nohz_full mode, each transition from
> > > > > kernel to user execution must invoke rcu_user_enter() and each transition
> > > > > back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> > > > > allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> > > > > to detect even momentary nohz_full usermode execution.
> > > >
> > > > Yes, agreed.
> > > >
> > > > > You can check this in your crash dump by looking at the offending CPU's
> > > > > rcu_data structure's ->dynticks field and comparing to the activities
> > > > > of rcu_user_enter().
> > > >
> > > > On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
> > > > I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
> > > > other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..
> > >
> > > That is expected behavior for a CPU running nohz_full user code for an
> > > extended time period.  RCU is supposed to leave that CPU strictly alone,
> > > after all.  ;-)
> >
> > Yeah, this does happen if the CPU is enabled nohz_full. But I believe
> > I ever saw similar
> > situation on a nohz enabled CPU as well.  So shouldn't RCU close off
> > the gp for this CPU
> > rather than waiting ? That's the way I am thinking of fixing it.
>
> Yes, the same thing can happen if a CPU remains idle for a long time.
>
> > > > > > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > > > > > switch to make progress on that CPU. This is
> > > > > > done through a resched IPI. But this can not always succeed depending on
> > > > > > the scheduler.   A while ago, this code
> > > > > > process the resched IPI:
> > > > > >
> > > > > > void scheduler_ipi(void)
> > > > > > {
> > > > > >         ...
> > > > > >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > > > > >                 return;
> > > > > >         ...
> > > > > >         irq_enter();
> > > > > >         sched_ttwu_pending();
> > > > > >         ...
> > > > > >         if (unlikely(got_nohz_idle_kick())) {
> > > > > >                 this_rq()->idle_balance = 1;
> > > > > >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> > > > > >         }
> > > > > >         irq_exit();
> > > > > > }
> > > > > >
> > > > > > As you can see the function returns from the first "if statement" before it
> > > > > > can issue a SCHED_SOFTIRQ. Later this
> > > > > > code has been changed, but similar check/optimization remains in many
> > > > > > places in the scheduler. The things I try to
> > > > > > fix are those that resched_cpu fails to do.
> > > > >
> > > > > ???  Current mainline has this instead:
> > > > >
> > > > > static __always_inline void scheduler_ipi(void)
> > > > > {
> > > > >         /*
> > > > >          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> > > > >          * TIF_NEED_RESCHED remotely (for the first time) will also send
> > > > >          * this IPI.
> > > > >          */
> > > > >         preempt_fold_need_resched();
> > > > > }
> > > >
> > > > This function was changed a year ago in the upstream kernel. But this
> > > > is not the only
> > > > code that fails the request of resched from rcu.  The scheduler is
> > > > optimized to avoid
> > > > context switching which it thinks is unnecessary over the years.
> > >
> > > But RCU shouldn't get to the point where it would invoke resched_cpu().
> > > Instead, it should see that CPU's rdp->dynticks value and report a
> > > quiescent state on that CPU's behalf.  See the rcu_implicit_dynticks_qs()
> > > function and its callers.
> >
> > What I was seeing in these crashdumps was that the rcu gp_kthread
> > invoked rcu_implicit_dynticks_qs() ---> resched_cpu()
> >
> > rcu_implicit_dynticks_qs()
> > {
> >         ......
> >         if (tick_nohz_full_cpu(rdp->cpu) && time_after(......) ) {
> >                   WRITE_ONCE(...);
> >                   resched_cpu(rdp->cpu);
> >                   WRITE_ONCE(...);
> >         }
> >
> >         if (time_after (...) {
> >                    if (time_aftr(...) {
> >                              resched_cpu(rdp->cpu);
> >                              ...
> >                    }
> >                    ...
> >          }
> >          return 0;
> > }
> >
> > The first time, resched_cpu() was invoked from the first "if
> > statement",  later, it was invoked
> > from the second "if statement".  But because the scheduler ignored
> > that resched request from
> > rcu due to optimization, nothing happened.
>
> I agree that this can happen.  However, it should not have happened
> in this case, because...
>
> > > > > Combined with the activities of resched_curr(), which is invoked
> > > > > from resched_cpu(), this should force a call to the scheduler on
> > > > > the return path from this IPI.
> > > > >
> > > > > So what kernel version are you using?
> > > >
> > > > The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
> > > > But this problem is reproducible on fedora 34 and the latest upstream kernel,
> > > > however, I don't have a crashdump of that kind right now.
> > >
> > > Interesting.  Are those systems doing anything unusual?  Long-running
> > > interrupts, CPU-hotplug operations, ... ?
> >
> > Only some regular test suites. Nothing special was running. I believe no long
> > interrupts,  no CPU-hotplug. But I noticed that the console was outputting
> > some message via printk(), but the message was not too long.
> > >
> > > > > Recent kernels have logic to enable the tick on nohz_full CPUs that are
> > > > > slow to supply RCU with a quiescent state, but this should happen only
> > > > > when such CPUs are spinning in kernel mode.  Again, usermode execution
> > > > > is dealt with by rcu_user_enter().
> > > >
> > > > That also reflected why the CPU was running a user thread when the RCU stall
> > > > was detected.  So I guess something should be done for this case.
> > >
> > > You lost me on this one.  Did the rdp->dynticks value show that the
> > > CPU was in an extended quiescent state?
> >
> > The value of rdp->dynticks on the stalled CPU0  was 0x6eab02,
> > thus    rdp->dynticks & RCU_DYNTICK_CTRL_CTR != 0
> > so it was not in an extended qs (idle).
>
> And, assuming that this CPU was executing in nohz_full usermode, this
> is a bug.  If a CPU is in nohz_full usermode, then its rdp->dynticks
> field should have the RCU_DYNTICK_CTRL_CTR bit set to zero.
>
> The usual cause of this situation is that there is a path to
> or run nohz_full usermode that fails to call rcu_user_enter() or
> rcu_user_exit(), respectively.  Similar problems can arise if there is
> a path between nohz_full userspace and interrupt context that does not
> invoke rcu_irq_enter() (from userspace to interrupt) or rcu_irq_exit()
> (from interrupt back to userspace).
>
Yes, that's true.
Because I am dealing with this issue in multiple kernel versions, sometimes
the configurations in these kernels may different. Initially the
problem I described
originated to rhel-8 on which the problem occurs more often and is a bit easier
to reproduce than others.

Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
   - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
   - dynticks_nesting = 1    which is in its initial state, so it said
it was not in eqs either.
   - dynticks_nmi_nesting = 4000000000000004    which meant that this
CPU had been
     interrupted when it was in the middle of the first interrupt.
And this is true: the first
     interrupt was the sched_timer interrupt, and the second was a NMI
when another
    CPU detected the RCU stall on CPU 0.  So it looks all identical.
If the kernel missed
    a rcu_user_enter or rcu_user_exit, would these items remain
identical ?  But I'll
    investigate that possibility seriously as you pointed out.

> There were some issues of this sort around the v5.8 timeframe.  Might
> there be another patch that needs to be backported?  Or a patch that
> was backported, but should not have been?

Good to know that clue. I'll take a look into the log history.

>
> Is it possible to bisect this?
>
> Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?

I am building the latest 5.14 kernel with this config and give it a try when the
machine is set up, see how much it can help.
>
> Either way, what should happen is that dyntick_save_progress_counter() or
> rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> nohz_full user execution, and then the quiescent state will be supplied
> on behalf of that CPU.

Agreed. But the counter rdp->dynticks of the CPU can only be updated
by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
or rcu_eqs_exit() is called, which in turn depends on the context switch.
So, when the context switch never happens, the counter rdp->dynticks
never advances. That's the thing I try to fix here.

>
> Now it is possible that you also have a problem where resched_cpu() isn't
> working.  But if so, that is a separate bug that should also be fixed.

agreed.

BTW, I feel like there should be a more reliable and efficient way to
let the RCU core deal with  gp and qs. I guess the existing code around
area can be simplified a lot. But we can start another discussion on that.

Thanks
Donghai
>
>                                                         Thanx, Paul
>
> > Thanks
> > Donghai
> >
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > > > > Hope this explains it.
> > > > > > Donghai
> > > > > >
> > > > > >
> > > > > > > Excellent point, Boqun!
> > > > > > >
> > > > > > > Donghai, have you tried reproducing this using a kernel built with
> > > > > > > CONFIG_RCU_EQS_DEBUG=y?
> > > > > > >
> > > > > >
> > > > > > I can give this configuration a try. Will let you know the results.
> > > > >
> > > > > This should help detect any missing rcu_user_enter() or rcu_user_exit()
> > > > > calls.
> > > >
> > > > Got it.
> > > >
> > > > Thanks
> > > > Donghai
> > > >
> > > > >
> > > > >                                                         Thanx, Paul
> > > > >
> > > > > > Thanks.
> > > > > > Donghai
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >                                                         Thanx, Paul
> > > > > > >
> > > > > > > > Regards,
> > > > > > > > Boqun
> > > > > > > >
> > > > > > > > > the CPU and as long as the user thread is running, the forced context
> > > > > > > > > switch will
> > > > > > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > > > > > think this
> > > > > > > > > should be one of the major root causes of these RCU stall issues. Even
> > > > > > > if
> > > > > > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > > > > > can
> > > > > > > > > affect the realtime kernel, by the way.
> > > > > > > > >
> > > > > > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > > > > > because
> > > > > > > > > this has to invalidate some existing scheduling optimizations. The
> > > > > > > current
> > > > > > > > > scheduler is deliberately optimized to avoid such context switching.
> > > > > > > So my
> > > > > > > > > question is why the RCU core cannot effectively update qs for the
> > > > > > > stalled CPU
> > > > > > > > > when it detects that the stalled CPU is running a user thread?  The
> > > > > > > reason
> > > > > > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > > > > > not
> > > > > > > > > be in any kernel read-side critical sections. So it should be safe to
> > > > > > > close
> > > > > > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > > > > > can make
> > > > > > > > > RCU work more efficiently than the approach of context switch which
> > > > > > > needs to
> > > > > > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > > > > > its
> > > > > > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > > > > > >
> > > > > > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > > > > > way.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Donghai
> > > > > > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-27 22:34                 ` donghai qiao
@ 2021-07-28  0:10                   ` Paul E. McKenney
  2021-10-04 21:22                     ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-07-28  0:10 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu

On Tue, Jul 27, 2021 at 06:34:56PM -0400, donghai qiao wrote:
> On Sat, Jul 24, 2021 at 8:48 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Jul 23, 2021 at 08:01:34PM -0400, donghai qiao wrote:
> > > On Fri, Jul 23, 2021 at 3:06 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Fri, Jul 23, 2021 at 02:41:20PM -0400, donghai qiao wrote:
> > > > > On Fri, Jul 23, 2021 at 1:25 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Jul 23, 2021 at 12:20:41PM -0400, donghai qiao wrote:
> > > > > > > On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > > > > > > > > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > > > > > > > > RCU experts,
> > > > > > > > > >
> > > > > > > > > > When you reply, please also keep me CC'ed.
> > > > > > > > > >
> > > > > > > > > > The problem of RCU stall might be an old problem and it can happen
> > > > > > > > quite often.
> > > > > > > > > > As I have observed, when the problem occurs,  at least one CPU in the
> > > > > > > > system
> > > > > > > > > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > > > > > > > > >
> > > > > > > > > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > > > > > > > > rdp->gp_seq = 0x1388a1.
> > > > > > > > > >
> > > > > > > > > > Because RCU stall issues can last a long period of time, the number of
> > > > > > > > callbacks
> > > > > > > > > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > > > > > > > > the worst case,
> > > > > > > > > > it triggers panic.
> > > > > > > > > >
> > > > > > > > > > When looking into the problem further, I'd think the problem is
> > > > > > > > related to the
> > > > > > > > > > Linux scheduler. When the RCU core detects the stall on a CPU,
> > > > > > > > rcu_gp_kthread
> > > > > > > > > > would send a rescheduling request via send_IPI to that CPU to try to
> > > > > > > > force a
> > > > > > > > > > context switch to make some progress. However, at least one situation
> > > > > > > > can fail
> > > > > > > > > > this effort, which is when the CPU is running a user thread and it is
> > > > > > > > the only
> > > > > > > > > > user thread in the rq, then this attempted context switching will not
> > > > > > > > happen
> > > > > > > > > > immediately. In particular if the system is also configured with
> > > > > > > > NOHZ_FULL for
> > > > > > > > >
> > > > > > > > > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > > > > > > > > can that CPU stall RCU? Because you need to be in a RCU read-side
> > > > > > > > > critical section to stall RCU. Or the problem you're talking here is
> > > > > > > > > about *recovering* from RCU stall?
> > > > > > >
> > > > > > > In response to Boqun's question, the crashdumps I analyzed were configured
> > > > > > > with this :
> > > > > > >
> > > > > > > CONFIG_PREEMPT_RCU=n
> > > > > > > CONFIG_PREEMPT_COUNT=n
> > > > > > > CONFIG_PROVE_RCU=n
> > > > > > >
> > > > > > > Because these configurations were not enabled, the compiler generated empty
> > > > > > > binary code for functions rcu_read_lock() and rcu_read_unlock() which
> > > > > > > delimit rcu read-side critical sections. And the crashdump showed both
> > > > > > > functions have no binary code in the kernel module and I am pretty sure.
> > > > > >
> > > > > > Agreed, that is expected behavior.
> > > > > >
> > > > > > > In the first place I thought this kernel might be built the wrong way,
> > > > > > > but later I found other sources that said this was ok.  That's why when
> > > > > > > CPUs enter or leave rcu critical section, the rcu core
> > > > > > > is not informed.
> > > > > >
> > > > > > If RCU core was informed every time that a CPU entered or left an RCU
> > > > > > read-side critical section, performance and scalability would be
> > > > > > abysmal.  So yes, this interaction is very arms-length.
> > > > >
> > > > > Thanks for confirming that.
> > > > >
> > > > > > > When the current grace period is closed, rcu_gp_kthread will open a new
> > > > > > > period for all. This will be reflected from every
> > > > > > > CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when a
> > > > > > > progress is made. So when a cpu is running
> > > > > > > a user thread whilst a new period is open, it can not update its rcu unless
> > > > > > > a context switch occurs or upon a sched tick.
> > > > > > > But if a CPU is configured as NOHZ, this will be a problem to RCU, so rcu
> > > > > > > stall will happen.
> > > > > >
> > > > > > Except that if a CPU is running in nohz_full mode, each transition from
> > > > > > kernel to user execution must invoke rcu_user_enter() and each transition
> > > > > > back must invoke rcu_user_exit().  These update RCU's per-CPU state, which
> > > > > > allows RCU's grace-period kthread ("rcu_sched" in this configuration)
> > > > > > to detect even momentary nohz_full usermode execution.
> > > > >
> > > > > Yes, agreed.
> > > > >
> > > > > > You can check this in your crash dump by looking at the offending CPU's
> > > > > > rcu_data structure's ->dynticks field and comparing to the activities
> > > > > > of rcu_user_enter().
> > > > >
> > > > > On the rcu stalled CPU, its rdp->dynticks is far behind others. In the crashdump
> > > > > I examined, stall happened on CPU 0,  its dynticks is 0x6eab02, but dynticks on
> > > > > other CPUs are 0x82c192, 0x72a3b6, 0x880516 etc..
> > > >
> > > > That is expected behavior for a CPU running nohz_full user code for an
> > > > extended time period.  RCU is supposed to leave that CPU strictly alone,
> > > > after all.  ;-)
> > >
> > > Yeah, this does happen if the CPU is enabled nohz_full. But I believe
> > > I ever saw similar
> > > situation on a nohz enabled CPU as well.  So shouldn't RCU close off
> > > the gp for this CPU
> > > rather than waiting ? That's the way I am thinking of fixing it.
> >
> > Yes, the same thing can happen if a CPU remains idle for a long time.
> >
> > > > > > > When RCU detects that qs is stalled on a CPU, it tries to force a context
> > > > > > > switch to make progress on that CPU. This is
> > > > > > > done through a resched IPI. But this can not always succeed depending on
> > > > > > > the scheduler.   A while ago, this code
> > > > > > > process the resched IPI:
> > > > > > >
> > > > > > > void scheduler_ipi(void)
> > > > > > > {
> > > > > > >         ...
> > > > > > >         if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
> > > > > > >                 return;
> > > > > > >         ...
> > > > > > >         irq_enter();
> > > > > > >         sched_ttwu_pending();
> > > > > > >         ...
> > > > > > >         if (unlikely(got_nohz_idle_kick())) {
> > > > > > >                 this_rq()->idle_balance = 1;
> > > > > > >                 raise_softirq_irqoff(SCHED_SOFTIRQ);
> > > > > > >         }
> > > > > > >         irq_exit();
> > > > > > > }
> > > > > > >
> > > > > > > As you can see the function returns from the first "if statement" before it
> > > > > > > can issue a SCHED_SOFTIRQ. Later this
> > > > > > > code has been changed, but similar check/optimization remains in many
> > > > > > > places in the scheduler. The things I try to
> > > > > > > fix are those that resched_cpu fails to do.
> > > > > >
> > > > > > ???  Current mainline has this instead:
> > > > > >
> > > > > > static __always_inline void scheduler_ipi(void)
> > > > > > {
> > > > > >         /*
> > > > > >          * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
> > > > > >          * TIF_NEED_RESCHED remotely (for the first time) will also send
> > > > > >          * this IPI.
> > > > > >          */
> > > > > >         preempt_fold_need_resched();
> > > > > > }
> > > > >
> > > > > This function was changed a year ago in the upstream kernel. But this
> > > > > is not the only
> > > > > code that fails the request of resched from rcu.  The scheduler is
> > > > > optimized to avoid
> > > > > context switching which it thinks is unnecessary over the years.
> > > >
> > > > But RCU shouldn't get to the point where it would invoke resched_cpu().
> > > > Instead, it should see that CPU's rdp->dynticks value and report a
> > > > quiescent state on that CPU's behalf.  See the rcu_implicit_dynticks_qs()
> > > > function and its callers.
> > >
> > > What I was seeing in these crashdumps was that the rcu gp_kthread
> > > invoked rcu_implicit_dynticks_qs() ---> resched_cpu()
> > >
> > > rcu_implicit_dynticks_qs()
> > > {
> > >         ......
> > >         if (tick_nohz_full_cpu(rdp->cpu) && time_after(......) ) {
> > >                   WRITE_ONCE(...);
> > >                   resched_cpu(rdp->cpu);
> > >                   WRITE_ONCE(...);
> > >         }
> > >
> > >         if (time_after (...) {
> > >                    if (time_aftr(...) {
> > >                              resched_cpu(rdp->cpu);
> > >                              ...
> > >                    }
> > >                    ...
> > >          }
> > >          return 0;
> > > }
> > >
> > > The first time, resched_cpu() was invoked from the first "if
> > > statement",  later, it was invoked
> > > from the second "if statement".  But because the scheduler ignored
> > > that resched request from
> > > rcu due to optimization, nothing happened.
> >
> > I agree that this can happen.  However, it should not have happened
> > in this case, because...
> >
> > > > > > Combined with the activities of resched_curr(), which is invoked
> > > > > > from resched_cpu(), this should force a call to the scheduler on
> > > > > > the return path from this IPI.
> > > > > >
> > > > > > So what kernel version are you using?
> > > > >
> > > > > The crashdumps I have examined were generated from 4.18.0-269 which is rhel-8.4.
> > > > > But this problem is reproducible on fedora 34 and the latest upstream kernel,
> > > > > however, I don't have a crashdump of that kind right now.
> > > >
> > > > Interesting.  Are those systems doing anything unusual?  Long-running
> > > > interrupts, CPU-hotplug operations, ... ?
> > >
> > > Only some regular test suites. Nothing special was running. I believe no long
> > > interrupts,  no CPU-hotplug. But I noticed that the console was outputting
> > > some message via printk(), but the message was not too long.
> > > >
> > > > > > Recent kernels have logic to enable the tick on nohz_full CPUs that are
> > > > > > slow to supply RCU with a quiescent state, but this should happen only
> > > > > > when such CPUs are spinning in kernel mode.  Again, usermode execution
> > > > > > is dealt with by rcu_user_enter().
> > > > >
> > > > > That also reflected why the CPU was running a user thread when the RCU stall
> > > > > was detected.  So I guess something should be done for this case.
> > > >
> > > > You lost me on this one.  Did the rdp->dynticks value show that the
> > > > CPU was in an extended quiescent state?
> > >
> > > The value of rdp->dynticks on the stalled CPU0  was 0x6eab02,
> > > thus    rdp->dynticks & RCU_DYNTICK_CTRL_CTR != 0
> > > so it was not in an extended qs (idle).
> >
> > And, assuming that this CPU was executing in nohz_full usermode, this
> > is a bug.  If a CPU is in nohz_full usermode, then its rdp->dynticks
> > field should have the RCU_DYNTICK_CTRL_CTR bit set to zero.
> >
> > The usual cause of this situation is that there is a path to
> > or run nohz_full usermode that fails to call rcu_user_enter() or
> > rcu_user_exit(), respectively.  Similar problems can arise if there is
> > a path between nohz_full userspace and interrupt context that does not
> > invoke rcu_irq_enter() (from userspace to interrupt) or rcu_irq_exit()
> > (from interrupt back to userspace).
> >
> Yes, that's true.
> Because I am dealing with this issue in multiple kernel versions, sometimes
> the configurations in these kernels may different. Initially the
> problem I described
> originated to rhel-8 on which the problem occurs more often and is a bit easier
> to reproduce than others.

Understood, that does make things more difficult.

> Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
>    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
>    - dynticks_nesting = 1    which is in its initial state, so it said
> it was not in eqs either.
>    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> CPU had been
>      interrupted when it was in the middle of the first interrupt.
> And this is true: the first
>      interrupt was the sched_timer interrupt, and the second was a NMI
> when another
>     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> If the kernel missed
>     a rcu_user_enter or rcu_user_exit, would these items remain
> identical ?  But I'll
>     investigate that possibility seriously as you pointed out.

So is the initial state non-eqs because it was interrupted from kernel
mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
incorrectly equal to the value of 1?  Or something else?

> > There were some issues of this sort around the v5.8 timeframe.  Might
> > there be another patch that needs to be backported?  Or a patch that
> > was backported, but should not have been?
> 
> Good to know that clue. I'll take a look into the log history.
> 
> > Is it possible to bisect this?
> >
> > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> 
> I am building the latest 5.14 kernel with this config and give it a try when the
> machine is set up, see how much it can help.

Very good, as that will help determine whether or not the problem is
due to backporting issues.

> > Either way, what should happen is that dyntick_save_progress_counter() or
> > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > nohz_full user execution, and then the quiescent state will be supplied
> > on behalf of that CPU.
> 
> Agreed. But the counter rdp->dynticks of the CPU can only be updated
> by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> or rcu_eqs_exit() is called, which in turn depends on the context switch.
> So, when the context switch never happens, the counter rdp->dynticks
> never advances. That's the thing I try to fix here.

First, understand the problem.  Otherwise, your fix is not so likely
to actually fix anything.  ;-)

If kernel mode was interrupted, there is probably a missing cond_resched().
But in sufficiently old kernels, cond_resched() doesn't do anything for
RCU unless a context switch actually happened.  In some of those kernels,
you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
really old kernels, life is hard and you will need to do some backporting.
Or move to newer kernels.

In short, if an in-kernel code path runs for long enough without hitting
a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
that you will get is your diagnostic.

If this is a usermode process that is not in nohz_full mode, the
scheduler-clock interrupt should be enabled, and RCU will then see
the interrupt from usermode, which will supply the quiescent state.

If this is a usermode process that -is- in nohz_full mode, then, as
noted earlier, you might be missing an rcu_user_enter() or similar.

> > Now it is possible that you also have a problem where resched_cpu() isn't
> > working.  But if so, that is a separate bug that should also be fixed.
> 
> agreed.
> 
> BTW, I feel like there should be a more reliable and efficient way to
> let the RCU core deal with  gp and qs. I guess the existing code around
> area can be simplified a lot. But we can start another discussion on that.

Oh, it -definitely- could be simplified.  But can it be simplified and
still do what its users need to?  Now -that- is the question.  So
you might want to take a look at RCU's requirements.  You can find
them here:  Documentation/RCU/Design/Requirements/Requirements.rst

							Thanx, Paul

> Thanks
> Donghai
> >
> >                                                         Thanx, Paul
> >
> > > Thanks
> > > Donghai
> > >
> > >
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > > > > Hope this explains it.
> > > > > > > Donghai
> > > > > > >
> > > > > > >
> > > > > > > > Excellent point, Boqun!
> > > > > > > >
> > > > > > > > Donghai, have you tried reproducing this using a kernel built with
> > > > > > > > CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > >
> > > > > > >
> > > > > > > I can give this configuration a try. Will let you know the results.
> > > > > >
> > > > > > This should help detect any missing rcu_user_enter() or rcu_user_exit()
> > > > > > calls.
> > > > >
> > > > > Got it.
> > > > >
> > > > > Thanks
> > > > > Donghai
> > > > >
> > > > > >
> > > > > >                                                         Thanx, Paul
> > > > > >
> > > > > > > Thanks.
> > > > > > > Donghai
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >                                                         Thanx, Paul
> > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Boqun
> > > > > > > > >
> > > > > > > > > > the CPU and as long as the user thread is running, the forced context
> > > > > > > > > > switch will
> > > > > > > > > > never happen unless the user thread volunteers to yield the CPU. I
> > > > > > > > think this
> > > > > > > > > > should be one of the major root causes of these RCU stall issues. Even
> > > > > > > > if
> > > > > > > > > > NOHZ_FULL is not configured, there will be at least 1 tick delay which
> > > > > > > > can
> > > > > > > > > > affect the realtime kernel, by the way.
> > > > > > > > > >
> > > > > > > > > > But it seems not a good idea to craft a fix from the scheduler side
> > > > > > > > because
> > > > > > > > > > this has to invalidate some existing scheduling optimizations. The
> > > > > > > > current
> > > > > > > > > > scheduler is deliberately optimized to avoid such context switching.
> > > > > > > > So my
> > > > > > > > > > question is why the RCU core cannot effectively update qs for the
> > > > > > > > stalled CPU
> > > > > > > > > > when it detects that the stalled CPU is running a user thread?  The
> > > > > > > > reason
> > > > > > > > > > is pretty obvious because when a CPU is running a user thread, it must
> > > > > > > > not
> > > > > > > > > > be in any kernel read-side critical sections. So it should be safe to
> > > > > > > > close
> > > > > > > > > > its current RCU grace period on this CPU. Also, with this approach we
> > > > > > > > can make
> > > > > > > > > > RCU work more efficiently than the approach of context switch which
> > > > > > > > needs to
> > > > > > > > > > go through an IPI interrupt and the destination CPU needs to wake up
> > > > > > > > its
> > > > > > > > > > ksoftirqd or wait for the next scheduling cycle.
> > > > > > > > > >
> > > > > > > > > > If my suggested approach makes sense, I can go ahead to fix it that
> > > > > > > > way.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Donghai
> > > > > > > >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-07-28  0:10                   ` Paul E. McKenney
@ 2021-10-04 21:22                     ` donghai qiao
  2021-10-05  0:59                       ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-04 21:22 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu

Hello Paul,
Sorry it has been long..
> > Because I am dealing with this issue in multiple kernel versions, sometimes
> > the configurations in these kernels may different. Initially the
> > problem I described
> > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > to reproduce than others.
>
> Understood, that does make things more difficult.
>
> > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> >    - dynticks_nesting = 1    which is in its initial state, so it said
> > it was not in eqs either.
> >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > CPU had been
> >      interrupted when it was in the middle of the first interrupt.
> > And this is true: the first
> >      interrupt was the sched_timer interrupt, and the second was a NMI
> > when another
> >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > If the kernel missed
> >     a rcu_user_enter or rcu_user_exit, would these items remain
> > identical ?  But I'll
> >     investigate that possibility seriously as you pointed out.
>
> So is the initial state non-eqs because it was interrupted from kernel
> mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> incorrectly equal to the value of 1?  Or something else?

As far as the original problem is concerned, the user thread was interrupted by
the timer, so the CPU was not working in the nohz mode. But I saw the similar
problems on CPUs working in nohz mode with different configurations.

>
> > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > there be another patch that needs to be backported?  Or a patch that
> > > was backported, but should not have been?
> >
> > Good to know that clue. I'll take a look into the log history.
> >
> > > Is it possible to bisect this?
> > >
> > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> >
> > I am building the latest 5.14 kernel with this config and give it a try when the
> > machine is set up, see how much it can help.
>
> Very good, as that will help determine whether or not the problem is
> due to backporting issues.

I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
turns out no new warning messages related to this came out. So,
rcu_user_enter/rcu_user_exit() should be paired right.

>
> > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > nohz_full user execution, and then the quiescent state will be supplied
> > > on behalf of that CPU.
> >
> > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > So, when the context switch never happens, the counter rdp->dynticks
> > never advances. That's the thing I try to fix here.
>
> First, understand the problem.  Otherwise, your fix is not so likely
> to actually fix anything.  ;-)
>
> If kernel mode was interrupted, there is probably a missing cond_resched().
> But in sufficiently old kernels, cond_resched() doesn't do anything for
> RCU unless a context switch actually happened.  In some of those kernels,
> you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> really old kernels, life is hard and you will need to do some backporting.
> Or move to newer kernels.
>
> In short, if an in-kernel code path runs for long enough without hitting
> a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> that you will get is your diagnostic.

Probably this is the case. With the test for 5.15.0-r1, I have seen different
scenarios, among them the most frequent ones were caused by the networking
in which a bunch of networking threads were spinning on the same rwlock.

For instance in one of them, the ticks_this_gp of a rcu_data could go as
large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
doing networking work and finally it was spinning as a writer on a rwlock
which had been locked by 16 readers.  By the way, there were 70 this
kinds of writers were blocked on the same rwlock.

When examining the readers of the lock, except the following code,
don't see any other obvious problems: e.g
 #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
 #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
 #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
 #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
 #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
#10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
#11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
#12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
#13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
#14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
#15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6

In the function ip_local_deliver_finish() of this stack, a lot of the work needs
to be done with ip_protocol_deliver_rcu(). But this function is invoked from
a rcu reader side section.

static int ip_local_deliver_finish(struct net *net, struct sock *sk,
struct sk_buff *skb)
{
        __skb_pull(skb, skb_network_header_len(skb));

        rcu_read_lock();
        ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
        rcu_read_unlock();

        return 0;
}

Actually there are multiple chances that this code path can hit
spinning locks starting from ip_protocol_deliver_rcu(). This kind
usage looks not quite right. But I'd like to know your opinion on this first ?

Thanks
Donghai

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-04 21:22                     ` donghai qiao
@ 2021-10-05  0:59                       ` Paul E. McKenney
  2021-10-05 16:10                         ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-05  0:59 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu, netdev

On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> Hello Paul,
> Sorry it has been long..

On this problem, your schedule is my schedule.  At least as long as your
are not expecting instantaneous response.  ;-)

> > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > the configurations in these kernels may different. Initially the
> > > problem I described
> > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > to reproduce than others.
> >
> > Understood, that does make things more difficult.
> >
> > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > it was not in eqs either.
> > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > CPU had been
> > >      interrupted when it was in the middle of the first interrupt.
> > > And this is true: the first
> > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > when another
> > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > If the kernel missed
> > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > identical ?  But I'll
> > >     investigate that possibility seriously as you pointed out.
> >
> > So is the initial state non-eqs because it was interrupted from kernel
> > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > incorrectly equal to the value of 1?  Or something else?
> 
> As far as the original problem is concerned, the user thread was interrupted by
> the timer, so the CPU was not working in the nohz mode. But I saw the similar
> problems on CPUs working in nohz mode with different configurations.

OK.

> > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > there be another patch that needs to be backported?  Or a patch that
> > > > was backported, but should not have been?
> > >
> > > Good to know that clue. I'll take a look into the log history.
> > >
> > > > Is it possible to bisect this?
> > > >
> > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > >
> > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > machine is set up, see how much it can help.
> >
> > Very good, as that will help determine whether or not the problem is
> > due to backporting issues.
> 
> I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> turns out no new warning messages related to this came out. So,
> rcu_user_enter/rcu_user_exit() should be paired right.

OK, good.

> > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > on behalf of that CPU.
> > >
> > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > So, when the context switch never happens, the counter rdp->dynticks
> > > never advances. That's the thing I try to fix here.
> >
> > First, understand the problem.  Otherwise, your fix is not so likely
> > to actually fix anything.  ;-)
> >
> > If kernel mode was interrupted, there is probably a missing cond_resched().
> > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > RCU unless a context switch actually happened.  In some of those kernels,
> > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > really old kernels, life is hard and you will need to do some backporting.
> > Or move to newer kernels.
> >
> > In short, if an in-kernel code path runs for long enough without hitting
> > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > that you will get is your diagnostic.
> 
> Probably this is the case. With the test for 5.15.0-r1, I have seen different
> scenarios, among them the most frequent ones were caused by the networking
> in which a bunch of networking threads were spinning on the same rwlock.
> 
> For instance in one of them, the ticks_this_gp of a rcu_data could go as
> large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> doing networking work and finally it was spinning as a writer on a rwlock
> which had been locked by 16 readers.  By the way, there were 70 this
> kinds of writers were blocked on the same rwlock.

OK, a lock-contention problem.  The networking folks have fixed a
very large number of these over the years, though, so I wonder what is
special about this one so that it is just now showing up.  I have added
a networking list on CC for their thoughts.

> When examining the readers of the lock, except the following code,
> don't see any other obvious problems: e.g
>  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
>  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
>  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
>  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
>  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> 
> In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> a rcu reader side section.
> 
> static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> struct sk_buff *skb)
> {
>         __skb_pull(skb, skb_network_header_len(skb));
> 
>         rcu_read_lock();
>         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
>         rcu_read_unlock();
> 
>         return 0;
> }
> 
> Actually there are multiple chances that this code path can hit
> spinning locks starting from ip_protocol_deliver_rcu(). This kind
> usage looks not quite right. But I'd like to know your opinion on this first ?

It is perfectly legal to acquire spinlocks in RCU read-side critical
sections.  In fact, this is one of the few ways to safely acquire a
per-object lock while still maintaining good performance and
scalability.

My guess is that the thing to track down is the cause of the high contention
on that reader-writer spinlock.  Missed patches, misconfiguration, etc.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-05  0:59                       ` Paul E. McKenney
@ 2021-10-05 16:10                         ` donghai qiao
  2021-10-05 16:39                           ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-05 16:10 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > Hello Paul,
> > Sorry it has been long..
>
> On this problem, your schedule is my schedule.  At least as long as your
> are not expecting instantaneous response.  ;-)
>
> > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > the configurations in these kernels may different. Initially the
> > > > problem I described
> > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > to reproduce than others.
> > >
> > > Understood, that does make things more difficult.
> > >
> > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > it was not in eqs either.
> > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > CPU had been
> > > >      interrupted when it was in the middle of the first interrupt.
> > > > And this is true: the first
> > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > when another
> > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > If the kernel missed
> > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > identical ?  But I'll
> > > >     investigate that possibility seriously as you pointed out.
> > >
> > > So is the initial state non-eqs because it was interrupted from kernel
> > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > incorrectly equal to the value of 1?  Or something else?
> >
> > As far as the original problem is concerned, the user thread was interrupted by
> > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > problems on CPUs working in nohz mode with different configurations.
>
> OK.
>
> > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > was backported, but should not have been?
> > > >
> > > > Good to know that clue. I'll take a look into the log history.
> > > >
> > > > > Is it possible to bisect this?
> > > > >
> > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > >
> > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > machine is set up, see how much it can help.
> > >
> > > Very good, as that will help determine whether or not the problem is
> > > due to backporting issues.
> >
> > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > turns out no new warning messages related to this came out. So,
> > rcu_user_enter/rcu_user_exit() should be paired right.
>
> OK, good.
>
> > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > on behalf of that CPU.
> > > >
> > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > never advances. That's the thing I try to fix here.
> > >
> > > First, understand the problem.  Otherwise, your fix is not so likely
> > > to actually fix anything.  ;-)
> > >
> > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > RCU unless a context switch actually happened.  In some of those kernels,
> > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > really old kernels, life is hard and you will need to do some backporting.
> > > Or move to newer kernels.
> > >
> > > In short, if an in-kernel code path runs for long enough without hitting
> > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > that you will get is your diagnostic.
> >
> > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > scenarios, among them the most frequent ones were caused by the networking
> > in which a bunch of networking threads were spinning on the same rwlock.
> >
> > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > doing networking work and finally it was spinning as a writer on a rwlock
> > which had been locked by 16 readers.  By the way, there were 70 this
> > kinds of writers were blocked on the same rwlock.
>
> OK, a lock-contention problem.  The networking folks have fixed a
> very large number of these over the years, though, so I wonder what is
> special about this one so that it is just now showing up.  I have added
> a networking list on CC for their thoughts.

Thanks for pulling the networking in. If they need the coredump, I can
forward it to them.  It's definitely worth analyzing it as this contention
might be a performance issue.  Or we can discuss this further in this
email thread if they are fine, or we can discuss it over with a separate
email thread with netdev@ only.

So back to my original problem, this might be one of the possibilities that
led to RCU stall panic.  Just imagining this type of contention might have
occurred and lasted long enough. When it finally came to the end, the
timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
stall on the CPU and panic.

So definitely we need to understand these networking activities here as
to why the readers could hold the rwlock too long.

>
> > When examining the readers of the lock, except the following code,
> > don't see any other obvious problems: e.g
> >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> >
> > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > a rcu reader side section.
> >
> > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > struct sk_buff *skb)
> > {
> >         __skb_pull(skb, skb_network_header_len(skb));
> >
> >         rcu_read_lock();
> >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> >         rcu_read_unlock();
> >
> >         return 0;
> > }
> >
> > Actually there are multiple chances that this code path can hit
> > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > usage looks not quite right. But I'd like to know your opinion on this first ?
>
> It is perfectly legal to acquire spinlocks in RCU read-side critical
> sections.  In fact, this is one of the few ways to safely acquire a
> per-object lock while still maintaining good performance and
> scalability.

Sure, understand. But the RCU related docs said that anything causing
the reader side to block must be avoided.

>
> My guess is that the thing to track down is the cause of the high contention
> on that reader-writer spinlock.  Missed patches, misconfiguration, etc.

Actually, the test was against a recent upstream 5.15.0-r1  But I can try
the latest r4.  Regarding the network configure, I believe I didn't do anything
special, just use the default.

Thanks
Donghai

>
>                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-05 16:10                         ` donghai qiao
@ 2021-10-05 16:39                           ` Paul E. McKenney
  2021-10-06  0:25                             ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-05 16:39 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu, netdev

On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > Hello Paul,
> > > Sorry it has been long..
> >
> > On this problem, your schedule is my schedule.  At least as long as your
> > are not expecting instantaneous response.  ;-)
> >
> > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > the configurations in these kernels may different. Initially the
> > > > > problem I described
> > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > to reproduce than others.
> > > >
> > > > Understood, that does make things more difficult.
> > > >
> > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > it was not in eqs either.
> > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > CPU had been
> > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > And this is true: the first
> > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > when another
> > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > If the kernel missed
> > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > identical ?  But I'll
> > > > >     investigate that possibility seriously as you pointed out.
> > > >
> > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > incorrectly equal to the value of 1?  Or something else?
> > >
> > > As far as the original problem is concerned, the user thread was interrupted by
> > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > problems on CPUs working in nohz mode with different configurations.
> >
> > OK.
> >
> > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > was backported, but should not have been?
> > > > >
> > > > > Good to know that clue. I'll take a look into the log history.
> > > > >
> > > > > > Is it possible to bisect this?
> > > > > >
> > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > >
> > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > machine is set up, see how much it can help.
> > > >
> > > > Very good, as that will help determine whether or not the problem is
> > > > due to backporting issues.
> > >
> > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > turns out no new warning messages related to this came out. So,
> > > rcu_user_enter/rcu_user_exit() should be paired right.
> >
> > OK, good.
> >
> > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > on behalf of that CPU.
> > > > >
> > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > never advances. That's the thing I try to fix here.
> > > >
> > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > to actually fix anything.  ;-)
> > > >
> > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > really old kernels, life is hard and you will need to do some backporting.
> > > > Or move to newer kernels.
> > > >
> > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > that you will get is your diagnostic.
> > >
> > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > scenarios, among them the most frequent ones were caused by the networking
> > > in which a bunch of networking threads were spinning on the same rwlock.
> > >
> > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > doing networking work and finally it was spinning as a writer on a rwlock
> > > which had been locked by 16 readers.  By the way, there were 70 this
> > > kinds of writers were blocked on the same rwlock.
> >
> > OK, a lock-contention problem.  The networking folks have fixed a
> > very large number of these over the years, though, so I wonder what is
> > special about this one so that it is just now showing up.  I have added
> > a networking list on CC for their thoughts.
> 
> Thanks for pulling the networking in. If they need the coredump, I can
> forward it to them.  It's definitely worth analyzing it as this contention
> might be a performance issue.  Or we can discuss this further in this
> email thread if they are fine, or we can discuss it over with a separate
> email thread with netdev@ only.
> 
> So back to my original problem, this might be one of the possibilities that
> led to RCU stall panic.  Just imagining this type of contention might have
> occurred and lasted long enough. When it finally came to the end, the
> timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> stall on the CPU and panic.
> 
> So definitely we need to understand these networking activities here as
> to why the readers could hold the rwlock too long.

I strongly suggest that you also continue to do your own analysis on this.
So please see below.

> > > When examining the readers of the lock, except the following code,
> > > don't see any other obvious problems: e.g
> > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > >
> > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > a rcu reader side section.
> > >
> > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > struct sk_buff *skb)
> > > {
> > >         __skb_pull(skb, skb_network_header_len(skb));
> > >
> > >         rcu_read_lock();
> > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > >         rcu_read_unlock();
> > >
> > >         return 0;
> > > }
> > >
> > > Actually there are multiple chances that this code path can hit
> > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > usage looks not quite right. But I'd like to know your opinion on this first ?
> >
> > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > sections.  In fact, this is one of the few ways to safely acquire a
> > per-object lock while still maintaining good performance and
> > scalability.
> 
> Sure, understand. But the RCU related docs said that anything causing
> the reader side to block must be avoided.

True.  But this is the Linux kernel, where "block" means something
like "invoke schedule()" or "sleep" instead of the academic-style
non-blocking-synchronization definition.  So it is perfectly legal to
acquire spinlocks within RCU read-side critical sections.

And before you complain that practitioners are not following the academic
definitions, please keep in mind that our definitions were here first.  ;-)

> > My guess is that the thing to track down is the cause of the high contention
> > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> 
> Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> the latest r4.  Regarding the network configure, I believe I didn't do anything
> special, just use the default.

Does this occur on older mainline kernels?  If not, I strongly suggest
bisecting, as this often quickly and easily finds the problem.

Bisection can also help you find the patch to be backported if a later
release fixes the bug, though things like gitk can also be helpful.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-05 16:39                           ` Paul E. McKenney
@ 2021-10-06  0:25                             ` donghai qiao
  2021-10-18 21:18                               ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-06  0:25 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > Hello Paul,
> > > > Sorry it has been long..
> > >
> > > On this problem, your schedule is my schedule.  At least as long as your
> > > are not expecting instantaneous response.  ;-)
> > >
> > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > the configurations in these kernels may different. Initially the
> > > > > > problem I described
> > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > to reproduce than others.
> > > > >
> > > > > Understood, that does make things more difficult.
> > > > >
> > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > it was not in eqs either.
> > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > CPU had been
> > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > And this is true: the first
> > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > when another
> > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > If the kernel missed
> > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > identical ?  But I'll
> > > > > >     investigate that possibility seriously as you pointed out.
> > > > >
> > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > incorrectly equal to the value of 1?  Or something else?
> > > >
> > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > problems on CPUs working in nohz mode with different configurations.
> > >
> > > OK.
> > >
> > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > was backported, but should not have been?
> > > > > >
> > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > >
> > > > > > > Is it possible to bisect this?
> > > > > > >
> > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > >
> > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > machine is set up, see how much it can help.
> > > > >
> > > > > Very good, as that will help determine whether or not the problem is
> > > > > due to backporting issues.
> > > >
> > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > turns out no new warning messages related to this came out. So,
> > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > >
> > > OK, good.
> > >
> > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > on behalf of that CPU.
> > > > > >
> > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > never advances. That's the thing I try to fix here.
> > > > >
> > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > to actually fix anything.  ;-)
> > > > >
> > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > Or move to newer kernels.
> > > > >
> > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > that you will get is your diagnostic.
> > > >
> > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > scenarios, among them the most frequent ones were caused by the networking
> > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > >
> > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > kinds of writers were blocked on the same rwlock.
> > >
> > > OK, a lock-contention problem.  The networking folks have fixed a
> > > very large number of these over the years, though, so I wonder what is
> > > special about this one so that it is just now showing up.  I have added
> > > a networking list on CC for their thoughts.
> >
> > Thanks for pulling the networking in. If they need the coredump, I can
> > forward it to them.  It's definitely worth analyzing it as this contention
> > might be a performance issue.  Or we can discuss this further in this
> > email thread if they are fine, or we can discuss it over with a separate
> > email thread with netdev@ only.
> >
> > So back to my original problem, this might be one of the possibilities that
> > led to RCU stall panic.  Just imagining this type of contention might have
> > occurred and lasted long enough. When it finally came to the end, the
> > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > stall on the CPU and panic.
> >
> > So definitely we need to understand these networking activities here as
> > to why the readers could hold the rwlock too long.
>
> I strongly suggest that you also continue to do your own analysis on this.
> So please see below.

This is just a brief of my analysis and the stack info below is not enough
for other people to figure out anything useful. I meant if they are really
interested, I can upload the core file. I think this is fair.

>
> > > > When examining the readers of the lock, except the following code,
> > > > don't see any other obvious problems: e.g
> > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > >
> > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > a rcu reader side section.
> > > >
> > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > struct sk_buff *skb)
> > > > {
> > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > >
> > > >         rcu_read_lock();
> > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > >         rcu_read_unlock();
> > > >
> > > >         return 0;
> > > > }
> > > >
> > > > Actually there are multiple chances that this code path can hit
> > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > >
> > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > sections.  In fact, this is one of the few ways to safely acquire a
> > > per-object lock while still maintaining good performance and
> > > scalability.
> >
> > Sure, understand. But the RCU related docs said that anything causing
> > the reader side to block must be avoided.
>
> True.  But this is the Linux kernel, where "block" means something
> like "invoke schedule()" or "sleep" instead of the academic-style
> non-blocking-synchronization definition.  So it is perfectly legal to
> acquire spinlocks within RCU read-side critical sections.
>
> And before you complain that practitioners are not following the academic
> definitions, please keep in mind that our definitions were here first.  ;-)
>
> > > My guess is that the thing to track down is the cause of the high contention
> > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> >
> > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > special, just use the default.
>
> Does this occur on older mainline kernels?  If not, I strongly suggest
> bisecting, as this often quickly and easily finds the problem.

Actually It does. But let's focus on the latest upstream and the latest rhel8.
This way, we will not worry about missing the needed rcu patches.
However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
networking related, which I am still working on.  So, there might be
multiple root causes.

> Bisection can also help you find the patch to be backported if a later
> release fixes the bug, though things like gitk can also be helpful.

Unfortunately, this is reproducible on the latest bit.

Thanks
Donghai
>
>                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-06  0:25                             ` donghai qiao
@ 2021-10-18 21:18                               ` donghai qiao
  2021-10-18 23:46                                 ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-18 21:18 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

I just want to follow up this discussion. First off, the latest issue
I mentioned in the email of Oct 4th which
exhibited a symptom of networking appeared to be a problem in
qrwlock.c. Particularly the problem is
caused by the 'if' statement in the function queued_read_lock_slowpath() below :

void queued_read_lock_slowpath(struct qrwlock *lock)
{
        /*
         * Readers come here when they cannot get the lock without waiting
         */
        if (unlikely(in_interrupt())) {
                /*
                 * Readers in interrupt context will get the lock immediately
                 * if the writer is just waiting (not holding the lock yet),
                 * so spin with ACQUIRE semantics until the lock is available
                 * without waiting in the queue.
                 */
                atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
                return;
        }
        ...
}

That 'if' statement said, if we are in an interrupt context and we are
a reader, then
we will be allowed to enter the lock as a reader no matter if there
are writers waiting
for it or not. So, in the circumstance when the network packets steadily come in
and the intervals are relatively small enough, then the writers will
have no chance to
acquire the lock. This should be the root cause for that case.

I have verified it by removing the 'if' and rerun the test multiple
times. The same
symptom hasn't been reproduced.  As far as rcu stall is concerned as a
broader range
of problems,  this is absolutely not the only root cause I have seen.
Actually too many
things can delay context switching.  Do you have a long term plan to
fix this issue,
or just want to treat it case by case?

Secondly, back to the following code I brought up that day. Actually
it is not as simple
as spinlock.

      rcu_read_lock();
      ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
      rcu_read_unlock();

Are you all aware of all the potential functions that
ip_protocol_deliver_rcu will call?
As I can see, there is a code path from ip_protocol_deliver_rcu to
kmem_cache_alloc
which will end up a call to cond_resched().  Because the operations in memory
allocation are too complicated, we cannot alway expect a prompt return
with success.
When the system is running out of memory, then rcu cannot close the
current gp, then
great number of callbacks will be delayed and the freeing of the
memory they held
will be delayed as well. This sounds like a deadlock in the resource flow.


Thanks
Donghai


On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > Hello Paul,
> > > > > Sorry it has been long..
> > > >
> > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > are not expecting instantaneous response.  ;-)
> > > >
> > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > problem I described
> > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > to reproduce than others.
> > > > > >
> > > > > > Understood, that does make things more difficult.
> > > > > >
> > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > it was not in eqs either.
> > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > CPU had been
> > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > And this is true: the first
> > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > when another
> > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > If the kernel missed
> > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > identical ?  But I'll
> > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > >
> > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > >
> > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > problems on CPUs working in nohz mode with different configurations.
> > > >
> > > > OK.
> > > >
> > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > was backported, but should not have been?
> > > > > > >
> > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > >
> > > > > > > > Is it possible to bisect this?
> > > > > > > >
> > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > >
> > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > machine is set up, see how much it can help.
> > > > > >
> > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > due to backporting issues.
> > > > >
> > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > turns out no new warning messages related to this came out. So,
> > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > >
> > > > OK, good.
> > > >
> > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > on behalf of that CPU.
> > > > > > >
> > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > never advances. That's the thing I try to fix here.
> > > > > >
> > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > to actually fix anything.  ;-)
> > > > > >
> > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > Or move to newer kernels.
> > > > > >
> > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > that you will get is your diagnostic.
> > > > >
> > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > >
> > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > kinds of writers were blocked on the same rwlock.
> > > >
> > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > very large number of these over the years, though, so I wonder what is
> > > > special about this one so that it is just now showing up.  I have added
> > > > a networking list on CC for their thoughts.
> > >
> > > Thanks for pulling the networking in. If they need the coredump, I can
> > > forward it to them.  It's definitely worth analyzing it as this contention
> > > might be a performance issue.  Or we can discuss this further in this
> > > email thread if they are fine, or we can discuss it over with a separate
> > > email thread with netdev@ only.
> > >
> > > So back to my original problem, this might be one of the possibilities that
> > > led to RCU stall panic.  Just imagining this type of contention might have
> > > occurred and lasted long enough. When it finally came to the end, the
> > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > stall on the CPU and panic.
> > >
> > > So definitely we need to understand these networking activities here as
> > > to why the readers could hold the rwlock too long.
> >
> > I strongly suggest that you also continue to do your own analysis on this.
> > So please see below.
>
> This is just a brief of my analysis and the stack info below is not enough
> for other people to figure out anything useful. I meant if they are really
> interested, I can upload the core file. I think this is fair.
>
> >
> > > > > When examining the readers of the lock, except the following code,
> > > > > don't see any other obvious problems: e.g
> > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > >
> > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > a rcu reader side section.
> > > > >
> > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > struct sk_buff *skb)
> > > > > {
> > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > >
> > > > >         rcu_read_lock();
> > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > >         rcu_read_unlock();
> > > > >
> > > > >         return 0;
> > > > > }
> > > > >
> > > > > Actually there are multiple chances that this code path can hit
> > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > >
> > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > per-object lock while still maintaining good performance and
> > > > scalability.
> > >
> > > Sure, understand. But the RCU related docs said that anything causing
> > > the reader side to block must be avoided.
> >
> > True.  But this is the Linux kernel, where "block" means something
> > like "invoke schedule()" or "sleep" instead of the academic-style
> > non-blocking-synchronization definition.  So it is perfectly legal to
> > acquire spinlocks within RCU read-side critical sections.
> >
> > And before you complain that practitioners are not following the academic
> > definitions, please keep in mind that our definitions were here first.  ;-)
> >
> > > > My guess is that the thing to track down is the cause of the high contention
> > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > >
> > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > special, just use the default.
> >
> > Does this occur on older mainline kernels?  If not, I strongly suggest
> > bisecting, as this often quickly and easily finds the problem.
>
> Actually It does. But let's focus on the latest upstream and the latest rhel8.
> This way, we will not worry about missing the needed rcu patches.
> However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> networking related, which I am still working on.  So, there might be
> multiple root causes.
>
> > Bisection can also help you find the patch to be backported if a later
> > release fixes the bug, though things like gitk can also be helpful.
>
> Unfortunately, this is reproducible on the latest bit.
>
> Thanks
> Donghai
> >
> >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-18 21:18                               ` donghai qiao
@ 2021-10-18 23:46                                 ` Paul E. McKenney
  2021-10-20 17:48                                   ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-18 23:46 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu, netdev

On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> I just want to follow up this discussion. First off, the latest issue
> I mentioned in the email of Oct 4th which
> exhibited a symptom of networking appeared to be a problem in
> qrwlock.c. Particularly the problem is
> caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> 
> void queued_read_lock_slowpath(struct qrwlock *lock)
> {
>         /*
>          * Readers come here when they cannot get the lock without waiting
>          */
>         if (unlikely(in_interrupt())) {
>                 /*
>                  * Readers in interrupt context will get the lock immediately
>                  * if the writer is just waiting (not holding the lock yet),
>                  * so spin with ACQUIRE semantics until the lock is available
>                  * without waiting in the queue.
>                  */
>                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
>                 return;
>         }
>         ...
> }
> 
> That 'if' statement said, if we are in an interrupt context and we are
> a reader, then
> we will be allowed to enter the lock as a reader no matter if there
> are writers waiting
> for it or not. So, in the circumstance when the network packets steadily come in
> and the intervals are relatively small enough, then the writers will
> have no chance to
> acquire the lock. This should be the root cause for that case.
> 
> I have verified it by removing the 'if' and rerun the test multiple
> times.

That could do it!

Would it make sense to keep the current check, but to also check if a
writer had been waiting for more than (say) 100ms?  The reason that I
ask is that I believe that this "if" statement is there for a reason.

>        The same
> symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> broader range
> of problems,  this is absolutely not the only root cause I have seen.
> Actually too many
> things can delay context switching.  Do you have a long term plan to
> fix this issue,
> or just want to treat it case by case?

If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
leverage the calls to cond_resched().  If the grace period is old enough,
cond_resched() will supply a quiescent state.

In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
RCU forces a schedule on the holdout CPU.  As long as the CPU is not
eternally non-preemptible (for example, eternally in an interrupt
handler), the grace period will end.

But beyond a certain point, case-by-case analysis and handling is
required.

> Secondly, back to the following code I brought up that day. Actually
> it is not as simple
> as spinlock.
> 
>       rcu_read_lock();
>       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
>       rcu_read_unlock();
> 
> Are you all aware of all the potential functions that
> ip_protocol_deliver_rcu will call?
> As I can see, there is a code path from ip_protocol_deliver_rcu to
> kmem_cache_alloc
> which will end up a call to cond_resched().

Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
There are some tens of millions of lines of code in the kernel, and I have
but one brain.  ;-)

And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
kernel.  Except that there should not be a call to cond_resched() within
an RCU read-side critical section.  Does the code momentarily exit that
critical section via something like rcu_read_unlock(); cond_resched();
rcu_read_lock()?  Or does something prevent the code from getting there
while in an RCU read-side critical section?  (The usual trick here is
to have different GFP_ flags depending on the context.)

>                                              Because the operations in memory
> allocation are too complicated, we cannot alway expect a prompt return
> with success.
> When the system is running out of memory, then rcu cannot close the
> current gp, then
> great number of callbacks will be delayed and the freeing of the
> memory they held
> will be delayed as well. This sounds like a deadlock in the resource flow.

That would of course be bad.  Though I am not familiar with all of the
details of how the networking guys handle out-of-memory conditions.

The usual advice would be to fail the request, but that does not appear
to be an easy option for ip_protocol_deliver_rcu().  At this point, I
must defer to the networking folks.

							Thanx, Paul

> Thanks
> Donghai
> 
> 
> On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> >
> > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > Hello Paul,
> > > > > > Sorry it has been long..
> > > > >
> > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > are not expecting instantaneous response.  ;-)
> > > > >
> > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > problem I described
> > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > to reproduce than others.
> > > > > > >
> > > > > > > Understood, that does make things more difficult.
> > > > > > >
> > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > it was not in eqs either.
> > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > CPU had been
> > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > And this is true: the first
> > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > when another
> > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > If the kernel missed
> > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > identical ?  But I'll
> > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > >
> > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > >
> > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > >
> > > > > OK.
> > > > >
> > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > was backported, but should not have been?
> > > > > > > >
> > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > >
> > > > > > > > > Is it possible to bisect this?
> > > > > > > > >
> > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > >
> > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > machine is set up, see how much it can help.
> > > > > > >
> > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > due to backporting issues.
> > > > > >
> > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > turns out no new warning messages related to this came out. So,
> > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > >
> > > > > OK, good.
> > > > >
> > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > on behalf of that CPU.
> > > > > > > >
> > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > >
> > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > to actually fix anything.  ;-)
> > > > > > >
> > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > Or move to newer kernels.
> > > > > > >
> > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > that you will get is your diagnostic.
> > > > > >
> > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > >
> > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > kinds of writers were blocked on the same rwlock.
> > > > >
> > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > very large number of these over the years, though, so I wonder what is
> > > > > special about this one so that it is just now showing up.  I have added
> > > > > a networking list on CC for their thoughts.
> > > >
> > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > might be a performance issue.  Or we can discuss this further in this
> > > > email thread if they are fine, or we can discuss it over with a separate
> > > > email thread with netdev@ only.
> > > >
> > > > So back to my original problem, this might be one of the possibilities that
> > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > occurred and lasted long enough. When it finally came to the end, the
> > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > stall on the CPU and panic.
> > > >
> > > > So definitely we need to understand these networking activities here as
> > > > to why the readers could hold the rwlock too long.
> > >
> > > I strongly suggest that you also continue to do your own analysis on this.
> > > So please see below.
> >
> > This is just a brief of my analysis and the stack info below is not enough
> > for other people to figure out anything useful. I meant if they are really
> > interested, I can upload the core file. I think this is fair.
> >
> > >
> > > > > > When examining the readers of the lock, except the following code,
> > > > > > don't see any other obvious problems: e.g
> > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > >
> > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > a rcu reader side section.
> > > > > >
> > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > struct sk_buff *skb)
> > > > > > {
> > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > >
> > > > > >         rcu_read_lock();
> > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > >         rcu_read_unlock();
> > > > > >
> > > > > >         return 0;
> > > > > > }
> > > > > >
> > > > > > Actually there are multiple chances that this code path can hit
> > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > >
> > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > per-object lock while still maintaining good performance and
> > > > > scalability.
> > > >
> > > > Sure, understand. But the RCU related docs said that anything causing
> > > > the reader side to block must be avoided.
> > >
> > > True.  But this is the Linux kernel, where "block" means something
> > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > acquire spinlocks within RCU read-side critical sections.
> > >
> > > And before you complain that practitioners are not following the academic
> > > definitions, please keep in mind that our definitions were here first.  ;-)
> > >
> > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > >
> > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > special, just use the default.
> > >
> > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > bisecting, as this often quickly and easily finds the problem.
> >
> > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > This way, we will not worry about missing the needed rcu patches.
> > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > networking related, which I am still working on.  So, there might be
> > multiple root causes.
> >
> > > Bisection can also help you find the patch to be backported if a later
> > > release fixes the bug, though things like gitk can also be helpful.
> >
> > Unfortunately, this is reproducible on the latest bit.
> >
> > Thanks
> > Donghai
> > >
> > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-18 23:46                                 ` Paul E. McKenney
@ 2021-10-20 17:48                                   ` donghai qiao
  2021-10-20 18:37                                     ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-20 17:48 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > I just want to follow up this discussion. First off, the latest issue
> > I mentioned in the email of Oct 4th which
> > exhibited a symptom of networking appeared to be a problem in
> > qrwlock.c. Particularly the problem is
> > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> >
> > void queued_read_lock_slowpath(struct qrwlock *lock)
> > {
> >         /*
> >          * Readers come here when they cannot get the lock without waiting
> >          */
> >         if (unlikely(in_interrupt())) {
> >                 /*
> >                  * Readers in interrupt context will get the lock immediately
> >                  * if the writer is just waiting (not holding the lock yet),
> >                  * so spin with ACQUIRE semantics until the lock is available
> >                  * without waiting in the queue.
> >                  */
> >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> >                 return;
> >         }
> >         ...
> > }
> >
> > That 'if' statement said, if we are in an interrupt context and we are
> > a reader, then
> > we will be allowed to enter the lock as a reader no matter if there
> > are writers waiting
> > for it or not. So, in the circumstance when the network packets steadily come in
> > and the intervals are relatively small enough, then the writers will
> > have no chance to
> > acquire the lock. This should be the root cause for that case.
> >
> > I have verified it by removing the 'if' and rerun the test multiple
> > times.
>
> That could do it!
>
> Would it make sense to keep the current check, but to also check if a
> writer had been waiting for more than (say) 100ms?  The reason that I
> ask is that I believe that this "if" statement is there for a reason.

The day before I also got this to Waiman Long who initially made these
changes in
the qrwlock.c file. Turns out, the 'if' block was introduced to
resolve the particular
requirement of tasklist_lock reentering as reader.  He said he will
perhaps come up
with another code change to take care of this new write lock
starvation issue. The
idea is to only allow the tasklist_lock clients to acquire the read
lock through the 'if'
statement,  others are not.

This sounds like a temporary solution if we cannot think of other
alternative ways
to fix the tasklist_lock issue. The principle here is that we should
not make the
locking primitives more special just in favor of a particular usage or scenario.

>
> >        The same
> > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > broader range
> > of problems,  this is absolutely not the only root cause I have seen.
> > Actually too many
> > things can delay context switching.  Do you have a long term plan to
> > fix this issue,
> > or just want to treat it case by case?
>
> If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> leverage the calls to cond_resched().  If the grace period is old enough,
> cond_resched() will supply a quiescent state.

So far, all types of rcu stall I am aware of are originated to the
CONFIG_PREEMPT=n
kernel. Isn't it impossible to let rcu not rely on context switch ?
As we know too many
things can delay context switch, so it is not a quite reliable
mechanism if timing and
performance are crucial.

>
> In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> eternally non-preemptible (for example, eternally in an interrupt
> handler), the grace period will end.

Among the rcu stall instances I have seen so far, quite a lot of them occurred
on the CPUs which were running in the interrupt context or spinning on spinlocks
with interrupt disabled. In these scenarios, forced schedules will be
delayed until
these activities end.

>
> But beyond a certain point, case-by-case analysis and handling is
> required.
>
> > Secondly, back to the following code I brought up that day. Actually
> > it is not as simple
> > as spinlock.
> >
> >       rcu_read_lock();
> >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> >       rcu_read_unlock();
> >
> > Are you all aware of all the potential functions that
> > ip_protocol_deliver_rcu will call?
> > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > kmem_cache_alloc
> > which will end up a call to cond_resched().
>
> Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> There are some tens of millions of lines of code in the kernel, and I have
> but one brain.  ;-)
>
> And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> kernel.  Except that there should not be a call to cond_resched() within
> an RCU read-side critical section.

with that 3 line snippet from the networking, a call to cond_resched() would
happen within the read-side critical section when the level of variable memory
is very low.

> Does the code momentarily exit that
> critical section via something like rcu_read_unlock(); cond_resched();
> rcu_read_lock()?

As far as I can see, cond_resched would be called between a pair of
rcu_read_lock and rcu_read_unlock.


> Or does something prevent the code from getting there
> while in an RCU read-side critical section?  (The usual trick here is
> to have different GFP_ flags depending on the context.)

Once we invoke kmem_cache_alloc or its variants, we cannot really
predict where we will go and how long this whole process is going to
take in this very large area from kmem to the virtual memory subsystem.
There is a flag __GFP_NOFAIL that determines whether or not cond_resched
should be called before retry, but this flag should be used from page level,
not from the kmem consumer level.  So I think there is little we can do
to avoid the resched.

>
> >                                              Because the operations in memory
> > allocation are too complicated, we cannot alway expect a prompt return
> > with success.
> > When the system is running out of memory, then rcu cannot close the
> > current gp, then
> > great number of callbacks will be delayed and the freeing of the
> > memory they held
> > will be delayed as well. This sounds like a deadlock in the resource flow.
>
> That would of course be bad.  Though I am not familiar with all of the
> details of how the networking guys handle out-of-memory conditions.
>
> The usual advice would be to fail the request, but that does not appear
> to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> must defer to the networking folks.

Thanks for the advice.

Donghai
>
>                                                         Thanx, Paul
>
> > Thanks
> > Donghai
> >
> >
> > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > >
> > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > Hello Paul,
> > > > > > > Sorry it has been long..
> > > > > >
> > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > are not expecting instantaneous response.  ;-)
> > > > > >
> > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > problem I described
> > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > to reproduce than others.
> > > > > > > >
> > > > > > > > Understood, that does make things more difficult.
> > > > > > > >
> > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > it was not in eqs either.
> > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > CPU had been
> > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > And this is true: the first
> > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > when another
> > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > If the kernel missed
> > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > identical ?  But I'll
> > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > >
> > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > >
> > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > was backported, but should not have been?
> > > > > > > > >
> > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > >
> > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > >
> > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > >
> > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > machine is set up, see how much it can help.
> > > > > > > >
> > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > due to backporting issues.
> > > > > > >
> > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > >
> > > > > > OK, good.
> > > > > >
> > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > on behalf of that CPU.
> > > > > > > > >
> > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > >
> > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > to actually fix anything.  ;-)
> > > > > > > >
> > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > Or move to newer kernels.
> > > > > > > >
> > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > that you will get is your diagnostic.
> > > > > > >
> > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > >
> > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > >
> > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > a networking list on CC for their thoughts.
> > > > >
> > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > email thread with netdev@ only.
> > > > >
> > > > > So back to my original problem, this might be one of the possibilities that
> > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > stall on the CPU and panic.
> > > > >
> > > > > So definitely we need to understand these networking activities here as
> > > > > to why the readers could hold the rwlock too long.
> > > >
> > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > So please see below.
> > >
> > > This is just a brief of my analysis and the stack info below is not enough
> > > for other people to figure out anything useful. I meant if they are really
> > > interested, I can upload the core file. I think this is fair.
> > >
> > > >
> > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > don't see any other obvious problems: e.g
> > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > >
> > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > a rcu reader side section.
> > > > > > >
> > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > struct sk_buff *skb)
> > > > > > > {
> > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > >
> > > > > > >         rcu_read_lock();
> > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > >         rcu_read_unlock();
> > > > > > >
> > > > > > >         return 0;
> > > > > > > }
> > > > > > >
> > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > >
> > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > per-object lock while still maintaining good performance and
> > > > > > scalability.
> > > > >
> > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > the reader side to block must be avoided.
> > > >
> > > > True.  But this is the Linux kernel, where "block" means something
> > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > acquire spinlocks within RCU read-side critical sections.
> > > >
> > > > And before you complain that practitioners are not following the academic
> > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > >
> > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > >
> > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > special, just use the default.
> > > >
> > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > bisecting, as this often quickly and easily finds the problem.
> > >
> > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > This way, we will not worry about missing the needed rcu patches.
> > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > networking related, which I am still working on.  So, there might be
> > > multiple root causes.
> > >
> > > > Bisection can also help you find the patch to be backported if a later
> > > > release fixes the bug, though things like gitk can also be helpful.
> > >
> > > Unfortunately, this is reproducible on the latest bit.
> > >
> > > Thanks
> > > Donghai
> > > >
> > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-20 17:48                                   ` donghai qiao
@ 2021-10-20 18:37                                     ` Paul E. McKenney
  2021-10-20 20:05                                       ` donghai qiao
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-20 18:37 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu, netdev

On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > I just want to follow up this discussion. First off, the latest issue
> > > I mentioned in the email of Oct 4th which
> > > exhibited a symptom of networking appeared to be a problem in
> > > qrwlock.c. Particularly the problem is
> > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > >
> > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > {
> > >         /*
> > >          * Readers come here when they cannot get the lock without waiting
> > >          */
> > >         if (unlikely(in_interrupt())) {
> > >                 /*
> > >                  * Readers in interrupt context will get the lock immediately
> > >                  * if the writer is just waiting (not holding the lock yet),
> > >                  * so spin with ACQUIRE semantics until the lock is available
> > >                  * without waiting in the queue.
> > >                  */
> > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > >                 return;
> > >         }
> > >         ...
> > > }
> > >
> > > That 'if' statement said, if we are in an interrupt context and we are
> > > a reader, then
> > > we will be allowed to enter the lock as a reader no matter if there
> > > are writers waiting
> > > for it or not. So, in the circumstance when the network packets steadily come in
> > > and the intervals are relatively small enough, then the writers will
> > > have no chance to
> > > acquire the lock. This should be the root cause for that case.
> > >
> > > I have verified it by removing the 'if' and rerun the test multiple
> > > times.
> >
> > That could do it!
> >
> > Would it make sense to keep the current check, but to also check if a
> > writer had been waiting for more than (say) 100ms?  The reason that I
> > ask is that I believe that this "if" statement is there for a reason.
> 
> The day before I also got this to Waiman Long who initially made these
> changes in
> the qrwlock.c file. Turns out, the 'if' block was introduced to
> resolve the particular
> requirement of tasklist_lock reentering as reader.  He said he will
> perhaps come up
> with another code change to take care of this new write lock
> starvation issue. The
> idea is to only allow the tasklist_lock clients to acquire the read
> lock through the 'if'
> statement,  others are not.
> 
> This sounds like a temporary solution if we cannot think of other
> alternative ways
> to fix the tasklist_lock issue. The principle here is that we should
> not make the
> locking primitives more special just in favor of a particular usage or scenario.

When principles meet practice, results can vary.  Still, it would be
better to have a less troublesome optimization.

> > >        The same
> > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > broader range
> > > of problems,  this is absolutely not the only root cause I have seen.
> > > Actually too many
> > > things can delay context switching.  Do you have a long term plan to
> > > fix this issue,
> > > or just want to treat it case by case?
> >
> > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > leverage the calls to cond_resched().  If the grace period is old enough,
> > cond_resched() will supply a quiescent state.
> 
> So far, all types of rcu stall I am aware of are originated to the
> CONFIG_PREEMPT=n
> kernel. Isn't it impossible to let rcu not rely on context switch ?
> As we know too many
> things can delay context switch, so it is not a quite reliable
> mechanism if timing and
> performance are crucial.

Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
need to wait for an actual context switch.  But there can be
performance issues for some workloads.

But please note that cond_resched() is not necessarily a context switch.

Besides, for a great many workloads, delaying a context switch for
very long is a first-class bug anyway.  For example, many internet data
centers are said to have sub-second response-time requirements, and such
requirements cannot be met if context switches are delayed too long.

> > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > eternally non-preemptible (for example, eternally in an interrupt
> > handler), the grace period will end.
> 
> Among the rcu stall instances I have seen so far, quite a lot of them occurred
> on the CPUs which were running in the interrupt context or spinning on spinlocks
> with interrupt disabled. In these scenarios, forced schedules will be
> delayed until
> these activities end.

But running for several seconds in interrupt context is not at all good.
As is spinning on a spinlock for several seconds.  These are performance
bugs in and of themselves.

More on this later in this email...

> > But beyond a certain point, case-by-case analysis and handling is
> > required.
> >
> > > Secondly, back to the following code I brought up that day. Actually
> > > it is not as simple
> > > as spinlock.
> > >
> > >       rcu_read_lock();
> > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > >       rcu_read_unlock();
> > >
> > > Are you all aware of all the potential functions that
> > > ip_protocol_deliver_rcu will call?
> > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > kmem_cache_alloc
> > > which will end up a call to cond_resched().
> >
> > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > There are some tens of millions of lines of code in the kernel, and I have
> > but one brain.  ;-)
> >
> > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > kernel.  Except that there should not be a call to cond_resched() within
> > an RCU read-side critical section.
> 
> with that 3 line snippet from the networking, a call to cond_resched() would
> happen within the read-side critical section when the level of variable memory
> is very low.

That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
it will complain about a cond_resched() in an RCU read-side critical
section.  But, as you say, perhaps only with the level of variable memory
is very low.

Please do not invoke cond_resched() within an RCU read-side critical
section.  Doing so can result in random memory corruption.

> > Does the code momentarily exit that
> > critical section via something like rcu_read_unlock(); cond_resched();
> > rcu_read_lock()?
> 
> As far as I can see, cond_resched would be called between a pair of
> rcu_read_lock and rcu_read_unlock.

Again, this is a bug.  The usual fix is the GFP_ thing I noted below.

> > Or does something prevent the code from getting there
> > while in an RCU read-side critical section?  (The usual trick here is
> > to have different GFP_ flags depending on the context.)
> 
> Once we invoke kmem_cache_alloc or its variants, we cannot really
> predict where we will go and how long this whole process is going to
> take in this very large area from kmem to the virtual memory subsystem.
> There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> should be called before retry, but this flag should be used from page level,
> not from the kmem consumer level.  So I think there is little we can do
> to avoid the resched.

If you are invoking the allocator within an RCU read-side critical
section, you should be using GFP_ATOMIC.  Except that doing this has
many negative consequences, so it is better to allocate outside of
the RCU read-side critical section.

The same rules apply when allocating while holding a spinlock, so
this is not just RCU placing restrictions on you.  ;-)

> > >                                              Because the operations in memory
> > > allocation are too complicated, we cannot alway expect a prompt return
> > > with success.
> > > When the system is running out of memory, then rcu cannot close the
> > > current gp, then
> > > great number of callbacks will be delayed and the freeing of the
> > > memory they held
> > > will be delayed as well. This sounds like a deadlock in the resource flow.
> >
> > That would of course be bad.  Though I am not familiar with all of the
> > details of how the networking guys handle out-of-memory conditions.
> >
> > The usual advice would be to fail the request, but that does not appear
> > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > must defer to the networking folks.
> 
> Thanks for the advice.

Another question...  Why the endless interrupts?  Or is it just one
very long interrupt?  Last I knew (admittedly a very long time ago),
the high-rate networking drivers used things like NAPI in order to avoid
this very problem.

Or is this some sort of special case where you are trying to do something
special, for example, to achieve extremely low communications latencies?

If this is a deliberate design, and if it is endless interrupts instead
of one big long one, and if you are deliberately interrupt-storming
a particular CPU, another approach is to build the kernel with
CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
the CPU that is to be interrupt-stormed.  If you are interrupt storming
multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
the CPU with the largest CPU number.

Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
probably a bit outdated, but a good place to start.  Follow its guidelines
(and, as needed, come up with additional ones) to ensure that CPU "n"
is not doing anything.  If you do come up with additional guidelines,
please submit a patch to kernel-per-CPU-kthreads.rst so that others can
also benefit, as you are benefiting from those before you.

Create a CPU-bound usermode application (a "while (1) continue;" loop or
similar), and run that application on CPU "n".  Then start up whatever
it is that interrupt-storms CPU "n".

Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
which will prevent the interrupt storm from delaying RCU grace periods.

On the other hand, if this is one big long interrupt, you need to make
that interrupt end every so often.  Or move some of the work out of
interrupt context, perhaps even to usermode.

Much depends on exactly what you are trying to achieve.

							Thanx, Paul

> Donghai
> >
> >                                                         Thanx, Paul
> >
> > > Thanks
> > > Donghai
> > >
> > >
> > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > >
> > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > Hello Paul,
> > > > > > > > Sorry it has been long..
> > > > > > >
> > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > >
> > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > problem I described
> > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > to reproduce than others.
> > > > > > > > >
> > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > >
> > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > it was not in eqs either.
> > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > CPU had been
> > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > And this is true: the first
> > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > when another
> > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > If the kernel missed
> > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > identical ?  But I'll
> > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > >
> > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > >
> > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > >
> > > > > > > OK.
> > > > > > >
> > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > >
> > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > >
> > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > >
> > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > >
> > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > >
> > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > due to backporting issues.
> > > > > > > >
> > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > >
> > > > > > > OK, good.
> > > > > > >
> > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > >
> > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > >
> > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > >
> > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > Or move to newer kernels.
> > > > > > > > >
> > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > that you will get is your diagnostic.
> > > > > > > >
> > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > >
> > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > >
> > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > a networking list on CC for their thoughts.
> > > > > >
> > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > email thread with netdev@ only.
> > > > > >
> > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > stall on the CPU and panic.
> > > > > >
> > > > > > So definitely we need to understand these networking activities here as
> > > > > > to why the readers could hold the rwlock too long.
> > > > >
> > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > So please see below.
> > > >
> > > > This is just a brief of my analysis and the stack info below is not enough
> > > > for other people to figure out anything useful. I meant if they are really
> > > > interested, I can upload the core file. I think this is fair.
> > > >
> > > > >
> > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > don't see any other obvious problems: e.g
> > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > >
> > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > a rcu reader side section.
> > > > > > > >
> > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > struct sk_buff *skb)
> > > > > > > > {
> > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > >
> > > > > > > >         rcu_read_lock();
> > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > >         rcu_read_unlock();
> > > > > > > >
> > > > > > > >         return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > >
> > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > per-object lock while still maintaining good performance and
> > > > > > > scalability.
> > > > > >
> > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > the reader side to block must be avoided.
> > > > >
> > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > acquire spinlocks within RCU read-side critical sections.
> > > > >
> > > > > And before you complain that practitioners are not following the academic
> > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > >
> > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > >
> > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > special, just use the default.
> > > > >
> > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > bisecting, as this often quickly and easily finds the problem.
> > > >
> > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > This way, we will not worry about missing the needed rcu patches.
> > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > networking related, which I am still working on.  So, there might be
> > > > multiple root causes.
> > > >
> > > > > Bisection can also help you find the patch to be backported if a later
> > > > > release fixes the bug, though things like gitk can also be helpful.
> > > >
> > > > Unfortunately, this is reproducible on the latest bit.
> > > >
> > > > Thanks
> > > > Donghai
> > > > >
> > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-20 18:37                                     ` Paul E. McKenney
@ 2021-10-20 20:05                                       ` donghai qiao
  2021-10-20 21:33                                         ` Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: donghai qiao @ 2021-10-20 20:05 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

On Wed, Oct 20, 2021 at 2:37 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> > On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > > I just want to follow up this discussion. First off, the latest issue
> > > > I mentioned in the email of Oct 4th which
> > > > exhibited a symptom of networking appeared to be a problem in
> > > > qrwlock.c. Particularly the problem is
> > > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > > >
> > > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > > {
> > > >         /*
> > > >          * Readers come here when they cannot get the lock without waiting
> > > >          */
> > > >         if (unlikely(in_interrupt())) {
> > > >                 /*
> > > >                  * Readers in interrupt context will get the lock immediately
> > > >                  * if the writer is just waiting (not holding the lock yet),
> > > >                  * so spin with ACQUIRE semantics until the lock is available
> > > >                  * without waiting in the queue.
> > > >                  */
> > > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > > >                 return;
> > > >         }
> > > >         ...
> > > > }
> > > >
> > > > That 'if' statement said, if we are in an interrupt context and we are
> > > > a reader, then
> > > > we will be allowed to enter the lock as a reader no matter if there
> > > > are writers waiting
> > > > for it or not. So, in the circumstance when the network packets steadily come in
> > > > and the intervals are relatively small enough, then the writers will
> > > > have no chance to
> > > > acquire the lock. This should be the root cause for that case.
> > > >
> > > > I have verified it by removing the 'if' and rerun the test multiple
> > > > times.
> > >
> > > That could do it!
> > >
> > > Would it make sense to keep the current check, but to also check if a
> > > writer had been waiting for more than (say) 100ms?  The reason that I
> > > ask is that I believe that this "if" statement is there for a reason.
> >
> > The day before I also got this to Waiman Long who initially made these
> > changes in
> > the qrwlock.c file. Turns out, the 'if' block was introduced to
> > resolve the particular
> > requirement of tasklist_lock reentering as reader.  He said he will
> > perhaps come up
> > with another code change to take care of this new write lock
> > starvation issue. The
> > idea is to only allow the tasklist_lock clients to acquire the read
> > lock through the 'if'
> > statement,  others are not.
> >
> > This sounds like a temporary solution if we cannot think of other
> > alternative ways
> > to fix the tasklist_lock issue. The principle here is that we should
> > not make the
> > locking primitives more special just in favor of a particular usage or scenario.
>
> When principles meet practice, results can vary.  Still, it would be
> better to have a less troublesome optimization.
>

This is a philosophical debate. Let's put it aside.

> > > >        The same
> > > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > > broader range
> > > > of problems,  this is absolutely not the only root cause I have seen.
> > > > Actually too many
> > > > things can delay context switching.  Do you have a long term plan to
> > > > fix this issue,
> > > > or just want to treat it case by case?
> > >
> > > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > > leverage the calls to cond_resched().  If the grace period is old enough,
> > > cond_resched() will supply a quiescent state.
> >
> > So far, all types of rcu stall I am aware of are originated to the
> > CONFIG_PREEMPT=n
> > kernel. Isn't it impossible to let rcu not rely on context switch ?
> > As we know too many
> > things can delay context switch, so it is not a quite reliable
> > mechanism if timing and
> > performance are crucial.
>
> Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
> need to wait for an actual context switch.  But there can be
> performance issues for some workloads.
>
I can give this config (CONFIG_PREEMPT=y) a try when I have time.

> But please note that cond_resched() is not necessarily a context switch.
>
> Besides, for a great many workloads, delaying a context switch for
> very long is a first-class bug anyway.  For example, many internet data
> centers are said to have sub-second response-time requirements, and such
> requirements cannot be met if context switches are delayed too long.
>
Agreed.

But on the other hand, if rcu relies on that,  the situation could be
even worse.
Simply put, when a gp cannot end soon, some rcu write-side will be delayed,
and the callbacks on the rcu-stalled CPU will be delayed. Thus in the case of
lack of free memory, this situation could form a deadlock.

> > > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > > eternally non-preemptible (for example, eternally in an interrupt
> > > handler), the grace period will end.
> >
> > Among the rcu stall instances I have seen so far, quite a lot of them occurred
> > on the CPUs which were running in the interrupt context or spinning on spinlocks
> > with interrupt disabled. In these scenarios, forced schedules will be
> > delayed until
> > these activities end.
>
> But running for several seconds in interrupt context is not at all good.
> As is spinning on a spinlock for several seconds.  These are performance
> bugs in and of themselves.

Agreed.

>
> More on this later in this email...
>
> > > But beyond a certain point, case-by-case analysis and handling is
> > > required.
> > >
> > > > Secondly, back to the following code I brought up that day. Actually
> > > > it is not as simple
> > > > as spinlock.
> > > >
> > > >       rcu_read_lock();
> > > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > >       rcu_read_unlock();
> > > >
> > > > Are you all aware of all the potential functions that
> > > > ip_protocol_deliver_rcu will call?
> > > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > > kmem_cache_alloc
> > > > which will end up a call to cond_resched().
> > >
> > > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > > There are some tens of millions of lines of code in the kernel, and I have
> > > but one brain.  ;-)
> > >
> > > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > > kernel.  Except that there should not be a call to cond_resched() within
> > > an RCU read-side critical section.
> >
> > with that 3 line snippet from the networking, a call to cond_resched() would
> > happen within the read-side critical section when the level of variable memory
> > is very low.
>
> That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
> it will complain about a cond_resched() in an RCU read-side critical
> section.  But, as you say, perhaps only with the level of variable memory
> is very low.

There is a typo in my previous email. I meant available (or free).
Sorry for that.

>
> Please do not invoke cond_resched() within an RCU read-side critical
> section.  Doing so can result in random memory corruption.
>
> > > Does the code momentarily exit that
> > > critical section via something like rcu_read_unlock(); cond_resched();
> > > rcu_read_lock()?
> >
> > As far as I can see, cond_resched would be called between a pair of
> > rcu_read_lock and rcu_read_unlock.
>
> Again, this is a bug.  The usual fix is the GFP_ thing I noted below.
>
> > > Or does something prevent the code from getting there
> > > while in an RCU read-side critical section?  (The usual trick here is
> > > to have different GFP_ flags depending on the context.)
> >
> > Once we invoke kmem_cache_alloc or its variants, we cannot really
> > predict where we will go and how long this whole process is going to
> > take in this very large area from kmem to the virtual memory subsystem.
> > There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> > should be called before retry, but this flag should be used from page level,
> > not from the kmem consumer level.  So I think there is little we can do
> > to avoid the resched.
>
> If you are invoking the allocator within an RCU read-side critical
> section, you should be using GFP_ATOMIC.  Except that doing this has
> many negative consequences, so it is better to allocate outside of
> the RCU read-side critical section.
>
> The same rules apply when allocating while holding a spinlock, so
> this is not just RCU placing restrictions on you.  ;-)

yep, absolutely.

>
> > > >                                              Because the operations in memory
> > > > allocation are too complicated, we cannot alway expect a prompt return
> > > > with success.
> > > > When the system is running out of memory, then rcu cannot close the
> > > > current gp, then
> > > > great number of callbacks will be delayed and the freeing of the
> > > > memory they held
> > > > will be delayed as well. This sounds like a deadlock in the resource flow.
> > >
> > > That would of course be bad.  Though I am not familiar with all of the
> > > details of how the networking guys handle out-of-memory conditions.
> > >
> > > The usual advice would be to fail the request, but that does not appear
> > > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > > must defer to the networking folks.
> >
> > Thanks for the advice.
>
> Another question...  Why the endless interrupts?  Or is it just one
> very long interrupt?  Last I knew (admittedly a very long time ago),
> the high-rate networking drivers used things like NAPI in order to avoid
> this very problem.

These should be long enough interrupts. The symptom in the networking
as the previous email said is one of them.  In that case, due to  rwlock
in favor of the readers in the interrupt context, the writer side would be
blocked as long as the readers keep coming.

>
> Or is this some sort of special case where you are trying to do something
> special, for example, to achieve extremely low communications latencies?

No, nothing special I am trying to do.

>
> If this is a deliberate design, and if it is endless interrupts instead
> of one big long one, and if you are deliberately interrupt-storming
> a particular CPU, another approach is to build the kernel with
> CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
> the CPU that is to be interrupt-stormed.  If you are interrupt storming
> multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
> specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
> the CPU with the largest CPU number.

I did this before, and I saw rcu stall as well with this kinda config.

>
> Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
> probably a bit outdated, but a good place to start.  Follow its guidelines
> (and, as needed, come up with additional ones) to ensure that CPU "n"
> is not doing anything.  If you do come up with additional guidelines,
> please submit a patch to kernel-per-CPU-kthreads.rst so that others can
> also benefit, as you are benefiting from those before you.

Thanks for the suggestion.
>
> Create a CPU-bound usermode application (a "while (1) continue;" loop or
> similar), and run that application on CPU "n".  Then start up whatever
> it is that interrupt-storms CPU "n".
>
> Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
> which will prevent the interrupt storm from delaying RCU grace periods.
>
> On the other hand, if this is one big long interrupt, you need to make
> that interrupt end every so often.  Or move some of the work out of
> interrupt context, perhaps even to usermode.
>
> Much depends on exactly what you are trying to achieve.

The things that can affect rcu stall are too many. So let's deal with
it case by case
before there is a permanent solution.

Thanks
Donghai




>
>                                                         Thanx, Paul
>
> > Donghai
> > >
> > >                                                         Thanx, Paul
> > >
> > > > Thanks
> > > > Donghai
> > > >
> > > >
> > > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > > >
> > > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > > Hello Paul,
> > > > > > > > > Sorry it has been long..
> > > > > > > >
> > > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > > >
> > > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > > problem I described
> > > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > > to reproduce than others.
> > > > > > > > > >
> > > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > > >
> > > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > > it was not in eqs either.
> > > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > > CPU had been
> > > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > > And this is true: the first
> > > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > > when another
> > > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > > If the kernel missed
> > > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > > identical ?  But I'll
> > > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > > >
> > > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > > >
> > > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > > >
> > > > > > > > OK.
> > > > > > > >
> > > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > > >
> > > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > > >
> > > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > > >
> > > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > > >
> > > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > > >
> > > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > > due to backporting issues.
> > > > > > > > >
> > > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > > >
> > > > > > > > OK, good.
> > > > > > > >
> > > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > > >
> > > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > > >
> > > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > > >
> > > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > > Or move to newer kernels.
> > > > > > > > > >
> > > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > > that you will get is your diagnostic.
> > > > > > > > >
> > > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > > >
> > > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > > >
> > > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > > a networking list on CC for their thoughts.
> > > > > > >
> > > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > > email thread with netdev@ only.
> > > > > > >
> > > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > > stall on the CPU and panic.
> > > > > > >
> > > > > > > So definitely we need to understand these networking activities here as
> > > > > > > to why the readers could hold the rwlock too long.
> > > > > >
> > > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > > So please see below.
> > > > >
> > > > > This is just a brief of my analysis and the stack info below is not enough
> > > > > for other people to figure out anything useful. I meant if they are really
> > > > > interested, I can upload the core file. I think this is fair.
> > > > >
> > > > > >
> > > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > > don't see any other obvious problems: e.g
> > > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > > >
> > > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > > a rcu reader side section.
> > > > > > > > >
> > > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > > struct sk_buff *skb)
> > > > > > > > > {
> > > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > > >
> > > > > > > > >         rcu_read_lock();
> > > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > > >         rcu_read_unlock();
> > > > > > > > >
> > > > > > > > >         return 0;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > > >
> > > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > > per-object lock while still maintaining good performance and
> > > > > > > > scalability.
> > > > > > >
> > > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > > the reader side to block must be avoided.
> > > > > >
> > > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > > acquire spinlocks within RCU read-side critical sections.
> > > > > >
> > > > > > And before you complain that practitioners are not following the academic
> > > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > > >
> > > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > > >
> > > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > > special, just use the default.
> > > > > >
> > > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > > bisecting, as this often quickly and easily finds the problem.
> > > > >
> > > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > > This way, we will not worry about missing the needed rcu patches.
> > > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > > networking related, which I am still working on.  So, there might be
> > > > > multiple root causes.
> > > > >
> > > > > > Bisection can also help you find the patch to be backported if a later
> > > > > > release fixes the bug, though things like gitk can also be helpful.
> > > > >
> > > > > Unfortunately, this is reproducible on the latest bit.
> > > > >
> > > > > Thanks
> > > > > Donghai
> > > > > >
> > > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-20 20:05                                       ` donghai qiao
@ 2021-10-20 21:33                                         ` Paul E. McKenney
  2021-10-21  3:25                                           ` Zhouyi Zhou
  2021-10-21 16:44                                           ` donghai qiao
  0 siblings, 2 replies; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-20 21:33 UTC (permalink / raw)
  To: donghai qiao; +Cc: Boqun Feng, rcu, netdev

On Wed, Oct 20, 2021 at 04:05:59PM -0400, donghai qiao wrote:
> On Wed, Oct 20, 2021 at 2:37 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> > > On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > > > I just want to follow up this discussion. First off, the latest issue
> > > > > I mentioned in the email of Oct 4th which
> > > > > exhibited a symptom of networking appeared to be a problem in
> > > > > qrwlock.c. Particularly the problem is
> > > > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > > > >
> > > > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > > > {
> > > > >         /*
> > > > >          * Readers come here when they cannot get the lock without waiting
> > > > >          */
> > > > >         if (unlikely(in_interrupt())) {
> > > > >                 /*
> > > > >                  * Readers in interrupt context will get the lock immediately
> > > > >                  * if the writer is just waiting (not holding the lock yet),
> > > > >                  * so spin with ACQUIRE semantics until the lock is available
> > > > >                  * without waiting in the queue.
> > > > >                  */
> > > > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > > > >                 return;
> > > > >         }
> > > > >         ...
> > > > > }
> > > > >
> > > > > That 'if' statement said, if we are in an interrupt context and we are
> > > > > a reader, then
> > > > > we will be allowed to enter the lock as a reader no matter if there
> > > > > are writers waiting
> > > > > for it or not. So, in the circumstance when the network packets steadily come in
> > > > > and the intervals are relatively small enough, then the writers will
> > > > > have no chance to
> > > > > acquire the lock. This should be the root cause for that case.
> > > > >
> > > > > I have verified it by removing the 'if' and rerun the test multiple
> > > > > times.
> > > >
> > > > That could do it!
> > > >
> > > > Would it make sense to keep the current check, but to also check if a
> > > > writer had been waiting for more than (say) 100ms?  The reason that I
> > > > ask is that I believe that this "if" statement is there for a reason.
> > >
> > > The day before I also got this to Waiman Long who initially made these
> > > changes in
> > > the qrwlock.c file. Turns out, the 'if' block was introduced to
> > > resolve the particular
> > > requirement of tasklist_lock reentering as reader.  He said he will
> > > perhaps come up
> > > with another code change to take care of this new write lock
> > > starvation issue. The
> > > idea is to only allow the tasklist_lock clients to acquire the read
> > > lock through the 'if'
> > > statement,  others are not.
> > >
> > > This sounds like a temporary solution if we cannot think of other
> > > alternative ways
> > > to fix the tasklist_lock issue. The principle here is that we should
> > > not make the
> > > locking primitives more special just in favor of a particular usage or scenario.
> >
> > When principles meet practice, results can vary.  Still, it would be
> > better to have a less troublesome optimization.
> 
> This is a philosophical debate. Let's put it aside.

Exactly!

> > > > >        The same
> > > > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > > > broader range
> > > > > of problems,  this is absolutely not the only root cause I have seen.
> > > > > Actually too many
> > > > > things can delay context switching.  Do you have a long term plan to
> > > > > fix this issue,
> > > > > or just want to treat it case by case?
> > > >
> > > > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > > > leverage the calls to cond_resched().  If the grace period is old enough,
> > > > cond_resched() will supply a quiescent state.
> > >
> > > So far, all types of rcu stall I am aware of are originated to the
> > > CONFIG_PREEMPT=n
> > > kernel. Isn't it impossible to let rcu not rely on context switch ?
> > > As we know too many
> > > things can delay context switch, so it is not a quite reliable
> > > mechanism if timing and
> > > performance are crucial.
> >
> > Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
> > need to wait for an actual context switch.  But there can be
> > performance issues for some workloads.
> >
> I can give this config (CONFIG_PREEMPT=y) a try when I have time.

Very good!

> > But please note that cond_resched() is not necessarily a context switch.
> >
> > Besides, for a great many workloads, delaying a context switch for
> > very long is a first-class bug anyway.  For example, many internet data
> > centers are said to have sub-second response-time requirements, and such
> > requirements cannot be met if context switches are delayed too long.
> >
> Agreed.
> 
> But on the other hand, if rcu relies on that,  the situation could be
> even worse.
> Simply put, when a gp cannot end soon, some rcu write-side will be delayed,
> and the callbacks on the rcu-stalled CPU will be delayed. Thus in the case of
> lack of free memory, this situation could form a deadlock.

Would this situation exist in the first place if a blocking form of
allocation were not being (erroneously) invoked within an RCU read-side
critical section?  Either way, that bug needs to be fixed.

> > > > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > > > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > > > eternally non-preemptible (for example, eternally in an interrupt
> > > > handler), the grace period will end.
> > >
> > > Among the rcu stall instances I have seen so far, quite a lot of them occurred
> > > on the CPUs which were running in the interrupt context or spinning on spinlocks
> > > with interrupt disabled. In these scenarios, forced schedules will be
> > > delayed until
> > > these activities end.
> >
> > But running for several seconds in interrupt context is not at all good.
> > As is spinning on a spinlock for several seconds.  These are performance
> > bugs in and of themselves.
> 
> Agreed.
> 
> >
> > More on this later in this email...
> >
> > > > But beyond a certain point, case-by-case analysis and handling is
> > > > required.
> > > >
> > > > > Secondly, back to the following code I brought up that day. Actually
> > > > > it is not as simple
> > > > > as spinlock.
> > > > >
> > > > >       rcu_read_lock();
> > > > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > >       rcu_read_unlock();
> > > > >
> > > > > Are you all aware of all the potential functions that
> > > > > ip_protocol_deliver_rcu will call?
> > > > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > > > kmem_cache_alloc
> > > > > which will end up a call to cond_resched().
> > > >
> > > > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > > > There are some tens of millions of lines of code in the kernel, and I have
> > > > but one brain.  ;-)
> > > >
> > > > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > > > kernel.  Except that there should not be a call to cond_resched() within
> > > > an RCU read-side critical section.
> > >
> > > with that 3 line snippet from the networking, a call to cond_resched() would
> > > happen within the read-side critical section when the level of variable memory
> > > is very low.
> >
> > That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
> > it will complain about a cond_resched() in an RCU read-side critical
> > section.  But, as you say, perhaps only with the level of variable memory
> > is very low.
> 
> There is a typo in my previous email. I meant available (or free).
> Sorry for that.

OK, good, that was my guess.  But invoking a potentially blocking form
of a kernel memory allocator is still a bug.  And that bug needs to
be fixed.  And fixing it might clear up a large fraction of your RCU
grace-period issues.

> > Please do not invoke cond_resched() within an RCU read-side critical
> > section.  Doing so can result in random memory corruption.
> >
> > > > Does the code momentarily exit that
> > > > critical section via something like rcu_read_unlock(); cond_resched();
> > > > rcu_read_lock()?
> > >
> > > As far as I can see, cond_resched would be called between a pair of
> > > rcu_read_lock and rcu_read_unlock.
> >
> > Again, this is a bug.  The usual fix is the GFP_ thing I noted below.
> >
> > > > Or does something prevent the code from getting there
> > > > while in an RCU read-side critical section?  (The usual trick here is
> > > > to have different GFP_ flags depending on the context.)
> > >
> > > Once we invoke kmem_cache_alloc or its variants, we cannot really
> > > predict where we will go and how long this whole process is going to
> > > take in this very large area from kmem to the virtual memory subsystem.
> > > There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> > > should be called before retry, but this flag should be used from page level,
> > > not from the kmem consumer level.  So I think there is little we can do
> > > to avoid the resched.
> >
> > If you are invoking the allocator within an RCU read-side critical
> > section, you should be using GFP_ATOMIC.  Except that doing this has
> > many negative consequences, so it is better to allocate outside of
> > the RCU read-side critical section.
> >
> > The same rules apply when allocating while holding a spinlock, so
> > this is not just RCU placing restrictions on you.  ;-)
> 
> yep, absolutely.
> 
> >
> > > > >                                              Because the operations in memory
> > > > > allocation are too complicated, we cannot alway expect a prompt return
> > > > > with success.
> > > > > When the system is running out of memory, then rcu cannot close the
> > > > > current gp, then
> > > > > great number of callbacks will be delayed and the freeing of the
> > > > > memory they held
> > > > > will be delayed as well. This sounds like a deadlock in the resource flow.
> > > >
> > > > That would of course be bad.  Though I am not familiar with all of the
> > > > details of how the networking guys handle out-of-memory conditions.
> > > >
> > > > The usual advice would be to fail the request, but that does not appear
> > > > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > > > must defer to the networking folks.
> > >
> > > Thanks for the advice.
> >
> > Another question...  Why the endless interrupts?  Or is it just one
> > very long interrupt?  Last I knew (admittedly a very long time ago),
> > the high-rate networking drivers used things like NAPI in order to avoid
> > this very problem.
> 
> These should be long enough interrupts. The symptom in the networking
> as the previous email said is one of them.  In that case, due to  rwlock
> in favor of the readers in the interrupt context, the writer side would be
> blocked as long as the readers keep coming.

OK, and hopefully Longman finds a way to get his optimization in some
less destructive way.

> > Or is this some sort of special case where you are trying to do something
> > special, for example, to achieve extremely low communications latencies?
> 
> No, nothing special I am trying to do.

OK, good.

> > If this is a deliberate design, and if it is endless interrupts instead
> > of one big long one, and if you are deliberately interrupt-storming
> > a particular CPU, another approach is to build the kernel with
> > CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
> > the CPU that is to be interrupt-stormed.  If you are interrupt storming
> > multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
> > specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
> > the CPU with the largest CPU number.
> 
> I did this before, and I saw rcu stall as well with this kinda config.

When you said you had long enough interrupts, how long were they?

If they were long enough, then yes, you would get a stall.

Suppose that some kernel code still executes on that CPU despite trying to
move things off of it.  As soon as the interrupt hits kernel execution
instead of nohz_full userspace execution, there will be no more RCU
quiescent states, and thus you can see stalls.

> > Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
> > probably a bit outdated, but a good place to start.  Follow its guidelines
> > (and, as needed, come up with additional ones) to ensure that CPU "n"
> > is not doing anything.  If you do come up with additional guidelines,
> > please submit a patch to kernel-per-CPU-kthreads.rst so that others can
> > also benefit, as you are benefiting from those before you.
> 
> Thanks for the suggestion.
> >
> > Create a CPU-bound usermode application (a "while (1) continue;" loop or
> > similar), and run that application on CPU "n".  Then start up whatever
> > it is that interrupt-storms CPU "n".
> >
> > Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
> > which will prevent the interrupt storm from delaying RCU grace periods.
> >
> > On the other hand, if this is one big long interrupt, you need to make
> > that interrupt end every so often.  Or move some of the work out of
> > interrupt context, perhaps even to usermode.
> >
> > Much depends on exactly what you are trying to achieve.
> 
> The things that can affect rcu stall are too many. So let's deal with
> it case by case
> before there is a permanent solution.

Getting the bugs fixed should be a good start.

							Thanx, Paul

> Thanks
> Donghai
> 
> 
> 
> 
> >
> >                                                         Thanx, Paul
> >
> > > Donghai
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > > Thanks
> > > > > Donghai
> > > > >
> > > > >
> > > > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > > > Hello Paul,
> > > > > > > > > > Sorry it has been long..
> > > > > > > > >
> > > > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > > > >
> > > > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > > > problem I described
> > > > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > > > to reproduce than others.
> > > > > > > > > > >
> > > > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > > > >
> > > > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > > > it was not in eqs either.
> > > > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > > > CPU had been
> > > > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > > > And this is true: the first
> > > > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > > > when another
> > > > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > > > If the kernel missed
> > > > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > > > identical ?  But I'll
> > > > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > > > >
> > > > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > > > >
> > > > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > > > >
> > > > > > > > > OK.
> > > > > > > > >
> > > > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > > > >
> > > > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > > > >
> > > > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > > > >
> > > > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > > > >
> > > > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > > > due to backporting issues.
> > > > > > > > > >
> > > > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > > > >
> > > > > > > > > OK, good.
> > > > > > > > >
> > > > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > > > >
> > > > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > > > >
> > > > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > > > >
> > > > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > > > Or move to newer kernels.
> > > > > > > > > > >
> > > > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > > > that you will get is your diagnostic.
> > > > > > > > > >
> > > > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > > > >
> > > > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > > > >
> > > > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > > > a networking list on CC for their thoughts.
> > > > > > > >
> > > > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > > > email thread with netdev@ only.
> > > > > > > >
> > > > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > > > stall on the CPU and panic.
> > > > > > > >
> > > > > > > > So definitely we need to understand these networking activities here as
> > > > > > > > to why the readers could hold the rwlock too long.
> > > > > > >
> > > > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > > > So please see below.
> > > > > >
> > > > > > This is just a brief of my analysis and the stack info below is not enough
> > > > > > for other people to figure out anything useful. I meant if they are really
> > > > > > interested, I can upload the core file. I think this is fair.
> > > > > >
> > > > > > >
> > > > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > > > don't see any other obvious problems: e.g
> > > > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > > > >
> > > > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > > > a rcu reader side section.
> > > > > > > > > >
> > > > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > > > struct sk_buff *skb)
> > > > > > > > > > {
> > > > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > > > >
> > > > > > > > > >         rcu_read_lock();
> > > > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > >
> > > > > > > > > >         return 0;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > > > >
> > > > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > > > per-object lock while still maintaining good performance and
> > > > > > > > > scalability.
> > > > > > > >
> > > > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > > > the reader side to block must be avoided.
> > > > > > >
> > > > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > > > acquire spinlocks within RCU read-side critical sections.
> > > > > > >
> > > > > > > And before you complain that practitioners are not following the academic
> > > > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > > > >
> > > > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > > > >
> > > > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > > > special, just use the default.
> > > > > > >
> > > > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > > > bisecting, as this often quickly and easily finds the problem.
> > > > > >
> > > > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > > > This way, we will not worry about missing the needed rcu patches.
> > > > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > > > networking related, which I am still working on.  So, there might be
> > > > > > multiple root causes.
> > > > > >
> > > > > > > Bisection can also help you find the patch to be backported if a later
> > > > > > > release fixes the bug, though things like gitk can also be helpful.
> > > > > >
> > > > > > Unfortunately, this is reproducible on the latest bit.
> > > > > >
> > > > > > Thanks
> > > > > > Donghai
> > > > > > >
> > > > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-20 21:33                                         ` Paul E. McKenney
@ 2021-10-21  3:25                                           ` Zhouyi Zhou
  2021-10-21  4:17                                             ` Paul E. McKenney
  2021-10-21 16:44                                           ` donghai qiao
  1 sibling, 1 reply; 26+ messages in thread
From: Zhouyi Zhou @ 2021-10-21  3:25 UTC (permalink / raw)
  To: paulmck; +Cc: donghai qiao, Boqun Feng, rcu, netdev

hi,
I try to run 5.15.0-rc6+ in a x86-64 qemu-kvm virtual machine with
CONFIG_PREEMPT=n,  then modprobe rcutore, and run netstrain (a open
source network performance stress tool) in it, and run following
program on nohz_full cpus:
int main()
{
    unsigned long l;
    while (1) {
      l*=0.3333;
      l/=0.3333;
    }
}
It seems nothing happens
I am glad to study the knowledge around this email thread more
thoroughly, and perform more tests on x86-64 host (instead of a
virtual machine) and aarch64 host  later on ;-)
Zhouyi

On Thu, Oct 21, 2021 at 5:33 AM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Oct 20, 2021 at 04:05:59PM -0400, donghai qiao wrote:
> > On Wed, Oct 20, 2021 at 2:37 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> > > > On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > > > > I just want to follow up this discussion. First off, the latest issue
> > > > > > I mentioned in the email of Oct 4th which
> > > > > > exhibited a symptom of networking appeared to be a problem in
> > > > > > qrwlock.c. Particularly the problem is
> > > > > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > > > > >
> > > > > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > > > > {
> > > > > >         /*
> > > > > >          * Readers come here when they cannot get the lock without waiting
> > > > > >          */
> > > > > >         if (unlikely(in_interrupt())) {
> > > > > >                 /*
> > > > > >                  * Readers in interrupt context will get the lock immediately
> > > > > >                  * if the writer is just waiting (not holding the lock yet),
> > > > > >                  * so spin with ACQUIRE semantics until the lock is available
> > > > > >                  * without waiting in the queue.
> > > > > >                  */
> > > > > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > > > > >                 return;
> > > > > >         }
> > > > > >         ...
> > > > > > }
> > > > > >
> > > > > > That 'if' statement said, if we are in an interrupt context and we are
> > > > > > a reader, then
> > > > > > we will be allowed to enter the lock as a reader no matter if there
> > > > > > are writers waiting
> > > > > > for it or not. So, in the circumstance when the network packets steadily come in
> > > > > > and the intervals are relatively small enough, then the writers will
> > > > > > have no chance to
> > > > > > acquire the lock. This should be the root cause for that case.
> > > > > >
> > > > > > I have verified it by removing the 'if' and rerun the test multiple
> > > > > > times.
> > > > >
> > > > > That could do it!
> > > > >
> > > > > Would it make sense to keep the current check, but to also check if a
> > > > > writer had been waiting for more than (say) 100ms?  The reason that I
> > > > > ask is that I believe that this "if" statement is there for a reason.
> > > >
> > > > The day before I also got this to Waiman Long who initially made these
> > > > changes in
> > > > the qrwlock.c file. Turns out, the 'if' block was introduced to
> > > > resolve the particular
> > > > requirement of tasklist_lock reentering as reader.  He said he will
> > > > perhaps come up
> > > > with another code change to take care of this new write lock
> > > > starvation issue. The
> > > > idea is to only allow the tasklist_lock clients to acquire the read
> > > > lock through the 'if'
> > > > statement,  others are not.
> > > >
> > > > This sounds like a temporary solution if we cannot think of other
> > > > alternative ways
> > > > to fix the tasklist_lock issue. The principle here is that we should
> > > > not make the
> > > > locking primitives more special just in favor of a particular usage or scenario.
> > >
> > > When principles meet practice, results can vary.  Still, it would be
> > > better to have a less troublesome optimization.
> >
> > This is a philosophical debate. Let's put it aside.
>
> Exactly!
>
> > > > > >        The same
> > > > > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > > > > broader range
> > > > > > of problems,  this is absolutely not the only root cause I have seen.
> > > > > > Actually too many
> > > > > > things can delay context switching.  Do you have a long term plan to
> > > > > > fix this issue,
> > > > > > or just want to treat it case by case?
> > > > >
> > > > > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > > > > leverage the calls to cond_resched().  If the grace period is old enough,
> > > > > cond_resched() will supply a quiescent state.
> > > >
> > > > So far, all types of rcu stall I am aware of are originated to the
> > > > CONFIG_PREEMPT=n
> > > > kernel. Isn't it impossible to let rcu not rely on context switch ?
> > > > As we know too many
> > > > things can delay context switch, so it is not a quite reliable
> > > > mechanism if timing and
> > > > performance are crucial.
> > >
> > > Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
> > > need to wait for an actual context switch.  But there can be
> > > performance issues for some workloads.
> > >
> > I can give this config (CONFIG_PREEMPT=y) a try when I have time.
>
> Very good!
>
> > > But please note that cond_resched() is not necessarily a context switch.
> > >
> > > Besides, for a great many workloads, delaying a context switch for
> > > very long is a first-class bug anyway.  For example, many internet data
> > > centers are said to have sub-second response-time requirements, and such
> > > requirements cannot be met if context switches are delayed too long.
> > >
> > Agreed.
> >
> > But on the other hand, if rcu relies on that,  the situation could be
> > even worse.
> > Simply put, when a gp cannot end soon, some rcu write-side will be delayed,
> > and the callbacks on the rcu-stalled CPU will be delayed. Thus in the case of
> > lack of free memory, this situation could form a deadlock.
>
> Would this situation exist in the first place if a blocking form of
> allocation were not being (erroneously) invoked within an RCU read-side
> critical section?  Either way, that bug needs to be fixed.
>
> > > > > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > > > > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > > > > eternally non-preemptible (for example, eternally in an interrupt
> > > > > handler), the grace period will end.
> > > >
> > > > Among the rcu stall instances I have seen so far, quite a lot of them occurred
> > > > on the CPUs which were running in the interrupt context or spinning on spinlocks
> > > > with interrupt disabled. In these scenarios, forced schedules will be
> > > > delayed until
> > > > these activities end.
> > >
> > > But running for several seconds in interrupt context is not at all good.
> > > As is spinning on a spinlock for several seconds.  These are performance
> > > bugs in and of themselves.
> >
> > Agreed.
> >
> > >
> > > More on this later in this email...
> > >
> > > > > But beyond a certain point, case-by-case analysis and handling is
> > > > > required.
> > > > >
> > > > > > Secondly, back to the following code I brought up that day. Actually
> > > > > > it is not as simple
> > > > > > as spinlock.
> > > > > >
> > > > > >       rcu_read_lock();
> > > > > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > >       rcu_read_unlock();
> > > > > >
> > > > > > Are you all aware of all the potential functions that
> > > > > > ip_protocol_deliver_rcu will call?
> > > > > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > > > > kmem_cache_alloc
> > > > > > which will end up a call to cond_resched().
> > > > >
> > > > > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > > > > There are some tens of millions of lines of code in the kernel, and I have
> > > > > but one brain.  ;-)
> > > > >
> > > > > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > > > > kernel.  Except that there should not be a call to cond_resched() within
> > > > > an RCU read-side critical section.
> > > >
> > > > with that 3 line snippet from the networking, a call to cond_resched() would
> > > > happen within the read-side critical section when the level of variable memory
> > > > is very low.
> > >
> > > That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
> > > it will complain about a cond_resched() in an RCU read-side critical
> > > section.  But, as you say, perhaps only with the level of variable memory
> > > is very low.
> >
> > There is a typo in my previous email. I meant available (or free).
> > Sorry for that.
>
> OK, good, that was my guess.  But invoking a potentially blocking form
> of a kernel memory allocator is still a bug.  And that bug needs to
> be fixed.  And fixing it might clear up a large fraction of your RCU
> grace-period issues.
>
> > > Please do not invoke cond_resched() within an RCU read-side critical
> > > section.  Doing so can result in random memory corruption.
> > >
> > > > > Does the code momentarily exit that
> > > > > critical section via something like rcu_read_unlock(); cond_resched();
> > > > > rcu_read_lock()?
> > > >
> > > > As far as I can see, cond_resched would be called between a pair of
> > > > rcu_read_lock and rcu_read_unlock.
> > >
> > > Again, this is a bug.  The usual fix is the GFP_ thing I noted below.
> > >
> > > > > Or does something prevent the code from getting there
> > > > > while in an RCU read-side critical section?  (The usual trick here is
> > > > > to have different GFP_ flags depending on the context.)
> > > >
> > > > Once we invoke kmem_cache_alloc or its variants, we cannot really
> > > > predict where we will go and how long this whole process is going to
> > > > take in this very large area from kmem to the virtual memory subsystem.
> > > > There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> > > > should be called before retry, but this flag should be used from page level,
> > > > not from the kmem consumer level.  So I think there is little we can do
> > > > to avoid the resched.
> > >
> > > If you are invoking the allocator within an RCU read-side critical
> > > section, you should be using GFP_ATOMIC.  Except that doing this has
> > > many negative consequences, so it is better to allocate outside of
> > > the RCU read-side critical section.
> > >
> > > The same rules apply when allocating while holding a spinlock, so
> > > this is not just RCU placing restrictions on you.  ;-)
> >
> > yep, absolutely.
> >
> > >
> > > > > >                                              Because the operations in memory
> > > > > > allocation are too complicated, we cannot alway expect a prompt return
> > > > > > with success.
> > > > > > When the system is running out of memory, then rcu cannot close the
> > > > > > current gp, then
> > > > > > great number of callbacks will be delayed and the freeing of the
> > > > > > memory they held
> > > > > > will be delayed as well. This sounds like a deadlock in the resource flow.
> > > > >
> > > > > That would of course be bad.  Though I am not familiar with all of the
> > > > > details of how the networking guys handle out-of-memory conditions.
> > > > >
> > > > > The usual advice would be to fail the request, but that does not appear
> > > > > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > > > > must defer to the networking folks.
> > > >
> > > > Thanks for the advice.
> > >
> > > Another question...  Why the endless interrupts?  Or is it just one
> > > very long interrupt?  Last I knew (admittedly a very long time ago),
> > > the high-rate networking drivers used things like NAPI in order to avoid
> > > this very problem.
> >
> > These should be long enough interrupts. The symptom in the networking
> > as the previous email said is one of them.  In that case, due to  rwlock
> > in favor of the readers in the interrupt context, the writer side would be
> > blocked as long as the readers keep coming.
>
> OK, and hopefully Longman finds a way to get his optimization in some
> less destructive way.
>
> > > Or is this some sort of special case where you are trying to do something
> > > special, for example, to achieve extremely low communications latencies?
> >
> > No, nothing special I am trying to do.
>
> OK, good.
>
> > > If this is a deliberate design, and if it is endless interrupts instead
> > > of one big long one, and if you are deliberately interrupt-storming
> > > a particular CPU, another approach is to build the kernel with
> > > CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
> > > the CPU that is to be interrupt-stormed.  If you are interrupt storming
> > > multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
> > > specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
> > > the CPU with the largest CPU number.
> >
> > I did this before, and I saw rcu stall as well with this kinda config.
>
> When you said you had long enough interrupts, how long were they?
>
> If they were long enough, then yes, you would get a stall.
>
> Suppose that some kernel code still executes on that CPU despite trying to
> move things off of it.  As soon as the interrupt hits kernel execution
> instead of nohz_full userspace execution, there will be no more RCU
> quiescent states, and thus you can see stalls.
>
> > > Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
> > > probably a bit outdated, but a good place to start.  Follow its guidelines
> > > (and, as needed, come up with additional ones) to ensure that CPU "n"
> > > is not doing anything.  If you do come up with additional guidelines,
> > > please submit a patch to kernel-per-CPU-kthreads.rst so that others can
> > > also benefit, as you are benefiting from those before you.
> >
> > Thanks for the suggestion.
> > >
> > > Create a CPU-bound usermode application (a "while (1) continue;" loop or
> > > similar), and run that application on CPU "n".  Then start up whatever
> > > it is that interrupt-storms CPU "n".
> > >
> > > Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
> > > which will prevent the interrupt storm from delaying RCU grace periods.
> > >
> > > On the other hand, if this is one big long interrupt, you need to make
> > > that interrupt end every so often.  Or move some of the work out of
> > > interrupt context, perhaps even to usermode.
> > >
> > > Much depends on exactly what you are trying to achieve.
> >
> > The things that can affect rcu stall are too many. So let's deal with
> > it case by case
> > before there is a permanent solution.
>
> Getting the bugs fixed should be a good start.
>
>                                                         Thanx, Paul
>
> > Thanks
> > Donghai
> >
> >
> >
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > > Donghai
> > > > >
> > > > >                                                         Thanx, Paul
> > > > >
> > > > > > Thanks
> > > > > > Donghai
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > > > > Hello Paul,
> > > > > > > > > > > Sorry it has been long..
> > > > > > > > > >
> > > > > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > > > > >
> > > > > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > > > > problem I described
> > > > > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > > > > to reproduce than others.
> > > > > > > > > > > >
> > > > > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > > > > >
> > > > > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > > > > it was not in eqs either.
> > > > > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > > > > CPU had been
> > > > > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > > > > And this is true: the first
> > > > > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > > > > when another
> > > > > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > > > > If the kernel missed
> > > > > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > > > > identical ?  But I'll
> > > > > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > > > > >
> > > > > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > > > > >
> > > > > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > > > > >
> > > > > > > > > > OK.
> > > > > > > > > >
> > > > > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > > > > >
> > > > > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > > > > due to backporting issues.
> > > > > > > > > > >
> > > > > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > > > > >
> > > > > > > > > > OK, good.
> > > > > > > > > >
> > > > > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > > > > >
> > > > > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > > > > >
> > > > > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > > > > Or move to newer kernels.
> > > > > > > > > > > >
> > > > > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > > > > that you will get is your diagnostic.
> > > > > > > > > > >
> > > > > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > > > > >
> > > > > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > > > > >
> > > > > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > > > > a networking list on CC for their thoughts.
> > > > > > > > >
> > > > > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > > > > email thread with netdev@ only.
> > > > > > > > >
> > > > > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > > > > stall on the CPU and panic.
> > > > > > > > >
> > > > > > > > > So definitely we need to understand these networking activities here as
> > > > > > > > > to why the readers could hold the rwlock too long.
> > > > > > > >
> > > > > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > > > > So please see below.
> > > > > > >
> > > > > > > This is just a brief of my analysis and the stack info below is not enough
> > > > > > > for other people to figure out anything useful. I meant if they are really
> > > > > > > interested, I can upload the core file. I think this is fair.
> > > > > > >
> > > > > > > >
> > > > > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > > > > don't see any other obvious problems: e.g
> > > > > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > > > > >
> > > > > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > > > > a rcu reader side section.
> > > > > > > > > > >
> > > > > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > > > > struct sk_buff *skb)
> > > > > > > > > > > {
> > > > > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > > > > >
> > > > > > > > > > >         rcu_read_lock();
> > > > > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > > >
> > > > > > > > > > >         return 0;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > > > > >
> > > > > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > > > > per-object lock while still maintaining good performance and
> > > > > > > > > > scalability.
> > > > > > > > >
> > > > > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > > > > the reader side to block must be avoided.
> > > > > > > >
> > > > > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > > > > acquire spinlocks within RCU read-side critical sections.
> > > > > > > >
> > > > > > > > And before you complain that practitioners are not following the academic
> > > > > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > > > > >
> > > > > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > > > > >
> > > > > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > > > > special, just use the default.
> > > > > > > >
> > > > > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > > > > bisecting, as this often quickly and easily finds the problem.
> > > > > > >
> > > > > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > > > > This way, we will not worry about missing the needed rcu patches.
> > > > > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > > > > networking related, which I am still working on.  So, there might be
> > > > > > > multiple root causes.
> > > > > > >
> > > > > > > > Bisection can also help you find the patch to be backported if a later
> > > > > > > > release fixes the bug, though things like gitk can also be helpful.
> > > > > > >
> > > > > > > Unfortunately, this is reproducible on the latest bit.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Donghai
> > > > > > > >
> > > > > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-21  3:25                                           ` Zhouyi Zhou
@ 2021-10-21  4:17                                             ` Paul E. McKenney
  0 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2021-10-21  4:17 UTC (permalink / raw)
  To: Zhouyi Zhou; +Cc: donghai qiao, Boqun Feng, rcu, netdev

On Thu, Oct 21, 2021 at 11:25:22AM +0800, Zhouyi Zhou wrote:
> hi,
> I try to run 5.15.0-rc6+ in a x86-64 qemu-kvm virtual machine with
> CONFIG_PREEMPT=n,  then modprobe rcutore, and run netstrain (a open
> source network performance stress tool) in it, and run following
> program on nohz_full cpus:
> int main()
> {
>     unsigned long l;
>     while (1) {
>       l*=0.3333;
>       l/=0.3333;
>     }
> }
> It seems nothing happens
> I am glad to study the knowledge around this email thread more
> thoroughly, and perform more tests on x86-64 host (instead of a
> virtual machine) and aarch64 host  later on ;-)

Good to hear that it survived, but rcutorture was not designed to play
nice or to share with other stress tests.  ;-)

							Thanx, Paul

> Zhouyi
> 
> On Thu, Oct 21, 2021 at 5:33 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Wed, Oct 20, 2021 at 04:05:59PM -0400, donghai qiao wrote:
> > > On Wed, Oct 20, 2021 at 2:37 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> > > > > On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > >
> > > > > > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > > > > > I just want to follow up this discussion. First off, the latest issue
> > > > > > > I mentioned in the email of Oct 4th which
> > > > > > > exhibited a symptom of networking appeared to be a problem in
> > > > > > > qrwlock.c. Particularly the problem is
> > > > > > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > > > > > >
> > > > > > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > > > > > {
> > > > > > >         /*
> > > > > > >          * Readers come here when they cannot get the lock without waiting
> > > > > > >          */
> > > > > > >         if (unlikely(in_interrupt())) {
> > > > > > >                 /*
> > > > > > >                  * Readers in interrupt context will get the lock immediately
> > > > > > >                  * if the writer is just waiting (not holding the lock yet),
> > > > > > >                  * so spin with ACQUIRE semantics until the lock is available
> > > > > > >                  * without waiting in the queue.
> > > > > > >                  */
> > > > > > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > > > > > >                 return;
> > > > > > >         }
> > > > > > >         ...
> > > > > > > }
> > > > > > >
> > > > > > > That 'if' statement said, if we are in an interrupt context and we are
> > > > > > > a reader, then
> > > > > > > we will be allowed to enter the lock as a reader no matter if there
> > > > > > > are writers waiting
> > > > > > > for it or not. So, in the circumstance when the network packets steadily come in
> > > > > > > and the intervals are relatively small enough, then the writers will
> > > > > > > have no chance to
> > > > > > > acquire the lock. This should be the root cause for that case.
> > > > > > >
> > > > > > > I have verified it by removing the 'if' and rerun the test multiple
> > > > > > > times.
> > > > > >
> > > > > > That could do it!
> > > > > >
> > > > > > Would it make sense to keep the current check, but to also check if a
> > > > > > writer had been waiting for more than (say) 100ms?  The reason that I
> > > > > > ask is that I believe that this "if" statement is there for a reason.
> > > > >
> > > > > The day before I also got this to Waiman Long who initially made these
> > > > > changes in
> > > > > the qrwlock.c file. Turns out, the 'if' block was introduced to
> > > > > resolve the particular
> > > > > requirement of tasklist_lock reentering as reader.  He said he will
> > > > > perhaps come up
> > > > > with another code change to take care of this new write lock
> > > > > starvation issue. The
> > > > > idea is to only allow the tasklist_lock clients to acquire the read
> > > > > lock through the 'if'
> > > > > statement,  others are not.
> > > > >
> > > > > This sounds like a temporary solution if we cannot think of other
> > > > > alternative ways
> > > > > to fix the tasklist_lock issue. The principle here is that we should
> > > > > not make the
> > > > > locking primitives more special just in favor of a particular usage or scenario.
> > > >
> > > > When principles meet practice, results can vary.  Still, it would be
> > > > better to have a less troublesome optimization.
> > >
> > > This is a philosophical debate. Let's put it aside.
> >
> > Exactly!
> >
> > > > > > >        The same
> > > > > > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > > > > > broader range
> > > > > > > of problems,  this is absolutely not the only root cause I have seen.
> > > > > > > Actually too many
> > > > > > > things can delay context switching.  Do you have a long term plan to
> > > > > > > fix this issue,
> > > > > > > or just want to treat it case by case?
> > > > > >
> > > > > > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > > > > > leverage the calls to cond_resched().  If the grace period is old enough,
> > > > > > cond_resched() will supply a quiescent state.
> > > > >
> > > > > So far, all types of rcu stall I am aware of are originated to the
> > > > > CONFIG_PREEMPT=n
> > > > > kernel. Isn't it impossible to let rcu not rely on context switch ?
> > > > > As we know too many
> > > > > things can delay context switch, so it is not a quite reliable
> > > > > mechanism if timing and
> > > > > performance are crucial.
> > > >
> > > > Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
> > > > need to wait for an actual context switch.  But there can be
> > > > performance issues for some workloads.
> > > >
> > > I can give this config (CONFIG_PREEMPT=y) a try when I have time.
> >
> > Very good!
> >
> > > > But please note that cond_resched() is not necessarily a context switch.
> > > >
> > > > Besides, for a great many workloads, delaying a context switch for
> > > > very long is a first-class bug anyway.  For example, many internet data
> > > > centers are said to have sub-second response-time requirements, and such
> > > > requirements cannot be met if context switches are delayed too long.
> > > >
> > > Agreed.
> > >
> > > But on the other hand, if rcu relies on that,  the situation could be
> > > even worse.
> > > Simply put, when a gp cannot end soon, some rcu write-side will be delayed,
> > > and the callbacks on the rcu-stalled CPU will be delayed. Thus in the case of
> > > lack of free memory, this situation could form a deadlock.
> >
> > Would this situation exist in the first place if a blocking form of
> > allocation were not being (erroneously) invoked within an RCU read-side
> > critical section?  Either way, that bug needs to be fixed.
> >
> > > > > > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > > > > > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > > > > > eternally non-preemptible (for example, eternally in an interrupt
> > > > > > handler), the grace period will end.
> > > > >
> > > > > Among the rcu stall instances I have seen so far, quite a lot of them occurred
> > > > > on the CPUs which were running in the interrupt context or spinning on spinlocks
> > > > > with interrupt disabled. In these scenarios, forced schedules will be
> > > > > delayed until
> > > > > these activities end.
> > > >
> > > > But running for several seconds in interrupt context is not at all good.
> > > > As is spinning on a spinlock for several seconds.  These are performance
> > > > bugs in and of themselves.
> > >
> > > Agreed.
> > >
> > > >
> > > > More on this later in this email...
> > > >
> > > > > > But beyond a certain point, case-by-case analysis and handling is
> > > > > > required.
> > > > > >
> > > > > > > Secondly, back to the following code I brought up that day. Actually
> > > > > > > it is not as simple
> > > > > > > as spinlock.
> > > > > > >
> > > > > > >       rcu_read_lock();
> > > > > > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > >       rcu_read_unlock();
> > > > > > >
> > > > > > > Are you all aware of all the potential functions that
> > > > > > > ip_protocol_deliver_rcu will call?
> > > > > > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > > > > > kmem_cache_alloc
> > > > > > > which will end up a call to cond_resched().
> > > > > >
> > > > > > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > > > > > There are some tens of millions of lines of code in the kernel, and I have
> > > > > > but one brain.  ;-)
> > > > > >
> > > > > > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > > > > > kernel.  Except that there should not be a call to cond_resched() within
> > > > > > an RCU read-side critical section.
> > > > >
> > > > > with that 3 line snippet from the networking, a call to cond_resched() would
> > > > > happen within the read-side critical section when the level of variable memory
> > > > > is very low.
> > > >
> > > > That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
> > > > it will complain about a cond_resched() in an RCU read-side critical
> > > > section.  But, as you say, perhaps only with the level of variable memory
> > > > is very low.
> > >
> > > There is a typo in my previous email. I meant available (or free).
> > > Sorry for that.
> >
> > OK, good, that was my guess.  But invoking a potentially blocking form
> > of a kernel memory allocator is still a bug.  And that bug needs to
> > be fixed.  And fixing it might clear up a large fraction of your RCU
> > grace-period issues.
> >
> > > > Please do not invoke cond_resched() within an RCU read-side critical
> > > > section.  Doing so can result in random memory corruption.
> > > >
> > > > > > Does the code momentarily exit that
> > > > > > critical section via something like rcu_read_unlock(); cond_resched();
> > > > > > rcu_read_lock()?
> > > > >
> > > > > As far as I can see, cond_resched would be called between a pair of
> > > > > rcu_read_lock and rcu_read_unlock.
> > > >
> > > > Again, this is a bug.  The usual fix is the GFP_ thing I noted below.
> > > >
> > > > > > Or does something prevent the code from getting there
> > > > > > while in an RCU read-side critical section?  (The usual trick here is
> > > > > > to have different GFP_ flags depending on the context.)
> > > > >
> > > > > Once we invoke kmem_cache_alloc or its variants, we cannot really
> > > > > predict where we will go and how long this whole process is going to
> > > > > take in this very large area from kmem to the virtual memory subsystem.
> > > > > There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> > > > > should be called before retry, but this flag should be used from page level,
> > > > > not from the kmem consumer level.  So I think there is little we can do
> > > > > to avoid the resched.
> > > >
> > > > If you are invoking the allocator within an RCU read-side critical
> > > > section, you should be using GFP_ATOMIC.  Except that doing this has
> > > > many negative consequences, so it is better to allocate outside of
> > > > the RCU read-side critical section.
> > > >
> > > > The same rules apply when allocating while holding a spinlock, so
> > > > this is not just RCU placing restrictions on you.  ;-)
> > >
> > > yep, absolutely.
> > >
> > > >
> > > > > > >                                              Because the operations in memory
> > > > > > > allocation are too complicated, we cannot alway expect a prompt return
> > > > > > > with success.
> > > > > > > When the system is running out of memory, then rcu cannot close the
> > > > > > > current gp, then
> > > > > > > great number of callbacks will be delayed and the freeing of the
> > > > > > > memory they held
> > > > > > > will be delayed as well. This sounds like a deadlock in the resource flow.
> > > > > >
> > > > > > That would of course be bad.  Though I am not familiar with all of the
> > > > > > details of how the networking guys handle out-of-memory conditions.
> > > > > >
> > > > > > The usual advice would be to fail the request, but that does not appear
> > > > > > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > > > > > must defer to the networking folks.
> > > > >
> > > > > Thanks for the advice.
> > > >
> > > > Another question...  Why the endless interrupts?  Or is it just one
> > > > very long interrupt?  Last I knew (admittedly a very long time ago),
> > > > the high-rate networking drivers used things like NAPI in order to avoid
> > > > this very problem.
> > >
> > > These should be long enough interrupts. The symptom in the networking
> > > as the previous email said is one of them.  In that case, due to  rwlock
> > > in favor of the readers in the interrupt context, the writer side would be
> > > blocked as long as the readers keep coming.
> >
> > OK, and hopefully Longman finds a way to get his optimization in some
> > less destructive way.
> >
> > > > Or is this some sort of special case where you are trying to do something
> > > > special, for example, to achieve extremely low communications latencies?
> > >
> > > No, nothing special I am trying to do.
> >
> > OK, good.
> >
> > > > If this is a deliberate design, and if it is endless interrupts instead
> > > > of one big long one, and if you are deliberately interrupt-storming
> > > > a particular CPU, another approach is to build the kernel with
> > > > CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
> > > > the CPU that is to be interrupt-stormed.  If you are interrupt storming
> > > > multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
> > > > specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
> > > > the CPU with the largest CPU number.
> > >
> > > I did this before, and I saw rcu stall as well with this kinda config.
> >
> > When you said you had long enough interrupts, how long were they?
> >
> > If they were long enough, then yes, you would get a stall.
> >
> > Suppose that some kernel code still executes on that CPU despite trying to
> > move things off of it.  As soon as the interrupt hits kernel execution
> > instead of nohz_full userspace execution, there will be no more RCU
> > quiescent states, and thus you can see stalls.
> >
> > > > Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
> > > > probably a bit outdated, but a good place to start.  Follow its guidelines
> > > > (and, as needed, come up with additional ones) to ensure that CPU "n"
> > > > is not doing anything.  If you do come up with additional guidelines,
> > > > please submit a patch to kernel-per-CPU-kthreads.rst so that others can
> > > > also benefit, as you are benefiting from those before you.
> > >
> > > Thanks for the suggestion.
> > > >
> > > > Create a CPU-bound usermode application (a "while (1) continue;" loop or
> > > > similar), and run that application on CPU "n".  Then start up whatever
> > > > it is that interrupt-storms CPU "n".
> > > >
> > > > Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
> > > > which will prevent the interrupt storm from delaying RCU grace periods.
> > > >
> > > > On the other hand, if this is one big long interrupt, you need to make
> > > > that interrupt end every so often.  Or move some of the work out of
> > > > interrupt context, perhaps even to usermode.
> > > >
> > > > Much depends on exactly what you are trying to achieve.
> > >
> > > The things that can affect rcu stall are too many. So let's deal with
> > > it case by case
> > > before there is a permanent solution.
> >
> > Getting the bugs fixed should be a good start.
> >
> >                                                         Thanx, Paul
> >
> > > Thanks
> > > Donghai
> > >
> > >
> > >
> > >
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > > Donghai
> > > > > >
> > > > > >                                                         Thanx, Paul
> > > > > >
> > > > > > > Thanks
> > > > > > > Donghai
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > > > > > Hello Paul,
> > > > > > > > > > > > Sorry it has been long..
> > > > > > > > > > >
> > > > > > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > > > > > >
> > > > > > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > > > > > problem I described
> > > > > > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > > > > > to reproduce than others.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > > > > > it was not in eqs either.
> > > > > > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > > > > > CPU had been
> > > > > > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > > > > > And this is true: the first
> > > > > > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > > > > > when another
> > > > > > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > > > > > If the kernel missed
> > > > > > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > > > > > identical ?  But I'll
> > > > > > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > > > > > >
> > > > > > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > > > > > >
> > > > > > > > > > > OK.
> > > > > > > > > > >
> > > > > > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > > > > > due to backporting issues.
> > > > > > > > > > > >
> > > > > > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > > > > > >
> > > > > > > > > > > OK, good.
> > > > > > > > > > >
> > > > > > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > > > > > >
> > > > > > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > > > > > Or move to newer kernels.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > > > > > that you will get is your diagnostic.
> > > > > > > > > > > >
> > > > > > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > > > > > >
> > > > > > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > > > > > >
> > > > > > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > > > > > a networking list on CC for their thoughts.
> > > > > > > > > >
> > > > > > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > > > > > email thread with netdev@ only.
> > > > > > > > > >
> > > > > > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > > > > > stall on the CPU and panic.
> > > > > > > > > >
> > > > > > > > > > So definitely we need to understand these networking activities here as
> > > > > > > > > > to why the readers could hold the rwlock too long.
> > > > > > > > >
> > > > > > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > > > > > So please see below.
> > > > > > > >
> > > > > > > > This is just a brief of my analysis and the stack info below is not enough
> > > > > > > > for other people to figure out anything useful. I meant if they are really
> > > > > > > > interested, I can upload the core file. I think this is fair.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > > > > > don't see any other obvious problems: e.g
> > > > > > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > > > > > >
> > > > > > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > > > > > a rcu reader side section.
> > > > > > > > > > > >
> > > > > > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > > > > > struct sk_buff *skb)
> > > > > > > > > > > > {
> > > > > > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > > > > > >
> > > > > > > > > > > >         rcu_read_lock();
> > > > > > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > > > >
> > > > > > > > > > > >         return 0;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > > > > > >
> > > > > > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > > > > > per-object lock while still maintaining good performance and
> > > > > > > > > > > scalability.
> > > > > > > > > >
> > > > > > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > > > > > the reader side to block must be avoided.
> > > > > > > > >
> > > > > > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > > > > > acquire spinlocks within RCU read-side critical sections.
> > > > > > > > >
> > > > > > > > > And before you complain that practitioners are not following the academic
> > > > > > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > > > > > >
> > > > > > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > > > > > >
> > > > > > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > > > > > special, just use the default.
> > > > > > > > >
> > > > > > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > > > > > bisecting, as this often quickly and easily finds the problem.
> > > > > > > >
> > > > > > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > > > > > This way, we will not worry about missing the needed rcu patches.
> > > > > > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > > > > > networking related, which I am still working on.  So, there might be
> > > > > > > > multiple root causes.
> > > > > > > >
> > > > > > > > > Bisection can also help you find the patch to be backported if a later
> > > > > > > > > release fixes the bug, though things like gitk can also be helpful.
> > > > > > > >
> > > > > > > > Unfortunately, this is reproducible on the latest bit.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Donghai
> > > > > > > > >
> > > > > > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: RCU: rcu stall issues and an approach to the fix
  2021-10-20 21:33                                         ` Paul E. McKenney
  2021-10-21  3:25                                           ` Zhouyi Zhou
@ 2021-10-21 16:44                                           ` donghai qiao
  1 sibling, 0 replies; 26+ messages in thread
From: donghai qiao @ 2021-10-21 16:44 UTC (permalink / raw)
  To: paulmck; +Cc: Boqun Feng, rcu, netdev

On Wed, Oct 20, 2021 at 5:33 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, Oct 20, 2021 at 04:05:59PM -0400, donghai qiao wrote:
> > On Wed, Oct 20, 2021 at 2:37 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 01:48:15PM -0400, donghai qiao wrote:
> > > > On Mon, Oct 18, 2021 at 7:46 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > >
> > > > > On Mon, Oct 18, 2021 at 05:18:40PM -0400, donghai qiao wrote:
> > > > > > I just want to follow up this discussion. First off, the latest issue
> > > > > > I mentioned in the email of Oct 4th which
> > > > > > exhibited a symptom of networking appeared to be a problem in
> > > > > > qrwlock.c. Particularly the problem is
> > > > > > caused by the 'if' statement in the function queued_read_lock_slowpath() below :
> > > > > >
> > > > > > void queued_read_lock_slowpath(struct qrwlock *lock)
> > > > > > {
> > > > > >         /*
> > > > > >          * Readers come here when they cannot get the lock without waiting
> > > > > >          */
> > > > > >         if (unlikely(in_interrupt())) {
> > > > > >                 /*
> > > > > >                  * Readers in interrupt context will get the lock immediately
> > > > > >                  * if the writer is just waiting (not holding the lock yet),
> > > > > >                  * so spin with ACQUIRE semantics until the lock is available
> > > > > >                  * without waiting in the queue.
> > > > > >                  */
> > > > > >                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
> > > > > >                 return;
> > > > > >         }
> > > > > >         ...
> > > > > > }
> > > > > >
> > > > > > That 'if' statement said, if we are in an interrupt context and we are
> > > > > > a reader, then
> > > > > > we will be allowed to enter the lock as a reader no matter if there
> > > > > > are writers waiting
> > > > > > for it or not. So, in the circumstance when the network packets steadily come in
> > > > > > and the intervals are relatively small enough, then the writers will
> > > > > > have no chance to
> > > > > > acquire the lock. This should be the root cause for that case.
> > > > > >
> > > > > > I have verified it by removing the 'if' and rerun the test multiple
> > > > > > times.
> > > > >
> > > > > That could do it!
> > > > >
> > > > > Would it make sense to keep the current check, but to also check if a
> > > > > writer had been waiting for more than (say) 100ms?  The reason that I
> > > > > ask is that I believe that this "if" statement is there for a reason.
> > > >
> > > > The day before I also got this to Waiman Long who initially made these
> > > > changes in
> > > > the qrwlock.c file. Turns out, the 'if' block was introduced to
> > > > resolve the particular
> > > > requirement of tasklist_lock reentering as reader.  He said he will
> > > > perhaps come up
> > > > with another code change to take care of this new write lock
> > > > starvation issue. The
> > > > idea is to only allow the tasklist_lock clients to acquire the read
> > > > lock through the 'if'
> > > > statement,  others are not.
> > > >
> > > > This sounds like a temporary solution if we cannot think of other
> > > > alternative ways
> > > > to fix the tasklist_lock issue. The principle here is that we should
> > > > not make the
> > > > locking primitives more special just in favor of a particular usage or scenario.
> > >
> > > When principles meet practice, results can vary.  Still, it would be
> > > better to have a less troublesome optimization.
> >
> > This is a philosophical debate. Let's put it aside.
>
> Exactly!
>
> > > > > >        The same
> > > > > > symptom hasn't been reproduced.  As far as rcu stall is concerned as a
> > > > > > broader range
> > > > > > of problems,  this is absolutely not the only root cause I have seen.
> > > > > > Actually too many
> > > > > > things can delay context switching.  Do you have a long term plan to
> > > > > > fix this issue,
> > > > > > or just want to treat it case by case?
> > > > >
> > > > > If you are running a CONFIG_PREEMPT=n kernel, then the plan has been to
> > > > > leverage the calls to cond_resched().  If the grace period is old enough,
> > > > > cond_resched() will supply a quiescent state.
> > > >
> > > > So far, all types of rcu stall I am aware of are originated to the
> > > > CONFIG_PREEMPT=n
> > > > kernel. Isn't it impossible to let rcu not rely on context switch ?
> > > > As we know too many
> > > > things can delay context switch, so it is not a quite reliable
> > > > mechanism if timing and
> > > > performance are crucial.
> > >
> > > Yes, you could build with CONFIG_PREEMPT=y and RCU would not always
> > > need to wait for an actual context switch.  But there can be
> > > performance issues for some workloads.
> > >
> > I can give this config (CONFIG_PREEMPT=y) a try when I have time.
>
> Very good!
>
> > > But please note that cond_resched() is not necessarily a context switch.
> > >
> > > Besides, for a great many workloads, delaying a context switch for
> > > very long is a first-class bug anyway.  For example, many internet data
> > > centers are said to have sub-second response-time requirements, and such
> > > requirements cannot be met if context switches are delayed too long.
> > >
> > Agreed.
> >
> > But on the other hand, if rcu relies on that,  the situation could be
> > even worse.
> > Simply put, when a gp cannot end soon, some rcu write-side will be delayed,
> > and the callbacks on the rcu-stalled CPU will be delayed. Thus in the case of
> > lack of free memory, this situation could form a deadlock.
>
> Would this situation exist in the first place if a blocking form of
> allocation were not being (erroneously) invoked within an RCU read-side
> critical section?  Either way, that bug needs to be fixed.

I cannot definitely say yes or no to this question. But I tend to
think that this situation could exist
in mm without interaction with rcu. I have some mm related core dumps
to analyze. So I will send
the results to the alias to discuss if they are definitely related or
somewhat related.

>
> > > > > In a CONFIG_PREEMPT=y kernel, when the grace period is old enough,
> > > > > RCU forces a schedule on the holdout CPU.  As long as the CPU is not
> > > > > eternally non-preemptible (for example, eternally in an interrupt
> > > > > handler), the grace period will end.
> > > >
> > > > Among the rcu stall instances I have seen so far, quite a lot of them occurred
> > > > on the CPUs which were running in the interrupt context or spinning on spinlocks
> > > > with interrupt disabled. In these scenarios, forced schedules will be
> > > > delayed until
> > > > these activities end.
> > >
> > > But running for several seconds in interrupt context is not at all good.
> > > As is spinning on a spinlock for several seconds.  These are performance
> > > bugs in and of themselves.
> >
> > Agreed.
> >
> > >
> > > More on this later in this email...
> > >
> > > > > But beyond a certain point, case-by-case analysis and handling is
> > > > > required.
> > > > >
> > > > > > Secondly, back to the following code I brought up that day. Actually
> > > > > > it is not as simple
> > > > > > as spinlock.
> > > > > >
> > > > > >       rcu_read_lock();
> > > > > >       ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > >       rcu_read_unlock();
> > > > > >
> > > > > > Are you all aware of all the potential functions that
> > > > > > ip_protocol_deliver_rcu will call?
> > > > > > As I can see, there is a code path from ip_protocol_deliver_rcu to
> > > > > > kmem_cache_alloc
> > > > > > which will end up a call to cond_resched().
> > > > >
> > > > > Can't say that I am familiar with everything that ip_protocol_deliver_rcu().
> > > > > There are some tens of millions of lines of code in the kernel, and I have
> > > > > but one brain.  ;-)
> > > > >
> > > > > And this cond_resched() should set things straight for a CONFIG_PREEMPT=n
> > > > > kernel.  Except that there should not be a call to cond_resched() within
> > > > > an RCU read-side critical section.
> > > >
> > > > with that 3 line snippet from the networking, a call to cond_resched() would
> > > > happen within the read-side critical section when the level of variable memory
> > > > is very low.
> > >
> > > That is a bug.  If you build your kernel with CONFIG_PROVE_LOCKING=y,
> > > it will complain about a cond_resched() in an RCU read-side critical
> > > section.  But, as you say, perhaps only with the level of variable memory
> > > is very low.
> >
> > There is a typo in my previous email. I meant available (or free).
> > Sorry for that.
>
> OK, good, that was my guess.  But invoking a potentially blocking form
> of a kernel memory allocator is still a bug.  And that bug needs to
> be fixed.  And fixing it might clear up a large fraction of your RCU
> grace-period issues.
>
> > > Please do not invoke cond_resched() within an RCU read-side critical
> > > section.  Doing so can result in random memory corruption.
> > >
> > > > > Does the code momentarily exit that
> > > > > critical section via something like rcu_read_unlock(); cond_resched();
> > > > > rcu_read_lock()?
> > > >
> > > > As far as I can see, cond_resched would be called between a pair of
> > > > rcu_read_lock and rcu_read_unlock.
> > >
> > > Again, this is a bug.  The usual fix is the GFP_ thing I noted below.
> > >
> > > > > Or does something prevent the code from getting there
> > > > > while in an RCU read-side critical section?  (The usual trick here is
> > > > > to have different GFP_ flags depending on the context.)
> > > >
> > > > Once we invoke kmem_cache_alloc or its variants, we cannot really
> > > > predict where we will go and how long this whole process is going to
> > > > take in this very large area from kmem to the virtual memory subsystem.
> > > > There is a flag __GFP_NOFAIL that determines whether or not cond_resched
> > > > should be called before retry, but this flag should be used from page level,
> > > > not from the kmem consumer level.  So I think there is little we can do
> > > > to avoid the resched.
> > >
> > > If you are invoking the allocator within an RCU read-side critical
> > > section, you should be using GFP_ATOMIC.  Except that doing this has
> > > many negative consequences, so it is better to allocate outside of
> > > the RCU read-side critical section.
> > >
> > > The same rules apply when allocating while holding a spinlock, so
> > > this is not just RCU placing restrictions on you.  ;-)
> >
> > yep, absolutely.
> >
> > >
> > > > > >                                              Because the operations in memory
> > > > > > allocation are too complicated, we cannot alway expect a prompt return
> > > > > > with success.
> > > > > > When the system is running out of memory, then rcu cannot close the
> > > > > > current gp, then
> > > > > > great number of callbacks will be delayed and the freeing of the
> > > > > > memory they held
> > > > > > will be delayed as well. This sounds like a deadlock in the resource flow.
> > > > >
> > > > > That would of course be bad.  Though I am not familiar with all of the
> > > > > details of how the networking guys handle out-of-memory conditions.
> > > > >
> > > > > The usual advice would be to fail the request, but that does not appear
> > > > > to be an easy option for ip_protocol_deliver_rcu().  At this point, I
> > > > > must defer to the networking folks.
> > > >
> > > > Thanks for the advice.
> > >
> > > Another question...  Why the endless interrupts?  Or is it just one
> > > very long interrupt?  Last I knew (admittedly a very long time ago),
> > > the high-rate networking drivers used things like NAPI in order to avoid
> > > this very problem.
> >
> > These should be long enough interrupts. The symptom in the networking
> > as the previous email said is one of them.  In that case, due to  rwlock
> > in favor of the readers in the interrupt context, the writer side would be
> > blocked as long as the readers keep coming.
>
> OK, and hopefully Longman finds a way to get his optimization in some
> less destructive way.
>
> > > Or is this some sort of special case where you are trying to do something
> > > special, for example, to achieve extremely low communications latencies?
> >
> > No, nothing special I am trying to do.
>
> OK, good.
>
> > > If this is a deliberate design, and if it is endless interrupts instead
> > > of one big long one, and if you are deliberately interrupt-storming
> > > a particular CPU, another approach is to build the kernel with
> > > CONFIG_NO_HZ_FULL=y, and boot with nohz_full=n, where "n" is the number of
> > > the CPU that is to be interrupt-stormed.  If you are interrupt storming
> > > multiple CPUs, you can specify them, for example, nohz_full=1-5,13 to
> > > specify CPUs 1, 2, 3, 4, 5, and 13.  In recent kernels, "N" stands for
> > > the CPU with the largest CPU number.
> >
> > I did this before, and I saw rcu stall as well with this kinda config.
>
> When you said you had long enough interrupts, how long were they?
>
> If they were long enough, then yes, you would get a stall.

The interrupts could take up to 60+ seconds. Certainly I low down the deadline
of triggering rcu stall in order to expose the underlying problems quickly.

>
> Suppose that some kernel code still executes on that CPU despite trying to
> move things off of it.  As soon as the interrupt hits kernel execution
> instead of nohz_full userspace execution, there will be no more RCU
> quiescent states, and thus you can see stalls.

Yes this could happen.
>
> > > Then read Documentation/admin-guide/kernel-per-CPU-kthreads.rst, which is
> > > probably a bit outdated, but a good place to start.  Follow its guidelines
> > > (and, as needed, come up with additional ones) to ensure that CPU "n"
> > > is not doing anything.  If you do come up with additional guidelines,
> > > please submit a patch to kernel-per-CPU-kthreads.rst so that others can
> > > also benefit, as you are benefiting from those before you.
> >
> > Thanks for the suggestion.
> > >
> > > Create a CPU-bound usermode application (a "while (1) continue;" loop or
> > > similar), and run that application on CPU "n".  Then start up whatever
> > > it is that interrupt-storms CPU "n".
> > >
> > > Every time CPU "n" returns from interrupt, RCU will see a quiescent state,
> > > which will prevent the interrupt storm from delaying RCU grace periods.
> > >
> > > On the other hand, if this is one big long interrupt, you need to make
> > > that interrupt end every so often.  Or move some of the work out of
> > > interrupt context, perhaps even to usermode.
> > >
> > > Much depends on exactly what you are trying to achieve.
> >
> > The things that can affect rcu stall are too many. So let's deal with
> > it case by case
> > before there is a permanent solution.
>
> Getting the bugs fixed should be a good start.

Thanks
Donghai

>
>                                                         Thanx, Paul
>
> > Thanks
> > Donghai
> >
> >
> >
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > > Donghai
> > > > >
> > > > >                                                         Thanx, Paul
> > > > >
> > > > > > Thanks
> > > > > > Donghai
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 5, 2021 at 8:25 PM donghai qiao <donghai.w.qiao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 5, 2021 at 12:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 05, 2021 at 12:10:25PM -0400, donghai qiao wrote:
> > > > > > > > > On Mon, Oct 4, 2021 at 8:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Oct 04, 2021 at 05:22:52PM -0400, donghai qiao wrote:
> > > > > > > > > > > Hello Paul,
> > > > > > > > > > > Sorry it has been long..
> > > > > > > > > >
> > > > > > > > > > On this problem, your schedule is my schedule.  At least as long as your
> > > > > > > > > > are not expecting instantaneous response.  ;-)
> > > > > > > > > >
> > > > > > > > > > > > > Because I am dealing with this issue in multiple kernel versions, sometimes
> > > > > > > > > > > > > the configurations in these kernels may different. Initially the
> > > > > > > > > > > > > problem I described
> > > > > > > > > > > > > originated to rhel-8 on which the problem occurs more often and is a bit easier
> > > > > > > > > > > > > to reproduce than others.
> > > > > > > > > > > >
> > > > > > > > > > > > Understood, that does make things more difficult.
> > > > > > > > > > > >
> > > > > > > > > > > > > Regarding these dynticks* parameters, I collected the data for CPU 0 as below :
> > > > > > > > > > > > >    - dynticks = 0x6eab02    which indicated the CPU was not in eqs.
> > > > > > > > > > > > >    - dynticks_nesting = 1    which is in its initial state, so it said
> > > > > > > > > > > > > it was not in eqs either.
> > > > > > > > > > > > >    - dynticks_nmi_nesting = 4000000000000004    which meant that this
> > > > > > > > > > > > > CPU had been
> > > > > > > > > > > > >      interrupted when it was in the middle of the first interrupt.
> > > > > > > > > > > > > And this is true: the first
> > > > > > > > > > > > >      interrupt was the sched_timer interrupt, and the second was a NMI
> > > > > > > > > > > > > when another
> > > > > > > > > > > > >     CPU detected the RCU stall on CPU 0.  So it looks all identical.
> > > > > > > > > > > > > If the kernel missed
> > > > > > > > > > > > >     a rcu_user_enter or rcu_user_exit, would these items remain
> > > > > > > > > > > > > identical ?  But I'll
> > > > > > > > > > > > >     investigate that possibility seriously as you pointed out.
> > > > > > > > > > > >
> > > > > > > > > > > > So is the initial state non-eqs because it was interrupted from kernel
> > > > > > > > > > > > mode?  Or because a missing rcu_user_enter() left ->dynticks_nesting
> > > > > > > > > > > > incorrectly equal to the value of 1?  Or something else?
> > > > > > > > > > >
> > > > > > > > > > > As far as the original problem is concerned, the user thread was interrupted by
> > > > > > > > > > > the timer, so the CPU was not working in the nohz mode. But I saw the similar
> > > > > > > > > > > problems on CPUs working in nohz mode with different configurations.
> > > > > > > > > >
> > > > > > > > > > OK.
> > > > > > > > > >
> > > > > > > > > > > > > > There were some issues of this sort around the v5.8 timeframe.  Might
> > > > > > > > > > > > > > there be another patch that needs to be backported?  Or a patch that
> > > > > > > > > > > > > > was backported, but should not have been?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good to know that clue. I'll take a look into the log history.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Is it possible to bisect this?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Or, again, to run with CONFIG_RCU_EQS_DEBUG=y?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am building the latest 5.14 kernel with this config and give it a try when the
> > > > > > > > > > > > > machine is set up, see how much it can help.
> > > > > > > > > > > >
> > > > > > > > > > > > Very good, as that will help determine whether or not the problem is
> > > > > > > > > > > > due to backporting issues.
> > > > > > > > > > >
> > > > > > > > > > > I enabled CONFIG_RCU_EQS_DEBUG=y as you suggested and
> > > > > > > > > > > tried it for both the latest rhel8 and a later upstream version 5.15.0-r1,
> > > > > > > > > > > turns out no new warning messages related to this came out. So,
> > > > > > > > > > > rcu_user_enter/rcu_user_exit() should be paired right.
> > > > > > > > > >
> > > > > > > > > > OK, good.
> > > > > > > > > >
> > > > > > > > > > > > > > Either way, what should happen is that dyntick_save_progress_counter() or
> > > > > > > > > > > > > > rcu_implicit_dynticks_qs() should see the rdp->dynticks field indicating
> > > > > > > > > > > > > > nohz_full user execution, and then the quiescent state will be supplied
> > > > > > > > > > > > > > on behalf of that CPU.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Agreed. But the counter rdp->dynticks of the CPU can only be updated
> > > > > > > > > > > > > by rcu_dynticks_eqs_enter() or rcu_dynticks_exit() when rcu_eqs_enter()
> > > > > > > > > > > > > or rcu_eqs_exit() is called, which in turn depends on the context switch.
> > > > > > > > > > > > > So, when the context switch never happens, the counter rdp->dynticks
> > > > > > > > > > > > > never advances. That's the thing I try to fix here.
> > > > > > > > > > > >
> > > > > > > > > > > > First, understand the problem.  Otherwise, your fix is not so likely
> > > > > > > > > > > > to actually fix anything.  ;-)
> > > > > > > > > > > >
> > > > > > > > > > > > If kernel mode was interrupted, there is probably a missing cond_resched().
> > > > > > > > > > > > But in sufficiently old kernels, cond_resched() doesn't do anything for
> > > > > > > > > > > > RCU unless a context switch actually happened.  In some of those kernels,
> > > > > > > > > > > > you can use cond_resched_rcu_qs() instead to get RCU's attention.  In
> > > > > > > > > > > > really old kernels, life is hard and you will need to do some backporting.
> > > > > > > > > > > > Or move to newer kernels.
> > > > > > > > > > > >
> > > > > > > > > > > > In short, if an in-kernel code path runs for long enough without hitting
> > > > > > > > > > > > a cond_resched() or similar, that is a bug.  The RCU CPU stall warning
> > > > > > > > > > > > that you will get is your diagnostic.
> > > > > > > > > > >
> > > > > > > > > > > Probably this is the case. With the test for 5.15.0-r1, I have seen different
> > > > > > > > > > > scenarios, among them the most frequent ones were caused by the networking
> > > > > > > > > > > in which a bunch of networking threads were spinning on the same rwlock.
> > > > > > > > > > >
> > > > > > > > > > > For instance in one of them, the ticks_this_gp of a rcu_data could go as
> > > > > > > > > > > large as 12166 (ticks) which is 12+ seconds. The thread on this cpu was
> > > > > > > > > > > doing networking work and finally it was spinning as a writer on a rwlock
> > > > > > > > > > > which had been locked by 16 readers.  By the way, there were 70 this
> > > > > > > > > > > kinds of writers were blocked on the same rwlock.
> > > > > > > > > >
> > > > > > > > > > OK, a lock-contention problem.  The networking folks have fixed a
> > > > > > > > > > very large number of these over the years, though, so I wonder what is
> > > > > > > > > > special about this one so that it is just now showing up.  I have added
> > > > > > > > > > a networking list on CC for their thoughts.
> > > > > > > > >
> > > > > > > > > Thanks for pulling the networking in. If they need the coredump, I can
> > > > > > > > > forward it to them.  It's definitely worth analyzing it as this contention
> > > > > > > > > might be a performance issue.  Or we can discuss this further in this
> > > > > > > > > email thread if they are fine, or we can discuss it over with a separate
> > > > > > > > > email thread with netdev@ only.
> > > > > > > > >
> > > > > > > > > So back to my original problem, this might be one of the possibilities that
> > > > > > > > > led to RCU stall panic.  Just imagining this type of contention might have
> > > > > > > > > occurred and lasted long enough. When it finally came to the end, the
> > > > > > > > > timer interrupt occurred, therefore rcu_sched_clock_irq detected the RCU
> > > > > > > > > stall on the CPU and panic.
> > > > > > > > >
> > > > > > > > > So definitely we need to understand these networking activities here as
> > > > > > > > > to why the readers could hold the rwlock too long.
> > > > > > > >
> > > > > > > > I strongly suggest that you also continue to do your own analysis on this.
> > > > > > > > So please see below.
> > > > > > >
> > > > > > > This is just a brief of my analysis and the stack info below is not enough
> > > > > > > for other people to figure out anything useful. I meant if they are really
> > > > > > > interested, I can upload the core file. I think this is fair.
> > > > > > >
> > > > > > > >
> > > > > > > > > > > When examining the readers of the lock, except the following code,
> > > > > > > > > > > don't see any other obvious problems: e.g
> > > > > > > > > > >  #5 [ffffad3987254df8] __sock_queue_rcv_skb at ffffffffa49cd2ee
> > > > > > > > > > >  #6 [ffffad3987254e18] raw_rcv at ffffffffa4ac75c8
> > > > > > > > > > >  #7 [ffffad3987254e38] raw_local_deliver at ffffffffa4ac7819
> > > > > > > > > > >  #8 [ffffad3987254e88] ip_protocol_deliver_rcu at ffffffffa4a8dea4
> > > > > > > > > > >  #9 [ffffad3987254ea8] ip_local_deliver_finish at ffffffffa4a8e074
> > > > > > > > > > > #10 [ffffad3987254eb0] __netif_receive_skb_one_core at ffffffffa49f3057
> > > > > > > > > > > #11 [ffffad3987254ed0] process_backlog at ffffffffa49f3278
> > > > > > > > > > > #12 [ffffad3987254f08] __napi_poll at ffffffffa49f2aba
> > > > > > > > > > > #13 [ffffad3987254f30] net_rx_action at ffffffffa49f2f33
> > > > > > > > > > > #14 [ffffad3987254fa0] __softirqentry_text_start at ffffffffa50000d0
> > > > > > > > > > > #15 [ffffad3987254ff0] do_softirq at ffffffffa40e12f6
> > > > > > > > > > >
> > > > > > > > > > > In the function ip_local_deliver_finish() of this stack, a lot of the work needs
> > > > > > > > > > > to be done with ip_protocol_deliver_rcu(). But this function is invoked from
> > > > > > > > > > > a rcu reader side section.
> > > > > > > > > > >
> > > > > > > > > > > static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > > > > > > > > > > struct sk_buff *skb)
> > > > > > > > > > > {
> > > > > > > > > > >         __skb_pull(skb, skb_network_header_len(skb));
> > > > > > > > > > >
> > > > > > > > > > >         rcu_read_lock();
> > > > > > > > > > >         ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
> > > > > > > > > > >         rcu_read_unlock();
> > > > > > > > > > >
> > > > > > > > > > >         return 0;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > Actually there are multiple chances that this code path can hit
> > > > > > > > > > > spinning locks starting from ip_protocol_deliver_rcu(). This kind
> > > > > > > > > > > usage looks not quite right. But I'd like to know your opinion on this first ?
> > > > > > > > > >
> > > > > > > > > > It is perfectly legal to acquire spinlocks in RCU read-side critical
> > > > > > > > > > sections.  In fact, this is one of the few ways to safely acquire a
> > > > > > > > > > per-object lock while still maintaining good performance and
> > > > > > > > > > scalability.
> > > > > > > > >
> > > > > > > > > Sure, understand. But the RCU related docs said that anything causing
> > > > > > > > > the reader side to block must be avoided.
> > > > > > > >
> > > > > > > > True.  But this is the Linux kernel, where "block" means something
> > > > > > > > like "invoke schedule()" or "sleep" instead of the academic-style
> > > > > > > > non-blocking-synchronization definition.  So it is perfectly legal to
> > > > > > > > acquire spinlocks within RCU read-side critical sections.
> > > > > > > >
> > > > > > > > And before you complain that practitioners are not following the academic
> > > > > > > > definitions, please keep in mind that our definitions were here first.  ;-)
> > > > > > > >
> > > > > > > > > > My guess is that the thing to track down is the cause of the high contention
> > > > > > > > > > on that reader-writer spinlock.  Missed patches, misconfiguration, etc.
> > > > > > > > >
> > > > > > > > > Actually, the test was against a recent upstream 5.15.0-r1  But I can try
> > > > > > > > > the latest r4.  Regarding the network configure, I believe I didn't do anything
> > > > > > > > > special, just use the default.
> > > > > > > >
> > > > > > > > Does this occur on older mainline kernels?  If not, I strongly suggest
> > > > > > > > bisecting, as this often quickly and easily finds the problem.
> > > > > > >
> > > > > > > Actually It does. But let's focus on the latest upstream and the latest rhel8.
> > > > > > > This way, we will not worry about missing the needed rcu patches.
> > > > > > > However, in rhel8, the kernel stack running on the rcu-stalled CPU is not
> > > > > > > networking related, which I am still working on.  So, there might be
> > > > > > > multiple root causes.
> > > > > > >
> > > > > > > > Bisection can also help you find the patch to be backported if a later
> > > > > > > > release fixes the bug, though things like gitk can also be helpful.
> > > > > > >
> > > > > > > Unfortunately, this is reproducible on the latest bit.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Donghai
> > > > > > > >
> > > > > > > >                                                         Thanx, Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-10-21 16:44 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 20:08 RCU: rcu stall issues and an approach to the fix donghai qiao
2021-07-22 20:41 ` Paul E. McKenney
2021-07-23  0:29 ` Boqun Feng
2021-07-23  3:49   ` Paul E. McKenney
     [not found]     ` <CAOzhvcOLFzFGZAptOTrP9Xne1-LiO8jka1sPF6+0=WiLh-cQUA@mail.gmail.com>
2021-07-23 17:25       ` Paul E. McKenney
2021-07-23 18:41         ` donghai qiao
2021-07-23 19:06           ` Paul E. McKenney
2021-07-24  0:01             ` donghai qiao
2021-07-25  0:48               ` Paul E. McKenney
2021-07-27 22:34                 ` donghai qiao
2021-07-28  0:10                   ` Paul E. McKenney
2021-10-04 21:22                     ` donghai qiao
2021-10-05  0:59                       ` Paul E. McKenney
2021-10-05 16:10                         ` donghai qiao
2021-10-05 16:39                           ` Paul E. McKenney
2021-10-06  0:25                             ` donghai qiao
2021-10-18 21:18                               ` donghai qiao
2021-10-18 23:46                                 ` Paul E. McKenney
2021-10-20 17:48                                   ` donghai qiao
2021-10-20 18:37                                     ` Paul E. McKenney
2021-10-20 20:05                                       ` donghai qiao
2021-10-20 21:33                                         ` Paul E. McKenney
2021-10-21  3:25                                           ` Zhouyi Zhou
2021-10-21  4:17                                             ` Paul E. McKenney
2021-10-21 16:44                                           ` donghai qiao
2021-07-23 17:25     ` donghai qiao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).