Re: RCU: rcu stall issues and an approach to the fix

From: donghai qiao <donghai.w.qiao@gmail.com>
To: paulmck@kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>, rcu@vger.kernel.org
Subject: Re: RCU: rcu stall issues and an approach to the fix
Date: Fri, 23 Jul 2021 13:25:32 -0400	[thread overview]
Message-ID: <CAOzhvcM6bBRR7A5pb5iU5H7uBPbZ4rQz67iPn7NwFNjheFKG7w@mail.gmail.com> (raw)
In-Reply-To: <20210723034928.GE4397@paulmck-ThinkPad-P17-Gen-1>

On Thu, Jul 22, 2021 at 11:49 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Jul 23, 2021 at 08:29:53AM +0800, Boqun Feng wrote:
> > On Thu, Jul 22, 2021 at 04:08:06PM -0400, donghai qiao wrote:
> > > RCU experts,
> > >
> > > When you reply, please also keep me CC'ed.
> > >
> > > The problem of RCU stall might be an old problem and it can happen quite often.
> > > As I have observed, when the problem occurs,  at least one CPU in the system
> > > on which its rdp->gp_seq falls behind others by 4 (qs).
> > >
> > > e.g.  On CPU 0, rdp->gp_seq = 0x13889d, but on other CPUs, their
> > > rdp->gp_seq = 0x1388a1.
> > >
> > > Because RCU stall issues can last a long period of time, the number of callbacks
> > > in the list rdp->cblist of all CPUs can accumulate to thousands. In
> > > the worst case,
> > > it triggers panic.
> > >
> > > When looking into the problem further, I'd think the problem is related to the
> > > Linux scheduler. When the RCU core detects the stall on a CPU, rcu_gp_kthread
> > > would send a rescheduling request via send_IPI to that CPU to try to force a
> > > context switch to make some progress. However, at least one situation can fail
> > > this effort, which is when the CPU is running a user thread and it is the only
> > > user thread in the rq, then this attempted context switching will not happen
> > > immediately. In particular if the system is also configured with NOHZ_FULL for
> >
> > Correct me if I'm wrong, if a CPU is solely running a user thread, how
> > can that CPU stall RCU? Because you need to be in a RCU read-side
> > critical section to stall RCU. Or the problem you're talking here is
> > about *recovering* from RCU stall?

In response to Boqun's question, the crashdumps I analyzed were
configured with this :

CONFIG_PREEMPT_RCU=n
CONFIG_PREEMPT_COUNT=n
CONFIG_PROVE_RCU=n

Because these configurations were not enabled, the compiler generated
empty binary code for functions rcu_read_lock() and rcu_read_unlock()
which delimit rcu read-side critical sections. And the crashdump
showed both functions have no binary code in the kernel module and I
am pretty sure.  In the first place I thought this kernel might be
built the wrong way,
but later I found other sources that said this was ok.  That's why
when CPUs enter or leave rcu critical section, the rcu core
is not informed.

When the current grace period is closed, rcu_gp_kthread will open a
new period for all. This will be reflected from every
CPU's rdp->gp_seq. Every CPU is responsible to update its own gp when
a progress is made. So when a cpu is running
a user thread whilst a new period is open, it can not update its rcu
unless a context switch occurs or upon a sched tick.
But if a CPU is configured as NOHZ, this will be a problem to RCU, so
rcu stall will happen.

When RCU detects that qs is stalled on a CPU, it tries to force a
context switch to make progress on that CPU. This is
done through a resched IPI. But this can not always succeed depending
on the scheduler.   A while ago, this code
process the resched IPI:

void scheduler_ipi(void)
{
        ...
        if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
                return;
        ...
        irq_enter();
        sched_ttwu_pending();
        ...
        if (unlikely(got_nohz_idle_kick())) {
                this_rq()->idle_balance = 1;
                raise_softirq_irqoff(SCHED_SOFTIRQ);
        }
        irq_exit();
}

As you can see the function returns from the first "if statement"
before it can issue a SCHED_SOFTIRQ. Later this
code has been changed, but similar check/optimization remains in many
places in the scheduler. The things I try to
fix are those that resched_cpu fails to do.

Hope this explains it.

>
> Excellent point, Boqun!
>
> Donghai, have you tried reproducing this using a kernel built with
> CONFIG_RCU_EQS_DEBUG=y?

I haven't tried it. I can do it and update you with the results when I am done.

Thanks!
Donghai

>
>                                                         Thanx, Paul
>
> > Regards,
> > Boqun