All of lore.kernel.org
 help / color / mirror / Atom feed
* SRCU: kworker hung in synchronize_srcu
@ 2023-09-28  7:59 zhuangel570
  2023-09-28 21:39 ` Joel Fernandes
  2023-09-30 10:11 ` Neeraj upadhyay
  0 siblings, 2 replies; 28+ messages in thread
From: zhuangel570 @ 2023-09-28  7:59 UTC (permalink / raw)
  To: paulmck, rcu, josh, rostedt, mathieu.desnoyers, jiangshanlai
  Cc: like.xu.linux, linussli, foxywang

Hi,

We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
and destroy concurrent test, the issue happens after several weeks. Now we
didn't have a way to reproduce this issue, the issue happens randomly, this
is the second time we found it in this year.

We did some investigation with crash tools, hope to get some useful clues
about the issue, try to find root cause of the issue (further find a way to
fix or workaround it), here is some of our investigation, hope to get
feedback from SRCU experts.

[3144162.608392] INFO: task kworker/136:12:252684 blocked for more
than 122 seconds.
[3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
[3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
[3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
[3144162.631192] Call Trace:
[3144162.631202]  __schedule+0x2ee/0x660
[3144162.631206]  schedule+0x33/0xa0
[3144162.631209]  schedule_timeout+0x1c4/0x340
[3144162.631214]  ? update_load_avg+0x82/0x660
[3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
[3144162.631218]  wait_for_completion+0x119/0x180
[3144162.631220]  ? wake_up_q+0x80/0x80
[3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
[3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
[3144162.631227]  synchronize_srcu+0x5f/0xc0
[3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
[3144162.631239]  ? __schedule+0x2f6/0x660
[3144162.631243]  process_one_work+0x19a/0x3a0
[3144162.631244]  worker_thread+0x37/0x3a0
[3144162.631247]  kthread+0x117/0x140
[3144162.631247]  ? process_one_work+0x3a0/0x3a0
[3144162.631248]  ? __kthread_cancel_work+0x40/0x40
[3144162.631250]  ret_from_fork+0x1f/0x30

Test workload:
- Repeated concurrency testing, include Create VM, service probe, destroy VM.
- VM using virito device, so kvm->irq_srcu is using frequently.
- x86_64 Machine with 191 core.

Crash analysis clues:
- The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
- The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
- The SNP (node) of SRCU struct has 2 levels.
- The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
- The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
- The SDP (per-cpu) of SRCU struct show lock and unlock matches.
- The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
- The pending entry is just the calling entry of hung kwoker task.
- The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
- The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
- The work-queue of SRCU and SDP of CPU 136 is idle.
- More details could be found via reference at the end of this mail.

crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
0xffffc900f100e878
        srcu_idx = 76
        srcu_gp_seq = 304
        srcu_gp_seq_needed = 304
        srcu_gp_seq_needed_exp = 304
        sda = 0x5fff9ac17800

crash> struct srcu_data 0x5fff9ac17800:136
[136]: ffffe8bee9817800
...
  srcu_cblist = {
    head = 0xffffc900e8413dd0,
    tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
0xffffc900e8413dd0, 0xffffc900e8413dd0},
    gp_seq = {0, 296, 296, 0},
    len = 1,
    len_lazy = 0,
    enabled = 1 '\001',
    offloaded = 0 '\000'
  },
  srcu_gp_seq_needed = 288,
  srcu_gp_seq_needed_exp = 0,
  srcu_cblist_invoking = false,
...

Root cause (we can't reproduce the issue, so the result is base on crash
analysis and code analysis):
- The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
- The leaked entry on segcblist seems has been "accelerated".
- The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
  srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
- Here are two emulated case may cause the "accelerate issue":


CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
- [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
- [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
  to SDP.
- [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
  faster than CPU 136, start GP, set rcu_seq_start.
- [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
  to start GP.
- [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
  "accelerate" the seq of new added entry to 292 (it should be 288).


CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
- [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
- [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
  to SDP.
- [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
  faster than CPU 136, start GP.
- [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
- [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
  to start GP.
- [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
  set rcu_seq_start.
- [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
  "accelerate" new added entry to 296 (it should be 288).


/*
 * CASE1
 * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
 * - GP0: 280-284, GP1: 284-288.
 */

/* [GP0][CPU-136] */
process_srcu {
 srcu_gp_end
}
                /* [GP1][CPU-136] */
                synchronize_srcu {
                 __call_srcu {
                  rcu_segcblist_enqueue
                                         /* [GP1][CPU-041] */
                                         synchronize_srcu {
                                          __call_srcu {
                                           srcu_funnel_gp_start
                                            srcu_gp_start
                                             rcu_seq_start
                                          }
                                         }
                /* [GP1][CPU-136] */
                  srcu_funnel_gp_start
                 }
                }
/* [GP0][CPU-136] */
srcu_invoke_callbacks {
 rcu_segcblist_accelerate
}


/*
 * CASE2
 * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
 * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
 */

/* [GP0][CPU-136] */
process_srcu {
 srcu_gp_end
}

                  /* [GP1][CPU-136] */
                  synchronize_srcu {
                   __call_srcu {
                    rcu_segcblist_enqueue
                                           /* [GP1][CPU-041] */
                                           synchronize_srcu {
                                            __call_srcu {
                                             srcu_funnel_gp_start
                                              srcu_gp_start
                                            }
                                           }
                                           process_srcu {
                                            srcu_gp_end
                                             rcu_seq_end
                                           }
                  /* [GP1][CPU-136] */
                    srcu_funnel_gp_start
                   }
                  }
/* [GP0][CPU-136] */
srcu_invoke_callbacks {
                                                      /* [GP2][CPU-017] */
                                                      synchronize_srcu {
                                                       __call_srcu {
                                                        srcu_funnel_gp_start
                                                         srcu_gp_start
                                                          rcu_seq_start
                                                       }
                                                      }
/* [GP0][CPU-136] */
 rcu_segcblist_accelerate
}


Potential fix:
We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
of current gp seq of SRCU struct, how about accelerate the sda base
sdp->srcu_gp_seq_needed, maybe like this:


diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 21115ffb6c44..ff8579abc924 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
work_struct *work)
        spin_lock_irq_rcu_node(sdp);
        rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
        (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-                                      rcu_seq_snap(&ssp->srcu_gp_seq));
+                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
        sdp->srcu_cblist_invoking = false;
        more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
        spin_unlock_irq_rcu_node(sdp);


Thanks for reading!


Reference detail log of crash dump information:
https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28  7:59 SRCU: kworker hung in synchronize_srcu zhuangel570
@ 2023-09-28 21:39 ` Joel Fernandes
  2023-09-28 21:40   ` Joel Fernandes
                     ` (4 more replies)
  2023-09-30 10:11 ` Neeraj upadhyay
  1 sibling, 5 replies; 28+ messages in thread
From: Joel Fernandes @ 2023-09-28 21:39 UTC (permalink / raw)
  To: zhuangel570
  Cc: paulmck, rcu, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	like.xu.linux, linussli, foxywang

Hello,
Firstly, kudos to the detailed report and analysis. Rare failures are
hard and your usage crash/kdump is awesome to dig deeper into the
issue..

On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
>
> Hi,
>
> We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> and destroy concurrent test, the issue happens after several weeks. Now we
> didn't have a way to reproduce this issue, the issue happens randomly, this
> is the second time we found it in this year.
>
> We did some investigation with crash tools, hope to get some useful clues
> about the issue, try to find root cause of the issue (further find a way to
> fix or workaround it), here is some of our investigation, hope to get
> feedback from SRCU experts.
>
> [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> than 122 seconds.
> [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> [3144162.631192] Call Trace:
> [3144162.631202]  __schedule+0x2ee/0x660
> [3144162.631206]  schedule+0x33/0xa0
> [3144162.631209]  schedule_timeout+0x1c4/0x340
> [3144162.631214]  ? update_load_avg+0x82/0x660
> [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> [3144162.631218]  wait_for_completion+0x119/0x180
> [3144162.631220]  ? wake_up_q+0x80/0x80
> [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> [3144162.631227]  synchronize_srcu+0x5f/0xc0
> [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> [3144162.631239]  ? __schedule+0x2f6/0x660
> [3144162.631243]  process_one_work+0x19a/0x3a0
> [3144162.631244]  worker_thread+0x37/0x3a0
> [3144162.631247]  kthread+0x117/0x140
> [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> [3144162.631250]  ret_from_fork+0x1f/0x30
>
> Test workload:
> - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> - VM using virito device, so kvm->irq_srcu is using frequently.
> - x86_64 Machine with 191 core.
>
> Crash analysis clues:
> - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> - The SNP (node) of SRCU struct has 2 levels.
> - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> - The pending entry is just the calling entry of hung kwoker task.
> - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> - The work-queue of SRCU and SDP of CPU 136 is idle.
> - More details could be found via reference at the end of this mail.
>
> crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> 0xffffc900f100e878
>         srcu_idx = 76
>         srcu_gp_seq = 304
>         srcu_gp_seq_needed = 304
>         srcu_gp_seq_needed_exp = 304
>         sda = 0x5fff9ac17800
>
> crash> struct srcu_data 0x5fff9ac17800:136
> [136]: ffffe8bee9817800
> ...
>   srcu_cblist = {
>     head = 0xffffc900e8413dd0,
>     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> 0xffffc900e8413dd0, 0xffffc900e8413dd0},
>     gp_seq = {0, 296, 296, 0},

I am actually a bit puzzled why the DONE segment count is 0. I would
have expected it to be >= 296 in this situation.

>     len = 1,
>     len_lazy = 0,
>     enabled = 1 '\001',
>     offloaded = 0 '\000'
>   },
>   srcu_gp_seq_needed = 288,
>   srcu_gp_seq_needed_exp = 0,
>   srcu_cblist_invoking = false,
> ...
>
> Root cause (we can't reproduce the issue, so the result is base on crash
> analysis and code analysis):
> - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> - The leaked entry on segcblist seems has been "accelerated".
> - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
>   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> - Here are two emulated case may cause the "accelerate issue":
>
>
> CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
>   to SDP.
> - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
>   faster than CPU 136, start GP, set rcu_seq_start.
> - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
>   to start GP.
> - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
>   "accelerate" the seq of new added entry to 292 (it should be 288).

But srcu_gp_seq is at 304 right now. How does it matter that the CB is
marked for 292? It should be ready to execute anyway even at 292. Note
the meaning of "acceleration", the idea is to start conservatively and
move the callbacks forward as more accurate information is available.
Considering this, 292 initially should be OK IMHO (that's just more
conservative than 288)..

> CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
>   to SDP.
> - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
>   faster than CPU 136, start GP.
> - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
>   to start GP.
> - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
>   set rcu_seq_start.
> - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
>   "accelerate" new added entry to 296 (it should be 288).

I had the same analysis here.

> /*
>  * CASE1
>  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
>  * - GP0: 280-284, GP1: 284-288.
>  */
>
> /* [GP0][CPU-136] */
> process_srcu {
>  srcu_gp_end
> }
>                 /* [GP1][CPU-136] */
>                 synchronize_srcu {
>                  __call_srcu {
>                   rcu_segcblist_enqueue
>                                          /* [GP1][CPU-041] */
>                                          synchronize_srcu {
>                                           __call_srcu {
>                                            srcu_funnel_gp_start
>                                             srcu_gp_start
>                                              rcu_seq_start
>                                           }
>                                          }
>                 /* [GP1][CPU-136] */
>                   srcu_funnel_gp_start
>                  }
>                 }
> /* [GP0][CPU-136] */
> srcu_invoke_callbacks {
>  rcu_segcblist_accelerate
> }
>
>
> /*
>  * CASE2
>  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
>  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
>  */
>
> /* [GP0][CPU-136] */
> process_srcu {
>  srcu_gp_end
> }
>
>                   /* [GP1][CPU-136] */
>                   synchronize_srcu {
>                    __call_srcu {
>                     rcu_segcblist_enqueue
>                                            /* [GP1][CPU-041] */
>                                            synchronize_srcu {
>                                             __call_srcu {
>                                              srcu_funnel_gp_start
>                                               srcu_gp_start
>                                             }
>                                            }
>                                            process_srcu {
>                                             srcu_gp_end
>                                              rcu_seq_end
>                                            }
>                   /* [GP1][CPU-136] */
>                     srcu_funnel_gp_start
>                    }
>                   }
> /* [GP0][CPU-136] */
> srcu_invoke_callbacks {

If srcu_invoke_callbacks() was really called for the rdp, I would have
expected rcu_segcblist_advance() to advance all those pending
callbacks to 304.

I posit that probably srcu_invoke_callbacks() is not even being called
in the first place, otherwise why is the DONE segment count still 0?
DONE should not be 0, otherwise callbacks will not run. The segcblist
of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
being called due to some other bug (or a workqueue/timer issue causing
the srcu_invoke_callbacks() to not run?).

I wonder if you would consider trying to reproduce it with the
rcutorture module's SRCU testing in the chance that it occurs with it
on your setup.

thanks,

 - Joel


>                                                       /* [GP2][CPU-017] */
>                                                       synchronize_srcu {
>                                                        __call_srcu {
>                                                         srcu_funnel_gp_start
>                                                          srcu_gp_start
>                                                           rcu_seq_start
>                                                        }
>                                                       }
> /* [GP0][CPU-136] */
>  rcu_segcblist_accelerate
> }
>
>
> Potential fix:
> We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> of current gp seq of SRCU struct, how about accelerate the sda base
> sdp->srcu_gp_seq_needed, maybe like this:
>
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 21115ffb6c44..ff8579abc924 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> work_struct *work)
>         spin_lock_irq_rcu_node(sdp);
>         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
>         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
>         sdp->srcu_cblist_invoking = false;
>         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
>         spin_unlock_irq_rcu_node(sdp);

I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
rcu_seq_snap() says which what's the gp_seq at which a full GP has
passed from the moment rcu_seq_snap() is called. That's precisely when
it is safe to call all those non-accelerated callbacks.

Basically, if any of the bits are set in the state portion of the
sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
<< RCU_SEQ_CTR_SHIFT.
Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.





>
>
> Thanks for reading!
>
>
> Reference detail log of crash dump information:
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28 21:39 ` Joel Fernandes
@ 2023-09-28 21:40   ` Joel Fernandes
  2023-09-29  1:50   ` Zhouyi Zhou
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Joel Fernandes @ 2023-09-28 21:40 UTC (permalink / raw)
  To: zhuangel570
  Cc: paulmck, rcu, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	like.xu.linux, linussli, foxywang

On Thu, Sep 28, 2023 at 5:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
> Firstly, kudos to the detailed report and analysis. Rare failures are
> hard and your usage crash/kdump is awesome to dig deeper into the
> issue..
>
> On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> >
> > Hi,
> >
> > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > and destroy concurrent test, the issue happens after several weeks. Now we
> > didn't have a way to reproduce this issue, the issue happens randomly, this
> > is the second time we found it in this year.
> >
> > We did some investigation with crash tools, hope to get some useful clues
> > about the issue, try to find root cause of the issue (further find a way to
> > fix or workaround it), here is some of our investigation, hope to get
> > feedback from SRCU experts.
> >
> > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> > than 122 seconds.
> > [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> > [3144162.631192] Call Trace:
> > [3144162.631202]  __schedule+0x2ee/0x660
> > [3144162.631206]  schedule+0x33/0xa0
> > [3144162.631209]  schedule_timeout+0x1c4/0x340
> > [3144162.631214]  ? update_load_avg+0x82/0x660
> > [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> > [3144162.631218]  wait_for_completion+0x119/0x180
> > [3144162.631220]  ? wake_up_q+0x80/0x80
> > [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> > [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> > [3144162.631227]  synchronize_srcu+0x5f/0xc0
> > [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> > [3144162.631239]  ? __schedule+0x2f6/0x660
> > [3144162.631243]  process_one_work+0x19a/0x3a0
> > [3144162.631244]  worker_thread+0x37/0x3a0
> > [3144162.631247]  kthread+0x117/0x140
> > [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> > [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> > [3144162.631250]  ret_from_fork+0x1f/0x30
> >
> > Test workload:
> > - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> > - VM using virito device, so kvm->irq_srcu is using frequently.
> > - x86_64 Machine with 191 core.
> >
> > Crash analysis clues:
> > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> > - The SNP (node) of SRCU struct has 2 levels.
> > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> > - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> > - The pending entry is just the calling entry of hung kwoker task.
> > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> > - The work-queue of SRCU and SDP of CPU 136 is idle.
> > - More details could be found via reference at the end of this mail.
> >
> > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> > 0xffffc900f100e878
> >         srcu_idx = 76
> >         srcu_gp_seq = 304
> >         srcu_gp_seq_needed = 304
> >         srcu_gp_seq_needed_exp = 304
> >         sda = 0x5fff9ac17800
> >
> > crash> struct srcu_data 0x5fff9ac17800:136
> > [136]: ffffe8bee9817800
> > ...
> >   srcu_cblist = {
> >     head = 0xffffc900e8413dd0,
> >     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> > 0xffffc900e8413dd0, 0xffffc900e8413dd0},
> >     gp_seq = {0, 296, 296, 0},
>
> I am actually a bit puzzled why the DONE segment count is 0. I would
> have expected it to be >= 296 in this situation.
>
> >     len = 1,
> >     len_lazy = 0,
> >     enabled = 1 '\001',
> >     offloaded = 0 '\000'
> >   },
> >   srcu_gp_seq_needed = 288,
> >   srcu_gp_seq_needed_exp = 0,
> >   srcu_cblist_invoking = false,
> > ...
> >
> > Root cause (we can't reproduce the issue, so the result is base on crash
> > analysis and code analysis):
> > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> > - The leaked entry on segcblist seems has been "accelerated".
> > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
> >   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> > - Here are two emulated case may cause the "accelerate issue":
> >
> >
> > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP, set rcu_seq_start.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" the seq of new added entry to 292 (it should be 288).
>
> But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> marked for 292? It should be ready to execute anyway even at 292. Note
> the meaning of "acceleration", the idea is to start conservatively and
> move the callbacks forward as more accurate information is available.
> Considering this, 292 initially should be OK IMHO (that's just more
> conservative than 288)..
>
> > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP.
> > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
> >   set rcu_seq_start.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" new added entry to 296 (it should be 288).
>
> I had the same analysis here.
>
> > /*
> >  * CASE1
> >  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
> >  * - GP0: 280-284, GP1: 284-288.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >                 /* [GP1][CPU-136] */
> >                 synchronize_srcu {
> >                  __call_srcu {
> >                   rcu_segcblist_enqueue
> >                                          /* [GP1][CPU-041] */
> >                                          synchronize_srcu {
> >                                           __call_srcu {
> >                                            srcu_funnel_gp_start
> >                                             srcu_gp_start
> >                                              rcu_seq_start
> >                                           }
> >                                          }
> >                 /* [GP1][CPU-136] */
> >                   srcu_funnel_gp_start
> >                  }
> >                 }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > /*
> >  * CASE2
> >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >
> >                   /* [GP1][CPU-136] */
> >                   synchronize_srcu {
> >                    __call_srcu {
> >                     rcu_segcblist_enqueue
> >                                            /* [GP1][CPU-041] */
> >                                            synchronize_srcu {
> >                                             __call_srcu {
> >                                              srcu_funnel_gp_start
> >                                               srcu_gp_start
> >                                             }
> >                                            }
> >                                            process_srcu {
> >                                             srcu_gp_end
> >                                              rcu_seq_end
> >                                            }
> >                   /* [GP1][CPU-136] */
> >                     srcu_funnel_gp_start
> >                    }
> >                   }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
>
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.
>
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).
>
> I wonder if you would consider trying to reproduce it with the
> rcutorture module's SRCU testing in the chance that it occurs with it
> on your setup.
>
> thanks,
>
>  - Joel

As a note, I had additional comments below but I accidentally signed
my name a bit earlier - fyi. Thanks. -Joel


>
>
> >                                                       /* [GP2][CPU-017] */
> >                                                       synchronize_srcu {
> >                                                        __call_srcu {
> >                                                         srcu_funnel_gp_start
> >                                                          srcu_gp_start
> >                                                           rcu_seq_start
> >                                                        }
> >                                                       }
> > /* [GP0][CPU-136] */
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > Potential fix:
> > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> > of current gp seq of SRCU struct, how about accelerate the sda base
> > sdp->srcu_gp_seq_needed, maybe like this:
> >
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 21115ffb6c44..ff8579abc924 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> > work_struct *work)
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> > +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);
>
> I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
> rcu_seq_snap() says which what's the gp_seq at which a full GP has
> passed from the moment rcu_seq_snap() is called. That's precisely when
> it is safe to call all those non-accelerated callbacks.
>
> Basically, if any of the bits are set in the state portion of the
> sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
> << RCU_SEQ_CTR_SHIFT.
> Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.
>
>
>
>
>
> >
> >
> > Thanks for reading!
> >
> >
> > Reference detail log of crash dump information:
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28 21:39 ` Joel Fernandes
  2023-09-28 21:40   ` Joel Fernandes
@ 2023-09-29  1:50   ` Zhouyi Zhou
  2023-09-29 22:44   ` Frederic Weisbecker
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Zhouyi Zhou @ 2023-09-29  1:50 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: zhuangel570, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

On Fri, Sep 29, 2023 at 6:05 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
> Firstly, kudos to the detailed report and analysis. Rare failures are
> hard and your usage crash/kdump is awesome to dig deeper into the
> issue..
>
> On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> >
> > Hi,
> >
> > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > and destroy concurrent test, the issue happens after several weeks. Now we
> > didn't have a way to reproduce this issue, the issue happens randomly, this
> > is the second time we found it in this year.
> >
> > We did some investigation with crash tools, hope to get some useful clues
> > about the issue, try to find root cause of the issue (further find a way to
> > fix or workaround it), here is some of our investigation, hope to get
> > feedback from SRCU experts.
> >
> > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> > than 122 seconds.
> > [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> > [3144162.631192] Call Trace:
> > [3144162.631202]  __schedule+0x2ee/0x660
> > [3144162.631206]  schedule+0x33/0xa0
> > [3144162.631209]  schedule_timeout+0x1c4/0x340
> > [3144162.631214]  ? update_load_avg+0x82/0x660
> > [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> > [3144162.631218]  wait_for_completion+0x119/0x180
> > [3144162.631220]  ? wake_up_q+0x80/0x80
> > [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> > [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> > [3144162.631227]  synchronize_srcu+0x5f/0xc0
> > [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> > [3144162.631239]  ? __schedule+0x2f6/0x660
> > [3144162.631243]  process_one_work+0x19a/0x3a0
> > [3144162.631244]  worker_thread+0x37/0x3a0
> > [3144162.631247]  kthread+0x117/0x140
> > [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> > [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> > [3144162.631250]  ret_from_fork+0x1f/0x30
> >
> > Test workload:
> > - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> > - VM using virito device, so kvm->irq_srcu is using frequently.
> > - x86_64 Machine with 191 core.
> >
> > Crash analysis clues:
> > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> > - The SNP (node) of SRCU struct has 2 levels.
> > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> > - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> > - The pending entry is just the calling entry of hung kwoker task.
> > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> > - The work-queue of SRCU and SDP of CPU 136 is idle.
> > - More details could be found via reference at the end of this mail.
> >
> > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> > 0xffffc900f100e878
> >         srcu_idx = 76
> >         srcu_gp_seq = 304
> >         srcu_gp_seq_needed = 304
> >         srcu_gp_seq_needed_exp = 304
> >         sda = 0x5fff9ac17800
> >
> > crash> struct srcu_data 0x5fff9ac17800:136
> > [136]: ffffe8bee9817800
> > ...
> >   srcu_cblist = {
> >     head = 0xffffc900e8413dd0,
> >     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> > 0xffffc900e8413dd0, 0xffffc900e8413dd0},
> >     gp_seq = {0, 296, 296, 0},
>
> I am actually a bit puzzled why the DONE segment count is 0. I would
> have expected it to be >= 296 in this situation.
>
> >     len = 1,
> >     len_lazy = 0,
> >     enabled = 1 '\001',
> >     offloaded = 0 '\000'
> >   },
> >   srcu_gp_seq_needed = 288,
> >   srcu_gp_seq_needed_exp = 0,
> >   srcu_cblist_invoking = false,
> > ...
> >
> > Root cause (we can't reproduce the issue, so the result is base on crash
> > analysis and code analysis):
> > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> > - The leaked entry on segcblist seems has been "accelerated".
> > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
> >   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> > - Here are two emulated case may cause the "accelerate issue":
> >
> >
> > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP, set rcu_seq_start.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" the seq of new added entry to 292 (it should be 288).
>
> But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> marked for 292? It should be ready to execute anyway even at 292. Note
> the meaning of "acceleration", the idea is to start conservatively and
> move the callbacks forward as more accurate information is available.
> Considering this, 292 initially should be OK IMHO (that's just more
> conservative than 288)..
>
> > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP.
> > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
> >   set rcu_seq_start.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" new added entry to 296 (it should be 288).
>
> I had the same analysis here.
>
> > /*
> >  * CASE1
> >  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
> >  * - GP0: 280-284, GP1: 284-288.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >                 /* [GP1][CPU-136] */
> >                 synchronize_srcu {
> >                  __call_srcu {
> >                   rcu_segcblist_enqueue
> >                                          /* [GP1][CPU-041] */
> >                                          synchronize_srcu {
> >                                           __call_srcu {
> >                                            srcu_funnel_gp_start
> >                                             srcu_gp_start
> >                                              rcu_seq_start
> >                                           }
> >                                          }
> >                 /* [GP1][CPU-136] */
> >                   srcu_funnel_gp_start
> >                  }
> >                 }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > /*
> >  * CASE2
> >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >
> >                   /* [GP1][CPU-136] */
> >                   synchronize_srcu {
> >                    __call_srcu {
> >                     rcu_segcblist_enqueue
> >                                            /* [GP1][CPU-041] */
> >                                            synchronize_srcu {
> >                                             __call_srcu {
> >                                              srcu_funnel_gp_start
> >                                               srcu_gp_start
> >                                             }
> >                                            }
> >                                            process_srcu {
> >                                             srcu_gp_end
> >                                              rcu_seq_end
> >                                            }
> >                   /* [GP1][CPU-136] */
> >                     srcu_funnel_gp_start
> >                    }
> >                   }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
>
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.
>
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).
>
> I wonder if you would consider trying to reproduce it with the
> rcutorture module's SRCU testing in the chance that it occurs with it
> on your setup.
As an enthusiast, I have an idea, could we use the ftrace mechanism to
log ftrace just before the stall for more detailed analysis.
Recently I used the cyclictest --tracemark option to analyze the
ftrace of the system just before the cyclictest's -b latency is
exceeded. [1]

Thanks
Zhouyi
[1] https://man.archlinux.org/man/cyclictest.8.en
>
> thanks,
>
>  - Joel
>
>
> >                                                       /* [GP2][CPU-017] */
> >                                                       synchronize_srcu {
> >                                                        __call_srcu {
> >                                                         srcu_funnel_gp_start
> >                                                          srcu_gp_start
> >                                                           rcu_seq_start
> >                                                        }
> >                                                       }
> > /* [GP0][CPU-136] */
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > Potential fix:
> > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> > of current gp seq of SRCU struct, how about accelerate the sda base
> > sdp->srcu_gp_seq_needed, maybe like this:
> >
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 21115ffb6c44..ff8579abc924 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> > work_struct *work)
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> > +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);
>
> I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
> rcu_seq_snap() says which what's the gp_seq at which a full GP has
> passed from the moment rcu_seq_snap() is called. That's precisely when
> it is safe to call all those non-accelerated callbacks.
>
> Basically, if any of the bits are set in the state portion of the
> sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
> << RCU_SEQ_CTR_SHIFT.
> Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.
>
>
>
>
>
> >
> >
> > Thanks for reading!
> >
> >
> > Reference detail log of crash dump information:
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28 21:39 ` Joel Fernandes
  2023-09-28 21:40   ` Joel Fernandes
  2023-09-29  1:50   ` Zhouyi Zhou
@ 2023-09-29 22:44   ` Frederic Weisbecker
  2023-09-30  2:45     ` Neeraj upadhyay
  2023-09-30 10:01   ` Neeraj upadhyay
  2023-10-07  8:53   ` zhuangel570
  4 siblings, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-09-29 22:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: zhuangel570, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit :
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.
> 
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).

Speaking of, I'm looking at srcu_gp_end() and the comment after releasing
srcu_gp_mutex says:

	      "/* A new grace period can start at this point.  But only one. */"

But what ensures that?

Because if we get preempted here and more than 3 grace periods elapse, then
->srcu_have_cbs[the_same_idx] could get overriden and then all sorts of race
may lead to srcu_invoke_callbacks() to not be invoked.

I'm not asking too loudly though because I'm sure I'm missing something obvious,
I see a lot of mirages these days...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-29 22:44   ` Frederic Weisbecker
@ 2023-09-30  2:45     ` Neeraj upadhyay
  2023-09-30  7:30       ` Frederic Weisbecker
  2023-09-30  9:10       ` Frederic Weisbecker
  0 siblings, 2 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-09-30  2:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit :
> > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > expected rcu_segcblist_advance() to advance all those pending
> > callbacks to 304.
> >
> > I posit that probably srcu_invoke_callbacks() is not even being called
> > in the first place, otherwise why is the DONE segment count still 0?
> > DONE should not be 0, otherwise callbacks will not run. The segcblist
> > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> > being called due to some other bug (or a workqueue/timer issue causing
> > the srcu_invoke_callbacks() to not run?).
>
> Speaking of, I'm looking at srcu_gp_end() and the comment after releasing
> srcu_gp_mutex says:
>
>               "/* A new grace period can start at this point.  But only one. */"
>
> But what ensures that?
>

I think ->srcu_cb_mutex ensures that.

/* Prevent more than one additional grace period. */
mutex_lock(&sup->srcu_cb_mutex);


Thanks
Neeraj

> Because if we get preempted here and more than 3 grace periods elapse, then
> ->srcu_have_cbs[the_same_idx] could get overriden and then all sorts of race
> may lead to srcu_invoke_callbacks() to not be invoked.
>
> I'm not asking too loudly though because I'm sure I'm missing something obvious,
> I see a lot of mirages these days...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-30  2:45     ` Neeraj upadhyay
@ 2023-09-30  7:30       ` Frederic Weisbecker
  2023-09-30  9:10       ` Frederic Weisbecker
  1 sibling, 0 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2023-09-30  7:30 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay wrote:
> On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit :
> > > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > > expected rcu_segcblist_advance() to advance all those pending
> > > callbacks to 304.
> > >
> > > I posit that probably srcu_invoke_callbacks() is not even being called
> > > in the first place, otherwise why is the DONE segment count still 0?
> > > DONE should not be 0, otherwise callbacks will not run. The segcblist
> > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> > > being called due to some other bug (or a workqueue/timer issue causing
> > > the srcu_invoke_callbacks() to not run?).
> >
> > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing
> > srcu_gp_mutex says:
> >
> >               "/* A new grace period can start at this point.  But only one. */"
> >
> > But what ensures that?
> >
> 
> I think ->srcu_cb_mutex ensures that.
> 
> /* Prevent more than one additional grace period. */
> mutex_lock(&sup->srcu_cb_mutex);

Yes but right ater it's relaxed (before callbacks are schedule for execution)?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-30  2:45     ` Neeraj upadhyay
  2023-09-30  7:30       ` Frederic Weisbecker
@ 2023-09-30  9:10       ` Frederic Weisbecker
  2023-09-30  9:48         ` Neeraj upadhyay
  1 sibling, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-09-30  9:10 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

Le Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay a écrit :
> On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit :
> > > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > > expected rcu_segcblist_advance() to advance all those pending
> > > callbacks to 304.
> > >
> > > I posit that probably srcu_invoke_callbacks() is not even being called
> > > in the first place, otherwise why is the DONE segment count still 0?
> > > DONE should not be 0, otherwise callbacks will not run. The segcblist
> > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> > > being called due to some other bug (or a workqueue/timer issue causing
> > > the srcu_invoke_callbacks() to not run?).
> >
> > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing
> > srcu_gp_mutex says:
> >
> >               "/* A new grace period can start at this point.  But only one. */"
> >
> > But what ensures that?
> >
> 
> I think ->srcu_cb_mutex ensures that.
> 
> /* Prevent more than one additional grace period. */
> mutex_lock(&sup->srcu_cb_mutex);

Oh I confused ->srcu_cb_mutex with ->srcu_gp_mutex, didn't I? ;-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-30  9:10       ` Frederic Weisbecker
@ 2023-09-30  9:48         ` Neeraj upadhyay
  0 siblings, 0 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-09-30  9:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Sat, Sep 30, 2023 at 2:40 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Sat, Sep 30, 2023 at 08:15:06AM +0530, Neeraj upadhyay a écrit :
> > On Sat, Sep 30, 2023 at 4:15 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > Le Thu, Sep 28, 2023 at 05:39:17PM -0400, Joel Fernandes a écrit :
> > > > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > > > expected rcu_segcblist_advance() to advance all those pending
> > > > callbacks to 304.
> > > >
> > > > I posit that probably srcu_invoke_callbacks() is not even being called
> > > > in the first place, otherwise why is the DONE segment count still 0?
> > > > DONE should not be 0, otherwise callbacks will not run. The segcblist
> > > > of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> > > > being called due to some other bug (or a workqueue/timer issue causing
> > > > the srcu_invoke_callbacks() to not run?).
> > >
> > > Speaking of, I'm looking at srcu_gp_end() and the comment after releasing
> > > srcu_gp_mutex says:
> > >
> > >               "/* A new grace period can start at this point.  But only one. */"
> > >
> > > But what ensures that?
> > >
> >
> > I think ->srcu_cb_mutex ensures that.
> >
> > /* Prevent more than one additional grace period. */
> > mutex_lock(&sup->srcu_cb_mutex);
>
> Oh I confused ->srcu_cb_mutex with ->srcu_gp_mutex, didn't I? ;-)

Maybe :) From your reply, I thought you meant srcu_invoke_callbacks() for one GP
might execute after multiple subsequent GPs have completed, which makes sense.


Thanks
Neeraj

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28 21:39 ` Joel Fernandes
                     ` (2 preceding siblings ...)
  2023-09-29 22:44   ` Frederic Weisbecker
@ 2023-09-30 10:01   ` Neeraj upadhyay
  2023-10-01  0:19     ` Joel Fernandes
  2023-10-07  8:53   ` zhuangel570
  4 siblings, 1 reply; 28+ messages in thread
From: Neeraj upadhyay @ 2023-09-30 10:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: zhuangel570, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
> Firstly, kudos to the detailed report and analysis. Rare failures are
> hard and your usage crash/kdump is awesome to dig deeper into the
> issue..
>
> On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> >
> > Hi,
> >
> > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > and destroy concurrent test, the issue happens after several weeks. Now we
> > didn't have a way to reproduce this issue, the issue happens randomly, this
> > is the second time we found it in this year.
> >
> >
> >
> > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP, set rcu_seq_start.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" the seq of new added entry to 292 (it should be 288).
>
> But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> marked for 292? It should be ready to execute anyway even at 292. Note
> the meaning of "acceleration", the idea is to start conservatively and
> move the callbacks forward as more accurate information is available.
> Considering this, 292 initially should be OK IMHO (that's just more
> conservative than 288)..
>

Maybe it matters, as for a CPU, the callbacks will only be scheduled
in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx]
and ->srcu_have_cbs[idx]

> >
> > /*
> >  * CASE2
> >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >
> >                   /* [GP1][CPU-136] */
> >                   synchronize_srcu {
> >                    __call_srcu {
> >                     rcu_segcblist_enqueue
> >                                            /* [GP1][CPU-041] */
> >                                            synchronize_srcu {
> >                                             __call_srcu {
> >                                              srcu_funnel_gp_start
> >                                               srcu_gp_start
> >                                             }
> >                                            }
> >                                            process_srcu {
> >                                             srcu_gp_end
> >                                              rcu_seq_end
> >                                            }
> >                   /* [GP1][CPU-136] */
> >                     srcu_funnel_gp_start
> >                    }
> >                   }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
>
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.

If the callback is in NEXT_TAIL and not assigned GP num,
rcu_segcblist_advance() won't move it and next accelerate in
srcu_invoke_callbacks() will
assign it the next gp sequence num.


Thanks
Neeraj

>
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).
>
> I wonder if you would consider trying to reproduce it with the
> rcutorture module's SRCU testing in the chance that it occurs with it
> on your setup.
>
> thanks,
>
>  - Joel
>
>
> >                                                       /* [GP2][CPU-017] */
> >                                                       synchronize_srcu {
> >                                                        __call_srcu {
> >                                                         srcu_funnel_gp_start
> >                                                          srcu_gp_start
> >                                                           rcu_seq_start
> >                                                        }
> >                                                       }
> > /* [GP0][CPU-136] */
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > Potential fix:
> > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> > of current gp seq of SRCU struct, how about accelerate the sda base
> > sdp->srcu_gp_seq_needed, maybe like this:
> >
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 21115ffb6c44..ff8579abc924 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> > work_struct *work)
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> > +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);
>
> I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
> rcu_seq_snap() says which what's the gp_seq at which a full GP has
> passed from the moment rcu_seq_snap() is called. That's precisely when
> it is safe to call all those non-accelerated callbacks.
>
> Basically, if any of the bits are set in the state portion of the
> sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
> << RCU_SEQ_CTR_SHIFT.
> Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.
>
>
>
>
>
> >
> >
> > Thanks for reading!
> >
> >
> > Reference detail log of crash dump information:
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28  7:59 SRCU: kworker hung in synchronize_srcu zhuangel570
  2023-09-28 21:39 ` Joel Fernandes
@ 2023-09-30 10:11 ` Neeraj upadhyay
  1 sibling, 0 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-09-30 10:11 UTC (permalink / raw)
  To: zhuangel570
  Cc: paulmck, rcu, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	like.xu.linux, linussli, foxywang

On Fri, Sep 29, 2023 at 3:04 AM zhuangel570 <zhuangel570@gmail.com> wrote:
>
> Hi,
>
> We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> and destroy concurrent test, the issue happens after several weeks. Now we
> didn't have a way to reproduce this issue, the issue happens randomly, this
> is the second time we found it in this year.
>
> We did some investigation with crash tools, hope to get some useful clues
> about the issue, try to find root cause of the issue (further find a way to
> fix or workaround it), here is some of our investigation, hope to get
> feedback from SRCU experts.
>
> [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> than 122 seconds.
> [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> [3144162.631192] Call Trace:
> [3144162.631202]  __schedule+0x2ee/0x660
> [3144162.631206]  schedule+0x33/0xa0
> [3144162.631209]  schedule_timeout+0x1c4/0x340
> [3144162.631214]  ? update_load_avg+0x82/0x660
> [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> [3144162.631218]  wait_for_completion+0x119/0x180
> [3144162.631220]  ? wake_up_q+0x80/0x80
> [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> [3144162.631227]  synchronize_srcu+0x5f/0xc0
> [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> [3144162.631239]  ? __schedule+0x2f6/0x660
> [3144162.631243]  process_one_work+0x19a/0x3a0
> [3144162.631244]  worker_thread+0x37/0x3a0
> [3144162.631247]  kthread+0x117/0x140
> [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> [3144162.631250]  ret_from_fork+0x1f/0x30
>
> Test workload:
> - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> - VM using virito device, so kvm->irq_srcu is using frequently.
> - x86_64 Machine with 191 core.
>
> Crash analysis clues:
> - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> - The SNP (node) of SRCU struct has 2 levels.
> - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> - The pending entry is just the calling entry of hung kwoker task.
> - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> - The work-queue of SRCU and SDP of CPU 136 is idle.
> - More details could be found via reference at the end of this mail.
>
> crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> 0xffffc900f100e878
>         srcu_idx = 76
>         srcu_gp_seq = 304
>         srcu_gp_seq_needed = 304
>         srcu_gp_seq_needed_exp = 304
>         sda = 0x5fff9ac17800
>
> crash> struct srcu_data 0x5fff9ac17800:136
> [136]: ffffe8bee9817800
> ...
>   srcu_cblist = {
>     head = 0xffffc900e8413dd0,
>     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> 0xffffc900e8413dd0, 0xffffc900e8413dd0},
>     gp_seq = {0, 296, 296, 0},
>     len = 1,
>     len_lazy = 0,
>     enabled = 1 '\001',
>     offloaded = 0 '\000'
>   },
>   srcu_gp_seq_needed = 288,
>   srcu_gp_seq_needed_exp = 0,
>   srcu_cblist_invoking = false,
> ...
>
> Root cause (we can't reproduce the issue, so the result is base on crash
> analysis and code analysis):
> - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> - The leaked entry on segcblist seems has been "accelerated".
> - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
>   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> - Here are two emulated case may cause the "accelerate issue":
>
>
> CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
>   to SDP.
> - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
>   faster than CPU 136, start GP, set rcu_seq_start.
> - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
>   to start GP.
> - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
>   "accelerate" the seq of new added entry to 292 (it should be 288).
>
>
> CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
>   to SDP.
> - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
>   faster than CPU 136, start GP.
> - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
>   to start GP.
> - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
>   set rcu_seq_start.
> - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
>   "accelerate" new added entry to 296 (it should be 288).
>
>
> /*
>  * CASE1
>  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
>  * - GP0: 280-284, GP1: 284-288.
>  */
>
> /* [GP0][CPU-136] */
> process_srcu {
>  srcu_gp_end
> }
>                 /* [GP1][CPU-136] */
>                 synchronize_srcu {
>                  __call_srcu {
>                   rcu_segcblist_enqueue
>                                          /* [GP1][CPU-041] */
>                                          synchronize_srcu {
>                                           __call_srcu {
>                                            srcu_funnel_gp_start
>                                             srcu_gp_start
>                                              rcu_seq_start
>                                           }
>                                          }
>                 /* [GP1][CPU-136] */
>                   srcu_funnel_gp_start
>                  }
>                 }
> /* [GP0][CPU-136] */
> srcu_invoke_callbacks {
>  rcu_segcblist_accelerate
> }
>
>
> /*
>  * CASE2
>  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
>  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
>  */
>
> /* [GP0][CPU-136] */
> process_srcu {
>  srcu_gp_end
> }
>
>                   /* [GP1][CPU-136] */
>                   synchronize_srcu {
>                    __call_srcu {
>                     rcu_segcblist_enqueue
>                                            /* [GP1][CPU-041] */
>                                            synchronize_srcu {
>                                             __call_srcu {
>                                              srcu_funnel_gp_start
>                                               srcu_gp_start
>                                             }
>                                            }
>                                            process_srcu {
>                                             srcu_gp_end
>                                              rcu_seq_end
>                                            }
>                   /* [GP1][CPU-136] */
>                     srcu_funnel_gp_start
>                    }
>                   }
> /* [GP0][CPU-136] */
> srcu_invoke_callbacks {
>                                                       /* [GP2][CPU-017] */
>                                                       synchronize_srcu {
>                                                        __call_srcu {
>                                                         srcu_funnel_gp_start
>                                                          srcu_gp_start
>                                                           rcu_seq_start
>                                                        }
>                                                       }
> /* [GP0][CPU-136] */
>  rcu_segcblist_accelerate
> }
>
>
> Potential fix:
> We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> of current gp seq of SRCU struct, how about accelerate the sda base
> sdp->srcu_gp_seq_needed, maybe like this:
>
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 21115ffb6c44..ff8579abc924 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> work_struct *work)
>         spin_lock_irq_rcu_node(sdp);
>         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
>         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));

Can you please explain why rcu_seq_snap() is needed here?
Maybe you also need to advance cbs, to ensure that following call to
rcu_segcblist_ready_cbs() picks these callbacks, in case srcu_invoke_callbacks()
is called after multiple GPs have elapsed.

rcu_segcblist_advance(&sdp->srcu_cblist,
                              rcu_seq_current(&ssp->srcu_gp_seq));
...
more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);


Thanks
Neeraj

>         sdp->srcu_cblist_invoking = false;
>         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
>         spin_unlock_irq_rcu_node(sdp);
>
>
> Thanks for reading!
>
>
> Reference detail log of crash dump information:
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-30 10:01   ` Neeraj upadhyay
@ 2023-10-01  0:19     ` Joel Fernandes
  2023-10-01  2:27       ` Neeraj upadhyay
  0 siblings, 1 reply; 28+ messages in thread
From: Joel Fernandes @ 2023-10-01  0:19 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: zhuangel570, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote:
>
> On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hello,
> > Firstly, kudos to the detailed report and analysis. Rare failures are
> > hard and your usage crash/kdump is awesome to dig deeper into the
> > issue..
> >
> > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > > and destroy concurrent test, the issue happens after several weeks. Now we
> > > didn't have a way to reproduce this issue, the issue happens randomly, this
> > > is the second time we found it in this year.
> > >
> > >
> > >
> > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> > >   to SDP.
> > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> > >   faster than CPU 136, start GP, set rcu_seq_start.
> > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> > >   to start GP.
> > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> > >   "accelerate" the seq of new added entry to 292 (it should be 288).
> >
> > But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> > marked for 292? It should be ready to execute anyway even at 292. Note
> > the meaning of "acceleration", the idea is to start conservatively and
> > move the callbacks forward as more accurate information is available.
> > Considering this, 292 initially should be OK IMHO (that's just more
> > conservative than 288)..
> >
>
> Maybe it matters, as for a CPU, the callbacks will only be scheduled
> in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx]
> and ->srcu_have_cbs[idx]

Right but if I am looking at the code correctly, nothing guarantees
that srcu_invoke_callbacks is called before srcu_gp_seq can advance.
So all callbacks that were previously queued for older grace periods
should be run whenever srcu_invoke_callbacks() eventually runs. That's
why I was thinking that part looked normal to me (segments having
older GP numbers).

> > >
> > > /*
> > >  * CASE2
> > >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> > >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> > >  */
> > >
> > > /* [GP0][CPU-136] */
> > > process_srcu {
> > >  srcu_gp_end
> > > }
> > >
> > >                   /* [GP1][CPU-136] */
> > >                   synchronize_srcu {
> > >                    __call_srcu {
> > >                     rcu_segcblist_enqueue
> > >                                            /* [GP1][CPU-041] */
> > >                                            synchronize_srcu {
> > >                                             __call_srcu {
> > >                                              srcu_funnel_gp_start
> > >                                               srcu_gp_start
> > >                                             }
> > >                                            }
> > >                                            process_srcu {
> > >                                             srcu_gp_end
> > >                                              rcu_seq_end
> > >                                            }
> > >                   /* [GP1][CPU-136] */
> > >                     srcu_funnel_gp_start
> > >                    }
> > >                   }
> > > /* [GP0][CPU-136] */
> > > srcu_invoke_callbacks {
> >
> > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > expected rcu_segcblist_advance() to advance all those pending
> > callbacks to 304.
>
> If the callback is in NEXT_TAIL and not assigned GP num,
> rcu_segcblist_advance() won't move it and next accelerate in
> srcu_invoke_callbacks() will
> assign it the next gp sequence num.

Sure, and after that again it will call srcu_schedule_cbs_sdp() so
that should be fine and the next workqueue invocation
srcu_invoke_callbacks() can advance at that time. Right?

if (more) {
  srcu_schedule_cbs_sdp(sdp, 0);
}

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01  0:19     ` Joel Fernandes
@ 2023-10-01  2:27       ` Neeraj upadhyay
  2023-10-01 22:32         ` Frederic Weisbecker
  2023-10-07  9:13         ` zhuangel570
  0 siblings, 2 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-01  2:27 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: zhuangel570, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

On Sun, Oct 1, 2023 at 5:49 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote:
> >
> > On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > Hello,
> > > Firstly, kudos to the detailed report and analysis. Rare failures are
> > > hard and your usage crash/kdump is awesome to dig deeper into the
> > > issue..
> > >
> > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > > > and destroy concurrent test, the issue happens after several weeks. Now we
> > > > didn't have a way to reproduce this issue, the issue happens randomly, this
> > > > is the second time we found it in this year.
> > > >
> > > >
> > > >
> > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> > > >   to SDP.
> > > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> > > >   faster than CPU 136, start GP, set rcu_seq_start.
> > > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> > > >   to start GP.
> > > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> > > >   "accelerate" the seq of new added entry to 292 (it should be 288).
> > >
> > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> > > marked for 292? It should be ready to execute anyway even at 292. Note
> > > the meaning of "acceleration", the idea is to start conservatively and
> > > move the callbacks forward as more accurate information is available.
> > > Considering this, 292 initially should be OK IMHO (that's just more
> > > conservative than 288)..
> > >
> >
> > Maybe it matters, as for a CPU, the callbacks will only be scheduled
> > in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx]
> > and ->srcu_have_cbs[idx]
>
> Right but if I am looking at the code correctly, nothing guarantees
> that srcu_invoke_callbacks is called before srcu_gp_seq can advance.
> So all callbacks that were previously queued for older grace periods
> should be run whenever srcu_invoke_callbacks() eventually runs. That's
> why I was thinking that part looked normal to me (segments having
> older GP numbers).
>
> > > >
> > > > /*
> > > >  * CASE2
> > > >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> > > >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> > > >  */
> > > >
> > > > /* [GP0][CPU-136] */
> > > > process_srcu {
> > > >  srcu_gp_end
> > > > }
> > > >
> > > >                   /* [GP1][CPU-136] */
> > > >                   synchronize_srcu {
> > > >                    __call_srcu {
> > > >                     rcu_segcblist_enqueue
> > > >                                            /* [GP1][CPU-041] */
> > > >                                            synchronize_srcu {
> > > >                                             __call_srcu {
> > > >                                              srcu_funnel_gp_start
> > > >                                               srcu_gp_start
> > > >                                             }
> > > >                                            }
> > > >                                            process_srcu {
> > > >                                             srcu_gp_end
> > > >                                              rcu_seq_end
> > > >                                            }
> > > >                   /* [GP1][CPU-136] */
> > > >                     srcu_funnel_gp_start
> > > >                    }
> > > >                   }
> > > > /* [GP0][CPU-136] */
> > > > srcu_invoke_callbacks {
> > >
> > > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > > expected rcu_segcblist_advance() to advance all those pending
> > > callbacks to 304.
> >
> > If the callback is in NEXT_TAIL and not assigned GP num,
> > rcu_segcblist_advance() won't move it and next accelerate in
> > srcu_invoke_callbacks() will
> > assign it the next gp sequence num.
>
> Sure, and after that again it will call srcu_schedule_cbs_sdp() so
> that should be fine and the next workqueue invocation
> srcu_invoke_callbacks() can advance at that time. Right?
>

But "more" only checks for CBs in DONE tail. The callbacks which have been just
accelerated are not advanced to DONE tail.

Having said above, I am still trying to figure out, which callbacks
are actually being pointed
by NEXT tail. Given that __call_srcu() already does a advance and
accelerate, all enqueued
callbacks would be in either WAIT tail (the callbacks for current
active GP) or NEXT_READY
tail (the callbacks for next GP after current one completes). So, they
should already have
GP num assigned and srcu_invoke_callbacks() won't accelerate them.
Only case I can
think of is, if current GP completes after we sample
rcu_seq_current(&ssp->srcu_gp_seq) for
rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
and a new GP is started
before we take snapshot ('s') of next GP  for
rcu_segcblist_accelerate(), then the gp num 's'
> gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
if my understanding is correct here. Thoughts?

__call_srcu()
        rcu_segcblist_advance(&sdp->srcu_cblist,
                              rcu_seq_current(&ssp->srcu_gp_seq));
        s = rcu_seq_snap(&ssp->srcu_gp_seq);
        (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);

Thanks
Neeraj


> if (more) {
>   srcu_schedule_cbs_sdp(sdp, 0);
> }
>
> thanks,
>
>  - Joel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01  2:27       ` Neeraj upadhyay
@ 2023-10-01 22:32         ` Frederic Weisbecker
  2023-10-01 22:39           ` Frederic Weisbecker
  2023-10-02  2:17           ` Neeraj upadhyay
  2023-10-07  9:13         ` zhuangel570
  1 sibling, 2 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-01 22:32 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> 
> But "more" only checks for CBs in DONE tail. The callbacks which have been just
> accelerated are not advanced to DONE tail.
> 
> Having said above, I am still trying to figure out, which callbacks
> are actually being pointed
> by NEXT tail. Given that __call_srcu() already does a advance and
> accelerate, all enqueued
> callbacks would be in either WAIT tail (the callbacks for current
> active GP) or NEXT_READY
> tail (the callbacks for next GP after current one completes). So, they
> should already have
> GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> Only case I can
> think of is, if current GP completes after we sample
> rcu_seq_current(&ssp->srcu_gp_seq) for
> rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> and a new GP is started
> before we take snapshot ('s') of next GP  for
> rcu_segcblist_accelerate(), then the gp num 's'
> > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> if my understanding is correct here. Thoughts?
> 
> __call_srcu()
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_gp_seq));
>         s = rcu_seq_snap(&ssp->srcu_gp_seq);
>         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);

Good point! This looks plausible.

Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
to handle that case? Because then the acceleration could be moved before
the advance on callbacks handling, something like:

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 20d7a238d675..af9d8af1d321 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
+	/*
+	 * Acceleration might fail if the preceding call to
+	 * rcu_segcblist_advance() also failed due to a prior incomplete grace
+	 * period. This should be later fixed in srcu_invoke_callbacks().
+	 */
 	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
 	if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
 		sdp->srcu_gp_seq_needed = s;
@@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	ssp = sdp->ssp;
 	rcu_cblist_init(&ready_cbs);
 	spin_lock_irq_rcu_node(sdp);
+	/*
+	 * Acceleration might have failed in srcu_gp_start_if_needed() if
+	 * the preceding call to rcu_segcblist_advance() also failed due to
+	 * a prior incomplete grace period.
+	 */
+	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
+				       sdp->srcu_gp_seq_needed);
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	if (sdp->srcu_cblist_invoking ||
@@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	 */
 	spin_lock_irq_rcu_node(sdp);
 	rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
 	sdp->srcu_cblist_invoking = false;
 	more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
 	spin_unlock_irq_rcu_node(sdp);


> 
> Thanks
> Neeraj
> 
> 
> > if (more) {
> >   srcu_schedule_cbs_sdp(sdp, 0);
> > }
> >
> > thanks,
> >
> >  - Joel

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01 22:32         ` Frederic Weisbecker
@ 2023-10-01 22:39           ` Frederic Weisbecker
  2023-10-02  2:21             ` Neeraj upadhyay
  2023-10-02  2:17           ` Neeraj upadhyay
  1 sibling, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-01 22:39 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

Le Mon, Oct 02, 2023 at 12:32:41AM +0200, Frederic Weisbecker a écrit :
> Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> > 
> > But "more" only checks for CBs in DONE tail. The callbacks which have been just
> > accelerated are not advanced to DONE tail.
> > 
> > Having said above, I am still trying to figure out, which callbacks
> > are actually being pointed
> > by NEXT tail. Given that __call_srcu() already does a advance and
> > accelerate, all enqueued
> > callbacks would be in either WAIT tail (the callbacks for current
> > active GP) or NEXT_READY
> > tail (the callbacks for next GP after current one completes). So, they
> > should already have
> > GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> > Only case I can
> > think of is, if current GP completes after we sample
> > rcu_seq_current(&ssp->srcu_gp_seq) for
> > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> > and a new GP is started
> > before we take snapshot ('s') of next GP  for
> > rcu_segcblist_accelerate(), then the gp num 's'
> > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> > if my understanding is correct here. Thoughts?
> > 
> > __call_srcu()
> >         rcu_segcblist_advance(&sdp->srcu_cblist,
> >                               rcu_seq_current(&ssp->srcu_gp_seq));
> >         s = rcu_seq_snap(&ssp->srcu_gp_seq);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> 
> Good point! This looks plausible.
> 
> Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
> to handle that case? Because then the acceleration could be moved before
> the advance on callbacks handling, something like:
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 20d7a238d675..af9d8af1d321 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
>  	rcu_segcblist_advance(&sdp->srcu_cblist,
>  			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>  	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> +	/*
> +	 * Acceleration might fail if the preceding call to
> +	 * rcu_segcblist_advance() also failed due to a prior incomplete grace
> +	 * period. This should be later fixed in srcu_invoke_callbacks().
> +	 */
>  	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
>  	if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
>  		sdp->srcu_gp_seq_needed = s;
> @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>  	ssp = sdp->ssp;
>  	rcu_cblist_init(&ready_cbs);
>  	spin_lock_irq_rcu_node(sdp);
> +	/*
> +	 * Acceleration might have failed in srcu_gp_start_if_needed() if
> +	 * the preceding call to rcu_segcblist_advance() also failed due to
> +	 * a prior incomplete grace period.
> +	 */
> +	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> +				       sdp->srcu_gp_seq_needed);
>  	rcu_segcblist_advance(&sdp->srcu_cblist,
>  			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>  	if (sdp->srcu_cblist_invoking ||
> @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>  	 */
>  	spin_lock_irq_rcu_node(sdp);
>  	rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> -	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
>  	sdp->srcu_cblist_invoking = false;
>  	more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
>  	spin_unlock_irq_rcu_node(sdp);

And if this works, can we then remove srcu_invoke_callbacks() self-requeue?
If queued several times before it actually fires, it will catch the latest
grace period's end. And if queued while the callback runs, it will re-run.

Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant?

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01 22:32         ` Frederic Weisbecker
  2023-10-01 22:39           ` Frederic Weisbecker
@ 2023-10-02  2:17           ` Neeraj upadhyay
  2023-10-02 10:41             ` Frederic Weisbecker
  1 sibling, 1 reply; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-02  2:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> >
> > But "more" only checks for CBs in DONE tail. The callbacks which have been just
> > accelerated are not advanced to DONE tail.
> >
> > Having said above, I am still trying to figure out, which callbacks
> > are actually being pointed
> > by NEXT tail. Given that __call_srcu() already does a advance and
> > accelerate, all enqueued
> > callbacks would be in either WAIT tail (the callbacks for current
> > active GP) or NEXT_READY
> > tail (the callbacks for next GP after current one completes). So, they
> > should already have
> > GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> > Only case I can
> > think of is, if current GP completes after we sample
> > rcu_seq_current(&ssp->srcu_gp_seq) for
> > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> > and a new GP is started
> > before we take snapshot ('s') of next GP  for
> > rcu_segcblist_accelerate(), then the gp num 's'
> > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> > if my understanding is correct here. Thoughts?
> >
> > __call_srcu()
> >         rcu_segcblist_advance(&sdp->srcu_cblist,
> >                               rcu_seq_current(&ssp->srcu_gp_seq));
> >         s = rcu_seq_snap(&ssp->srcu_gp_seq);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
>
> Good point! This looks plausible.
>
> Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
> to handle that case? Because then the acceleration could be moved before
> the advance on callbacks handling, something like:
>

I think we might need to accelerate after advance, as the  tail pointers
(WAIT, NEXT_READY) can be non-empty and we will not be able to
accelerate (and assign GP num) until we advance WAIT tail to DONE tail?


Thanks
Neeraj

> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 20d7a238d675..af9d8af1d321 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> +       /*
> +        * Acceleration might fail if the preceding call to
> +        * rcu_segcblist_advance() also failed due to a prior incomplete grace
> +        * period. This should be later fixed in srcu_invoke_callbacks().
> +        */
>         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
>         if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
>                 sdp->srcu_gp_seq_needed = s;
> @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>         ssp = sdp->ssp;
>         rcu_cblist_init(&ready_cbs);
>         spin_lock_irq_rcu_node(sdp);
> +       /*
> +        * Acceleration might have failed in srcu_gp_start_if_needed() if
> +        * the preceding call to rcu_segcblist_advance() also failed due to
> +        * a prior incomplete grace period.
> +        */
> +       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> +                                      sdp->srcu_gp_seq_needed);
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>         if (sdp->srcu_cblist_invoking ||
> @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>          */
>         spin_lock_irq_rcu_node(sdp);
>         rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
>         sdp->srcu_cblist_invoking = false;
>         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
>         spin_unlock_irq_rcu_node(sdp);
>
>
> >
> > Thanks
> > Neeraj
> >
> >
> > > if (more) {
> > >   srcu_schedule_cbs_sdp(sdp, 0);
> > > }
> > >
> > > thanks,
> > >
> > >  - Joel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01 22:39           ` Frederic Weisbecker
@ 2023-10-02  2:21             ` Neeraj upadhyay
  2023-10-02 11:05               ` Frederic Weisbecker
  2023-10-02 22:46               ` Frederic Weisbecker
  0 siblings, 2 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-02  2:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 2, 2023 at 4:10 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Mon, Oct 02, 2023 at 12:32:41AM +0200, Frederic Weisbecker a écrit :
> > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> > >
> > > But "more" only checks for CBs in DONE tail. The callbacks which have been just
> > > accelerated are not advanced to DONE tail.
> > >
> > > Having said above, I am still trying to figure out, which callbacks
> > > are actually being pointed
> > > by NEXT tail. Given that __call_srcu() already does a advance and
> > > accelerate, all enqueued
> > > callbacks would be in either WAIT tail (the callbacks for current
> > > active GP) or NEXT_READY
> > > tail (the callbacks for next GP after current one completes). So, they
> > > should already have
> > > GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> > > Only case I can
> > > think of is, if current GP completes after we sample
> > > rcu_seq_current(&ssp->srcu_gp_seq) for
> > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> > > and a new GP is started
> > > before we take snapshot ('s') of next GP  for
> > > rcu_segcblist_accelerate(), then the gp num 's'
> > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> > > if my understanding is correct here. Thoughts?
> > >
> > > __call_srcu()
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_gp_seq));
> > >         s = rcu_seq_snap(&ssp->srcu_gp_seq);
> > >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> >
> > Good point! This looks plausible.
> >
> > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
> > to handle that case? Because then the acceleration could be moved before
> > the advance on callbacks handling, something like:
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 20d7a238d675..af9d8af1d321 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1245,6 +1245,11 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
> >       rcu_segcblist_advance(&sdp->srcu_cblist,
> >                             rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> >       s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> > +     /*
> > +      * Acceleration might fail if the preceding call to
> > +      * rcu_segcblist_advance() also failed due to a prior incomplete grace
> > +      * period. This should be later fixed in srcu_invoke_callbacks().
> > +      */
> >       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> >       if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
> >               sdp->srcu_gp_seq_needed = s;
> > @@ -1692,6 +1697,13 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> >       ssp = sdp->ssp;
> >       rcu_cblist_init(&ready_cbs);
> >       spin_lock_irq_rcu_node(sdp);
> > +     /*
> > +      * Acceleration might have failed in srcu_gp_start_if_needed() if
> > +      * the preceding call to rcu_segcblist_advance() also failed due to
> > +      * a prior incomplete grace period.
> > +      */
> > +     (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > +                                    sdp->srcu_gp_seq_needed);
> >       rcu_segcblist_advance(&sdp->srcu_cblist,
> >                             rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> >       if (sdp->srcu_cblist_invoking ||
> > @@ -1720,8 +1732,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> >        */
> >       spin_lock_irq_rcu_node(sdp);
> >       rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> > -     (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                    rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
> >       sdp->srcu_cblist_invoking = false;
> >       more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >       spin_unlock_irq_rcu_node(sdp);
>
> And if this works, can we then remove srcu_invoke_callbacks() self-requeue?
> If queued several times before it actually fires, it will catch the latest
> grace period's end. And if queued while the callback runs, it will re-run.
>

This makes sense, but not sure for non-wq context which  link [1] mentions,
whether it needs it.

> Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant?
>
I think you mean sdp->srcu_cblist_invoking ?

There was a prior discussion on this [1], where Paul mentions about
non-wq context.



Thanks
Neeraj

[1] https://lkml.org/lkml/2020/11/19/1065

> Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02  2:17           ` Neeraj upadhyay
@ 2023-10-02 10:41             ` Frederic Weisbecker
  2023-10-02 13:22               ` Neeraj upadhyay
  0 siblings, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-02 10:41 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 02, 2023 at 07:47:55AM +0530, Neeraj upadhyay wrote:
> On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> > >
> > > But "more" only checks for CBs in DONE tail. The callbacks which have been just
> > > accelerated are not advanced to DONE tail.
> > >
> > > Having said above, I am still trying to figure out, which callbacks
> > > are actually being pointed
> > > by NEXT tail. Given that __call_srcu() already does a advance and
> > > accelerate, all enqueued
> > > callbacks would be in either WAIT tail (the callbacks for current
> > > active GP) or NEXT_READY
> > > tail (the callbacks for next GP after current one completes). So, they
> > > should already have
> > > GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> > > Only case I can
> > > think of is, if current GP completes after we sample
> > > rcu_seq_current(&ssp->srcu_gp_seq) for
> > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> > > and a new GP is started
> > > before we take snapshot ('s') of next GP  for
> > > rcu_segcblist_accelerate(), then the gp num 's'
> > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> > > if my understanding is correct here. Thoughts?
> > >
> > > __call_srcu()
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_gp_seq));
> > >         s = rcu_seq_snap(&ssp->srcu_gp_seq);
> > >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> >
> > Good point! This looks plausible.
> >
> > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
> > to handle that case? Because then the acceleration could be moved before
> > the advance on callbacks handling, something like:
> >
> 
> I think we might need to accelerate after advance, as the  tail pointers
> (WAIT, NEXT_READY) can be non-empty and we will not be able to
> accelerate (and assign GP num) until we advance WAIT tail to DONE tail?

Right indeed! How about the following patch then, assuming that in SRCU:
1 enqueue == 1 accelerate and therefore it only makes sense
to accelerate on enqueue time and any other accelerate call is error prone.

I can't find a scenario where the second call below to accelerate (and thus also
the second call to advance) would fail. Please tell me if I'm missing something.
Also the role of the remaining advance in srcu_gp_start() is unclear to me...

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 20d7a238d675..5ac81ca10ec8 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp)
 	spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
 	spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
 	WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
 	WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
@@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
+	/*
+	 * Acceleration might fail if the preceding call to
+	 * rcu_segcblist_advance() also failed due to a prior grace
+	 * period seen incomplete before rcu_seq_snap(). If so then a new
+	 * call to advance will see the completed grace period and fix
+	 * the situation.
+	 */
+	if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {
+		rcu_segcblist_advance(&sdp->srcu_cblist,
+				      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
+		WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s));
+	}
 	if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
 		sdp->srcu_gp_seq_needed = s;
 		needgp = true;
@@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	ssp = sdp->ssp;
 	rcu_cblist_init(&ready_cbs);
 	spin_lock_irq_rcu_node(sdp);
+	WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
 	if (sdp->srcu_cblist_invoking ||
@@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
 	 */
 	spin_lock_irq_rcu_node(sdp);
 	rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
-				       rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
 	sdp->srcu_cblist_invoking = false;
 	more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
 	spin_unlock_irq_rcu_node(sdp);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02  2:21             ` Neeraj upadhyay
@ 2023-10-02 11:05               ` Frederic Weisbecker
  2023-10-02 22:46               ` Frederic Weisbecker
  1 sibling, 0 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-02 11:05 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay wrote:
> On Mon, Oct 2, 2023 at 4:10 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> > And if this works, can we then remove srcu_invoke_callbacks() self-requeue?
> > If queued several times before it actually fires, it will catch the latest
> > grace period's end. And if queued while the callback runs, it will re-run.
> >
> 
> This makes sense, but not sure for non-wq context which  link [1] mentions,
> whether it needs it.

I don't see a problem with it, see below.

> 
> > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant?
> >
> I think you mean sdp->srcu_cblist_invoking ?

Right!

> 
> There was a prior discussion on this [1], where Paul mentions about
> non-wq context.

I don't see that function called elsewhere than workqueue. So in early boot,
works queued must wait for workqueue_init() (and kthreads creation and running)
which is quite late in the boot process. So it looks like SRCU isn't really
usable synchonously in early boot.

Besides, this workqueue is already re-entrant. Several srcu_invoke_callbacks()
calls can already execute concurrently, which is fine thanks to the snp locking.

So I don't see an obvious need for this re-entrancy protection (coming along the
requeue thing).

Thanks.


> 
> 
> 
> Thanks
> Neeraj
> 
> [1] https://lkml.org/lkml/2020/11/19/1065
> 
> > Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02 10:41             ` Frederic Weisbecker
@ 2023-10-02 13:22               ` Neeraj upadhyay
  2023-10-02 21:09                 ` Frederic Weisbecker
  0 siblings, 1 reply; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-02 13:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Mon, Oct 02, 2023 at 07:47:55AM +0530, Neeraj upadhyay wrote:
> > On Mon, Oct 2, 2023 at 4:02 AM Frederic Weisbecker <frederic@kernel.org> wrote:
> > >
> > > Le Sun, Oct 01, 2023 at 07:57:14AM +0530, Neeraj upadhyay a écrit :
> > > >
> > > > But "more" only checks for CBs in DONE tail. The callbacks which have been just
> > > > accelerated are not advanced to DONE tail.
> > > >
> > > > Having said above, I am still trying to figure out, which callbacks
> > > > are actually being pointed
> > > > by NEXT tail. Given that __call_srcu() already does a advance and
> > > > accelerate, all enqueued
> > > > callbacks would be in either WAIT tail (the callbacks for current
> > > > active GP) or NEXT_READY
> > > > tail (the callbacks for next GP after current one completes). So, they
> > > > should already have
> > > > GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> > > > Only case I can
> > > > think of is, if current GP completes after we sample
> > > > rcu_seq_current(&ssp->srcu_gp_seq) for
> > > > rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> > > > and a new GP is started
> > > > before we take snapshot ('s') of next GP  for
> > > > rcu_segcblist_accelerate(), then the gp num 's'
> > > > > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> > > > if my understanding is correct here. Thoughts?
> > > >
> > > > __call_srcu()
> > > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > > >                               rcu_seq_current(&ssp->srcu_gp_seq));
> > > >         s = rcu_seq_snap(&ssp->srcu_gp_seq);
> > > >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> > >
> > > Good point! This looks plausible.
> > >
> > > Does the (buggy) acceleration in srcu_invoke_callbacks() exists solely
> > > to handle that case? Because then the acceleration could be moved before
> > > the advance on callbacks handling, something like:
> > >
> >
> > I think we might need to accelerate after advance, as the  tail pointers
> > (WAIT, NEXT_READY) can be non-empty and we will not be able to
> > accelerate (and assign GP num) until we advance WAIT tail to DONE tail?
>
> Right indeed! How about the following patch then, assuming that in SRCU:
> 1 enqueue == 1 accelerate and therefore it only makes sense
> to accelerate on enqueue time and any other accelerate call is error prone.
>

Agree.

> I can't find a scenario where the second call below to accelerate (and thus also
> the second call to advance) would fail. Please tell me if I'm missing something.

Looks good to me. Few minor comments.

> Also the role of the remaining advance in srcu_gp_start() is unclear to me...
>

As far as I understand, the advance call before accelerate is to make
space in one of WAIT
or NEXT_READY tail for acceleration. So, it can be removed.

> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 20d7a238d675..5ac81ca10ec8 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp)
>         spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));

Deletion is ok; alternatively, we could have used
WARN_ON_ONCE(rcu_segcblist_accelerate(...))
in all places other than enqueue time for few cycles to be on safer side.

>         spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
>         WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
>         WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
> @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> +       /*
> +        * Acceleration might fail if the preceding call to
> +        * rcu_segcblist_advance() also failed due to a prior grace
> +        * period seen incomplete before rcu_seq_snap(). If so then a new
> +        * call to advance will see the completed grace period and fix
> +        * the situation.
> +        */
> +       if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {

We can add below also? Here old and new are rcu_seq_current() values used in
the 2 calls to rcu_segcblist_advance().

WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new)));


Thanks
Neeraj

> +               rcu_segcblist_advance(&sdp->srcu_cblist,
> +                                     rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> +               WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s));
> +       }
>         if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
>                 sdp->srcu_gp_seq_needed = s;
>                 needgp = true;
> @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>         ssp = sdp->ssp;
>         rcu_cblist_init(&ready_cbs);
>         spin_lock_irq_rcu_node(sdp);
> +       WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
>         if (sdp->srcu_cblist_invoking ||
> @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
>          */
>         spin_lock_irq_rcu_node(sdp);
>         rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
>         sdp->srcu_cblist_invoking = false;
>         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
>         spin_unlock_irq_rcu_node(sdp);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02 13:22               ` Neeraj upadhyay
@ 2023-10-02 21:09                 ` Frederic Weisbecker
  2023-10-03 12:00                   ` Neeraj upadhyay
  2023-10-03 12:22                   ` Frederic Weisbecker
  0 siblings, 2 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-02 21:09 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

Le Mon, Oct 02, 2023 at 06:52:27PM +0530, Neeraj upadhyay a écrit :
> On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > Also the role of the remaining advance in srcu_gp_start() is unclear to me...
> >
> 
> As far as I understand, the advance call before accelerate is to make
> space in one of WAIT
> or NEXT_READY tail for acceleration. So, it can be removed.

Sounds good. Will remove that.

> 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 20d7a238d675..5ac81ca10ec8 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp)
> >         spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
> >         rcu_segcblist_advance(&sdp->srcu_cblist,
> >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
> 
> Deletion is ok; alternatively, we could have used
> WARN_ON_ONCE(rcu_segcblist_accelerate(...))
> in all places other than enqueue time for few cycles to be on safer side.

How about WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)) ?

> 
> >         spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
> >         WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
> >         WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
> > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
> >         rcu_segcblist_advance(&sdp->srcu_cblist,
> >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> >         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> > +       /*
> > +        * Acceleration might fail if the preceding call to
> > +        * rcu_segcblist_advance() also failed due to a prior grace
> > +        * period seen incomplete before rcu_seq_snap(). If so then a new
> > +        * call to advance will see the completed grace period and fix
> > +        * the situation.
> > +        */
> > +       if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {
> 
> We can add below also? Here old and new are rcu_seq_current() values used in
> the 2 calls to rcu_segcblist_advance().
> 
> WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new)));

Very good point! "new" should be exactly one and a half grace period away from
"old", will add that.

Cooking proper patches now.

Thanks.

> 
> 
> Thanks
> Neeraj
> 
> > +               rcu_segcblist_advance(&sdp->srcu_cblist,
> > +                                     rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > +               WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s));
> > +       }
> >         if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
> >                 sdp->srcu_gp_seq_needed = s;
> >                 needgp = true;
> > @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> >         ssp = sdp->ssp;
> >         rcu_cblist_init(&ready_cbs);
> >         spin_lock_irq_rcu_node(sdp);
> > +       WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
> >         rcu_segcblist_advance(&sdp->srcu_cblist,
> >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> >         if (sdp->srcu_cblist_invoking ||
> > @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> >          */
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02  2:21             ` Neeraj upadhyay
  2023-10-02 11:05               ` Frederic Weisbecker
@ 2023-10-02 22:46               ` Frederic Weisbecker
  2023-10-03 12:06                 ` Neeraj upadhyay
  1 sibling, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-02 22:46 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

Le Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay a écrit :
> > And if this works, can we then remove srcu_invoke_callbacks() self-requeue?
> > If queued several times before it actually fires, it will catch the latest
> > grace period's end. And if queued while the callback runs, it will re-run.
> >
> 
> This makes sense, but not sure for non-wq context which  link [1] mentions,
> whether it needs it.
> 
> > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant?
> >
> I think you mean sdp->srcu_cblist_invoking ?
> 
> There was a prior discussion on this [1], where Paul mentions about
> non-wq context.

Actually I realize that if we remove that, then two srcu_invoke_callbacks()
can race in such a way that an srcu barrier callback may execute before the
callbacks it is supposed to follow. And if we must keep
sdp->srcu_cblist_invoking then we must also keep self-requeuing.

I'll add a comment about that instead of removing.

Thanks.


> 
> 
> 
> Thanks
> Neeraj
> 
> [1] https://lkml.org/lkml/2020/11/19/1065
> 
> > Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02 21:09                 ` Frederic Weisbecker
@ 2023-10-03 12:00                   ` Neeraj upadhyay
  2023-10-03 12:22                   ` Frederic Weisbecker
  1 sibling, 0 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-03 12:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, Paul E. McKenney, rcu, josh,
	rostedt, mathieu.desnoyers, jiangshanlai, like.xu.linux,
	linussli, foxywang

On Tue, 3 Oct, 2023, 2:39 am Frederic Weisbecker, <frederic@kernel.org> wrote:
>
> Le Mon, Oct 02, 2023 at 06:52:27PM +0530, Neeraj upadhyay a écrit :
> > On Mon, Oct 2, 2023 at 4:11 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > Also the role of the remaining advance in srcu_gp_start() is unclear to me...
> > >
> >
> > As far as I understand, the advance call before accelerate is to make
> > space in one of WAIT
> > or NEXT_READY tail for acceleration. So, it can be removed.
>
> Sounds good. Will remove that.
>
> >
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 20d7a238d675..5ac81ca10ec8 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -782,8 +782,6 @@ static void srcu_gp_start(struct srcu_struct *ssp)
> > >         spin_lock_rcu_node(sdp);  /* Interrupts already disabled. */
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > > -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
> >
> > Deletion is ok; alternatively, we could have used
> > WARN_ON_ONCE(rcu_segcblist_accelerate(...))
> > in all places other than enqueue time for few cycles to be on safer side.
>
> How about WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)) ?
>

Sounds good. My initial thought to put
WARN_ON_ONCE(rcu_segcblist_accelerate(...)) was to
catch any unexpected accelerations for WAIT and NEXT_READY cbs.
However, as all rcu_seq_snap()
snapshots, used for acceleration, are ordered, I think we do not need
to check for those tails.

> >
> > >         spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
> > >         WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
> > >         WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
> > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > >         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> > > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> > > +       /*
> > > +        * Acceleration might fail if the preceding call to
> > > +        * rcu_segcblist_advance() also failed due to a prior grace
> > > +        * period seen incomplete before rcu_seq_snap(). If so then a new
> > > +        * call to advance will see the completed grace period and fix
> > > +        * the situation.
> > > +        */
> > > +       if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {
> >
> > We can add below also? Here old and new are rcu_seq_current() values used in
> > the 2 calls to rcu_segcblist_advance().
> >
> > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new)));
>
> Very good point! "new" should be exactly one and a half grace period away from
> "old", will add that.
>
> Cooking proper patches now.
>

Cool!


Thanks
Neeraj

> Thanks.
>
> >
> >
> > Thanks
> > Neeraj
> >
> > > +               rcu_segcblist_advance(&sdp->srcu_cblist,
> > > +                                     rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > > +               WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s));
> > > +       }
> > >         if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
> > >                 sdp->srcu_gp_seq_needed = s;
> > >                 needgp = true;
> > > @@ -1692,6 +1701,7 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> > >         ssp = sdp->ssp;
> > >         rcu_cblist_init(&ready_cbs);
> > >         spin_lock_irq_rcu_node(sdp);
> > > +       WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL));
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > >         if (sdp->srcu_cblist_invoking ||
> > > @@ -1720,8 +1730,6 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> > >          */
> > >         spin_lock_irq_rcu_node(sdp);
> > >         rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
> > > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > > -                                      rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq));
> > >         sdp->srcu_cblist_invoking = false;
> > >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> > >         spin_unlock_irq_rcu_node(sdp);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02 22:46               ` Frederic Weisbecker
@ 2023-10-03 12:06                 ` Neeraj upadhyay
  0 siblings, 0 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-03 12:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Tue, Oct 3, 2023 at 4:16 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> Le Mon, Oct 02, 2023 at 07:51:10AM +0530, Neeraj upadhyay a écrit :
> > > And if this works, can we then remove srcu_invoke_callbacks() self-requeue?
> > > If queued several times before it actually fires, it will catch the latest
> > > grace period's end. And if queued while the callback runs, it will re-run.
> > >
> >
> > This makes sense, but not sure for non-wq context which  link [1] mentions,
> > whether it needs it.
> >
> > > Also why do we have sdp->srcu_invoke_callbacks ? Is that workqueue re-entrant?
> > >
> > I think you mean sdp->srcu_cblist_invoking ?
> >
> > There was a prior discussion on this [1], where Paul mentions about
> > non-wq context.
>
> Actually I realize that if we remove that, then two srcu_invoke_callbacks()
> can race in such a way that an srcu barrier callback may execute before the

I wasn't aware that two srcu_invoke_callbacks() for same sdp can
execute in parallel.
My understanding was, any new callback will be handled by the same kworker,
once it completes the current executing one [1]. Maybe I am missing
your point here?


[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/workqueue.c?h=v6.6-rc4#n1081

> callbacks it is supposed to follow. And if we must keep
> sdp->srcu_cblist_invoking then we must also keep self-requeuing.
>
> I'll add a comment about that instead of removing.
>
> Thanks.
>
>
> >
> >
> >
> > Thanks
> > Neeraj
> >
> > [1] https://lkml.org/lkml/2020/11/19/1065
> >
> > > Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-02 21:09                 ` Frederic Weisbecker
  2023-10-03 12:00                   ` Neeraj upadhyay
@ 2023-10-03 12:22                   ` Frederic Weisbecker
  2023-10-03 18:46                     ` Neeraj upadhyay
  1 sibling, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2023-10-03 12:22 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Mon, Oct 02, 2023 at 11:09:39PM +0200, Frederic Weisbecker wrote:
> > >         spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
> > >         WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
> > >         WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
> > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
> > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > >         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> > > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> > > +       /*
> > > +        * Acceleration might fail if the preceding call to
> > > +        * rcu_segcblist_advance() also failed due to a prior grace
> > > +        * period seen incomplete before rcu_seq_snap(). If so then a new
> > > +        * call to advance will see the completed grace period and fix
> > > +        * the situation.
> > > +        */
> > > +       if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {
> > 
> > We can add below also? Here old and new are rcu_seq_current() values used in
> > the 2 calls to rcu_segcblist_advance().
> > 
> > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new)));
> 
> Very good point! "new" should be exactly one and a half grace period away from
> "old", will add that.
> 
> Cooking proper patches now.

Actually this more simple fix below. rcu_seq_snap() can be called before
rcu_segcblist_advance() after all. The only side effect is that callbacks
advancing is then _after_ the full barrier in rcu_seq_snap(). I don't see
an obvious problem with that as that barrier only cares about:

1) Ordering accesses of the update side before call_srcu() so they don't bleed
2) See all the accesses prior to the grace period of the current gp_num

The only things callbacks advancing need to be ordered against are carried by
snp locking.

I still remove the accelerations elsewhere and advancing in srcu_gp_start() in
further patches. I'll also avoid advancing and acceleration in
srcu_gp_start_if_needed if there is no callback to queue.

The point is also that this simple fix alone can be easily backported and
the rest can come as cleanups.


diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 5602042856b1..8b09fb37dbf3 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1244,10 +1244,10 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
 	spin_lock_irqsave_sdp_contention(sdp, &flags);
 	if (rhp)
 		rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp);
+	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
 	rcu_segcblist_advance(&sdp->srcu_cblist,
 			      rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
-	s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
-	(void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
+	WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s) && rhp);
 	if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
 		sdp->srcu_gp_seq_needed = s;
 		needgp = true;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-03 12:22                   ` Frederic Weisbecker
@ 2023-10-03 18:46                     ` Neeraj upadhyay
  0 siblings, 0 replies; 28+ messages in thread
From: Neeraj upadhyay @ 2023-10-03 18:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, zhuangel570, paulmck, rcu, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, like.xu.linux, linussli,
	foxywang

On Tue, Oct 3, 2023 at 5:52 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Mon, Oct 02, 2023 at 11:09:39PM +0200, Frederic Weisbecker wrote:
> > > >         spin_unlock_rcu_node(sdp);  /* Interrupts remain disabled. */
> > > >         WRITE_ONCE(ssp->srcu_sup->srcu_gp_start, jiffies);
> > > >         WRITE_ONCE(ssp->srcu_sup->srcu_n_exp_nodelay, 0);
> > > > @@ -1245,7 +1243,18 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
> > > >         rcu_segcblist_advance(&sdp->srcu_cblist,
> > > >                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> > > >         s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> > > > -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> > > > +       /*
> > > > +        * Acceleration might fail if the preceding call to
> > > > +        * rcu_segcblist_advance() also failed due to a prior grace
> > > > +        * period seen incomplete before rcu_seq_snap(). If so then a new
> > > > +        * call to advance will see the completed grace period and fix
> > > > +        * the situation.
> > > > +        */
> > > > +       if (!rcu_segcblist_accelerate(&sdp->srcu_cblist, s)) {
> > >
> > > We can add below also? Here old and new are rcu_seq_current() values used in
> > > the 2 calls to rcu_segcblist_advance().
> > >
> > > WARN_ON_ONCE(!(rcu_seq_completed_gp(old, new) && rcu_seq_new_gp(old, new)));
> >
> > Very good point! "new" should be exactly one and a half grace period away from
> > "old", will add that.
> >
> > Cooking proper patches now.
>
> Actually this more simple fix below. rcu_seq_snap() can be called before
> rcu_segcblist_advance() after all. The only side effect is that callbacks
> advancing is then _after_ the full barrier in rcu_seq_snap(). I don't see
> an obvious problem with that as that barrier only cares about:
>
> 1) Ordering accesses of the update side before call_srcu() so they don't bleed
> 2) See all the accesses prior to the grace period of the current gp_num
>
> The only things callbacks advancing need to be ordered against are carried by
> snp locking.
>

Nice! Your analysis looks good to me!

> I still remove the accelerations elsewhere and advancing in srcu_gp_start() in
> further patches. I'll also avoid advancing and acceleration in
> srcu_gp_start_if_needed if there is no callback to queue.
>
> The point is also that this simple fix alone can be easily backported and
> the rest can come as cleanups.
>

Sounds good!

>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 5602042856b1..8b09fb37dbf3 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1244,10 +1244,10 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
>         spin_lock_irqsave_sdp_contention(sdp, &flags);
>         if (rhp)
>                 rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp);
> +       s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);

We might want to add a comment here, why the specific ordering of the two
srcu_gp_seq reads is required here.



Thanks
Neeraj

>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
> -       s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
> -       (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);
> +       WARN_ON_ONCE(!rcu_segcblist_accelerate(&sdp->srcu_cblist, s) && rhp);
>         if (ULONG_CMP_LT(sdp->srcu_gp_seq_needed, s)) {
>                 sdp->srcu_gp_seq_needed = s;
>                 needgp = true;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-09-28 21:39 ` Joel Fernandes
                     ` (3 preceding siblings ...)
  2023-09-30 10:01   ` Neeraj upadhyay
@ 2023-10-07  8:53   ` zhuangel570
  4 siblings, 0 replies; 28+ messages in thread
From: zhuangel570 @ 2023-10-07  8:53 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: paulmck, rcu, josh, rostedt, mathieu.desnoyers, jiangshanlai,
	like.xu.linux, linussli, foxywang

On Fri, Sep 29, 2023 at 5:39 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hello,
> Firstly, kudos to the detailed report and analysis. Rare failures are
> hard and your usage crash/kdump is awesome to dig deeper into the
> issue..
>
> On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> >
> > Hi,
> >
> > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > and destroy concurrent test, the issue happens after several weeks. Now we
> > didn't have a way to reproduce this issue, the issue happens randomly, this
> > is the second time we found it in this year.
> >
> > We did some investigation with crash tools, hope to get some useful clues
> > about the issue, try to find root cause of the issue (further find a way to
> > fix or workaround it), here is some of our investigation, hope to get
> > feedback from SRCU experts.
> >
> > [3144162.608392] INFO: task kworker/136:12:252684 blocked for more
> > than 122 seconds.
> > [3144162.615986]       Tainted: G           O  K   5.4.203-1-tlinux4-0011.1 #1
> > [3144162.623053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [3144162.631162] kworker/136:12  D    0 252684      2 0x90004000
> > [3144162.631189] Workqueue: kvm-irqfd-cleanup irqfd_shutdown [kvm]
> > [3144162.631192] Call Trace:
> > [3144162.631202]  __schedule+0x2ee/0x660
> > [3144162.631206]  schedule+0x33/0xa0
> > [3144162.631209]  schedule_timeout+0x1c4/0x340
> > [3144162.631214]  ? update_load_avg+0x82/0x660
> > [3144162.631217]  ? raw_spin_rq_lock_nested+0x1f/0x30
> > [3144162.631218]  wait_for_completion+0x119/0x180
> > [3144162.631220]  ? wake_up_q+0x80/0x80
> > [3144162.631224]  __synchronize_srcu.part.19+0x81/0xb0
> > [3144162.631226]  ? __bpf_trace_rcu_utilization+0x10/0x10
> > [3144162.631227]  synchronize_srcu+0x5f/0xc0
> > [3144162.631236]  irqfd_shutdown+0x3c/0xb0 [kvm]
> > [3144162.631239]  ? __schedule+0x2f6/0x660
> > [3144162.631243]  process_one_work+0x19a/0x3a0
> > [3144162.631244]  worker_thread+0x37/0x3a0
> > [3144162.631247]  kthread+0x117/0x140
> > [3144162.631247]  ? process_one_work+0x3a0/0x3a0
> > [3144162.631248]  ? __kthread_cancel_work+0x40/0x40
> > [3144162.631250]  ret_from_fork+0x1f/0x30
> >
> > Test workload:
> > - Repeated concurrency testing, include Create VM, service probe, destroy VM.
> > - VM using virito device, so kvm->irq_srcu is using frequently.
> > - x86_64 Machine with 191 core.
> >
> > Crash analysis clues:
> > - The hung task is run on CPU 136, waiting synchronize_srcu to be completed.
> > - The current GP of SRCU struct is 304 (srcu_gp_seq and srcu_gp_seq_needed).
> > - The SNP (node) of SRCU struct has 2 levels.
> > - The SNP (node) of SRCU struct show all SNP now in GP 304 (srcu_have_cbs).
> > - The SNP (node) of SRCU struct show CPU 136 belongs to group [128,143].
> > - The SDP (per-cpu) of SRCU struct show lock and unlock matches.
> > - The SDP (per-cpu) of CPU 136 segcblist has one entry in pending list.
> > - The pending entry is just the calling entry of hung kwoker task.
> > - The SDP (per-cpu) of CPU 136 is waiting GP 288, gp_seq of segcblist is 296.
> > - The SDP (per-cpu) of CPU 136 segcblist has no entry in DONE list.
> > - The work-queue of SRCU and SDP of CPU 136 is idle.
> > - More details could be found via reference at the end of this mail.
> >
> > crash> struct srcu_struct.srcu_idx,srcu_gp_seq,srcu_gp_seq_needed,srcu_gp_seq_needed_exp,sda
> > 0xffffc900f100e878
> >         srcu_idx = 76
> >         srcu_gp_seq = 304
> >         srcu_gp_seq_needed = 304
> >         srcu_gp_seq_needed_exp = 304
> >         sda = 0x5fff9ac17800
> >
> > crash> struct srcu_data 0x5fff9ac17800:136
> > [136]: ffffe8bee9817800
> > ...
> >   srcu_cblist = {
> >     head = 0xffffc900e8413dd0,
> >     tails = {0xffffe8bee9817848, 0xffffc900e8413dd0,
> > 0xffffc900e8413dd0, 0xffffc900e8413dd0},
> >     gp_seq = {0, 296, 296, 0},
>
> I am actually a bit puzzled why the DONE segment count is 0. I would
> have expected it to be >= 296 in this situation.

The gp_seq of segcblist will only be updated in rcu_segcblist_accelerate
and rcu_segcblist_advance, and the update logic does not include DONE and
NEXT tail, so I think the seq of DONE and NEXT should always be zero.

>
> >     len = 1,
> >     len_lazy = 0,
> >     enabled = 1 '\001',
> >     offloaded = 0 '\000'
> >   },
> >   srcu_gp_seq_needed = 288,
> >   srcu_gp_seq_needed_exp = 0,
> >   srcu_cblist_invoking = false,
> > ...
> >
> > Root cause (we can't reproduce the issue, so the result is base on crash
> > analysis and code analysis):
> > - The leaked entry on segcblist has incorrect gp_seq (SDP is waiting 288).
> > - The leaked entry on segcblist seems has been "accelerated".
> > - The incorrect accelerate maybe caused by srcu_invoke_callbacks, because
> >   srcu_invoke_callbacks may be using incorrect GP seq doing accelerate.
> > - Here are two emulated case may cause the "accelerate issue":
> >
> >
> > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP, set rcu_seq_start.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" the seq of new added entry to 292 (it should be 288).
>
> But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> marked for 292? It should be ready to execute anyway even at 292. Note
> the meaning of "acceleration", the idea is to start conservatively and
> move the callbacks forward as more accurate information is available.
> Considering this, 292 initially should be OK IMHO (that's just more
> conservative than 288)..

As Neeraj explained, the callback is manged under SDP (per-CPU), but CBs are
managed under SNP (group of CPU).

>
> > CASE2: entry of CPU 136 belongs to GP 288 was accelerated to GP 296
> > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> >   to SDP.
> > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> >   faster than CPU 136, start GP.
> > - [CPU 041] [GP 284-288] finished, workload run faster, finish GP.
> > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> >   to start GP.
> > - [CPU 017] [GP 288-292] starting, new synchronize_srcu request, start GP,
> >   set rcu_seq_start.
> > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> >   "accelerate" new added entry to 296 (it should be 288).
>
> I had the same analysis here.
>
> > /*
> >  * CASE1
> >  * - entry of CPU 136 expecting GP 288 was accelerated to GP 292
> >  * - GP0: 280-284, GP1: 284-288.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >                 /* [GP1][CPU-136] */
> >                 synchronize_srcu {
> >                  __call_srcu {
> >                   rcu_segcblist_enqueue
> >                                          /* [GP1][CPU-041] */
> >                                          synchronize_srcu {
> >                                           __call_srcu {
> >                                            srcu_funnel_gp_start
> >                                             srcu_gp_start
> >                                              rcu_seq_start
> >                                           }
> >                                          }
> >                 /* [GP1][CPU-136] */
> >                   srcu_funnel_gp_start
> >                  }
> >                 }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > /*
> >  * CASE2
> >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> >  */
> >
> > /* [GP0][CPU-136] */
> > process_srcu {
> >  srcu_gp_end
> > }
> >
> >                   /* [GP1][CPU-136] */
> >                   synchronize_srcu {
> >                    __call_srcu {
> >                     rcu_segcblist_enqueue
> >                                            /* [GP1][CPU-041] */
> >                                            synchronize_srcu {
> >                                             __call_srcu {
> >                                              srcu_funnel_gp_start
> >                                               srcu_gp_start
> >                                             }
> >                                            }
> >                                            process_srcu {
> >                                             srcu_gp_end
> >                                              rcu_seq_end
> >                                            }
> >                   /* [GP1][CPU-136] */
> >                     srcu_funnel_gp_start
> >                    }
> >                   }
> > /* [GP0][CPU-136] */
> > srcu_invoke_callbacks {
>
> If srcu_invoke_callbacks() was really called for the rdp, I would have
> expected rcu_segcblist_advance() to advance all those pending
> callbacks to 304.
>
> I posit that probably srcu_invoke_callbacks() is not even being called
> in the first place, otherwise why is the DONE segment count still 0?
> DONE should not be 0, otherwise callbacks will not run. The segcblist
> of the rdp seems out of date perhaps due to rcu_invoke_callbacks() not
> being called due to some other bug (or a workqueue/timer issue causing
> the srcu_invoke_callbacks() to not run?).

There are no other system error messages and the system is running fine
except for KVM related operations such as destroying the VM (this is
because the SRCU hang is blocking the KVM global work queue). So I'd
rather think there's a bug in SRCU and the analysis shows there may be
a race issue.

>
> I wonder if you would consider trying to reproduce it with the
> rcutorture module's SRCU testing in the chance that it occurs with it
> on your setup.
>
> thanks,
>
>  - Joel
>
>
> >                                                       /* [GP2][CPU-017] */
> >                                                       synchronize_srcu {
> >                                                        __call_srcu {
> >                                                         srcu_funnel_gp_start
> >                                                          srcu_gp_start
> >                                                           rcu_seq_start
> >                                                        }
> >                                                       }
> > /* [GP0][CPU-136] */
> >  rcu_segcblist_accelerate
> > }
> >
> >
> > Potential fix:
> > We shouldn't accelerate the segcblist in srcu_invoke_callbacks base snapshot
> > of current gp seq of SRCU struct, how about accelerate the sda base
> > sdp->srcu_gp_seq_needed, maybe like this:
> >
> >
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 21115ffb6c44..ff8579abc924 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1197,7 +1197,7 @@ static void srcu_invoke_callbacks(struct
> > work_struct *work)
> >         spin_lock_irq_rcu_node(sdp);
> >         rcu_segcblist_insert_count(&sdp->srcu_cblist, &ready_cbs);
> >         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
> > -                                      rcu_seq_snap(&ssp->srcu_gp_seq));
> > +                                      rcu_seq_snap(&sdp->srcu_gp_seq_needed));
> >         sdp->srcu_cblist_invoking = false;
> >         more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
> >         spin_unlock_irq_rcu_node(sdp);
>
> I think "rcu_seq_snap(&ssp->srcu_gp_seq))" is correct IMHO,
> rcu_seq_snap() says which what's the gp_seq at which a full GP has
> passed from the moment rcu_seq_snap() is called. That's precisely when
> it is safe to call all those non-accelerated callbacks.
>
> Basically, if any of the bits are set in the state portion of the
> sequence of the gp_seq number, then it returns (COUNTER_PORTION + 2)
> << RCU_SEQ_CTR_SHIFT.
> Otherwise, it returns (COUNTER_PORTION + 1) << RCU_SEQ_CTR_SHIFT.
>
>
>
>
>
> >
> >
> > Thanks for reading!
> >
> >
> > Reference detail log of crash dump information:
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/bt.f.252684.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.partial.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.raw.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.lock.log
> > https://raw.githubusercontent.com/zhuangel/misc/main/debug/srcu_hung/irq_srcu.sda.raw.log



-- 
——————————
   zhuangel570
——————————

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: SRCU: kworker hung in synchronize_srcu
  2023-10-01  2:27       ` Neeraj upadhyay
  2023-10-01 22:32         ` Frederic Weisbecker
@ 2023-10-07  9:13         ` zhuangel570
  1 sibling, 0 replies; 28+ messages in thread
From: zhuangel570 @ 2023-10-07  9:13 UTC (permalink / raw)
  To: Neeraj upadhyay
  Cc: Joel Fernandes, paulmck, rcu, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, like.xu.linux, linussli, foxywang

On Sun, Oct 1, 2023 at 10:27 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote:
>
> On Sun, Oct 1, 2023 at 5:49 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Sat, Sep 30, 2023 at 6:01 AM Neeraj upadhyay <neeraj.iitr10@gmail.com> wrote:
> > >
> > > On Fri, Sep 29, 2023 at 3:35 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > Hello,
> > > > Firstly, kudos to the detailed report and analysis. Rare failures are
> > > > hard and your usage crash/kdump is awesome to dig deeper into the
> > > > issue..
> > > >
> > > > On Thu, Sep 28, 2023 at 3:59 AM zhuangel570 <zhuangel570@gmail.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > We encounter SRCU hung issue in stable tree 5.4.203, we are running VM create
> > > > > and destroy concurrent test, the issue happens after several weeks. Now we
> > > > > didn't have a way to reproduce this issue, the issue happens randomly, this
> > > > > is the second time we found it in this year.
> > > > >
> > > > >
> > > > >
> > > > > CASE1: entry of CPU 136 belongs to GP 288 was accelerated to GP 292
> > > > > - [CPU 136] [GP 280-284] finished, yet not enter srcu_invoke_callbacks.
> > > > > - [CPU 136] [GP 284-288] starting, new synchronize_srcu request, queue entry
> > > > >   to SDP.
> > > > > - [CPU 041] [GP 284-288] starting, new synchronize_srcu request, workload run
> > > > >   faster than CPU 136, start GP, set rcu_seq_start.
> > > > > - [CPU 136] [GP 284-288] starting, call srcu_funnel_gp_start, found no need
> > > > >   to start GP.
> > > > > - [CPU 136] [GP 280-284] finished, start to run srcu_invoke_callbacks,
> > > > >   "accelerate" the seq of new added entry to 292 (it should be 288).
> > > >
> > > > But srcu_gp_seq is at 304 right now. How does it matter that the CB is
> > > > marked for 292? It should be ready to execute anyway even at 292. Note
> > > > the meaning of "acceleration", the idea is to start conservatively and
> > > > move the callbacks forward as more accurate information is available.
> > > > Considering this, 292 initially should be OK IMHO (that's just more
> > > > conservative than 288)..
> > > >
> > >
> > > Maybe it matters, as for a CPU, the callbacks will only be scheduled
> > > in srcu_gp_end() for the GPs, for which it has updated ->srcu_data_have_cbs[idx]
> > > and ->srcu_have_cbs[idx]
> >
> > Right but if I am looking at the code correctly, nothing guarantees
> > that srcu_invoke_callbacks is called before srcu_gp_seq can advance.
> > So all callbacks that were previously queued for older grace periods
> > should be run whenever srcu_invoke_callbacks() eventually runs. That's
> > why I was thinking that part looked normal to me (segments having
> > older GP numbers).
> >
> > > > >
> > > > > /*
> > > > >  * CASE2
> > > > >  * - entry of CPU 136 belongs to GP 288 was accelerated to GP 296.
> > > > >  * - GP0: 280-284, GP1: 284-288, GP2: 288-292.
> > > > >  */
> > > > >
> > > > > /* [GP0][CPU-136] */
> > > > > process_srcu {
> > > > >  srcu_gp_end
> > > > > }
> > > > >
> > > > >                   /* [GP1][CPU-136] */
> > > > >                   synchronize_srcu {
> > > > >                    __call_srcu {
> > > > >                     rcu_segcblist_enqueue
> > > > >                                            /* [GP1][CPU-041] */
> > > > >                                            synchronize_srcu {
> > > > >                                             __call_srcu {
> > > > >                                              srcu_funnel_gp_start
> > > > >                                               srcu_gp_start
> > > > >                                             }
> > > > >                                            }
> > > > >                                            process_srcu {
> > > > >                                             srcu_gp_end
> > > > >                                              rcu_seq_end
> > > > >                                            }
> > > > >                   /* [GP1][CPU-136] */
> > > > >                     srcu_funnel_gp_start
> > > > >                    }
> > > > >                   }
> > > > > /* [GP0][CPU-136] */
> > > > > srcu_invoke_callbacks {
> > > >
> > > > If srcu_invoke_callbacks() was really called for the rdp, I would have
> > > > expected rcu_segcblist_advance() to advance all those pending
> > > > callbacks to 304.
> > >
> > > If the callback is in NEXT_TAIL and not assigned GP num,
> > > rcu_segcblist_advance() won't move it and next accelerate in
> > > srcu_invoke_callbacks() will
> > > assign it the next gp sequence num.
> >
> > Sure, and after that again it will call srcu_schedule_cbs_sdp() so
> > that should be fine and the next workqueue invocation
> > srcu_invoke_callbacks() can advance at that time. Right?
> >
>
> But "more" only checks for CBs in DONE tail. The callbacks which have been just
> accelerated are not advanced to DONE tail.
>
> Having said above, I am still trying to figure out, which callbacks
> are actually being pointed
> by NEXT tail. Given that __call_srcu() already does a advance and
> accelerate, all enqueued
> callbacks would be in either WAIT tail (the callbacks for current
> active GP) or NEXT_READY
> tail (the callbacks for next GP after current one completes). So, they
> should already have
> GP num assigned and srcu_invoke_callbacks() won't accelerate them.
> Only case I can
> think of is, if current GP completes after we sample
> rcu_seq_current(&ssp->srcu_gp_seq) for
> rcu_segcblist_advance() (so, WAIT tail cbs are not moved to DONE tail)
> and a new GP is started
> before we take snapshot ('s') of next GP  for
> rcu_segcblist_accelerate(), then the gp num 's'
> > gp num of NEXT_READY_TAIL and will be put in NEXT tail. Not sure
> if my understanding is correct here. Thoughts?
>
> __call_srcu()
>         rcu_segcblist_advance(&sdp->srcu_cblist,
>                               rcu_seq_current(&ssp->srcu_gp_seq));
>         s = rcu_seq_snap(&ssp->srcu_gp_seq);
>         (void)rcu_segcblist_accelerate(&sdp->srcu_cblist, s);

Thanks for helping to confirm this!

What I want to explain in my CASE (maybe is not totally correct, I "created"
it which base the hung context), is rcu_segcblist_advance and
rcu_segcblist_accelerate in srcu_invoke_callbacks maybe using not-matched
GP seq. And this makes the callbacks "accelerate" to wrong GP, which has no
CBS event on that SNP (CPU groups), then callback leaked in segcglist, cause
__synchronize_srcu hang.

>
> Thanks
> Neeraj
>
>
> > if (more) {
> >   srcu_schedule_cbs_sdp(sdp, 0);
> > }
> >
> > thanks,
> >
> >  - Joel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-10-07  9:14 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-28  7:59 SRCU: kworker hung in synchronize_srcu zhuangel570
2023-09-28 21:39 ` Joel Fernandes
2023-09-28 21:40   ` Joel Fernandes
2023-09-29  1:50   ` Zhouyi Zhou
2023-09-29 22:44   ` Frederic Weisbecker
2023-09-30  2:45     ` Neeraj upadhyay
2023-09-30  7:30       ` Frederic Weisbecker
2023-09-30  9:10       ` Frederic Weisbecker
2023-09-30  9:48         ` Neeraj upadhyay
2023-09-30 10:01   ` Neeraj upadhyay
2023-10-01  0:19     ` Joel Fernandes
2023-10-01  2:27       ` Neeraj upadhyay
2023-10-01 22:32         ` Frederic Weisbecker
2023-10-01 22:39           ` Frederic Weisbecker
2023-10-02  2:21             ` Neeraj upadhyay
2023-10-02 11:05               ` Frederic Weisbecker
2023-10-02 22:46               ` Frederic Weisbecker
2023-10-03 12:06                 ` Neeraj upadhyay
2023-10-02  2:17           ` Neeraj upadhyay
2023-10-02 10:41             ` Frederic Weisbecker
2023-10-02 13:22               ` Neeraj upadhyay
2023-10-02 21:09                 ` Frederic Weisbecker
2023-10-03 12:00                   ` Neeraj upadhyay
2023-10-03 12:22                   ` Frederic Weisbecker
2023-10-03 18:46                     ` Neeraj upadhyay
2023-10-07  9:13         ` zhuangel570
2023-10-07  8:53   ` zhuangel570
2023-09-30 10:11 ` Neeraj upadhyay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.