[REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
@ 2024-01-23  0:56 Dan Moulding
  2024-01-23  1:08 ` Song Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Dan Moulding @ 2024-01-23  0:56 UTC (permalink / raw)
  To: Song Liu
  Cc: regressions, linux-raid, linux-kernel, stable, Junxiao Bi,
	Greg Kroah-Hartman, Dan Moulding

After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
RAID-5 arrays started experiencing hangs. It starts with some
processes which write to the array getting stuck. The whole system
eventually becomes unresponsive and unclean shutdown must be performed
(poweroff and reboot don't work).

While trying to diagnose the issue, I noticed that the md0_raid5
kernel thread consumes 100% CPU after the issue occurs. No relevant
warnings or errors were found in dmesg.

On 6.7.1, I can reproduce the issue somewhat reliably by copying a
large amount of data to the array. I am unable to reproduce the issue
at all on 6.7.0. The bisection was a bit difficult since I don't have
a 100% reliable method to reproduce the problem, but with some
perseverence I eventually managed to whittle it down to commit
0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
raid5d"). After reverting that commit (i.e. reapplying the reverted
commit) on top of 6.7.1 I can no longer reproduce the problem at all.

Some details that might be relevant:
- Both systems are running MD RAID-5 with a journal device.
- mdadm in monitor mode is always running on both systems.
- Both systems were previously running 6.7.0 and earlier just fine.
- The older of the two systems has been running a raid5 array without
  incident for many years (kernel going back to at least 5.1) -- this
  is the first raid5 issue it has encountered.

Please let me know if there is any other helpful information that I
might be able to provide.

-- Dan

#regzbot introduced: 0de40f76d567133b871cd6ad46bb87afbce46983

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
@ 2024-01-23  1:08 ` Song Liu
  2024-01-23  1:35 ` Dan Moulding
  2024-02-20 23:06 ` Dan Moulding
  2 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-01-23  1:08 UTC (permalink / raw)
  To: Dan Moulding
  Cc: regressions, linux-raid, linux-kernel, stable, Junxiao Bi,
	Greg Kroah-Hartman, Yu Kuai

On Mon, Jan 22, 2024 at 4:57 PM Dan Moulding <dan@danm.net> wrote:
>
> After upgrading from 6.7.0 to 6.7.1 a couple of my systems with md
> RAID-5 arrays started experiencing hangs. It starts with some
> processes which write to the array getting stuck. The whole system
> eventually becomes unresponsive and unclean shutdown must be performed
> (poweroff and reboot don't work).
>
> While trying to diagnose the issue, I noticed that the md0_raid5
> kernel thread consumes 100% CPU after the issue occurs. No relevant
> warnings or errors were found in dmesg.
>
> On 6.7.1, I can reproduce the issue somewhat reliably by copying a
> large amount of data to the array. I am unable to reproduce the issue
> at all on 6.7.0. The bisection was a bit difficult since I don't have
> a 100% reliable method to reproduce the problem, but with some
> perseverence I eventually managed to whittle it down to commit
> 0de40f76d567 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"). After reverting that commit (i.e. reapplying the reverted
> commit) on top of 6.7.1 I can no longer reproduce the problem at all.
>
> Some details that might be relevant:
> - Both systems are running MD RAID-5 with a journal device.
> - mdadm in monitor mode is always running on both systems.
> - Both systems were previously running 6.7.0 and earlier just fine.
> - The older of the two systems has been running a raid5 array without
>   incident for many years (kernel going back to at least 5.1) -- this
>   is the first raid5 issue it has encountered.
>
> Please let me know if there is any other helpful information that I
> might be able to provide.

Thanks for the report, and sorry for the problem.

We are looking into some regressions that are probably related to this.
We will fix the issue ASAP.

Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
  2024-01-23  1:08 ` Song Liu
@ 2024-01-23  1:35 ` Dan Moulding
  2024-01-23  6:35   ` Song Liu
  2024-02-20 23:06 ` Dan Moulding
  2 siblings, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-01-23  1:35 UTC (permalink / raw)
  To: dan
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions, song, stable

Some additional new information: I realized after filing this report
that on the mainline there is a second commit, part of a pair, that
was supposed to go with commit 0de40f76d567. That second commit
upstream is d6e035aad6c0 ("md: bypass block throttle for superblock
update"). That commit probably also was supposed to have been
backported to stable along with the first, but was not, since it
provides what is supposed to be a replacement for the fix that has
been reverted.

So I rebuilt my kernel with the missed commit also backported instead
of just reverting the first commit (i.e. I have now built 6.7.1 with
just commit d6e035aad6c0 on top). Unfortunately, I can still reproduce
the hang after applying this second commit. So it looks
like even with that fix applied the regression is still present.

Coincidentally, I see it seems this second commit was picked up for
inclusion in 6.7.2 just today. I think that needs to NOT be
done. Instead the stable series should probably revert 0de40f76d567
until the regression is successfully dealt with on master. Probably no
further changes related to this patch series should be backported
until then.

Cheers,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23  1:35 ` Dan Moulding
@ 2024-01-23  6:35   ` Song Liu
  2024-01-23 21:53     ` Dan Moulding
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-01-23  6:35 UTC (permalink / raw)
  To: Dan Moulding, Yu Kuai
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions, stable

Hi Dan,

On Mon, Jan 22, 2024 at 5:35 PM Dan Moulding <dan@danm.net> wrote:
>
> Some additional new information: I realized after filing this report
> that on the mainline there is a second commit, part of a pair, that
> was supposed to go with commit 0de40f76d567. That second commit
> upstream is d6e035aad6c0 ("md: bypass block throttle for superblock
> update"). That commit probably also was supposed to have been
> backported to stable along with the first, but was not, since it
> provides what is supposed to be a replacement for the fix that has
> been reverted.
>
> So I rebuilt my kernel with the missed commit also backported instead
> of just reverting the first commit (i.e. I have now built 6.7.1 with
> just commit d6e035aad6c0 on top). Unfortunately, I can still reproduce
> the hang after applying this second commit. So it looks
> like even with that fix applied the regression is still present.
>
> Coincidentally, I see it seems this second commit was picked up for
> inclusion in 6.7.2 just today. I think that needs to NOT be
> done. Instead the stable series should probably revert 0de40f76d567
> until the regression is successfully dealt with on master. Probably no
> further changes related to this patch series should be backported
> until then.

I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
d6e035aad6c0 + revert 0de40f76d567.

OTOH, I am not able to reproduce the issue. Could you please help
get more information:
  cat /proc/mdstat
  profile (perf, etc.) of the md thread

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23  6:35   ` Song Liu
@ 2024-01-23 21:53     ` Dan Moulding
  2024-01-23 22:21       ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-01-23 21:53 UTC (permalink / raw)
  To: song
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, yukuai1

> I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> d6e035aad6c0 + revert 0de40f76d567.

I was operating under the assumption that the two commits were
intended to exist as a pair (the one reverts the old fix, because the
next commit has what is supposed to be a better fix). But since the
regression still exists, even with both patches applied, the old fix
must be reapplied to resolve the current regression.

But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
0de40f76d567 and it seems fine. So I have no issue if you think it
makes sense to accept d6e035aad6c0 on its own, even though it would
break up the pair of commits.

> OTOH, I am not able to reproduce the issue. Could you please help
> get more information:
>   cat /proc/mdstat

Here is /proc/mdstat from one of the systems where I can reproduce it:

    $ cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4]
    md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
          3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

    unused devices: <none>

dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
others are run-of-the-mill SATA SSDs.

>  profile (perf, etc.) of the md thread

I might need a little more pointing in the direction of what exactly
to look for and under what conditions (i.e. should I run perf while
the thread is stuck in the 100% CPU loop? what kind of report should I
ask perf for?). Also, are there any debug options I could enable in
the kernel configuration that might help gather more information?
Maybe something in debugfs? I currently get absolutely no warnings or
errors in dmesg when the problem occurs.

Cheers,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23 21:53     ` Dan Moulding
@ 2024-01-23 22:21       ` Song Liu
  2024-01-23 23:58         ` Dan Moulding
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-01-23 22:21 UTC (permalink / raw)
  To: Dan Moulding
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, yukuai1

Hi Dan,

On Tue, Jan 23, 2024 at 1:53 PM Dan Moulding <dan@danm.net> wrote:
>
> > I think we still want d6e035aad6c0 in 6.7.2. We may need to revert
> > 0de40f76d567 on top of that. Could you please test it out? (6.7.1 +
> > d6e035aad6c0 + revert 0de40f76d567.
>
> I was operating under the assumption that the two commits were
> intended to exist as a pair (the one reverts the old fix, because the
> next commit has what is supposed to be a better fix). But since the
> regression still exists, even with both patches applied, the old fix
> must be reapplied to resolve the current regression.
>
> But, as you've requested, I have tested 6.7.1 + d6e035aad6c0 + revert
> 0de40f76d567 and it seems fine. So I have no issue if you think it
> makes sense to accept d6e035aad6c0 on its own, even though it would
> break up the pair of commits.

Thanks for running the test!

>
> > OTOH, I am not able to reproduce the issue. Could you please help
> > get more information:
> >   cat /proc/mdstat
>
> Here is /proc/mdstat from one of the systems where I can reproduce it:
>
>     $ cat /proc/mdstat
>     Personalities : [raid6] [raid5] [raid4]
>     md0 : active raid5 dm-0[4](J) sdc[3] sda[0] sdb[1]
>           3906764800 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>
>     unused devices: <none>
>
> dm-0 is an LVM logical volume which is backed by an NVMe SSD. The
> others are run-of-the-mill SATA SSDs.
>
> >  profile (perf, etc.) of the md thread
>
> I might need a little more pointing in the direction of what exactly
> to look for and under what conditions (i.e. should I run perf while
> the thread is stuck in the 100% CPU loop? what kind of report should I
> ask perf for?). Also, are there any debug options I could enable in
> the kernel configuration that might help gather more information?
> Maybe something in debugfs? I currently get absolutely no warnings or
> errors in dmesg when the problem occurs.

This appears the md thread hit some infinite loop, so I would like to
know what it is doing. We can probably get the information with the
perf tool, something like:

perf record -a
perf report

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23 22:21       ` Song Liu
@ 2024-01-23 23:58         ` Dan Moulding
  2024-01-25  0:01           ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-01-23 23:58 UTC (permalink / raw)
  To: song
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, yukuai1

> This appears the md thread hit some infinite loop, so I would like to
> know what it is doing. We can probably get the information with the
> perf tool, something like:
>
> perf record -a
> perf report

Here you go!

# Total Lost Samples: 0
#
# Samples: 78K of event 'cycles'
# Event count (approx.): 83127675745
#
# Overhead  Command          Shared Object                   Symbol
# ........  ...............  ..............................  ...................................................
#
    49.31%  md0_raid5        [kernel.kallsyms]               [k] handle_stripe
    18.63%  md0_raid5        [kernel.kallsyms]               [k] ops_run_io
     6.07%  md0_raid5        [kernel.kallsyms]               [k] handle_active_stripes.isra.0
     5.50%  md0_raid5        [kernel.kallsyms]               [k] do_release_stripe
     3.09%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     2.48%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe
     1.89%  md0_raid5        [kernel.kallsyms]               [k] md_wakeup_thread
     1.45%  ksmd             [kernel.kallsyms]               [k] ksm_scan_thread
     1.37%  md0_raid5        [kernel.kallsyms]               [k] stripe_is_lowprio
     0.87%  ksmd             [kernel.kallsyms]               [k] memcmp
     0.68%  ksmd             [kernel.kallsyms]               [k] xxh64
     0.56%  md0_raid5        [kernel.kallsyms]               [k] __wake_up_common
     0.52%  md0_raid5        [kernel.kallsyms]               [k] __wake_up
     0.46%  ksmd             [kernel.kallsyms]               [k] mtree_load
     0.44%  ksmd             [kernel.kallsyms]               [k] try_grab_page
     0.40%  ksmd             [kernel.kallsyms]               [k] follow_p4d_mask.constprop.0
     0.39%  md0_raid5        [kernel.kallsyms]               [k] r5l_log_disk_error
     0.37%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irq
     0.33%  md0_raid5        [kernel.kallsyms]               [k] release_stripe_list
     0.31%  md0_raid5        [kernel.kallsyms]               [k] release_inactive_stripe_list
     0.31%  ksmd             [kernel.kallsyms]               [k] get_ksm_page
     0.30%  md0_raid5        [kernel.kallsyms]               [k] __cond_resched
     0.28%  md0_raid5        [kernel.kallsyms]               [k] mutex_unlock
     0.28%  ksmd             [kernel.kallsyms]               [k] _raw_spin_lock
     0.27%  swapper          [kernel.kallsyms]               [k] intel_idle
     0.26%  md0_raid5        [kernel.kallsyms]               [k] mutex_lock
     0.24%  md0_raid5        [kernel.kallsyms]               [k] rcu_all_qs
     0.22%  md0_raid5        [kernel.kallsyms]               [k] r5c_is_writeback
     0.20%  md0_raid5        [kernel.kallsyms]               [k] __lock_text_start
     0.18%  ksmd             [kernel.kallsyms]               [k] up_read
     0.18%  ksmd             [kernel.kallsyms]               [k] down_read
     0.17%  md0_raid5        [kernel.kallsyms]               [k] raid5d
     0.15%  ksmd             [kernel.kallsyms]               [k] follow_trans_huge_pmd
     0.13%  kworker/u16:3-e  [kernel.kallsyms]               [k] ioread32
     0.13%  kworker/u16:1-e  [kernel.kallsyms]               [k] ioread32
     0.12%  ksmd             [kernel.kallsyms]               [k] follow_page_pte
     0.11%  md0_raid5        [kernel.kallsyms]               [k] r5l_flush_stripe_to_raid
     0.11%  ksmd             [kernel.kallsyms]               [k] follow_page
     0.11%  ksmd             [kernel.kallsyms]               [k] memcmp_pages
     0.10%  swapper          [kernel.kallsyms]               [k] poll_idle
     0.08%  ksmd             [kernel.kallsyms]               [k] mtree_range_walk
     0.07%  ksmd             [kernel.kallsyms]               [k] __cond_resched
     0.07%  ksmd             [kernel.kallsyms]               [k] rcu_all_qs
     0.06%  ksmd             [kernel.kallsyms]               [k] __pte_offset_map_lock
     0.04%  ksmd             [kernel.kallsyms]               [k] __pte_offset_map
     0.03%  md0_raid5        [kernel.kallsyms]               [k] llist_reverse_order
     0.03%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe_run
     0.02%  swapper          [kernel.kallsyms]               [k] menu_select
     0.02%  ksmd             [kernel.kallsyms]               [k] rb_insert_color
     0.02%  ksmd             [kernel.kallsyms]               [k] vm_normal_page
     0.02%  swapper          [kernel.kallsyms]               [k] cpuidle_enter_state
     0.01%  md0_raid5        [kernel.kallsyms]               [k] r5l_submit_current_io
     0.01%  ksmd             [kernel.kallsyms]               [k] vma_is_secretmem
     0.01%  swapper          [kernel.kallsyms]               [k] alx_mask_msix
     0.01%  ksmd             [kernel.kallsyms]               [k] remove_rmap_item_from_tree
     0.01%  swapper          [kernel.kallsyms]               [k] lapic_next_deadline
     0.01%  swapper          [kernel.kallsyms]               [k] read_tsc
     0.01%  ksmd             [kernel.kallsyms]               [k] mas_walk
     0.01%  swapper          [kernel.kallsyms]               [k] do_idle
     0.01%  md0_raid5        [kernel.kallsyms]               [k] perf_adjust_freq_unthr_context
     0.01%  md0_raid5        [kernel.kallsyms]               [k] lapic_next_deadline
     0.01%  swapper          [kernel.kallsyms]               [k] perf_adjust_freq_unthr_context
     0.01%  swapper          [kernel.kallsyms]               [k] __switch_to_asm
     0.01%  swapper          [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     0.01%  swapper          [kernel.kallsyms]               [k] native_irq_return_iret
     0.01%  swapper          [kernel.kallsyms]               [k] arch_scale_freq_tick
     0.01%  kworker/u16:3-e  [kernel.kallsyms]               [k] lapic_next_deadline
     0.00%  swapper          [kernel.kallsyms]               [k] __hrtimer_next_event_base
     0.00%  ksmd             [kernel.kallsyms]               [k] calc_checksum
     0.00%  swapper          [kernel.kallsyms]               [k] psi_group_change
     0.00%  swapper          [kernel.kallsyms]               [k] timerqueue_add
     0.00%  ksmd             [kernel.kallsyms]               [k] mas_find
     0.00%  swapper          [kernel.kallsyms]               [k] __schedule
     0.00%  swapper          [kernel.kallsyms]               [k] ioread32
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] _aesni_enc4
     0.00%  swapper          [kernel.kallsyms]               [k] rb_next
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] lapic_next_deadline
     0.00%  swapper          [kernel.kallsyms]               [k] ktime_get
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] psi_group_change
     0.00%  Qt bearer threa  [kernel.kallsyms]               [k] alx_update_hw_stats
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_enter
     0.00%  swapper          [kernel.kallsyms]               [k] update_sd_lb_stats.constprop.0
     0.00%  swapper          [kernel.kallsyms]               [k] ct_kernel_exit_state
     0.00%  swapper          [kernel.kallsyms]               [k] nr_iowait_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] sched_clock_noinstr
     0.00%  swapper          [kernel.kallsyms]               [k] psi_flags_change
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_stop_idle
     0.00%  swapper          [kernel.kallsyms]               [k] __run_timers.part.0
     0.00%  swapper          [kernel.kallsyms]               [k] native_apic_msr_eoi
     0.00%  swapper          [kernel.kallsyms]               [k] __update_load_avg_se
     0.00%  md0_raid5        [kernel.kallsyms]               [k] __intel_pmu_enable_all.isra.0
     0.00%  md0_raid5        [kernel.kallsyms]               [k] update_vsyscall
     0.00%  md0_raid5        [kernel.kallsyms]               [k] arch_scale_freq_tick
     0.00%  md0_raid5        [kernel.kallsyms]               [k] read_tsc
     0.00%  md0_raid5        [kernel.kallsyms]               [k] x86_pmu_disable
     0.00%  md0_raid5        [kernel.kallsyms]               [k] __update_load_avg_cfs_rq
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_sched_clock_irq
     0.00%  perf             [kernel.kallsyms]               [k] rep_movs_alternative
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_active
     0.00%  swapper          [kernel.kallsyms]               [k] newidle_balance.isra.0
     0.00%  swapper          [kernel.kallsyms]               [k] _raw_spin_lock_irq
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] aesni_xts_encrypt
     0.00%  swapper          [kernel.kallsyms]               [k] enqueue_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_idle_stop_tick
     0.00%  swapper          [kernel.kallsyms]               [k] leave_mm
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] sched_clock_noinstr
     0.00%  plasmashell      [kernel.kallsyms]               [k] ext4fs_dirhash
     0.00%  kwin_x11         [kernel.kallsyms]               [k] ioread32
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] ioread32
     0.00%  swapper          [kernel.kallsyms]               [k] memchr_inv
     0.00%  swapper          [kernel.kallsyms]               [k] pick_next_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] tick_sched_do_timer
     0.00%  swapper          [kernel.kallsyms]               [k] ktime_get_update_offsets_now
     0.00%  swapper          [kernel.kallsyms]               [k] __update_load_avg_cfs_rq
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] psi_group_change
     0.00%  swapper          [kernel.kallsyms]               [k] ct_kernel_exit.constprop.0
     0.00%  ksmd             [kernel.kallsyms]               [k] psi_task_switch
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_next_event
     0.00%  swapper          [kernel.kallsyms]               [k] clockevents_program_event
     0.00%  swapper          [kernel.kallsyms]               [k] __sysvec_apic_timer_interrupt
     0.00%  swapper          [kernel.kallsyms]               [k] enqueue_entity
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] schedule
     0.00%  swapper          [kernel.kallsyms]               [k] get_cpu_device
     0.00%  swapper          [kernel.kallsyms]               [k] scheduler_tick
     0.00%  swapper          [kernel.kallsyms]               [k] tick_check_oneshot_broadcast_this_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] switch_mm_irqs_off
     0.00%  swapper          [kernel.kallsyms]               [k] calc_load_nohz_stop
     0.00%  swapper          [kernel.kallsyms]               [k] _raw_spin_lock
     0.00%  swapper          [kernel.kallsyms]               [k] nohz_run_idle_balance
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_note_context_switch
     0.00%  swapper          [kernel.kallsyms]               [k] run_timer_softirq
     0.00%  swapper          [kernel.kallsyms]               [k] kthread_is_per_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] x86_pmu_disable
     0.00%  ksoftirqd/4      [kernel.kallsyms]               [k] rcu_cblist_dequeue
     0.00%  init             init                            [.] 0x0000000000008874
     0.00%  swapper          [kernel.kallsyms]               [k] ct_kernel_enter.constprop.0
     0.00%  swapper          [kernel.kallsyms]               [k] update_rq_clock.part.0
     0.00%  swapper          [kernel.kallsyms]               [k] __dequeue_entity
     0.00%  swapper          [kernel.kallsyms]               [k] ttwu_queue_wakelist
     0.00%  swapper          [kernel.kallsyms]               [k] __hrtimer_run_queues
     0.00%  swapper          [kernel.kallsyms]               [k] select_task_rq_fair
     0.00%  md0_raid5        [kernel.kallsyms]               [k] update_wall_time
     0.00%  md0_raid5        [kernel.kallsyms]               [k] ntp_tick_length
     0.00%  md0_raid5        [kernel.kallsyms]               [k] trigger_load_balance
     0.00%  md0_raid5        [kernel.kallsyms]               [k] acct_account_cputime
     0.00%  md0_raid5        [kernel.kallsyms]               [k] ktime_get
     0.00%  md0_raid5        [kernel.kallsyms]               [k] timerqueue_add
     0.00%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock
     0.00%  md0_raid5        [kernel.kallsyms]               [k] tick_do_update_jiffies64
     0.00%  md0_raid5        [kernel.kallsyms]               [k] native_irq_return_iret
     0.00%  md0_raid5        [kernel.kallsyms]               [k] ktime_get_update_offsets_now
     0.00%  swapper          [kernel.kallsyms]               [k] asm_sysvec_apic_timer_interrupt
     0.00%  swapper          [kernel.kallsyms]               [k] update_blocked_averages
     0.00%  md0_raid5        [kernel.kallsyms]               [k] error_entry
     0.00%  md0_raid5        [kernel.kallsyms]               [k] rcu_sched_clock_irq
     0.00%  md0_raid5        [kernel.kallsyms]               [k] native_apic_msr_eoi
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_highres_handler
     0.00%  md0_raid5        [kernel.kallsyms]               [k] irq_work_tick
     0.00%  ksmd             [kernel.kallsyms]               [k] __mod_timer
     0.00%  ksmd             [kernel.kallsyms]               [k] __hrtimer_run_queues
     0.00%  kwin_x11         [kernel.kallsyms]               [k] do_vfs_ioctl
     0.00%  swapper          [kernel.kallsyms]               [k] run_posix_cpu_timers
     0.00%  swapper          [kernel.kallsyms]               [k] __rdgsbase_inactive
     0.00%  ksmd             [kernel.kallsyms]               [k] hrtimer_interrupt
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] nvkm_object_search
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] hrtimer_start_range_ns
     0.00%  swapper          [kernel.kallsyms]               [k] __wrgsbase_inactive
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] clockevents_program_event
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] wq_worker_running
     0.00%  QSGRenderThread  [kernel.kallsyms]               [k] ioread32
     0.00%  swapper          [kernel.kallsyms]               [k] ct_nmi_exit
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] __hrtimer_init
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] __schedule
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] sched_clock_noinstr
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] calc_global_load_tick
     0.00%  swapper          [kernel.kallsyms]               [k] load_balance
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_start_range_ns
     0.00%  swapper          [kernel.kallsyms]               [k] irqentry_exit
     0.00%  ksmd             [kernel.kallsyms]               [k] psi_group_change
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_interrupt
     0.00%  swapper          [kernel.kallsyms]               [k] rebalance_domains
     0.00%  plasmashell      libKF5Plasma.so.5.113.0         [.] Plasma::Containment::metaObject
     0.00%  plasmashell      [kernel.kallsyms]               [k] rb_insert_color
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_reflect
     0.00%  swapper          [kernel.kallsyms]               [k] update_cfs_group
     0.00%  dmeventd         [kernel.kallsyms]               [k] update_curr
     0.00%  plasmashell      libc.so.6                       [.] __poll
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] queued_spin_lock_slowpath
     0.00%  swapper          [kernel.kallsyms]               [k] quiet_vmstat
     0.00%  plasmashell      [kernel.kallsyms]               [k] call_filldir
     0.00%  gpm              gpm                             [.] 0x0000000000010470
     0.00%  gpm              [kernel.kallsyms]               [k] getname_flags
     0.00%  QSGRenderThread  libQt5Quick.so.5.15.11          [.] 0x0000000000199010
     0.00%  QSGRenderThread  libqxcb-glx-integration.so      [.] QXcbWindow::needsSync@plt
     0.00%  synergys         [kernel.kallsyms]               [k] do_sys_poll
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] readdir64@plt
     0.00%  swapper          [kernel.kallsyms]               [k] ct_nmi_enter
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x00000000002dc2b5
     0.00%  perf             [kernel.kallsyms]               [k] ext4_journal_check_start
     0.00%  swapper          [kernel.kallsyms]               [k] timerqueue_del
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     0.00%  swapper          [kernel.kallsyms]               [k] call_cpuidle
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] percpu_counter_add_batch
     0.00%  swapper          [kernel.kallsyms]               [k] tsc_verify_tsc_adjust
     0.00%  ksmd             [kernel.kallsyms]               [k] schedule_timeout
     0.00%  konsole          libQt5XcbQpa.so.5.15.11         [.] QKeyEvent::modifiers@plt
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] QString::fromLocal8Bit_helper
     0.00%  kwin_x11         libkwin.so.5.27.10              [.] KWin::Application::dispatchEvent
     0.00%  swapper          [kernel.kallsyms]               [k] sysvec_apic_timer_interrupt
     0.00%  migration/2      [kernel.kallsyms]               [k] update_sd_lb_stats.constprop.0
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] QArrayData::allocate
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] hrtimer_active
     0.00%  plasmashell      [kernel.kallsyms]               [k] __get_user_1
     0.00%  synergys         [kernel.kallsyms]               [k] avg_vruntime
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x00000000001cdca2
     0.00%  ksmd             [kernel.kallsyms]               [k] hrtimer_active
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] __switch_to
     0.00%  ksmd             [kernel.kallsyms]               [k] nohz_balance_exit_idle.part.0
     0.00%  konsole          libharfbuzz.so.0.60830.0        [.] 0x00000000000a9aa0
     0.00%  swapper          [kernel.kallsyms]               [k] rb_erase
     0.00%  swapper          [kernel.kallsyms]               [k] activate_task
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x00000000001d72f3
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_idle_retain_tick
     0.00%  konsole          libxcb.so.1.1.0                 [.] xcb_send_request64
     0.00%  swapper          [unknown]                       [.] 0000000000000000
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_update_next_event
     0.00%  kworker/7:0-eve  [kernel.kallsyms]               [k] collect_percpu_times
     0.00%  plasmashell      libQt5Qml.so.5.15.11            [.] QQmlJavaScriptExpression::clearActiveGuards
     0.00%  perf             [kernel.kallsyms]               [k] __block_commit_write
     0.00%  swapper          [kernel.kallsyms]               [k] __intel_pmu_enable_all.isra.0
     0.00%  perf             [kernel.kallsyms]               [k] affine_move_task
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_get_sleep_length
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] mpage_release_unused_pages
     0.00%  plasmashell      libQt5Qml.so.5.15.11            [.] QQmlData::isSignalConnected
     0.00%  perf             [kernel.kallsyms]               [k] mt_find
     0.00%  xembedsniproxy   [kernel.kallsyms]               [k] update_sd_lb_stats.constprop.0
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x0000000000202d04
     0.00%  migration/3      [kernel.kallsyms]               [k] psi_group_change
     0.00%  swapper          [kernel.kallsyms]               [k] tick_program_event
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_get_cpu_driver
     0.00%  swapper          [kernel.kallsyms]               [k] account_process_tick
     0.00%  Qt bearer threa  libc.so.6                       [.] 0x0000000000093948
     0.00%  swapper          [kernel.kallsyms]               [k] __flush_smp_call_function_queue
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] xts_crypt
     0.00%  swapper          [kernel.kallsyms]               [k] kmem_cache_free
     0.00%  synergys         [kernel.kallsyms]               [k] psi_group_change
     0.00%  avahi-daemon     libavahi-common.so.3.5.4        [.] avahi_unescape_label
     0.00%  migration/0      [kernel.kallsyms]               [k] __update_load_avg_se
     0.00%  swapper          [kernel.kallsyms]               [k] ct_idle_exit
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_not_available
     0.00%  swapper          [kernel.kallsyms]               [k] error_entry
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_idle_got_tick
     0.00%  X                Xorg                            [.] 0x0000000000094069
     0.00%  swapper          [kernel.kallsyms]               [k] try_to_wake_up
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x0000000000202db6
     0.00%  swapper          [kernel.kallsyms]               [k] idle_cpu
     0.00%  kwin_x11         nouveau_dri.so                  [.] 0x00000000001342e0
     0.00%  swapper          [kernel.kallsyms]               [k] irq_work_needs_cpu
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] _raw_read_lock_irqsave
     0.00%  swapper          [kernel.kallsyms]               [k] nvkm_pci_wr32
     0.00%  kwin_x11         libkwineffects.so.5.27.10       [.] KWin::WindowPaintData::brightness
     0.00%  plasmashell      libQt5Quick.so.5.15.11          [.] QTextLayout::beginLayout@plt
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] unix_destruct_scm
     0.00%  X                [kernel.kallsyms]               [k] ___slab_alloc.isra.0
     0.00%  kwin_x11         nouveau_dri.so                  [.] 0x0000000000070093
     0.00%  swapper          [kernel.kallsyms]               [k] psi_task_change
     0.00%  X                Xorg                            [.] XkbComputeDerivedState
     0.00%  swapper          [kernel.kallsyms]               [k] rb_insert_color
     0.00%  synergys         [kernel.kallsyms]               [k] newidle_balance.isra.0
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] __copy_msghdr
     0.00%  swapper          [kernel.kallsyms]               [k] __softirqentry_text_start
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] ext4_reserve_inode_write
     0.00%  konsole          libc.so.6                       [.] 0x0000000000092244
     0.00%  kwin_x11         libQt5Gui.so.5.15.11            [.] QRegion::~QRegion
     0.00%  perf             [kernel.kallsyms]               [k] __rmqueue_pcplist
     0.00%  konsole          libQt5Core.so.5.15.11           [.] 0x00000000002db957
     0.00%  ksmd             [kernel.kallsyms]               [k] mas_next_slot
     0.00%  kwin_x11         libQt5Core.so.5.15.11           [.] QtPrivate::qustrchr
     0.00%  swapper          [kernel.kallsyms]               [k] update_load_avg
     0.00%  swapper          [kernel.kallsyms]               [k] perf_pmu_nop_void
     0.00%  plasmashell      libc.so.6                       [.] 0x00000000000920bf
     0.00%  synergys         [kernel.kallsyms]               [k] sock_poll
     0.00%  QSGRenderThread  nouveau_dri.so                  [.] 0x00000000007474d0
     0.00%  kwin_x11         [kernel.kallsyms]               [k] nvkm_vmm_get_locked
     0.00%  swapper          [kernel.kallsyms]               [k] __msecs_to_jiffies
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] task_h_load
     0.00%  synergys         [kernel.kallsyms]               [k] __fget_light
     0.00%  swapper          [kernel.kallsyms]               [k] irq_work_tick
     0.00%  swapper          [kernel.kallsyms]               [k] irqentry_enter
     0.00%  kwin_x11         nouveau_dri.so                  [.] 0x0000000000745aa0
     0.00%  X                [kernel.kallsyms]               [k] do_iter_write
     0.00%  plasmashell      libQt5XcbQpa.so.5.15.11         [.] QXcbConnection::handleXcbEvent
     0.00%  QSGRenderThread  [kernel.kallsyms]               [k] nvkm_vmm_get_locked
     0.00%  QSGRenderThread  libQt5Quick.so.5.15.11          [.] QSGRenderContext::endSync
     0.00%  swapper          [kernel.kallsyms]               [k] arch_cpu_idle_enter
     0.00%  X                [kernel.kallsyms]               [k] drain_obj_stock
     0.00%  swapper          [kernel.kallsyms]               [k] calc_global_load_tick
     0.00%  Qt bearer threa  [kernel.kallsyms]               [k] macvlan_fill_info
     0.00%  X                libdrm_nouveau.so.2.0.0         [.] 0x0000000000004ee2
     0.00%  synergys         libc.so.6                       [.] __poll
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_governor_latency_req
     0.00%  swapper          [kernel.kallsyms]               [k] _nohz_idle_balance.isra.0
     0.00%  X                Xorg                            [.] 0x000000000008207c
     0.00%  plasmashell      libglib-2.0.so.0.7800.3         [.] 0x0000000000059794
     0.00%  swapper          [kernel.kallsyms]               [k] irq_exit_rcu
     0.00%  X                [kernel.kallsyms]               [k] timestamp_truncate
     0.00%  plasmashell      libglib-2.0.so.0.7800.3         [.] 0x00000000000567c4
     0.00%  QSGRenderThread  nouveau_dri.so                  [.] 0x000000000024295e
     0.00%  X                [kernel.kallsyms]               [k] save_fpregs_to_fpstate
     0.00%  perf             [kernel.kallsyms]               [k] lru_add_fn
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_preempt_deferred_qs
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_get_next_event
     0.00%  plasmashell      libc.so.6                       [.] 0x0000000000140199
     0.00%  X                [kernel.kallsyms]               [k] dequeue_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] __lock_text_start
     0.00%  swapper          [kernel.kallsyms]               [k] __remove_hrtimer
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_needs_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] alx_poll
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_segcblist_ready_cbs
     0.00%  swapper          [kernel.kallsyms]               [k] task_tick_idle
     0.00%  swapper          [kernel.kallsyms]               [k] cr4_update_irqsoff
     0.00%  plasmashell      libQt5Quick.so.5.15.11          [.] 0x000000000020564d
     0.00%  swapper          [kernel.kallsyms]               [k] cpu_latency_qos_limit
     0.00%  swapper          [kernel.kallsyms]               [k] get_next_timer_interrupt
     0.00%  InputThread      [kernel.kallsyms]               [k] __get_user_8
     0.00%  xembedsniproxy   libQt5XcbQpa.so.5.15.11         [.] QXcbConnection::processXcbEvents
     0.00%  kwin_x11         libxkbcommon.so.0.0.0           [.] xkb_state_key_get_level
     0.00%  sudo             libc.so.6                       [.] read
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] filemap_get_folios_tag
     0.00%  InputThread      [kernel.kallsyms]               [k] ep_item_poll.isra.0
     0.00%  swapper          [kernel.kallsyms]               [k] can_stop_idle_tick
     0.00%  swapper          [kernel.kallsyms]               [k] __pick_eevdf
     0.00%  perf             [kernel.kallsyms]               [k] __fget_light
     0.00%  InputThread      [kernel.kallsyms]               [k] _copy_from_iter
     0.00%  InputThread      [kernel.kallsyms]               [k] ep_done_scan
     0.00%  swapper          [kernel.kallsyms]               [k] netlink_broadcast_filtered
     0.00%  upsd             [kernel.kallsyms]               [k] __cgroup_account_cputime
     0.00%  kworker/7:0-eve  [kernel.kallsyms]               [k] __cond_resched
     0.00%  X                [kernel.kallsyms]               [k] ww_mutex_lock_interruptible
     0.00%  swapper          [kernel.kallsyms]               [k] attach_entity_load_avg
     0.00%  plasmashell      libKF5Archive.so.5.113.0        [.] 0x000000000000ea00
     0.00%  QSGRenderThread  nouveau_dri.so                  [.] 0x000000000037f463
     0.00%  jbd2/dm-2-8      [kernel.kallsyms]               [k] _aesni_enc4
     0.00%  kwin_x11         [kernel.kallsyms]               [k] obj_cgroup_charge
     0.00%  X                nouveau_dri.so                  [.] 0x0000000000125020
     0.00%  perf             [kernel.kallsyms]               [k] fault_in_readable
     0.00%  perf             [kernel.kallsyms]               [k] should_failslab
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] xhci_ring_ep_doorbell
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] put_cpu_partial
     0.00%  swapper          [kernel.kallsyms]               [k] ___slab_alloc.isra.0
     0.00%  kwin_x11         [kernel.kallsyms]               [k] evict
     0.00%  swapper          [kernel.kallsyms]               [k] sched_clock
     0.00%  crond            libc.so.6                       [.] 0x00000000000b1330
     0.00%  swapper          [kernel.kallsyms]               [k] update_dl_rq_load_avg
     0.00%  X                libdrm_nouveau.so.2.0.0         [.] nouveau_bo_ref
     0.00%  perf             perf                            [.] 0x000000000007e2a6
     0.00%  konsole          [kernel.kallsyms]               [k] n_tty_read
     0.00%  synergys         [kernel.kallsyms]               [k] __schedule
     0.00%  swapper          [kernel.kallsyms]               [k] calc_load_nohz_start
     0.00%  swapper          [kernel.kallsyms]               [k] tick_irq_enter
     0.00%  swapper          [kernel.kallsyms]               [k] skb_release_head_state
     0.00%  swapper          [kernel.kallsyms]               [k] task_tick_mm_cid
     0.00%  swapper          [kernel.kallsyms]               [k] nohz_csd_func
     0.00%  swapper          [kernel.kallsyms]               [k] update_process_times
     0.00%  perf             [kernel.kallsyms]               [k] xas_load
     0.00%  swapper          [kernel.kallsyms]               [k] update_rt_rq_load_avg
     0.00%  synergys         [kernel.kallsyms]               [k] entry_SYSRETQ_unsafe_stack
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] 0x00000000002b9526
     0.00%  plasmashell      libc.so.6                       [.] _pthread_cleanup_push
     0.00%  plasmashell      libglib-2.0.so.0.7800.3         [.] g_mutex_lock
     0.00%  synergys         synergys                        [.] 0x000000000004dd9b
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] update_cfs_group
     0.00%  swapper          [kernel.kallsyms]               [k] sched_clock_cpu
     0.00%  kglobalaccel5    libxcb-keysyms.so.1.0.0         [.] xcb_key_symbols_get_keysym
     0.00%  synergys         [kernel.kallsyms]               [k] pipe_poll
     0.00%  swapper          [kernel.kallsyms]               [k] record_times
     0.00%  swapper          [kernel.kallsyms]               [k] cpu_startup_entry
     0.00%  plasmashell      libQt5Qml.so.5.15.11            [.] QV4::QObjectWrapper::findProperty
     0.00%  swapper          [kernel.kallsyms]               [k] finish_task_switch.isra.0
     0.00%  kwin_x11         libQt5Core.so.5.15.11           [.] qstrcmp
     0.00%  synergys         [kernel.kallsyms]               [k] dequeue_entity
     0.00%  QXcbEventQueue   libxcb.so.1.1.0                 [.] 0x000000000000f56e
     0.00%  kglobalaccel5    libc.so.6                       [.] pthread_getspecific
     0.00%  swapper          [kernel.kallsyms]               [k] ttwu_do_activate.isra.0
     0.00%  synergys         libxcb.so.1.1.0                 [.] xcb_poll_for_event
     0.00%  synergys         [kernel.kallsyms]               [k] unix_poll
     0.00%  konqueror        libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002ba3914
     0.00%  rcu_sched        [kernel.kallsyms]               [k] rcu_all_qs
     0.00%  QSGRenderThread  [kernel.kallsyms]               [k] mutex_spin_on_owner
     0.00%  konqueror        libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002b56bc8
     0.00%  synergys         [kernel.kallsyms]               [k] update_cfs_group
     0.00%  QSGRenderThread  [kernel.kallsyms]               [k] syscall_return_via_sysret
     0.00%  synergys         synergys                        [.] pthread_mutex_lock@plt
     0.00%  synergys         [kernel.kallsyms]               [k] __switch_to
     0.00%  at-spi2-registr  libglib-2.0.so.0.7800.3         [.] 0x0000000000056e64
     0.00%  perf             [kernel.kallsyms]               [k] __get_file_rcu
     0.00%  synergys         [kernel.kallsyms]               [k] __switch_to_asm
     0.00%  swapper          [kernel.kallsyms]               [k] local_clock_noinstr
     0.00%  perf             [kernel.kallsyms]               [k] __filemap_add_folio
     0.00%  swapper          [kernel.kallsyms]               [k] trigger_load_balance
     0.00%  swapper          [kernel.kallsyms]               [k] xhci_ring_ep_doorbell
     0.00%  synergys         [kernel.kallsyms]               [k] __rseq_handle_notify_resume
     0.00%  swapper          [kernel.kallsyms]               [k] intel_pmu_disable_all
     0.00%  kwin_x11         kwin_x11                        [.] 0x000000000008ee30
     0.00%  swapper          [kernel.kallsyms]               [k] sched_idle_set_state
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_next_event_without
     0.00%  upsmon           [kernel.kallsyms]               [k] __ip_finish_output
     0.00%  plasmashell      libQt5Core.so.5.15.11           [.] QVariant::clear
     0.00%  perf             [kernel.kallsyms]               [k] create_empty_buffers
     0.00%  perf             [kernel.kallsyms]               [k] memset_orig
     0.00%  synergys         libc.so.6                       [.] recvmsg
     0.00%  baloorunner      libQt5XcbQpa.so.5.15.11         [.] 0x0000000000065c0d
     0.00%  konsole          libc.so.6                       [.] 0x000000000013d502
     0.00%  swapper          [kernel.kallsyms]               [k] update_curr
     0.00%  QSGRenderThread  nouveau_dri.so                  [.] 0x00000000002428dc
     0.00%  synergys         [kernel.kallsyms]               [k] save_fpregs_to_fpstate
     0.00%  synergys         [kernel.kallsyms]               [k] __update_load_avg_se
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] mem_cgroup_css_rstat_flush
     0.00%  swapper          [kernel.kallsyms]               [k] ___bpf_prog_run
     0.00%  kwin_x11         libQt5Core.so.5.15.11           [.] QArrayData::deallocate
     0.00%  konqueror        libQt5Core.so.5.15.11           [.] qstrcmp
     0.00%  X                libglamoregl.so                 [.] 0x000000000000c6de
     0.00%  synergys         [kernel.kallsyms]               [k] exit_to_user_mode_prepare
     0.00%  X                [kernel.kallsyms]               [k] __kmem_cache_alloc_node
     0.00%  synergys         libc.so.6                       [.] pthread_mutex_lock
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_idle_enter
     0.00%  swapper          [kernel.kallsyms]               [k] tick_check_broadcast_expired
     0.00%  perf             [kernel.kallsyms]               [k] __fdget_pos
     0.00%  konqueror        libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002b6092c
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] load_balance
     0.00%  kglobalaccel5    ld-linux-x86-64.so.2            [.] __tls_get_addr
     0.00%  swapper          [kernel.kallsyms]               [k] perf_swevent_stop
     0.00%  Qt bearer threa  [kernel.kallsyms]               [k] inet6_fill_ifla6_attrs
     0.00%  perf             [kernel.kallsyms]               [k] copy_page_from_iter_atomic
     0.00%  swapper          [kernel.kallsyms]               [k] __call_rcu_common.constprop.0
     0.00%  swapper          [kernel.kallsyms]               [k] psi_task_switch
     0.00%  swapper          [kernel.kallsyms]               [k] menu_reflect
     0.00%  synergys         [kernel.kallsyms]               [k] __update_load_avg_cfs_rq
     0.00%  :-1              [kernel.kallsyms]               [k] proc_invalidate_siblings_dcache
     0.00%  rcu_sched        [kernel.kallsyms]               [k] dequeue_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] check_tsc_unstable
     0.00%  konsole          libQt5Core.so.5.15.11           [.] QAbstractEventDispatcherPrivate::releaseTimerId
     0.00%  konqueror        libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002b836e2
     0.00%  kclockd          [kernel.kallsyms]               [k] __get_user_8
     0.00%  usbhid-ups       libc.so.6                       [.] ioctl
     0.00%  swapper          [kernel.kallsyms]               [k] perf_event_task_tick
     0.00%  swapper          [kernel.kallsyms]               [k] tun_net_xmit
     0.00%  rcu_sched        [kernel.kallsyms]               [k] enqueue_timer
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_idle_exit
     0.00%  swapper          [kernel.kallsyms]               [k] set_next_entity
     0.00%  synergys         [kernel.kallsyms]               [k] syscall_enter_from_user_mode
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_irq_exit
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] proc_do_submiturb
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] usbdev_poll
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] enqueue_to_backlog
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] update_sd_lb_stats.constprop.0
     0.00%  kwin_x11         libkwin.so.5.27.10              [.] KWin::RenderLoopPrivate::scheduleRepaint
     0.00%  :-1              [kernel.kallsyms]               [k] wake_up_bit
     0.00%  synergys         [kernel.kallsyms]               [k] update_load_avg
     0.00%  QXcbEventQueue   libQt5Core.so.5.15.11           [.] QMutex::lock
     0.00%  synergys         [unknown]                       [.] 0000000000000000
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] record_times
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] drain_obj_stock
     0.00%  konqueror        [kernel.kallsyms]               [k] refill_stock
     0.00%  konqueror        libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002bc6fff
     0.00%  perf             [kernel.kallsyms]               [k] _raw_write_lock
     0.00%  synergys         libX11.so.6.4.0                 [.] XPending
     0.00%  synergys         libc.so.6                       [.] pthread_mutex_unlock
     0.00%  synergys         synergys                        [.] poll@plt
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] schedule_hrtimeout_range_clock
     0.00%  synergys         synergys                        [.] pthread_mutex_unlock@plt
     0.00%  swapper          [kernel.kallsyms]               [k] schedule_idle
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] wq_worker_running
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __switch_to_asm
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] mem_cgroup_css_rstat_flush
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x00000000000440b0
     0.00%  synergys         [kernel.kallsyms]               [k] unix_stream_read_generic
     0.00%  usbhid-ups       libusb-1.0.so.0.3.0             [.] 0x0000000000011979
     0.00%  avahi-daemon     libavahi-core.so.7.1.0          [.] avahi_dns_packet_check_valid
     0.00%  X                Xorg                            [.] 0x00000000000d094e
     0.00%  synergys         libxcb.so.1.1.0                 [.] 0x000000000000f56c
     0.00%  swapper          [kernel.kallsyms]               [k] wakeup_preempt
     0.00%  swapper          [kernel.kallsyms]               [k] avg_vruntime
     0.00%  swapper          [kernel.kallsyms]               [k] put_prev_task_idle
     0.00%  swapper          [kernel.kallsyms]               [k] _find_next_bit
     0.00%  plasmashell      libc.so.6                       [.] malloc
     0.00%  Qt bearer threa  [kernel.kallsyms]               [k] kmem_cache_alloc_node
     0.00%  QXcbEventQueue   libQt5Core.so.5.15.11           [.] QThread::eventDispatcher
     0.00%  Qt bearer threa  [kernel.kallsyms]               [k] do_syscall_64
     0.00%  perf             [kernel.kallsyms]               [k] perf_poll
     0.00%  X                libEGL_mesa.so.0.0.0            [.] 0x0000000000018a27
     0.00%  synergys         [kernel.kallsyms]               [k] pick_next_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] enqueue_hrtimer
     0.00%  rcu_sched        [kernel.kallsyms]               [k] psi_group_change
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] vmstat_shepherd
     0.00%  perf             perf                            [.] 0x0000000000101078
     0.00%  perf             [kernel.kallsyms]               [k] lock_vma_under_rcu
     0.00%  swapper          [kernel.kallsyms]               [k] tcp_orphan_count_sum
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] _raw_spin_lock_irq
     0.00%  synergys         [kernel.kallsyms]               [k] sched_clock_noinstr
     0.00%  swapper          [kernel.kallsyms]               [k] __rb_insert_augmented
     0.00%  swapper          [kernel.kallsyms]               [k] cpuidle_select
     0.00%  QSGRenderThread  libQt5Quick.so.5.15.11          [.] QSGBatchRenderer::Renderer::buildRenderLists
     0.00%  QSGRenderThread  libQt5Quick.so.5.15.11          [.] QSGBatchRenderer::Renderer::nodeChanged
     0.00%  kwin_x11         libKF5JobWidgets.so.5.113.0     [.] 0x000000000000fa50
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] __cgroup_account_cputime
     0.00%  usbhid-ups       libc.so.6                       [.] 0x000000000013e8b0
     0.00%  konqueror        libQt5Core.so.5.15.11           [.] clock_gettime@plt
     0.00%  swapper          [kernel.kallsyms]               [k] mm_cid_get
     0.00%  gmain            [kernel.kallsyms]               [k] inode_permission
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_try_to_cancel.part.0
     0.00%  rcu_sched        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     0.00%  usbhid-ups       libc.so.6                       [.] 0x000000000007ad00
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] kvfree_rcu_bulk
     0.00%  synergys         [kernel.kallsyms]               [k] sockfd_lookup_light
     0.00%  synergys         libc.so.6                       [.] 0x000000000008ac00
     0.00%  swapper          [kernel.kallsyms]               [k] timerqueue_iterate_next
     0.00%  synergys         [kernel.kallsyms]               [k] __get_user_8
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] memchr_inv
     0.00%  swapper          [kernel.kallsyms]               [k] wb_timer_fn
     0.00%  perf             perf                            [.] 0x0000000000104467
     0.00%  swapper          [kernel.kallsyms]               [k] ct_idle_enter
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x0000000000043e60
     0.00%  usbhid-ups       libc.so.6                       [.] 0x00000000000826a3
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] __mod_memcg_lruvec_state
     0.00%  synergys         synergys                        [.] 0x0000000000026260
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] kthread_should_stop
     0.00%  synergys         synergys                        [.] 0x0000000000025047
     0.00%  usbhid-ups       libc.so.6                       [.] pthread_mutex_trylock
     0.00%  synergys         libxcb.so.1.1.0                 [.] 0x0000000000010030
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] psi_avgs_work
     0.00%  synergys         [kernel.kallsyms]               [k] ____sys_recvmsg
     0.00%  kwin_x11         libglib-2.0.so.0.7800.3         [.] g_mutex_lock
     0.00%  synergys         [kernel.kallsyms]               [k] _copy_from_user
     0.00%  rcu_sched        [kernel.kallsyms]               [k] update_min_vruntime
     0.00%  kwin_x11         libQt5Gui.so.5.15.11            [.] QImageData::~QImageData
     0.00%  rcu_sched        [kernel.kallsyms]               [k] rcu_gp_kthread
     0.00%  synergys         synergys                        [.] 0x0000000000025040
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] memcpy_orig
     0.00%  synergys         [kernel.kallsyms]               [k] timerqueue_add
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_tick_stopped
     0.00%  swapper          [kernel.kallsyms]               [k] __put_task_struct
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] kfree
     0.00%  dmeventd         [kernel.kallsyms]               [k] finish_task_switch.isra.0
     0.00%  perf             [kernel.kallsyms]               [k] __rdgsbase_inactive
     0.00%  swapper          [kernel.kallsyms]               [k] irq_chip_ack_parent
     0.00%  swapper          [kernel.kallsyms]               [k] irq_enter_rcu
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] __fget_light
     0.00%  usbhid-ups       usbhid-ups                      [.] 0x000000000001e143
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __mod_timer
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x0000000000031dab
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] __softirqentry_text_start
     0.00%  synergys         [kernel.kallsyms]               [k] ___sys_recvmsg
     0.00%  swapper          [kernel.kallsyms]               [k] error_return
     0.00%  swapper          [kernel.kallsyms]               [k] run_rebalance_domains
     0.00%  rcu_sched        [kernel.kallsyms]               [k] check_cfs_rq_runtime
     0.00%  perf             [kernel.kallsyms]               [k] do_sys_poll
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __update_load_avg_se
     0.00%  ThreadPoolForeg  libQt5WebEngineCore.so.5.15.11  [.] 0x0000000002b8cf74
     0.00%  rcu_sched        [kernel.kallsyms]               [k] rcu_implicit_dynticks_qs
     0.00%  swapper          [kernel.kallsyms]               [k] atomic_notifier_call_chain
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x00000000000476cb
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x0000000000031cd0
     0.00%  swapper          [kernel.kallsyms]               [k] llist_reverse_order
     0.00%  rcu_sched        [kernel.kallsyms]               [k] finish_task_switch.isra.0
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x00000000000441d0
     0.00%  upsmon           [kernel.kallsyms]               [k] __schedule
     0.00%  upsmon           [kernel.kallsyms]               [k] check_stack_object
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] usbdev_ioctl
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_run_queues
     0.00%  swapper          [kernel.kallsyms]               [k] i_callback
     0.00%  swapper          [kernel.kallsyms]               [k] wake_up_process
     0.00%  synergys         [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     0.00%  X                [kernel.kallsyms]               [k] rcu_note_context_switch
     0.00%  kwin_x11         [kernel.kallsyms]               [k] __get_task_ioprio
     0.00%  kwin_x11         libkwin.so.5.27.10              [.] KWin::Workspace::findClient
     0.00%  X                [kernel.kallsyms]               [k] update_min_vruntime
     0.00%  X                libGLdispatch.so.0.0.0          [.] 0x000000000004918b
     0.00%  synergys         libX11.so.6.4.0                 [.] 0x0000000000031da8
     0.00%  konqueror        [kernel.kallsyms]               [k] unix_poll
     0.00%  konqueror        libKF5WidgetsAddons.so.5.113.0  [.] 0x0000000000075fb0
     0.00%  rcu_sched        [kernel.kallsyms]               [k] psi_task_switch
     0.00%  swapper          [kernel.kallsyms]               [k] __mod_memcg_lruvec_state
     0.00%  swapper          [kernel.kallsyms]               [k] get_nohz_timer_target
     0.00%  rcu_sched        [kernel.kallsyms]               [k] avg_vruntime
     0.00%  X                libEGL_mesa.so.0.0.0            [.] 0x0000000000018a20
     0.00%  X                [kernel.kallsyms]               [k] drm_file_get_master
     0.00%  swapper          [kernel.kallsyms]               [k] timer_clear_idle
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] __switch_to_asm
     0.00%  kwin_x11         libQt5Core.so.5.15.11           [.] malloc@plt
     0.00%  swapper          [kernel.kallsyms]               [k] evdev_pass_values.part.0
     0.00%  synergys         libX11.so.6.4.0                 [.] xcb_connection_has_error@plt
     0.00%  swapper          [kernel.kallsyms]               [k] need_update
     0.00%  synergys         [kernel.kallsyms]               [k] __cgroup_account_cputime
     0.00%  synergys         [kernel.kallsyms]               [k] remove_wait_queue
     0.00%  swapper          [kernel.kallsyms]               [k] first_online_pgdat
     0.00%  swapper          [kernel.kallsyms]               [k] raw_spin_rq_lock_nested
     0.00%  perf             [kernel.kallsyms]               [k] remote_function
     0.00%  kwin_x11         [kernel.kallsyms]               [k] __get_file_rcu
     0.00%  :-1              [kernel.kallsyms]               [k] evict
     0.00%  X                [kernel.kallsyms]               [k] sock_poll
     0.00%  swapper          [kernel.kallsyms]               [k] arch_cpu_idle_exit
     0.00%  synergys         [kernel.kallsyms]               [k] enter_lazy_tlb
     0.00%  rcu_sched        [kernel.kallsyms]               [k] rcu_gp_cleanup
     0.00%  synergys         [kernel.kallsyms]               [k] __entry_text_start
     0.00%  swapper          [kernel.kallsyms]               [k] irq_work_run_list
     0.00%  swapper          [kernel.kallsyms]               [k] place_entity
     0.00%  perf             [kernel.kallsyms]               [k] xas_start
     0.00%  synergys         [kernel.kallsyms]               [k] copy_msghdr_from_user
     0.00%  synergys         [kernel.kallsyms]               [k] syscall_return_via_sysret
     0.00%  synergys         [kernel.kallsyms]               [k] schedule_hrtimeout_range_clock
     0.00%  synergys         [kernel.kallsyms]               [k] set_normalized_timespec64
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] desc_read
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] update_min_vruntime
     0.00%  synergys         [kernel.kallsyms]               [k] update_min_vruntime
     0.00%  :-1              [kernel.kallsyms]               [k] ___d_drop
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] strscpy
     0.00%  swapper          [kernel.kallsyms]               [k] __wake_up_common
     0.00%  swapper          [kernel.kallsyms]               [k] ep_poll_callback
     0.00%  rcu_sched        [kernel.kallsyms]               [k] update_curr
     0.00%  rcu_sched        [kernel.kallsyms]               [k] pick_next_task_idle
     0.00%  rcu_sched        [kernel.kallsyms]               [k] cpuacct_charge
     0.00%  InputThread      libinput_drv.so                 [.] 0x0000000000008e92
     0.00%  swapper          [kernel.kallsyms]               [k] __smp_call_single_queue
     0.00%  swapper          [kernel.kallsyms]               [k] reweight_entity
     0.00%  rcu_sched        [kernel.kallsyms]               [k] lock_timer_base
     0.00%  synergys         [kernel.kallsyms]               [k] put_prev_task_fair
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] dequeue_entity
     0.00%  konsole          libQt5Core.so.5.15.11           [.] 0x00000000002d4f40
     0.00%  kworker/2:1-mm_  [kernel.kallsyms]               [k] collect_percpu_times
     0.00%  synergys         libxcb.so.1.1.0                 [.] 0x000000000001004b
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_forward
     0.00%  upsmon           libc.so.6                       [.] strlen@plt
     0.00%  konqueror        libQt5Widgets.so.5.15.11        [.] QApplication::notify
     0.00%  swapper          [kernel.kallsyms]               [k] queued_spin_lock_slowpath
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] extract_entropy.constprop.0
     0.00%  swapper          [kernel.kallsyms]               [k] skb_network_protocol
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] _prb_read_valid
     0.00%  swapper          [kernel.kallsyms]               [k] enter_lazy_tlb
     0.00%  synergys         [kernel.kallsyms]               [k] dequeue_task_fair
     0.00%  synergys         [kernel.kallsyms]               [k] psi_task_switch
     0.00%  swapper          [kernel.kallsyms]               [k] flush_smp_call_function_queue
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] crypt_page_alloc
     0.00%  kworker/u16:3-e  [kernel.kallsyms]               [k] vsnprintf
     0.00%  kwin_x11         libkwin.so.5.27.10              [.] KWin::X11Window::windowEvent
     0.00%  swapper          [kernel.kallsyms]               [k] nsecs_to_jiffies
     0.00%  synergys         [kernel.kallsyms]               [k] schedule
     0.00%  rcu_sched        [kernel.kallsyms]               [k] dequeue_entity
     0.00%  synergys         [kernel.kallsyms]               [k] get_nohz_timer_target
     0.00%  synergys         [kernel.kallsyms]               [k] record_times
     0.00%  synergys         synergys                        [.] 0x000000000004dd96
     0.00%  synergys         [kernel.kallsyms]               [k] __x64_sys_poll
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __switch_to
     0.00%  kworker/u16:1-e  [kernel.kallsyms]               [k] cgroup_rstat_flush_locked
     0.00%  swapper          [kernel.kallsyms]               [k] nohz_balance_enter_idle
     0.00%  swapper          [kernel.kallsyms]               [k] __switch_to
     0.00%  avahi-daemon     [kernel.kallsyms]               [k] free_unref_page_commit
     0.00%  swapper          [kernel.kallsyms]               [k] account_idle_ticks
     0.00%  swapper          [kernel.kallsyms]               [k] perf_swevent_start
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] __rdgsbase_inactive
     0.00%  rcu_sched        [kernel.kallsyms]               [k] detach_if_pending
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] mutex_lock
     0.00%  perf             [kernel.kallsyms]               [k] fput
     0.00%  upsmon           [kernel.kallsyms]               [k] eth_type_trans
     0.00%  synergys         libX11.so.6.4.0                 [.] pthread_mutex_lock@plt
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] enqueue_timer
     0.00%  kwin_x11         KF5WindowSystemX11Plugin.so     [.] qstrcmp@plt
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] __kmem_cache_alloc_node
     0.00%  QXcbEventQueue   libc.so.6                       [.] malloc
     0.00%  kscreen_backend  libQt5XcbQpa.so.5.15.11         [.] xcb_flush@plt
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] __wake_up_common
     0.00%  avahi-daemon     [kernel.kallsyms]               [k] pipe_write
     0.00%  gmain            [kernel.kallsyms]               [k] restore_fpregs_from_fpstate
     0.00%  swapper          [kernel.kallsyms]               [k] pick_next_task_idle
     0.00%  swapper          [kernel.kallsyms]               [k] timekeeping_max_deferment
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __note_gp_changes
     0.00%  swapper          [kernel.kallsyms]               [k] ct_irq_exit
     0.00%  usbhid-ups       usbhid-ups                      [.] 0x000000000001d21b
     0.00%  gmain            libgio-2.0.so.0.7800.3          [.] g_list_free@plt
     0.00%  kworker/2:1-mm_  [kernel.kallsyms]               [k] refresh_cpu_vm_stats
     0.00%  swapper          [kernel.kallsyms]               [k] br_config_bpdu_generation
     0.00%  swapper          [kernel.kallsyms]               [k] process_timeout
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] psi_group_change
     0.00%  kwin_x11         libc.so.6                       [.] pthread_getspecific
     0.00%  swapper          [kernel.kallsyms]               [k] free_unref_page_prepare
     0.00%  X                libc.so.6                       [.] __errno_location
     0.00%  rcu_sched        [kernel.kallsyms]               [k] schedule
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] notifier_call_chain
     0.00%  dmeventd         [kernel.kallsyms]               [k] cpuacct_charge
     0.00%  synergys         [kernel.kallsyms]               [k] do_syscall_64
     0.00%  GUsbEventThread  libusb-1.0.so.0.3.0             [.] pthread_mutex_unlock@plt
     0.00%  swapper          [kernel.kallsyms]               [k] list_add_leaf_cfs_rq
     0.00%  synergys         [kernel.kallsyms]               [k] finish_task_switch.isra.0
     0.00%  synergys         libX11.so.6.4.0                 [.] _XSend@plt
     0.00%  synergys         [kernel.kallsyms]               [k] sched_clock_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] find_busiest_group
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] worker_thread
     0.00%  synergys         synergys                        [.] 0x00000000000356fd
     0.00%  ksoftirqd/7      [kernel.kallsyms]               [k] update_sd_lb_stats.constprop.0
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
     0.00%  swapper          [kernel.kallsyms]               [k] __slab_free.isra.0
     0.00%  X                [kernel.kallsyms]               [k] switch_fpu_return
     0.00%  swapper          [kernel.kallsyms]               [k] hrtimer_reprogram
     0.00%  QXcbEventQueue   [kernel.kallsyms]               [k] __schedule
     0.00%  QXcbEventQueue   libxcb.so.1.1.0                 [.] pthread_mutex_lock@plt
     0.00%  swapper          [kernel.kallsyms]               [k] ipt_do_table
     0.00%  synergys         [kernel.kallsyms]               [k] __hrtimer_init
     0.00%  kworker/dying    [kernel.kallsyms]               [k] queued_spin_lock_slowpath
     0.00%  ksoftirqd/5      [kernel.kallsyms]               [k] smpboot_thread_fn
     0.00%  avahi-daemon     [kernel.kallsyms]               [k] __get_user_8
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] enqueue_timer
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] collect_percpu_times
     0.00%  synergys         libc.so.6                       [.] 0x00000000000826cd
     0.00%  swapper          [kernel.kallsyms]               [k] macvlan_forward_source
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] get_pfnblock_flags_mask
     0.00%  swapper          [kernel.kallsyms]               [k] raise_softirq
     0.00%  rcu_sched        [kernel.kallsyms]               [k] rcu_gp_init
     0.00%  kworker/0:2-eve  [kernel.kallsyms]               [k] lock_timer_base
     0.00%  perf             [kernel.kallsyms]               [k] event_function_call
     0.00%  synergys         [kernel.kallsyms]               [k] update_curr
     0.00%  swapper          [kernel.kallsyms]               [k] ip_route_input_slow
     0.00%  swapper          [kernel.kallsyms]               [k] sched_clock_tick
     0.00%  swapper          [kernel.kallsyms]               [k] __nf_conntrack_find_get.isra.0
     0.00%  perf             [kernel.kallsyms]               [k] __intel_pmu_enable_all.isra.0
     0.00%  gmain            libc.so.6                       [.] clock_gettime
     0.00%  kworker/5:2-eve  [kernel.kallsyms]               [k] psi_task_switch
     0.00%  swapper          [kernel.kallsyms]               [k] input_event_dispose
     0.00%  swapper          [kernel.kallsyms]               [k] __next_timer_interrupt
     0.00%  swapper          [kernel.kallsyms]               [k] ct_irq_enter
     0.00%  kwin_x11         libc.so.6                       [.] 0x0000000000082620
     0.00%  dmeventd         libc.so.6                       [.] 0x0000000000087dfd
     0.00%  perf             [kernel.kallsyms]               [k] perf_ctx_enable.constprop.0
     0.00%  kworker/4:2-eve  [kernel.kallsyms]               [k] fold_diff
     0.00%  rcu_sched        [kernel.kallsyms]               [k] put_prev_task_fair
     0.00%  swapper          [kernel.kallsyms]               [k] tick_nohz_get_next_hrtimer
     0.00%  usbhid-ups       [kernel.kallsyms]               [k] unix_poll
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __schedule
     0.00%  rcu_sched        [kernel.kallsyms]               [k] update_rq_clock.part.0
     0.00%  swapper          [kernel.kallsyms]               [k] put_cpu_partial
     0.00%  perf             [kernel.kallsyms]               [k] nmi_restore
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __timer_delete_sync
     0.00%  kworker/3:2-mm_  [kernel.kallsyms]               [k] lru_add_drain_per_cpu
     0.00%  swapper          [kernel.kallsyms]               [k] local_touch_nmi
     0.00%  swapper          [kernel.kallsyms]               [k] rcu_cblist_dequeue
     0.00%  swapper          [kernel.kallsyms]               [k] notifier_call_chain
     0.00%  swapper          [kernel.kallsyms]               [k] update_rq_clock
     0.00%  rcu_sched        [kernel.kallsyms]               [k] force_qs_rnp
     0.00%  swapper          [kernel.kallsyms]               [k] __mod_timer
     0.00%  swapper          [kernel.kallsyms]               [k] update_group_capacity
     0.00%  rcu_sched        [kernel.kallsyms]               [k] __lock_text_start
     0.00%  rcu_sched        [kernel.kallsyms]               [k] newidle_balance.isra.0
     0.00%  rcu_sched        [kernel.kallsyms]               [k] _raw_spin_lock
     0.00%  rcu_sched        [kernel.kallsyms]               [k] schedule_timeout
     0.00%  swapper          [kernel.kallsyms]               [k] __enqueue_entity
     0.00%  swapper          [kernel.kallsyms]               [k] put_ucounts
     0.00%  perf             [kernel.kallsyms]               [k] native_apic_msr_write

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23 23:58         ` Dan Moulding
@ 2024-01-25  0:01           ` Song Liu
  2024-01-25 16:44             ` junxiao.bi
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-01-25  0:01 UTC (permalink / raw)
  To: Dan Moulding, junxiao.bi
  Cc: gregkh, linux-kernel, linux-raid, regressions, stable, yukuai1

Thanks for the information!


On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <dan@danm.net> wrote:
>
> > This appears the md thread hit some infinite loop, so I would like to
> > know what it is doing. We can probably get the information with the
> > perf tool, something like:
> >
> > perf record -a
> > perf report
>
> Here you go!
>
> # Total Lost Samples: 0
> #
> # Samples: 78K of event 'cycles'
> # Event count (approx.): 83127675745
> #
> # Overhead  Command          Shared Object                   Symbol
> # ........  ...............  ..............................  ...................................................
> #
>     49.31%  md0_raid5        [kernel.kallsyms]               [k] handle_stripe
>     18.63%  md0_raid5        [kernel.kallsyms]               [k] ops_run_io
>      6.07%  md0_raid5        [kernel.kallsyms]               [k] handle_active_stripes.isra.0
>      5.50%  md0_raid5        [kernel.kallsyms]               [k] do_release_stripe
>      3.09%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
>      2.48%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe
>      1.89%  md0_raid5        [kernel.kallsyms]               [k] md_wakeup_thread
>      1.45%  ksmd             [kernel.kallsyms]               [k] ksm_scan_thread
>      1.37%  md0_raid5        [kernel.kallsyms]               [k] stripe_is_lowprio
>      0.87%  ksmd             [kernel.kallsyms]               [k] memcmp
>      0.68%  ksmd             [kernel.kallsyms]               [k] xxh64
>      0.56%  md0_raid5        [kernel.kallsyms]               [k] __wake_up_common
>      0.52%  md0_raid5        [kernel.kallsyms]               [k] __wake_up
>      0.46%  ksmd             [kernel.kallsyms]               [k] mtree_load
>      0.44%  ksmd             [kernel.kallsyms]               [k] try_grab_page
>      0.40%  ksmd             [kernel.kallsyms]               [k] follow_p4d_mask.constprop.0
>      0.39%  md0_raid5        [kernel.kallsyms]               [k] r5l_log_disk_error
>      0.37%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irq
>      0.33%  md0_raid5        [kernel.kallsyms]               [k] release_stripe_list
>      0.31%  md0_raid5        [kernel.kallsyms]               [k] release_inactive_stripe_list

It appears the thread is indeed doing something. I haven't got luck to
reproduce this on my hosts. Could you please try whether the following
change fixes the issue (without reverting 0de40f76d567)? I will try to
reproduce the issue on my side.

Junxiao,

Please also help look into this.

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25  0:01           ` Song Liu
@ 2024-01-25 16:44             ` junxiao.bi
  2024-01-25 19:40               ` Song Liu
  2024-01-25 20:31               ` Dan Moulding
  0 siblings, 2 replies; 53+ messages in thread
From: junxiao.bi @ 2024-01-25 16:44 UTC (permalink / raw)
  To: Song Liu, Dan Moulding
  Cc: gregkh, linux-kernel, linux-raid, regressions, stable, yukuai1

Hi Dan,

Thanks for the report.

Can you define the hung? No hung task or other error from dmesg, any 
process in D status and what is the call trace if there is? From the 
perf result, looks like the raid thread is doing some real job, it may 
be issuing io since ops_run_io() took around 20% cpu, please share 
"iostat -xz 1" while the workload is running, i am wondering is this 
some performance issue with the workload?

Thanks,

Junxiao.

On 1/24/24 4:01 PM, Song Liu wrote:
> Thanks for the information!
>
>
> On Tue, Jan 23, 2024 at 3:58 PM Dan Moulding <dan@danm.net> wrote:
>>> This appears the md thread hit some infinite loop, so I would like to
>>> know what it is doing. We can probably get the information with the
>>> perf tool, something like:
>>>
>>> perf record -a
>>> perf report
>> Here you go!
>>
>> # Total Lost Samples: 0
>> #
>> # Samples: 78K of event 'cycles'
>> # Event count (approx.): 83127675745
>> #
>> # Overhead  Command          Shared Object                   Symbol
>> # ........  ...............  ..............................  ...................................................
>> #
>>      49.31%  md0_raid5        [kernel.kallsyms]               [k] handle_stripe
>>      18.63%  md0_raid5        [kernel.kallsyms]               [k] ops_run_io
>>       6.07%  md0_raid5        [kernel.kallsyms]               [k] handle_active_stripes.isra.0
>>       5.50%  md0_raid5        [kernel.kallsyms]               [k] do_release_stripe
>>       3.09%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irqsave
>>       2.48%  md0_raid5        [kernel.kallsyms]               [k] r5l_write_stripe
>>       1.89%  md0_raid5        [kernel.kallsyms]               [k] md_wakeup_thread
>>       1.45%  ksmd             [kernel.kallsyms]               [k] ksm_scan_thread
>>       1.37%  md0_raid5        [kernel.kallsyms]               [k] stripe_is_lowprio
>>       0.87%  ksmd             [kernel.kallsyms]               [k] memcmp
>>       0.68%  ksmd             [kernel.kallsyms]               [k] xxh64
>>       0.56%  md0_raid5        [kernel.kallsyms]               [k] __wake_up_common
>>       0.52%  md0_raid5        [kernel.kallsyms]               [k] __wake_up
>>       0.46%  ksmd             [kernel.kallsyms]               [k] mtree_load
>>       0.44%  ksmd             [kernel.kallsyms]               [k] try_grab_page
>>       0.40%  ksmd             [kernel.kallsyms]               [k] follow_p4d_mask.constprop.0
>>       0.39%  md0_raid5        [kernel.kallsyms]               [k] r5l_log_disk_error
>>       0.37%  md0_raid5        [kernel.kallsyms]               [k] _raw_spin_lock_irq
>>       0.33%  md0_raid5        [kernel.kallsyms]               [k] release_stripe_list
>>       0.31%  md0_raid5        [kernel.kallsyms]               [k] release_inactive_stripe_list
> It appears the thread is indeed doing something. I haven't got luck to
> reproduce this on my hosts. Could you please try whether the following
> change fixes the issue (without reverting 0de40f76d567)? I will try to
> reproduce the issue on my side.
>
> Junxiao,
>
> Please also help look into this.
>
> Thanks,
> Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25 16:44             ` junxiao.bi
@ 2024-01-25 19:40               ` Song Liu
  2024-01-25 20:31               ` Dan Moulding
  1 sibling, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-01-25 19:40 UTC (permalink / raw)
  To: junxiao.bi
  Cc: Dan Moulding, gregkh, linux-kernel, linux-raid, regressions,
	stable, yukuai1

On Thu, Jan 25, 2024 at 8:44 AM <junxiao.bi@oracle.com> wrote:
>
> Hi Dan,
>
> Thanks for the report.
>
> Can you define the hung? No hung task or other error from dmesg, any
> process in D status and what is the call trace if there is? From the
> perf result, looks like the raid thread is doing some real job, it may
> be issuing io since ops_run_io() took around 20% cpu, please share
> "iostat -xz 1" while the workload is running, i am wondering is this
> some performance issue with the workload?

I am hoping to get a repro on my side. From the information shared
by Dan, the md thread is busy looping on some stripes. The issue
probably only triggers with raid5 journal.

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25 16:44             ` junxiao.bi
  2024-01-25 19:40               ` Song Liu
@ 2024-01-25 20:31               ` Dan Moulding
  2024-01-26  3:30                 ` Carlos Carvalho
                                   ` (2 more replies)
  1 sibling, 3 replies; 53+ messages in thread
From: Dan Moulding @ 2024-01-25 20:31 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable,
	yukuai1

Hi Junxiao,

I first noticed this problem the next day after I had upgraded some
machines to the 6.7.1 kernel. One of the machines is a backup server.
Just a few hours after the upgrade to 6.7.1, it started running its
overnight backup jobs. Those backup jobs hung part way through. When I
tried to check on the backups in the morning, I found the server
mostly unresponsive. I could SSH in but most shell commands would just
hang. I was able to run top and see that the md0_raid5 kernel thread
was using 100% CPU. I tried to reboot the server, but it wasn't able
to successfully shutdown and eventually I had to hard reset it.

The next day, the same sequence of events occurred on that server
again when it tried to run its backup jobs. Then the following day, I
experienced another hang on a different machine, with a similar RAID-5
configuration. That time I was scp'ing a large file to a virtual
machine whose image was stored on the RAID-5 array. Part way through
the transfer scp reported that the transfer had stalled. I checked top
on that machine and found once again that the md0_raid5 kernel thread
was using 100% CPU.

Yesterday I created a fresh Fedora 39 VM for the purposes of
reproducing this problem in a different environment (the other two
machines are both Gentoo servers running v6.7 kernels straight from
the stable trees with a custom kernel configuration). I am able to
reproduce the problem on Fedora 39 running both the v6.6.13 stable
tree kernel code and the Fedora 39 6.6.13 distribution kernel.

On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
journal from space on the "boot" disk. Then I attached 3 additional
100 GiB virtual disks and created the RAID-5 from those 3 disks and
the write-journal device. I then created a new LVM volume group from
the md0 array and created one LVM logical volume named "data", using
all but 64GiB of the available VG space. I then created an ext4 file
system on the "data" volume, mounted it, and used "dd" to copy 1MiB
blocks from /dev/urandom to a file on the "data" file system, and just
let it run. Eventually "dd" hangs and top shows that md0_raid5 is
using 100% CPU.

Here is an example command I just ran, which has hung after writing
4.1 GiB of random data to the array:

test@localhost:~$ dd if=/dev/urandom bs=1M of=/data/random.dat status=progress
4410310656 bytes (4.4 GB, 4.1 GiB) copied, 324 s, 13.6 MB/s

Top shows md0_raid5 using 100% CPU and dd in the "D" state:

top - 19:10:07 up 14 min,  1 user,  load average: 7.00, 5.93, 3.30
Tasks: 246 total,   2 running, 244 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 37.5 id, 50.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem :   1963.4 total,     81.6 free,    490.7 used,   1560.2 buff/cache     
MiB Swap:   1963.0 total,   1962.5 free,      0.5 used.   1472.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    993 root      20   0       0      0      0 R  99.9   0.0   7:19.08 md0_raid5
   1461 root      20   0       0      0      0 I   0.0   0.0   0:00.17 kworker/1+
     18 root      20   0       0      0      0 I   0.0   0.0   0:00.12 rcu_preem+
   1071 systemd+  20   0   16240   7480   6712 S   0.0   0.4   0:00.22 systemd-o+
   1136 root      20   0  504124  27960  27192 S   0.0   1.4   0:00.26 rsyslogd
      1 root      20   0   75356  27884  10456 S   0.0   1.4   0:01.48 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
...
   1417 test      20   0  222668   3120   2096 D   0.0   0.2   0:10.45 dd

The dd process stack shows this:

test@localhost:~$ sudo cat /proc/1417/stack
[<0>] do_get_write_access+0x266/0x3f0
[<0>] jbd2_journal_get_write_access+0x5f/0x80
[<0>] __ext4_journal_get_write_access+0x74/0x170
[<0>] ext4_reserve_inode_write+0x61/0xc0
[<0>] __ext4_mark_inode_dirty+0x78/0x240
[<0>] ext4_dirty_inode+0x5b/0x80
[<0>] __mark_inode_dirty+0x57/0x390
[<0>] generic_update_time+0x4e/0x60
[<0>] file_modified+0xa1/0xb0
[<0>] ext4_buffered_write_iter+0x54/0x100
[<0>] vfs_write+0x23b/0x420
[<0>] ksys_write+0x6f/0xf0
[<0>] do_syscall_64+0x5d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

I have run that dd test in the VM several times (I have to power cycle
the VM in between tests since each time it hangs it won't successfully
reboot). I also tested creating a LVM snapshot of the "data" LV while
the "dd" is running and from the few runs I've done it seems it might
reproduce more easily when the LVM snapshot exists (the snapshot would
act as a write amplifier since it is performing a copy-on-write
operation when dd is writing to the data LV and perhaps that helps to
induce the problem). However, the backup server I mentioned above does
not utilize LVM snapshots, so I know that an LVM snapshot isn't
required to cause the problem.

Below I will include a (hopefully) complete description of how this VM
is configured which might aid in efforts to reproduce the problem.

I hope this helps to undertand the nature of the problem, and may be
of assistance in diagnosing or reproducing the issue.

-- Dan


test@localhost:~$ ls -ld /sys/block/sd*
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sda -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:0/block/sda
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdb -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:4/block/sdb
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdc -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:3/block/sdc
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sdd -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:2/block/sdd
lrwxrwxrwx. 1 root root 0 Jan 25 19:59 /sys/block/sde -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/virtio2/host6/target6:0:0/6:0:0:1/block/sde




test@localhost:~$ sudo fdisk -l /dev/sd[a,b,c,d,e]
Disk /dev/sda: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 3B52A5A1-29BD-436B-8145-EEF27D9EFC97

Device        Start      End  Sectors Size Type
/dev/sda1      2048     4095     2048   1M BIOS boot
/dev/sda2      4096  2101247  2097152   1G Linux filesystem
/dev/sda3   2101248 14678015 12576768   6G Linux LVM
/dev/sda4  14678016 16777215  2099200   1G Linux LVM
/dev/sda5  16777216 67106815 50329600  24G Linux LVM


Disk /dev/sdb: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdc: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdd: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sde: 100 GiB, 107374182400 bytes, 209715200 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes




test@localhost:~$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 dm-1[0](J) sdd[4] sde[3] sdc[1]
      209711104 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>




test@localhost:~$ sudo pvs
  PV         VG      Fmt  Attr PSize   PFree
  /dev/md0   array   lvm2 a--  199.99g    0 
  /dev/sda3  sysvg   lvm2 a--   <6.00g    0 
  /dev/sda4  journal lvm2 a--    1.00g    0 
  /dev/sda5  sysvg   lvm2 a--  <24.00g    0 
  /dev/sdb   sysvg   lvm2 a--  <32.00g    0 




test@localhost:~$ sudo vgs
  VG      #PV #LV #SN Attr   VSize   VFree 
  array     1   1   0 wz--n- 199.99g 63.99g
  journal   1   1   0 wz--n-   1.00g     0 
  sysvg     3   1   0 wz--n- <61.99g     0 




test@localhost:~$ sudo lvs
  LV      VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data    array   -wi-ao---- 136.00g                                                    
  journal journal -wi-ao----   1.00g                                                    
  root    sysvg   -wi-ao---- <61.99g                                                    




test@localhost:~$ sudo blkid
/dev/mapper/journal-journal: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="446d19c8-d56c-6938-a82f-ff8d52ba1772" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdd: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="5ab8d465-102a-d333-b1fa-012bd73d7cf5" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdb: UUID="Gj7d9g-LgcN-LLVl-iv37-DFZy-U0mz-s5nt3e" TYPE="LVM2_member"
/dev/md0: UUID="LcJ3i3-8Gfc-vs1g-ZNZc-8m0G-bPI3-l87W4X" TYPE="LVM2_member"
/dev/mapper/sysvg-root: LABEL="sysroot" UUID="22b0112a-6f38-41d6-921e-2492a19008f0" BLOCK_SIZE="512" TYPE="xfs"
/dev/sde: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="171616cc-ce88-94be-affe-00933b8a7a30" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sdc: UUID="3e3379f6-ef7a-6bd4-adc1-1c00e328a556" UUID_SUB="6d28b122-c1b6-973d-f8df-0834756581f0" LABEL="localhost.localdomain:0" TYPE="linux_raid_member"
/dev/sda4: UUID="ceH3kP-hljE-T6q4-W2qI-Iutm-Vf2N-Uz4omD" TYPE="LVM2_member" PARTUUID="2ed40d4b-f8b2-4c86-b8ca-61216a0c3f48"
/dev/sda2: UUID="c2192edb-0767-464b-9c3a-29d2d8e11c6e" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="effdb052-4887-4571-84df-5c5df132d702"
/dev/sda5: UUID="MEAcyI-qQwk-shwO-Y8qv-EFGa-ggpm-t6NhAV" TYPE="LVM2_member" PARTUUID="343aa231-9f62-46e2-b412-66640d153840"
/dev/sda3: UUID="yKUg0d-XqD2-5IEA-GFkd-6kDc-jVLz-cntwkj" TYPE="LVM2_member" PARTUUID="0dfa0e2d-f467-4e26-b013-9c965ed5a95c"
/dev/zram0: LABEL="zram0" UUID="5087ad0b-ec76-4de7-bbeb-7f39dd1ae318" TYPE="swap"
/dev/mapper/array-data: UUID="fcb29d49-5546-487f-9620-18afb0eeee90" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sda1: PARTUUID="93d0bf6a-463d-4a2a-862f-0a4026964d54"




test@localhost:~$ lsblk -i
NAME                MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
sda                   8:0    0   32G  0 disk  
|-sda1                8:1    0    1M  0 part  
|-sda2                8:2    0    1G  0 part  /boot
|-sda3                8:3    0    6G  0 part  
| `-sysvg-root      253:0    0   62G  0 lvm   /
|-sda4                8:4    0    1G  0 part  
| `-journal-journal 253:1    0    1G  0 lvm   
|   `-md0             9:0    0  200G  0 raid5 
|     `-array-data  253:3    0  136G  0 lvm   /data
`-sda5                8:5    0   24G  0 part  
  `-sysvg-root      253:0    0   62G  0 lvm   /
sdb                   8:16   0   32G  0 disk  
`-sysvg-root        253:0    0   62G  0 lvm   /
sdc                   8:32   0  100G  0 disk  
`-md0                 9:0    0  200G  0 raid5 
  `-array-data      253:3    0  136G  0 lvm   /data
sdd                   8:48   0  100G  0 disk  
`-md0                 9:0    0  200G  0 raid5 
  `-array-data      253:3    0  136G  0 lvm   /data
sde                   8:64   0  100G  0 disk  
`-md0                 9:0    0  200G  0 raid5 
  `-array-data      253:3    0  136G  0 lvm   /data
zram0               252:0    0  1.9G  0 disk  [SWAP]




test@localhost:~$ findmnt --ascii
TARGET                         SOURCE                 FSTYPE      OPTIONS
/                              /dev/mapper/sysvg-root xfs         rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
|-/dev                         devtmpfs               devtmpfs    rw,nosuid,seclabel,size=4096k,nr_inodes=245904,mode=755,inode64
| |-/dev/hugepages             hugetlbfs              hugetlbfs   rw,nosuid,nodev,relatime,seclabel,pagesize=2M
| |-/dev/mqueue                mqueue                 mqueue      rw,nosuid,nodev,noexec,relatime,seclabel
| |-/dev/shm                   tmpfs                  tmpfs       rw,nosuid,nodev,seclabel,inode64
| `-/dev/pts                   devpts                 devpts      rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000
|-/sys                         sysfs                  sysfs       rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/selinux            selinuxfs              selinuxfs   rw,nosuid,noexec,relatime
| |-/sys/kernel/debug          debugfs                debugfs     rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/kernel/tracing        tracefs                tracefs     rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/fuse/connections   fusectl                fusectl     rw,nosuid,nodev,noexec,relatime
| |-/sys/kernel/security       securityfs             securityfs  rw,nosuid,nodev,noexec,relatime
| |-/sys/fs/cgroup             cgroup2                cgroup2     rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot
| |-/sys/fs/pstore             pstore                 pstore      rw,nosuid,nodev,noexec,relatime,seclabel
| |-/sys/fs/bpf                bpf                    bpf         rw,nosuid,nodev,noexec,relatime,mode=700
| `-/sys/kernel/config         configfs               configfs    rw,nosuid,nodev,noexec,relatime
|-/proc                        proc                   proc        rw,nosuid,nodev,noexec,relatime
| `-/proc/sys/fs/binfmt_misc   systemd-1              autofs      rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=8800
|   `-/proc/sys/fs/binfmt_misc binfmt_misc            binfmt_misc rw,nosuid,nodev,noexec,relatime
|-/run                         tmpfs                  tmpfs       rw,nosuid,nodev,seclabel,size=402112k,nr_inodes=819200,mode=755,inode64
| `-/run/user/1000             tmpfs                  tmpfs       rw,nosuid,nodev,relatime,seclabel,size=201056k,nr_inodes=50264,mode=700,uid=1000,gid=1000,inode64
|-/tmp                         tmpfs                  tmpfs       rw,nosuid,nodev,seclabel,nr_inodes=1048576,inode64
|-/boot                        /dev/sda2              ext4        rw,relatime,seclabel
|-/data                        /dev/mapper/array-data ext4        rw,relatime,seclabel,stripe=256
`-/var/lib/nfs/rpc_pipefs      sunrpc                 rpc_pipefs  rw,relatime




(On virtual machine host)
$ sudo virsh dumpxml raid5-test-Fedora-Server-39-x86_64
<domain type='kvm' id='48'>
  <name>raid5-test-Fedora-Server-39-x86_64</name>
  <uuid>abb4cad1-35a4-4209-9da1-01e1cf3463da</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://fedoraproject.org/fedora/38"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-q35-8.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <vmport state='off'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='on'/>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64.qcow2' index='5'/>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-1.qcow2' index='4'/>
      <backingStore/>
      <target dev='sdb' bus='scsi'/>
      <alias name='scsi0-0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-2.qcow2' index='3'/>
      <backingStore/>
      <target dev='sdc' bus='scsi'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-raid-3.qcow2' index='2'/>
      <backingStore/>
      <target dev='sdd' bus='scsi'/>
      <alias name='scsi0-0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/raid5-test-Fedora-Server-39-x86_64-1.qcow2' index='1'/>
      <backingStore/>
      <target dev='sde' bus='scsi'/>
      <alias name='scsi0-0-0-4'/>
      <address type='drive' controller='0' bus='0' target='0' unit='4'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x1b'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x1c'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x1d'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <interface type='network'>
      <mac address='52:54:00:01:a7:85'/>
      <source network='default' portid='2f054bc0-bdd3-4431-9a7f-f57c84313f0d' bridge='virbr0'/>
      <target dev='vnet47'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/9'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/9'>
      <source path='/dev/pts/9'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/run/libvirt/qemu/channel/48-raid5-test-Fedora-Se/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='spice' port='5902' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
      <image compression='off'/>
    </graphics>
    <sound model='ich9'>
      <alias name='sound0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1b' function='0x0'/>
    </sound>
    <audio id='1' type='spice'/>
    <video>
      <model type='virtio' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir0'/>
      <address type='usb' bus='0' port='2'/>
    </redirdev>
    <redirdev bus='usb' type='spicevmc'>
      <alias name='redir1'/>
      <address type='usb' bus='0' port='3'/>
    </redirdev>
    <watchdog model='itco' action='reset'>
      <alias name='watchdog0'/>
    </watchdog>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='rng0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+77:+77</label>
    <imagelabel>+77:+77</imagelabel>
  </seclabel>
</domain>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25 20:31               ` Dan Moulding
@ 2024-01-26  3:30                 ` Carlos Carvalho
  2024-01-26 15:46                   ` Dan Moulding
  2024-01-26 16:21                   ` Roman Mamedov
  2024-01-31 17:37                 ` junxiao.bi
  2024-02-06  8:07                 ` Song Liu
  2 siblings, 2 replies; 53+ messages in thread
From: Carlos Carvalho @ 2024-01-26  3:30 UTC (permalink / raw)
  To: Dan Moulding
  Cc: junxiao.bi, gregkh, linux-kernel, linux-raid, regressions, song,
	stable, yukuai1

Dan Moulding (dan@danm.net) wrote on Thu, Jan 25, 2024 at 05:31:30PM -03:
> I then created an ext4 file system on the "data" volume, mounted it, and used
> "dd" to copy 1MiB blocks from /dev/urandom to a file on the "data" file
> system, and just let it run. Eventually "dd" hangs and top shows that
> md0_raid5 is using 100% CPU.

It's known that ext4 has these symptoms with parity raid. To make sure it's a
raid problem you should try another filesystem or remount it with stripe=0.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-26  3:30                 ` Carlos Carvalho
@ 2024-01-26 15:46                   ` Dan Moulding
  2024-01-30 16:26                     ` Blazej Kucman
  2024-01-26 16:21                   ` Roman Mamedov
  1 sibling, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-01-26 15:46 UTC (permalink / raw)
  To: carlos
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	song, stable, yukuai1

> It's known that ext4 has these symptoms with parity raid.

Interesting. I'm not aware of that problem. One of the systems that
hit this hang has been running with ext4 on an MD RAID-5 array with
every kernel since at least 5.1 and never had an issue until this
regression.

> To make sure it's a raid problem you should try another filesystem or
> remount it with stripe=0.

That's a good suggestion, so I switched it to use XFS. It can still
reproduce the hang. Sounds like this is probably a different problem
than the known ext4 one.

Thanks,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-26  3:30                 ` Carlos Carvalho
  2024-01-26 15:46                   ` Dan Moulding
@ 2024-01-26 16:21                   ` Roman Mamedov
  1 sibling, 0 replies; 53+ messages in thread
From: Roman Mamedov @ 2024-01-26 16:21 UTC (permalink / raw)
  To: Carlos Carvalho
  Cc: Dan Moulding, junxiao.bi, gregkh, linux-kernel, linux-raid,
	regressions, song, stable, yukuai1

On Fri, 26 Jan 2024 00:30:46 -0300
Carlos Carvalho <carlos@fisica.ufpr.br> wrote:

> Dan Moulding (dan@danm.net) wrote on Thu, Jan 25, 2024 at 05:31:30PM -03:
> > I then created an ext4 file system on the "data" volume, mounted it, and used
> > "dd" to copy 1MiB blocks from /dev/urandom to a file on the "data" file
> > system, and just let it run. Eventually "dd" hangs and top shows that
> > md0_raid5 is using 100% CPU.
> 
> It's known that ext4 has these symptoms with parity raid. To make sure it's a
> raid problem you should try another filesystem or remount it with stripe=0.

If Ext4 wouldn't work properly on parity RAID, then it is a bug that should be
tracked down and fixed, not worked around by using a different FS. I am in
disbelief you are seriously suggesting that, and to be honest really doubt
there is any such high-profile "known" issue that stays unfixed and is just
commonly worked around.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-26 15:46                   ` Dan Moulding
@ 2024-01-30 16:26                     ` Blazej Kucman
  2024-01-30 20:21                       ` Song Liu
                                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Blazej Kucman @ 2024-01-30 16:26 UTC (permalink / raw)
  To: Dan Moulding
  Cc: carlos, gregkh, junxiao.bi, linux-kernel, linux-raid,
	regressions, song, stable, yukuai1

Hi,

On Fri, 26 Jan 2024 08:46:10 -0700
Dan Moulding <dan@danm.net> wrote: 
> 
> That's a good suggestion, so I switched it to use XFS. It can still
> reproduce the hang. Sounds like this is probably a different problem
> than the known ext4 one.
> 

Our daily tests directed at mdadm/md also detected a problem with
identical symptoms as described in the thread.

Issue detected with IMSM metadata but it also reproduces with native
metadata.
NVMe disks under VMD controller were used.

Scenario:
1. Create raid10:
mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
--raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
--size=7864320 --run
2. Create FS
mkfs.ext4 /dev/md/r10d4s128-15_A
3. Set faulty one raid member:
mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
4. Stop raid devies:
mdadm -Ss

Expected result:
The raid stops without kernel hangs and errors.

Actual result:
command "mdadm -Ss" hangs,
hung_task occurs in OS.

[   62.770472] md: resync of RAID array md127
[  140.893329] md: md127: resync done.
[  204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
device. md/raid10:md127: Operation continuing on 3 devices.
[  244.625393] INFO: task kworker/48:1:755 blocked for more than 30
seconds. [  244.632294]       Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [  244.640157] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.648105] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
  ppid:2      flags:0x00004000 [  244.657552] Workqueue: md_misc
md_start_sync [md_mod] [  244.662688] Call Trace: [  244.665176]  <TASK>
[  244.667316]  __schedule+0x2f0/0x9c0
[  244.670868]  ? sched_clock+0x10/0x20
[  244.674510]  schedule+0x28/0x90
[  244.677703]  mddev_suspend+0x11d/0x1e0 [md_mod]
[  244.682313]  ? __update_idle_core+0x29/0xc0
[  244.686574]  ? swake_up_all+0xe0/0xe0
[  244.690302]  md_start_sync+0x3c/0x280 [md_mod]
[  244.694825]  process_scheduled_works+0x87/0x320
[  244.699427]  worker_thread+0x147/0x2a0
[  244.703237]  ? rescuer_thread+0x2d0/0x2d0
[  244.707313]  kthread+0xe5/0x120
[  244.710504]  ? kthread_complete_and_exit+0x20/0x20
[  244.715370]  ret_from_fork+0x31/0x40
[  244.719007]  ? kthread_complete_and_exit+0x20/0x20
[  244.723879]  ret_from_fork_asm+0x11/0x20
[  244.727872]  </TASK>
[  244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
[  244.736486]       Tainted: G S
6.8.0-rc1-20240129.intel.13479453+ #1 [  244.744345] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
244.752293] task:mdadm           state:D stack:13512 pid:8457
tgid:8457  ppid:8276   flags:0x00000000 [  244.761736] Call Trace: [
244.764241]  <TASK> [  244.766389]  __schedule+0x2f0/0x9c0
[  244.773224]  schedule+0x28/0x90
[  244.779690]  stop_sync_thread+0xfa/0x170 [md_mod]
[  244.787737]  ? swake_up_all+0xe0/0xe0
[  244.794705]  do_md_stop+0x51/0x4c0 [md_mod]
[  244.802166]  md_ioctl+0x59d/0x10a0 [md_mod]
[  244.809567]  blkdev_ioctl+0x1bb/0x270
[  244.816417]  __x64_sys_ioctl+0x7a/0xb0
[  244.823720]  do_syscall_64+0x4e/0x110
[  244.830481]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
[  244.838700] RIP: 0033:0x7f2c540c97cb
[  244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010 [  244.856265] RAX: ffffffffffffffda RBX:
0000000000000003 RCX: 00007f2c540c97cb [  244.866659] RDX:
0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
00007fff4ad6a4c5 [  244.887382] R10: 0000000000000000 R11:
0000000000000246 R12: 00007fff4ad6a9c0 [  244.897723] R13:
00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
244.908018]  </TASK> [  275.345375] INFO: task kworker/48:1:755 blocked
for more than 60 seconds. [  275.355363]       Tainted: G S
    6.8.0-rc1-20240129.intel.13479453+ #1 [  275.366306] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message. [
275.377334] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
  ppid:2      flags:0x00004000 [  275.389863] Workqueue: md_misc
md_start_sync [md_mod] [  275.398102] Call Trace: [  275.403673]  <TASK>


Also reproduces with XFS FS, does not reproduce when there is no FS on
RAID.

Repository used for testing:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
Branch: master

Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
size check of V1 server-list header")

I see one merge commit touching md after the above one:
01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
git://git.kernel.dk/linux")

I hope these additional logs will help find the cause.

Thanks,
Blazej


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-30 16:26                     ` Blazej Kucman
@ 2024-01-30 20:21                       ` Song Liu
  2024-01-31  1:26                       ` Song Liu
  2024-01-31  2:41                       ` Yu Kuai
  2 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-01-30 20:21 UTC (permalink / raw)
  To: Blazej Kucman
  Cc: Dan Moulding, carlos, gregkh, junxiao.bi, linux-kernel,
	linux-raid, regressions, stable, yukuai1

Hi Blazej,

On Tue, Jan 30, 2024 at 8:27 AM Blazej Kucman
<blazej.kucman@linux.intel.com> wrote:
>
> Hi,
>
> On Fri, 26 Jan 2024 08:46:10 -0700
> Dan Moulding <dan@danm.net> wrote:
> >
> > That's a good suggestion, so I switched it to use XFS. It can still
> > reproduce the hang. Sounds like this is probably a different problem
> > than the known ext4 one.
> >
>
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> 4. Stop raid devies:
> mdadm -Ss

Thanks for the report. I can reproduce the issue locally.

The revert [1] cannot fix this one, because the revert is for raid5 (and
the repro is on raid10). I will look into this.

Thanks again!

Song


[1] https://lore.kernel.org/linux-raid/20240125082131.788600-1-song@kernel.org/


>
> Expected result:
> The raid stops without kernel hangs and errors.
>
> Actual result:
> command "mdadm -Ss" hangs,
> hung_task occurs in OS.
>
> [   62.770472] md: resync of RAID array md127
> [  140.893329] md: md127: resync done.
> [  204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
> device. md/raid10:md127: Operation continuing on 3 devices.
> [  244.625393] INFO: task kworker/48:1:755 blocked for more than 30
> seconds. [  244.632294]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.640157] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.648105] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>   ppid:2      flags:0x00004000 [  244.657552] Workqueue: md_misc
> md_start_sync [md_mod] [  244.662688] Call Trace: [  244.665176]  <TASK>
> [  244.667316]  __schedule+0x2f0/0x9c0
> [  244.670868]  ? sched_clock+0x10/0x20
> [  244.674510]  schedule+0x28/0x90
> [  244.677703]  mddev_suspend+0x11d/0x1e0 [md_mod]
> [  244.682313]  ? __update_idle_core+0x29/0xc0
> [  244.686574]  ? swake_up_all+0xe0/0xe0
> [  244.690302]  md_start_sync+0x3c/0x280 [md_mod]
> [  244.694825]  process_scheduled_works+0x87/0x320
> [  244.699427]  worker_thread+0x147/0x2a0
> [  244.703237]  ? rescuer_thread+0x2d0/0x2d0
> [  244.707313]  kthread+0xe5/0x120
> [  244.710504]  ? kthread_complete_and_exit+0x20/0x20
> [  244.715370]  ret_from_fork+0x31/0x40
> [  244.719007]  ? kthread_complete_and_exit+0x20/0x20
> [  244.723879]  ret_from_fork_asm+0x11/0x20
> [  244.727872]  </TASK>
> [  244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
> [  244.736486]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.744345] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.752293] task:mdadm           state:D stack:13512 pid:8457
> tgid:8457  ppid:8276   flags:0x00000000 [  244.761736] Call Trace: [
> 244.764241]  <TASK> [  244.766389]  __schedule+0x2f0/0x9c0
> [  244.773224]  schedule+0x28/0x90
> [  244.779690]  stop_sync_thread+0xfa/0x170 [md_mod]
> [  244.787737]  ? swake_up_all+0xe0/0xe0
> [  244.794705]  do_md_stop+0x51/0x4c0 [md_mod]
> [  244.802166]  md_ioctl+0x59d/0x10a0 [md_mod]
> [  244.809567]  blkdev_ioctl+0x1bb/0x270
> [  244.816417]  __x64_sys_ioctl+0x7a/0xb0
> [  244.823720]  do_syscall_64+0x4e/0x110
> [  244.830481]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [  244.838700] RIP: 0033:0x7f2c540c97cb
> [  244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010 [  244.856265] RAX: ffffffffffffffda RBX:
> 0000000000000003 RCX: 00007f2c540c97cb [  244.866659] RDX:
> 0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
> 244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
> 00007fff4ad6a4c5 [  244.887382] R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007fff4ad6a9c0 [  244.897723] R13:
> 00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
> 244.908018]  </TASK> [  275.345375] INFO: task kworker/48:1:755 blocked
> for more than 60 seconds. [  275.355363]       Tainted: G S
>     6.8.0-rc1-20240129.intel.13479453+ #1 [  275.366306] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 275.377334] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>   ppid:2      flags:0x00004000 [  275.389863] Workqueue: md_misc
> md_start_sync [md_mod] [  275.398102] Call Trace: [  275.403673]  <TASK>
>
>
> Also reproduces with XFS FS, does not reproduce when there is no FS on
> RAID.
>
> Repository used for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> Branch: master
>
> Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
> size check of V1 server-list header")
>
> I see one merge commit touching md after the above one:
> 01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
> git://git.kernel.dk/linux")
>
> I hope these additional logs will help find the cause.
>
> Thanks,
> Blazej
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-30 16:26                     ` Blazej Kucman
  2024-01-30 20:21                       ` Song Liu
@ 2024-01-31  1:26                       ` Song Liu
  2024-01-31  2:13                         ` Yu Kuai
  2024-01-31  2:41                       ` Yu Kuai
  2 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-01-31  1:26 UTC (permalink / raw)
  To: Blazej Kucman, Yu Kuai
  Cc: Dan Moulding, carlos, gregkh, junxiao.bi, linux-kernel,
	linux-raid, regressions, stable

Update my findings so far.

On Tue, Jan 30, 2024 at 8:27 AM Blazej Kucman
<blazej.kucman@linux.intel.com> wrote:
[...]
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1

With a failed drive, md_thread calls md_check_recovery() and kicks
off mddev->sync_work, which is md_start_sync().
md_check_recovery() also sets MD_RECOVERY_RUNNING.

md_start_sync() calls mddev_suspend() and waits for
mddev->active_io to become zero.

> 4. Stop raid devies:
> mdadm -Ss

This command calls stop_sync_thread() and waits for
MD_RECOVERY_RUNNING to be cleared.

Given we need a working file system to reproduce the issue, I
suspect the problem comes from active_io.

Yu Kuai, I guess we missed this case in the recent refactoring.
I don't have a good idea to fix this. Please also take a look into
this.

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-31  1:26                       ` Song Liu
@ 2024-01-31  2:13                         ` Yu Kuai
  0 siblings, 0 replies; 53+ messages in thread
From: Yu Kuai @ 2024-01-31  2:13 UTC (permalink / raw)
  To: Song Liu, Blazej Kucman, Yu Kuai
  Cc: Dan Moulding, carlos, gregkh, junxiao.bi, linux-kernel,
	linux-raid, regressions, stable, yukuai (C)

Hi,

在 2024/01/31 9:26, Song Liu 写道:
>> Scenario:
>> 1. Create raid10:
>> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
>> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
>> --size=7864320 --run
>> 2. Create FS
>> mkfs.ext4 /dev/md/r10d4s128-15_A
>> 3. Set faulty one raid member:
>> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> With a failed drive, md_thread calls md_check_recovery() and kicks
> off mddev->sync_work, which is md_start_sync().
> md_check_recovery() also sets MD_RECOVERY_RUNNING.
> 
> md_start_sync() calls mddev_suspend() and waits for
> mddev->active_io to become zero.
> 
>> 4. Stop raid devies:
>> mdadm -Ss
> This command calls stop_sync_thread() and waits for
> MD_RECOVERY_RUNNING to be cleared.
> 
> Given we need a working file system to reproduce the issue, I
> suspect the problem comes from active_io.

I'll look into this. But I don't understand the root cause yet.
Who grab the 'active_io' and why doesn't release it?

Thanks,
Kuai

> 
> Yu Kuai, I guess we missed this case in the recent refactoring.
> I don't have a good idea to fix this. Please also take a look into
> this.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-30 16:26                     ` Blazej Kucman
  2024-01-30 20:21                       ` Song Liu
  2024-01-31  1:26                       ` Song Liu
@ 2024-01-31  2:41                       ` Yu Kuai
  2024-01-31  4:55                         ` Song Liu
  2 siblings, 1 reply; 53+ messages in thread
From: Yu Kuai @ 2024-01-31  2:41 UTC (permalink / raw)
  To: Blazej Kucman, Dan Moulding
  Cc: carlos, gregkh, junxiao.bi, linux-kernel, linux-raid,
	regressions, song, stable, yukuai1, yukuai (C)

Hi, Blazej!

在 2024/01/31 0:26, Blazej Kucman 写道:
> Hi,
> 
> On Fri, 26 Jan 2024 08:46:10 -0700
> Dan Moulding <dan@danm.net> wrote:
>>
>> That's a good suggestion, so I switched it to use XFS. It can still
>> reproduce the hang. Sounds like this is probably a different problem
>> than the known ext4 one.
>>
> 
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
> 
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
> 
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> 4. Stop raid devies:
> mdadm -Ss
> 
> Expected result:
> The raid stops without kernel hangs and errors.
> 
> Actual result:
> command "mdadm -Ss" hangs,
> hung_task occurs in OS.

Can you test the following patch?

Thanks!
Kuai

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e3a56a958b47..a8db84c200fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
                         rcu_read_lock();
                 }
         rcu_read_unlock();
-       if (atomic_dec_and_test(&mddev->flush_pending))
+       if (atomic_dec_and_test(&mddev->flush_pending)) {
+               /* The pair is percpu_ref_get() from md_flush_request() */
+               percpu_ref_put(&mddev->active_io);
+
                 queue_work(md_wq, &mddev->flush_work);
+       }
  }

  static void md_submit_flush_data(struct work_struct *ws)

> 
> [   62.770472] md: resync of RAID array md127
> [  140.893329] md: md127: resync done.
> [  204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
> device. md/raid10:md127: Operation continuing on 3 devices.
> [  244.625393] INFO: task kworker/48:1:755 blocked for more than 30
> seconds. [  244.632294]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.640157] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.648105] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>    ppid:2      flags:0x00004000 [  244.657552] Workqueue: md_misc
> md_start_sync [md_mod] [  244.662688] Call Trace: [  244.665176]  <TASK>
> [  244.667316]  __schedule+0x2f0/0x9c0
> [  244.670868]  ? sched_clock+0x10/0x20
> [  244.674510]  schedule+0x28/0x90
> [  244.677703]  mddev_suspend+0x11d/0x1e0 [md_mod]
> [  244.682313]  ? __update_idle_core+0x29/0xc0
> [  244.686574]  ? swake_up_all+0xe0/0xe0
> [  244.690302]  md_start_sync+0x3c/0x280 [md_mod]
> [  244.694825]  process_scheduled_works+0x87/0x320
> [  244.699427]  worker_thread+0x147/0x2a0
> [  244.703237]  ? rescuer_thread+0x2d0/0x2d0
> [  244.707313]  kthread+0xe5/0x120
> [  244.710504]  ? kthread_complete_and_exit+0x20/0x20
> [  244.715370]  ret_from_fork+0x31/0x40
> [  244.719007]  ? kthread_complete_and_exit+0x20/0x20
> [  244.723879]  ret_from_fork_asm+0x11/0x20
> [  244.727872]  </TASK>
> [  244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
> [  244.736486]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.744345] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.752293] task:mdadm           state:D stack:13512 pid:8457
> tgid:8457  ppid:8276   flags:0x00000000 [  244.761736] Call Trace: [
> 244.764241]  <TASK> [  244.766389]  __schedule+0x2f0/0x9c0
> [  244.773224]  schedule+0x28/0x90
> [  244.779690]  stop_sync_thread+0xfa/0x170 [md_mod]
> [  244.787737]  ? swake_up_all+0xe0/0xe0
> [  244.794705]  do_md_stop+0x51/0x4c0 [md_mod]
> [  244.802166]  md_ioctl+0x59d/0x10a0 [md_mod]
> [  244.809567]  blkdev_ioctl+0x1bb/0x270
> [  244.816417]  __x64_sys_ioctl+0x7a/0xb0
> [  244.823720]  do_syscall_64+0x4e/0x110
> [  244.830481]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [  244.838700] RIP: 0033:0x7f2c540c97cb
> [  244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010 [  244.856265] RAX: ffffffffffffffda RBX:
> 0000000000000003 RCX: 00007f2c540c97cb [  244.866659] RDX:
> 0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
> 244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
> 00007fff4ad6a4c5 [  244.887382] R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007fff4ad6a9c0 [  244.897723] R13:
> 00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
> 244.908018]  </TASK> [  275.345375] INFO: task kworker/48:1:755 blocked
> for more than 60 seconds. [  275.355363]       Tainted: G S
>      6.8.0-rc1-20240129.intel.13479453+ #1 [  275.366306] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 275.377334] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>    ppid:2      flags:0x00004000 [  275.389863] Workqueue: md_misc
> md_start_sync [md_mod] [  275.398102] Call Trace: [  275.403673]  <TASK>
> 
> 
> Also reproduces with XFS FS, does not reproduce when there is no FS on
> RAID.
> 
> Repository used for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> Branch: master
> 
> Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
> size check of V1 server-list header")
> 
> I see one merge commit touching md after the above one:
> 01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
> git://git.kernel.dk/linux")
> 
> I hope these additional logs will help find the cause.
> 
> Thanks,
> Blazej
> 
> 
> .
> 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-31  2:41                       ` Yu Kuai
@ 2024-01-31  4:55                         ` Song Liu
  2024-01-31 13:36                           ` Blazej Kucman
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-01-31  4:55 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Blazej Kucman, Dan Moulding, carlos, gregkh, junxiao.bi,
	linux-kernel, linux-raid, regressions, stable, yukuai (C)

On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi, Blazej!
>
> 在 2024/01/31 0:26, Blazej Kucman 写道:
> > Hi,
> >
> > On Fri, 26 Jan 2024 08:46:10 -0700
> > Dan Moulding <dan@danm.net> wrote:
> >>
> >> That's a good suggestion, so I switched it to use XFS. It can still
> >> reproduce the hang. Sounds like this is probably a different problem
> >> than the known ext4 one.
> >>
> >
> > Our daily tests directed at mdadm/md also detected a problem with
> > identical symptoms as described in the thread.
> >
> > Issue detected with IMSM metadata but it also reproduces with native
> > metadata.
> > NVMe disks under VMD controller were used.
> >
> > Scenario:
> > 1. Create raid10:
> > mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> > --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> > --size=7864320 --run
> > 2. Create FS
> > mkfs.ext4 /dev/md/r10d4s128-15_A
> > 3. Set faulty one raid member:
> > mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> > 4. Stop raid devies:
> > mdadm -Ss
> >
> > Expected result:
> > The raid stops without kernel hangs and errors.
> >
> > Actual result:
> > command "mdadm -Ss" hangs,
> > hung_task occurs in OS.
>
> Can you test the following patch?
>
> Thanks!
> Kuai
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e3a56a958b47..a8db84c200fe 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
>                          rcu_read_lock();
>                  }
>          rcu_read_unlock();
> -       if (atomic_dec_and_test(&mddev->flush_pending))
> +       if (atomic_dec_and_test(&mddev->flush_pending)) {
> +               /* The pair is percpu_ref_get() from md_flush_request() */
> +               percpu_ref_put(&mddev->active_io);
> +
>                  queue_work(md_wq, &mddev->flush_work);
> +       }
>   }
>
>   static void md_submit_flush_data(struct work_struct *ws)

This fixes the issue in my tests. Please submit the official patch.
Also, we should add a test in mdadm/tests to cover this case.

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-31  4:55                         ` Song Liu
@ 2024-01-31 13:36                           ` Blazej Kucman
  2024-02-01  1:39                             ` Yu Kuai
  0 siblings, 1 reply; 53+ messages in thread
From: Blazej Kucman @ 2024-01-31 13:36 UTC (permalink / raw)
  To: Song Liu
  Cc: Yu Kuai, Dan Moulding, carlos, gregkh, junxiao.bi, linux-kernel,
	linux-raid, regressions, stable, yukuai (C)

On Tue, 30 Jan 2024 20:55:39 -0800
Song Liu <song@kernel.org> wrote:

> On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <yukuai1@huaweicloud.com>
> >
> > Can you test the following patch?
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index e3a56a958b47..a8db84c200fe 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct
> > *ws) rcu_read_lock();
> >                  }
> >          rcu_read_unlock();
> > -       if (atomic_dec_and_test(&mddev->flush_pending))
> > +       if (atomic_dec_and_test(&mddev->flush_pending)) {
> > +               /* The pair is percpu_ref_get() from
> > md_flush_request() */
> > +               percpu_ref_put(&mddev->active_io);
> > +
> >                  queue_work(md_wq, &mddev->flush_work);
> > +       }
> >   }
> >
> >   static void md_submit_flush_data(struct work_struct *ws)  
> 
> This fixes the issue in my tests. Please submit the official patch.
> Also, we should add a test in mdadm/tests to cover this case.
> 
> Thanks,
> Song
> 

Hi Kuai,

On my hardware issue also stopped reproducing with this fix. 

I applied the fix on current HEAD of master
branch in kernel/git/torvalds/linux.git repo.

Thansk,
Blazej




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25 20:31               ` Dan Moulding
  2024-01-26  3:30                 ` Carlos Carvalho
@ 2024-01-31 17:37                 ` junxiao.bi
  2024-02-06  8:07                 ` Song Liu
  2 siblings, 0 replies; 53+ messages in thread
From: junxiao.bi @ 2024-01-31 17:37 UTC (permalink / raw)
  To: Dan Moulding
  Cc: gregkh, linux-kernel, linux-raid, regressions, song, stable, yukuai1

Hi Dan,

On 1/25/24 12:31 PM, Dan Moulding wrote:
> On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
> journal from space on the "boot" disk. Then I attached 3 additional
> 100 GiB virtual disks and created the RAID-5 from those 3 disks and
> the write-journal device. I then created a new LVM volume group from
> the md0 array and created one LVM logical volume named "data", using
> all but 64GiB of the available VG space. I then created an ext4 file
> system on the "data" volume, mounted it, and used "dd" to copy 1MiB
> blocks from /dev/urandom to a file on the "data" file system, and just
> let it run. Eventually "dd" hangs and top shows that md0_raid5 is
> using 100% CPU.

I can't reproduce this issue with this test case running over night, dd 
is making progress well. I can see dd is very busy, closing to 100%, 
sometimes it stay in D status, but just for a moment. md5_raid5 is 
staying around 60%, never 100%.

I am wondering your case is a performance issue or a dead hung, if it's 
a hung, i suppose we should see some hung task call trace of dd in dmesg 
if you didn't disable kernel.hung_task_timeout_secs.

Also are you able to configure kdump and trigger a core dump when issue 
reproduced.

Thanks,

Junxiao.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-31 13:36                           ` Blazej Kucman
@ 2024-02-01  1:39                             ` Yu Kuai
  0 siblings, 0 replies; 53+ messages in thread
From: Yu Kuai @ 2024-02-01  1:39 UTC (permalink / raw)
  To: Blazej Kucman, Song Liu
  Cc: Yu Kuai, Dan Moulding, carlos, gregkh, junxiao.bi, linux-kernel,
	linux-raid, regressions, stable, yukuai (C)

Hi!

在 2024/01/31 21:36, Blazej Kucman 写道:
> Hi Kuai,
> 
> On my hardware issue also stopped reproducing with this fix.
> 
> I applied the fix on current HEAD of master
> branch in kernel/git/torvalds/linux.git repo.

That is great, thanks for testing!

Hi, Dan, can you try this patch as well. I feel this is a different
problem that the one you reported first. Because revert 0de40f76d567
shouldn't make any difference.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-25 20:31               ` Dan Moulding
  2024-01-26  3:30                 ` Carlos Carvalho
  2024-01-31 17:37                 ` junxiao.bi
@ 2024-02-06  8:07                 ` Song Liu
  2024-02-06 20:56                   ` Dan Moulding
  2 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-02-06  8:07 UTC (permalink / raw)
  To: Dan Moulding
  Cc: junxiao.bi, gregkh, linux-kernel, linux-raid, regressions,
	stable, yukuai1

On Thu, Jan 25, 2024 at 12:31 PM Dan Moulding <dan@danm.net> wrote:
>
> Hi Junxiao,
>
> I first noticed this problem the next day after I had upgraded some
> machines to the 6.7.1 kernel. One of the machines is a backup server.
> Just a few hours after the upgrade to 6.7.1, it started running its
> overnight backup jobs. Those backup jobs hung part way through. When I
> tried to check on the backups in the morning, I found the server
> mostly unresponsive. I could SSH in but most shell commands would just
> hang. I was able to run top and see that the md0_raid5 kernel thread
> was using 100% CPU. I tried to reboot the server, but it wasn't able
> to successfully shutdown and eventually I had to hard reset it.
>
> The next day, the same sequence of events occurred on that server
> again when it tried to run its backup jobs. Then the following day, I
> experienced another hang on a different machine, with a similar RAID-5
> configuration. That time I was scp'ing a large file to a virtual
> machine whose image was stored on the RAID-5 array. Part way through
> the transfer scp reported that the transfer had stalled. I checked top
> on that machine and found once again that the md0_raid5 kernel thread
> was using 100% CPU.
>
> Yesterday I created a fresh Fedora 39 VM for the purposes of
> reproducing this problem in a different environment (the other two
> machines are both Gentoo servers running v6.7 kernels straight from
> the stable trees with a custom kernel configuration). I am able to
> reproduce the problem on Fedora 39 running both the v6.6.13 stable
> tree kernel code and the Fedora 39 6.6.13 distribution kernel.
>
> On this Fedora 39 VM, I created a 1GiB LVM volume to use as the RAID-5
> journal from space on the "boot" disk. Then I attached 3 additional
> 100 GiB virtual disks and created the RAID-5 from those 3 disks and
> the write-journal device. I then created a new LVM volume group from
> the md0 array and created one LVM logical volume named "data", using
> all but 64GiB of the available VG space. I then created an ext4 file
> system on the "data" volume, mounted it, and used "dd" to copy 1MiB
> blocks from /dev/urandom to a file on the "data" file system, and just
> let it run. Eventually "dd" hangs and top shows that md0_raid5 is
> using 100% CPU.
>
> Here is an example command I just ran, which has hung after writing
> 4.1 GiB of random data to the array:
>
> test@localhost:~$ dd if=/dev/urandom bs=1M of=/data/random.dat status=progress
> 4410310656 bytes (4.4 GB, 4.1 GiB) copied, 324 s, 13.6 MB/s

Update on this..

I haven't been testing the following config md-6.9 branch [1].
The array works fine afaict.

Dan, could you please run the test on this branch
(83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?

Thanks,
Song


[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.9

[root@eth50-1 ~]# lsblk
NAME                             MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0                               11:0    1 1024M  0 rom
vda                              253:0    0   32G  0 disk
├─vda1                           253:1    0    2G  0 part  /boot
└─vda2                           253:2    0   30G  0 part  /
nvme2n1                          259:0    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme0n1                          259:1    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme1n1                          259:2    0   50G  0 disk
└─md0                              9:0    0  100G  0 raid5
  ├─vg--md--data-md--data-real   250:2    0   50G  0 lvm
  │ ├─vg--md--data-md--data      250:1    0   50G  0 lvm   /mnt/2
  │ └─vg--md--data-snap          250:4    0   50G  0 lvm
  └─vg--md--data-snap-cow        250:3    0   49G  0 lvm
    └─vg--md--data-snap          250:4    0   50G  0 lvm
nvme4n1                          259:3    0    2G  0 disk
nvme3n1                          259:4    0   50G  0 disk
└─vg--data-lv--journal           250:0    0  512M  0 lvm
  └─md0                            9:0    0  100G  0 raid5
    ├─vg--md--data-md--data-real 250:2    0   50G  0 lvm
    │ ├─vg--md--data-md--data    250:1    0   50G  0 lvm   /mnt/2
    │ └─vg--md--data-snap        250:4    0   50G  0 lvm
    └─vg--md--data-snap-cow      250:3    0   49G  0 lvm
      └─vg--md--data-snap        250:4    0   50G  0 lvm
nvme5n1                          259:5    0    2G  0 disk
nvme6n1                          259:6    0    4G  0 disk
[root@eth50-1 ~]# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 nvme2n1[4] dm-0[3](J) nvme1n1[1] nvme0n1[0]
      104790016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]

unused devices: <none>
[root@eth50-1 ~]# mount | grep /mnt/2
/dev/mapper/vg--md--data-md--data on /mnt/2 type ext4 (rw,relatime,stripe=256)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-06  8:07                 ` Song Liu
@ 2024-02-06 20:56                   ` Dan Moulding
  2024-02-06 21:34                     ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-02-06 20:56 UTC (permalink / raw)
  To: song
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, yukuai1

> Dan, could you please run the test on this branch
> (83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?

I'm sorry to report that I can still reproduce the problem running the
kernel built from the md-6.9 branch (83cbdaf61b1a).

But the only commit I see on that branch that's not in master and
touches raid5.c is this one:

    test@sysrescue:~/src/linux$ git log master..song/md-6.9 drivers/md/raid5.c
    commit 61c90765e131e63ead773b9b99167415e246a945
    Author: Yu Kuai <yukuai3@huawei.com>
    Date:   Thu Dec 28 20:55:51 2023 +0800

        md: remove redundant check of 'mddev->sync_thread'

Is that expected, or were you expecting additional fixes to be in there?

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-06 20:56                   ` Dan Moulding
@ 2024-02-06 21:34                     ` Song Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-02-06 21:34 UTC (permalink / raw)
  To: Dan Moulding
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, yukuai1

On Tue, Feb 6, 2024 at 12:56 PM Dan Moulding <dan@danm.net> wrote:
>
> > Dan, could you please run the test on this branch
> > (83cbdaf61b1ab9cdaa0321eeea734bc70ca069c8)?
>
> I'm sorry to report that I can still reproduce the problem running the
> kernel built from the md-6.9 branch (83cbdaf61b1a).
>
> But the only commit I see on that branch that's not in master and
> touches raid5.c is this one:
>
>     test@sysrescue:~/src/linux$ git log master..song/md-6.9 drivers/md/raid5.c
>     commit 61c90765e131e63ead773b9b99167415e246a945
>     Author: Yu Kuai <yukuai3@huawei.com>
>     Date:   Thu Dec 28 20:55:51 2023 +0800
>
>         md: remove redundant check of 'mddev->sync_thread'
>
> Is that expected, or were you expecting additional fixes to be in there?

I don't expect that commit to fix the issue. It is expected to be merged to
master in the next merge window. I am curious why I cannot reproduce
the issue. Let me try more..

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
  2024-01-23  1:08 ` Song Liu
  2024-01-23  1:35 ` Dan Moulding
@ 2024-02-20 23:06 ` Dan Moulding
  2024-02-20 23:15   ` junxiao.bi
  2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
  2 siblings, 2 replies; 53+ messages in thread
From: Dan Moulding @ 2024-02-20 23:06 UTC (permalink / raw)
  To: dan
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions, song, stable

Just a friendly reminder that this regression still exists on the
mainline. It has been reverted in 6.7 stable. But I upgraded a
development system to 6.8-rc5 today and immediately hit this issue
again. Then I saw that it hasn't yet been reverted in Linus' tree.

Cheers,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-20 23:06 ` Dan Moulding
@ 2024-02-20 23:15   ` junxiao.bi
  2024-02-21 14:50     ` Mateusz Kusiak
  2024-02-23 17:44     ` Dan Moulding
  2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 2 replies; 53+ messages in thread
From: junxiao.bi @ 2024-02-20 23:15 UTC (permalink / raw)
  To: Dan Moulding; +Cc: gregkh, linux-kernel, linux-raid, regressions, song, stable

Hi Dan,

The thing is we can't reproduce this issue at all. If you can generate a 
vmcore when the hung happened, then we can review which processes are 
stuck.

Thanks,

Junxiao.

On 2/20/24 3:06 PM, Dan Moulding wrote:
> Just a friendly reminder that this regression still exists on the
> mainline. It has been reverted in 6.7 stable. But I upgraded a
> development system to 6.8-rc5 today and immediately hit this issue
> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>
> Cheers,
>
> -- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-20 23:15   ` junxiao.bi
@ 2024-02-21 14:50     ` Mateusz Kusiak
  2024-02-21 19:15       ` junxiao.bi
  2024-02-23 17:44     ` Dan Moulding
  1 sibling, 1 reply; 53+ messages in thread
From: Mateusz Kusiak @ 2024-02-21 14:50 UTC (permalink / raw)
  To: junxiao.bi, Dan Moulding
  Cc: gregkh, linux-kernel, linux-raid, regressions, song, stable

On 21.02.2024 00:15, junxiao.bi@oracle.com wrote:
>
> The thing is we can't reproduce this issue at all. If you can generate 
> a vmcore when the hung happened, then we can review which processes 
> are stuck.
>
Hi,
don't know if that be any of help, but I run below scenario with SATA 
and NVMe drives. For me, the issue is reproducible on NVMe drives only.

Scenario:
1. Create R5D3 with native metadata
     # mdadm -CR /dev/md/vol -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
2. Create FS on the array
     # mkfs.ext4 /dev/md/vol -F
3. Remove single member drive via "--incremental --fail"
     # mdadm -If nvme0n1

The result is almost instant.

Thanks,
Mateusz

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-21 14:50     ` Mateusz Kusiak
@ 2024-02-21 19:15       ` junxiao.bi
  0 siblings, 0 replies; 53+ messages in thread
From: junxiao.bi @ 2024-02-21 19:15 UTC (permalink / raw)
  To: Mateusz Kusiak, Dan Moulding
  Cc: gregkh, linux-kernel, linux-raid, regressions, song, stable

On 2/21/24 6:50 AM, Mateusz Kusiak wrote:
> On 21.02.2024 00:15, junxiao.bi@oracle.com wrote:
>>
>> The thing is we can't reproduce this issue at all. If you can 
>> generate a vmcore when the hung happened, then we can review which 
>> processes are stuck.
>>
> Hi,
> don't know if that be any of help, but I run below scenario with SATA 
> and NVMe drives. For me, the issue is reproducible on NVMe drives only.
>
> Scenario:
> 1. Create R5D3 with native metadata
>     # mdadm -CR /dev/md/vol -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
> 2. Create FS on the array
>     # mkfs.ext4 /dev/md/vol -F
> 3. Remove single member drive via "--incremental --fail"
>     # mdadm -If nvme0n1
>
> The result is almost instant.

This is not the same issue that Dan reported, it looks like another 
regression that Yu Kuai fixed , can you please try this patch?

https://lore.kernel.org/lkml/95f2e08e-2daf-e298-e696-42ebfa7b9bbf@huaweicloud.com/

Thanks,

Junxiao.

>
> Thanks,
> Mateusz

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-20 23:06 ` Dan Moulding
  2024-02-20 23:15   ` junxiao.bi
@ 2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-24  2:13     ` Song Liu
  1 sibling, 1 reply; 53+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-23  8:07 UTC (permalink / raw)
  To: song
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	stable, Dan Moulding

On 21.02.24 00:06, Dan Moulding wrote:
> Just a friendly reminder that this regression still exists on the
> mainline. It has been reverted in 6.7 stable. But I upgraded a
> development system to 6.8-rc5 today and immediately hit this issue
> again. Then I saw that it hasn't yet been reverted in Linus' tree.

Song Liu, what's the status here? I aware that you fixed with quite a
few regressions recently, but it seems like resolving this one is
stalled. Or were you able to reproduce the issue or make some progress
and I just missed it?

And if not, what's the way forward here wrt to the release of 6.8?
Revert the culprit and try again later? Or is that not an option for one
reason or another?

Or do we assume that this is not a real issue? That it's caused by some
oddity (bit-flip in the metadata or something like that?) only to be
found in Dan's setup?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-20 23:15   ` junxiao.bi
  2024-02-21 14:50     ` Mateusz Kusiak
@ 2024-02-23 17:44     ` Dan Moulding
  2024-02-23 19:18       ` junxiao.bi
  1 sibling, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-02-23 17:44 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable

Hi Junxiao,

Thanks for your time so far on this problem. It took some time,
because I've never had to generate a vmcore before, but I have one now
and it looks usable from what I've seen using crash and gdb on
it. It's a bit large, 1.1GB. How can I share it? Also, I'm assuming
you'll also need the vmlinux image that it came from? It's also a bit
big, 251MB.

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-23 17:44     ` Dan Moulding
@ 2024-02-23 19:18       ` junxiao.bi
  2024-02-23 20:22         ` Dan Moulding
  0 siblings, 1 reply; 53+ messages in thread
From: junxiao.bi @ 2024-02-23 19:18 UTC (permalink / raw)
  To: Dan Moulding; +Cc: gregkh, linux-kernel, linux-raid, regressions, song, stable

Thanks Dan.

Before we know how to share vmcore, can you run below cmds from crash first:

1. ps -m | grep UN

2. foreach UN bt

3. ps -m | grep md

4. bt each md process

Thanks,

Junxiao.

On 2/23/24 9:44 AM, Dan Moulding wrote:
> Hi Junxiao,
>
> Thanks for your time so far on this problem. It took some time,
> because I've never had to generate a vmcore before, but I have one now
> and it looks usable from what I've seen using crash and gdb on
> it. It's a bit large, 1.1GB. How can I share it? Also, I'm assuming
> you'll also need the vmlinux image that it came from? It's also a bit
> big, 251MB.
>
> -- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-23 19:18       ` junxiao.bi
@ 2024-02-23 20:22         ` Dan Moulding
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Moulding @ 2024-02-23 20:22 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable

> Before we know how to share vmcore, can you run below cmds from crash first:
>
> 1. ps -m | grep UN
>
> 2. foreach UN bt
>
> 3. ps -m | grep md
>
> 4. bt each md process

Sure, here you go!

----

root@localhost:/var/crash/127.0.0.1-2024-02-23-01:34:56# crash /home/test/src/linux/vmlinux vmcore

crash 8.0.4-2.fc39
Copyright (C) 2002-2022  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
Copyright (C) 2015, 2021  VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

WARNING: ORC unwinder: module orc_entry structures have changed
WARNING: cannot determine how modules are linked
WARNING: no kernel module access

      KERNEL: /home/test/src/linux/vmlinux 
    DUMPFILE: vmcore
        CPUS: 8
        DATE: Fri Feb 23 01:34:54 UTC 2024
      UPTIME: 00:41:00
LOAD AVERAGE: 6.00, 5.90, 4.80
       TASKS: 309
    NODENAME: localhost.localdomain
     RELEASE: 6.8.0-rc5
     VERSION: #1 SMP Fri Feb 23 00:22:23 UTC 2024
     MACHINE: x86_64  (2999 Mhz)
      MEMORY: 8 GB
       PANIC: "Kernel panic - not syncing: sysrq triggered crash"
         PID: 1977
     COMMAND: "bash"
        TASK: ffff888105325880  [THREAD_INFO: ffff888105325880]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1    COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2    COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1    COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7    COMMAND: "kworker/u21:1"
crash> foreach UN bt
PID: 403      TASK: ffff888101351d80  CPU: 7    COMMAND: "kworker/u21:1"
 #0 [ffffc90000863840] __schedule at ffffffff81ac18ac
 #1 [ffffc900008638a0] schedule at ffffffff81ac1d82
 #2 [ffffc900008638b8] io_schedule at ffffffff81ac1e4d
 #3 [ffffc900008638c8] wait_for_in_progress at ffffffff81806224
 #4 [ffffc90000863910] do_origin at ffffffff81807265
 #5 [ffffc90000863948] __map_bio at ffffffff817ede6a
 #6 [ffffc90000863978] dm_submit_bio at ffffffff817ee31e
 #7 [ffffc900008639f0] __submit_bio at ffffffff81515ec1
 #8 [ffffc90000863a08] submit_bio_noacct_nocheck at ffffffff815162a7
 #9 [ffffc90000863a60] ext4_io_submit at ffffffff813b506b
#10 [ffffc90000863a70] ext4_do_writepages at ffffffff81399ed6
#11 [ffffc90000863b20] ext4_writepages at ffffffff8139a85d
#12 [ffffc90000863bb8] do_writepages at ffffffff81258c30
#13 [ffffc90000863c18] __writeback_single_inode at ffffffff8132348a
#14 [ffffc90000863c48] writeback_sb_inodes at ffffffff81323b62
#15 [ffffc90000863d18] __writeback_inodes_wb at ffffffff81323e17
#16 [ffffc90000863d58] wb_writeback at ffffffff8132400a
#17 [ffffc90000863dc0] wb_workfn at ffffffff8132503c
#18 [ffffc90000863e68] process_one_work at ffffffff81147b69
#19 [ffffc90000863ea8] worker_thread at ffffffff81148554
#20 [ffffc90000863ef8] kthread at ffffffff8114f8ee
#21 [ffffc90000863f30] ret_from_fork at ffffffff8108bb98
#22 [ffffc90000863f50] ret_from_fork_asm at ffffffff81000da1

PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
 #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
 #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
 #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
 #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
 #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
 #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
 #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
 #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
 #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

PID: 957      TASK: ffff88810baa0ec0  CPU: 1    COMMAND: "jbd2/dm-3-8"
 #0 [ffffc90001d47b10] __schedule at ffffffff81ac18ac
 #1 [ffffc90001d47b70] schedule at ffffffff81ac1d82
 #2 [ffffc90001d47b88] io_schedule at ffffffff81ac1e4d
 #3 [ffffc90001d47b98] wait_for_in_progress at ffffffff81806224
 #4 [ffffc90001d47be0] do_origin at ffffffff81807265
 #5 [ffffc90001d47c18] __map_bio at ffffffff817ede6a
 #6 [ffffc90001d47c48] dm_submit_bio at ffffffff817ee31e
 #7 [ffffc90001d47cc0] __submit_bio at ffffffff81515ec1
 #8 [ffffc90001d47cd8] submit_bio_noacct_nocheck at ffffffff815162a7
 #9 [ffffc90001d47d30] jbd2_journal_commit_transaction at ffffffff813d246c
#10 [ffffc90001d47e90] kjournald2 at ffffffff813d65cb
#11 [ffffc90001d47ef8] kthread at ffffffff8114f8ee
#12 [ffffc90001d47f30] ret_from_fork at ffffffff8108bb98
#13 [ffffc90001d47f50] ret_from_fork_asm at ffffffff81000da1

PID: 1835     TASK: ffff888108a28ec0  CPU: 2    COMMAND: "dd"
 #0 [ffffc90000c2fb30] __schedule at ffffffff81ac18ac
 #1 [ffffc90000c2fb90] schedule at ffffffff81ac1d82
 #2 [ffffc90000c2fba8] io_schedule at ffffffff81ac1e4d
 #3 [ffffc90000c2fbb8] bit_wait_io at ffffffff81ac2418
 #4 [ffffc90000c2fbc8] __wait_on_bit at ffffffff81ac214a
 #5 [ffffc90000c2fc10] out_of_line_wait_on_bit at ffffffff81ac22cc
 #6 [ffffc90000c2fc60] do_get_write_access at ffffffff813d0bc3
 #7 [ffffc90000c2fcb0] jbd2_journal_get_write_access at ffffffff813d0dc4
 #8 [ffffc90000c2fcd8] __ext4_journal_get_write_access at ffffffff8137c2c9
 #9 [ffffc90000c2fd18] ext4_reserve_inode_write at ffffffff813997f8
#10 [ffffc90000c2fd40] __ext4_mark_inode_dirty at ffffffff81399a38
#11 [ffffc90000c2fdc0] ext4_dirty_inode at ffffffff8139cf52
#12 [ffffc90000c2fdd8] __mark_inode_dirty at ffffffff81323284
#13 [ffffc90000c2fe10] generic_update_time at ffffffff8130de25
#14 [ffffc90000c2fe28] file_modified at ffffffff8130e23c
#15 [ffffc90000c2fe50] ext4_buffered_write_iter at ffffffff81388b6f
#16 [ffffc90000c2fe78] vfs_write at ffffffff812ee149
#17 [ffffc90000c2ff08] ksys_write at ffffffff812ee47e
#18 [ffffc90000c2ff40] do_syscall_64 at ffffffff81ab418e
#19 [ffffc90000c2ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c0006a
    RIP: 00007f14bdcacc74  RSP: 00007ffcee806498  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f14bdcacc74
    RDX: 0000000000100000  RSI: 00007f14bdaa0000  RDI: 0000000000000001
    RBP: 00007ffcee8064c0   R8: 0000000000000001   R9: 00007ffcee8a8080
    R10: 0000000000000017  R11: 0000000000000202  R12: 0000000000100000
    R13: 00007f14bdaa0000  R14: 0000000000000000  R15: 0000000000100000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

PID: 1914     TASK: ffff8881015e6740  CPU: 1    COMMAND: "kworker/1:2"
 #0 [ffffc90000d5fa58] __schedule at ffffffff81ac18ac
 #1 [ffffc90000d5fab8] schedule at ffffffff81ac1d82
 #2 [ffffc90000d5fad0] schedule_timeout at ffffffff81ac64e9
 #3 [ffffc90000d5fb18] io_schedule_timeout at ffffffff81ac15e7
 #4 [ffffc90000d5fb30] __wait_for_common at ffffffff81ac2723
 #5 [ffffc90000d5fb98] sync_io at ffffffff817f695d
 #6 [ffffc90000d5fc00] dm_io at ffffffff817f6b22
 #7 [ffffc90000d5fc80] chunk_io at ffffffff81808950
 #8 [ffffc90000d5fd38] persistent_commit_exception at ffffffff81808caa
 #9 [ffffc90000d5fd50] copy_callback at ffffffff8180601a
#10 [ffffc90000d5fd80] run_complete_job at ffffffff817f78ff
#11 [ffffc90000d5fdc8] process_jobs at ffffffff817f7c5e
#12 [ffffc90000d5fe10] do_work at ffffffff817f7eb7
#13 [ffffc90000d5fe68] process_one_work at ffffffff81147b69
#14 [ffffc90000d5fea8] worker_thread at ffffffff81148554
#15 [ffffc90000d5fef8] kthread at ffffffff8114f8ee
#16 [ffffc90000d5ff30] ret_from_fork at ffffffff8108bb98
#17 [ffffc90000d5ff50] ret_from_fork_asm at ffffffff81000da1
crash> ps -m | grep md
[0 00:00:00.129] [IN]  PID: 965      TASK: ffff88810b8de740  CPU: 4    COMMAND: "systemd-oomd"
[0 00:00:01.187] [RU]  PID: 875      TASK: ffff888108bee740  CPU: 3    COMMAND: "md0_raid5"
[0 00:00:07.128] [IN]  PID: 707      TASK: ffff88810cc31d80  CPU: 1    COMMAND: "systemd-journal"
[0 00:00:07.524] [IN]  PID: 1007     TASK: ffff88810b8dc9c0  CPU: 4    COMMAND: "systemd-logind"
[0 00:00:07.524] [IN]  PID: 1981     TASK: ffff88810521bb00  CPU: 5    COMMAND: "systemd-hostnam"
[0 00:00:07.524] [IN]  PID: 1        TASK: ffff888100158000  CPU: 0    COMMAND: "systemd"
[0 00:00:07.824] [IN]  PID: 1971     TASK: ffff88810521ac40  CPU: 2    COMMAND: "systemd-userwor"
[0 00:00:07.825] [IN]  PID: 1006     TASK: ffff8881045a0ec0  CPU: 4    COMMAND: "systemd-homed"
[0 00:00:07.830] [IN]  PID: 1970     TASK: ffff888105218000  CPU: 1    COMMAND: "systemd-userwor"
[0 00:00:10.916] [IN]  PID: 1972     TASK: ffff888105218ec0  CPU: 1    COMMAND: "systemd-userwor"
[0 00:00:36.004] [IN]  PID: 971      TASK: ffff8881089c2c40  CPU: 0    COMMAND: "systemd-userdbd"
[0 00:10:56.905] [IN]  PID: 966      TASK: ffff888105546740  CPU: 4    COMMAND: "systemd-resolve"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
[0 00:34:52.328] [IN]  PID: 1669     TASK: ffff88810521c9c0  CPU: 2    COMMAND: "systemd"
[0 00:39:21.349] [IN]  PID: 739      TASK: ffff8881089c5880  CPU: 3    COMMAND: "systemd-udevd"
[0 00:40:59.426] [ID]  PID: 74       TASK: ffff888100a68000  CPU: 6    COMMAND: "kworker/R-md"
[0 00:40:59.427] [ID]  PID: 75       TASK: ffff888100a68ec0  CPU: 7    COMMAND: "kworker/R-md_bi"
[0 00:40:59.556] [IN]  PID: 66       TASK: ffff8881003e8000  CPU: 4    COMMAND: "ksmd"
crash> bt 875
PID: 875      TASK: ffff888108bee740  CPU: 3    COMMAND: "md0_raid5"
 #0 [fffffe00000bee60] crash_nmi_callback at ffffffff810a351e
 #1 [fffffe00000bee68] nmi_handle at ffffffff81085acb
 #2 [fffffe00000beea8] default_do_nmi at ffffffff81ab59d2
 #3 [fffffe00000beed0] exc_nmi at ffffffff81ab5c9c
 #4 [fffffe00000beef0] end_repeat_nmi at ffffffff81c010f7
    [exception RIP: ops_run_io+224]
    RIP: ffffffff817c4740  RSP: ffffc90000b3fb58  RFLAGS: 00000206
    RAX: 0000000000000220  RBX: 0000000000000003  RCX: ffff88810cee7098
    RDX: ffff88812495a3d0  RSI: 0000000000000000  RDI: ffff88810cee7000
    RBP: ffff888103884000   R8: 0000000000000000   R9: ffff888103884000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000003  R14: ffff88812495a1b0  R15: ffffc90000b3fc00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffc90000b3fb58] ops_run_io at ffffffff817c4740
 #6 [ffffc90000b3fc40] handle_stripe at ffffffff817cd85d
 #7 [ffffc90000b3fd40] handle_active_stripes at ffffffff817ce82c
 #8 [ffffc90000b3fdd0] raid5d at ffffffff817cee88
 #9 [ffffc90000b3fe98] md_thread at ffffffff817db1ef
#10 [ffffc90000b3fef8] kthread at ffffffff8114f8ee
#11 [ffffc90000b3ff30] ret_from_fork at ffffffff8108bb98
#12 [ffffc90000b3ff50] ret_from_fork_asm at ffffffff81000da1
crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
 #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
 #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
 #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
 #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
 #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
 #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
 #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
 #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
 #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
crash> bt 74
PID: 74       TASK: ffff888100a68000  CPU: 6    COMMAND: "kworker/R-md"
 #0 [ffffc900002afdf8] __schedule at ffffffff81ac18ac
 #1 [ffffc900002afe58] schedule at ffffffff81ac1d82
 #2 [ffffc900002afe70] rescuer_thread at ffffffff81148138
 #3 [ffffc900002afef8] kthread at ffffffff8114f8ee
 #4 [ffffc900002aff30] ret_from_fork at ffffffff8108bb98
 #5 [ffffc900002aff50] ret_from_fork_asm at ffffffff81000da1
crash> bt 75
PID: 75       TASK: ffff888100a68ec0  CPU: 7    COMMAND: "kworker/R-md_bi"
 #0 [ffffc900002b7df8] __schedule at ffffffff81ac18ac
 #1 [ffffc900002b7e58] schedule at ffffffff81ac1d82
 #2 [ffffc900002b7e70] rescuer_thread at ffffffff81148138
 #3 [ffffc900002b7ef8] kthread at ffffffff8114f8ee
 #4 [ffffc900002b7f30] ret_from_fork at ffffffff8108bb98
 #5 [ffffc900002b7f50] ret_from_fork_asm at ffffffff81000da1
crash> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-24  2:13     ` Song Liu
  2024-03-01 20:26       ` junxiao.bi
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-02-24  2:13 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, stable, Dan Moulding

Hi,

On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 21.02.24 00:06, Dan Moulding wrote:
> > Just a friendly reminder that this regression still exists on the
> > mainline. It has been reverted in 6.7 stable. But I upgraded a
> > development system to 6.8-rc5 today and immediately hit this issue
> > again. Then I saw that it hasn't yet been reverted in Linus' tree.
>
> Song Liu, what's the status here? I aware that you fixed with quite a
> few regressions recently, but it seems like resolving this one is
> stalled. Or were you able to reproduce the issue or make some progress
> and I just missed it?

Sorry for the delay with this issue. I have been occupied with some
other stuff this week.

I haven't got luck to reproduce this issue. I will spend more time looking
into it next week.

>
> And if not, what's the way forward here wrt to the release of 6.8?
> Revert the culprit and try again later? Or is that not an option for one
> reason or another?

If we don't make progress with it in the next week, we will do the revert,
same as we did with stable kernels.

>
> Or do we assume that this is not a real issue? That it's caused by some
> oddity (bit-flip in the metadata or something like that?) only to be
> found in Dan's setup?

I don't think this is because of oddities. Hopefully we can get more
information about this soon.

Thanks,
Song

>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-02-24  2:13     ` Song Liu
@ 2024-03-01 20:26       ` junxiao.bi
  2024-03-01 23:12         ` Dan Moulding
                           ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: junxiao.bi @ 2024-03-01 20:26 UTC (permalink / raw)
  To: Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding

Hi Dan & Song,

I have not root cause this yet, but would like share some findings from 
the vmcore Dan shared. From what i can see, this doesn't look like a md 
issue, but something wrong with block layer or below.

1. There were multiple process hung by IO over 15mins.

crash> ps -m | grep UN
[0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1    
COMMAND: "jbd2/dm-3-8"
[0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2    
COMMAND: "dd"
[0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    
COMMAND: "md0_reclaim"
[0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1    
COMMAND: "kworker/1:2"
[0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7    
COMMAND: "kworker/u21:1"

2. Let pick md0_reclaim to take a look, it is waiting done super_block 
update. We can see there were two pending superblock write and other 
pending io for the underling physical disk, which caused these process hung.

crash> bt 876
PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1

crash> mddev.pending_writes,disks 0xffff888108335800
   pending_writes = {
     counter = 2  <<<<<<<<<< 2 active super block write
   },
   disks = {
     next = 0xffff88810ce85a00,
     prev = 0xffff88810ce84c00
   },
crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
0xffff88810ce85a00
ffff88810ce85a00
   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
   nr_pending = {
     counter = 0
   },
ffff8881083ace00
   kobj.name = 0xffff888100a93280 "dev-sde",
   nr_pending = {
     counter = 10 <<<<
   },
ffff8881010ad200
   kobj.name = 0xffff8881012721c8 "dev-sdc",
   nr_pending = {
     counter = 8 <<<<<
   },
ffff88810ce84c00
   kobj.name = 0xffff888100325f08 "dev-sdd",
   nr_pending = {
     counter = 2 <<<<<
   },

3. From block layer, i can find the inflight IO for md superblock write 
which has been pending 955s which matches with the hung time of 
"md0_reclaim"

crash> 
request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
ffff888103b4c300
   q = 0xffff888103a00d80,
   mq_hctx = 0xffff888103c5d200,
   cmd_flags = 38913,
   rq_flags = 139408,
   start_time_ns = 1504179024146,
   bio = 0x0,
   biotail = 0xffff888120758e40,
   state = MQ_RQ_COMPLETE,
   __data_len = 0,
   flush = {
     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
     saved_end_io = 0x0
   },
   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,

crash> p tk_core.timekeeper.tkr_mono.base
$1 = 2459916243002
crash> eval 2459916243002-1504179024146
hexadecimal: de86609f28
     decimal: 955737218856  <<<<<<< IO pending time is 955s
       octal: 15720630117450
      binary: 
0000000000000000000000001101111010000110011000001001111100101000

crash> bio.bi_iter,bi_end_io 0xffff888120758e40
   bi_iter = {
     bi_sector = 8, <<<< super block offset
     bi_size = 0,
     bi_idx = 0,
     bi_bvec_done = 0
   },
   bi_end_io = 0xffffffff817dca50 <super_written>,
crash> dev -d | grep ffff888103a00d80
     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0

4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
still pending. That's because each md superblock write was marked with 
REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
data, and post_flush. Once each step complete, it will be marked in 
"request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH |  
REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another 
wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even 
"data" step already done.

crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
   fq = 0xffff88810332e240,
crash> blk_flush_queue 0xffff88810332e240
struct blk_flush_queue {
   mq_flush_lock = {
     {
       rlock = {
         raw_lock = {
           {
             val = {
               counter = 0
             },
             {
               locked = 0 '\000',
               pending = 0 '\000'
             },
             {
               locked_pending = 0,
               tail = 0
             }
           }
         }
       }
     }
   },
   flush_pending_idx = 1,
   flush_running_idx = 1,
   rq_status = 0 '\000',
   flush_pending_since = 4296171408,
   flush_queue = {{
       next = 0xffff88810332e250,
       prev = 0xffff88810332e250
     }, {
       next = 0xffff888103b4c348, <<<< the request is in this list
       prev = 0xffff888103b4c348
     }},
   flush_data_in_flight = 1,  >>>>>> still 1
   flush_rq = 0xffff888103c2e000
}

crash> list 0xffff888103b4c348
ffff888103b4c348
ffff88810332e260

crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
   tag = -1,
   state = MQ_RQ_IDLE,
   ref = {
     counter = 0
   },

5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
some issue which leading to the io request from md layer stayed in a 
partial complete statue. I can't see how this can be related with the 
commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d"")


Dan,

Are you able to reproduce using some regular scsi disk, would like to 
rule out whether this is related with virtio-scsi?

And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
official mainline v6.8-rc5 without any other patches?


Thanks,

Junxiao.

On 2/23/24 6:13 PM, Song Liu wrote:
> Hi,
>
> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>> On 21.02.24 00:06, Dan Moulding wrote:
>>> Just a friendly reminder that this regression still exists on the
>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>> development system to 6.8-rc5 today and immediately hit this issue
>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>> Song Liu, what's the status here? I aware that you fixed with quite a
>> few regressions recently, but it seems like resolving this one is
>> stalled. Or were you able to reproduce the issue or make some progress
>> and I just missed it?
> Sorry for the delay with this issue. I have been occupied with some
> other stuff this week.
>
> I haven't got luck to reproduce this issue. I will spend more time looking
> into it next week.
>
>> And if not, what's the way forward here wrt to the release of 6.8?
>> Revert the culprit and try again later? Or is that not an option for one
>> reason or another?
> If we don't make progress with it in the next week, we will do the revert,
> same as we did with stable kernels.
>
>> Or do we assume that this is not a real issue? That it's caused by some
>> oddity (bit-flip in the metadata or something like that?) only to be
>> found in Dan's setup?
> I don't think this is because of oddities. Hopefully we can get more
> information about this soon.
>
> Thanks,
> Song
>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-01 20:26       ` junxiao.bi
@ 2024-03-01 23:12         ` Dan Moulding
  2024-03-02  0:05           ` Song Liu
  2024-03-02 16:55         ` Dan Moulding
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-03-01 23:12 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable

> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> some issue which leading to the io request from md layer stayed in a
> partial complete statue. I can't see how this can be related with the
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> raid5d"")

There is no question that the above mentioned commit makes this
problem appear. While it may be that ultimately the root cause lies
outside the md/raid5 code (I'm not able to make such an assessment), I
can tell you that change is what turned it into a runtime
regression. Prior to that change, I cannot reproduce the problem. One
of my RAID-5 arrays has been running on every kernel version since
4.8, without issue. Then kernel 6.7.1 the problem appeared within
hours of running the new code and affected not just one but two
different machines with RAID-5 arrays. With that change reverted, the
problem is not reproducible. Then when I recently upgraded to 6.8-rc5
I immediately hit the problem again (because it hadn't been reverted
in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
machines without issue after reverting that commit on top of it.

It would seem a very unlikely coincidence that a careful bisection of
all changes between 6.7.0 and 6.7.1 pointed at that commit as being
the culprit, and that the change is to raid5.c, and that the hang
happens in the raid5 kernel task, if there was no connection. :)

> Are you able to reproduce using some regular scsi disk, would like to
> rule out whether this is related with virtio-scsi?

The first time I hit this problem was on two bare-metal machines, one
server and one desktop with different hardware. I then set up this
virtual machine just to reproduce the problem in a different
environment (and to see if I could reproduce it with a distribution
kernel since the other machines are running custom kernel
configurations). So I'm able to reproduce it on:

- A virtual machine
- Bare metal machines
- Custom kernel configuration with straight from stable and mainline code
- Fedora 39 distribution kernel

> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> official mainline v6.8-rc5 without any other patches?

Yes this particular vmcore was from the Fedora 39 VM I used to
reproduce the problem, but with the straight 6.8.0-rc5 mainline code
(so that you wouldn't have to worry about any possible interference
from distribution patches).

Cheers,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-01 23:12         ` Dan Moulding
@ 2024-03-02  0:05           ` Song Liu
  2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 53+ messages in thread
From: Song Liu @ 2024-03-02  0:05 UTC (permalink / raw)
  To: Dan Moulding
  Cc: junxiao.bi, gregkh, linux-kernel, linux-raid, regressions, stable

Hi Dan and Junxiao,

On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <dan@danm.net> wrote:
>
> > 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> > some issue which leading to the io request from md layer stayed in a
> > partial complete statue. I can't see how this can be related with the
> > commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> > raid5d"")
>
> There is no question that the above mentioned commit makes this
> problem appear. While it may be that ultimately the root cause lies
> outside the md/raid5 code (I'm not able to make such an assessment), I
> can tell you that change is what turned it into a runtime
> regression. Prior to that change, I cannot reproduce the problem. One
> of my RAID-5 arrays has been running on every kernel version since
> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
> hours of running the new code and affected not just one but two
> different machines with RAID-5 arrays. With that change reverted, the
> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
> I immediately hit the problem again (because it hadn't been reverted
> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
> machines without issue after reverting that commit on top of it.
>
> It would seem a very unlikely coincidence that a careful bisection of
> all changes between 6.7.0 and 6.7.1 pointed at that commit as being
> the culprit, and that the change is to raid5.c, and that the hang
> happens in the raid5 kernel task, if there was no connection. :)
>
> > Are you able to reproduce using some regular scsi disk, would like to
> > rule out whether this is related with virtio-scsi?
>
> The first time I hit this problem was on two bare-metal machines, one
> server and one desktop with different hardware. I then set up this
> virtual machine just to reproduce the problem in a different
> environment (and to see if I could reproduce it with a distribution
> kernel since the other machines are running custom kernel
> configurations). So I'm able to reproduce it on:
>
> - A virtual machine
> - Bare metal machines
> - Custom kernel configuration with straight from stable and mainline code
> - Fedora 39 distribution kernel
>
> > And I see the kernel version is 6.8.0-rc5 from vmcore, is this the
> > official mainline v6.8-rc5 without any other patches?
>
> Yes this particular vmcore was from the Fedora 39 VM I used to
> reproduce the problem, but with the straight 6.8.0-rc5 mainline code
> (so that you wouldn't have to worry about any possible interference
> from distribution patches).

Thanks to both of your for looking into the issue and running various
tests.

I also tried again to reproduce the issue, but haven't got luck. While
I will continue try to repro the issue, I will also send the revert to 6.8
kernel. We have been fighting multiple issues recently, so we didn't
get much time into this issue. Fortunately, we have got proper fixes
for most of the other issues. We should have more time to look into
this.

Thanks again,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-01 20:26       ` junxiao.bi
  2024-03-01 23:12         ` Dan Moulding
@ 2024-03-02 16:55         ` Dan Moulding
  2024-03-07  3:34         ` Yu Kuai
  2024-03-08 23:49         ` junxiao.bi
  3 siblings, 0 replies; 53+ messages in thread
From: Dan Moulding @ 2024-03-02 16:55 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable, logang

> I have not root cause this yet, but would like share some findings from 
> the vmcore Dan shared. From what i can see, this doesn't look like a md 
> issue, but something wrong with block layer or below.

Below is one other thing I found that might be of interest. This is
from the original email thread [1] that was linked to in the original
issue from 2022, which the change in question reverts:

On 2022-09-02 17:46, Logan Gunthorpe wrote:
> I've made some progress on this nasty bug. I've got far enough to know it's not
> related to the blk-wbt or the block layer.
> 
> Turns out a bunch of bios are stuck queued in a blk_plug in the md_raid5 
> thread while that thread appears to be stuck in an infinite loop (so it never
> schedules or does anything to flush the plug). 
> 
> I'm still debugging to try and find out the root cause of that infinite loop, 
> but I just wanted to send an update that the previous place I was stuck at
> was not correct.
> 
> Logan

This certainly sounds like it has some similarities to what we are
seeing when that change is reverted. The md0_raid5 thread appears to be
in an infinite loop, consuming 100% CPU, but not actually doing any
work.

-- Dan

[1] https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@deltatee.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-02  0:05           ` Song Liu
@ 2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
  2024-03-06 17:13               ` Song Liu
  0 siblings, 1 reply; 53+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-03-06  8:38 UTC (permalink / raw)
  To: Song Liu, Dan Moulding
  Cc: junxiao.bi, gregkh, linux-kernel, linux-raid, regressions, stable

On 02.03.24 01:05, Song Liu wrote:
> On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <dan@danm.net> wrote:
>>
>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
>>> some issue which leading to the io request from md layer stayed in a
>>> partial complete statue. I can't see how this can be related with the
>>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
>>> raid5d"")
>>
>> There is no question that the above mentioned commit makes this
>> problem appear. While it may be that ultimately the root cause lies
>> outside the md/raid5 code (I'm not able to make such an assessment), I
>> can tell you that change is what turned it into a runtime
>> regression. Prior to that change, I cannot reproduce the problem. One
>> of my RAID-5 arrays has been running on every kernel version since
>> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
>> hours of running the new code and affected not just one but two
>> different machines with RAID-5 arrays. With that change reverted, the
>> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
>> I immediately hit the problem again (because it hadn't been reverted
>> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
>> machines without issue after reverting that commit on top of it.
> [...]
> I also tried again to reproduce the issue, but haven't got luck. While
> I will continue try to repro the issue, I will also send the revert to 6.8
> kernel.

Is that revert on the way meanwhile? I'm asking because Linus might
release 6.8 on Sunday.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
@ 2024-03-06 17:13               ` Song Liu
  0 siblings, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-03-06 17:13 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Dan Moulding, junxiao.bi, gregkh, linux-kernel, linux-raid, stable

Hi Thorsten,

On Wed, Mar 6, 2024 at 12:38 AM Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 02.03.24 01:05, Song Liu wrote:
> > On Fri, Mar 1, 2024 at 3:12 PM Dan Moulding <dan@danm.net> wrote:
> >>
> >>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have
> >>> some issue which leading to the io request from md layer stayed in a
> >>> partial complete statue. I can't see how this can be related with the
> >>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in
> >>> raid5d"")
> >>
> >> There is no question that the above mentioned commit makes this
> >> problem appear. While it may be that ultimately the root cause lies
> >> outside the md/raid5 code (I'm not able to make such an assessment), I
> >> can tell you that change is what turned it into a runtime
> >> regression. Prior to that change, I cannot reproduce the problem. One
> >> of my RAID-5 arrays has been running on every kernel version since
> >> 4.8, without issue. Then kernel 6.7.1 the problem appeared within
> >> hours of running the new code and affected not just one but two
> >> different machines with RAID-5 arrays. With that change reverted, the
> >> problem is not reproducible. Then when I recently upgraded to 6.8-rc5
> >> I immediately hit the problem again (because it hadn't been reverted
> >> in the mainline yet). I'm now running 6.8.0-rc5 on one of my affected
> >> machines without issue after reverting that commit on top of it.
> > [...]
> > I also tried again to reproduce the issue, but haven't got luck. While
> > I will continue try to repro the issue, I will also send the revert to 6.8
> > kernel.
>
> Is that revert on the way meanwhile? I'm asking because Linus might
> release 6.8 on Sunday.

The patch is on its way to 6.9 kernel via a PR yesterday [1]. It will land in
stable 6.8 kernel via stable backports.

Since this is not a new regression in 6.8 kernel and Dan is the only one
experiencing this, we would rather not rush last minute change to the 6.8
release.

Thanks,
Song

[1] https://lore.kernel.org/linux-raid/1C22EE73-62D9-43B0-B1A2-2D3B95F774AC@fb.com/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-01 20:26       ` junxiao.bi
  2024-03-01 23:12         ` Dan Moulding
  2024-03-02 16:55         ` Dan Moulding
@ 2024-03-07  3:34         ` Yu Kuai
  2024-03-08 23:49         ` junxiao.bi
  3 siblings, 0 replies; 53+ messages in thread
From: Yu Kuai @ 2024-03-07  3:34 UTC (permalink / raw)
  To: junxiao.bi, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

Hi,

在 2024/03/02 4:26, junxiao.bi@oracle.com 写道:
> Hi Dan & Song,
> 
> I have not root cause this yet, but would like share some findings from 
> the vmcore Dan shared. From what i can see, this doesn't look like a md 
> issue, but something wrong with block layer or below.

I would like to take a look at vmcore as well. How dose Dan sharing the
vmcore? I don't find it in the thread.

Thanks,
Kuai

> 
> 1. There were multiple process hung by IO over 15mins.
> 
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1 
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2 
> COMMAND: "dd"
> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3 
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1 
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7 
> COMMAND: "kworker/u21:1"
> 
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block 
> update. We can see there were two pending superblock write and other 
> pending io for the underling physical disk, which caused these process 
> hung.
> 
> crash> bt 876
> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>   #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>   #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>   #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>   #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>   #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>   #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>   #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>   #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>   #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
> 
> crash> mddev.pending_writes,disks 0xffff888108335800
>    pending_writes = {
>      counter = 2  <<<<<<<<<< 2 active super block write
>    },
>    disks = {
>      next = 0xffff88810ce85a00,
>      prev = 0xffff88810ce84c00
>    },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
> 0xffff88810ce85a00
> ffff88810ce85a00
>    kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>    nr_pending = {
>      counter = 0
>    },
> ffff8881083ace00
>    kobj.name = 0xffff888100a93280 "dev-sde",
>    nr_pending = {
>      counter = 10 <<<<
>    },
> ffff8881010ad200
>    kobj.name = 0xffff8881012721c8 "dev-sdc",
>    nr_pending = {
>      counter = 8 <<<<<
>    },
> ffff88810ce84c00
>    kobj.name = 0xffff888100325f08 "dev-sdd",
>    nr_pending = {
>      counter = 2 <<<<<
>    },
> 
> 3. From block layer, i can find the inflight IO for md superblock write 
> which has been pending 955s which matches with the hung time of 
> "md0_reclaim"
> 
> crash> 
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
> ffff888103b4c300
>    q = 0xffff888103a00d80,
>    mq_hctx = 0xffff888103c5d200,
>    cmd_flags = 38913,
>    rq_flags = 139408,
>    start_time_ns = 1504179024146,
>    bio = 0x0,
>    biotail = 0xffff888120758e40,
>    state = MQ_RQ_COMPLETE,
>    __data_len = 0,
>    flush = {
>      seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>      saved_end_io = 0x0
>    },
>    end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
> 
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
>      decimal: 955737218856  <<<<<<< IO pending time is 955s
>        octal: 15720630117450
>       binary: 
> 0000000000000000000000001101111010000110011000001001111100101000
> 
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>    bi_iter = {
>      bi_sector = 8, <<<< super block offset
>      bi_size = 0,
>      bi_idx = 0,
>      bi_bvec_done = 0
>    },
>    bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
>      8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0
> 
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
> still pending. That's because each md superblock write was marked with 
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
> data, and post_flush. Once each step complete, it will be marked in 
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH | 
> REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another 
> wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even 
> "data" step already done.
> 
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>    fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
>    mq_flush_lock = {
>      {
>        rlock = {
>          raw_lock = {
>            {
>              val = {
>                counter = 0
>              },
>              {
>                locked = 0 '\000',
>                pending = 0 '\000'
>              },
>              {
>                locked_pending = 0,
>                tail = 0
>              }
>            }
>          }
>        }
>      }
>    },
>    flush_pending_idx = 1,
>    flush_running_idx = 1,
>    rq_status = 0 '\000',
>    flush_pending_since = 4296171408,
>    flush_queue = {{
>        next = 0xffff88810332e250,
>        prev = 0xffff88810332e250
>      }, {
>        next = 0xffff888103b4c348, <<<< the request is in this list
>        prev = 0xffff888103b4c348
>      }},
>    flush_data_in_flight = 1,  >>>>>> still 1
>    flush_rq = 0xffff888103c2e000
> }
> 
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
> 
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>    tag = -1,
>    state = MQ_RQ_IDLE,
>    ref = {
>      counter = 0
>    },
> 
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
> some issue which leading to the io request from md layer stayed in a 
> partial complete statue. I can't see how this can be related with the 
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
> raid5d"")
> 
> 
> Dan,
> 
> Are you able to reproduce using some regular scsi disk, would like to 
> rule out whether this is related with virtio-scsi?
> 
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
> official mainline v6.8-rc5 without any other patches?
> 
> 
> Thanks,
> 
> Junxiao.
> 
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time 
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the 
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> -- 
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-01 20:26       ` junxiao.bi
                           ` (2 preceding siblings ...)
  2024-03-07  3:34         ` Yu Kuai
@ 2024-03-08 23:49         ` junxiao.bi
  2024-03-10  5:13           ` Dan Moulding
  2024-03-11  1:50           ` Yu Kuai
  3 siblings, 2 replies; 53+ messages in thread
From: junxiao.bi @ 2024-03-08 23:49 UTC (permalink / raw)
  To: Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding

Here is the root cause for this issue:

Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d") introduced a regression, it got reverted through commit 
bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
commit d6e035aad6c0 ("md: bypass block throttle for superblock update") 
was created, it avoids md superblock write getting throttled by block 
layer which is good, but md superblock write could be stuck in block 
layer due to block flush as well, and that is what was happening in this 
regression report.

Process "md0_reclaim" got stuck while waiting IO for md superblock write 
done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, these 3 
steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed before done, the 
hung of this process is because the last step "POSTFLUSH" never done. 
And that was because of  process "md0_raid5" submitted another IO with 
REQ_FUA flag marked just before that step started. To handle that IO, 
blk_insert_flush() will be invoked and hit "REQ_FSEQ_DATA | 
REQ_FSEQ_POSTFLUSH" case where "fq->flush_data_in_flight" will be 
increased. When the IO for md superblock write was to issue "POSTFLUSH" 
step through blk_kick_flush(), it found that "fq->flush_data_in_flight" 
was not zero, so it will skip that step, that is expected, because flush 
will be triggered when "fq->flush_data_in_flight" dropped to zero.

Unfortunately here that inflight data IO from "md0_raid5" will never 
done, because it was added into the blk_plug list of that process, but 
"md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" which 
made it never had a chance to finish the blk plug until 
"MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was supposed 
to clear that flag but it was stuck by "md0_raid5", so this is a deadlock.

Looks like the approach in the RFC patch trying to resolve the 
regression of commit 5e2cf333b7bd can help this issue. Once "md0_raid5" 
starts looping due to "MD_SB_CHANGE_PENDING", it should release all its 
staging IO requests to avoid blocking others. Also a cond_reschedule() 
will avoid it run into lockup.

https://www.spinics.net/lists/raid/msg75338.html

Dan, can you try the following patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index de771093b526..474462abfbdc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool 
from_schedule)
         if (unlikely(!rq_list_empty(plug->cached_rq)))
                 blk_mq_free_plug_rqs(plug);
  }
+EXPORT_SYMBOL(__blk_flush_plug);

  /**
   * blk_finish_plug - mark the end of a batch of submitted I/O
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..26e09cdf46a3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
spin_unlock_irq(&conf->device_lock);
                         md_check_recovery(mddev);
                         spin_lock_irq(&conf->device_lock);
+               } else {
+ spin_unlock_irq(&conf->device_lock);
+                       blk_flush_plug(&plug, false);
+                       cond_resched();
+                       spin_lock_irq(&conf->device_lock);
                 }
         }
         pr_debug("%d stripes handled\n", handled);

Thanks,

Junxiao.

On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
> Hi Dan & Song,
>
> I have not root cause this yet, but would like share some findings 
> from the vmcore Dan shared. From what i can see, this doesn't look 
> like a md issue, but something wrong with block layer or below.
>
> 1. There were multiple process hung by IO over 15mins.
>
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1    
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2    
> COMMAND: "dd"
> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3    
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1    
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7    
> COMMAND: "kworker/u21:1"
>
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block 
> update. We can see there were two pending superblock write and other 
> pending io for the underling physical disk, which caused these process 
> hung.
>
> crash> bt 876
> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>
> crash> mddev.pending_writes,disks 0xffff888108335800
>   pending_writes = {
>     counter = 2  <<<<<<<<<< 2 active super block write
>   },
>   disks = {
>     next = 0xffff88810ce85a00,
>     prev = 0xffff88810ce84c00
>   },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
> 0xffff88810ce85a00
> ffff88810ce85a00
>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>   nr_pending = {
>     counter = 0
>   },
> ffff8881083ace00
>   kobj.name = 0xffff888100a93280 "dev-sde",
>   nr_pending = {
>     counter = 10 <<<<
>   },
> ffff8881010ad200
>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>   nr_pending = {
>     counter = 8 <<<<<
>   },
> ffff88810ce84c00
>   kobj.name = 0xffff888100325f08 "dev-sdd",
>   nr_pending = {
>     counter = 2 <<<<<
>   },
>
> 3. From block layer, i can find the inflight IO for md superblock 
> write which has been pending 955s which matches with the hung time of 
> "md0_reclaim"
>
> crash> 
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
> ffff888103b4c300
>   q = 0xffff888103a00d80,
>   mq_hctx = 0xffff888103c5d200,
>   cmd_flags = 38913,
>   rq_flags = 139408,
>   start_time_ns = 1504179024146,
>   bio = 0x0,
>   biotail = 0xffff888120758e40,
>   state = MQ_RQ_COMPLETE,
>   __data_len = 0,
>   flush = {
>     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>     saved_end_io = 0x0
>   },
>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>       octal: 15720630117450
>      binary: 
> 0000000000000000000000001101111010000110011000001001111100101000
>
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>   bi_iter = {
>     bi_sector = 8, <<<< super block offset
>     bi_size = 0,
>     bi_idx = 0,
>     bi_bvec_done = 0
>   },
>   bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
> still pending. That's because each md superblock write was marked with 
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
> data, and post_flush. Once each step complete, it will be marked in 
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH 
> |  REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  
> Another wired thing is that blk_flush_queue.flush_data_in_flight is 
> still 1 even "data" step already done.
>
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>   fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
>   mq_flush_lock = {
>     {
>       rlock = {
>         raw_lock = {
>           {
>             val = {
>               counter = 0
>             },
>             {
>               locked = 0 '\000',
>               pending = 0 '\000'
>             },
>             {
>               locked_pending = 0,
>               tail = 0
>             }
>           }
>         }
>       }
>     }
>   },
>   flush_pending_idx = 1,
>   flush_running_idx = 1,
>   rq_status = 0 '\000',
>   flush_pending_since = 4296171408,
>   flush_queue = {{
>       next = 0xffff88810332e250,
>       prev = 0xffff88810332e250
>     }, {
>       next = 0xffff888103b4c348, <<<< the request is in this list
>       prev = 0xffff888103b4c348
>     }},
>   flush_data_in_flight = 1,  >>>>>> still 1
>   flush_rq = 0xffff888103c2e000
> }
>
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
>
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>   tag = -1,
>   state = MQ_RQ_IDLE,
>   ref = {
>     counter = 0
>   },
>
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
> some issue which leading to the io request from md layer stayed in a 
> partial complete statue. I can't see how this can be related with the 
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING 
> in raid5d"")
>
>
> Dan,
>
> Are you able to reproduce using some regular scsi disk, would like to 
> rule out whether this is related with virtio-scsi?
>
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
> official mainline v6.8-rc5 without any other patches?
>
>
> Thanks,
>
> Junxiao.
>
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time 
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for 
>>> one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the 
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' 
>>> hat)
>>> -- 
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-08 23:49         ` junxiao.bi
@ 2024-03-10  5:13           ` Dan Moulding
  2024-03-11  1:50           ` Yu Kuai
  1 sibling, 0 replies; 53+ messages in thread
From: Dan Moulding @ 2024-03-10  5:13 UTC (permalink / raw)
  To: junxiao.bi
  Cc: dan, gregkh, linux-kernel, linux-raid, regressions, song, stable

> Dan, can you try the following patch?
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index de771093b526..474462abfbdc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool 
> from_schedule)
>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>                  blk_mq_free_plug_rqs(plug);
>   }
> +EXPORT_SYMBOL(__blk_flush_plug);
> 
>   /**
>    * blk_finish_plug - mark the end of a batch of submitted I/O
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8497880135ee..26e09cdf46a3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
> spin_unlock_irq(&conf->device_lock);
>                          md_check_recovery(mddev);
>                          spin_lock_irq(&conf->device_lock);
> +               } else {
> + spin_unlock_irq(&conf->device_lock);
> +                       blk_flush_plug(&plug, false);
> +                       cond_resched();
> +                       spin_lock_irq(&conf->device_lock);
>                  }
>          }
>          pr_debug("%d stripes handled\n", handled);

This patch seems to work! I can no longer reproduce the problem after
applying this.

Thanks,

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-08 23:49         ` junxiao.bi
  2024-03-10  5:13           ` Dan Moulding
@ 2024-03-11  1:50           ` Yu Kuai
  2024-03-12 22:56             ` junxiao.bi
  2024-03-14 16:12             ` Dan Moulding
  1 sibling, 2 replies; 53+ messages in thread
From: Yu Kuai @ 2024-03-11  1:50 UTC (permalink / raw)
  To: junxiao.bi, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

Hi,

在 2024/03/09 7:49, junxiao.bi@oracle.com 写道:
> Here is the root cause for this issue:
> 
> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
> raid5d") introduced a regression, it got reverted through commit 
> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
> commit d6e035aad6c0 ("md: bypass block throttle for superblock update") 
> was created, it avoids md superblock write getting throttled by block 
> layer which is good, but md superblock write could be stuck in block 
> layer due to block flush as well, and that is what was happening in this 
> regression report.
> 
> Process "md0_reclaim" got stuck while waiting IO for md superblock write 
> done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, these 3 
> steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed before done, the 
> hung of this process is because the last step "POSTFLUSH" never done. 
> And that was because of  process "md0_raid5" submitted another IO with 
> REQ_FUA flag marked just before that step started. To handle that IO, 
> blk_insert_flush() will be invoked and hit "REQ_FSEQ_DATA | 
> REQ_FSEQ_POSTFLUSH" case where "fq->flush_data_in_flight" will be 
> increased. When the IO for md superblock write was to issue "POSTFLUSH" 
> step through blk_kick_flush(), it found that "fq->flush_data_in_flight" 
> was not zero, so it will skip that step, that is expected, because flush 
> will be triggered when "fq->flush_data_in_flight" dropped to zero.
> 
> Unfortunately here that inflight data IO from "md0_raid5" will never 
> done, because it was added into the blk_plug list of that process, but 
> "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" which 
> made it never had a chance to finish the blk plug until 
> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was supposed 
> to clear that flag but it was stuck by "md0_raid5", so this is a deadlock.
> 
> Looks like the approach in the RFC patch trying to resolve the 
> regression of commit 5e2cf333b7bd can help this issue. Once "md0_raid5" 
> starts looping due to "MD_SB_CHANGE_PENDING", it should release all its 
> staging IO requests to avoid blocking others. Also a cond_reschedule() 
> will avoid it run into lockup.

The analysis sounds good, however, it seems to me that the behaviour
raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
is not reasonable, because md_check_recovery() must hold
'reconfig_mutex' to clear the flag.

Look at raid1/raid10, there are two different behaviour that seems can
avoid this problem as well:

1) blk_start_plug() is delayed until all failed IO is handled. This look
reasonable because in order to get better performance, IO should be
handled by submitted thread as much as possible, and meanwhile, the
deadlock can be triggered here.
2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
the handling of failed IO, and when mddev_unlock() is called, daemon
thread will be woken up again to handle failed IO.

How about the following patch?

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..0b2e6060f2c9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)

         md_check_recovery(mddev);

-       blk_start_plug(&plug);
         handled = 0;
         spin_lock_irq(&conf->device_lock);
         while (1) {
@@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
                 int batch_size, released;
                 unsigned int offset;

+               /*
+                * md_check_recovery() can't clear sb_flags, usually 
because of
+                * 'reconfig_mutex' can't be grabbed, wait for 
mddev_unlock() to
+                * wake up raid5d().
+                */
+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                 released = release_stripe_list(conf, 
conf->temp_inactive_list);
                 if (released)
                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
                         spin_lock_irq(&conf->device_lock);
                 }
         }
+skip:
         pr_debug("%d stripes handled\n", handled);
-
         spin_unlock_irq(&conf->device_lock);
         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
             mutex_trylock(&conf->cache_size_mutex)) {
@@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
                 mutex_unlock(&conf->cache_size_mutex);
         }

+       blk_start_plug(&plug);
         flush_deferred_bios(conf);

         r5l_flush_stripe_to_raid(conf->log);

> 
> https://www.spinics.net/lists/raid/msg75338.html
> 
> Dan, can you try the following patch?
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index de771093b526..474462abfbdc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, bool 
> from_schedule)
>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>                  blk_mq_free_plug_rqs(plug);
>   }
> +EXPORT_SYMBOL(__blk_flush_plug);
> 
>   /**
>    * blk_finish_plug - mark the end of a batch of submitted I/O
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8497880135ee..26e09cdf46a3 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
> spin_unlock_irq(&conf->device_lock);
>                          md_check_recovery(mddev);
>                          spin_lock_irq(&conf->device_lock);
> +               } else {
> + spin_unlock_irq(&conf->device_lock);
> +                       blk_flush_plug(&plug, false);
> +                       cond_resched();
> +                       spin_lock_irq(&conf->device_lock);
>                  }
>          }
>          pr_debug("%d stripes handled\n", handled);
> 
> Thanks,
> 
> Junxiao.
> 
> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
>> Hi Dan & Song,
>>
>> I have not root cause this yet, but would like share some findings 
>> from the vmcore Dan shared. From what i can see, this doesn't look 
>> like a md issue, but something wrong with block layer or below.
>>
>> 1. There were multiple process hung by IO over 15mins.
>>
>> crash> ps -m | grep UN
>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1 
>> COMMAND: "jbd2/dm-3-8"
>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2 
>> COMMAND: "dd"
>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3 
>> COMMAND: "md0_reclaim"
>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1 
>> COMMAND: "kworker/1:2"
>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7 
>> COMMAND: "kworker/u21:1"
>>
>> 2. Let pick md0_reclaim to take a look, it is waiting done super_block 
>> update. We can see there were two pending superblock write and other 
>> pending io for the underling physical disk, which caused these process 
>> hung.
>>
>> crash> bt 876
>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>
>> crash> mddev.pending_writes,disks 0xffff888108335800
>>   pending_writes = {
>>     counter = 2  <<<<<<<<<< 2 active super block write
>>   },
>>   disks = {
>>     next = 0xffff88810ce85a00,
>>     prev = 0xffff88810ce84c00
>>   },
>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
>> 0xffff88810ce85a00
>> ffff88810ce85a00
>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>   nr_pending = {
>>     counter = 0
>>   },
>> ffff8881083ace00
>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>   nr_pending = {
>>     counter = 10 <<<<
>>   },
>> ffff8881010ad200
>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>   nr_pending = {
>>     counter = 8 <<<<<
>>   },
>> ffff88810ce84c00
>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>   nr_pending = {
>>     counter = 2 <<<<<
>>   },
>>
>> 3. From block layer, i can find the inflight IO for md superblock 
>> write which has been pending 955s which matches with the hung time of 
>> "md0_reclaim"
>>
>> crash> 
>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
>> ffff888103b4c300
>>   q = 0xffff888103a00d80,
>>   mq_hctx = 0xffff888103c5d200,
>>   cmd_flags = 38913,
>>   rq_flags = 139408,
>>   start_time_ns = 1504179024146,
>>   bio = 0x0,
>>   biotail = 0xffff888120758e40,
>>   state = MQ_RQ_COMPLETE,
>>   __data_len = 0,
>>   flush = {
>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>>     saved_end_io = 0x0
>>   },
>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>
>> crash> p tk_core.timekeeper.tkr_mono.base
>> $1 = 2459916243002
>> crash> eval 2459916243002-1504179024146
>> hexadecimal: de86609f28
>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>       octal: 15720630117450
>>      binary: 
>> 0000000000000000000000001101111010000110011000001001111100101000
>>
>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>   bi_iter = {
>>     bi_sector = 8, <<<< super block offset
>>     bi_size = 0,
>>     bi_idx = 0,
>>     bi_bvec_done = 0
>>   },
>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>> crash> dev -d | grep ffff888103a00d80
>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>
>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
>> still pending. That's because each md superblock write was marked with 
>> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
>> data, and post_flush. Once each step complete, it will be marked in 
>> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH 
>> |  REQ_FSEQ_DATA, so the last step "post_flush" has not be done. 
>> Another wired thing is that blk_flush_queue.flush_data_in_flight is 
>> still 1 even "data" step already done.
>>
>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>   fq = 0xffff88810332e240,
>> crash> blk_flush_queue 0xffff88810332e240
>> struct blk_flush_queue {
>>   mq_flush_lock = {
>>     {
>>       rlock = {
>>         raw_lock = {
>>           {
>>             val = {
>>               counter = 0
>>             },
>>             {
>>               locked = 0 '\000',
>>               pending = 0 '\000'
>>             },
>>             {
>>               locked_pending = 0,
>>               tail = 0
>>             }
>>           }
>>         }
>>       }
>>     }
>>   },
>>   flush_pending_idx = 1,
>>   flush_running_idx = 1,
>>   rq_status = 0 '\000',
>>   flush_pending_since = 4296171408,
>>   flush_queue = {{
>>       next = 0xffff88810332e250,
>>       prev = 0xffff88810332e250
>>     }, {
>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>       prev = 0xffff888103b4c348
>>     }},
>>   flush_data_in_flight = 1,  >>>>>> still 1
>>   flush_rq = 0xffff888103c2e000
>> }
>>
>> crash> list 0xffff888103b4c348
>> ffff888103b4c348
>> ffff88810332e260
>>
>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>>   tag = -1,
>>   state = MQ_RQ_IDLE,
>>   ref = {
>>     counter = 0
>>   },
>>
>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
>> some issue which leading to the io request from md layer stayed in a 
>> partial complete statue. I can't see how this can be related with the 
>> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING 
>> in raid5d"")
>>
>>
>> Dan,
>>
>> Are you able to reproduce using some regular scsi disk, would like to 
>> rule out whether this is related with virtio-scsi?
>>
>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
>> official mainline v6.8-rc5 without any other patches?
>>
>>
>> Thanks,
>>
>> Junxiao.
>>
>> On 2/23/24 6:13 PM, Song Liu wrote:
>>> Hi,
>>>
>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>> Just a friendly reminder that this regression still exists on the
>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>> few regressions recently, but it seems like resolving this one is
>>>> stalled. Or were you able to reproduce the issue or make some progress
>>>> and I just missed it?
>>> Sorry for the delay with this issue. I have been occupied with some
>>> other stuff this week.
>>>
>>> I haven't got luck to reproduce this issue. I will spend more time 
>>> looking
>>> into it next week.
>>>
>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>> Revert the culprit and try again later? Or is that not an option for 
>>>> one
>>>> reason or another?
>>> If we don't make progress with it in the next week, we will do the 
>>> revert,
>>> same as we did with stable kernels.
>>>
>>>> Or do we assume that this is not a real issue? That it's caused by some
>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>> found in Dan's setup?
>>> I don't think this is because of oddities. Hopefully we can get more
>>> information about this soon.
>>>
>>> Thanks,
>>> Song
>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' 
>>>> hat)
>>>> -- 
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> If I did something stupid, please tell me, as explained on that page.
>>>>
>>>> #regzbot poke
>>>>
> 
> .
> 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-11  1:50           ` Yu Kuai
@ 2024-03-12 22:56             ` junxiao.bi
  2024-03-13  1:20               ` Yu Kuai
  2024-03-14 16:12             ` Dan Moulding
  1 sibling, 1 reply; 53+ messages in thread
From: junxiao.bi @ 2024-03-12 22:56 UTC (permalink / raw)
  To: Yu Kuai, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

On 3/10/24 6:50 PM, Yu Kuai wrote:

> Hi,
>
> 在 2024/03/09 7:49, junxiao.bi@oracle.com 写道:
>> Here is the root cause for this issue:
>>
>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>> raid5d") introduced a regression, it got reverted through commit 
>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
>> commit d6e035aad6c0 ("md: bypass block throttle for superblock 
>> update") was created, it avoids md superblock write getting throttled 
>> by block layer which is good, but md superblock write could be stuck 
>> in block layer due to block flush as well, and that is what was 
>> happening in this regression report.
>>
>> Process "md0_reclaim" got stuck while waiting IO for md superblock 
>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, 
>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed 
>> before done, the hung of this process is because the last step 
>> "POSTFLUSH" never done. And that was because of  process "md0_raid5" 
>> submitted another IO with REQ_FUA flag marked just before that step 
>> started. To handle that IO, blk_insert_flush() will be invoked and 
>> hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case where 
>> "fq->flush_data_in_flight" will be increased. When the IO for md 
>> superblock write was to issue "POSTFLUSH" step through 
>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not 
>> zero, so it will skip that step, that is expected, because flush will 
>> be triggered when "fq->flush_data_in_flight" dropped to zero.
>>
>> Unfortunately here that inflight data IO from "md0_raid5" will never 
>> done, because it was added into the blk_plug list of that process, 
>> but "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" 
>> which made it never had a chance to finish the blk plug until 
>> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was 
>> supposed to clear that flag but it was stuck by "md0_raid5", so this 
>> is a deadlock.
>>
>> Looks like the approach in the RFC patch trying to resolve the 
>> regression of commit 5e2cf333b7bd can help this issue. Once 
>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should 
>> release all its staging IO requests to avoid blocking others. Also a 
>> cond_reschedule() will avoid it run into lockup.
>
> The analysis sounds good, however, it seems to me that the behaviour
> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
> is not reasonable, because md_check_recovery() must hold
> 'reconfig_mutex' to clear the flag.

That's the behavior before commit 5e2cf333b7bd which was added into Sep 
2022, so this behavior has been with raid5 for many years.


>
> Look at raid1/raid10, there are two different behaviour that seems can
> avoid this problem as well:
>
> 1) blk_start_plug() is delayed until all failed IO is handled. This look
> reasonable because in order to get better performance, IO should be
> handled by submitted thread as much as possible, and meanwhile, the
> deadlock can be triggered here.
> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
> the handling of failed IO, and when mddev_unlock() is called, daemon
> thread will be woken up again to handle failed IO.
>
> How about the following patch?
>
> Thanks,
> Kuai
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..0b2e6060f2c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>
>         md_check_recovery(mddev);
>
> -       blk_start_plug(&plug);
>         handled = 0;
>         spin_lock_irq(&conf->device_lock);
>         while (1) {
> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>                 int batch_size, released;
>                 unsigned int offset;
>
> +               /*
> +                * md_check_recovery() can't clear sb_flags, usually 
> because of
> +                * 'reconfig_mutex' can't be grabbed, wait for 
> mddev_unlock() to
> +                * wake up raid5d().
> +                */
> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +                       goto skip;
> +
>                 released = release_stripe_list(conf, 
> conf->temp_inactive_list);
>                 if (released)
>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>                         spin_lock_irq(&conf->device_lock);
>                 }
>         }
> +skip:
>         pr_debug("%d stripes handled\n", handled);
> -
>         spin_unlock_irq(&conf->device_lock);
>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>             mutex_trylock(&conf->cache_size_mutex)) {
> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>                 mutex_unlock(&conf->cache_size_mutex);
>         }
>
> +       blk_start_plug(&plug);
>         flush_deferred_bios(conf);
>
>         r5l_flush_stripe_to_raid(conf->log);

This patch eliminated the benefit of blk_plug, i think it will not be 
good for IO performance perspective?


Thanks,

Junxiao.

>
>>
>> https://www.spinics.net/lists/raid/msg75338.html
>>
>> Dan, can you try the following patch?
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index de771093b526..474462abfbdc 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, 
>> bool from_schedule)
>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>                  blk_mq_free_plug_rqs(plug);
>>   }
>> +EXPORT_SYMBOL(__blk_flush_plug);
>>
>>   /**
>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 8497880135ee..26e09cdf46a3 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>> spin_unlock_irq(&conf->device_lock);
>>                          md_check_recovery(mddev);
>> spin_lock_irq(&conf->device_lock);
>> +               } else {
>> + spin_unlock_irq(&conf->device_lock);
>> +                       blk_flush_plug(&plug, false);
>> +                       cond_resched();
>> + spin_lock_irq(&conf->device_lock);
>>                  }
>>          }
>>          pr_debug("%d stripes handled\n", handled);
>>
>> Thanks,
>>
>> Junxiao.
>>
>> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
>>> Hi Dan & Song,
>>>
>>> I have not root cause this yet, but would like share some findings 
>>> from the vmcore Dan shared. From what i can see, this doesn't look 
>>> like a md issue, but something wrong with block layer or below.
>>>
>>> 1. There were multiple process hung by IO over 15mins.
>>>
>>> crash> ps -m | grep UN
>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1 
>>> COMMAND: "jbd2/dm-3-8"
>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2 
>>> COMMAND: "dd"
>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3 
>>> COMMAND: "md0_reclaim"
>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1 
>>> COMMAND: "kworker/1:2"
>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7 
>>> COMMAND: "kworker/u21:1"
>>>
>>> 2. Let pick md0_reclaim to take a look, it is waiting done 
>>> super_block update. We can see there were two pending superblock 
>>> write and other pending io for the underling physical disk, which 
>>> caused these process hung.
>>>
>>> crash> bt 876
>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>
>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>   pending_writes = {
>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>   },
>>>   disks = {
>>>     next = 0xffff88810ce85a00,
>>>     prev = 0xffff88810ce84c00
>>>   },
>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
>>> 0xffff88810ce85a00
>>> ffff88810ce85a00
>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>   nr_pending = {
>>>     counter = 0
>>>   },
>>> ffff8881083ace00
>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>   nr_pending = {
>>>     counter = 10 <<<<
>>>   },
>>> ffff8881010ad200
>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>   nr_pending = {
>>>     counter = 8 <<<<<
>>>   },
>>> ffff88810ce84c00
>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>   nr_pending = {
>>>     counter = 2 <<<<<
>>>   },
>>>
>>> 3. From block layer, i can find the inflight IO for md superblock 
>>> write which has been pending 955s which matches with the hung time 
>>> of "md0_reclaim"
>>>
>>> crash> 
>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
>>> ffff888103b4c300
>>>   q = 0xffff888103a00d80,
>>>   mq_hctx = 0xffff888103c5d200,
>>>   cmd_flags = 38913,
>>>   rq_flags = 139408,
>>>   start_time_ns = 1504179024146,
>>>   bio = 0x0,
>>>   biotail = 0xffff888120758e40,
>>>   state = MQ_RQ_COMPLETE,
>>>   __data_len = 0,
>>>   flush = {
>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>     saved_end_io = 0x0
>>>   },
>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>
>>> crash> p tk_core.timekeeper.tkr_mono.base
>>> $1 = 2459916243002
>>> crash> eval 2459916243002-1504179024146
>>> hexadecimal: de86609f28
>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>       octal: 15720630117450
>>>      binary: 
>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>
>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>   bi_iter = {
>>>     bi_sector = 8, <<<< super block offset
>>>     bi_size = 0,
>>>     bi_idx = 0,
>>>     bi_bvec_done = 0
>>>   },
>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>> crash> dev -d | grep ffff888103a00d80
>>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>>
>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it 
>>> is still pending. That's because each md superblock write was marked 
>>> with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: 
>>> pre_flush, data, and post_flush. Once each step complete, it will be 
>>> marked in "request.flush.seq", here the value is 3, which is 
>>> REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step "post_flush" 
>>> has not be done. Another wired thing is that 
>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step 
>>> already done.
>>>
>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>   fq = 0xffff88810332e240,
>>> crash> blk_flush_queue 0xffff88810332e240
>>> struct blk_flush_queue {
>>>   mq_flush_lock = {
>>>     {
>>>       rlock = {
>>>         raw_lock = {
>>>           {
>>>             val = {
>>>               counter = 0
>>>             },
>>>             {
>>>               locked = 0 '\000',
>>>               pending = 0 '\000'
>>>             },
>>>             {
>>>               locked_pending = 0,
>>>               tail = 0
>>>             }
>>>           }
>>>         }
>>>       }
>>>     }
>>>   },
>>>   flush_pending_idx = 1,
>>>   flush_running_idx = 1,
>>>   rq_status = 0 '\000',
>>>   flush_pending_since = 4296171408,
>>>   flush_queue = {{
>>>       next = 0xffff88810332e250,
>>>       prev = 0xffff88810332e250
>>>     }, {
>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>       prev = 0xffff888103b4c348
>>>     }},
>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>   flush_rq = 0xffff888103c2e000
>>> }
>>>
>>> crash> list 0xffff888103b4c348
>>> ffff888103b4c348
>>> ffff88810332e260
>>>
>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw 
>>> queue
>>>   tag = -1,
>>>   state = MQ_RQ_IDLE,
>>>   ref = {
>>>     counter = 0
>>>   },
>>>
>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may 
>>> have some issue which leading to the io request from md layer stayed 
>>> in a partial complete statue. I can't see how this can be related 
>>> with the commit bed9e27baf52 ("Revert "md/raid5: Wait for 
>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>
>>>
>>> Dan,
>>>
>>> Are you able to reproduce using some regular scsi disk, would like 
>>> to rule out whether this is related with virtio-scsi?
>>>
>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
>>> official mainline v6.8-rc5 without any other patches?
>>>
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>> Hi,
>>>>
>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>>> few regressions recently, but it seems like resolving this one is
>>>>> stalled. Or were you able to reproduce the issue or make some 
>>>>> progress
>>>>> and I just missed it?
>>>> Sorry for the delay with this issue. I have been occupied with some
>>>> other stuff this week.
>>>>
>>>> I haven't got luck to reproduce this issue. I will spend more time 
>>>> looking
>>>> into it next week.
>>>>
>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>> Revert the culprit and try again later? Or is that not an option 
>>>>> for one
>>>>> reason or another?
>>>> If we don't make progress with it in the next week, we will do the 
>>>> revert,
>>>> same as we did with stable kernels.
>>>>
>>>>> Or do we assume that this is not a real issue? That it's caused by 
>>>>> some
>>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>>> found in Dan's setup?
>>>> I don't think this is because of oddities. Hopefully we can get more
>>>> information about this soon.
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression 
>>>>> tracker' hat)
>>>>> -- 
>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>> If I did something stupid, please tell me, as explained on that page.
>>>>>
>>>>> #regzbot poke
>>>>>
>>
>> .
>>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-12 22:56             ` junxiao.bi
@ 2024-03-13  1:20               ` Yu Kuai
  2024-03-14 18:20                 ` junxiao.bi
  0 siblings, 1 reply; 53+ messages in thread
From: Yu Kuai @ 2024-03-13  1:20 UTC (permalink / raw)
  To: junxiao.bi, Yu Kuai, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

Hi,

在 2024/03/13 6:56, junxiao.bi@oracle.com 写道:
> On 3/10/24 6:50 PM, Yu Kuai wrote:
> 
>> Hi,
>>
>> 在 2024/03/09 7:49, junxiao.bi@oracle.com 写道:
>>> Here is the root cause for this issue:
>>>
>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>> raid5d") introduced a regression, it got reverted through commit 
>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock 
>>> update") was created, it avoids md superblock write getting throttled 
>>> by block layer which is good, but md superblock write could be stuck 
>>> in block layer due to block flush as well, and that is what was 
>>> happening in this regression report.
>>>
>>> Process "md0_reclaim" got stuck while waiting IO for md superblock 
>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, 
>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed 
>>> before done, the hung of this process is because the last step 
>>> "POSTFLUSH" never done. And that was because of  process "md0_raid5" 
>>> submitted another IO with REQ_FUA flag marked just before that step 
>>> started. To handle that IO, blk_insert_flush() will be invoked and 
>>> hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case where 
>>> "fq->flush_data_in_flight" will be increased. When the IO for md 
>>> superblock write was to issue "POSTFLUSH" step through 
>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not 
>>> zero, so it will skip that step, that is expected, because flush will 
>>> be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>
>>> Unfortunately here that inflight data IO from "md0_raid5" will never 
>>> done, because it was added into the blk_plug list of that process, 
>>> but "md0_raid5" run into infinite loop due to "MD_SB_CHANGE_PENDING" 
>>> which made it never had a chance to finish the blk plug until 
>>> "MD_SB_CHANGE_PENDING" was cleared. Process "md0_reclaim" was 
>>> supposed to clear that flag but it was stuck by "md0_raid5", so this 
>>> is a deadlock.
>>>
>>> Looks like the approach in the RFC patch trying to resolve the 
>>> regression of commit 5e2cf333b7bd can help this issue. Once 
>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should 
>>> release all its staging IO requests to avoid blocking others. Also a 
>>> cond_reschedule() will avoid it run into lockup.
>>
>> The analysis sounds good, however, it seems to me that the behaviour
>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>> is not reasonable, because md_check_recovery() must hold
>> 'reconfig_mutex' to clear the flag.
> 
> That's the behavior before commit 5e2cf333b7bd which was added into Sep 
> 2022, so this behavior has been with raid5 for many years.
> 

Yes, it exists for a long time doesn't mean it's good. It is really
weird to hold spinlock to wait for a mutex.
> 
>>
>> Look at raid1/raid10, there are two different behaviour that seems can
>> avoid this problem as well:
>>
>> 1) blk_start_plug() is delayed until all failed IO is handled. This look
>> reasonable because in order to get better performance, IO should be
>> handled by submitted thread as much as possible, and meanwhile, the
>> deadlock can be triggered here.
>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), skip
>> the handling of failed IO, and when mddev_unlock() is called, daemon
>> thread will be woken up again to handle failed IO.
>>
>> How about the following patch?
>>
>> Thanks,
>> Kuai
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>
>>         md_check_recovery(mddev);
>>
>> -       blk_start_plug(&plug);
>>         handled = 0;
>>         spin_lock_irq(&conf->device_lock);
>>         while (1) {
>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>                 int batch_size, released;
>>                 unsigned int offset;
>>
>> +               /*
>> +                * md_check_recovery() can't clear sb_flags, usually 
>> because of
>> +                * 'reconfig_mutex' can't be grabbed, wait for 
>> mddev_unlock() to
>> +                * wake up raid5d().
>> +                */
>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>> +                       goto skip;
>> +
>>                 released = release_stripe_list(conf, 
>> conf->temp_inactive_list);
>>                 if (released)
>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>                         spin_lock_irq(&conf->device_lock);
>>                 }
>>         }
>> +skip:
>>         pr_debug("%d stripes handled\n", handled);
>> -
>>         spin_unlock_irq(&conf->device_lock);
>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>             mutex_trylock(&conf->cache_size_mutex)) {
>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>                 mutex_unlock(&conf->cache_size_mutex);
>>         }
>>
>> +       blk_start_plug(&plug);
>>         flush_deferred_bios(conf);
>>
>>         r5l_flush_stripe_to_raid(conf->log);
> 
> This patch eliminated the benefit of blk_plug, i think it will not be 
> good for IO performance perspective?

There is only one daemon thread, so IO should not be handled here as
much as possible. The IO should be handled by the thread that is
submitting the IO, and let daemon to hanldle the case that IO failed or
can't be submitted at that time.

Thanks,
Kuai

> 
> 
> Thanks,
> 
> Junxiao.
> 
>>
>>>
>>> https://www.spinics.net/lists/raid/msg75338.html
>>>
>>> Dan, can you try the following patch?
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index de771093b526..474462abfbdc 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, 
>>> bool from_schedule)
>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>                  blk_mq_free_plug_rqs(plug);
>>>   }
>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>
>>>   /**
>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 8497880135ee..26e09cdf46a3 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>> spin_unlock_irq(&conf->device_lock);
>>>                          md_check_recovery(mddev);
>>> spin_lock_irq(&conf->device_lock);
>>> +               } else {
>>> + spin_unlock_irq(&conf->device_lock);
>>> +                       blk_flush_plug(&plug, false);
>>> +                       cond_resched();
>>> + spin_lock_irq(&conf->device_lock);
>>>                  }
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
>>>> Hi Dan & Song,
>>>>
>>>> I have not root cause this yet, but would like share some findings 
>>>> from the vmcore Dan shared. From what i can see, this doesn't look 
>>>> like a md issue, but something wrong with block layer or below.
>>>>
>>>> 1. There were multiple process hung by IO over 15mins.
>>>>
>>>> crash> ps -m | grep UN
>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1 
>>>> COMMAND: "jbd2/dm-3-8"
>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2 
>>>> COMMAND: "dd"
>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3 
>>>> COMMAND: "md0_reclaim"
>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1 
>>>> COMMAND: "kworker/1:2"
>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7 
>>>> COMMAND: "kworker/u21:1"
>>>>
>>>> 2. Let pick md0_reclaim to take a look, it is waiting done 
>>>> super_block update. We can see there were two pending superblock 
>>>> write and other pending io for the underling physical disk, which 
>>>> caused these process hung.
>>>>
>>>> crash> bt 876
>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>
>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>   pending_writes = {
>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>   },
>>>>   disks = {
>>>>     next = 0xffff88810ce85a00,
>>>>     prev = 0xffff88810ce84c00
>>>>   },
>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
>>>> 0xffff88810ce85a00
>>>> ffff88810ce85a00
>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>   nr_pending = {
>>>>     counter = 0
>>>>   },
>>>> ffff8881083ace00
>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>   nr_pending = {
>>>>     counter = 10 <<<<
>>>>   },
>>>> ffff8881010ad200
>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>   nr_pending = {
>>>>     counter = 8 <<<<<
>>>>   },
>>>> ffff88810ce84c00
>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>   nr_pending = {
>>>>     counter = 2 <<<<<
>>>>   },
>>>>
>>>> 3. From block layer, i can find the inflight IO for md superblock 
>>>> write which has been pending 955s which matches with the hung time 
>>>> of "md0_reclaim"
>>>>
>>>> crash> 
>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
>>>> ffff888103b4c300
>>>>   q = 0xffff888103a00d80,
>>>>   mq_hctx = 0xffff888103c5d200,
>>>>   cmd_flags = 38913,
>>>>   rq_flags = 139408,
>>>>   start_time_ns = 1504179024146,
>>>>   bio = 0x0,
>>>>   biotail = 0xffff888120758e40,
>>>>   state = MQ_RQ_COMPLETE,
>>>>   __data_len = 0,
>>>>   flush = {
>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>     saved_end_io = 0x0
>>>>   },
>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>
>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>> $1 = 2459916243002
>>>> crash> eval 2459916243002-1504179024146
>>>> hexadecimal: de86609f28
>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>       octal: 15720630117450
>>>>      binary: 
>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>
>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>   bi_iter = {
>>>>     bi_sector = 8, <<<< super block offset
>>>>     bi_size = 0,
>>>>     bi_idx = 0,
>>>>     bi_bvec_done = 0
>>>>   },
>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>> crash> dev -d | grep ffff888103a00d80
>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80       0 0 0
>>>>
>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it 
>>>> is still pending. That's because each md superblock write was marked 
>>>> with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: 
>>>> pre_flush, data, and post_flush. Once each step complete, it will be 
>>>> marked in "request.flush.seq", here the value is 3, which is 
>>>> REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step "post_flush" 
>>>> has not be done. Another wired thing is that 
>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step 
>>>> already done.
>>>>
>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>   fq = 0xffff88810332e240,
>>>> crash> blk_flush_queue 0xffff88810332e240
>>>> struct blk_flush_queue {
>>>>   mq_flush_lock = {
>>>>     {
>>>>       rlock = {
>>>>         raw_lock = {
>>>>           {
>>>>             val = {
>>>>               counter = 0
>>>>             },
>>>>             {
>>>>               locked = 0 '\000',
>>>>               pending = 0 '\000'
>>>>             },
>>>>             {
>>>>               locked_pending = 0,
>>>>               tail = 0
>>>>             }
>>>>           }
>>>>         }
>>>>       }
>>>>     }
>>>>   },
>>>>   flush_pending_idx = 1,
>>>>   flush_running_idx = 1,
>>>>   rq_status = 0 '\000',
>>>>   flush_pending_since = 4296171408,
>>>>   flush_queue = {{
>>>>       next = 0xffff88810332e250,
>>>>       prev = 0xffff88810332e250
>>>>     }, {
>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>       prev = 0xffff888103b4c348
>>>>     }},
>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>   flush_rq = 0xffff888103c2e000
>>>> }
>>>>
>>>> crash> list 0xffff888103b4c348
>>>> ffff888103b4c348
>>>> ffff88810332e260
>>>>
>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw 
>>>> queue
>>>>   tag = -1,
>>>>   state = MQ_RQ_IDLE,
>>>>   ref = {
>>>>     counter = 0
>>>>   },
>>>>
>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may 
>>>> have some issue which leading to the io request from md layer stayed 
>>>> in a partial complete statue. I can't see how this can be related 
>>>> with the commit bed9e27baf52 ("Revert "md/raid5: Wait for 
>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>
>>>>
>>>> Dan,
>>>>
>>>> Are you able to reproduce using some regular scsi disk, would like 
>>>> to rule out whether this is related with virtio-scsi?
>>>>
>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
>>>> official mainline v6.8-rc5 without any other patches?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Junxiao.
>>>>
>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>> stalled. Or were you able to reproduce the issue or make some 
>>>>>> progress
>>>>>> and I just missed it?
>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>> other stuff this week.
>>>>>
>>>>> I haven't got luck to reproduce this issue. I will spend more time 
>>>>> looking
>>>>> into it next week.
>>>>>
>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>> Revert the culprit and try again later? Or is that not an option 
>>>>>> for one
>>>>>> reason or another?
>>>>> If we don't make progress with it in the next week, we will do the 
>>>>> revert,
>>>>> same as we did with stable kernels.
>>>>>
>>>>>> Or do we assume that this is not a real issue? That it's caused by 
>>>>>> some
>>>>>> oddity (bit-flip in the metadata or something like that?) only to be
>>>>>> found in Dan's setup?
>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>> information about this soon.
>>>>>
>>>>> Thanks,
>>>>> Song
>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression 
>>>>>> tracker' hat)
>>>>>> -- 
>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>> If I did something stupid, please tell me, as explained on that page.
>>>>>>
>>>>>> #regzbot poke
>>>>>>
>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-11  1:50           ` Yu Kuai
  2024-03-12 22:56             ` junxiao.bi
@ 2024-03-14 16:12             ` Dan Moulding
  2024-03-15  1:17               ` Yu Kuai
  1 sibling, 1 reply; 53+ messages in thread
From: Dan Moulding @ 2024-03-14 16:12 UTC (permalink / raw)
  To: yukuai1
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	song, stable, yukuai3

> How about the following patch?
> 
> Thanks,
> Kuai
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..0b2e6060f2c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
> 
>          md_check_recovery(mddev);
> 
> -       blk_start_plug(&plug);
>          handled = 0;
>          spin_lock_irq(&conf->device_lock);
>          while (1) {
> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>                  int batch_size, released;
>                  unsigned int offset;
> 
> +               /*
> +                * md_check_recovery() can't clear sb_flags, usually 
> because of
> +                * 'reconfig_mutex' can't be grabbed, wait for 
> mddev_unlock() to
> +                * wake up raid5d().
> +                */
> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +                       goto skip;
> +
>                  released = release_stripe_list(conf, 
> conf->temp_inactive_list);
>                  if (released)
>                          clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>                          spin_lock_irq(&conf->device_lock);
>                  }
>          }
> +skip:
>          pr_debug("%d stripes handled\n", handled);
> -
>          spin_unlock_irq(&conf->device_lock);
>          if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>              mutex_trylock(&conf->cache_size_mutex)) {
> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>                  mutex_unlock(&conf->cache_size_mutex);
>          }
> 
> +       blk_start_plug(&plug);
>          flush_deferred_bios(conf);
> 
>          r5l_flush_stripe_to_raid(conf->log);

I can confirm that this patch also works. I'm unable to reproduce the
hang after applying this instead of the first patch provided by
Junxiao. So looks like both ways are succesful in avoiding the hang.

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-13  1:20               ` Yu Kuai
@ 2024-03-14 18:20                 ` junxiao.bi
  2024-03-14 22:36                   ` Song Liu
  2024-03-15  1:30                   ` Yu Kuai
  0 siblings, 2 replies; 53+ messages in thread
From: junxiao.bi @ 2024-03-14 18:20 UTC (permalink / raw)
  To: Yu Kuai, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

On 3/12/24 6:20 PM, Yu Kuai wrote:

> Hi,
>
> 在 2024/03/13 6:56, junxiao.bi@oracle.com 写道:
>> On 3/10/24 6:50 PM, Yu Kuai wrote:
>>
>>> Hi,
>>>
>>> 在 2024/03/09 7:49, junxiao.bi@oracle.com 写道:
>>>> Here is the root cause for this issue:
>>>>
>>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>>> raid5d") introduced a regression, it got reverted through commit 
>>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
>>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock 
>>>> update") was created, it avoids md superblock write getting 
>>>> throttled by block layer which is good, but md superblock write 
>>>> could be stuck in block layer due to block flush as well, and that 
>>>> is what was happening in this regression report.
>>>>
>>>> Process "md0_reclaim" got stuck while waiting IO for md superblock 
>>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, 
>>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed 
>>>> before done, the hung of this process is because the last step 
>>>> "POSTFLUSH" never done. And that was because of  process 
>>>> "md0_raid5" submitted another IO with REQ_FUA flag marked just 
>>>> before that step started. To handle that IO, blk_insert_flush() 
>>>> will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case 
>>>> where "fq->flush_data_in_flight" will be increased. When the IO for 
>>>> md superblock write was to issue "POSTFLUSH" step through 
>>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not 
>>>> zero, so it will skip that step, that is expected, because flush 
>>>> will be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>>
>>>> Unfortunately here that inflight data IO from "md0_raid5" will 
>>>> never done, because it was added into the blk_plug list of that 
>>>> process, but "md0_raid5" run into infinite loop due to 
>>>> "MD_SB_CHANGE_PENDING" which made it never had a chance to finish 
>>>> the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process 
>>>> "md0_reclaim" was supposed to clear that flag but it was stuck by 
>>>> "md0_raid5", so this is a deadlock.
>>>>
>>>> Looks like the approach in the RFC patch trying to resolve the 
>>>> regression of commit 5e2cf333b7bd can help this issue. Once 
>>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should 
>>>> release all its staging IO requests to avoid blocking others. Also 
>>>> a cond_reschedule() will avoid it run into lockup.
>>>
>>> The analysis sounds good, however, it seems to me that the behaviour
>>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>>> is not reasonable, because md_check_recovery() must hold
>>> 'reconfig_mutex' to clear the flag.
>>
>> That's the behavior before commit 5e2cf333b7bd which was added into 
>> Sep 2022, so this behavior has been with raid5 for many years.
>>
>
> Yes, it exists for a long time doesn't mean it's good. It is really
> weird to hold spinlock to wait for a mutex.
I am confused about this, where is the code that waiting mutex while 
holding spinlock, wouldn't that cause a deadlock?
>>
>>>
>>> Look at raid1/raid10, there are two different behaviour that seems can
>>> avoid this problem as well:
>>>
>>> 1) blk_start_plug() is delayed until all failed IO is handled. This 
>>> look
>>> reasonable because in order to get better performance, IO should be
>>> handled by submitted thread as much as possible, and meanwhile, the
>>> deadlock can be triggered here.
>>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), 
>>> skip
>>> the handling of failed IO, and when mddev_unlock() is called, daemon
>>> thread will be woken up again to handle failed IO.
>>>
>>> How about the following patch?
>>>
>>> Thanks,
>>> Kuai
>>>
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>>
>>>         md_check_recovery(mddev);
>>>
>>> -       blk_start_plug(&plug);
>>>         handled = 0;
>>>         spin_lock_irq(&conf->device_lock);
>>>         while (1) {
>>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>>                 int batch_size, released;
>>>                 unsigned int offset;
>>>
>>> +               /*
>>> +                * md_check_recovery() can't clear sb_flags, usually 
>>> because of
>>> +                * 'reconfig_mutex' can't be grabbed, wait for 
>>> mddev_unlock() to
>>> +                * wake up raid5d().
>>> +                */
>>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>> +                       goto skip;
>>> +
>>>                 released = release_stripe_list(conf, 
>>> conf->temp_inactive_list);
>>>                 if (released)
>>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>> spin_lock_irq(&conf->device_lock);
>>>                 }
>>>         }
>>> +skip:
>>>         pr_debug("%d stripes handled\n", handled);
>>> -
>>>         spin_unlock_irq(&conf->device_lock);
>>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>>             mutex_trylock(&conf->cache_size_mutex)) {
>>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>>                 mutex_unlock(&conf->cache_size_mutex);
>>>         }
>>>
>>> +       blk_start_plug(&plug);
>>>         flush_deferred_bios(conf);
>>>
>>>         r5l_flush_stripe_to_raid(conf->log);
>>
>> This patch eliminated the benefit of blk_plug, i think it will not be 
>> good for IO performance perspective?
>
> There is only one daemon thread, so IO should not be handled here as
> much as possible. The IO should be handled by the thread that is
> submitting the IO, and let daemon to hanldle the case that IO failed or
> can't be submitted at that time.

I am not sure how much it will impact regarding drop the blk_plug.

Song, what's your take on this?

Thanks,

Junxiao.

>
> Thanks,
> Kuai
>
>>
>>
>> Thanks,
>>
>> Junxiao.
>>
>>>
>>>>
>>>> https://www.spinics.net/lists/raid/msg75338.html
>>>>
>>>> Dan, can you try the following patch?
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index de771093b526..474462abfbdc 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, 
>>>> bool from_schedule)
>>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>>                  blk_mq_free_plug_rqs(plug);
>>>>   }
>>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>>
>>>>   /**
>>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 8497880135ee..26e09cdf46a3 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>>> spin_unlock_irq(&conf->device_lock);
>>>>                          md_check_recovery(mddev);
>>>> spin_lock_irq(&conf->device_lock);
>>>> +               } else {
>>>> + spin_unlock_irq(&conf->device_lock);
>>>> +                       blk_flush_plug(&plug, false);
>>>> +                       cond_resched();
>>>> + spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Thanks,
>>>>
>>>> Junxiao.
>>>>
>>>> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
>>>>> Hi Dan & Song,
>>>>>
>>>>> I have not root cause this yet, but would like share some findings 
>>>>> from the vmcore Dan shared. From what i can see, this doesn't look 
>>>>> like a md issue, but something wrong with block layer or below.
>>>>>
>>>>> 1. There were multiple process hung by IO over 15mins.
>>>>>
>>>>> crash> ps -m | grep UN
>>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1 
>>>>> COMMAND: "jbd2/dm-3-8"
>>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2 
>>>>> COMMAND: "dd"
>>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3 
>>>>> COMMAND: "md0_reclaim"
>>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1 
>>>>> COMMAND: "kworker/1:2"
>>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7 
>>>>> COMMAND: "kworker/u21:1"
>>>>>
>>>>> 2. Let pick md0_reclaim to take a look, it is waiting done 
>>>>> super_block update. We can see there were two pending superblock 
>>>>> write and other pending io for the underling physical disk, which 
>>>>> caused these process hung.
>>>>>
>>>>> crash> bt 876
>>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: 
>>>>> "md0_reclaim"
>>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>>
>>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>>   pending_writes = {
>>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>>   },
>>>>>   disks = {
>>>>>     next = 0xffff88810ce85a00,
>>>>>     prev = 0xffff88810ce84c00
>>>>>   },
>>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
>>>>> 0xffff88810ce85a00
>>>>> ffff88810ce85a00
>>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>>   nr_pending = {
>>>>>     counter = 0
>>>>>   },
>>>>> ffff8881083ace00
>>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>>   nr_pending = {
>>>>>     counter = 10 <<<<
>>>>>   },
>>>>> ffff8881010ad200
>>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>>   nr_pending = {
>>>>>     counter = 8 <<<<<
>>>>>   },
>>>>> ffff88810ce84c00
>>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>>   nr_pending = {
>>>>>     counter = 2 <<<<<
>>>>>   },
>>>>>
>>>>> 3. From block layer, i can find the inflight IO for md superblock 
>>>>> write which has been pending 955s which matches with the hung time 
>>>>> of "md0_reclaim"
>>>>>
>>>>> crash> 
>>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
>>>>> ffff888103b4c300
>>>>>   q = 0xffff888103a00d80,
>>>>>   mq_hctx = 0xffff888103c5d200,
>>>>>   cmd_flags = 38913,
>>>>>   rq_flags = 139408,
>>>>>   start_time_ns = 1504179024146,
>>>>>   bio = 0x0,
>>>>>   biotail = 0xffff888120758e40,
>>>>>   state = MQ_RQ_COMPLETE,
>>>>>   __data_len = 0,
>>>>>   flush = {
>>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>>     saved_end_io = 0x0
>>>>>   },
>>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>>
>>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>>> $1 = 2459916243002
>>>>> crash> eval 2459916243002-1504179024146
>>>>> hexadecimal: de86609f28
>>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>>       octal: 15720630117450
>>>>>      binary: 
>>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>>
>>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>>   bi_iter = {
>>>>>     bi_sector = 8, <<<< super block offset
>>>>>     bi_size = 0,
>>>>>     bi_idx = 0,
>>>>>     bi_bvec_done = 0
>>>>>   },
>>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>>> crash> dev -d | grep ffff888103a00d80
>>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80 0 0 0
>>>>>
>>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it 
>>>>> is still pending. That's because each md superblock write was 
>>>>> marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 
>>>>> steps: pre_flush, data, and post_flush. Once each step complete, 
>>>>> it will be marked in "request.flush.seq", here the value is 3, 
>>>>> which is REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step 
>>>>> "post_flush" has not be done. Another wired thing is that 
>>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step 
>>>>> already done.
>>>>>
>>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>>   fq = 0xffff88810332e240,
>>>>> crash> blk_flush_queue 0xffff88810332e240
>>>>> struct blk_flush_queue {
>>>>>   mq_flush_lock = {
>>>>>     {
>>>>>       rlock = {
>>>>>         raw_lock = {
>>>>>           {
>>>>>             val = {
>>>>>               counter = 0
>>>>>             },
>>>>>             {
>>>>>               locked = 0 '\000',
>>>>>               pending = 0 '\000'
>>>>>             },
>>>>>             {
>>>>>               locked_pending = 0,
>>>>>               tail = 0
>>>>>             }
>>>>>           }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   },
>>>>>   flush_pending_idx = 1,
>>>>>   flush_running_idx = 1,
>>>>>   rq_status = 0 '\000',
>>>>>   flush_pending_since = 4296171408,
>>>>>   flush_queue = {{
>>>>>       next = 0xffff88810332e250,
>>>>>       prev = 0xffff88810332e250
>>>>>     }, {
>>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>>       prev = 0xffff888103b4c348
>>>>>     }},
>>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>>   flush_rq = 0xffff888103c2e000
>>>>> }
>>>>>
>>>>> crash> list 0xffff888103b4c348
>>>>> ffff888103b4c348
>>>>> ffff88810332e260
>>>>>
>>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of 
>>>>> hw queue
>>>>>   tag = -1,
>>>>>   state = MQ_RQ_IDLE,
>>>>>   ref = {
>>>>>     counter = 0
>>>>>   },
>>>>>
>>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may 
>>>>> have some issue which leading to the io request from md layer 
>>>>> stayed in a partial complete statue. I can't see how this can be 
>>>>> related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for 
>>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>>
>>>>>
>>>>> Dan,
>>>>>
>>>>> Are you able to reproduce using some regular scsi disk, would like 
>>>>> to rule out whether this is related with virtio-scsi?
>>>>>
>>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
>>>>> official mainline v6.8-rc5 without any other patches?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Junxiao.
>>>>>
>>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>>> Song Liu, what's the status here? I aware that you fixed with 
>>>>>>> quite a
>>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>>> stalled. Or were you able to reproduce the issue or make some 
>>>>>>> progress
>>>>>>> and I just missed it?
>>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>>> other stuff this week.
>>>>>>
>>>>>> I haven't got luck to reproduce this issue. I will spend more 
>>>>>> time looking
>>>>>> into it next week.
>>>>>>
>>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>>> Revert the culprit and try again later? Or is that not an option 
>>>>>>> for one
>>>>>>> reason or another?
>>>>>> If we don't make progress with it in the next week, we will do 
>>>>>> the revert,
>>>>>> same as we did with stable kernels.
>>>>>>
>>>>>>> Or do we assume that this is not a real issue? That it's caused 
>>>>>>> by some
>>>>>>> oddity (bit-flip in the metadata or something like that?) only 
>>>>>>> to be
>>>>>>> found in Dan's setup?
>>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>>> information about this soon.
>>>>>>
>>>>>> Thanks,
>>>>>> Song
>>>>>>
>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression 
>>>>>>> tracker' hat)
>>>>>>> -- 
>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>> If I did something stupid, please tell me, as explained on that 
>>>>>>> page.
>>>>>>>
>>>>>>> #regzbot poke
>>>>>>>
>>>>
>>>> .
>>>>
>>>
>> .
>>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-14 18:20                 ` junxiao.bi
@ 2024-03-14 22:36                   ` Song Liu
  2024-03-15  1:30                   ` Yu Kuai
  1 sibling, 0 replies; 53+ messages in thread
From: Song Liu @ 2024-03-14 22:36 UTC (permalink / raw)
  To: junxiao.bi
  Cc: Yu Kuai, Linux regressions mailing list, gregkh, linux-kernel,
	linux-raid, stable, Dan Moulding, yukuai (C)

On Thu, Mar 14, 2024 at 11:20 AM <junxiao.bi@oracle.com> wrote:
>
[...]
> >>
> >> This patch eliminated the benefit of blk_plug, i think it will not be
> >> good for IO performance perspective?
> >
> > There is only one daemon thread, so IO should not be handled here as
> > much as possible. The IO should be handled by the thread that is
> > submitting the IO, and let daemon to hanldle the case that IO failed or
> > can't be submitted at that time.

raid5 can have multiple threads calling handle_stripe(). See raid5_do_work().
Only chunk_aligned_read() can be handled in raid5_make_request.

>
> I am not sure how much it will impact regarding drop the blk_plug.
>
> Song, what's your take on this?

I think we need to evaluate the impact of (removing) blk_plug. We had
some performance regressions related to blk_plug a couple years ago.

Thanks,
Song

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-14 16:12             ` Dan Moulding
@ 2024-03-15  1:17               ` Yu Kuai
  2024-03-19 14:16                 ` Dan Moulding
  0 siblings, 1 reply; 53+ messages in thread
From: Yu Kuai @ 2024-03-15  1:17 UTC (permalink / raw)
  To: Dan Moulding, yukuai1
  Cc: gregkh, junxiao.bi, linux-kernel, linux-raid, regressions, song,
	stable, yukuai (C)

Hi,

在 2024/03/15 0:12, Dan Moulding 写道:
>> How about the following patch?
>>
>> Thanks,
>> Kuai
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>
>>           md_check_recovery(mddev);
>>
>> -       blk_start_plug(&plug);
>>           handled = 0;
>>           spin_lock_irq(&conf->device_lock);
>>           while (1) {
>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>                   int batch_size, released;
>>                   unsigned int offset;
>>
>> +               /*
>> +                * md_check_recovery() can't clear sb_flags, usually
>> because of
>> +                * 'reconfig_mutex' can't be grabbed, wait for
>> mddev_unlock() to
>> +                * wake up raid5d().
>> +                */
>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>> +                       goto skip;
>> +
>>                   released = release_stripe_list(conf,
>> conf->temp_inactive_list);
>>                   if (released)
>>                           clear_bit(R5_DID_ALLOC, &conf->cache_state);
>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>                           spin_lock_irq(&conf->device_lock);
>>                   }
>>           }
>> +skip:
>>           pr_debug("%d stripes handled\n", handled);
>> -
>>           spin_unlock_irq(&conf->device_lock);
>>           if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>               mutex_trylock(&conf->cache_size_mutex)) {
>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>                   mutex_unlock(&conf->cache_size_mutex);
>>           }
>>
>> +       blk_start_plug(&plug);
>>           flush_deferred_bios(conf);
>>
>>           r5l_flush_stripe_to_raid(conf->log);
> 
> I can confirm that this patch also works. I'm unable to reproduce the
> hang after applying this instead of the first patch provided by
> Junxiao. So looks like both ways are succesful in avoiding the hang.
> 

Thanks a lot for the testing! Can you also give following patch a try?
It removes the change to blk_plug, because Dan and Song are worried
about performance degradation, so we need to verify the performance
before consider that patch.

Anyway, I think following patch can fix this problem as well.

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..ae8665be9940 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
                 int batch_size, released;
                 unsigned int offset;

+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                 released = release_stripe_list(conf, 
conf->temp_inactive_list);
                 if (released)
                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
                         spin_lock_irq(&conf->device_lock);
                 }
         }
+skip:
         pr_debug("%d stripes handled\n", handled);

         spin_unlock_irq(&conf->device_lock);


> -- Dan
> .
> 


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-14 18:20                 ` junxiao.bi
  2024-03-14 22:36                   ` Song Liu
@ 2024-03-15  1:30                   ` Yu Kuai
  1 sibling, 0 replies; 53+ messages in thread
From: Yu Kuai @ 2024-03-15  1:30 UTC (permalink / raw)
  To: junxiao.bi, Yu Kuai, Song Liu, Linux regressions mailing list
  Cc: gregkh, linux-kernel, linux-raid, stable, Dan Moulding, yukuai (C)

Hi,

在 2024/03/15 2:20, junxiao.bi@oracle.com 写道:
> On 3/12/24 6:20 PM, Yu Kuai wrote:
> 
>> Hi,
>>
>> 在 2024/03/13 6:56, junxiao.bi@oracle.com 写道:
>>> On 3/10/24 6:50 PM, Yu Kuai wrote:
>>>
>>>> Hi,
>>>>
>>>> 在 2024/03/09 7:49, junxiao.bi@oracle.com 写道:
>>>>> Here is the root cause for this issue:
>>>>>
>>>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>>>> raid5d") introduced a regression, it got reverted through commit 
>>>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
>>>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, 
>>>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock 
>>>>> update") was created, it avoids md superblock write getting 
>>>>> throttled by block layer which is good, but md superblock write 
>>>>> could be stuck in block layer due to block flush as well, and that 
>>>>> is what was happening in this regression report.
>>>>>
>>>>> Process "md0_reclaim" got stuck while waiting IO for md superblock 
>>>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, 
>>>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed 
>>>>> before done, the hung of this process is because the last step 
>>>>> "POSTFLUSH" never done. And that was because of  process 
>>>>> "md0_raid5" submitted another IO with REQ_FUA flag marked just 
>>>>> before that step started. To handle that IO, blk_insert_flush() 
>>>>> will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case 
>>>>> where "fq->flush_data_in_flight" will be increased. When the IO for 
>>>>> md superblock write was to issue "POSTFLUSH" step through 
>>>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not 
>>>>> zero, so it will skip that step, that is expected, because flush 
>>>>> will be triggered when "fq->flush_data_in_flight" dropped to zero.
>>>>>
>>>>> Unfortunately here that inflight data IO from "md0_raid5" will 
>>>>> never done, because it was added into the blk_plug list of that 
>>>>> process, but "md0_raid5" run into infinite loop due to 
>>>>> "MD_SB_CHANGE_PENDING" which made it never had a chance to finish 
>>>>> the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process 
>>>>> "md0_reclaim" was supposed to clear that flag but it was stuck by 
>>>>> "md0_raid5", so this is a deadlock.
>>>>>
>>>>> Looks like the approach in the RFC patch trying to resolve the 
>>>>> regression of commit 5e2cf333b7bd can help this issue. Once 
>>>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should 
>>>>> release all its staging IO requests to avoid blocking others. Also 
>>>>> a cond_reschedule() will avoid it run into lockup.
>>>>
>>>> The analysis sounds good, however, it seems to me that the behaviour
>>>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared
>>>> is not reasonable, because md_check_recovery() must hold
>>>> 'reconfig_mutex' to clear the flag.
>>>
>>> That's the behavior before commit 5e2cf333b7bd which was added into 
>>> Sep 2022, so this behavior has been with raid5 for many years.
>>>
>>
>> Yes, it exists for a long time doesn't mean it's good. It is really
>> weird to hold spinlock to wait for a mutex.
> I am confused about this, where is the code that waiting mutex while 
> holding spinlock, wouldn't that cause a deadlock?

For example, assume that other contex already holding the
'reconfig_mutex', and this can be slow, then in raid5d:

md_check_recovery
  try lock 'reconfig_mutex' failed

while (1)
  hold spin_lock
  try to issue IO, failed
  release spin_lock
  blk_flush_plug
  hold spin_lock

So, untill other contex release the 'reconfig_mutex', and then
md_check_recovery() update super_block, raid5d() will not make progress, 
meanwhile it will waste one cpu.

Thanks,
Kuai

>>>
>>>>
>>>> Look at raid1/raid10, there are two different behaviour that seems can
>>>> avoid this problem as well:
>>>>
>>>> 1) blk_start_plug() is delayed until all failed IO is handled. This 
>>>> look
>>>> reasonable because in order to get better performance, IO should be
>>>> handled by submitted thread as much as possible, and meanwhile, the
>>>> deadlock can be triggered here.
>>>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), 
>>>> skip
>>>> the handling of failed IO, and when mddev_unlock() is called, daemon
>>>> thread will be woken up again to handle failed IO.
>>>>
>>>> How about the following patch?
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>>>
>>>>         md_check_recovery(mddev);
>>>>
>>>> -       blk_start_plug(&plug);
>>>>         handled = 0;
>>>>         spin_lock_irq(&conf->device_lock);
>>>>         while (1) {
>>>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>>>                 int batch_size, released;
>>>>                 unsigned int offset;
>>>>
>>>> +               /*
>>>> +                * md_check_recovery() can't clear sb_flags, usually 
>>>> because of
>>>> +                * 'reconfig_mutex' can't be grabbed, wait for 
>>>> mddev_unlock() to
>>>> +                * wake up raid5d().
>>>> +                */
>>>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>>>> +                       goto skip;
>>>> +
>>>>                 released = release_stripe_list(conf, 
>>>> conf->temp_inactive_list);
>>>>                 if (released)
>>>>                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
>>>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>>> spin_lock_irq(&conf->device_lock);
>>>>                 }
>>>>         }
>>>> +skip:
>>>>         pr_debug("%d stripes handled\n", handled);
>>>> -
>>>>         spin_unlock_irq(&conf->device_lock);
>>>>         if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>>>             mutex_trylock(&conf->cache_size_mutex)) {
>>>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>>>                 mutex_unlock(&conf->cache_size_mutex);
>>>>         }
>>>>
>>>> +       blk_start_plug(&plug);
>>>>         flush_deferred_bios(conf);
>>>>
>>>>         r5l_flush_stripe_to_raid(conf->log);
>>>
>>> This patch eliminated the benefit of blk_plug, i think it will not be 
>>> good for IO performance perspective?
>>
>> There is only one daemon thread, so IO should not be handled here as
>> much as possible. The IO should be handled by the thread that is
>> submitting the IO, and let daemon to hanldle the case that IO failed or
>> can't be submitted at that time.
> 
> I am not sure how much it will impact regarding drop the blk_plug.
> 
> Song, what's your take on this?
> 
> Thanks,
> 
> Junxiao.
> 
>>
>> Thanks,
>> Kuai
>>
>>>
>>>
>>> Thanks,
>>>
>>> Junxiao.
>>>
>>>>
>>>>>
>>>>> https://www.spinics.net/lists/raid/msg75338.html
>>>>>
>>>>> Dan, can you try the following patch?
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index de771093b526..474462abfbdc 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, 
>>>>> bool from_schedule)
>>>>>          if (unlikely(!rq_list_empty(plug->cached_rq)))
>>>>>                  blk_mq_free_plug_rqs(plug);
>>>>>   }
>>>>> +EXPORT_SYMBOL(__blk_flush_plug);
>>>>>
>>>>>   /**
>>>>>    * blk_finish_plug - mark the end of a batch of submitted I/O
>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>>> index 8497880135ee..26e09cdf46a3 100644
>>>>> --- a/drivers/md/raid5.c
>>>>> +++ b/drivers/md/raid5.c
>>>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread)
>>>>> spin_unlock_irq(&conf->device_lock);
>>>>>                          md_check_recovery(mddev);
>>>>> spin_lock_irq(&conf->device_lock);
>>>>> +               } else {
>>>>> + spin_unlock_irq(&conf->device_lock);
>>>>> +                       blk_flush_plug(&plug, false);
>>>>> +                       cond_resched();
>>>>> + spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Junxiao.
>>>>>
>>>>> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote:
>>>>>> Hi Dan & Song,
>>>>>>
>>>>>> I have not root cause this yet, but would like share some findings 
>>>>>> from the vmcore Dan shared. From what i can see, this doesn't look 
>>>>>> like a md issue, but something wrong with block layer or below.
>>>>>>
>>>>>> 1. There were multiple process hung by IO over 15mins.
>>>>>>
>>>>>> crash> ps -m | grep UN
>>>>>> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0 CPU: 1 
>>>>>> COMMAND: "jbd2/dm-3-8"
>>>>>> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0 CPU: 2 
>>>>>> COMMAND: "dd"
>>>>>> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00 CPU: 3 
>>>>>> COMMAND: "md0_reclaim"
>>>>>> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740 CPU: 1 
>>>>>> COMMAND: "kworker/1:2"
>>>>>> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80 CPU: 7 
>>>>>> COMMAND: "kworker/u21:1"
>>>>>>
>>>>>> 2. Let pick md0_reclaim to take a look, it is waiting done 
>>>>>> super_block update. We can see there were two pending superblock 
>>>>>> write and other pending io for the underling physical disk, which 
>>>>>> caused these process hung.
>>>>>>
>>>>>> crash> bt 876
>>>>>> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: 
>>>>>> "md0_reclaim"
>>>>>>  #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>>>>>>  #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>>>>>>  #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>>>>>>  #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>>>>>>  #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>>>>>>  #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>>>>>>  #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>>>>>>  #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>>>>>>  #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
>>>>>>
>>>>>> crash> mddev.pending_writes,disks 0xffff888108335800
>>>>>>   pending_writes = {
>>>>>>     counter = 2  <<<<<<<<<< 2 active super block write
>>>>>>   },
>>>>>>   disks = {
>>>>>>     next = 0xffff88810ce85a00,
>>>>>>     prev = 0xffff88810ce84c00
>>>>>>   },
>>>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
>>>>>> 0xffff88810ce85a00
>>>>>> ffff88810ce85a00
>>>>>>   kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>>>>>>   nr_pending = {
>>>>>>     counter = 0
>>>>>>   },
>>>>>> ffff8881083ace00
>>>>>>   kobj.name = 0xffff888100a93280 "dev-sde",
>>>>>>   nr_pending = {
>>>>>>     counter = 10 <<<<
>>>>>>   },
>>>>>> ffff8881010ad200
>>>>>>   kobj.name = 0xffff8881012721c8 "dev-sdc",
>>>>>>   nr_pending = {
>>>>>>     counter = 8 <<<<<
>>>>>>   },
>>>>>> ffff88810ce84c00
>>>>>>   kobj.name = 0xffff888100325f08 "dev-sdd",
>>>>>>   nr_pending = {
>>>>>>     counter = 2 <<<<<
>>>>>>   },
>>>>>>
>>>>>> 3. From block layer, i can find the inflight IO for md superblock 
>>>>>> write which has been pending 955s which matches with the hung time 
>>>>>> of "md0_reclaim"
>>>>>>
>>>>>> crash> 
>>>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
>>>>>> ffff888103b4c300
>>>>>>   q = 0xffff888103a00d80,
>>>>>>   mq_hctx = 0xffff888103c5d200,
>>>>>>   cmd_flags = 38913,
>>>>>>   rq_flags = 139408,
>>>>>>   start_time_ns = 1504179024146,
>>>>>>   bio = 0x0,
>>>>>>   biotail = 0xffff888120758e40,
>>>>>>   state = MQ_RQ_COMPLETE,
>>>>>>   __data_len = 0,
>>>>>>   flush = {
>>>>>>     seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA
>>>>>>     saved_end_io = 0x0
>>>>>>   },
>>>>>>   end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
>>>>>>
>>>>>> crash> p tk_core.timekeeper.tkr_mono.base
>>>>>> $1 = 2459916243002
>>>>>> crash> eval 2459916243002-1504179024146
>>>>>> hexadecimal: de86609f28
>>>>>>     decimal: 955737218856  <<<<<<< IO pending time is 955s
>>>>>>       octal: 15720630117450
>>>>>>      binary: 
>>>>>> 0000000000000000000000001101111010000110011000001001111100101000
>>>>>>
>>>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>>>>>>   bi_iter = {
>>>>>>     bi_sector = 8, <<<< super block offset
>>>>>>     bi_size = 0,
>>>>>>     bi_idx = 0,
>>>>>>     bi_bvec_done = 0
>>>>>>   },
>>>>>>   bi_end_io = 0xffffffff817dca50 <super_written>,
>>>>>> crash> dev -d | grep ffff888103a00d80
>>>>>>     8 ffff8881003ab000   sdd        ffff888103a00d80 0 0 0
>>>>>>
>>>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it 
>>>>>> is still pending. That's because each md superblock write was 
>>>>>> marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 
>>>>>> steps: pre_flush, data, and post_flush. Once each step complete, 
>>>>>> it will be marked in "request.flush.seq", here the value is 3, 
>>>>>> which is REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA, so the last step 
>>>>>> "post_flush" has not be done. Another wired thing is that 
>>>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step 
>>>>>> already done.
>>>>>>
>>>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>>>>>>   fq = 0xffff88810332e240,
>>>>>> crash> blk_flush_queue 0xffff88810332e240
>>>>>> struct blk_flush_queue {
>>>>>>   mq_flush_lock = {
>>>>>>     {
>>>>>>       rlock = {
>>>>>>         raw_lock = {
>>>>>>           {
>>>>>>             val = {
>>>>>>               counter = 0
>>>>>>             },
>>>>>>             {
>>>>>>               locked = 0 '\000',
>>>>>>               pending = 0 '\000'
>>>>>>             },
>>>>>>             {
>>>>>>               locked_pending = 0,
>>>>>>               tail = 0
>>>>>>             }
>>>>>>           }
>>>>>>         }
>>>>>>       }
>>>>>>     }
>>>>>>   },
>>>>>>   flush_pending_idx = 1,
>>>>>>   flush_running_idx = 1,
>>>>>>   rq_status = 0 '\000',
>>>>>>   flush_pending_since = 4296171408,
>>>>>>   flush_queue = {{
>>>>>>       next = 0xffff88810332e250,
>>>>>>       prev = 0xffff88810332e250
>>>>>>     }, {
>>>>>>       next = 0xffff888103b4c348, <<<< the request is in this list
>>>>>>       prev = 0xffff888103b4c348
>>>>>>     }},
>>>>>>   flush_data_in_flight = 1,  >>>>>> still 1
>>>>>>   flush_rq = 0xffff888103c2e000
>>>>>> }
>>>>>>
>>>>>> crash> list 0xffff888103b4c348
>>>>>> ffff888103b4c348
>>>>>> ffff88810332e260
>>>>>>
>>>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of 
>>>>>> hw queue
>>>>>>   tag = -1,
>>>>>>   state = MQ_RQ_IDLE,
>>>>>>   ref = {
>>>>>>     counter = 0
>>>>>>   },
>>>>>>
>>>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may 
>>>>>> have some issue which leading to the io request from md layer 
>>>>>> stayed in a partial complete statue. I can't see how this can be 
>>>>>> related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for 
>>>>>> MD_SB_CHANGE_PENDING in raid5d"")
>>>>>>
>>>>>>
>>>>>> Dan,
>>>>>>
>>>>>> Are you able to reproduce using some regular scsi disk, would like 
>>>>>> to rule out whether this is related with virtio-scsi?
>>>>>>
>>>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
>>>>>> official mainline v6.8-rc5 without any other patches?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Junxiao.
>>>>>>
>>>>>> On 2/23/24 6:13 PM, Song Liu wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@leemhuis.info> wrote:
>>>>>>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>>>>>>> Just a friendly reminder that this regression still exists on the
>>>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>>>>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>>>>>>> Song Liu, what's the status here? I aware that you fixed with 
>>>>>>>> quite a
>>>>>>>> few regressions recently, but it seems like resolving this one is
>>>>>>>> stalled. Or were you able to reproduce the issue or make some 
>>>>>>>> progress
>>>>>>>> and I just missed it?
>>>>>>> Sorry for the delay with this issue. I have been occupied with some
>>>>>>> other stuff this week.
>>>>>>>
>>>>>>> I haven't got luck to reproduce this issue. I will spend more 
>>>>>>> time looking
>>>>>>> into it next week.
>>>>>>>
>>>>>>>> And if not, what's the way forward here wrt to the release of 6.8?
>>>>>>>> Revert the culprit and try again later? Or is that not an option 
>>>>>>>> for one
>>>>>>>> reason or another?
>>>>>>> If we don't make progress with it in the next week, we will do 
>>>>>>> the revert,
>>>>>>> same as we did with stable kernels.
>>>>>>>
>>>>>>>> Or do we assume that this is not a real issue? That it's caused 
>>>>>>>> by some
>>>>>>>> oddity (bit-flip in the metadata or something like that?) only 
>>>>>>>> to be
>>>>>>>> found in Dan's setup?
>>>>>>> I don't think this is because of oddities. Hopefully we can get more
>>>>>>> information about this soon.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Song
>>>>>>>
>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression 
>>>>>>>> tracker' hat)
>>>>>>>> -- 
>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>> If I did something stupid, please tell me, as explained on that 
>>>>>>>> page.
>>>>>>>>
>>>>>>>> #regzbot poke
>>>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
  2024-03-15  1:17               ` Yu Kuai
@ 2024-03-19 14:16                 ` Dan Moulding
  0 siblings, 0 replies; 53+ messages in thread
From: Dan Moulding @ 2024-03-19 14:16 UTC (permalink / raw)
  To: yukuai1
  Cc: dan, gregkh, junxiao.bi, linux-kernel, linux-raid, regressions,
	song, stable, yukuai3

> Thanks a lot for the testing! Can you also give following patch a try?
> It removes the change to blk_plug, because Dan and Song are worried
> about performance degradation, so we need to verify the performance
> before consider that patch.
> 
> Anyway, I think following patch can fix this problem as well.
> 
> Thanks,
> Kuai
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..ae8665be9940 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
>                  int batch_size, released;
>                  unsigned int offset;
> 
> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +                       goto skip;
> +
>                  released = release_stripe_list(conf, 
> conf->temp_inactive_list);
>                  if (released)
>                          clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
>                          spin_lock_irq(&conf->device_lock);
>                  }
>          }
> +skip:
>          pr_debug("%d stripes handled\n", handled);
> 
>          spin_unlock_irq(&conf->device_lock);

Yes, this patch also seems to work. I cannot reproduce the problem on
6.8-rc7 or 6.8.1 with just this one applied.

Cheers!

-- Dan

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2024-03-19 14:16 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
2024-01-23  1:08 ` Song Liu
2024-01-23  1:35 ` Dan Moulding
2024-01-23  6:35   ` Song Liu
2024-01-23 21:53     ` Dan Moulding
2024-01-23 22:21       ` Song Liu
2024-01-23 23:58         ` Dan Moulding
2024-01-25  0:01           ` Song Liu
2024-01-25 16:44             ` junxiao.bi
2024-01-25 19:40               ` Song Liu
2024-01-25 20:31               ` Dan Moulding
2024-01-26  3:30                 ` Carlos Carvalho
2024-01-26 15:46                   ` Dan Moulding
2024-01-30 16:26                     ` Blazej Kucman
2024-01-30 20:21                       ` Song Liu
2024-01-31  1:26                       ` Song Liu
2024-01-31  2:13                         ` Yu Kuai
2024-01-31  2:41                       ` Yu Kuai
2024-01-31  4:55                         ` Song Liu
2024-01-31 13:36                           ` Blazej Kucman
2024-02-01  1:39                             ` Yu Kuai
2024-01-26 16:21                   ` Roman Mamedov
2024-01-31 17:37                 ` junxiao.bi
2024-02-06  8:07                 ` Song Liu
2024-02-06 20:56                   ` Dan Moulding
2024-02-06 21:34                     ` Song Liu
2024-02-20 23:06 ` Dan Moulding
2024-02-20 23:15   ` junxiao.bi
2024-02-21 14:50     ` Mateusz Kusiak
2024-02-21 19:15       ` junxiao.bi
2024-02-23 17:44     ` Dan Moulding
2024-02-23 19:18       ` junxiao.bi
2024-02-23 20:22         ` Dan Moulding
2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-24  2:13     ` Song Liu
2024-03-01 20:26       ` junxiao.bi
2024-03-01 23:12         ` Dan Moulding
2024-03-02  0:05           ` Song Liu
2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
2024-03-06 17:13               ` Song Liu
2024-03-02 16:55         ` Dan Moulding
2024-03-07  3:34         ` Yu Kuai
2024-03-08 23:49         ` junxiao.bi
2024-03-10  5:13           ` Dan Moulding
2024-03-11  1:50           ` Yu Kuai
2024-03-12 22:56             ` junxiao.bi
2024-03-13  1:20               ` Yu Kuai
2024-03-14 18:20                 ` junxiao.bi
2024-03-14 22:36                   ` Song Liu
2024-03-15  1:30                   ` Yu Kuai
2024-03-14 16:12             ` Dan Moulding
2024-03-15  1:17               ` Yu Kuai
2024-03-19 14:16                 ` Dan Moulding

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).