All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Liu <song@kernel.org>
To: Blazej Kucman <blazej.kucman@linux.intel.com>
Cc: Dan Moulding <dan@danm.net>,
	carlos@fisica.ufpr.br, gregkh@linuxfoundation.org,
	 junxiao.bi@oracle.com, linux-kernel@vger.kernel.org,
	 linux-raid@vger.kernel.org, regressions@lists.linux.dev,
	 stable@vger.kernel.org, yukuai1@huaweicloud.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected
Date: Tue, 30 Jan 2024 12:21:16 -0800	[thread overview]
Message-ID: <CAPhsuW7jtQP+qsWYE4Nt3bA7P4SvUjGr9L0orrPY5rYWFW=jPw@mail.gmail.com> (raw)
In-Reply-To: <20240130172524.0000417b@linux.intel.com>

Hi Blazej,

On Tue, Jan 30, 2024 at 8:27 AM Blazej Kucman
<blazej.kucman@linux.intel.com> wrote:
>
> Hi,
>
> On Fri, 26 Jan 2024 08:46:10 -0700
> Dan Moulding <dan@danm.net> wrote:
> >
> > That's a good suggestion, so I switched it to use XFS. It can still
> > reproduce the hang. Sounds like this is probably a different problem
> > than the known ext4 one.
> >
>
> Our daily tests directed at mdadm/md also detected a problem with
> identical symptoms as described in the thread.
>
> Issue detected with IMSM metadata but it also reproduces with native
> metadata.
> NVMe disks under VMD controller were used.
>
> Scenario:
> 1. Create raid10:
> mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
> --raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
> --size=7864320 --run
> 2. Create FS
> mkfs.ext4 /dev/md/r10d4s128-15_A
> 3. Set faulty one raid member:
> mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
> 4. Stop raid devies:
> mdadm -Ss

Thanks for the report. I can reproduce the issue locally.

The revert [1] cannot fix this one, because the revert is for raid5 (and
the repro is on raid10). I will look into this.

Thanks again!

Song


[1] https://lore.kernel.org/linux-raid/20240125082131.788600-1-song@kernel.org/


>
> Expected result:
> The raid stops without kernel hangs and errors.
>
> Actual result:
> command "mdadm -Ss" hangs,
> hung_task occurs in OS.
>
> [   62.770472] md: resync of RAID array md127
> [  140.893329] md: md127: resync done.
> [  204.100490] md/raid10:md127: Disk failure on nvme3n1, disabling
> device. md/raid10:md127: Operation continuing on 3 devices.
> [  244.625393] INFO: task kworker/48:1:755 blocked for more than 30
> seconds. [  244.632294]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.640157] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.648105] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>   ppid:2      flags:0x00004000 [  244.657552] Workqueue: md_misc
> md_start_sync [md_mod] [  244.662688] Call Trace: [  244.665176]  <TASK>
> [  244.667316]  __schedule+0x2f0/0x9c0
> [  244.670868]  ? sched_clock+0x10/0x20
> [  244.674510]  schedule+0x28/0x90
> [  244.677703]  mddev_suspend+0x11d/0x1e0 [md_mod]
> [  244.682313]  ? __update_idle_core+0x29/0xc0
> [  244.686574]  ? swake_up_all+0xe0/0xe0
> [  244.690302]  md_start_sync+0x3c/0x280 [md_mod]
> [  244.694825]  process_scheduled_works+0x87/0x320
> [  244.699427]  worker_thread+0x147/0x2a0
> [  244.703237]  ? rescuer_thread+0x2d0/0x2d0
> [  244.707313]  kthread+0xe5/0x120
> [  244.710504]  ? kthread_complete_and_exit+0x20/0x20
> [  244.715370]  ret_from_fork+0x31/0x40
> [  244.719007]  ? kthread_complete_and_exit+0x20/0x20
> [  244.723879]  ret_from_fork_asm+0x11/0x20
> [  244.727872]  </TASK>
> [  244.730117] INFO: task mdadm:8457 blocked for more than 30 seconds.
> [  244.736486]       Tainted: G S
> 6.8.0-rc1-20240129.intel.13479453+ #1 [  244.744345] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 244.752293] task:mdadm           state:D stack:13512 pid:8457
> tgid:8457  ppid:8276   flags:0x00000000 [  244.761736] Call Trace: [
> 244.764241]  <TASK> [  244.766389]  __schedule+0x2f0/0x9c0
> [  244.773224]  schedule+0x28/0x90
> [  244.779690]  stop_sync_thread+0xfa/0x170 [md_mod]
> [  244.787737]  ? swake_up_all+0xe0/0xe0
> [  244.794705]  do_md_stop+0x51/0x4c0 [md_mod]
> [  244.802166]  md_ioctl+0x59d/0x10a0 [md_mod]
> [  244.809567]  blkdev_ioctl+0x1bb/0x270
> [  244.816417]  __x64_sys_ioctl+0x7a/0xb0
> [  244.823720]  do_syscall_64+0x4e/0x110
> [  244.830481]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
> [  244.838700] RIP: 0033:0x7f2c540c97cb
> [  244.845457] RSP: 002b:00007fff4ad6a8f8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010 [  244.856265] RAX: ffffffffffffffda RBX:
> 0000000000000003 RCX: 00007f2c540c97cb [  244.866659] RDX:
> 0000000000000000 RSI: 0000000000000932 RDI: 0000000000000003 [
> 244.877031] RBP: 0000000000000019 R08: 0000000000200000 R09:
> 00007fff4ad6a4c5 [  244.887382] R10: 0000000000000000 R11:
> 0000000000000246 R12: 00007fff4ad6a9c0 [  244.897723] R13:
> 00007fff4ad6a9a0 R14: 000055724d0990e0 R15: 000055724efaa780 [
> 244.908018]  </TASK> [  275.345375] INFO: task kworker/48:1:755 blocked
> for more than 60 seconds. [  275.355363]       Tainted: G S
>     6.8.0-rc1-20240129.intel.13479453+ #1 [  275.366306] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message. [
> 275.377334] task:kworker/48:1    state:D stack:14592 pid:755   tgid:755
>   ppid:2      flags:0x00004000 [  275.389863] Workqueue: md_misc
> md_start_sync [md_mod] [  275.398102] Call Trace: [  275.403673]  <TASK>
>
>
> Also reproduces with XFS FS, does not reproduce when there is no FS on
> RAID.
>
> Repository used for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
> Branch: master
>
> Last working build: kernel branch HEAD: acc657692aed ("keys, dns: Fix
> size check of V1 server-list header")
>
> I see one merge commit touching md after the above one:
> 01d550f0fcc0 ("Merge tag 'for-6.8/block-2024-01-08' of
> git://git.kernel.dk/linux")
>
> I hope these additional logs will help find the cause.
>
> Thanks,
> Blazej
>

  reply	other threads:[~2024-01-30 20:21 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-23  0:56 [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected Dan Moulding
2024-01-23  1:08 ` Song Liu
2024-01-23  1:35 ` Dan Moulding
2024-01-23  6:35   ` Song Liu
2024-01-23 21:53     ` Dan Moulding
2024-01-23 22:21       ` Song Liu
2024-01-23 23:58         ` Dan Moulding
2024-01-25  0:01           ` Song Liu
2024-01-25 16:44             ` junxiao.bi
2024-01-25 19:40               ` Song Liu
2024-01-25 20:31               ` Dan Moulding
2024-01-26  3:30                 ` Carlos Carvalho
2024-01-26 15:46                   ` Dan Moulding
2024-01-30 16:26                     ` Blazej Kucman
2024-01-30 20:21                       ` Song Liu [this message]
2024-01-31  1:26                       ` Song Liu
2024-01-31  2:13                         ` Yu Kuai
2024-01-31  2:41                       ` Yu Kuai
2024-01-31  4:55                         ` Song Liu
2024-01-31 13:36                           ` Blazej Kucman
2024-02-01  1:39                             ` Yu Kuai
2024-01-26 16:21                   ` Roman Mamedov
2024-01-31 17:37                 ` junxiao.bi
2024-02-06  8:07                 ` Song Liu
2024-02-06 20:56                   ` Dan Moulding
2024-02-06 21:34                     ` Song Liu
2024-02-20 23:06 ` Dan Moulding
2024-02-20 23:15   ` junxiao.bi
2024-02-21 14:50     ` Mateusz Kusiak
2024-02-21 19:15       ` junxiao.bi
2024-02-23 17:44     ` Dan Moulding
2024-02-23 19:18       ` junxiao.bi
2024-02-23 20:22         ` Dan Moulding
2024-02-23  8:07   ` Linux regression tracking (Thorsten Leemhuis)
2024-02-24  2:13     ` Song Liu
2024-02-25 17:46       ` Thomas B. Clark
2024-02-26  1:17         ` Thomas B. Clark
2024-02-26 17:35           ` Song Liu
2024-03-01 20:26       ` junxiao.bi
2024-03-01 23:12         ` Dan Moulding
2024-03-02  0:05           ` Song Liu
2024-03-06  8:38             ` Linux regression tracking (Thorsten Leemhuis)
2024-03-06 17:13               ` Song Liu
2024-03-02 16:55         ` Dan Moulding
2024-03-07  3:34         ` Yu Kuai
2024-03-08 23:49         ` junxiao.bi
2024-03-10  5:13           ` Dan Moulding
2024-03-11  1:50           ` Yu Kuai
2024-03-12 22:56             ` junxiao.bi
2024-03-13  1:20               ` Yu Kuai
2024-03-14 18:20                 ` junxiao.bi
2024-03-14 22:36                   ` Song Liu
2024-03-15  1:30                   ` Yu Kuai
2024-03-14 16:12             ` Dan Moulding
2024-03-15  1:17               ` Yu Kuai
2024-03-19 14:16                 ` Dan Moulding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPhsuW7jtQP+qsWYE4Nt3bA7P4SvUjGr9L0orrPY5rYWFW=jPw@mail.gmail.com' \
    --to=song@kernel.org \
    --cc=blazej.kucman@linux.intel.com \
    --cc=carlos@fisica.ufpr.br \
    --cc=dan@danm.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=junxiao.bi@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=yukuai1@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.