* md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition @ 2020-11-28 12:25 Donald Buczek 2020-11-30 2:06 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2020-11-28 12:25 UTC (permalink / raw) To: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Linux mdraid people, we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. The last time this happened (in this case with Linux 5.10.0-rc4), I took some data. The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with echo idle > /sys/devices/virtual/block/md1/md/sync_action This doesn't complete. Here's /proc/stack of this process: root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333 UID PID PPID C STIME TTY TIME CMD root 23333 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00 root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack [<0>] kthread_stop+0x6e/0x150 [<0>] md_unregister_thread+0x3e/0x70 [<0>] md_reap_sync_thread+0x1f/0x1e0 [<0>] action_store+0x141/0x2b0 [<0>] md_attr_store+0x71/0xb0 [<0>] kernfs_fop_write+0x113/0x1a0 [<0>] vfs_write+0xbc/0x250 [<0>] ksys_write+0xa1/0xe0 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Note, that md0 has been paused successfully just before. 2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00") 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920 2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320 2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1 2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted. 2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680 2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted. [ nothing related following.... ] After that, we see md1_raid6 in a busy loop: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6 Also, all processes doing I/O do the md device block. This is /proc/mdstat: Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] bitmap: 0/59 pages [0KB], 65536KB chunk There doesn't seem to be any further progress. I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB) Maybe this helps to get an idea what might be going on? Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek @ 2020-11-30 2:06 ` Guoqing Jiang 2020-12-01 9:29 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2020-11-30 2:06 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 11/28/20 13:25, Donald Buczek wrote: > Dear Linux mdraid people, > > we are using raid6 on several servers. Occasionally we had failures, > where a mdX_raid6 process seems to go into a busy loop and all I/O to > the md device blocks. We've seen this on various kernel versions. > > The last time this happened (in this case with Linux 5.10.0-rc4), I took > some data. > > The triggering event seems to be the md_check cron job trying to pause > the ongoing check operation in the morning with > > echo idle > /sys/devices/virtual/block/md1/md/sync_action > > This doesn't complete. Here's /proc/stack of this process: > > root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333 > UID PID PPID C STIME TTY TIME CMD > root 23333 23331 0 02:00 ? 00:00:00 /bin/bash > /usr/bin/mdcheck --continue --duration 06:00 > root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat > /proc/23333/stack > [<0>] kthread_stop+0x6e/0x150 > [<0>] md_unregister_thread+0x3e/0x70 > [<0>] md_reap_sync_thread+0x1f/0x1e0 > [<0>] action_store+0x141/0x2b0 > [<0>] md_attr_store+0x71/0xb0 > [<0>] kernfs_fop_write+0x113/0x1a0 > [<0>] vfs_write+0xbc/0x250 > [<0>] ksys_write+0xa1/0xe0 > [<0>] do_syscall_64+0x33/0x40 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Note, that md0 has been paused successfully just before. What is the personality of md0? Is it also raid6? > > 2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD > (/usr/bin/mdcheck --continue --duration "06:00") > 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking > /dev/md0 from 10623180920 > 2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: > data-check of RAID array md0 > 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking > /dev/md1 from 11582849320 > 2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: > data-check of RAID array md1 > 2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: > md0: data-check interrupted. > 2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at > 13351127680 > 2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: > md1: data-check interrupted. > [ nothing related following.... ] > > After that, we see md1_raid6 in a busy loop: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 > md1_raid6 Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again. > > Also, all processes doing I/O do the md device block. > > This is /proc/mdstat: > > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] > sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [==================>..] check = 94.0% > (7350290348/7813894144) finish=57189.3min speed=135K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] > sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] > sdu[2] sdt[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > bitmap: 0/59 pages [0KB], 65536KB chunk > So the RECOVERY_CHECK flag should be set, not sure if the simple changes helps, but you may give it a try. diff --git a/drivers/md/md.c b/drivers/md/md.c index 98bac4f..e2697d0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev) md_update_sb(mddev, 0); if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && - !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { /* resync/recovery still happening */ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); goto unlock; Thanks, Guoqing ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-11-30 2:06 ` Guoqing Jiang @ 2020-12-01 9:29 ` Donald Buczek 2020-12-02 17:28 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2020-12-01 9:29 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Am 30.11.20 um 03:06 schrieb Guoqing Jiang: > > > On 11/28/20 13:25, Donald Buczek wrote: >> Dear Linux mdraid people, >> >> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. >> >> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data. >> >> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with >> >> echo idle > /sys/devices/virtual/block/md1/md/sync_action >> >> This doesn't complete. Here's /proc/stack of this process: >> >> root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333 >> UID PID PPID C STIME TTY TIME CMD >> root 23333 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00 >> root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack >> [<0>] kthread_stop+0x6e/0x150 >> [<0>] md_unregister_thread+0x3e/0x70 >> [<0>] md_reap_sync_thread+0x1f/0x1e0 >> [<0>] action_store+0x141/0x2b0 >> [<0>] md_attr_store+0x71/0xb0 >> [<0>] kernfs_fop_write+0x113/0x1a0 >> [<0>] vfs_write+0xbc/0x250 >> [<0>] ksys_write+0xa1/0xe0 >> [<0>] do_syscall_64+0x33/0x40 >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Note, that md0 has been paused successfully just before. > > What is the personality of md0? Is it also raid6? Yes. > >> >> 2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00") >> 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920 >> 2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0 >> 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320 >> 2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1 >> 2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted. >> 2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680 >> 2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted. >> [ nothing related following.... ] >> >> After that, we see md1_raid6 in a busy loop: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6 > > Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again. > >> >> Also, all processes doing I/O do the md device block. >> >> This is /proc/mdstat: >> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] >> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> [==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec >> bitmap: 0/59 pages [0KB], 65536KB chunk >> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> bitmap: 0/59 pages [0KB], 65536KB chunk >> > > So the RECOVERY_CHECK flag should be set, not sure if the simple changes > helps, but you may give it a try. Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often. Best Donald > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 98bac4f..e2697d0 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev) > md_update_sb(mddev, 0); > > if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && > - !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { > + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || > + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { > /* resync/recovery still happening */ > clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); > goto unlock; > > Thanks, > Guoqing -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-12-01 9:29 ` Donald Buczek @ 2020-12-02 17:28 ` Donald Buczek 2020-12-03 1:55 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2020-12-02 17:28 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used: --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev) clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); goto unlock; } + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { + /* resync/recovery still happening */ + pr_info("md: XXX BUGFIX applied\n"); + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); + goto unlock; + } if (mddev->sync_thread) { md_reap_sync_thread(mddev); goto unlock; With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing: Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [=>...................] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [>....................] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec bitmap: 0/59 pages [0KB], 65536KB chunk after 1 minute: Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [=>...................] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [>....................] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec bitmap: 0/59 pages [0KB], 65536KB chunk A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs: buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md root 134 2 0 Nov30 ? 00:00:00 [md] root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6] root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6] root 5845 2 0 16:20 ? 00:00:30 [md0_resync] root 5855 2 13 16:20 ? 00:14:11 [md1_resync] buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack [<0>] md_bitmap_cond_end_sync+0x12d/0x170 [<0>] raid5_sync_request+0x24b/0x390 [<0>] md_do_sync+0xb41/0x1030 [<0>] md_thread+0x122/0x160 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x1f/0x30 I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c. Donald On 01.12.20 10:29, Donald Buczek wrote: > Am 30.11.20 um 03:06 schrieb Guoqing Jiang: >> >> >> On 11/28/20 13:25, Donald Buczek wrote: >>> Dear Linux mdraid people, >>> >>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. >>> >>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data. >>> >>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with >>> >>> echo idle > /sys/devices/virtual/block/md1/md/sync_action >>> >>> This doesn't complete. Here's /proc/stack of this process: >>> >>> root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333 >>> UID PID PPID C STIME TTY TIME CMD >>> root 23333 23331 0 02:00 ? 00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00 >>> root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack >>> [<0>] kthread_stop+0x6e/0x150 >>> [<0>] md_unregister_thread+0x3e/0x70 >>> [<0>] md_reap_sync_thread+0x1f/0x1e0 >>> [<0>] action_store+0x141/0x2b0 >>> [<0>] md_attr_store+0x71/0xb0 >>> [<0>] kernfs_fop_write+0x113/0x1a0 >>> [<0>] vfs_write+0xbc/0x250 >>> [<0>] ksys_write+0xa1/0xe0 >>> [<0>] do_syscall_64+0x33/0x40 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> Note, that md0 has been paused successfully just before. >> >> What is the personality of md0? Is it also raid6? > > Yes. > >> >>> >>> 2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00") >>> 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920 >>> 2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0 >>> 2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320 >>> 2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1 >>> 2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted. >>> 2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680 >>> 2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted. >>> [ nothing related following.... ] >>> >>> After that, we see md1_raid6 in a busy loop: >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 2376 root 20 0 0 0 0 R 100.0 0.0 1387:38 md1_raid6 >> >> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again. >> >>> >>> Also, all processes doing I/O do the md device block. >>> >>> This is /proc/mdstat: >>> >>> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] >>> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] >>> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >>> [==================>..] check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec >>> bitmap: 0/59 pages [0KB], 65536KB chunk >>> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] >>> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >>> bitmap: 0/59 pages [0KB], 65536KB chunk >>> >> >> So the RECOVERY_CHECK flag should be set, not sure if the simple changes >> helps, but you may give it a try. > > Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often. > > Best > Donald > >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index 98bac4f..e2697d0 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev) >> md_update_sb(mddev, 0); >> >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >> - !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { >> + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || >> + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { >> /* resync/recovery still happening */ >> clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >> goto unlock; >> >> Thanks, >> Guoqing > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-12-02 17:28 ` Donald Buczek @ 2020-12-03 1:55 ` Guoqing Jiang 2020-12-03 11:42 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2020-12-03 1:55 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi Donald, On 12/2/20 18:28, Donald Buczek wrote: > Dear Guoqing, > > unfortunately the patch didn't fix the problem (unless I messed it up > with my logging). This is what I used: > > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev) > clear_bit(MD_RECOVERY_NEEDED, > &mddev->recovery); > goto unlock; > } I think you can add the check of RECOVERY_CHECK in above part instead of add a new part. > + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && > + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || > + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { > + /* resync/recovery still happening */ > + pr_info("md: XXX BUGFIX applied\n"); > + clear_bit(MD_RECOVERY_NEEDED, > &mddev->recovery); > + goto unlock; > + } > if (mddev->sync_thread) { > md_reap_sync_thread(mddev); > goto unlock; > > With pausing and continuing the check four times an hour, I could > trigger the problem after about 48 hours. This time, the other device > (md0) has locked up on `echo idle > > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 > is still ongoing: Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ... What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm. > > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] > sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [=>...................] check = 8.5% (666852112/7813894144) > finish=1271.2min speed=93701K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] > sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] > sdu[2] sdt[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [>....................] check = 0.2% (19510348/7813894144) > finish=253779.6min speed=511K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > > after 1 minute: > > Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] > [multipath] > md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] > sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [=>...................] check = 8.6% (674914560/7813894144) > finish=941.1min speed=126418K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] > sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] > sdu[2] sdt[1] > 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm > 2 [16/16] [UUUUUUUUUUUUUUUU] > [>....................] check = 0.2% (19510348/7813894144) > finish=256805.0min speed=505K/sec > bitmap: 0/59 pages [0KB], 65536KB chunk > > A data point, I didn't mention in my previous mail, is that the > mdX_resync thread is not gone when the problem occurs: > > buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md > root 134 2 0 Nov30 ? 00:00:00 [md] > root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6] > root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6] > root 5845 2 0 16:20 ? 00:00:30 [md0_resync] > root 5855 2 13 16:20 ? 00:14:11 [md1_resync] > buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md > buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat > /proc/5845/stack > [<0>] md_bitmap_cond_end_sync+0x12d/0x170 > [<0>] raid5_sync_request+0x24b/0x390 > [<0>] md_do_sync+0xb41/0x1030 > [<0>] md_thread+0x122/0x160 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x1f/0x30 > > I guess, md_bitmap_cond_end_sync+0x12d is the > `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) > == 0);` in md-bitmap.c. > Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe. I'd suggest to compare the stacks between md0 and md1 to find the difference. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-12-03 1:55 ` Guoqing Jiang @ 2020-12-03 11:42 ` Donald Buczek 2020-12-21 12:33 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2020-12-03 11:42 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, On 12/3/20 2:55 AM, Guoqing Jiang wrote: > Hi Donald, > > On 12/2/20 18:28, Donald Buczek wrote: >> Dear Guoqing, >> >> unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used: >> >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev) >> clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >> goto unlock; >> } > > I think you can add the check of RECOVERY_CHECK in above part instead of add a new part. I understand that. I did it this way to get a log only when the changed expression makes a difference (original block not triggering and using goto and and modified block triggering) and even if both of us make some stupid error with boolean expressions. > >> + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >> + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) || >> + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) { >> + /* resync/recovery still happening */ >> + pr_info("md: XXX BUGFIX applied\n"); >> + clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); >> + goto unlock; >> + } >> if (mddev->sync_thread) { >> md_reap_sync_thread(mddev); >> goto unlock; > > >> >> With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing: > > Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ... I think its just random. There is a slight chance for a deadlock depending on I/O to the device or other system load when stopping the resync. In the last experiment, both md0 and md1 where changed from "check" to "idle" for about 400 times before one of it blocked. OTOH - and I don't understand this at all - it looks like md0 hung again after reboot to a 5.4 kernel at the very first "check" to "idle" transition which seems to contradict the "slight" chance and "once after 400 attempts". > What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm. Of md0_raid6 thread ? Sorry, didn't take the data before reboot this time, and concentrated on the non-terminating md0_resync thread. >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] >> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> [=>...................] check = 8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec >> bitmap: 0/59 pages [0KB], 65536KB chunk >> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> [>....................] check = 0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec >> bitmap: 0/59 pages [0KB], 65536KB chunk >> >> after 1 minute: >> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] >> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> [=>...................] check = 8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec >> bitmap: 0/59 pages [0KB], 65536KB chunk >> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1] >> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] >> [>....................] check = 0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec >> bitmap: 0/59 pages [0KB], 65536KB chunk >> >> A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs: >> >> buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md >> root 134 2 0 Nov30 ? 00:00:00 [md] >> root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6] >> root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6] >> root 5845 2 0 16:20 ? 00:00:30 [md0_resync] >> root 5855 2 13 16:20 ? 00:14:11 [md1_resync] >> buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md >> buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack >> [<0>] md_bitmap_cond_end_sync+0x12d/0x170 >> [<0>] raid5_sync_request+0x24b/0x390 >> [<0>] md_do_sync+0xb41/0x1030 >> [<0>] md_thread+0x122/0x160 >> [<0>] kthread+0x118/0x130 >> [<0>] ret_from_fork+0x1f/0x30 >> >> I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c. >> > > Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe. Ok. Maybe I find a way to add some logging there. > I'd suggest to compare the stacks between md0 and md1 to find the difference. Compare stack of md0_raid6 and md1_raid6? Okay, but when one md is deadlocked, the other is operating just normally. I/O can be done and you can even start and stop a resync on it. So I'd expect the stack of the operating raid to be varying. Thanks. Donald > > Thanks, > Guoqing -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-12-03 11:42 ` Donald Buczek @ 2020-12-21 12:33 ` Donald Buczek 2021-01-19 11:30 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2020-12-21 12:33 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoging, I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver. With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action blocked, no progress in mdstat) and md1 operating normally: root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done /sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0 /sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0 /sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0 /sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0 Best Donald ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2020-12-21 12:33 ` Donald Buczek @ 2021-01-19 11:30 ` Donald Buczek 2021-01-20 16:33 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-19 11:30 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear md-raid people, I've reported a problem in this thread in December: "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked. On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again. I was able to collect a lot of additional information from a blocked system. - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ] - The sync thread ("md1_resync") does not finish, because its blocked at [<0>] raid5_get_active_stripe+0x4c4/0x660 # [1] [<0>] raid5_sync_request+0x364/0x390 [<0>] md_do_sync+0xb41/0x1030 [<0>] md_thread+0x122/0x160 [<0>] kthread+0x118/0x130 [<0>] ret_from_fork+0x22/0x30 [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735 - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE ) - We have lots of active stripes: root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active 27534 - handle_stripe() doesn't make progress: echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958: [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 , check:4, reconstruct:0 The sector numbers repeat after some time. We have only the following variants of stripe state and "check": state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, check_state_idle state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, check_state_check_result state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, check_state_idle - We have MD_SB_CHANGE_PENDING set: root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state write-pending gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING) So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939 - The system can manually be freed from the deadlock: When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520 I, of coruse, don't fully understand it yet. Any ideas? I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks) Donald $L_002: (struct mddev *) 0xffff8898a48e3000 : { private = (struct r5conf *) 0xffff88988695a800 /* --> $L_004 */ pers = (struct md_personality *) 0xffffffff826ee280 <raid6_personality> unit = 9437185 md_minor = 1 disks = {next = 0xffff889886ee8c00, prev = 0xffff889891506000} flags = 0 sb_flags = 6 suspended = 0 active_io = { counter=4 } /* atomic_t */ ro = 0 sysfs_active = 0 gendisk = (struct gendisk *) 0xffff8898bc357800 kobj = { name = (const char *) 0xffffffff8220cef8 "md" entry = {next = 0xffff8898a48e3058, prev = 0xffff8898a48e3058} /* empty */ parent = (struct kobject *) 0xffff8898bc357868 kset = 0x0 ktype = (struct kobj_type *) 0xffffffff826ef140 <md_ktype> sd = (struct kernfs_node *) 0xffff8898b5f65980 kref = { refcount = { refs = { counter=17 } /* atomic_t */ } } state_initialized = 1 state_in_sysfs = 1 state_add_uevent_sent = 0 state_remove_uevent_sent = 0 uevent_suppress = 0 } hold_active = 0 major_version = 1 minor_version = 2 patch_version = 0 persistent = 1 external = 0 metadata_type = [ (8) [0] : 0 '\000' [1] : 0 '\000' [2] : 0 '\000' [3] : 0 '\000' [4] : 0 '\000' [5] : 0 '\000' [6] : 0 '\000' [7] : 0 '\000' [8] : 0 '\000' [9] : 0 '\000' [10] : 0 '\000' [11] : 0 '\000' [12] : 0 '\000' [13] : 0 '\000' [14] : 0 '\000' [15] : 0 '\000' [16] : 0 '\000' ] chunk_sectors = 1024 ctime = 1607680540 utime = 1610853337 level = 6 layout = 2 clevel = [ (8) [0] : 114 'r' [1] : 97 'a' [2] : 105 'i' [3] : 100 'd' [4] : 54 '6' [5] : 0 '\000' [6] : 0 '\000' [7] : 0 '\000' [8] : 0 '\000' [9] : 0 '\000' [10] : 0 '\000' [11] : 0 '\000' [12] : 0 '\000' [13] : 0 '\000' [14] : 0 '\000' [15] : 0 '\000' ] raid_disks = 16 max_disks = 1920 dev_sectors = 15627788288 array_sectors = 218789036032 external_size = 0 events = 32409 can_decrease_events = 0 uuid = [ (8) [0] : -49 '\317' [1] : 69 'E' [2] : 119 'w' [3] : 62 '>' [4] : -113 '\217' [5] : -107 '\225' [6] : 9 '\t' [7] : -90 '\246' [8] : -2 '\376' [9] : 84 'T' [10] : -12 '\364' [11] : -31 '\341' [12] : -110 '\222' [13] : 24 '\030' [14] : -15 '\361' [15] : 109 'm' ] reshape_position = 18446744073709551615 delta_disks = 0 new_level = 6 new_layout = 2 new_chunk_sectors = 1024 reshape_backwards = 0 thread = (struct md_thread *) 0xffff889887bde080 sync_thread = 0x0 last_sync_action = (char *) 0xffffffff8225e246 "check" curr_resync = 4947256648 curr_resync_completed = 4947016640 resync_mark = 4415850173 resync_mark_cnt = 66402936 curr_mark_cnt = 75262560 resync_max_sectors = 15627788288 resync_mismatches = { counter = 0 } suspend_lo = 0 suspend_hi = 0 sync_speed_min = 0 sync_speed_max = 0 parallel_resync = 0 ok_start_degraded = 0 recovery = 203 recovery_disabled = 16 in_sync = 0 open_mutex = { owner = { counter=0 } /* atomic_t */ wait_lock = {counter=0} /* spinlock_t */ osq = { tail = { counter=0 } /* atomic_t */ } wait_list = {next = 0xffff8898a48e31d8, prev = 0xffff8898a48e31d8} /* empty */ } reconfig_mutex = { owner = { counter=-131247629823744 } /* atomic_t */ wait_lock = {counter=0} /* spinlock_t */ osq = { tail = { counter=0 } /* atomic_t */ } wait_list = {next = 0xffff8898a48e31f8, prev = 0xffff8898a48e31f8} /* empty */ } active = { counter=2 } /* atomic_t */ openers = { counter=1 } /* atomic_t */ changed = 0 degraded = 0 recovery_active = { counter=181360 } /* atomic_t */ recovery_wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e3228, prev = 0xffff8898a48e3228} /* empty */ } recovery_cp = 18446744073709551615 resync_min = 4871994088 resync_max = 18446744073709551615 sysfs_state = (struct kernfs_node *) 0xffff8898b5f65380 sysfs_action = (struct kernfs_node *) 0xffff889887d87600 sysfs_completed = (struct kernfs_node *) 0xffff8898ac6f4e80 sysfs_degraded = (struct kernfs_node *) 0xffff8898ac6f4b80 sysfs_level = (struct kernfs_node *) 0xffff8898b5f65700 del_work = { data = { counter=128 } /* atomic_t */ entry = {next = 0xffff8898a48e3280, prev = 0xffff8898a48e3280} /* empty */ func = (work_func_t) 0xffffffff819ad430 <md_start_sync> } lock = {counter=0} /* spinlock_t */ sb_wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffffc9000e4a7a68, prev = 0xffffc9000cf97d30} } pending_writes = { counter=0 } /* atomic_t */ safemode = 0 safemode_delay = 201 safemode_timer = { entry = { next = (struct hlist_node *) 0xdead000000000122 pprev = 0x0 } expires = 4415857239 function = (void (*)(struct timer_list *)) 0xffffffff819a6a40 <md_safemode_timeout> flags = 314572820 } writes_pending = { percpu_count_ptr = 105896157983096 data = (struct percpu_ref_data *) 0xffff8898b340fc80 } sync_checkers = 0 queue = (struct request_queue *) 0xffff8898b65847c8 bitmap = (struct bitmap *) 0xffff889887d84400 bitmap_info = { file = 0x0 offset = 8 space = 0 default_offset = 2 default_space = 6 mutex = { owner = { counter=0 } /* atomic_t */ wait_lock = {counter=0} /* spinlock_t */ osq = { tail = { counter=0 } /* atomic_t */ } wait_list = {next = 0xffff8898a48e3350, prev = 0xffff8898a48e3350} /* empty */ } chunksize = 67108864 daemon_sleep = 5000 max_write_behind = 0 external = 0 nodes = 0 cluster_name = [ (8) [0] : 0 '\000' [1] : 0 '\000' [2] : 0 '\000' [3] : 0 '\000' [4] : 0 '\000' [5] : 0 '\000' [6] : 0 '\000' [7] : 0 '\000' [8] : 0 '\000' [9] : 0 '\000' [10] : 0 '\000' [11] : 0 '\000' [12] : 0 '\000' [13] : 0 '\000' [14] : 0 '\000' [15] : 0 '\000' [16] : 0 '\000' [17] : 0 '\000' [18] : 0 '\000' [19] : 0 '\000' [20] : 0 '\000' [21] : 0 '\000' [22] : 0 '\000' [23] : 0 '\000' [24] : 0 '\000' [25] : 0 '\000' [26] : 0 '\000' [27] : 0 '\000' [28] : 0 '\000' [29] : 0 '\000' [30] : 0 '\000' [31] : 0 '\000' [32] : 0 '\000' [33] : 0 '\000' [34] : 0 '\000' [35] : 0 '\000' [36] : 0 '\000' [37] : 0 '\000' [38] : 0 '\000' [39] : 0 '\000' [40] : 0 '\000' [41] : 0 '\000' [42] : 0 '\000' [43] : 0 '\000' [44] : 0 '\000' [45] : 0 '\000' [46] : 0 '\000' [47] : 0 '\000' [48] : 0 '\000' [49] : 0 '\000' [50] : 0 '\000' [51] : 0 '\000' [52] : 0 '\000' [53] : 0 '\000' [54] : 0 '\000' [55] : 0 '\000' [56] : 0 '\000' [57] : 0 '\000' [58] : 0 '\000' [59] : 0 '\000' [60] : 0 '\000' [61] : 0 '\000' [62] : 0 '\000' [63] : 0 '\000' ] } max_corr_read_errors = { counter=20 } /* atomic_t */ all_mddevs = {next = 0xffffffff826eef10, prev = 0xffff88810a551bc8} to_remove = 0x0 bio_set = { bio_slab = (struct kmem_cache *) 0xffff889886fbdc00 front_pad = 0 bio_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b4fae300 pool_data = (void *) 0xffff889886fbdc00 alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab> free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e3428, prev = 0xffff8898a48e3428} /* empty */ } } bvec_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b4fae2e0 pool_data = (void *) 0xffff8881078ecb40 alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab> free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e3470, prev = 0xffff8898a48e3470} /* empty */ } } bio_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } bvec_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } rescue_lock = {counter=0} /* spinlock_t */ rescue_list = { head = 0x0 tail = 0x0 } rescue_work = { data = { counter=68719476704 } /* atomic_t */ entry = {next = 0xffff8898a48e3530, prev = 0xffff8898a48e3530} /* empty */ func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue> } rescue_workqueue = 0x0 } sync_set = { bio_slab = (struct kmem_cache *) 0xffff889886fbdc00 front_pad = 0 bio_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b4fae2c0 pool_data = (void *) 0xffff889886fbdc00 alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab> free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e3598, prev = 0xffff8898a48e3598} /* empty */ } } bvec_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b4fae2a0 pool_data = (void *) 0xffff8881078ecb40 alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab> free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e35e0, prev = 0xffff8898a48e35e0} /* empty */ } } bio_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } bvec_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } rescue_lock = {counter=0} /* spinlock_t */ rescue_list = { head = 0x0 tail = 0x0 } rescue_work = { data = { counter=68719476704 } /* atomic_t */ entry = {next = 0xffff8898a48e36a0, prev = 0xffff8898a48e36a0} /* empty */ func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue> } rescue_workqueue = 0x0 } md_io_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b236af60 pool_data = (void *) 0x28 <fixed_percpu_data+40> alloc = (mempool_alloc_t *) 0xffffffff811c8b50 <mempool_kmalloc> free = (mempool_free_t *) 0xffffffff811c84b0 <mempool_kfree> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff8898a48e36f8, prev = 0xffff8898a48e36f8} /* empty */ } } flush_bio = 0x0 flush_pending = { counter=0 } /* atomic_t */ start_flush = 121249416090188 last_flush = 121249416090188 flush_work = { data = { counter=2176 } /* atomic_t */ entry = {next = 0xffff8898a48e3730, prev = 0xffff8898a48e3730} /* empty */ func = (work_func_t) 0xffffffff819aca80 <md_submit_flush_data> } event_work = { data = { counter=0 } /* atomic_t */ entry = {next = 0x0000000000000000, prev = 0x0000000000000000} func = 0x0 } serial_info_pool = 0x0 sync_super = 0x0 cluster_info = 0x0 good_device_nr = 0 noio_flag = 0 has_superblocks = true_21_ fail_last_dev = false_21_ serialize_policy = false_21_ } $L_004: (struct r5conf *) 0xffff88988695a800 : { stripe_hashtbl = (struct hlist_head *) 0xffff8898a1cb9000 /* --> $L_006 */ hash_locks = [ (23) [0] : {counter=0} /* spinlock_t */ [1] : {counter=0} /* spinlock_t */ [2] : {counter=0} /* spinlock_t */ [3] : {counter=0} /* spinlock_t */ [4] : {counter=0} /* spinlock_t */ [5] : {counter=0} /* spinlock_t */ [6] : {counter=0} /* spinlock_t */ [7] : {counter=0} /* spinlock_t */ ] mddev = (struct mddev *) 0xffff8898a48e3000 chunk_sectors = 1024 level = 6 algorithm = 2 rmw_level = 1 max_degraded = 2 raid_disks = 16 max_nr_stripes = 29816 min_nr_stripes = 256 reshape_progress = 18446744073709551615 reshape_safe = 0 previous_raid_disks = 16 prev_chunk_sectors = 1024 prev_algo = 2 generation = 0 gen_lock = { seqcount = { sequence = 0 } } reshape_checkpoint = 0 min_offset_diff = 0 handle_list = {next = 0xffff888a7893c890, prev = 0xffff8889de3cc7d0} loprio_list = {next = 0xffff88988695a898, prev = 0xffff88988695a898} /* empty */ hold_list = {next = 0xffff88988695a8a8, prev = 0xffff88988695a8a8} /* empty */ delayed_list = {next = 0xffff88988695a8b8, prev = 0xffff88988695a8b8} /* empty */ bitmap_list = {next = 0xffff88988695a8c8, prev = 0xffff88988695a8c8} /* empty */ retry_read_aligned = 0x0 retry_read_offset = 0 retry_read_aligned_list = 0x0 preread_active_stripes = { counter=0 } /* atomic_t */ active_aligned_reads = { counter=0 } /* atomic_t */ pending_full_writes = { counter=0 } /* atomic_t */ bypass_count = 0 bypass_threshold = 1 skip_copy = 0 last_hold = (struct list_head *) 0xffff88a299ea0010 reshape_stripes = { counter=0 } /* atomic_t */ active_name = 0 cache_name = [ (2) [0] : [ (8) [0] : 114 'r' [1] : 97 'a' [2] : 105 'i' [3] : 100 'd' [4] : 54 '6' [5] : 45 '-' [6] : 109 'm' [7] : 100 'd' [8] : 49 '1' [9] : 0 '\000' [10] : 0 '\000' [11] : 0 '\000' [12] : 0 '\000' [13] : 0 '\000' [14] : 0 '\000' [15] : 0 '\000' [16] : 0 '\000' [17] : 0 '\000' [18] : 0 '\000' [19] : 0 '\000' [20] : 0 '\000' [21] : 0 '\000' [22] : 0 '\000' [23] : 0 '\000' [24] : 0 '\000' [25] : 0 '\000' [26] : 0 '\000' [27] : 0 '\000' [28] : 0 '\000' [29] : 0 '\000' [30] : 0 '\000' [31] : 0 '\000' ] [1] : [ (8) [0] : 114 'r' [1] : 97 'a' [2] : 105 'i' [3] : 100 'd' [4] : 54 '6' [5] : 45 '-' [6] : 109 'm' [7] : 100 'd' [8] : 49 '1' [9] : 45 '-' [10] : 97 'a' [11] : 108 'l' [12] : 116 't' [13] : 0 '\000' [14] : 0 '\000' [15] : 0 '\000' [16] : 0 '\000' [17] : 0 '\000' [18] : 0 '\000' [19] : 0 '\000' [20] : 0 '\000' [21] : 0 '\000' [22] : 0 '\000' [23] : 0 '\000' [24] : 0 '\000' [25] : 0 '\000' [26] : 0 '\000' [27] : 0 '\000' [28] : 0 '\000' [29] : 0 '\000' [30] : 0 '\000' [31] : 0 '\000' ] ] slab_cache = (struct kmem_cache *) 0xffff8881078eca80 cache_size_mutex = { owner = { counter=0 } /* atomic_t */ wait_lock = {counter=0} /* spinlock_t */ osq = { tail = { counter=0 } /* atomic_t */ } wait_list = {next = 0xffff88988695a970, prev = 0xffff88988695a970} /* empty */ } seq_flush = 5031591 seq_write = 5031591 quiesce = 0 fullsync = 0 recovery_disabled = 15 percpu = (struct raid5_percpu *) 0x604fdee147e8 scribble_disks = 16 scribble_sectors = 1024 node = { next = 0x0 pprev = (struct hlist_node **) 0xffff88810a5b89a8 } active_stripes = { counter=27534 } /* atomic_t */ inactive_list = [ (3) [0] : {next = 0xffff88990fd76710, prev = 0xffff8889b7f72550} [1] : {next = 0xffff88959af002d0, prev = 0xffff889958cd4590} [2] : {next = 0xffff88a60a070150, prev = 0xffff88a37d3d0850} [3] : {next = 0xffff889e211e0690, prev = 0xffff8893a4000110} [4] : {next = 0xffff888c2139a350, prev = 0xffff88901a188410} [5] : {next = 0xffff88a83413e3d0, prev = 0xffff88a2090bc350} [6] : {next = 0xffff88a720d54310, prev = 0xffff888a0da6c5d0} [7] : {next = 0xffff888a0412c650, prev = 0xffff889250596550} ] r5c_cached_full_stripes = { counter=0 } /* atomic_t */ r5c_full_stripe_list = {next = 0xffff88988695aa48, prev = 0xffff88988695aa48} /* empty */ r5c_cached_partial_stripes = { counter=0 } /* atomic_t */ r5c_partial_stripe_list = {next = 0xffff88988695aa60, prev = 0xffff88988695aa60} /* empty */ r5c_flushing_full_stripes = { counter=0 } /* atomic_t */ r5c_flushing_partial_stripes = { counter=0 } /* atomic_t */ empty_inactive_list_nr = { counter=0 } /* atomic_t */ released_stripes = { first = 0x0 } wait_for_quiescent = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff88988695aa90, prev = 0xffff88988695aa90} /* empty */ } wait_for_stripe = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffffc9000e3f7c78, prev = 0xffffc9000e3f7c78} } wait_for_overlap = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff88988695aac0, prev = 0xffff88988695aac0} /* empty */ } cache_state = 3 shrinker = { count_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198bc60 <raid5_cache_count> scan_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198de90 <raid5_cache_scan> batch = 128 seeks = 128 flags = 0 list = {next = 0xffff8898867e8c30, prev = 0xffffffffa0e14520} id = 0 nr_deferred = (atomic_long_t *) 0xffff88988d8955c0 } pool_size = 16 device_lock = {counter=0} /* spinlock_t */ disks = (struct disk_info *) 0xffff8898b1df3800 bio_split = { bio_slab = (struct kmem_cache *) 0xffff889886fbdc00 front_pad = 0 bio_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 2 curr_nr = 2 elements = (void **) 0xffff8898b4fae280 pool_data = (void *) 0xffff889886fbdc00 alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab> free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab> wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0xffff88988695ab70, prev = 0xffff88988695ab70} /* empty */ } } bvec_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } bio_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } bvec_integrity_pool = { lock = {counter=0} /* spinlock_t */ min_nr = 0 curr_nr = 0 elements = 0x0 pool_data = 0x0 alloc = 0x0 free = 0x0 wait = { lock = {counter=0} /* spinlock_t */ head = {next = 0x0000000000000000, prev = 0x0000000000000000} } } rescue_lock = {counter=0} /* spinlock_t */ rescue_list = { head = 0x0 tail = 0x0 } rescue_work = { data = { counter=68719476704 } /* atomic_t */ entry = {next = 0xffff88988695ac78, prev = 0xffff88988695ac78} /* empty */ func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue> } rescue_workqueue = 0x0 } thread = 0x0 temp_inactive_list = [ (3) [0] : {next = 0xffff88988695aca0, prev = 0xffff88988695aca0} /* empty */ [1] : {next = 0xffff88988695acb0, prev = 0xffff88988695acb0} /* empty */ [2] : {next = 0xffff88988695acc0, prev = 0xffff88988695acc0} /* empty */ [3] : {next = 0xffff88988695acd0, prev = 0xffff88988695acd0} /* empty */ [4] : {next = 0xffff88988695ace0, prev = 0xffff88988695ace0} /* empty */ [5] : {next = 0xffff88988695acf0, prev = 0xffff88988695acf0} /* empty */ [6] : {next = 0xffff88988695ad00, prev = 0xffff88988695ad00} /* empty */ [7] : {next = 0xffff88988695ad10, prev = 0xffff88988695ad10} /* empty */ ] worker_groups = 0x0 group_cnt = 0 worker_cnt_per_group = 0 log = 0x0 log_private = 0x0 pending_bios_lock = {counter=0} /* spinlock_t */ batch_bio_dispatch = true_21_ pending_data = (struct r5pending_data *) 0xffff88988fae0000 free_list = {next = 0xffff88988fae4fd8, prev = 0xffff88988fae0000} pending_list = {next = 0xffff88988695ad60, prev = 0xffff88988695ad60} /* empty */ pending_data_cnt = 0 next_pending_data = 0x0 } set $sh = container_of ((void *)0xffff88a57f1e6550, struct stripe_head, lru) (gdb) dump *$sh $L_010: (struct stripe_head *) 0xffff88a57f1e6540 : { hash = { next = (struct hlist_node *) 0xffff888768f7c680 pprev = (struct hlist_node **) 0xffff8892f0520080 } lru = {next = 0xffff88945e034510, prev = 0xffff8891d8f04210} release_list = { next = (struct llist_node *) 0xffff88945e034520 } raid_conf = (struct r5conf *) 0xffff88988695a800 generation = 0 sector = 4947068032 pd_idx = 14 qd_idx = 15 ddf_layout = 0 hash_lock_index = 0 state = 8242 # 0x2032 HANDLE + INSYNC + REPLACED + STARTED count = { counter=0 } /* atomic_t */ bm_seq = 5030861 disks = 16 overwrite_disks = 0 check_state = check_state_idle_5_ reconstruct_state = reconstruct_state_idle_5_ stripe_lock = {counter=0} /* spinlock_t */ cpu = 17 group = 0x0 batch_head = 0x0 # no batch batch_lock = {counter=0} /* spinlock_t */ batch_list = {next = 0xffff88a57f1e65c8, prev = 0xffff88a57f1e65c8} /* empty */ /* anon union*/ { log_io = 0x0 ppl_io = 0x0 } log_list = {next = 0xffff88a57f1e65e0, prev = 0xffff88a57f1e65e0} /* empty */ log_start = 18446744073709551615 r5c = {next = 0xffff88a57f1e65f8, prev = 0xffff88a57f1e65f8} /* empty */ ppl_page = 0x0 ops = { target = 0 target2 = 0 zero_sum_result = 0_5_ } (gdb) dump $sh->dev[0] $L_011: (struct r5dev *) 0xffff88a57f1e6620 : { req = { bi_next = 0x0 bi_disk = 0x0 bi_opf = 0 bi_flags = 0 bi_ioprio = 0 bi_write_hint = 0 bi_status = 0 '\000' bi_partno = 0 '\000' __bi_remaining = { counter=1 } /* atomic_t */ bi_iter = { bi_sector = 0 bi_size = 0 bi_idx = 0 bi_bvec_done = 0 } bi_end_io = 0x0 bi_private = 0x0 bi_blkg = 0x0 bi_issue = { value = 0 } /* anon union*/ { bi_integrity = 0x0 } bi_vcnt = 0 bi_max_vecs = 1 __bi_cnt = { counter=1 } /* atomic_t */ bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6710 bi_pool = 0x0 bi_inline_vecs = [ ] } rreq = { bi_next = 0x0 bi_disk = 0x0 bi_opf = 0 bi_flags = 0 bi_ioprio = 0 bi_write_hint = 0 bi_status = 0 '\000' bi_partno = 0 '\000' __bi_remaining = { counter=1 } /* atomic_t */ bi_iter = { bi_sector = 0 bi_size = 0 bi_idx = 0 bi_bvec_done = 0 } bi_end_io = 0x0 bi_private = 0x0 bi_blkg = 0x0 bi_issue = { value = 0 } /* anon union*/ { bi_integrity = 0x0 } bi_vcnt = 0 bi_max_vecs = 1 __bi_cnt = { counter=1 } /* atomic_t */ bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6720 bi_pool = 0x0 bi_inline_vecs = [ ] } vec = { bv_page = (struct page *) 0xffffea008ffccc80 bv_len = 4096 bv_offset = 0 } rvec = { bv_page = 0x0 bv_len = 0 bv_offset = 0 } page = (struct page *) 0xffffea008ffccc80 orig_page = (struct page *) 0xffffea008ffccc80 offset = 0 toread = 0x0 read = 0x0 towrite = 0x0 written = 0x0 sector = 69258950784 flags = 17 # 0x11 : R5_UPTODATE + R5_INSYNC log_checksum = 0 write_hint = 0 } $sh->dev[1] .. $sh->dev[15] omitted. All the same, just different page buffers and sector numbers. $sh->dev[14] and $sh->dev[15] have sector = 0. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-19 11:30 ` Donald Buczek @ 2021-01-20 16:33 ` Guoqing Jiang 2021-01-23 13:04 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-01-20 16:33 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi Donald, On 1/19/21 12:30, Donald Buczek wrote: > Dear md-raid people, > > I've reported a problem in this thread in December: > > "We are using raid6 on several servers. Occasionally we had failures, > where a mdX_raid6 process seems to go into a busy loop and all I/O to > the md device blocks. We've seen this on various kernel versions." It > was clear, that this is related to "mdcheck" running, because we found, > that the command, which should stop the scrubbing in the morning (`echo > idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked. > > On 12/21/20, I've reported, that the problem might be caused by a > failure of the underlying block device driver, because I've found > "inflight" counters of the block devices not being zero. However, this > is not the case. We were able to run into the mdX_raid6 looping > condition a few times again, but the non-zero inflight counters have not > been observed again. > > I was able to collect a lot of additional information from a blocked > system. > > - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command > is waiting at kthread_stop to stop the sync thread. [ > https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ] > > - The sync thread ("md1_resync") does not finish, because its blocked at > > [<0>] raid5_get_active_stripe+0x4c4/0x660 # [1] > [<0>] raid5_sync_request+0x364/0x390 > [<0>] md_do_sync+0xb41/0x1030 > [<0>] md_thread+0x122/0x160 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x22/0x30 > > [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735 > > - yes, gdb confirms that `conf->cache_state` is 0x03 ( > R5_INACTIVE_BLOCKED + R5_ALLOC_MORE ) The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread. > > - We have lots of active stripes: > > root@deadbird:~/linux_problems/mdX_raid6_looping# cat > /sys/block/md1/md/stripe_cache_active > 27534 There are too many active stripes, so this is false: atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes. 1. Increase max_nr_stripes since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system. Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory. 2. Or decrease active_stripes This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set > - handle_stripe() doesn't make progress: > > echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control > > In dmesg, we see the debug output from > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 > but never from > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958: > > [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 > , check:4, reconstruct:0 > [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 > , check:4, reconstruct:0 > [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 > , check:4, reconstruct:0 > > The sector numbers repeat after some time. We have only the following > variants of stripe state and "check": > > state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, > check_state_idle > state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, > check_state_check_result > state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, > check_state_idle > > - We have MD_SB_CHANGE_PENDING set: because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes). Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe). Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since > released_stripes = { > first = 0x0 > } And __get_priority_stripe fetches sh from conf->handle_list and delete it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list. So batch_size can't be '0', and handle_active_stripes in the loop repeats the process in the busy loop. This is my best guess to explain the busy loop situation. > > root@deadbird:~/linux_problems/mdX_raid6_looping# cat > /sys/block/md1/md/array_state > write-pending > > gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING) since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)? > > So it can be assumed that handle_stripe breaks out at > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939 > > - The system can manually be freed from the deadlock: > > When `echo active > /sys/block/md1/md/array_state` is used, the > scrubbing and other I/O continue. Probably because of > https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520 Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress. > > I, of coruse, don't fully understand it yet. Any ideas? > > I append some data from a hanging raid... (mddev, r5conf and a sample > stripe_head from handle_list with it first disks) These data did help for investigation! Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-20 16:33 ` Guoqing Jiang @ 2021-01-23 13:04 ` Donald Buczek 2021-01-25 8:54 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-23 13:04 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, On 20.01.21 17:33, Guoqing Jiang wrote: > Hi Donald, > > On 1/19/21 12:30, Donald Buczek wrote: >> Dear md-raid people, >> >> I've reported a problem in this thread in December: >> >> "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked. >> >> On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again. >> >> I was able to collect a lot of additional information from a blocked system. >> >> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ] >> >> - The sync thread ("md1_resync") does not finish, because its blocked at >> >> [<0>] raid5_get_active_stripe+0x4c4/0x660 # [1] >> [<0>] raid5_sync_request+0x364/0x390 >> [<0>] md_do_sync+0xb41/0x1030 >> [<0>] md_thread+0x122/0x160 >> [<0>] kthread+0x118/0x130 >> [<0>] ret_from_fork+0x22/0x30 >> >> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735 >> >> - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE ) > > The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread. > >> >> - We have lots of active stripes: >> >> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active >> 27534 > > There are too many active stripes, so this is false: > > atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) > > so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes. > > 1. Increase max_nr_stripes > since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system. > > Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory. > > 2. Or decrease active_stripes > This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set > >> - handle_stripe() doesn't make progress: >> >> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control >> >> In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958: >> >> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >> , check:4, reconstruct:0 >> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >> , check:4, reconstruct:0 >> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >> , check:4, reconstruct:0 >> >> The sector numbers repeat after some time. We have only the following variants of stripe state and "check": >> >> state=0x2031 cnt=1, check:0 # ACTIVE +INSYNC+REPLACED+IO_STARTED, check_state_idle >> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING +REPLACED+IO_STARTED, check_state_check_result >> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING +IO_STARTED, check_state_idle >> >> - We have MD_SB_CHANGE_PENDING set: > > because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes). > > Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe). > > > Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since > > released_stripes = { > > first = 0x0 > > } > And __get_priority_stripe fetches sh from conf->handle_list and delete > it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list. > So batch_size can't be '0', and handle_active_stripes in the loop > repeats the process in the busy loop. This is my best guess to explain the busy loop situation. > >> >> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state >> write-pending >> >> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING) > > since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)? The problem is, we are still not able to run a system into the deadlocked state by a repeatable procedure. So I was not yet able to explicitly check for badblocks. However, on the production systems, which locked up, and on the test systems, which we managed to lock up, there always was some file system activity to the affected devices aside from mdcheck itself. This alone would explain any clean->write-pending transitions and might be a required condition for the problem to happen. I'm currently trying to explicitly exercise this path (by write/fsync()/sleep), with no result yet. Btw: When you monitor md/stripe_cache_active during a ongoing "check" on an otherwise idle system, you see a variety of values up to a certain maximum value, which is visible most of the time. But this maximum value seems to continuously increase. Also, when you start the check at higher blocks (via md/sync_min) right away, the maximum value seems to be higher. Can this be explained? Bigger seek gap between superblocks and active data area? But should this increase the number of active stripes? Best Donald >> So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939 >> >> - The system can manually be freed from the deadlock: >> >> When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520 > > Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress. > >> >> I, of coruse, don't fully understand it yet. Any ideas? >> >> I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks) > > These data did help for investigation! > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-23 13:04 ` Donald Buczek @ 2021-01-25 8:54 ` Donald Buczek 2021-01-25 21:32 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-25 8:54 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem. My current theory of what is happening is: - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308) - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action` - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero. - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping. - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared. - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate. Does this make sense? Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output: ``` #! /bin/bash ( while true; do echo "start check" echo check > /sys/block/md3/md/sync_action sleep 10 echo "stop check" echo idle > /sys/block/md3/md/sync_action sleep 2 done ) & ( while true; do dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none sync /mnt/raid_md3/bigfile rm /mnt/raid_md3/bigfile sleep .1 done ) & start="$(date +%s)" cd /sys/block/md3/md wp_count=0 while true; do array_state=$(cat array_state) if [ "$array_state" = write-pending ]; then wp_count=$(($wp_count+1)) else wp_count=0 fi echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active) if [ $wp_count -ge 3 ]; then kill -- -$$ exit fi sleep 1 done ``` The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this: 309 check 6283872 / 8378368 active-idle 4144 310 check 6283872 / 8378368 active 1702 311 check 6807528 / 8378368 active 4152 312 check 7331184 / 8378368 clean 3021 stop check 313 check 7331184 / 8378368 write-pending 3905 314 check 7331184 / 8378368 write-pending 3905 315 check 7331184 / 8378368 write-pending 3905 Terminated If I add pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags); in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 If I add pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags); at the head of md_check_recovery, I get: [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken ) What approach would you suggest to fix this? Best Donald ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-25 8:54 ` Donald Buczek @ 2021-01-25 21:32 ` Donald Buczek 2021-01-26 0:44 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-25 21:32 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 25.01.21 09:54, Donald Buczek wrote: > Dear Guoqing, > > a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem. > > My current theory of what is happening is: > > - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308) > > - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action` > > - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero. > > - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping. > > - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared. > > - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate. > > Does this make sense? > > Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output: > > ``` > #! /bin/bash > > ( > while true; do > echo "start check" > echo check > /sys/block/md3/md/sync_action > sleep 10 > echo "stop check" > echo idle > /sys/block/md3/md/sync_action > sleep 2 > done > ) & > > ( > while true; do > dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none > sync /mnt/raid_md3/bigfile > rm /mnt/raid_md3/bigfile > sleep .1 > done > ) & > > start="$(date +%s)" > cd /sys/block/md3/md > wp_count=0 > while true; do > array_state=$(cat array_state) > if [ "$array_state" = write-pending ]; then > wp_count=$(($wp_count+1)) > else > wp_count=0 > fi > echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active) > if [ $wp_count -ge 3 ]; then > kill -- -$$ > exit > fi > sleep 1 > done > ``` > > The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this: > > 309 check 6283872 / 8378368 active-idle 4144 > 310 check 6283872 / 8378368 active 1702 > 311 check 6807528 / 8378368 active 4152 > 312 check 7331184 / 8378368 clean 3021 > stop check > 313 check 7331184 / 8378368 write-pending 3905 > 314 check 7331184 / 8378368 write-pending 3905 > 315 check 7331184 / 8378368 write-pending 3905 > Terminated > > If I add > > pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags); > > in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get > > [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 > [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 > [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 > [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 > > If I add > > pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags); > > at the head of md_check_recovery, I get: > > [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 > [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 > [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 > [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 > [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 > > More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken ) > > What approach would you suggest to fix this? I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours. D. diff --git a/drivers/md/md.c b/drivers/md/md.c index 2d21c298ffa7..f40429843906 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len) clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && mddev_lock(mddev) == 0) { + set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); flush_workqueue(md_misc_wq); if (mddev->sync_thread) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_reap_sync_thread(mddev); } + clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); mddev_unlock(mddev); } } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) -- 2.30.0 ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-25 21:32 ` Donald Buczek @ 2021-01-26 0:44 ` Guoqing Jiang 2021-01-26 9:50 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-01-26 0:44 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid [-- Attachment #1: Type: text/plain, Size: 8583 bytes --] Hi Donald, On 1/25/21 22:32, Donald Buczek wrote: > > > On 25.01.21 09:54, Donald Buczek wrote: >> Dear Guoqing, >> >> a colleague of mine was able to produce the issue inside a vm and were >> able to find a procedure to run the vm into the issue within minutes >> (not unreliably after hours on a physical system as before). This of >> course helped to pinpoint the problem. >> >> My current theory of what is happening is: >> >> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by >> md_write_start() when file-system I/O wants to do a write and the >> array transitions from "clean" to "active". >> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308) >> >> - Before raid5d gets to write the superblock (its busy processing >> active stripes because of the sync activity) , userspace wants to >> pause the check by `echo idle > /sys/block/mdX/md/sync_action` >> >> - action_store() takes the reconfig_mutex before trying to stop the >> sync thread. >> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump >> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero. >> >> - raid5d is running in its main loop. >> raid5d()->handle_active_stripes() returns a positive batch size ( >> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 >> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't >> process any stripe because of MD_SB_CHANGE_PENDING. >> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 >> ). This is the reason, raid5d is busy looping. >> >> - raid5d()->md_check_recovery() is called by the raid5d main loop. One >> of its duties is to write the superblock, if a change is pending. >> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able >> to take the reconfig_mutex. >> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 >> , >> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) >> Both is not true, so the superblock is not written and >> MD_SB_CHANGE_PENDING is not cleared. >> >> - (as discussed previously) the sync thread is waiting for the number >> of active stripes to go down and doesn't terminate. The userspace >> thread is waiting for the sync thread to terminate. >> >> Does this make sense? Yes, exactly! That was my thought too, the scenario is twisted. Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag. And raid5 could suffer from the same issue I think. >> >> Just for reference, I add the procedure which triggers the issue on >> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output: >> >> ``` >> #! /bin/bash >> >> ( >> while true; do >> echo "start check" >> echo check > /sys/block/md3/md/sync_action >> sleep 10 >> echo "stop check" >> echo idle > /sys/block/md3/md/sync_action >> sleep 2 >> done >> ) & >> >> ( >> while true; do >> dd bs=1k count=$((5*1024*1024)) if=/dev/zero >> of=/mnt/raid_md3/bigfile status=none >> sync /mnt/raid_md3/bigfile >> rm /mnt/raid_md3/bigfile >> sleep .1 >> done >> ) & >> >> start="$(date +%s)" >> cd /sys/block/md3/md >> wp_count=0 >> while true; do >> array_state=$(cat array_state) >> if [ "$array_state" = write-pending ]; then >> wp_count=$(($wp_count+1)) >> else >> wp_count=0 >> fi >> echo $(($(date +%s)-$start)) $(cat sync_action) $(cat >> sync_completed) $array_state $(cat stripe_cache_active) >> if [ $wp_count -ge 3 ]; then >> kill -- -$$ >> exit >> fi >> sleep 1 >> done >> ``` >> >> The time, this needs to trigger the issue, varies from under a minute >> to one hour with 5 minute being typical. The output ends like this: >> >> 309 check 6283872 / 8378368 active-idle 4144 >> 310 check 6283872 / 8378368 active 1702 >> 311 check 6807528 / 8378368 active 4152 >> 312 check 7331184 / 8378368 clean 3021 >> stop check >> 313 check 7331184 / 8378368 write-pending 3905 >> 314 check 7331184 / 8378368 write-pending 3905 >> 315 check 7331184 / 8378368 write-pending 3905 >> Terminated >> >> If I add >> >> pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", >> batch_size, released, mddev->sb_flags); >> >> in raid5d after the call to handle_active_stripes and enable the debug >> location after the deadlock occurred, I get >> >> [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 >> mdddev->sb_flags 6 >> [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 >> mdddev->sb_flags 6 >> [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 >> mdddev->sb_flags 6 >> [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 >> mdddev->sb_flags 6 >> >> If I add >> >> pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags >> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags); >> >> at the head of md_check_recovery, I get: >> >> [ 789.555462] [1191] md_check_recovery:8970: XXX 1 >> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >> [ 789.555477] [1191] md_check_recovery:8970: XXX 1 >> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >> [ 789.555491] [1191] md_check_recovery:8970: XXX 1 >> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >> [ 789.555505] [1191] md_check_recovery:8970: XXX 1 >> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >> [ 789.555520] [1191] md_check_recovery:8970: XXX 1 >> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >> >> More debug lines in md_check_recovery confirm the control flow ( `if >> (mddev_trylock(mddev))` block not taken ) >> That is great that you have a reproducer now! >> What approach would you suggest to fix this? > > I naively tried the following patch and it seems to fix the problem. The > test procedure didn't trigger the deadlock in 10 hours. > > D. > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 2d21c298ffa7..f40429843906 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char > *page, size_t len) > clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); > if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && > mddev_lock(mddev) == 0) { > + set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); > flush_workqueue(md_misc_wq); > if (mddev->sync_thread) { > set_bit(MD_RECOVERY_INTR, &mddev->recovery); > md_reap_sync_thread(mddev); > } > + clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); > mddev_unlock(mddev); > } > } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing. I prefer to make resync thread not wait forever here. wait_event_lock_irq( conf->wait_for_stripe, !list_empty(conf->inactive_list + hash) && (atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) || !test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state) *(conf->hash_locks + hash)); So, could you please try the attached? Thanks, Guoqing [-- Attachment #2: raid5-proposal --] [-- Type: text/plain, Size: 5862 bytes --] diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c index 4337ae0..378ce5c 100644 --- a/drivers/md/raid5-cache.c +++ b/drivers/md/raid5-cache.c @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe( { struct stripe_head *sh; - sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0); + sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0); if (!sh) return NULL; /* no more stripe available */ diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 8888973..33a2a22 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf) } struct stripe_head * -raid5_get_active_stripe(struct r5conf *conf, sector_t sector, +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req, int previous, int noblock, int noquiesce) { struct stripe_head *sh; + struct mddev *mddev = conf->mddev; int hash = stripe_hash_locks_hash(conf, sector); int inc_empty_inactive_list_flag; @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector, (atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) || !test_bit(R5_INACTIVE_BLOCKED, - &conf->cache_state)), + &conf->cache_state) + || (test_bit(MD_RECOVERY_INTR, + &mddev->recovery) + && sync_req)), *(conf->hash_locks + hash)); + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) + && sync_req) + break; clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state); } else { @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh) sector_t bn = raid5_compute_blocknr(sh, i, 1); sector_t s = raid5_compute_sector(conf, bn, 0, &dd_idx, NULL); - sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1); + sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1); if (sh2 == NULL) /* so far only the early blocks of this stripe * have been requested. When later blocks @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh) /* Finish reconstruct operations initiated by the expansion process */ if (sh->reconstruct_state == reconstruct_state_result) { struct stripe_head *sh_src - = raid5_get_active_stripe(conf, sh->sector, 1, 1, 1); + = raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1); if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) { /* sh cannot be written until sh_src has been read. * so arrange for sh to be delayed a little @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) DEFINE_WAIT(w); int d; again: - sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0); + sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0); prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE); set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags); @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) (unsigned long long)new_sector, (unsigned long long)logical_sector); - sh = raid5_get_active_stripe(conf, new_sector, previous, + sh = raid5_get_active_stripe(conf, new_sector, previous, 0, (bi->bi_opf & REQ_RAHEAD), 0); if (sh) { if (unlikely(previous)) { @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) { int j; int skipped_disk = 0; - sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1); + sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1); set_bit(STRIPE_EXPANDING, &sh->state); atomic_inc(&conf->reshape_stripes); /* If any of this stripe is beyond the end of the old @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk if (last_sector >= mddev->dev_sectors) last_sector = mddev->dev_sectors - 1; while (first_sector <= last_sector) { - sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1); + sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1); set_bit(STRIPE_EXPAND_SOURCE, &sh->state); set_bit(STRIPE_HANDLE, &sh->state); raid5_release_stripe(sh); @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false); - sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0); + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0); if (sh == NULL) { - sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0); + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0); + if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { + *skipped = 1; + return 0; + } + /* make sure we don't swamp the stripe cache if someone else * is trying to get access */ @@ -6334,7 +6346,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio, /* already done this stripe */ continue; - sh = raid5_get_active_stripe(conf, sector, 0, 1, 1); + sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1); if (!sh) { /* failed to get a stripe - must wait */ diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 5c05acf..d9eab46 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector, int previous, int *dd_idx, struct stripe_head *sh); extern struct stripe_head * -raid5_get_active_stripe(struct r5conf *conf, sector_t sector, +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req, int previous, int noblock, int noquiesce); extern int raid5_calc_degraded(struct r5conf *conf); extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode); ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 0:44 ` Guoqing Jiang @ 2021-01-26 9:50 ` Donald Buczek 2021-01-26 11:14 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-26 9:50 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, On 26.01.21 01:44, Guoqing Jiang wrote: > Hi Donald, > > On 1/25/21 22:32, Donald Buczek wrote: >> >> >> On 25.01.21 09:54, Donald Buczek wrote: >>> Dear Guoqing, >>> >>> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem. >>> >>> My current theory of what is happening is: >>> >>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308) >>> >>> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action` >>> >>> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero. >>> >>> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping. >>> >>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared. >>> >>> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate. >>> >>> Does this make sense? > > Yes, exactly! That was my thought too, the scenario is twisted. > > Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag. > > And raid5 could suffer from the same issue I think. > >>> >>> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output: >>> >>> ``` >>> #! /bin/bash >>> >>> ( >>> while true; do >>> echo "start check" >>> echo check > /sys/block/md3/md/sync_action >>> sleep 10 >>> echo "stop check" >>> echo idle > /sys/block/md3/md/sync_action >>> sleep 2 >>> done >>> ) & >>> >>> ( >>> while true; do >>> dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none >>> sync /mnt/raid_md3/bigfile >>> rm /mnt/raid_md3/bigfile >>> sleep .1 >>> done >>> ) & >>> >>> start="$(date +%s)" >>> cd /sys/block/md3/md >>> wp_count=0 >>> while true; do >>> array_state=$(cat array_state) >>> if [ "$array_state" = write-pending ]; then >>> wp_count=$(($wp_count+1)) >>> else >>> wp_count=0 >>> fi >>> echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active) >>> if [ $wp_count -ge 3 ]; then >>> kill -- -$$ >>> exit >>> fi >>> sleep 1 >>> done >>> ``` >>> >>> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this: >>> >>> 309 check 6283872 / 8378368 active-idle 4144 >>> 310 check 6283872 / 8378368 active 1702 >>> 311 check 6807528 / 8378368 active 4152 >>> 312 check 7331184 / 8378368 clean 3021 >>> stop check >>> 313 check 7331184 / 8378368 write-pending 3905 >>> 314 check 7331184 / 8378368 write-pending 3905 >>> 315 check 7331184 / 8378368 write-pending 3905 >>> Terminated >>> >>> If I add >>> >>> pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags); >>> >>> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get >>> >>> [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 >>> [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 >>> [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 >>> [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6 >>> >>> If I add >>> >>> pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags); >>> >>> at the head of md_check_recovery, I get: >>> >>> [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >>> [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >>> [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >>> [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >>> [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006 >>> >>> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken ) >>> > > That is great that you have a reproducer now! > >>> What approach would you suggest to fix this? >> >> I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours. >> >> D. >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index 2d21c298ffa7..f40429843906 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len) >> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); >> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >> mddev_lock(mddev) == 0) { >> + set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >> flush_workqueue(md_misc_wq); >> if (mddev->sync_thread) { >> set_bit(MD_RECOVERY_INTR, &mddev->recovery); >> md_reap_sync_thread(mddev); >> } >> + clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >> mddev_unlock(mddev); >> } >> } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) > > Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing. Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage. > I prefer to make resync thread not wait forever here. > > wait_event_lock_irq( > conf->wait_for_stripe, > !list_empty(conf->inactive_list + hash) && > (atomic_read(&conf->active_stripes) > < (conf->max_nr_stripes * 3 / 4) > || !test_bit(R5_INACTIVE_BLOCKED, > &conf->cache_state) > *(conf->hash_locks + hash)); > > So, could you please try the attached? > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c > index 4337ae0..378ce5c 100644 > --- a/drivers/md/raid5-cache.c > +++ b/drivers/md/raid5-cache.c > @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe( > { > struct stripe_head *sh; > > - sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0); > + sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0); > if (!sh) > return NULL; /* no more stripe available */ > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 8888973..33a2a22 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf) > } > > struct stripe_head * > -raid5_get_active_stripe(struct r5conf *conf, sector_t sector, > +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req, > int previous, int noblock, int noquiesce) > { > struct stripe_head *sh; > + struct mddev *mddev = conf->mddev; > int hash = stripe_hash_locks_hash(conf, sector); > int inc_empty_inactive_list_flag; > > @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector, > (atomic_read(&conf->active_stripes) > < (conf->max_nr_stripes * 3 / 4) > || !test_bit(R5_INACTIVE_BLOCKED, > - &conf->cache_state)), > + &conf->cache_state) > + || (test_bit(MD_RECOVERY_INTR, > + &mddev->recovery) > + && sync_req)), > *(conf->hash_locks + hash)); > + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) > + && sync_req) > + break; > clear_bit(R5_INACTIVE_BLOCKED, > &conf->cache_state); > } else { > @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh) > sector_t bn = raid5_compute_blocknr(sh, i, 1); > sector_t s = raid5_compute_sector(conf, bn, 0, > &dd_idx, NULL); > - sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1); > + sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1); > if (sh2 == NULL) > /* so far only the early blocks of this stripe > * have been requested. When later blocks > @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh) > /* Finish reconstruct operations initiated by the expansion process */ > if (sh->reconstruct_state == reconstruct_state_result) { > struct stripe_head *sh_src > - = raid5_get_active_stripe(conf, sh->sector, 1, 1, 1); > + = raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1); > if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) { > /* sh cannot be written until sh_src has been read. > * so arrange for sh to be delayed a little > @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi) > DEFINE_WAIT(w); > int d; > again: > - sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0); > + sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0); > prepare_to_wait(&conf->wait_for_overlap, &w, > TASK_UNINTERRUPTIBLE); > set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags); > @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) > (unsigned long long)new_sector, > (unsigned long long)logical_sector); > > - sh = raid5_get_active_stripe(conf, new_sector, previous, > + sh = raid5_get_active_stripe(conf, new_sector, previous, 0, Mistake here (fourth argument added instead of third) > (bi->bi_opf & REQ_RAHEAD), 0); > if (sh) { > if (unlikely(previous)) { > @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk > for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) { > int j; > int skipped_disk = 0; > - sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1); > + sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1); > set_bit(STRIPE_EXPANDING, &sh->state); > atomic_inc(&conf->reshape_stripes); > /* If any of this stripe is beyond the end of the old > @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk > if (last_sector >= mddev->dev_sectors) > last_sector = mddev->dev_sectors - 1; > while (first_sector <= last_sector) { > - sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1); > + sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1); > set_bit(STRIPE_EXPAND_SOURCE, &sh->state); > set_bit(STRIPE_HANDLE, &sh->state); > raid5_release_stripe(sh); > @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n > > md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false); > > - sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0); > + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0); > if (sh == NULL) { > - sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0); > + sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0); > + if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { > + *skipped = 1; > + return 0; > + } > + > /* make sure we don't swamp the stripe cache if someone else > * is trying to get access > */ > @@ -6334,7 +6346,7 @@ static int retry_aligned_read(struct r5conf *conf, struct bio *raid_bio, > /* already done this stripe */ > continue; > > - sh = raid5_get_active_stripe(conf, sector, 0, 1, 1); > + sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1); > > if (!sh) { > /* failed to get a stripe - must wait */ > diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h > index 5c05acf..d9eab46 100644 > --- a/drivers/md/raid5.h > +++ b/drivers/md/raid5.h > @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector, > int previous, int *dd_idx, > struct stripe_head *sh); > extern struct stripe_head * > -raid5_get_active_stripe(struct r5conf *conf, sector_t sector, > +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req, > int previous, int noblock, int noquiesce); > extern int raid5_calc_degraded(struct r5conf *conf); > extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode); Unfortunately, this patch did not fix the issue. root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack [<0>] raid5_get_active_stripe+0x1e7/0x6b0 [<0>] raid5_sync_request+0x2a7/0x3d0 [<0>] md_do_sync.cold+0x3ee/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters? Best Donald > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 9:50 ` Donald Buczek @ 2021-01-26 11:14 ` Guoqing Jiang 2021-01-26 12:58 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-01-26 11:14 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi Donald, On 1/26/21 10:50, Donald Buczek wrote: [...] >>> >>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>> index 2d21c298ffa7..f40429843906 100644 >>> --- a/drivers/md/md.c >>> +++ b/drivers/md/md.c >>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char >>> *page, size_t len) >>> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); >>> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >>> mddev_lock(mddev) == 0) { >>> + set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >>> flush_workqueue(md_misc_wq); >>> if (mddev->sync_thread) { >>> set_bit(MD_RECOVERY_INTR, &mddev->recovery); >>> md_reap_sync_thread(mddev); >>> } >>> + clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >>> mddev_unlock(mddev); >>> } >>> } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >> >> Yes, it could break the deadlock issue, but I am not sure if it is the >> right way given we only set ALLOW_SB_UPDATE in suspend which makes >> sense since the io will be quiesced, but write idle action can't >> guarantee the similar thing. > > Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of > reconfig_mutex promises not to make any changes which would exclude > superblocks from being written" might make it easier to accept the usage. I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that. > >> I prefer to make resync thread not wait forever here. >> [...] >> >> - sh = raid5_get_active_stripe(conf, new_sector, previous, >> + sh = raid5_get_active_stripe(conf, new_sector, previous, 0, > > > Mistake here (fourth argument added instead of third) Thanks for checking. [...] > Unfortunately, this patch did not fix the issue. > > root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack > [<0>] raid5_get_active_stripe+0x1e7/0x6b0 > [<0>] raid5_sync_request+0x2a7/0x3d0 > [<0>] md_do_sync.cold+0x3ee/0x97c > [<0>] md_thread+0xab/0x160 > [<0>] kthread+0x11b/0x140 > [<0>] ret_from_fork+0x22/0x30 > > which is ( according to objdump -d -Sl drivers/md/raid5.o ) at > https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 > > Isn't it still the case that the superblocks are not written, therefore > stripes are not processed, therefore number of active stripes are not > decreasing? Who is expected to wake up conf->wait_for_stripe waiters? Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); spin_lock_irq(&conf->device_lock); } + + if ((atomic_read(&conf->active_stripes) + < (conf->max_nr_stripes * 3 / 4) || + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) + wake_up(&conf->wait_for_stripe); } pr_debug("%d stripes handled\n", handled); If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like. wait_event_lock_irq(conf->wait_for_stripe, (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) || /* the previous condition */, *(conf->hash_locks + hash)); Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 11:14 ` Guoqing Jiang @ 2021-01-26 12:58 ` Donald Buczek 2021-01-26 14:06 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-26 12:58 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 26.01.21 12:14, Guoqing Jiang wrote: > Hi Donald, > > On 1/26/21 10:50, Donald Buczek wrote: > [...] > >>>> >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c >>>> index 2d21c298ffa7..f40429843906 100644 >>>> --- a/drivers/md/md.c >>>> +++ b/drivers/md/md.c >>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len) >>>> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); >>>> if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && >>>> mddev_lock(mddev) == 0) { >>>> + set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >>>> flush_workqueue(md_misc_wq); >>>> if (mddev->sync_thread) { >>>> set_bit(MD_RECOVERY_INTR, &mddev->recovery); >>>> md_reap_sync_thread(mddev); >>>> } >>>> + clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags); >>>> mddev_unlock(mddev); >>>> } >>>> } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) >>> >>> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the similar thing. >> >> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage. > > I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that. > >> >>> I prefer to make resync thread not wait forever here. >>> > > [...] > >>> >>> - sh = raid5_get_active_stripe(conf, new_sector, previous, >>> + sh = raid5_get_active_stripe(conf, new_sector, previous, 0, >> >> >> Mistake here (fourth argument added instead of third) > > Thanks for checking. > > [...] > >> Unfortunately, this patch did not fix the issue. >> >> root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack >> [<0>] raid5_get_active_stripe+0x1e7/0x6b0 >> [<0>] raid5_sync_request+0x2a7/0x3d0 >> [<0>] md_do_sync.cold+0x3ee/0x97c >> [<0>] md_thread+0xab/0x160 >> [<0>] kthread+0x11b/0x140 >> [<0>] ret_from_fork+0x22/0x30 >> >> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 >> >> Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters? > > Hmm, how about wake the waiter up in the while loop of raid5d? > > @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) > md_check_recovery(mddev); > spin_lock_irq(&conf->device_lock); > } > + > + if ((atomic_read(&conf->active_stripes) > + < (conf->max_nr_stripes * 3 / 4) || > + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) > + wake_up(&conf->wait_for_stripe); > } > pr_debug("%d stripes handled\n", handled); Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] md_do_sync.cold+0x8ec/0x97c [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg. Best Donald > If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like. > > wait_event_lock_irq(conf->wait_for_stripe, > (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) || > /* the previous condition */, > *(conf->hash_locks + hash)); > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 12:58 ` Donald Buczek @ 2021-01-26 14:06 ` Guoqing Jiang 2021-01-26 16:05 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-01-26 14:06 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 1/26/21 13:58, Donald Buczek wrote: > > >> Hmm, how about wake the waiter up in the while loop of raid5d? >> >> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) >> md_check_recovery(mddev); >> spin_lock_irq(&conf->device_lock); >> } >> + >> + if ((atomic_read(&conf->active_stripes) >> + < (conf->max_nr_stripes * 3 / 4) || >> + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) >> + wake_up(&conf->wait_for_stripe); >> } >> pr_debug("%d stripes handled\n", handled); > > Hmm... With this patch on top of your other one, we still have the basic > symptoms (md3_raid6 busy looping), but the sync thread is now hanging at > > root@sloth:~# cat /proc/$(pgrep md3_resync)/stack > [<0>] md_do_sync.cold+0x8ec/0x97c > [<0>] md_thread+0xab/0x160 > [<0>] kthread+0x11b/0x140 > [<0>] ret_from_fork+0x22/0x30 > > instead, which is > https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. wait_event(mddev->recovery_wait, test_bit(MD_RECOVERY_INTR,&mddev->recovery) || !atomic_read(&mddev->recovery_active)); > And, unlike before, "md: md3: data-check interrupted." from the pr_info > two lines above appears in dmesg. Yes, that is intentional since MD_RECOVERY_INTR is set by write idle. Anyway, will try the script and investigate more about the issue. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 14:06 ` Guoqing Jiang @ 2021-01-26 16:05 ` Donald Buczek 2021-02-02 15:42 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-01-26 16:05 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, On 26.01.21 15:06, Guoqing Jiang wrote: > > > On 1/26/21 13:58, Donald Buczek wrote: >> >> >>> Hmm, how about wake the waiter up in the while loop of raid5d? >>> >>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) >>> md_check_recovery(mddev); >>> spin_lock_irq(&conf->device_lock); >>> } >>> + >>> + if ((atomic_read(&conf->active_stripes) >>> + < (conf->max_nr_stripes * 3 / 4) || >>> + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) >>> + wake_up(&conf->wait_for_stripe); >>> } >>> pr_debug("%d stripes handled\n", handled); >> >> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at >> >> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack >> [<0>] md_do_sync.cold+0x8ec/0x97c >> [<0>] md_thread+0xab/0x160 >> [<0>] kthread+0x11b/0x140 >> [<0>] ret_from_fork+0x22/0x30 >> >> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 > > Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. > > Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. > > wait_event(mddev->recovery_wait, > test_bit(MD_RECOVERY_INTR,&mddev->recovery) || > !atomic_read(&mddev->recovery_active)); With that added, md3_resync goes into a loop, too. Not 100% busy, though. root@sloth:~# cat /proc/$(pgrep md3_resync)/stack [<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 [<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274 [<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883 [<0>] md_thread+0xab/0x160 [<0>] kthread+0x11b/0x140 [<0>] ret_from_fork+0x22/0x30 Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again. Best Donald > >> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg. > > Yes, that is intentional since MD_RECOVERY_INTR is set by write idle. > > Anyway, will try the script and investigate more about the issue. > > Thanks, > Guoqing -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-01-26 16:05 ` Donald Buczek @ 2021-02-02 15:42 ` Guoqing Jiang 2021-02-08 11:38 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-02-02 15:42 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi Donald, On 1/26/21 17:05, Donald Buczek wrote: > Dear Guoqing, > > On 26.01.21 15:06, Guoqing Jiang wrote: >> >> >> On 1/26/21 13:58, Donald Buczek wrote: >>> >>> >>>> Hmm, how about wake the waiter up in the while loop of raid5d? >>>> >>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) >>>> md_check_recovery(mddev); >>>> spin_lock_irq(&conf->device_lock); >>>> } >>>> + >>>> + if ((atomic_read(&conf->active_stripes) >>>> + < (conf->max_nr_stripes * 3 / 4) || >>>> + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) >>>> + wake_up(&conf->wait_for_stripe); >>>> } >>>> pr_debug("%d stripes handled\n", handled); >>> >>> Hmm... With this patch on top of your other one, we still have the >>> basic symptoms (md3_raid6 busy looping), but the sync thread is now >>> hanging at >>> >>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack >>> [<0>] md_do_sync.cold+0x8ec/0x97c >>> [<0>] md_thread+0xab/0x160 >>> [<0>] kthread+0x11b/0x140 >>> [<0>] ret_from_fork+0x22/0x30 >>> >>> instead, which is >>> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 >> >> Not sure why recovery_active is not zero, because it is set 0 before >> blk_start_plug, and raid5_sync_request returns 0 and skipped is also >> set to 1. Perhaps handle_stripe calls md_done_sync. >> >> Could you double check the value of recovery_active? Or just don't >> wait if resync thread is interrupted. >> >> wait_event(mddev->recovery_wait, >> test_bit(MD_RECOVERY_INTR,&mddev->recovery) || >> !atomic_read(&mddev->recovery_active)); > > With that added, md3_resync goes into a loop, too. Not 100% busy, though. > > root@sloth:~# cat /proc/$(pgrep md3_resync)/stack > > [<0>] raid5_get_active_stripe+0x1e7/0x6b0 # > https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 > [<0>] raid5_sync_request+0x2a7/0x3d0 # > https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274 > [<0>] md_do_sync.cold+0x3ee/0x97c # > https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883 > [<0>] md_thread+0xab/0x160 > [<0>] kthread+0x11b/0x140 > [<0>] ret_from_fork+0x22/0x30 > > Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of > raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, > but the conditions don't match so its sleeps again. I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request. wait_event_lock_irq(conf->wait_for_stripe, (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) || /* the previous condition */, *(conf->hash_locks + hash)); BTW, I think there some some possible ways: 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation). 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far). 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong). 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe at the end of make_request. 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(&mddev->sync_thread); mddev_lock(mddev); My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-02 15:42 ` Guoqing Jiang @ 2021-02-08 11:38 ` Donald Buczek 2021-02-08 14:53 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-02-08 11:38 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 02.02.21 16:42, Guoqing Jiang wrote: > Hi Donald, > > On 1/26/21 17:05, Donald Buczek wrote: >> Dear Guoqing, >> >> On 26.01.21 15:06, Guoqing Jiang wrote: >>> >>> >>> On 1/26/21 13:58, Donald Buczek wrote: >>>> >>>> >>>>> Hmm, how about wake the waiter up in the while loop of raid5d? >>>>> >>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) >>>>> md_check_recovery(mddev); >>>>> spin_lock_irq(&conf->device_lock); >>>>> } >>>>> + >>>>> + if ((atomic_read(&conf->active_stripes) >>>>> + < (conf->max_nr_stripes * 3 / 4) || >>>>> + (test_bit(MD_RECOVERY_INTR, &mddev->recovery)))) >>>>> + wake_up(&conf->wait_for_stripe); >>>>> } >>>>> pr_debug("%d stripes handled\n", handled); >>>> >>>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at >>>> >>>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack >>>> [<0>] md_do_sync.cold+0x8ec/0x97c >>>> [<0>] md_thread+0xab/0x160 >>>> [<0>] kthread+0x11b/0x140 >>>> [<0>] ret_from_fork+0x22/0x30 >>>> >>>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963 >>> >>> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync. >>> >>> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted. >>> >>> wait_event(mddev->recovery_wait, >>> test_bit(MD_RECOVERY_INTR,&mddev->recovery) || >>> !atomic_read(&mddev->recovery_active)); >> >> With that added, md3_resync goes into a loop, too. Not 100% busy, though. >> >> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack >> >> [<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735 >> [<0>] raid5_sync_request+0x2a7/0x3d0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274 >> [<0>] md_do_sync.cold+0x3ee/0x97c # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883 >> [<0>] md_thread+0xab/0x160 >> [<0>] kthread+0x11b/0x140 >> [<0>] ret_from_fork+0x22/0x30 >> >> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again. > > I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request. > > wait_event_lock_irq(conf->wait_for_stripe, > (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) || > /* the previous condition */, > *(conf->hash_locks + hash)); Not knowing the system by heart, I'd probably need more than a full day to analyze that. Please let me know, if this is needed. > BTW, I think there some some possible ways: > > 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation). My first though is, that it is either needed to hold the mutex or it is not. Why should this depend on the number of available stripes? And wouldn't we get into the same situation, if another another reconfiguration from userspace takes the mutex again? > 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far). > > 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong). > > 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe > at the end of make_request. > > 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. > > /* resync has finished, collect result */ > mddev_unlock(mddev); > md_unregister_thread(&mddev->sync_thread); > mddev_lock(mddev); As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex? > My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find. I just get a bit suspicious, if it is needed to add special code here and here and here, that there might be something wrong with the abstraction. But, I'm afraid, that aside from testing, I can't be of to much help here, as I don't know all the components or details of the system that well. Best Donald > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-08 11:38 ` Donald Buczek @ 2021-02-08 14:53 ` Guoqing Jiang 2021-02-08 18:41 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2021-02-08 14:53 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 2/8/21 12:38, Donald Buczek wrote: >> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, >> like this. >> >> /* resync has finished, collect result */ >> mddev_unlock(mddev); >> md_unregister_thread(&mddev->sync_thread); >> mddev_lock(mddev); > > As above: While we wait for the sync thread to terminate, wouldn't it be > a problem, if another user space operation takes the mutex? I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-08 14:53 ` Guoqing Jiang @ 2021-02-08 18:41 ` Donald Buczek 2021-02-09 0:46 ` Guoqing Jiang 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2021-02-08 18:41 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Dear Guoqing, On 08.02.21 15:53, Guoqing Jiang wrote: > > > On 2/8/21 12:38, Donald Buczek wrote: >>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. >>> >>> /* resync has finished, collect result */ >>> mddev_unlock(mddev); >>> md_unregister_thread(&mddev->sync_thread); >>> mddev_lock(mddev); >> >> As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex? > > I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea. Yes, this works. No deadlock after >11000 seconds, (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) Best Donald > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-08 18:41 ` Donald Buczek @ 2021-02-09 0:46 ` Guoqing Jiang 2021-02-09 9:24 ` Donald Buczek 2023-03-14 13:25 ` Marc Smith 0 siblings, 2 replies; 49+ messages in thread From: Guoqing Jiang @ 2021-02-09 0:46 UTC (permalink / raw) To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi Donald, On 2/8/21 19:41, Donald Buczek wrote: > Dear Guoqing, > > On 08.02.21 15:53, Guoqing Jiang wrote: >> >> >> On 2/8/21 12:38, Donald Buczek wrote: >>>> 5. maybe don't hold reconfig_mutex when try to unregister >>>> sync_thread, like this. >>>> >>>> /* resync has finished, collect result */ >>>> mddev_unlock(mddev); >>>> md_unregister_thread(&mddev->sync_thread); >>>> mddev_lock(mddev); >>> >>> As above: While we wait for the sync thread to terminate, wouldn't it >>> be a problem, if another user space operation takes the mutex? >> >> I don't think other places can be blocked while hold mutex, otherwise >> these places can cause potential deadlock. Please try above two lines >> change. And perhaps others have better idea. > > Yes, this works. No deadlock after >11000 seconds, > > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) Great. I will send a formal patch with your reported-by and tested-by. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-09 0:46 ` Guoqing Jiang @ 2021-02-09 9:24 ` Donald Buczek 2023-03-14 13:25 ` Marc Smith 1 sibling, 0 replies; 49+ messages in thread From: Donald Buczek @ 2021-02-09 9:24 UTC (permalink / raw) To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 09.02.21 01:46, Guoqing Jiang wrote: > Great. I will send a formal patch with your reported-by and tested-by. Yes, that's fine. Thanks a lot for your help! Donald > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2021-02-09 0:46 ` Guoqing Jiang 2021-02-09 9:24 ` Donald Buczek @ 2023-03-14 13:25 ` Marc Smith 2023-03-14 13:55 ` Guoqing Jiang 2023-03-15 7:52 ` Donald Buczek 1 sibling, 2 replies; 49+ messages in thread From: Marc Smith @ 2023-03-14 13:25 UTC (permalink / raw) To: Guoqing Jiang Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang <guoqing.jiang@cloud.ionos.com> wrote: > > Hi Donald, > > On 2/8/21 19:41, Donald Buczek wrote: > > Dear Guoqing, > > > > On 08.02.21 15:53, Guoqing Jiang wrote: > >> > >> > >> On 2/8/21 12:38, Donald Buczek wrote: > >>>> 5. maybe don't hold reconfig_mutex when try to unregister > >>>> sync_thread, like this. > >>>> > >>>> /* resync has finished, collect result */ > >>>> mddev_unlock(mddev); > >>>> md_unregister_thread(&mddev->sync_thread); > >>>> mddev_lock(mddev); > >>> > >>> As above: While we wait for the sync thread to terminate, wouldn't it > >>> be a problem, if another user space operation takes the mutex? > >> > >> I don't think other places can be blocked while hold mutex, otherwise > >> these places can cause potential deadlock. Please try above two lines > >> change. And perhaps others have better idea. > > > > Yes, this works. No deadlock after >11000 seconds, > > > > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > Great. I will send a formal patch with your reported-by and tested-by. > > Thanks, > Guoqing I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 of the patches that supposedly resolve this were applied to the stable kernels, however, one was omitted due to a regression: md: don't unregister sync_thread with reconfig_mutex held (upstream commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) I don't see any follow-up on the thread from June 8th 2022 asking for this patch to be dropped from all stable kernels since it caused a regression. The patch doesn't appear to be present in the current mainline kernel (6.3-rc2) either. So I assume this issue is still present there, or it was resolved differently and I just can't find the commit/patch. I can induce the issue by using Donald's script above which will eventually result in hangs: ... 147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds. [147948.504624] Tainted: P OE 5.4.229-esos.prod #1 [147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [147948.504625] md_test_2.sh D 0 68033 1 0x00000004 [147948.504627] Call Trace: [147948.504634] __schedule+0x4ab/0x4f3 [147948.504637] ? usleep_range+0x7a/0x7a [147948.504638] schedule+0x67/0x81 [147948.504639] schedule_timeout+0x2c/0xe5 [147948.504643] ? do_raw_spin_lock+0x2b/0x52 [147948.504644] __wait_for_common+0xc4/0x13a [147948.504647] ? wake_up_q+0x40/0x40 [147948.504649] kthread_stop+0x9a/0x117 [147948.504653] md_unregister_thread+0x43/0x4d [147948.504655] md_reap_sync_thread+0x1c/0x1d5 [147948.504657] action_store+0xc9/0x284 [147948.504658] md_attr_store+0x9f/0xb8 [147948.504661] kernfs_fop_write+0x10a/0x14c [147948.504664] vfs_write+0xa0/0xdd [147948.504666] ksys_write+0x71/0xba [147948.504668] do_syscall_64+0x52/0x60 [147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 ... [147948.504748] INFO: task md120_resync:135315 blocked for more than 122 seconds. [147948.504749] Tainted: P OE 5.4.229-esos.prod #1 [147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [147948.504749] md120_resync D 0 135315 2 0x80004000 [147948.504750] Call Trace: [147948.504752] __schedule+0x4ab/0x4f3 [147948.504754] ? printk+0x53/0x6a [147948.504755] schedule+0x67/0x81 [147948.504756] md_do_sync+0xae7/0xdd9 [147948.504758] ? remove_wait_queue+0x41/0x41 [147948.504759] md_thread+0x128/0x151 [147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d [147948.504762] ? md_start_sync+0xdc/0xdc [147948.504763] kthread+0xe4/0xe9 [147948.504764] ? kthread_flush_worker+0x70/0x70 [147948.504765] ret_from_fork+0x35/0x40 ... This happens on 'raid6' MD RAID arrays that initially have sync_action==resync. Any guidance would be greatly appreciated. --Marc ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-14 13:25 ` Marc Smith @ 2023-03-14 13:55 ` Guoqing Jiang 2023-03-14 14:45 ` Marc Smith 2023-03-15 3:02 ` Yu Kuai 2023-03-15 7:52 ` Donald Buczek 1 sibling, 2 replies; 49+ messages in thread From: Guoqing Jiang @ 2023-03-14 13:55 UTC (permalink / raw) To: Marc Smith Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 3/14/23 21:25, Marc Smith wrote: > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > <guoqing.jiang@cloud.ionos.com> wrote: >> Hi Donald, >> >> On 2/8/21 19:41, Donald Buczek wrote: >>> Dear Guoqing, >>> >>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>> >>>> On 2/8/21 12:38, Donald Buczek wrote: >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >>>>>> sync_thread, like this. >>>>>> >>>>>> /* resync has finished, collect result */ >>>>>> mddev_unlock(mddev); >>>>>> md_unregister_thread(&mddev->sync_thread); >>>>>> mddev_lock(mddev); >>>>> As above: While we wait for the sync thread to terminate, wouldn't it >>>>> be a problem, if another user space operation takes the mutex? >>>> I don't think other places can be blocked while hold mutex, otherwise >>>> these places can cause potential deadlock. Please try above two lines >>>> change. And perhaps others have better idea. >>> Yes, this works. No deadlock after >11000 seconds, >>> >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >> Great. I will send a formal patch with your reported-by and tested-by. >> >> Thanks, >> Guoqing > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > of the patches that supposedly resolve this were applied to the stable > kernels, however, one was omitted due to a regression: > md: don't unregister sync_thread with reconfig_mutex held (upstream > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > I don't see any follow-up on the thread from June 8th 2022 asking for > this patch to be dropped from all stable kernels since it caused a > regression. > > The patch doesn't appear to be present in the current mainline kernel > (6.3-rc2) either. So I assume this issue is still present there, or it > was resolved differently and I just can't find the commit/patch. It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap sync_thread in action_store". Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-14 13:55 ` Guoqing Jiang @ 2023-03-14 14:45 ` Marc Smith 2023-03-16 15:25 ` Marc Smith 2023-03-15 3:02 ` Yu Kuai 1 sibling, 1 reply; 49+ messages in thread From: Marc Smith @ 2023-03-14 14:45 UTC (permalink / raw) To: Guoqing Jiang Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > > > On 3/14/23 21:25, Marc Smith wrote: > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > > <guoqing.jiang@cloud.ionos.com> wrote: > >> Hi Donald, > >> > >> On 2/8/21 19:41, Donald Buczek wrote: > >>> Dear Guoqing, > >>> > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > >>>> > >>>> On 2/8/21 12:38, Donald Buczek wrote: > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > >>>>>> sync_thread, like this. > >>>>>> > >>>>>> /* resync has finished, collect result */ > >>>>>> mddev_unlock(mddev); > >>>>>> md_unregister_thread(&mddev->sync_thread); > >>>>>> mddev_lock(mddev); > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it > >>>>> be a problem, if another user space operation takes the mutex? > >>>> I don't think other places can be blocked while hold mutex, otherwise > >>>> these places can cause potential deadlock. Please try above two lines > >>>> change. And perhaps others have better idea. > >>> Yes, this works. No deadlock after >11000 seconds, > >>> > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > >> Great. I will send a formal patch with your reported-by and tested-by. > >> > >> Thanks, > >> Guoqing > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > > of the patches that supposedly resolve this were applied to the stable > > kernels, however, one was omitted due to a regression: > > md: don't unregister sync_thread with reconfig_mutex held (upstream > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > I don't see any follow-up on the thread from June 8th 2022 asking for > > this patch to be dropped from all stable kernels since it caused a > > regression. > > > > The patch doesn't appear to be present in the current mainline kernel > > (6.3-rc2) either. So I assume this issue is still present there, or it > > was resolved differently and I just can't find the commit/patch. > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap > sync_thread in action_store". Okay, let me try applying that patch... it does not appear to be present in my 5.4.229 kernel source. Thanks. --Marc > > Thanks, > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-14 14:45 ` Marc Smith @ 2023-03-16 15:25 ` Marc Smith 2023-03-29 0:01 ` Song Liu 0 siblings, 1 reply; 49+ messages in thread From: Marc Smith @ 2023-03-16 15:25 UTC (permalink / raw) To: Guoqing Jiang Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <msmith626@gmail.com> wrote: > > On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > > > > > > > On 3/14/23 21:25, Marc Smith wrote: > > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > > > <guoqing.jiang@cloud.ionos.com> wrote: > > >> Hi Donald, > > >> > > >> On 2/8/21 19:41, Donald Buczek wrote: > > >>> Dear Guoqing, > > >>> > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > > >>>> > > >>>> On 2/8/21 12:38, Donald Buczek wrote: > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > > >>>>>> sync_thread, like this. > > >>>>>> > > >>>>>> /* resync has finished, collect result */ > > >>>>>> mddev_unlock(mddev); > > >>>>>> md_unregister_thread(&mddev->sync_thread); > > >>>>>> mddev_lock(mddev); > > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it > > >>>>> be a problem, if another user space operation takes the mutex? > > >>>> I don't think other places can be blocked while hold mutex, otherwise > > >>>> these places can cause potential deadlock. Please try above two lines > > >>>> change. And perhaps others have better idea. > > >>> Yes, this works. No deadlock after >11000 seconds, > > >>> > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > >> Great. I will send a formal patch with your reported-by and tested-by. > > >> > > >> Thanks, > > >> Guoqing > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > > > of the patches that supposedly resolve this were applied to the stable > > > kernels, however, one was omitted due to a regression: > > > md: don't unregister sync_thread with reconfig_mutex held (upstream > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > > > I don't see any follow-up on the thread from June 8th 2022 asking for > > > this patch to be dropped from all stable kernels since it caused a > > > regression. > > > > > > The patch doesn't appear to be present in the current mainline kernel > > > (6.3-rc2) either. So I assume this issue is still present there, or it > > > was resolved differently and I just can't find the commit/patch. > > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap > > sync_thread in action_store". > > Okay, let me try applying that patch... it does not appear to be > present in my 5.4.229 kernel source. Thanks. Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap sync_thread in action_store"' patch on top of vanilla 5.4.229 source appears to fix the problem for me -- I can't reproduce the issue with the script, and it's been running for >24 hours now. (Previously I was able to induce the issue within a matter of minutes.) > > --Marc > > > > > > Thanks, > > Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-16 15:25 ` Marc Smith @ 2023-03-29 0:01 ` Song Liu 2023-08-22 21:16 ` Dragan Stancevic 0 siblings, 1 reply; 49+ messages in thread From: Song Liu @ 2023-03-29 0:01 UTC (permalink / raw) To: Marc Smith, Yu Kuai Cc: Guoqing Jiang, Donald Buczek, linux-raid, Linux Kernel Mailing List, it+raid On Thu, Mar 16, 2023 at 8:25 AM Marc Smith <msmith626@gmail.com> wrote: > > On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <msmith626@gmail.com> wrote: > > > > On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote: > > > > > > > > > > > > On 3/14/23 21:25, Marc Smith wrote: > > > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > > > > <guoqing.jiang@cloud.ionos.com> wrote: > > > >> Hi Donald, > > > >> > > > >> On 2/8/21 19:41, Donald Buczek wrote: > > > >>> Dear Guoqing, > > > >>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > > > >>>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > > > >>>>>> sync_thread, like this. > > > >>>>>> > > > >>>>>> /* resync has finished, collect result */ > > > >>>>>> mddev_unlock(mddev); > > > >>>>>> md_unregister_thread(&mddev->sync_thread); > > > >>>>>> mddev_lock(mddev); > > > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it > > > >>>>> be a problem, if another user space operation takes the mutex? > > > >>>> I don't think other places can be blocked while hold mutex, otherwise > > > >>>> these places can cause potential deadlock. Please try above two lines > > > >>>> change. And perhaps others have better idea. > > > >>> Yes, this works. No deadlock after >11000 seconds, > > > >>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > > >> Great. I will send a formal patch with your reported-by and tested-by. > > > >> > > > >> Thanks, > > > >> Guoqing > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > > > > of the patches that supposedly resolve this were applied to the stable > > > > kernels, however, one was omitted due to a regression: > > > > md: don't unregister sync_thread with reconfig_mutex held (upstream > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > > > > > I don't see any follow-up on the thread from June 8th 2022 asking for > > > > this patch to be dropped from all stable kernels since it caused a > > > > regression. > > > > > > > > The patch doesn't appear to be present in the current mainline kernel > > > > (6.3-rc2) either. So I assume this issue is still present there, or it > > > > was resolved differently and I just can't find the commit/patch. > > > > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap > > > sync_thread in action_store". > > > > Okay, let me try applying that patch... it does not appear to be > > present in my 5.4.229 kernel source. Thanks. > > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap > sync_thread in action_store"' patch on top of vanilla 5.4.229 source > appears to fix the problem for me -- I can't reproduce the issue with > the script, and it's been running for >24 hours now. (Previously I was > able to induce the issue within a matter of minutes.) Hi Marc, Could you please run your reproducer on the md-tmp branch? https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-tmp This contains a different version of the fix by Yu Kuai. Thanks, Song ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-29 0:01 ` Song Liu @ 2023-08-22 21:16 ` Dragan Stancevic 2023-08-23 1:22 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Dragan Stancevic @ 2023-08-22 21:16 UTC (permalink / raw) To: song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yukuai1 On Tue, 3/28/23 17:01 Song Liu wrote: > On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com> > wr= > ote: > > > > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith > <msmith626@gmail.com>= > wrote: > > > > > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang > <guoqing.jiang@li= > nux.dev> wrote: > > > > > > > > > > > > > > > > On 3/14/23 21:25, Marc Smith wrote: > > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang > > > > > <guoqing.jiang@cloud.ionos.com> wrote: > > > > >> Hi Donald, > > > > >> > > > > >> On 2/8/21 19:41, Donald Buczek wrote: > > > > >>> Dear Guoqing, > > > > >>> > > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: > > > > >>>> > > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: > > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister > > > > >>>>>> sync_thread, like this. > > > > >>>>>> > > > > >>>>>> /* resync has finished, collect result */ > > > > >>>>>> mddev_unlock(mddev); > > > > >>>>>> md_unregister_thread(&mddev->sync_thread); > > > > >>>>>> mddev_lock(mddev); > > > > >>>>> As above: While we wait for the sync thread to terminate, > would= > n't it > > > > >>>>> be a problem, if another user space operation takes the mutex? > > > > >>>> I don't think other places can be blocked while hold mutex, > othe= > rwise > > > > >>>> these places can cause potential deadlock. Please try above > two = > lines > > > > >>>> change. And perhaps others have better idea. > > > > >>> Yes, this works. No deadlock after >11000 seconds, > > > > >>> > > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, > 434, 12= > 65, > > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) > > > > >> Great. I will send a formal patch with your reported-by and > tested= > -by. > > > > >> > > > > >> Thanks, > > > > >> Guoqing > > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks > like 1/= > 2 > > > > > of the patches that supposedly resolve this were applied to the > sta= > ble > > > > > kernels, however, one was omitted due to a regression: > > > > > md: don't unregister sync_thread with reconfig_mutex held > (upstream > > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > > > > > > > > > I don't see any follow-up on the thread from June 8th 2022 > asking f= > or > > > > > this patch to be dropped from all stable kernels since it caused a > > > > > regression. > > > > > > > > > > The patch doesn't appear to be present in the current mainline > kern= > el > > > > > (6.3-rc2) either. So I assume this issue is still present > there, or= > it > > > > > was resolved differently and I just can't find the commit/patch. > > > > > > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before > rea= > p > > > > sync_thread in action_store". > > > > > > Okay, let me try applying that patch... it does not appear to be > > > present in my 5.4.229 kernel source. Thanks. > > > > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap > > sync_thread in action_store"' patch on top of vanilla 5.4.229 source > > appears to fix the problem for me -- I can't reproduce the issue with > > the script, and it's been running for >24 hours now. (Previously I was > > able to induce the issue within a matter of minutes.) > > Hi Marc, > > Could you please run your reproducer on the md-tmp branch? > > https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp > > This contains a different version of the fix by Yu Kuai. > > Thanks, > Song > Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me. But I did pull the changes from the md-tmp branch you are refering: https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp I was not totally clear on which change exactly to pull, but I pulled the following changes: 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu Kuai 1 -7/+15 2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai 1 -1/+1 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22 2023-03-28 md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai 2 -0/+8 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu Kuai 1 -16/+45 I used to be able to reproduce the lockup within minutes, but with those changes the test system has been running for more than 120 hours. When you said a "different fix", can you confirm that I grabbed the right changes and that I need all 5 of them. And second question was, has this fix been submitted upstream yet? If so which kernel version? Thank you ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-22 21:16 ` Dragan Stancevic @ 2023-08-23 1:22 ` Yu Kuai 2023-08-23 15:33 ` Dragan Stancevic 0 siblings, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-08-23 1:22 UTC (permalink / raw) To: Dragan Stancevic, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yukuai1, yukuai (C), yangerkun Hi, 在 2023/08/23 5:16, Dragan Stancevic 写道: > On Tue, 3/28/23 17:01 Song Liu wrote: >> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com> >> wr= >> ote: >> > >> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith >> <msmith626@gmail.com>= >> wrote: >> > > >> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang >> <guoqing.jiang@li= >> nux.dev> wrote: >> > > > >> > > > >> > > > >> > > > On 3/14/23 21:25, Marc Smith wrote: >> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang >> > > > > <guoqing.jiang@cloud.ionos.com> wrote: >> > > > >> Hi Donald, >> > > > >> >> > > > >> On 2/8/21 19:41, Donald Buczek wrote: >> > > > >>> Dear Guoqing, >> > > > >>> >> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: >> > > > >>>> >> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: >> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >> > > > >>>>>> sync_thread, like this. >> > > > >>>>>> >> > > > >>>>>> /* resync has finished, collect result */ >> > > > >>>>>> mddev_unlock(mddev); >> > > > >>>>>> md_unregister_thread(&mddev->sync_thread); >> > > > >>>>>> mddev_lock(mddev); >> > > > >>>>> As above: While we wait for the sync thread to terminate, >> would= >> n't it >> > > > >>>>> be a problem, if another user space operation takes the mutex? >> > > > >>>> I don't think other places can be blocked while hold mutex, >> othe= >> rwise >> > > > >>>> these places can cause potential deadlock. Please try above >> two = >> lines >> > > > >>>> change. And perhaps others have better idea. >> > > > >>> Yes, this works. No deadlock after >11000 seconds, >> > > > >>> >> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, >> 434, 12= >> 65, >> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >> > > > >> Great. I will send a formal patch with your reported-by and >> tested= >> -by. >> > > > >> >> > > > >> Thanks, >> > > > >> Guoqing >> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks >> like 1/= >> 2 >> > > > > of the patches that supposedly resolve this were applied to the >> sta= >> ble >> > > > > kernels, however, one was omitted due to a regression: >> > > > > md: don't unregister sync_thread with reconfig_mutex held >> (upstream >> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) >> > > > > >> > > > > I don't see any follow-up on the thread from June 8th 2022 >> asking f= >> or >> > > > > this patch to be dropped from all stable kernels since it caused a >> > > > > regression. >> > > > > >> > > > > The patch doesn't appear to be present in the current mainline >> kern= >> el >> > > > > (6.3-rc2) either. So I assume this issue is still present >> there, or= >> it >> > > > > was resolved differently and I just can't find the commit/patch. >> > > > >> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before >> rea= >> p >> > > > sync_thread in action_store". >> > > >> > > Okay, let me try applying that patch... it does not appear to be >> > > present in my 5.4.229 kernel source. Thanks. >> > >> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap >> > sync_thread in action_store"' patch on top of vanilla 5.4.229 source >> > appears to fix the problem for me -- I can't reproduce the issue with >> > the script, and it's been running for >24 hours now. (Previously I was >> > able to induce the issue within a matter of minutes.) >> >> Hi Marc, >> >> Could you please run your reproducer on the md-tmp branch? >> >> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >> >> This contains a different version of the fix by Yu Kuai. >> >> Thanks, >> Song >> > > Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change > "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me. > > But I did pull the changes from the md-tmp branch you are refering: > https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp > > I was not totally clear on which change exactly to pull, but I pulled > the following changes: > 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu Kuai 1 -7/+15 > 2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread() Yu Kuai 1 -1/+1 > 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22 > 2023-03-28 md: add a mutex to synchronize idle and frozen in action_store() Yu Kuai 2 -0/+8 > 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu Kuai 1 -16/+45 > > I used to be able to reproduce the lockup within minutes, but with those > changes the test system has been running for more than 120 hours. > > When you said a "different fix", can you confirm that I grabbed the right > changes and that I need all 5 of them. Yes, you grabbed the right changes, and these patches is merged to linux-next as well. > > And second question was, has this fix been submitted upstream yet? > If so which kernel version? This fix is currently in linux-next, and will be applied to v6.6-rc1 soon. Thanks, Kuai > > Thank you > > > . > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-23 1:22 ` Yu Kuai @ 2023-08-23 15:33 ` Dragan Stancevic 2023-08-24 1:18 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Dragan Stancevic @ 2023-08-23 15:33 UTC (permalink / raw) To: Yu Kuai, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yukuai (C), yangerkun Hi Kuai- On 8/22/23 20:22, Yu Kuai wrote: > Hi, > > 在 2023/08/23 5:16, Dragan Stancevic 写道: >> On Tue, 3/28/23 17:01 Song Liu wrote: >>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com> >>> wr= >>> ote: >>> > >>> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith >>> <msmith626@gmail.com>= >>> wrote: >>> > > >>> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang >>> <guoqing.jiang@li= >>> nux.dev> wrote: >>> > > > >>> > > > >>> > > > >>> > > > On 3/14/23 21:25, Marc Smith wrote: >>> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang >>> > > > > <guoqing.jiang@cloud.ionos.com> wrote: >>> > > > >> Hi Donald, >>> > > > >> >>> > > > >> On 2/8/21 19:41, Donald Buczek wrote: >>> > > > >>> Dear Guoqing, >>> > > > >>> >>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: >>> > > > >>>> >>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: >>> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >>> > > > >>>>>> sync_thread, like this. >>> > > > >>>>>> >>> > > > >>>>>> /* resync has finished, collect result */ >>> > > > >>>>>> mddev_unlock(mddev); >>> > > > >>>>>> md_unregister_thread(&mddev->sync_thread); >>> > > > >>>>>> mddev_lock(mddev); >>> > > > >>>>> As above: While we wait for the sync thread to terminate, >>> would= >>> n't it >>> > > > >>>>> be a problem, if another user space operation takes the >>> mutex? >>> > > > >>>> I don't think other places can be blocked while hold mutex, >>> othe= >>> rwise >>> > > > >>>> these places can cause potential deadlock. Please try above >>> two = >>> lines >>> > > > >>>> change. And perhaps others have better idea. >>> > > > >>> Yes, this works. No deadlock after >11000 seconds, >>> > > > >>> >>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, >>> 434, 12= >>> 65, >>> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >>> > > > >> Great. I will send a formal patch with your reported-by and >>> tested= >>> -by. >>> > > > >> >>> > > > >> Thanks, >>> > > > >> Guoqing >>> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks >>> like 1/= >>> 2 >>> > > > > of the patches that supposedly resolve this were applied to >>> the >>> sta= >>> ble >>> > > > > kernels, however, one was omitted due to a regression: >>> > > > > md: don't unregister sync_thread with reconfig_mutex held >>> (upstream >>> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) >>> > > > > >>> > > > > I don't see any follow-up on the thread from June 8th 2022 >>> asking f= >>> or >>> > > > > this patch to be dropped from all stable kernels since it >>> caused a >>> > > > > regression. >>> > > > > >>> > > > > The patch doesn't appear to be present in the current mainline >>> kern= >>> el >>> > > > > (6.3-rc2) either. So I assume this issue is still present >>> there, or= >>> it >>> > > > > was resolved differently and I just can't find the >>> commit/patch. >>> > > > >>> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev >>> before >>> rea= >>> p >>> > > > sync_thread in action_store". >>> > > >>> > > Okay, let me try applying that patch... it does not appear to be >>> > > present in my 5.4.229 kernel source. Thanks. >>> > >>> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap >>> > sync_thread in action_store"' patch on top of vanilla 5.4.229 source >>> > appears to fix the problem for me -- I can't reproduce the issue >>> with >>> > the script, and it's been running for >24 hours now. (Previously >>> I was >>> > able to induce the issue within a matter of minutes.) >>> >>> Hi Marc, >>> >>> Could you please run your reproducer on the md-tmp branch? >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >>> >>> This contains a different version of the fix by Yu Kuai. >>> >>> Thanks, >>> Song >>> >> >> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. >> The change >> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue >> for me. >> >> But I did pull the changes from the md-tmp branch you are refering: >> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >> >> I was not totally clear on which change exactly to pull, but I pulled >> the following changes: >> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu >> Kuai 1 -7/+15 >> 2023-03-28 md: wake up 'resync_wait' at last in >> md_reap_sync_thread() Yu Kuai 1 -1/+1 >> 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22 >> 2023-03-28 md: add a mutex to synchronize idle and frozen in >> action_store() Yu Kuai 2 -0/+8 >> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu >> Kuai 1 -16/+45 >> >> I used to be able to reproduce the lockup within minutes, but with those >> changes the test system has been running for more than 120 hours. >> >> When you said a "different fix", can you confirm that I grabbed the right >> changes and that I need all 5 of them. > > Yes, you grabbed the right changes, and these patches is merged to > linux-next as well. >> >> And second question was, has this fix been submitted upstream yet? >> If so which kernel version? > > This fix is currently in linux-next, and will be applied to v6.6-rc1 > soon. Thank you, that is great news. I'd like to see this change backported to 5.10 and 6.1, do you have any plans of backporting to any of the previous kernels? If not, I would like to try to get your changes into 5.10 and 6.1 if Greg will accept them. Four out of five of your changes were a straight cherry-pick into 5.10, one needed a minor conflict resolution. But I can definitely confirm that your changes fix the lockup issue on 5.10 I am now switching to 6.1 and will test the changes there too. Thanks -- Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-23 15:33 ` Dragan Stancevic @ 2023-08-24 1:18 ` Yu Kuai 2023-08-28 20:32 ` Dragan Stancevic 0 siblings, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-08-24 1:18 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) Hi, 在 2023/08/23 23:33, Dragan Stancevic 写道: > Hi Kuai- > > On 8/22/23 20:22, Yu Kuai wrote: >> Hi, >> >> 在 2023/08/23 5:16, Dragan Stancevic 写道: >>> On Tue, 3/28/23 17:01 Song Liu wrote: >>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith >>>> <msmith626@gmail.com> >>>> wr= >>>> ote: >>>> > >>>> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith >>>> <msmith626@gmail.com>= >>>> wrote: >>>> > > >>>> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang >>>> <guoqing.jiang@li= >>>> nux.dev> wrote: >>>> > > > >>>> > > > >>>> > > > >>>> > > > On 3/14/23 21:25, Marc Smith wrote: >>>> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang >>>> > > > > <guoqing.jiang@cloud.ionos.com> wrote: >>>> > > > >> Hi Donald, >>>> > > > >> >>>> > > > >> On 2/8/21 19:41, Donald Buczek wrote: >>>> > > > >>> Dear Guoqing, >>>> > > > >>> >>>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>> > > > >>>> >>>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: >>>> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to >>>> unregister >>>> > > > >>>>>> sync_thread, like this. >>>> > > > >>>>>> >>>> > > > >>>>>> /* resync has finished, collect result */ >>>> > > > >>>>>> mddev_unlock(mddev); >>>> > > > >>>>>> md_unregister_thread(&mddev->sync_thread); >>>> > > > >>>>>> mddev_lock(mddev); >>>> > > > >>>>> As above: While we wait for the sync thread to terminate, >>>> would= >>>> n't it >>>> > > > >>>>> be a problem, if another user space operation takes >>>> the mutex? >>>> > > > >>>> I don't think other places can be blocked while hold >>>> mutex, >>>> othe= >>>> rwise >>>> > > > >>>> these places can cause potential deadlock. Please try >>>> above >>>> two = >>>> lines >>>> > > > >>>> change. And perhaps others have better idea. >>>> > > > >>> Yes, this works. No deadlock after >11000 seconds, >>>> > > > >>> >>>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, >>>> 434, 12= >>>> 65, >>>> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >>>> > > > >> Great. I will send a formal patch with your reported-by and >>>> tested= >>>> -by. >>>> > > > >> >>>> > > > >> Thanks, >>>> > > > >> Guoqing >>>> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks >>>> like 1/= >>>> 2 >>>> > > > > of the patches that supposedly resolve this were applied >>>> to the >>>> sta= >>>> ble >>>> > > > > kernels, however, one was omitted due to a regression: >>>> > > > > md: don't unregister sync_thread with reconfig_mutex held >>>> (upstream >>>> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) >>>> > > > > >>>> > > > > I don't see any follow-up on the thread from June 8th 2022 >>>> asking f= >>>> or >>>> > > > > this patch to be dropped from all stable kernels since it >>>> caused a >>>> > > > > regression. >>>> > > > > >>>> > > > > The patch doesn't appear to be present in the current >>>> mainline >>>> kern= >>>> el >>>> > > > > (6.3-rc2) either. So I assume this issue is still present >>>> there, or= >>>> it >>>> > > > > was resolved differently and I just can't find the >>>> commit/patch. >>>> > > > >>>> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev >>>> before >>>> rea= >>>> p >>>> > > > sync_thread in action_store". >>>> > > >>>> > > Okay, let me try applying that patch... it does not appear to be >>>> > > present in my 5.4.229 kernel source. Thanks. >>>> > >>>> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap >>>> > sync_thread in action_store"' patch on top of vanilla 5.4.229 >>>> source >>>> > appears to fix the problem for me -- I can't reproduce the issue >>>> with >>>> > the script, and it's been running for >24 hours now. (Previously >>>> I was >>>> > able to induce the issue within a matter of minutes.) >>>> >>>> Hi Marc, >>>> >>>> Could you please run your reproducer on the md-tmp branch? >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >>>> >>>> >>>> This contains a different version of the fix by Yu Kuai. >>>> >>>> Thanks, >>>> Song >>>> >>> >>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. >>> The change >>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue >>> for me. >>> >>> But I did pull the changes from the md-tmp branch you are refering: >>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >>> >>> >>> I was not totally clear on which change exactly to pull, but I pulled >>> the following changes: >>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu >>> Kuai 1 -7/+15 >>> 2023-03-28 md: wake up 'resync_wait' at last in >>> md_reap_sync_thread() Yu Kuai 1 -1/+1 >>> 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 -4/+22 >>> 2023-03-28 md: add a mutex to synchronize idle and frozen in >>> action_store() Yu Kuai 2 -0/+8 >>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu >>> Kuai 1 -16/+45 >>> >>> I used to be able to reproduce the lockup within minutes, but with those >>> changes the test system has been running for more than 120 hours. >>> >>> When you said a "different fix", can you confirm that I grabbed the >>> right >>> changes and that I need all 5 of them. >> >> Yes, you grabbed the right changes, and these patches is merged to >> linux-next as well. >>> >>> And second question was, has this fix been submitted upstream yet? >>> If so which kernel version? >> >> This fix is currently in linux-next, and will be applied to v6.6-rc1 >> soon. > > Thank you, that is great news. I'd like to see this change backported to > 5.10 and 6.1, do you have any plans of backporting to any of the > previous kernels? > > If not, I would like to try to get your changes into 5.10 and 6.1 if > Greg will accept them. > I don't have plans yet, so feel free to do this, I guess these patches won't be picked automatically due to the conflict. Feel free to ask if you meet any problems. Thanks, Kuai > > Four out of five of your changes were a straight cherry-pick into 5.10, > one needed a minor conflict resolution. But I can definitely confirm > that your changes fix the lockup issue on 5.10 > > I am now switching to 6.1 and will test the changes there too. > > > Thanks > > > -- > Peace can only come as a natural consequence > of universal enlightenment -Dr. Nikola Tesla > > > . > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-24 1:18 ` Yu Kuai @ 2023-08-28 20:32 ` Dragan Stancevic 2023-08-30 1:36 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Dragan Stancevic @ 2023-08-28 20:32 UTC (permalink / raw) To: Yu Kuai, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) Hi Kuai, On 8/23/23 20:18, Yu Kuai wrote: > Hi, > > 在 2023/08/23 23:33, Dragan Stancevic 写道: >> Hi Kuai- >> >> On 8/22/23 20:22, Yu Kuai wrote: >>> Hi, >>> >>> 在 2023/08/23 5:16, Dragan Stancevic 写道: >>>> On Tue, 3/28/23 17:01 Song Liu wrote: >>>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith >>>>> <msmith626@gmail.com> >>>>> wr= >>>>> ote: >>>>> > >>>>> > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith >>>>> <msmith626@gmail.com>= >>>>> wrote: >>>>> > > >>>>> > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang >>>>> <guoqing.jiang@li= >>>>> nux.dev> wrote: >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > On 3/14/23 21:25, Marc Smith wrote: >>>>> > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang >>>>> > > > > <guoqing.jiang@cloud.ionos.com> wrote: >>>>> > > > >> Hi Donald, >>>>> > > > >> >>>>> > > > >> On 2/8/21 19:41, Donald Buczek wrote: >>>>> > > > >>> Dear Guoqing, >>>>> > > > >>> >>>>> > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>>> > > > >>>> >>>>> > > > >>>> On 2/8/21 12:38, Donald Buczek wrote: >>>>> > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to >>>>> unregister >>>>> > > > >>>>>> sync_thread, like this. >>>>> > > > >>>>>> >>>>> > > > >>>>>> /* resync has finished, collect result */ >>>>> > > > >>>>>> mddev_unlock(mddev); >>>>> > > > >>>>>> md_unregister_thread(&mddev->sync_thread); >>>>> > > > >>>>>> mddev_lock(mddev); >>>>> > > > >>>>> As above: While we wait for the sync thread to >>>>> terminate, >>>>> would= >>>>> n't it >>>>> > > > >>>>> be a problem, if another user space operation takes >>>>> the mutex? >>>>> > > > >>>> I don't think other places can be blocked while hold >>>>> mutex, >>>>> othe= >>>>> rwise >>>>> > > > >>>> these places can cause potential deadlock. Please try >>>>> above >>>>> two = >>>>> lines >>>>> > > > >>>> change. And perhaps others have better idea. >>>>> > > > >>> Yes, this works. No deadlock after >11000 seconds, >>>>> > > > >>> >>>>> > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, >>>>> 434, 12= >>>>> 65, >>>>> > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >>>>> > > > >> Great. I will send a formal patch with your reported-by and >>>>> tested= >>>>> -by. >>>>> > > > >> >>>>> > > > >> Thanks, >>>>> > > > >> Guoqing >>>>> > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks >>>>> like 1/= >>>>> 2 >>>>> > > > > of the patches that supposedly resolve this were applied >>>>> to the >>>>> sta= >>>>> ble >>>>> > > > > kernels, however, one was omitted due to a regression: >>>>> > > > > md: don't unregister sync_thread with reconfig_mutex held >>>>> (upstream >>>>> > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) >>>>> > > > > >>>>> > > > > I don't see any follow-up on the thread from June 8th 2022 >>>>> asking f= >>>>> or >>>>> > > > > this patch to be dropped from all stable kernels since it >>>>> caused a >>>>> > > > > regression. >>>>> > > > > >>>>> > > > > The patch doesn't appear to be present in the current >>>>> mainline >>>>> kern= >>>>> el >>>>> > > > > (6.3-rc2) either. So I assume this issue is still present >>>>> there, or= >>>>> it >>>>> > > > > was resolved differently and I just can't find the >>>>> commit/patch. >>>>> > > > >>>>> > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev >>>>> before >>>>> rea= >>>>> p >>>>> > > > sync_thread in action_store". >>>>> > > >>>>> > > Okay, let me try applying that patch... it does not appear to be >>>>> > > present in my 5.4.229 kernel source. Thanks. >>>>> > >>>>> > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap >>>>> > sync_thread in action_store"' patch on top of vanilla 5.4.229 >>>>> source >>>>> > appears to fix the problem for me -- I can't reproduce the >>>>> issue with >>>>> > the script, and it's been running for >24 hours now. >>>>> (Previously I was >>>>> > able to induce the issue within a matter of minutes.) >>>>> >>>>> Hi Marc, >>>>> >>>>> Could you please run your reproducer on the md-tmp branch? >>>>> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >>>>> >>>>> This contains a different version of the fix by Yu Kuai. >>>>> >>>>> Thanks, >>>>> Song >>>>> >>>> >>>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. >>>> The change >>>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue >>>> for me. >>>> >>>> But I did pull the changes from the md-tmp branch you are refering: >>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp >>>> >>>> I was not totally clear on which change exactly to pull, but I pulled >>>> the following changes: >>>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp Yu >>>> Kuai 1 -7/+15 >>>> 2023-03-28 md: wake up 'resync_wait' at last in >>>> md_reap_sync_thread() Yu Kuai 1 -1/+1 >>>> 2023-03-28 md: refactor idle/frozen_sync_thread() Yu Kuai 2 >>>> -4/+22 >>>> 2023-03-28 md: add a mutex to synchronize idle and frozen in >>>> action_store() Yu Kuai 2 -0/+8 >>>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen' Yu >>>> Kuai 1 -16/+45 >>>> >>>> I used to be able to reproduce the lockup within minutes, but with >>>> those >>>> changes the test system has been running for more than 120 hours. >>>> >>>> When you said a "different fix", can you confirm that I grabbed the >>>> right >>>> changes and that I need all 5 of them. >>> >>> Yes, you grabbed the right changes, and these patches is merged to >>> linux-next as well. >>>> >>>> And second question was, has this fix been submitted upstream yet? >>>> If so which kernel version? >>> >>> This fix is currently in linux-next, and will be applied to v6.6-rc1 >>> soon. >> >> Thank you, that is great news. I'd like to see this change backported >> to 5.10 and 6.1, do you have any plans of backporting to any of the >> previous kernels? >> >> If not, I would like to try to get your changes into 5.10 and 6.1 if >> Greg will accept them. >> > > I don't have plans yet, so feel free to do this, I guess these patches > won't be picked automatically due to the conflict. Feel free to ask if > you meet any problems. Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it. It seems that 6.1 has some other code that prevents this from happening. On 5.10 I can reproduce it within minutes to an hour. > > Thanks, > Kuai > >> >> Four out of five of your changes were a straight cherry-pick into >> 5.10, one needed a minor conflict resolution. But I can definitely >> confirm that your changes fix the lockup issue on 5.10 >> >> I am now switching to 6.1 and will test the changes there too. >> >> >> Thanks >> >> >> -- >> Peace can only come as a natural consequence >> of universal enlightenment -Dr. Nikola Tesla >> >> >> . >> > -- Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-28 20:32 ` Dragan Stancevic @ 2023-08-30 1:36 ` Yu Kuai 2023-09-05 3:50 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-08-30 1:36 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) Hi, 在 2023/08/29 4:32, Dragan Stancevic 写道: > Just a followup on 6.1 testing. I tried reproducing this problem for 5 > days with 6.1.42 kernel without your patches and I was not able to > reproduce it. > > It seems that 6.1 has some other code that prevents this from happening. > I see that there are lots of patches for raid456 between 5.10 and 6.1, however, I remember that I used to reporduce the deadlock after 6.1, and it's true it's not easy to reporduce, see below: https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ My guess is that 6.1 is harder to reporduce than 5.10 due to some changes inside raid456. By the way, raid10 had a similiar deadlock, and can be fixed the same way, so it make sense to backport these patches. https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com Thanks, Kuai > On 5.10 I can reproduce it within minutes to an hour. > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-08-30 1:36 ` Yu Kuai @ 2023-09-05 3:50 ` Yu Kuai 2023-09-05 13:54 ` Dragan Stancevic 0 siblings, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-09-05 3:50 UTC (permalink / raw) To: Yu Kuai, Dragan Stancevic, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun Hi, 在 2023/08/30 9:36, Yu Kuai 写道: > Hi, > > 在 2023/08/29 4:32, Dragan Stancevic 写道: > >> Just a followup on 6.1 testing. I tried reproducing this problem for 5 >> days with 6.1.42 kernel without your patches and I was not able to >> reproduce it. oops, I forgot that you need to backport this patch first to reporduce this problem: https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/ The patch fix the deadlock as well, but it introduce some regressions. Thanks, Kuai >> >> It seems that 6.1 has some other code that prevents this from happening. >> > > I see that there are lots of patches for raid456 between 5.10 and 6.1, > however, I remember that I used to reporduce the deadlock after 6.1, and > it's true it's not easy to reporduce, see below: > > https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ > > > My guess is that 6.1 is harder to reporduce than 5.10 due to some > changes inside raid456. > > By the way, raid10 had a similiar deadlock, and can be fixed the same > way, so it make sense to backport these patches. > > https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com > > Thanks, > Kuai > > >> On 5.10 I can reproduce it within minutes to an hour. >> > > . > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-05 3:50 ` Yu Kuai @ 2023-09-05 13:54 ` Dragan Stancevic 2023-09-13 9:08 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Dragan Stancevic @ 2023-09-05 13:54 UTC (permalink / raw) To: Yu Kuai, song Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun On 9/4/23 22:50, Yu Kuai wrote: > Hi, > > 在 2023/08/30 9:36, Yu Kuai 写道: >> Hi, >> >> 在 2023/08/29 4:32, Dragan Stancevic 写道: >> >>> Just a followup on 6.1 testing. I tried reproducing this problem for >>> 5 days with 6.1.42 kernel without your patches and I was not able to >>> reproduce it. > > oops, I forgot that you need to backport this patch first to reporduce > this problem: > > https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/ > > The patch fix the deadlock as well, but it introduce some regressions. Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :) > > Thanks, > Kuai > >>> >>> It seems that 6.1 has some other code that prevents this from happening. >>> >> >> I see that there are lots of patches for raid456 between 5.10 and 6.1, >> however, I remember that I used to reporduce the deadlock after 6.1, and >> it's true it's not easy to reporduce, see below: >> >> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ >> >> My guess is that 6.1 is harder to reporduce than 5.10 due to some >> changes inside raid456. >> >> By the way, raid10 had a similiar deadlock, and can be fixed the same >> way, so it make sense to backport these patches. >> >> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com >> >> Thanks, >> Kuai >> >> >>> On 5.10 I can reproduce it within minutes to an hour. >>> >> >> . >> > -- Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-05 13:54 ` Dragan Stancevic @ 2023-09-13 9:08 ` Donald Buczek 2023-09-13 14:16 ` Dragan Stancevic 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2023-09-13 9:08 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun On 9/5/23 3:54 PM, Dragan Stancevic wrote: > On 9/4/23 22:50, Yu Kuai wrote: >> Hi, >> >> 在 2023/08/30 9:36, Yu Kuai 写道: >>> Hi, >>> >>> 在 2023/08/29 4:32, Dragan Stancevic 写道: >>> >>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it. >> >> oops, I forgot that you need to backport this patch first to reporduce >> this problem: >> >> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/ >> >> The patch fix the deadlock as well, but it introduce some regressions. We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes. But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem? Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily? Thanks Donald [1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/ > Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :) > > > >> >> Thanks, >> Kuai >> >>>> >>>> It seems that 6.1 has some other code that prevents this from happening. >>>> >>> >>> I see that there are lots of patches for raid456 between 5.10 and 6.1, >>> however, I remember that I used to reporduce the deadlock after 6.1, and >>> it's true it's not easy to reporduce, see below: >>> >>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ >>> >>> My guess is that 6.1 is harder to reporduce than 5.10 due to some >>> changes inside raid456. >>> >>> By the way, raid10 had a similiar deadlock, and can be fixed the same >>> way, so it make sense to backport these patches. >>> >>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com >>> >>> Thanks, >>> Kuai >>> >>> >>>> On 5.10 I can reproduce it within minutes to an hour. >>>> >>> >>> . >>> >> > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-13 9:08 ` Donald Buczek @ 2023-09-13 14:16 ` Dragan Stancevic 2023-09-14 6:03 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Dragan Stancevic @ 2023-09-13 14:16 UTC (permalink / raw) To: Donald Buczek, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun Hi Donald- On 9/13/23 04:08, Donald Buczek wrote: > On 9/5/23 3:54 PM, Dragan Stancevic wrote: >> On 9/4/23 22:50, Yu Kuai wrote: >>> Hi, >>> >>> 在 2023/08/30 9:36, Yu Kuai 写道: >>>> Hi, >>>> >>>> 在 2023/08/29 4:32, Dragan Stancevic 写道: >>>> >>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it. >>> >>> oops, I forgot that you need to backport this patch first to reporduce >>> this problem: >>> >>> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/ >>> >>> The patch fix the deadlock as well, but it introduce some regressions. > > We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes. > > But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem? > > Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily? Here is a list of changes for 6.1: e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly f71209b1f21c md: enhance checking in md_check_recovery() 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" You can get them from the following tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > Thanks > > Donald > > [1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/ > >> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :) >> >> >> >>> >>> Thanks, >>> Kuai >>> >>>>> >>>>> It seems that 6.1 has some other code that prevents this from happening. >>>>> >>>> >>>> I see that there are lots of patches for raid456 between 5.10 and 6.1, >>>> however, I remember that I used to reporduce the deadlock after 6.1, and >>>> it's true it's not easy to reporduce, see below: >>>> >>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ >>>> >>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some >>>> changes inside raid456. >>>> >>>> By the way, raid10 had a similiar deadlock, and can be fixed the same >>>> way, so it make sense to backport these patches. >>>> >>>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com >>>> >>>> Thanks, >>>> Kuai >>>> >>>> >>>>> On 5.10 I can reproduce it within minutes to an hour. >>>>> >>>> >>>> . >>>> >>> >> > > -- Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-13 14:16 ` Dragan Stancevic @ 2023-09-14 6:03 ` Donald Buczek 2023-09-17 8:55 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2023-09-14 6:03 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun On 9/13/23 16:16, Dragan Stancevic wrote: > Hi Donald- > > On 9/13/23 04:08, Donald Buczek wrote: >> On 9/5/23 3:54 PM, Dragan Stancevic wrote: >>> On 9/4/23 22:50, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2023/08/30 9:36, Yu Kuai 写道: >>>>> Hi, >>>>> >>>>> 在 2023/08/29 4:32, Dragan Stancevic 写道: >>>>> >>>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it. >>>> >>>> oops, I forgot that you need to backport this patch first to reporduce >>>> this problem: >>>> >>>> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/ >>>> >>>> The patch fix the deadlock as well, but it introduce some regressions. >> >> We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes. >> >> But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem? >> >> Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily? > > Here is a list of changes for 6.1: > > e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly > f71209b1f21c md: enhance checking in md_check_recovery() > 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() > 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock > 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() > 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' > a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" Thanks! I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago. It went through ~400 iterations till now. No lock-ups so far. LGTM ! Donald buczek@done:~$ dmesg|grep "data-check of RAID array"|wc 393 2820 18864 buczek@done:~$ cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md2 : active raid6 sdc[0] sdo[15] sdn[14] sdm[13] sdl[12] sdk[11] sdj[10] sdi[9] sdh[8] sdg[7] sdf[6] sde[5] sdd[4] sdr[3] sdq[2] sdp[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [=========>...........] check = 47.1% (3681799128/7813894144) finish=671.8min speed=102496K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md1 : active raid6 sdaa[0] sdz[15] sdy[14] sdx[13] sdw[12] sdv[11] sdu[10] sdt[16] sds[8] sdah[7] sdag[17] sdaf[5] sdae[4] sdad[3] sdac[2] sdab[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [=======>.............] check = 38.5% (3009484896/7813894144) finish=811.0min speed=98720K/sec bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sdai[0] sdax[15] sdaw[16] sdav[13] sdau[12] sdat[11] sdas[10] sdar[9] sdaq[8] sdap[7] sdao[6] sdan[17] sdam[4] sdal[3] sdak[2] sdaj[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [========>............] check = 42.3% (3311789940/7813894144) finish=911.9min speed=82272K/sec bitmap: 6/59 pages [24KB], 65536KB chunk unused devices: <none> > > You can get them from the following tree: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > >> >> Thanks >> >> Donald >> >> [1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/ >> >>> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :) >>> >>> >>> >>>> >>>> Thanks, >>>> Kuai >>>> >>>>>> >>>>>> It seems that 6.1 has some other code that prevents this from happening. >>>>>> >>>>> >>>>> I see that there are lots of patches for raid456 between 5.10 and 6.1, >>>>> however, I remember that I used to reporduce the deadlock after 6.1, and >>>>> it's true it's not easy to reporduce, see below: >>>>> >>>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ >>>>> >>>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some >>>>> changes inside raid456. >>>>> >>>>> By the way, raid10 had a similiar deadlock, and can be fixed the same >>>>> way, so it make sense to backport these patches. >>>>> >>>>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com >>>>> >>>>> Thanks, >>>>> Kuai >>>>> >>>>> >>>>>> On 5.10 I can reproduce it within minutes to an hour. >>>>>> >>>>> >>>>> . >>>>> >>>> >>> >> >> > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-14 6:03 ` Donald Buczek @ 2023-09-17 8:55 ` Donald Buczek 2023-09-24 14:35 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2023-09-17 8:55 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun On 9/14/23 08:03, Donald Buczek wrote: > On 9/13/23 16:16, Dragan Stancevic wrote: >> Hi Donald- >> [...] >> Here is a list of changes for 6.1: >> >> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >> f71209b1f21c md: enhance checking in md_check_recovery() >> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() >> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" > > Thanks! > > I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago. It went through ~400 iterations till now. No lock-ups so far. Oh dear, looks like the deadlock problem is _not_fixed with these patches. We've had a lockup again after ~3 days of operation. Again, the `echo idle > $sys/md/sync_action` is hanging: # # /proc/70554/task/70554: mdcheck.safe : /bin/bash /usr/bin/mdcheck.safe --continue --duration 06:00 # cat /proc/70554/task/70554/stack [<0>] action_store+0x17f/0x390 [<0>] md_attr_store+0x83/0xf0 [<0>] kernfs_fop_write_iter+0x117/0x1b0 [<0>] vfs_write+0x2ce/0x400 [<0>] ksys_write+0x5f/0xe0 [<0>] do_syscall_64+0x43/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce And everything else going to that specific raid (md0) is dead, too. No task is busy looping. So as it looks now, we cant go from 5.15.X to 6.1.X as we would like to do. These patches don't fix the problem and our own patch no longer works with 6.1. Unfortunately, this happened on a production system which I need to reboot and is not available for further analysis. We'd need to reproduce the problem on a dedicated machine to really work on it. Here's some more possibly interesting procfs output and some examples of tasks. /sys/devices/virtual/block/md0/inflight : 0 3936 #/proc/mdstat Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] md1 : active raid6 sdae[0] sdad[15] sdac[14] sdab[13] sdaa[12] sdz[11] sdy[10] sdx[9] sdw[8] sdv[7] sdu[6] sdt[5] sds[4] sdah[3] sdag[2] sdaf[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] bitmap: 0/59 pages [0KB], 65536KB chunk md0 : active raid6 sdc[0] sdr[17] sdq[16] sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU] [===>.................] check = 15.9% (1242830396/7813894144) finish=14788.4min speed=7405K/sec bitmap: 53/59 pages [212KB], 65536KB chunk unused devices: <none> # # /proc/66024/task/66024: md0_resync : # cat /proc/66024/task/66024/stack [<0>] raid5_get_active_stripe+0x20f/0x4d0 [<0>] raid5_sync_request+0x38b/0x3b0 [<0>] md_do_sync.cold+0x40c/0x985 [<0>] md_thread+0xb1/0x160 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/939/task/939: md0_raid6 : # cat /proc/939/task/939/stack [<0>] md_thread+0x12d/0x160 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/1228/task/1228: xfsaild/md0 : # cat /proc/1228/task/1228/stack [<0>] raid5_get_active_stripe+0x20f/0x4d0 [<0>] raid5_make_request+0x24c/0x1170 [<0>] md_handle_request+0x131/0x220 [<0>] __submit_bio+0x89/0x130 [<0>] submit_bio_noacct_nocheck+0x160/0x360 [<0>] _xfs_buf_ioapply+0x26c/0x420 [<0>] __xfs_buf_submit+0x64/0x1d0 [<0>] xfs_buf_delwri_submit_buffers+0xc5/0x1e0 [<0>] xfsaild+0x2a0/0x880 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/49747/task/49747: kworker/24:2+xfs-inodegc/md0 : # cat /proc/49747/task/49747/stack [<0>] xfs_buf_lock+0x35/0xf0 [<0>] xfs_buf_find_lock+0x45/0xf0 [<0>] xfs_buf_get_map+0x17d/0xa60 [<0>] xfs_buf_read_map+0x52/0x280 [<0>] xfs_trans_read_buf_map+0x115/0x350 [<0>] xfs_btree_read_buf_block.constprop.0+0x9a/0xd0 [<0>] xfs_btree_lookup_get_block+0x97/0x170 [<0>] xfs_btree_lookup+0xc4/0x4a0 [<0>] xfs_difree_finobt+0x62/0x250 [<0>] xfs_difree+0x130/0x1c0 [<0>] xfs_ifree+0x86/0x510 [<0>] xfs_inactive_ifree.isra.0+0xa2/0x1c0 [<0>] xfs_inactive+0xf8/0x170 [<0>] xfs_inodegc_worker+0x90/0x140 [<0>] process_one_work+0x1c7/0x3c0 [<0>] worker_thread+0x4d/0x3c0 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/49844/task/49844: kworker/30:3+xfs-sync/md0 : # cat /proc/49844/task/49844/stack [<0>] __flush_workqueue+0x10e/0x390 [<0>] xlog_cil_push_now.isra.0+0x25/0x90 [<0>] xlog_cil_force_seq+0x7c/0x240 [<0>] xfs_log_force+0x83/0x240 [<0>] xfs_log_worker+0x3b/0xd0 [<0>] process_one_work+0x1c7/0x3c0 [<0>] worker_thread+0x4d/0x3c0 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/52646/task/52646: kworker/u263:2+xfs-cil/md0 : # cat /proc/52646/task/52646/stack [<0>] raid5_get_active_stripe+0x20f/0x4d0 [<0>] raid5_make_request+0x24c/0x1170 [<0>] md_handle_request+0x131/0x220 [<0>] __submit_bio+0x89/0x130 [<0>] submit_bio_noacct_nocheck+0x160/0x360 [<0>] xlog_state_release_iclog+0xf6/0x1d0 [<0>] xlog_write_get_more_iclog_space+0x79/0xf0 [<0>] xlog_write+0x334/0x3b0 [<0>] xlog_cil_push_work+0x501/0x740 [<0>] process_one_work+0x1c7/0x3c0 [<0>] worker_thread+0x4d/0x3c0 [<0>] kthread+0xe7/0x110 [<0>] ret_from_fork+0x22/0x30 # # /proc/52753/task/52753: rm : rm -rf /project/pbackup_gone/data/C8029/home_Cyang/home_Cyang:202306011248:C3019.BEING_DELETED # cat /proc/52753/task/52753/stack [<0>] xfs_buf_lock+0x35/0xf0 [<0>] xfs_buf_find_lock+0x45/0xf0 [<0>] xfs_buf_get_map+0x17d/0xa60 [<0>] xfs_buf_read_map+0x52/0x280 [<0>] xfs_trans_read_buf_map+0x115/0x350 [<0>] xfs_read_agi+0x98/0x140 [<0>] xfs_iunlink+0x63/0x1f0 [<0>] xfs_remove+0x280/0x3a0 [<0>] xfs_vn_unlink+0x53/0xa0 [<0>] vfs_rmdir.part.0+0x5e/0x1e0 [<0>] do_rmdir+0x15c/0x1c0 [<0>] __x64_sys_unlinkat+0x4b/0x60 [<0>] do_syscall_64+0x43/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-17 8:55 ` Donald Buczek @ 2023-09-24 14:35 ` Donald Buczek 2023-09-25 1:11 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2023-09-24 14:35 UTC (permalink / raw) To: Dragan Stancevic, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun On 9/17/23 10:55, Donald Buczek wrote: > On 9/14/23 08:03, Donald Buczek wrote: >> On 9/13/23 16:16, Dragan Stancevic wrote: >>> Hi Donald- >>> [...] >>> Here is a list of changes for 6.1: >>> >>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>> f71209b1f21c md: enhance checking in md_check_recovery() >>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() >>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" >> >> Thanks! >> >> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago. It went through ~400 iterations till now. No lock-ups so far. > > Oh dear, looks like the deadlock problem is _not_fixed with these patches. Some more info after another incident: - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches) - The symptoms are not exactly the same as with the original year-old problem. Differences: - - mdX_raid6 is NOT busy looping - - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending" - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock - - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files. - But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid. The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131 We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...). But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one. Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-24 14:35 ` Donald Buczek @ 2023-09-25 1:11 ` Yu Kuai 2023-09-25 9:11 ` Donald Buczek 0 siblings, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-09-25 1:11 UTC (permalink / raw) To: Donald Buczek, Dragan Stancevic, Yu Kuai, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) Hi, 在 2023/09/24 22:35, Donald Buczek 写道: > On 9/17/23 10:55, Donald Buczek wrote: >> On 9/14/23 08:03, Donald Buczek wrote: >>> On 9/13/23 16:16, Dragan Stancevic wrote: >>>> Hi Donald- >>>> [...] >>>> Here is a list of changes for 6.1: >>>> >>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>>> f71209b1f21c md: enhance checking in md_check_recovery() >>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in >>>> action_store() >>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in >>>> action_store" >>> >>> Thanks! >>> >>> I've put these patches on v6.1.52. I've started a script which >>> transitions the three md-devices of a very active backup server >>> through idle->check->idle every 6 minutes a few ours ago. It went >>> through ~400 iterations till now. No lock-ups so far. >> >> Oh dear, looks like the deadlock problem is _not_fixed with these >> patches. > > Some more info after another incident: > > - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any > of the above patches) > - The symptoms are not exactly the same as with the original year-old > problem. Differences: > - - mdX_raid6 is NOT busy looping > - - /sys/devices/virtual/block/mdX/md/array_state says "active" not > "write pending" > - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does > not resolve the deadlock > - - After hours in the deadlock state the system resumed operation when > a script of mine read(!) lots of sysfs files. > - But in both cases, `echo idle > > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O > operation on the raid. > > The fact that we didn't hit the problem for many month on 5.15.94 might > hint that it was introduced between 5.15.94 and 5.15.131 > > We'll try to reproduce the problem on a test machine for analysis, but > this make take time (vacation imminent for one...). > > But its not like these patches caused the problem. Any maybe they _did_ > fix the original problem, as we didn't hit that one. Sorry for the late reply, yes, this looks like a different problem. I'm pretty confident that the orignal problem is fixed since that echo idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for sync_thread to be done. I'll check patches between 5.15.94 and 5.15.131. Thanks, Kuai > > Best > > Donald > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-25 1:11 ` Yu Kuai @ 2023-09-25 9:11 ` Donald Buczek 2023-09-25 9:32 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Donald Buczek @ 2023-09-25 9:11 UTC (permalink / raw) To: Yu Kuai, Dragan Stancevic, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) On 9/25/23 03:11, Yu Kuai wrote: > Hi, > > 在 2023/09/24 22:35, Donald Buczek 写道: >> On 9/17/23 10:55, Donald Buczek wrote: >>> On 9/14/23 08:03, Donald Buczek wrote: >>>> On 9/13/23 16:16, Dragan Stancevic wrote: >>>>> Hi Donald- >>>>> [...] >>>>> Here is a list of changes for 6.1: >>>>> >>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>>>> f71209b1f21c md: enhance checking in md_check_recovery() >>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() >>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" >>>> >>>> Thanks! >>>> >>>> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago. It went through ~400 iterations till now. No lock-ups so far. >>> >>> Oh dear, looks like the deadlock problem is _not_fixed with these patches. >> >> Some more info after another incident: >> >> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches) >> - The symptoms are not exactly the same as with the original year-old problem. Differences: >> - - mdX_raid6 is NOT busy looping >> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending" >> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock >> - - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files. >> - But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid. >> >> The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131 >> >> We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...). >> >> But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one. > > Sorry for the late reply, yes, this looks like a different problem. I'm > pretty confident that the orignal problem is fixed since that echo > idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for > sync_thread to be done. > > I'll check patches between 5.15.94 and 5.15.131. We've got another event today. Some more information to save you work. I'm sorry, this comes dripping in, but as I said, currently we can't reproduce it and hit it on production machines only, where we have limited time to analyze: * In the last two events, "echo idle > sys/devices/virtual/block/mdX/md/sync_action" was not even executing. This is not a trigger, but was a random victim when it happened the first time. This deceived me to believe this is some variation of the old problem. * It's not filesystem related, yesterday `blkid -o value -s LABEL /dev/md1` was hanging, too, and today, for example, `df`. * /sys/devices/virtual/block/md0/inflight today was (frozen at) "2 579" * iotop showed no disk activity (on the raid) at all. Only a single member device had activity from time to time (usually after ~30 seconds, but sometimes after a few seconds) with usually 1-4 tps, but sometimes more, max 136 tps. * As I said, I use a script to take a snapshot of various /sys and /proc information and running this script resolved the deadlock twice. * The recorded stack traces of mdX_raid6 of the hanging raid recorded in the two events were [<0>] md_bitmap_unplug.part.0+0xce/0x100 [<0>] raid5d+0xe4/0x5a0 [<0>] md_thread+0xab/0x160 [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 and [<0>] md_super_wait+0x72/0xa0 [<0>] md_bitmap_unplug.part.0+0xce/0x100 [<0>] raid5d+0xe4/0x5a0 [<0>] md_thread+0xab/0x160 [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 But note, that these probably were taken after the previous commands in the script already unfroze the system. Today I've manually looked at the stack while the system was still frozen, and it was just [<0>] md_thread+0x122/0x160 [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 * Because I knew that my script seems to unblock the system, I've run it slowly line by line to see what actually unfreezes the system. There is one loop which takes "comm" "cmdline" and "stack" of all threads: for task in /proc/*/task/*; do echo "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)" cmd cat $task/stack done I've added a few "read" to single-step it. Unfortunately, when it came to the 64 nfsd threads, I've got a bit impatient and hit "return" faster then I should have and when the unfreeze happened, I couldn't say exactly were it was triggered. But it must have been somewhere in this tail: # # /proc/1299/task/1299: nfsd [<0>] svc_recv+0x7a7/0x8c0 [sunrpc] [<0>] nfsd+0xd6/0x140 [nfsd] [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 # # /proc/13/task/13: ksoftirqd/0 [<0>] smpboot_thread_fn+0xf3/0x140 [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 # # /proc/130/task/130: cpuhp/22 [<0>] smpboot_thread_fn+0xf3/0x140 [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 # # /proc/1300/task/1300: nfsd [<0>] svc_recv+0x7a7/0x8c0 [sunrpc] [<0>] nfsd+0xd6/0x140 [nfsd] [<0>] kthread+0x127/0x150 [<0>] ret_from_fork+0x22/0x30 ## (3 more repetitions of other nfsd threads which exactly the same stack skipped here ## So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as there shouldn't and wasn't anything nfs-mounted from this system. Conclusion: This is probably not related to mdraid at all and might be a problem of the block or some infrastructure subsystem. Do you agree? Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-09-25 9:11 ` Donald Buczek @ 2023-09-25 9:32 ` Yu Kuai 0 siblings, 0 replies; 49+ messages in thread From: Yu Kuai @ 2023-09-25 9:32 UTC (permalink / raw) To: Donald Buczek, Yu Kuai, Dragan Stancevic, song Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun, yukuai (C) Hi, 在 2023/09/25 17:11, Donald Buczek 写道: > On 9/25/23 03:11, Yu Kuai wrote: >> Hi, >> >> 在 2023/09/24 22:35, Donald Buczek 写道: >>> On 9/17/23 10:55, Donald Buczek wrote: >>>> On 9/14/23 08:03, Donald Buczek wrote: >>>>> On 9/13/23 16:16, Dragan Stancevic wrote: >>>>>> Hi Donald- >>>>>> [...] >>>>>> Here is a list of changes for 6.1: >>>>>> >>>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>>>>> f71209b1f21c md: enhance checking in md_check_recovery() >>>>>> 753260ed0b46 md: wake up 'resync_wait' at last in >>>>>> md_reap_sync_thread() >>>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in >>>>>> action_store() >>>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in >>>>>> action_store" >>>>> >>>>> Thanks! >>>>> >>>>> I've put these patches on v6.1.52. I've started a script which >>>>> transitions the three md-devices of a very active backup server >>>>> through idle->check->idle every 6 minutes a few ours ago. It went >>>>> through ~400 iterations till now. No lock-ups so far. >>>> >>>> Oh dear, looks like the deadlock problem is _not_fixed with these >>>> patches. >>> >>> Some more info after another incident: >>> >>> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by >>> any of the above patches) >>> - The symptoms are not exactly the same as with the original year-old >>> problem. Differences: >>> - - mdX_raid6 is NOT busy looping >>> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not >>> "write pending" >>> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` >>> does not resolve the deadlock >>> - - After hours in the deadlock state the system resumed operation >>> when a script of mine read(!) lots of sysfs files. >>> - But in both cases, `echo idle > >>> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O >>> operation on the raid. >>> >>> The fact that we didn't hit the problem for many month on 5.15.94 >>> might hint that it was introduced between 5.15.94 and 5.15.131 >>> >>> We'll try to reproduce the problem on a test machine for analysis, >>> but this make take time (vacation imminent for one...). >>> >>> But its not like these patches caused the problem. Any maybe they >>> _did_ fix the original problem, as we didn't hit that one. >> >> Sorry for the late reply, yes, this looks like a different problem. I'm >> pretty confident that the orignal problem is fixed since that echo >> idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for >> sync_thread to be done. >> >> I'll check patches between 5.15.94 and 5.15.131. > > We've got another event today. Some more information to save you work. > I'm sorry, this comes dripping in, but as I said, currently we can't > reproduce it and hit it on production machines only, where we have > limited time to analyze: There is a way to clarify if io is stuck in underlying disks: Once the problem is triggered and there are no disk activity: cat /sys/kernel/debug/block/[disk]/hctx*/sched_tags | grep busy cat /sys/kernel/debug/block/[disk]/hctx*/tags | grep busy If busy is not 0, means that io is stuck in underlying disk, then this problem is not related to raid, otherwise raid doesn't issue any io to underlyiung dikss and this problem is related to raid. > > * In the last two events, "echo idle > > sys/devices/virtual/block/mdX/md/sync_action" was not even executing. > This is not a trigger, but was a random victim when it happened the > first time. This deceived me to believe this is some variation of the > old problem. > > * It's not filesystem related, yesterday `blkid -o value -s LABEL > /dev/md1` was hanging, too, and today, for example, `df`. > > * /sys/devices/virtual/block/md0/inflight today was (frozen at) "2 > 579" > > * iotop showed no disk activity (on the raid) at all. Only a single > member device had activity from time to time (usually after ~30 seconds, > but sometimes after a few seconds) with usually 1-4 tps, but sometimes > more, max 136 tps. > > * As I said, I use a script to take a snapshot of various /sys and /proc > information and running this script resolved the deadlock twice. > > * The recorded stack traces of mdX_raid6 of the hanging raid recorded in > the two events were > > [<0>] md_bitmap_unplug.part.0+0xce/0x100 > [<0>] raid5d+0xe4/0x5a0 > [<0>] md_thread+0xab/0x160 > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > and > > [<0>] md_super_wait+0x72/0xa0 > [<0>] md_bitmap_unplug.part.0+0xce/0x100 > [<0>] raid5d+0xe4/0x5a0 > [<0>] md_thread+0xab/0x160 > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 Above stack shows that raid issue bitmap io to underlying disk and is waiting for such io to be done, except for bitmap io is broken in raid, this problem should not related to raid, above debugfs can help to clarify this. Thanks, Kuai > > But note, that these probably were taken after the previous commands in > the script already unfroze the system. Today I've manually looked at the > stack while the system was still frozen, and it was just > > [<0>] md_thread+0x122/0x160 > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > * Because I knew that my script seems to unblock the system, I've run it > slowly line by line to see what actually unfreezes the system. There is > one loop which takes "comm" "cmdline" and "stack" of all threads: > > for task in /proc/*/task/*; do > echo "# # $task: $(cat $task/comm) : $(cat $task/cmdline | > xargs -0 echo)" > cmd cat $task/stack > done > > I've added a few "read" to single-step it. Unfortunately, when it came > to the 64 nfsd threads, I've got a bit impatient and hit "return" faster > then I should have and when the unfreeze happened, I couldn't say > exactly were it was triggered. But it must have been somewhere in this > tail: > > # # /proc/1299/task/1299: nfsd > > [<0>] svc_recv+0x7a7/0x8c0 [sunrpc] > [<0>] nfsd+0xd6/0x140 [nfsd] > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > # # /proc/13/task/13: ksoftirqd/0 > > [<0>] smpboot_thread_fn+0xf3/0x140 > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > # # /proc/130/task/130: cpuhp/22 > > [<0>] smpboot_thread_fn+0xf3/0x140 > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > # # /proc/1300/task/1300: nfsd > > [<0>] svc_recv+0x7a7/0x8c0 [sunrpc] > [<0>] nfsd+0xd6/0x140 [nfsd] > [<0>] kthread+0x127/0x150 > [<0>] ret_from_fork+0x22/0x30 > > ## (3 more repetitions of other nfsd threads which exactly the same > stack skipped here ## > > So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a > (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as > there shouldn't and wasn't anything nfs-mounted from this system. > > Conclusion: This is probably not related to mdraid at all and might be a > problem of the block or some infrastructure subsystem. Do you agree? > > Best > > Donald ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-14 13:55 ` Guoqing Jiang 2023-03-14 14:45 ` Marc Smith @ 2023-03-15 3:02 ` Yu Kuai 2023-03-15 9:30 ` Guoqing Jiang 1 sibling, 1 reply; 49+ messages in thread From: Yu Kuai @ 2023-03-15 3:02 UTC (permalink / raw) To: Guoqing Jiang, Marc Smith Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid, yukuai (C) 在 2023/03/14 21:55, Guoqing Jiang 写道: > > > On 3/14/23 21:25, Marc Smith wrote: >> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang >> <guoqing.jiang@cloud.ionos.com> wrote: >>> Hi Donald, >>> >>> On 2/8/21 19:41, Donald Buczek wrote: >>>> Dear Guoqing, >>>> >>>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>>> >>>>> On 2/8/21 12:38, Donald Buczek wrote: >>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >>>>>>> sync_thread, like this. >>>>>>> >>>>>>> /* resync has finished, collect result */ >>>>>>> mddev_unlock(mddev); >>>>>>> md_unregister_thread(&mddev->sync_thread); >>>>>>> mddev_lock(mddev); >>>>>> As above: While we wait for the sync thread to terminate, wouldn't it >>>>>> be a problem, if another user space operation takes the mutex? >>>>> I don't think other places can be blocked while hold mutex, otherwise >>>>> these places can cause potential deadlock. Please try above two lines >>>>> change. And perhaps others have better idea. >>>> Yes, this works. No deadlock after >11000 seconds, >>>> >>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, >>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >>> Great. I will send a formal patch with your reported-by and tested-by. >>> >>> Thanks, >>> Guoqing >> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 >> of the patches that supposedly resolve this were applied to the stable >> kernels, however, one was omitted due to a regression: >> md: don't unregister sync_thread with reconfig_mutex held (upstream >> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) Hi, Guoqing, Just borrow this thread to discuss, I think this commit might have problem in some corner cases: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread md_check_recovery set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) queue_work(md_misc_wq, &mddev->del_work) mddev_lock_nointr md_reap_sync_thread // clear running mddev_lock t3: md_start_sync // running is not set Our test report a problem that can be cause by this in theory, by we can't be sure for now... We thought about how to fix this, instead of calling md_register_thread() here to wait for sync_thread to be done synchronisely, we do this asynchronously like what md_set_readonly() and do_md_stop() does. What do you think? Thanks, Kuai >> >> I don't see any follow-up on the thread from June 8th 2022 asking for >> this patch to be dropped from all stable kernels since it caused a >> regression. >> >> The patch doesn't appear to be present in the current mainline kernel >> (6.3-rc2) either. So I assume this issue is still present there, or it >> was resolved differently and I just can't find the commit/patch. > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap > sync_thread in action_store". > > Thanks, > Guoqing > . > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-15 3:02 ` Yu Kuai @ 2023-03-15 9:30 ` Guoqing Jiang 2023-03-15 9:53 ` Yu Kuai 0 siblings, 1 reply; 49+ messages in thread From: Guoqing Jiang @ 2023-03-15 9:30 UTC (permalink / raw) To: Yu Kuai, Marc Smith Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid On 3/15/23 11:02, Yu Kuai wrote: > > > 在 2023/03/14 21:55, Guoqing Jiang 写道: >> >> >> On 3/14/23 21:25, Marc Smith wrote: >>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang >>> <guoqing.jiang@cloud.ionos.com> wrote: >>>> Hi Donald, >>>> >>>> On 2/8/21 19:41, Donald Buczek wrote: >>>>> Dear Guoqing, >>>>> >>>>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>>>> >>>>>> On 2/8/21 12:38, Donald Buczek wrote: >>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >>>>>>>> sync_thread, like this. >>>>>>>> >>>>>>>> /* resync has finished, collect result */ >>>>>>>> mddev_unlock(mddev); >>>>>>>> md_unregister_thread(&mddev->sync_thread); >>>>>>>> mddev_lock(mddev); >>>>>>> As above: While we wait for the sync thread to terminate, >>>>>>> wouldn't it >>>>>>> be a problem, if another user space operation takes the mutex? >>>>>> I don't think other places can be blocked while hold mutex, >>>>>> otherwise >>>>>> these places can cause potential deadlock. Please try above two >>>>>> lines >>>>>> change. And perhaps others have better idea. >>>>> Yes, this works. No deadlock after >11000 seconds, >>>>> >>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, >>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >>>> Great. I will send a formal patch with your reported-by and tested-by. >>>> >>>> Thanks, >>>> Guoqing >>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 >>> of the patches that supposedly resolve this were applied to the stable >>> kernels, however, one was omitted due to a regression: >>> md: don't unregister sync_thread with reconfig_mutex held (upstream >>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > Hi, Guoqing, > > Just borrow this thread to discuss, I think this commit might have > problem in some corner cases: > > t1: t2: > action_store > mddev_lock > if (mddev->sync_thread) > mddev_unlock > md_unregister_thread > md_check_recovery > set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) > queue_work(md_misc_wq, &mddev->del_work) > mddev_lock_nointr > md_reap_sync_thread > // clear running > mddev_lock > > t3: > md_start_sync > // running is not set What does 'running' mean? MD_RECOVERY_RUNNING? > Our test report a problem that can be cause by this in theory, by we > can't be sure for now... I guess you tried to describe racy between action_store -> md_register_thread and md_start_sync -> md_register_thread Didn't you already fix them in the series? [PATCH -next 0/5] md: fix uaf for sync_thread Sorry, I didn't follow the problem and also your series, I might try your test with latest mainline kernel if the test is available somewhere. > We thought about how to fix this, instead of calling > md_register_thread() here to wait for sync_thread to be done > synchronisely, IMO, md_register_thread just create and wake a thread, not sure why it waits for sync_thread. > we do this asynchronously like what md_set_readonly() and > do_md_stop() does. Still, I don't have clear picture about the problem, so I can't judge it. Thanks, Guoqing ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-15 9:30 ` Guoqing Jiang @ 2023-03-15 9:53 ` Yu Kuai 0 siblings, 0 replies; 49+ messages in thread From: Yu Kuai @ 2023-03-15 9:53 UTC (permalink / raw) To: Guoqing Jiang, Marc Smith Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid, yukuai (C) Hi, 在 2023/03/15 17:30, Guoqing Jiang 写道: > >> Just borrow this thread to discuss, I think this commit might have >> problem in some corner cases: >> >> t1: t2: >> action_store >> mddev_lock >> if (mddev->sync_thread) >> mddev_unlock >> md_unregister_thread >> md_check_recovery >> set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) >> queue_work(md_misc_wq, &mddev->del_work) >> mddev_lock_nointr >> md_reap_sync_thread >> // clear running >> mddev_lock >> >> t3: >> md_start_sync >> // running is not set > > What does 'running' mean? MD_RECOVERY_RUNNING? > >> Our test report a problem that can be cause by this in theory, by we >> can't be sure for now... > > I guess you tried to describe racy between > > action_store -> md_register_thread > > and > > md_start_sync -> md_register_thread > > Didn't you already fix them in the series? > > [PATCH -next 0/5] md: fix uaf for sync_thread > > Sorry, I didn't follow the problem and also your series, I might try your > test with latest mainline kernel if the test is available somewhere. > >> We thought about how to fix this, instead of calling >> md_register_thread() here to wait for sync_thread to be done >> synchronisely, > > IMO, md_register_thread just create and wake a thread, not sure why it > waits for sync_thread. > >> we do this asynchronously like what md_set_readonly() and do_md_stop() >> does. > > Still, I don't have clear picture about the problem, so I can't judge it. > Sorry that I didn't explain the problem clear. Let me explain the problem we meet first: 1) raid10d is waiting for sync_thread to stop: raid10d md_unregister_thread kthread_stop 2) sync_thread is waiting for io to finish: md_do_sync wait_event(... atomic_read(&mddev->recovery_active) == 0) 3) io is waiting for raid10d to finish(online crash found 2 io in conf->retry_list) Additional information from online crash: mddev->recovery = 29, // DONE, RUNING, INTR is set PID: 138293 TASK: ffff0000de89a900 CPU: 7 COMMAND: "md0_resync" #0 [ffffa00107c178a0] __switch_to at ffffa0010001d75c #1 [ffffa00107c178d0] __schedule at ffffa001017c7f14 #2 [ffffa00107c179f0] schedule at ffffa001017c880c #3 [ffffa00107c17a20] md_do_sync at ffffa0010129cdb4 #4 [ffffa00107c17d50] md_thread at ffffa00101290d9c #5 [ffffa00107c17e50] kthread at ffffa00100187a74 PID: 138294 TASK: ffff0000eba13d80 CPU: 5 COMMAND: "md0_resync" #0 [ffffa00107e47a60] __switch_to at ffffa0010001d75c #1 [ffffa00107e47a90] __schedule at ffffa001017c7f14 #2 [ffffa00107e47bb0] schedule at ffffa001017c880c #3 [ffffa00107e47be0] schedule_timeout at ffffa001017d1298 #4 [ffffa00107e47d50] md_thread at ffffa00101290ee8 #5 [ffffa00107e47e50] kthread at ffffa00100187a74 // there are two sync_thread for md0 I believe the root cause is that two sync_thread exist for the same mddev, and this is how I think this is possible: t1: t2: action_store mddev_lock if (mddev->sync_thread) mddev_unlock md_unregister_thread // first sync_thread is done md_check_recovery set_bit(MD_RECOVERY_RUNNING, &mddev->recovery) queue_work(md_misc_wq, &mddev->del_work) mddev_lock_nointr md_reap_sync_thread // MD_RECOVERY_RUNNING is cleared mddev_unlock t3: md_start_sync // second sync_thread is registed t3: md_check_recovery queue_work(md_misc_wq, &mddev->del_work) // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started This is just guess, I can't reporduce the problem yet. Please let me know if you have any questions Thanks, Kuai ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition 2023-03-14 13:25 ` Marc Smith 2023-03-14 13:55 ` Guoqing Jiang @ 2023-03-15 7:52 ` Donald Buczek 1 sibling, 0 replies; 49+ messages in thread From: Donald Buczek @ 2023-03-15 7:52 UTC (permalink / raw) To: Marc Smith, Guoqing Jiang Cc: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid Hi, I can just comment, that the simple patch I proposed at https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/ works for us with several different kernel versions and currently 195 raid6 jbods on 105 systems going through several "idle->sync->idle" transitions each month for over two years now. So if you suffer from the problem and are able to add patches to the kernel you use, you might give it a try. Best Donald On 3/14/23 14:25, Marc Smith wrote: > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang > <guoqing.jiang@cloud.ionos.com> wrote:t >> >> Hi Donald, >> >> On 2/8/21 19:41, Donald Buczek wrote: >>> Dear Guoqing, >>> >>> On 08.02.21 15:53, Guoqing Jiang wrote: >>>> >>>> >>>> On 2/8/21 12:38, Donald Buczek wrote: >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister >>>>>> sync_thread, like this. >>>>>> >>>>>> /* resync has finished, collect result */ >>>>>> mddev_unlock(mddev); >>>>>> md_unregister_thread(&mddev->sync_thread); >>>>>> mddev_lock(mddev); >>>>> >>>>> As above: While we wait for the sync thread to terminate, wouldn't it >>>>> be a problem, if another user space operation takes the mutex? >>>> >>>> I don't think other places can be blocked while hold mutex, otherwise >>>> these places can cause potential deadlock. Please try above two lines >>>> change. And perhaps others have better idea. >>> >>> Yes, this works. No deadlock after >11000 seconds, >>> >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 ) >> >> Great. I will send a formal patch with your reported-by and tested-by. >> >> Thanks, >> Guoqing > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2 > of the patches that supposedly resolve this were applied to the stable > kernels, however, one was omitted due to a regression: > md: don't unregister sync_thread with reconfig_mutex held (upstream > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934) > > I don't see any follow-up on the thread from June 8th 2022 asking for > this patch to be dropped from all stable kernels since it caused a > regression. > > The patch doesn't appear to be present in the current mainline kernel > (6.3-rc2) either. So I assume this issue is still present there, or it > was resolved differently and I just can't find the commit/patch. > > I can induce the issue by using Donald's script above which will > eventually result in hangs: > ... > 147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds. > [147948.504624] Tainted: P OE 5.4.229-esos.prod #1 > [147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [147948.504625] md_test_2.sh D 0 68033 1 0x00000004 > [147948.504627] Call Trace: > [147948.504634] __schedule+0x4ab/0x4f3 > [147948.504637] ? usleep_range+0x7a/0x7a > [147948.504638] schedule+0x67/0x81 > [147948.504639] schedule_timeout+0x2c/0xe5 > [147948.504643] ? do_raw_spin_lock+0x2b/0x52 > [147948.504644] __wait_for_common+0xc4/0x13a > [147948.504647] ? wake_up_q+0x40/0x40 > [147948.504649] kthread_stop+0x9a/0x117 > [147948.504653] md_unregister_thread+0x43/0x4d > [147948.504655] md_reap_sync_thread+0x1c/0x1d5 > [147948.504657] action_store+0xc9/0x284 > [147948.504658] md_attr_store+0x9f/0xb8 > [147948.504661] kernfs_fop_write+0x10a/0x14c > [147948.504664] vfs_write+0xa0/0xdd > [147948.504666] ksys_write+0x71/0xba > [147948.504668] do_syscall_64+0x52/0x60 > [147948.504671] entry_SYSCALL_64_after_hwframe+0x5c/0xc1 > ... > [147948.504748] INFO: task md120_resync:135315 blocked for more than > 122 seconds. > [147948.504749] Tainted: P OE 5.4.229-esos.prod #1 > [147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [147948.504749] md120_resync D 0 135315 2 0x80004000 > [147948.504750] Call Trace: > [147948.504752] __schedule+0x4ab/0x4f3 > [147948.504754] ? printk+0x53/0x6a > [147948.504755] schedule+0x67/0x81 > [147948.504756] md_do_sync+0xae7/0xdd9 > [147948.504758] ? remove_wait_queue+0x41/0x41 > [147948.504759] md_thread+0x128/0x151 > [147948.504761] ? _raw_spin_lock_irqsave+0x31/0x5d > [147948.504762] ? md_start_sync+0xdc/0xdc > [147948.504763] kthread+0xe4/0xe9 > [147948.504764] ? kthread_flush_worker+0x70/0x70 > [147948.504765] ret_from_fork+0x35/0x40 > ... > > This happens on 'raid6' MD RAID arrays that initially have sync_action==resync. > > Any guidance would be greatly appreciated. > > --Marc -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433 ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2023-09-25 9:33 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek 2020-11-30 2:06 ` Guoqing Jiang 2020-12-01 9:29 ` Donald Buczek 2020-12-02 17:28 ` Donald Buczek 2020-12-03 1:55 ` Guoqing Jiang 2020-12-03 11:42 ` Donald Buczek 2020-12-21 12:33 ` Donald Buczek 2021-01-19 11:30 ` Donald Buczek 2021-01-20 16:33 ` Guoqing Jiang 2021-01-23 13:04 ` Donald Buczek 2021-01-25 8:54 ` Donald Buczek 2021-01-25 21:32 ` Donald Buczek 2021-01-26 0:44 ` Guoqing Jiang 2021-01-26 9:50 ` Donald Buczek 2021-01-26 11:14 ` Guoqing Jiang 2021-01-26 12:58 ` Donald Buczek 2021-01-26 14:06 ` Guoqing Jiang 2021-01-26 16:05 ` Donald Buczek 2021-02-02 15:42 ` Guoqing Jiang 2021-02-08 11:38 ` Donald Buczek 2021-02-08 14:53 ` Guoqing Jiang 2021-02-08 18:41 ` Donald Buczek 2021-02-09 0:46 ` Guoqing Jiang 2021-02-09 9:24 ` Donald Buczek 2023-03-14 13:25 ` Marc Smith 2023-03-14 13:55 ` Guoqing Jiang 2023-03-14 14:45 ` Marc Smith 2023-03-16 15:25 ` Marc Smith 2023-03-29 0:01 ` Song Liu 2023-08-22 21:16 ` Dragan Stancevic 2023-08-23 1:22 ` Yu Kuai 2023-08-23 15:33 ` Dragan Stancevic 2023-08-24 1:18 ` Yu Kuai 2023-08-28 20:32 ` Dragan Stancevic 2023-08-30 1:36 ` Yu Kuai 2023-09-05 3:50 ` Yu Kuai 2023-09-05 13:54 ` Dragan Stancevic 2023-09-13 9:08 ` Donald Buczek 2023-09-13 14:16 ` Dragan Stancevic 2023-09-14 6:03 ` Donald Buczek 2023-09-17 8:55 ` Donald Buczek 2023-09-24 14:35 ` Donald Buczek 2023-09-25 1:11 ` Yu Kuai 2023-09-25 9:11 ` Donald Buczek 2023-09-25 9:32 ` Yu Kuai 2023-03-15 3:02 ` Yu Kuai 2023-03-15 9:30 ` Guoqing Jiang 2023-03-15 9:53 ` Yu Kuai 2023-03-15 7:52 ` Donald Buczek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).