md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
@ 2020-11-28 12:25 Donald Buczek
  2020-11-30  2:06 ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2020-11-28 12:25 UTC (permalink / raw)
  To: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

     echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
     UID        PID  PPID  C STIME TTY          TIME CMD
     root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
     [<0>] kthread_stop+0x6e/0x150
     [<0>] md_unregister_thread+0x3e/0x70
     [<0>] md_reap_sync_thread+0x1f/0x1e0
     [<0>] action_store+0x141/0x2b0
     [<0>] md_attr_store+0x71/0xb0
     [<0>] kernfs_fop_write+0x113/0x1a0
     [<0>] vfs_write+0xbc/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
     
     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
     [ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
@ 2020-11-30  2:06 ` Guoqing Jiang
  2020-12-01  9:29   ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2020-11-30  2:06 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 11/28/20 13:25, Donald Buczek wrote:
> Dear Linux mdraid people,
> 
> we are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions.
> 
> The last time this happened (in this case with Linux 5.10.0-rc4), I took 
> some data.
> 
> The triggering event seems to be the md_check cron job trying to pause 
> the ongoing check operation in the morning with
> 
>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
> 
> This doesn't complete. Here's /proc/stack of this process:
> 
>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>      UID        PID  PPID  C STIME TTY          TIME CMD
>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash 
> /usr/bin/mdcheck --continue --duration 06:00
>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat 
> /proc/23333/stack
>      [<0>] kthread_stop+0x6e/0x150
>      [<0>] md_unregister_thread+0x3e/0x70
>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>      [<0>] action_store+0x141/0x2b0
>      [<0>] md_attr_store+0x71/0xb0
>      [<0>] kernfs_fop_write+0x113/0x1a0
>      [<0>] vfs_write+0xbc/0x250
>      [<0>] ksys_write+0xa1/0xe0
>      [<0>] do_syscall_64+0x33/0x40
>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Note, that md0 has been paused successfully just before.

What is the personality of md0? Is it also raid6?

> 
>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD 
> (/usr/bin/mdcheck --continue --duration "06:00")
>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking 
> /dev/md0 from 10623180920
>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: 
> data-check of RAID array md0
>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking 
> /dev/md1 from 11582849320
>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: 
> data-check of RAID array md1
>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: 
> md0: data-check interrupted.
>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 
> 13351127680
>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: 
> md1: data-check interrupted.
>      [ nothing related following.... ]
> 
> After that, we see md1_raid6 in a busy loop:
> 
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ 
> COMMAND
>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 
> md1_raid6

Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 
was triggered again and again.

> 
> Also, all processes doing I/O do the md device block.
> 
> This is /proc/mdstat:
> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [==================>..]  check = 94.0% 
> (7350290348/7813894144) finish=57189.3min speed=135K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 

So the RECOVERY_CHECK flag should be set, not sure if the simple changes
helps, but you may give it a try.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f..e2697d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
                         md_update_sb(mddev, 0);

                 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
-                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
                         /* resync/recovery still happening */
                         clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                         goto unlock;

Thanks,
Guoqing

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-11-30  2:06 ` Guoqing Jiang
@ 2020-12-01  9:29   ` Donald Buczek
  2020-12-02 17:28     ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2020-12-01  9:29 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
> 
> 
> On 11/28/20 13:25, Donald Buczek wrote:
>> Dear Linux mdraid people,
>>
>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>
>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>
>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>
>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>
>> This doesn't complete. Here's /proc/stack of this process:
>>
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>      [<0>] kthread_stop+0x6e/0x150
>>      [<0>] md_unregister_thread+0x3e/0x70
>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>      [<0>] action_store+0x141/0x2b0
>>      [<0>] md_attr_store+0x71/0xb0
>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>      [<0>] vfs_write+0xbc/0x250
>>      [<0>] ksys_write+0xa1/0xe0
>>      [<0>] do_syscall_64+0x33/0x40
>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Note, that md0 has been paused successfully just before.
> 
> What is the personality of md0? Is it also raid6?

Yes.

> 
>>
>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>      [ nothing related following.... ]
>>
>> After that, we see md1_raid6 in a busy loop:
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
> 
> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
> 
>>
>> Also, all processes doing I/O do the md device block.
>>
>> This is /proc/mdstat:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
> 
> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
> helps, but you may give it a try.

Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.

Best
   Donald

> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f..e2697d0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>                          md_update_sb(mddev, 0);
> 
>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>                          /* resync/recovery still happening */
>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>                          goto unlock;
> 
> Thanks,
> Guoqing

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-01  9:29   ` Donald Buczek
@ 2020-12-02 17:28     ` Donald Buczek
  2020-12-03  1:55       ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2020-12-02 17:28 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:

     --- a/drivers/md/md.c
     +++ b/drivers/md/md.c
     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
                             clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                             goto unlock;
                     }
     +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
     +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
     +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
     +                       /* resync/recovery still happening */
     +                       pr_info("md: XXX BUGFIX applied\n");
     +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
     +                       goto unlock;
     +               }
                     if (mddev->sync_thread) {
                             md_reap_sync_thread(mddev);
                             goto unlock;

With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

after 1 minute:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:

     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
     root       134     2  0 Nov30 ?        00:00:00 [md]
     root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
     root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
     root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
     root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
     buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
     
     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
     [<0>] md_bitmap_cond_end_sync+0x12d/0x170
     [<0>] raid5_sync_request+0x24b/0x390
     [<0>] md_do_sync+0xb41/0x1030
     [<0>] md_thread+0x122/0x160
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x1f/0x30

I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.

Donald


On 01.12.20 10:29, Donald Buczek wrote:
> Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>>
>>
>> On 11/28/20 13:25, Donald Buczek wrote:
>>> Dear Linux mdraid people,
>>>
>>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>>
>>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>>
>>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>>
>>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>>
>>> This doesn't complete. Here's /proc/stack of this process:
>>>
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>>      [<0>] kthread_stop+0x6e/0x150
>>>      [<0>] md_unregister_thread+0x3e/0x70
>>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>>      [<0>] action_store+0x141/0x2b0
>>>      [<0>] md_attr_store+0x71/0xb0
>>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>>      [<0>] vfs_write+0xbc/0x250
>>>      [<0>] ksys_write+0xa1/0xe0
>>>      [<0>] do_syscall_64+0x33/0x40
>>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Note, that md0 has been paused successfully just before.
>>
>> What is the personality of md0? Is it also raid6?
> 
> Yes.
> 
>>
>>>
>>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>>      [ nothing related following.... ]
>>>
>>> After that, we see md1_raid6 in a busy loop:
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
>>
>> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>>
>>>
>>> Also, all processes doing I/O do the md device block.
>>>
>>> This is /proc/mdstat:
>>>
>>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>
>>
>> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
>> helps, but you may give it a try.
> 
> Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.
> 
> Best
>    Donald
> 
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 98bac4f..e2697d0 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>>                          md_update_sb(mddev, 0);
>>
>>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
>> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>                          /* resync/recovery still happening */
>>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                          goto unlock;
>>
>> Thanks,
>> Guoqing
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-02 17:28     ` Donald Buczek
@ 2020-12-03  1:55       ` Guoqing Jiang
  2020-12-03 11:42         ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2020-12-03  1:55 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 12/2/20 18:28, Donald Buczek wrote:
> Dear Guoqing,
> 
> unfortunately the patch didn't fix the problem (unless I messed it up 
> with my logging). This is what I used:
> 
>      --- a/drivers/md/md.c
>      +++ b/drivers/md/md.c
>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>                              clear_bit(MD_RECOVERY_NEEDED, 
> &mddev->recovery);
>                              goto unlock;
>                      }

I think you can add the check of RECOVERY_CHECK in above part instead of 
add a new part.

>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>      +                       /* resync/recovery still happening */
>      +                       pr_info("md: XXX BUGFIX applied\n");
>      +                       clear_bit(MD_RECOVERY_NEEDED, 
> &mddev->recovery);
>      +                       goto unlock;
>      +               }
>                      if (mddev->sync_thread) {
>                              md_reap_sync_thread(mddev);
>                              goto unlock;


> 
> With pausing and continuing the check four times an hour, I could 
> trigger the problem after about 48 hours. This time, the other device 
> (md0) has locked up on `echo idle > 
> /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 
> is still ongoing:

Without the patch, md0 was good while md1 was locked. So the patch 
switches the status of the two arrays, a little weird ...

What is the stack of the process? I guess it is same as the stack of 
23333 in your previous mail, but just to confirm.

> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [=>...................]  check =  8.5% (666852112/7813894144) 
> finish=1271.2min speed=93701K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [>....................]  check =  0.2% (19510348/7813894144) 
> finish=253779.6min speed=511K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 
> after 1 minute:
> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [=>...................]  check =  8.6% (674914560/7813894144) 
> finish=941.1min speed=126418K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [>....................]  check =  0.2% (19510348/7813894144) 
> finish=256805.0min speed=505K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 
> A data point, I didn't mention in my previous mail, is that the 
> mdX_resync thread is not gone when the problem occurs:
> 
>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>      root       134     2  0 Nov30 ?        00:00:00 [md]
>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>      root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>      buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat 
> /proc/5845/stack
>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>      [<0>] raid5_sync_request+0x24b/0x390
>      [<0>] md_do_sync+0xb41/0x1030
>      [<0>] md_thread+0x122/0x160
>      [<0>] kthread+0x118/0x130
>      [<0>] ret_from_fork+0x1f/0x30
> 
> I guess, md_bitmap_cond_end_sync+0x12d is the 
> `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) 
> == 0);` in md-bitmap.c.
> 

Could be, if so, then I think md_done_sync was not triggered by the path 
md0_raid6 -> ... -> handle_stripe.

I'd suggest to compare the stacks between md0 and md1 to find the 
difference.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-03  1:55       ` Guoqing Jiang
@ 2020-12-03 11:42         ` Donald Buczek
  2020-12-21 12:33           ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2020-12-03 11:42 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 12/3/20 2:55 AM, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 12/2/20 18:28, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:
>>
>>      --- a/drivers/md/md.c
>>      +++ b/drivers/md/md.c
>>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>>                              clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                              goto unlock;
>>                      }
> 
> I think you can add the check of RECOVERY_CHECK in above part instead of add a new part.

I understand that. I did it this way to get a log only when the changed expression makes a difference (original block not triggering and using goto and and modified block triggering) and even if both of us make some stupid error with boolean expressions.

> 
>>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>      +                       /* resync/recovery still happening */
>>      +                       pr_info("md: XXX BUGFIX applied\n");
>>      +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>      +                       goto unlock;
>>      +               }
>>                      if (mddev->sync_thread) {
>>                              md_reap_sync_thread(mddev);
>>                              goto unlock;
> 
> 
>>
>> With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:
> 
> Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ...

I think its just random. There is a slight chance for a deadlock depending on I/O to the device or other system load when stopping the resync. In the last experiment, both md0 and md1 where changed from "check" to "idle" for about 400 times before one of it blocked.

OTOH - and I don't understand this at all - it looks like md0 hung again after reboot to a 5.4 kernel at the very first "check" to "idle" transition which seems to contradict the "slight" chance and "once after 400 attempts".

> What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm.

Of md0_raid6 thread ?  Sorry, didn't take the data before reboot this time, and concentrated on the non-terminating md0_resync thread.

>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> after 1 minute:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:
>>
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>>      root       134     2  0 Nov30 ?        00:00:00 [md]
>>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>>      root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>>      buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
>>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>>      [<0>] raid5_sync_request+0x24b/0x390
>>      [<0>] md_do_sync+0xb41/0x1030
>>      [<0>] md_thread+0x122/0x160
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x1f/0x30
>>
>> I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.
>>
> 
> Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe.

Ok. Maybe I find a way to add some logging there.

> I'd suggest to compare the stacks between md0 and md1 to find the difference.

Compare stack of md0_raid6 and md1_raid6?

Okay, but when one md is deadlocked, the other is operating just normally. I/O can be done and you can even start and stop a resync on it. So I'd expect the stack of the operating raid to be varying.

Thanks.

   Donald

> 
> Thanks,
> Guoqing


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-03 11:42         ` Donald Buczek
@ 2020-12-21 12:33           ` Donald Buczek
  2021-01-19 11:30             ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2020-12-21 12:33 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoging,

I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver.

With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action blocked, no progress in mdstat) and md1 operating normally:

     root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
     /sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
     /sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0

Best
   Donald

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-21 12:33           ` Donald Buczek
@ 2021-01-19 11:30             ` Donald Buczek
  2021-01-20 16:33               ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-19 11:30 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear md-raid people,

I've reported a problem in this thread in December:

"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.

On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.

I was able to collect a lot of additional information from a blocked system.

- The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]

- The sync thread ("md1_resync") does not finish, because its blocked at

[<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
[<0>] raid5_sync_request+0x364/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

[1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735

- yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

- We have lots of active stripes:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
27534

- handle_stripe() doesn't make progress:

echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control

In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:

[171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0
[171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0
[171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0

The sector numbers repeat after some time. We have only the following variants of stripe state and "check":

state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle

- We have MD_SB_CHANGE_PENDING set:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
write-pending

gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939

- The system can manually be freed from the deadlock:

When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

I, of coruse, don't fully understand it yet. Any ideas?

I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)

Donald


$L_002: (struct mddev *) 0xffff8898a48e3000 : {
     private = (struct r5conf *) 0xffff88988695a800 /* --> $L_004 */
     pers = (struct md_personality *) 0xffffffff826ee280 <raid6_personality>
     unit = 9437185
     md_minor = 1
     disks = {next = 0xffff889886ee8c00, prev = 0xffff889891506000}
     flags = 0
     sb_flags = 6
     suspended = 0
     active_io = { counter=4 }  /* atomic_t */
     ro = 0
     sysfs_active = 0
     gendisk = (struct gendisk *) 0xffff8898bc357800
     kobj = {
         name = (const char *) 0xffffffff8220cef8 "md"
         entry = {next = 0xffff8898a48e3058, prev = 0xffff8898a48e3058}    /* empty */
         parent = (struct kobject *) 0xffff8898bc357868
         kset = 0x0
         ktype = (struct kobj_type *) 0xffffffff826ef140 <md_ktype>
         sd = (struct kernfs_node *) 0xffff8898b5f65980
         kref = {
             refcount = {
                 refs = { counter=17 }  /* atomic_t */
             }
         }
         state_initialized = 1
         state_in_sysfs = 1
         state_add_uevent_sent = 0
         state_remove_uevent_sent = 0
         uevent_suppress = 0
     }
     hold_active = 0
     major_version = 1
     minor_version = 2
     patch_version = 0
     persistent = 1
     external = 0
     metadata_type = [
(8)
         [0] : 0 '\000'
         [1] : 0 '\000'
         [2] : 0 '\000'
         [3] : 0 '\000'
         [4] : 0 '\000'
         [5] : 0 '\000'
         [6] : 0 '\000'
         [7] : 0 '\000'
         [8] : 0 '\000'
         [9] : 0 '\000'
         [10] : 0 '\000'
         [11] : 0 '\000'
         [12] : 0 '\000'
         [13] : 0 '\000'
         [14] : 0 '\000'
         [15] : 0 '\000'
         [16] : 0 '\000'
     ]
     chunk_sectors = 1024
     ctime = 1607680540
     utime = 1610853337
     level = 6
     layout = 2
     clevel = [
(8)
         [0] : 114 'r'
         [1] : 97 'a'
         [2] : 105 'i'
         [3] : 100 'd'
         [4] : 54 '6'
         [5] : 0 '\000'
         [6] : 0 '\000'
         [7] : 0 '\000'
         [8] : 0 '\000'
         [9] : 0 '\000'
         [10] : 0 '\000'
         [11] : 0 '\000'
         [12] : 0 '\000'
         [13] : 0 '\000'
         [14] : 0 '\000'
         [15] : 0 '\000'
     ]
     raid_disks = 16
     max_disks = 1920
     dev_sectors = 15627788288
     array_sectors = 218789036032
     external_size = 0
     events = 32409
     can_decrease_events = 0
     uuid = [
(8)
         [0] : -49 '\317'
         [1] : 69 'E'
         [2] : 119 'w'
         [3] : 62 '>'
         [4] : -113 '\217'
         [5] : -107 '\225'
         [6] : 9 '\t'
         [7] : -90 '\246'
         [8] : -2 '\376'
         [9] : 84 'T'
         [10] : -12 '\364'
         [11] : -31 '\341'
         [12] : -110 '\222'
         [13] : 24 '\030'
         [14] : -15 '\361'
         [15] : 109 'm'
     ]
     reshape_position = 18446744073709551615
     delta_disks = 0
     new_level = 6
     new_layout = 2
     new_chunk_sectors = 1024
     reshape_backwards = 0
     thread = (struct md_thread *) 0xffff889887bde080
     sync_thread = 0x0
     last_sync_action = (char *) 0xffffffff8225e246 "check"
     curr_resync = 4947256648
     curr_resync_completed = 4947016640
     resync_mark = 4415850173
     resync_mark_cnt = 66402936
     curr_mark_cnt = 75262560
     resync_max_sectors = 15627788288
     resync_mismatches = {
         counter = 0
     }
     suspend_lo = 0
     suspend_hi = 0
     sync_speed_min = 0
     sync_speed_max = 0
     parallel_resync = 0
     ok_start_degraded = 0
     recovery = 203
     recovery_disabled = 16
     in_sync = 0
     open_mutex = {
         owner = { counter=0 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff8898a48e31d8, prev = 0xffff8898a48e31d8}    /* empty */
     }
     reconfig_mutex = {
         owner = { counter=-131247629823744 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff8898a48e31f8, prev = 0xffff8898a48e31f8}    /* empty */
     }
     active = { counter=2 }  /* atomic_t */
     openers = { counter=1 }  /* atomic_t */
     changed = 0
     degraded = 0
     recovery_active = { counter=181360 }  /* atomic_t */
     recovery_wait = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff8898a48e3228, prev = 0xffff8898a48e3228}    /* empty */
     }
     recovery_cp = 18446744073709551615
     resync_min = 4871994088
     resync_max = 18446744073709551615
     sysfs_state = (struct kernfs_node *) 0xffff8898b5f65380
     sysfs_action = (struct kernfs_node *) 0xffff889887d87600
     sysfs_completed = (struct kernfs_node *) 0xffff8898ac6f4e80
     sysfs_degraded = (struct kernfs_node *) 0xffff8898ac6f4b80
     sysfs_level = (struct kernfs_node *) 0xffff8898b5f65700
     del_work = {
         data = { counter=128 }  /* atomic_t */
         entry = {next = 0xffff8898a48e3280, prev = 0xffff8898a48e3280}    /* empty */
         func = (work_func_t) 0xffffffff819ad430 <md_start_sync>
     }
     lock = {counter=0} /* spinlock_t */
     sb_wait = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffffc9000e4a7a68, prev = 0xffffc9000cf97d30}
     }
     pending_writes = { counter=0 }  /* atomic_t */
     safemode = 0
     safemode_delay = 201
     safemode_timer = {
         entry = {
             next = (struct hlist_node *) 0xdead000000000122
             pprev = 0x0
         }
         expires = 4415857239
         function = (void (*)(struct timer_list *)) 0xffffffff819a6a40 <md_safemode_timeout>
         flags = 314572820
     }
     writes_pending = {
         percpu_count_ptr = 105896157983096
         data = (struct percpu_ref_data *) 0xffff8898b340fc80
     }
     sync_checkers = 0
     queue = (struct request_queue *) 0xffff8898b65847c8
     bitmap = (struct bitmap *) 0xffff889887d84400
     bitmap_info = {
         file = 0x0
         offset = 8
         space = 0
         default_offset = 2
         default_space = 6
         mutex = {
             owner = { counter=0 }  /* atomic_t */
             wait_lock = {counter=0} /* spinlock_t */
             osq = {
                 tail = { counter=0 }  /* atomic_t */
             }
             wait_list = {next = 0xffff8898a48e3350, prev = 0xffff8898a48e3350}    /* empty */
         }
         chunksize = 67108864
         daemon_sleep = 5000
         max_write_behind = 0
         external = 0
         nodes = 0
         cluster_name = [
(8)
             [0] : 0 '\000'
             [1] : 0 '\000'
             [2] : 0 '\000'
             [3] : 0 '\000'
             [4] : 0 '\000'
             [5] : 0 '\000'
             [6] : 0 '\000'
             [7] : 0 '\000'
             [8] : 0 '\000'
             [9] : 0 '\000'
             [10] : 0 '\000'
             [11] : 0 '\000'
             [12] : 0 '\000'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
             [32] : 0 '\000'
             [33] : 0 '\000'
             [34] : 0 '\000'
             [35] : 0 '\000'
             [36] : 0 '\000'
             [37] : 0 '\000'
             [38] : 0 '\000'
             [39] : 0 '\000'
             [40] : 0 '\000'
             [41] : 0 '\000'
             [42] : 0 '\000'
             [43] : 0 '\000'
             [44] : 0 '\000'
             [45] : 0 '\000'
             [46] : 0 '\000'
             [47] : 0 '\000'
             [48] : 0 '\000'
             [49] : 0 '\000'
             [50] : 0 '\000'
             [51] : 0 '\000'
             [52] : 0 '\000'
             [53] : 0 '\000'
             [54] : 0 '\000'
             [55] : 0 '\000'
             [56] : 0 '\000'
             [57] : 0 '\000'
             [58] : 0 '\000'
             [59] : 0 '\000'
             [60] : 0 '\000'
             [61] : 0 '\000'
             [62] : 0 '\000'
             [63] : 0 '\000'
         ]
     }
     max_corr_read_errors = { counter=20 }  /* atomic_t */
     all_mddevs = {next = 0xffffffff826eef10, prev = 0xffff88810a551bc8}
     to_remove = 0x0
     bio_set = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae300
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3428, prev = 0xffff8898a48e3428}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2e0
             pool_data = (void *) 0xffff8881078ecb40
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3470, prev = 0xffff8898a48e3470}    /* empty */
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff8898a48e3530, prev = 0xffff8898a48e3530}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     sync_set = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2c0
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3598, prev = 0xffff8898a48e3598}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2a0
             pool_data = (void *) 0xffff8881078ecb40
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e35e0, prev = 0xffff8898a48e35e0}    /* empty */
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff8898a48e36a0, prev = 0xffff8898a48e36a0}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     md_io_pool = {
         lock = {counter=0} /* spinlock_t */
         min_nr = 2
         curr_nr = 2
         elements = (void **) 0xffff8898b236af60
         pool_data = (void *) 0x28 <fixed_percpu_data+40>
         alloc = (mempool_alloc_t *) 0xffffffff811c8b50 <mempool_kmalloc>
         free = (mempool_free_t *) 0xffffffff811c84b0 <mempool_kfree>
         wait = {
             lock = {counter=0} /* spinlock_t */
             head = {next = 0xffff8898a48e36f8, prev = 0xffff8898a48e36f8}    /* empty */
         }
     }
     flush_bio = 0x0
     flush_pending = { counter=0 }  /* atomic_t */
     start_flush = 121249416090188
     last_flush = 121249416090188
     flush_work = {
         data = { counter=2176 }  /* atomic_t */
         entry = {next = 0xffff8898a48e3730, prev = 0xffff8898a48e3730}    /* empty */
         func = (work_func_t) 0xffffffff819aca80 <md_submit_flush_data>
     }
     event_work = {
         data = { counter=0 }  /* atomic_t */
         entry = {next = 0x0000000000000000, prev = 0x0000000000000000}
         func = 0x0
     }
     serial_info_pool = 0x0
     sync_super = 0x0
     cluster_info = 0x0
     good_device_nr = 0
     noio_flag = 0
     has_superblocks = true_21_
     fail_last_dev = false_21_
     serialize_policy = false_21_
}

$L_004: (struct r5conf *) 0xffff88988695a800 : {
     stripe_hashtbl = (struct hlist_head *) 0xffff8898a1cb9000 /* --> $L_006 */
     hash_locks = [
(23)
         [0] : {counter=0} /* spinlock_t */
         [1] : {counter=0} /* spinlock_t */
         [2] : {counter=0} /* spinlock_t */
         [3] : {counter=0} /* spinlock_t */
         [4] : {counter=0} /* spinlock_t */
         [5] : {counter=0} /* spinlock_t */
         [6] : {counter=0} /* spinlock_t */
         [7] : {counter=0} /* spinlock_t */
     ]
     mddev = (struct mddev *) 0xffff8898a48e3000
     chunk_sectors = 1024
     level = 6
     algorithm = 2
     rmw_level = 1
     max_degraded = 2
     raid_disks = 16
     max_nr_stripes = 29816
     min_nr_stripes = 256
     reshape_progress = 18446744073709551615
     reshape_safe = 0
     previous_raid_disks = 16
     prev_chunk_sectors = 1024
     prev_algo = 2
     generation = 0
     gen_lock = {
         seqcount = {
             sequence = 0
         }
     }
     reshape_checkpoint = 0
     min_offset_diff = 0
     handle_list = {next = 0xffff888a7893c890, prev = 0xffff8889de3cc7d0}
     loprio_list = {next = 0xffff88988695a898, prev = 0xffff88988695a898}    /* empty */
     hold_list = {next = 0xffff88988695a8a8, prev = 0xffff88988695a8a8}    /* empty */
     delayed_list = {next = 0xffff88988695a8b8, prev = 0xffff88988695a8b8}    /* empty */
     bitmap_list = {next = 0xffff88988695a8c8, prev = 0xffff88988695a8c8}    /* empty */
     retry_read_aligned = 0x0
     retry_read_offset = 0
     retry_read_aligned_list = 0x0
     preread_active_stripes = { counter=0 }  /* atomic_t */
     active_aligned_reads = { counter=0 }  /* atomic_t */
     pending_full_writes = { counter=0 }  /* atomic_t */
     bypass_count = 0
     bypass_threshold = 1
     skip_copy = 0
     last_hold = (struct list_head *) 0xffff88a299ea0010
     reshape_stripes = { counter=0 }  /* atomic_t */
     active_name = 0
     cache_name = [
(2)
         [0] : [
(8)
             [0] : 114 'r'
             [1] : 97 'a'
             [2] : 105 'i'
             [3] : 100 'd'
             [4] : 54 '6'
             [5] : 45 '-'
             [6] : 109 'm'
             [7] : 100 'd'
             [8] : 49 '1'
             [9] : 0 '\000'
             [10] : 0 '\000'
             [11] : 0 '\000'
             [12] : 0 '\000'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
         ]
         [1] : [
(8)
             [0] : 114 'r'
             [1] : 97 'a'
             [2] : 105 'i'
             [3] : 100 'd'
             [4] : 54 '6'
             [5] : 45 '-'
             [6] : 109 'm'
             [7] : 100 'd'
             [8] : 49 '1'
             [9] : 45 '-'
             [10] : 97 'a'
             [11] : 108 'l'
             [12] : 116 't'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
         ]
     ]
     slab_cache = (struct kmem_cache *) 0xffff8881078eca80
     cache_size_mutex = {
         owner = { counter=0 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff88988695a970, prev = 0xffff88988695a970}    /* empty */
     }
     seq_flush = 5031591
     seq_write = 5031591
     quiesce = 0
     fullsync = 0
     recovery_disabled = 15
     percpu = (struct raid5_percpu *) 0x604fdee147e8
     scribble_disks = 16
     scribble_sectors = 1024
     node = {
         next = 0x0
         pprev = (struct hlist_node **) 0xffff88810a5b89a8
     }
     active_stripes = { counter=27534 }  /* atomic_t */
     inactive_list = [
(3)
         [0] : {next = 0xffff88990fd76710, prev = 0xffff8889b7f72550}
         [1] : {next = 0xffff88959af002d0, prev = 0xffff889958cd4590}
         [2] : {next = 0xffff88a60a070150, prev = 0xffff88a37d3d0850}
         [3] : {next = 0xffff889e211e0690, prev = 0xffff8893a4000110}
         [4] : {next = 0xffff888c2139a350, prev = 0xffff88901a188410}
         [5] : {next = 0xffff88a83413e3d0, prev = 0xffff88a2090bc350}
         [6] : {next = 0xffff88a720d54310, prev = 0xffff888a0da6c5d0}
         [7] : {next = 0xffff888a0412c650, prev = 0xffff889250596550}
     ]
     r5c_cached_full_stripes = { counter=0 }  /* atomic_t */
     r5c_full_stripe_list = {next = 0xffff88988695aa48, prev = 0xffff88988695aa48}    /* empty */
     r5c_cached_partial_stripes = { counter=0 }  /* atomic_t */
     r5c_partial_stripe_list = {next = 0xffff88988695aa60, prev = 0xffff88988695aa60}    /* empty */
     r5c_flushing_full_stripes = { counter=0 }  /* atomic_t */
     r5c_flushing_partial_stripes = { counter=0 }  /* atomic_t */
     empty_inactive_list_nr = { counter=0 }  /* atomic_t */
     released_stripes = {
         first = 0x0
     }
     wait_for_quiescent = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff88988695aa90, prev = 0xffff88988695aa90}    /* empty */
     }
     wait_for_stripe = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffffc9000e3f7c78, prev = 0xffffc9000e3f7c78}
     }
     wait_for_overlap = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff88988695aac0, prev = 0xffff88988695aac0}    /* empty */
     }
     cache_state = 3
     shrinker = {
         count_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198bc60 <raid5_cache_count>
         scan_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198de90 <raid5_cache_scan>
         batch = 128
         seeks = 128
         flags = 0
         list = {next = 0xffff8898867e8c30, prev = 0xffffffffa0e14520}
         id = 0
         nr_deferred = (atomic_long_t *) 0xffff88988d8955c0
     }
     pool_size = 16
     device_lock = {counter=0} /* spinlock_t */
     disks = (struct disk_info *) 0xffff8898b1df3800
     bio_split = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae280
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff88988695ab70, prev = 0xffff88988695ab70}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff88988695ac78, prev = 0xffff88988695ac78}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     thread = 0x0
     temp_inactive_list = [
(3)
         [0] : {next = 0xffff88988695aca0, prev = 0xffff88988695aca0}    /* empty */
         [1] : {next = 0xffff88988695acb0, prev = 0xffff88988695acb0}    /* empty */
         [2] : {next = 0xffff88988695acc0, prev = 0xffff88988695acc0}    /* empty */
         [3] : {next = 0xffff88988695acd0, prev = 0xffff88988695acd0}    /* empty */
         [4] : {next = 0xffff88988695ace0, prev = 0xffff88988695ace0}    /* empty */
         [5] : {next = 0xffff88988695acf0, prev = 0xffff88988695acf0}    /* empty */
         [6] : {next = 0xffff88988695ad00, prev = 0xffff88988695ad00}    /* empty */
         [7] : {next = 0xffff88988695ad10, prev = 0xffff88988695ad10}    /* empty */
     ]
     worker_groups = 0x0
     group_cnt = 0
     worker_cnt_per_group = 0
     log = 0x0
     log_private = 0x0
     pending_bios_lock = {counter=0} /* spinlock_t */
     batch_bio_dispatch = true_21_
     pending_data = (struct r5pending_data *) 0xffff88988fae0000
     free_list = {next = 0xffff88988fae4fd8, prev = 0xffff88988fae0000}
     pending_list = {next = 0xffff88988695ad60, prev = 0xffff88988695ad60}    /* empty */
     pending_data_cnt = 0
     next_pending_data = 0x0
}


set $sh = container_of ((void *)0xffff88a57f1e6550, struct stripe_head, lru)

(gdb) dump *$sh

$L_010: (struct stripe_head *) 0xffff88a57f1e6540 : {
     hash = {
         next = (struct hlist_node *) 0xffff888768f7c680
         pprev = (struct hlist_node **) 0xffff8892f0520080
     }
     lru = {next = 0xffff88945e034510, prev = 0xffff8891d8f04210}
     release_list = {
         next = (struct llist_node *) 0xffff88945e034520
     }
     raid_conf = (struct r5conf *) 0xffff88988695a800
     generation = 0
     sector = 4947068032
     pd_idx = 14
     qd_idx = 15
     ddf_layout = 0
     hash_lock_index = 0
     state = 8242                                           # 0x2032  HANDLE + INSYNC + REPLACED + STARTED
     count = { counter=0 }  /* atomic_t */
     bm_seq = 5030861
     disks = 16
     overwrite_disks = 0
     check_state = check_state_idle_5_
     reconstruct_state = reconstruct_state_idle_5_
     stripe_lock = {counter=0} /* spinlock_t */
     cpu = 17
     group = 0x0
     batch_head = 0x0                                               # no batch
     batch_lock = {counter=0} /* spinlock_t */
     batch_list = {next = 0xffff88a57f1e65c8, prev = 0xffff88a57f1e65c8}    /* empty */
     /* anon union*/ {
         log_io = 0x0
         ppl_io = 0x0
     }
     log_list = {next = 0xffff88a57f1e65e0, prev = 0xffff88a57f1e65e0}    /* empty */
     log_start = 18446744073709551615
     r5c = {next = 0xffff88a57f1e65f8, prev = 0xffff88a57f1e65f8}    /* empty */
     ppl_page = 0x0
     ops = {
         target = 0
         target2 = 0
         zero_sum_result = 0_5_
     }


(gdb) dump $sh->dev[0]

$L_011: (struct r5dev *) 0xffff88a57f1e6620 : {
     req = {
         bi_next = 0x0
         bi_disk = 0x0
         bi_opf = 0
         bi_flags = 0
         bi_ioprio = 0
         bi_write_hint = 0
         bi_status = 0 '\000'
         bi_partno = 0 '\000'
         __bi_remaining = { counter=1 }  /* atomic_t */
         bi_iter = {
             bi_sector = 0
             bi_size = 0
             bi_idx = 0
             bi_bvec_done = 0
         }
         bi_end_io = 0x0
         bi_private = 0x0
         bi_blkg = 0x0
         bi_issue = {
             value = 0
         }
         /* anon union*/ {
             bi_integrity = 0x0
         }
         bi_vcnt = 0
         bi_max_vecs = 1
         __bi_cnt = { counter=1 }  /* atomic_t */
         bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6710
         bi_pool = 0x0
         bi_inline_vecs = [
         ]
     }
     rreq = {
         bi_next = 0x0
         bi_disk = 0x0
         bi_opf = 0
         bi_flags = 0
         bi_ioprio = 0
         bi_write_hint = 0
         bi_status = 0 '\000'
         bi_partno = 0 '\000'
         __bi_remaining = { counter=1 }  /* atomic_t */
         bi_iter = {
             bi_sector = 0
             bi_size = 0
             bi_idx = 0
             bi_bvec_done = 0
         }
         bi_end_io = 0x0
         bi_private = 0x0
         bi_blkg = 0x0
         bi_issue = {
             value = 0
         }
         /* anon union*/ {
             bi_integrity = 0x0
         }
         bi_vcnt = 0
         bi_max_vecs = 1
         __bi_cnt = { counter=1 }  /* atomic_t */
         bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6720
         bi_pool = 0x0
         bi_inline_vecs = [
         ]
     }
     vec = {
         bv_page = (struct page *) 0xffffea008ffccc80
         bv_len = 4096
         bv_offset = 0
     }
     rvec = {
         bv_page = 0x0
         bv_len = 0
         bv_offset = 0
     }
     page = (struct page *) 0xffffea008ffccc80
     orig_page = (struct page *) 0xffffea008ffccc80
     offset = 0
     toread = 0x0
     read = 0x0
     towrite = 0x0
     written = 0x0
     sector = 69258950784
     flags = 17                                                         # 0x11  : R5_UPTODATE + R5_INSYNC
     log_checksum = 0
     write_hint = 0
}

$sh->dev[1] .. $sh->dev[15] omitted. All the same, just different page buffers and sector numbers.

$sh->dev[14] and $sh->dev[15] have sector = 0.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-19 11:30             ` Donald Buczek
@ 2021-01-20 16:33               ` Guoqing Jiang
  2021-01-23 13:04                 ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-01-20 16:33 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
> 
> I've reported a problem in this thread in December:
> 
> "We are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions." It 
> was clear, that this is related to "mdcheck" running, because we found, 
> that the command, which should stop the scrubbing in the morning (`echo 
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
> 
> On 12/21/20, I've reported, that the problem might be caused by a 
> failure of the underlying block device driver, because I've found 
> "inflight" counters of the block devices not being zero. However, this 
> is not the case. We were able to run into the mdX_raid6 looping 
> condition a few times again, but the non-zero inflight counters have not 
> been observed again.
> 
> I was able to collect a lot of additional information from a blocked 
> system.
> 
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command 
> is waiting at kthread_stop to stop the sync thread. [ 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
> 
> - The sync thread ("md1_resync") does not finish, because its blocked at
> 
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
> 
> - yes, gdb confirms that `conf->cache_state` is 0x03 ( 
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list, 
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle 
 > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

> 
> - We have lots of active stripes:
> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either 
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE 
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could 
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So 
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by 
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
> 
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
> 
> In dmesg, we see the debug output from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 
> but never from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
> 
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> 
> The sector numbers repeat after some time. We have only the following 
> variants of stripe state and "check":
> 
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, 
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, 
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, 
> check_state_idle
> 
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call 
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be 
called by md_reap_sync_thread. But md_reap_sync_thread stuck with 
unregister sync_thread (it was blocked in raid5_get_active_stripe).


Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d 
can't break from the while(1) loop because "if (!batch_size && 
!released)" is false. I would assume released is '0' since
 >      released_stripes = {
 >          first = 0x0
 >      }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with 
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain 
the busy loop situation.

> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/array_state
> write-pending
> 
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is 
badblock of underlying device (sd*)?

> 
> So it can be assumed that handle_stripe breaks out at 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
> 
> - The system can manually be freed from the deadlock:
> 
> When `echo active > /sys/block/md1/md/array_state` is used, the 
> scrubbing and other I/O continue. Probably because of 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked 
process can make progress.

> 
> I, of coruse, don't fully understand it yet. Any ideas?
> 
> I append some data from a hanging raid... (mddev, r5conf and a sample 
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-20 16:33               ` Guoqing Jiang
@ 2021-01-23 13:04                 ` Donald Buczek
  2021-01-25  8:54                   ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-23 13:04 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 20.01.21 17:33, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/19/21 12:30, Donald Buczek wrote:
>> Dear md-raid people,
>>
>> I've reported a problem in this thread in December:
>>
>> "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>>
>> On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.
>>
>> I was able to collect a lot of additional information from a blocked system.
>>
>> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>>
>> - The sync thread ("md1_resync") does not finish, because its blocked at
>>
>> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
>> [<0>] raid5_sync_request+0x364/0x390
>> [<0>] md_do_sync+0xb41/0x1030
>> [<0>] md_thread+0x122/0x160
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>>
>> - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )
> 
> The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.
> 
>>
>> - We have lots of active stripes:
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
>> 27534
> 
> There are too many active stripes, so this is false:
> 
> atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)
> 
> so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes.
> 
> 1. Increase max_nr_stripes
> since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system.
> 
> Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory.
> 
> 2. Or decrease active_stripes
> This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set
> 
>> - handle_stripe() doesn't make progress:
>>
>> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>>
>> In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>>
>> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>>
>> The sector numbers repeat after some time. We have only the following variants of stripe state and "check":
>>
>> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
>> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
>> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle
>>
>> - We have MD_SB_CHANGE_PENDING set:
> 
> because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes).
> 
> Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe).
> 
> 
> Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since
>  >      released_stripes = {
>  >          first = 0x0
>  >      }
> And __get_priority_stripe fetches sh from conf->handle_list and delete
> it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
> So batch_size can't be '0', and handle_active_stripes in the loop
> repeats the process in the busy loop. This is my best guess to explain the busy loop situation.
> 
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
>> write-pending
>>
>> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)
> 
> since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)?

The problem is, we are still not able to run a system into the deadlocked state by a repeatable procedure. So I was not yet able to explicitly check for badblocks.

However, on the production systems, which locked up, and on the test systems, which we managed to lock up, there always was some file system activity to the affected devices aside from mdcheck itself. This alone would explain any clean->write-pending transitions and might be a required condition for the problem to happen. I'm currently trying to explicitly exercise this path (by write/fsync()/sleep), with no result yet.

Btw: When you monitor md/stripe_cache_active during a ongoing "check" on an otherwise idle system, you see a variety of values up to a certain maximum value, which is visible most of the time. But this maximum value seems to continuously increase. Also, when you start the check at higher blocks (via md/sync_min) right away, the maximum value seems to be higher. Can this be explained? Bigger seek gap between superblocks and active data area? But should this increase the number of active stripes?

Best
   Donald

>> So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>>
>> - The system can manually be freed from the deadlock:
>>
>> When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520
> 
> Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress.
> 
>>
>> I, of coruse, don't fully understand it yet. Any ideas?
>>
>> I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)
> 
> These data did help for investigation!
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-23 13:04                 ` Donald Buczek
@ 2021-01-25  8:54                   ` Donald Buczek
  2021-01-25 21:32                     ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-25  8:54 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.

My current theory of what is happening is:

- MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)

- Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`

- action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.

- raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.

- raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.

- (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.

Does this make sense?

Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:

```
#! /bin/bash

(
         while true; do
                 echo "start check"
                 echo check > /sys/block/md3/md/sync_action
                 sleep 10
                 echo "stop check"
                 echo idle > /sys/block/md3/md/sync_action
                 sleep 2
         done
) &

(
         while true; do
                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
                 sync /mnt/raid_md3/bigfile
                 rm /mnt/raid_md3/bigfile
                 sleep .1
         done
) &

start="$(date +%s)"
cd /sys/block/md3/md
wp_count=0
while true; do
         array_state=$(cat array_state)
         if [ "$array_state" = write-pending ]; then
                 wp_count=$(($wp_count+1))
         else
                 wp_count=0
         fi
         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
         if [ $wp_count -ge 3 ]; then
                 kill -- -$$
                 exit
         fi
         sleep 1
done
```

The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:

     309 check 6283872 / 8378368 active-idle 4144
     310 check 6283872 / 8378368 active 1702
     311 check 6807528 / 8378368 active 4152
     312 check 7331184 / 8378368 clean 3021
     stop check
     313 check 7331184 / 8378368 write-pending 3905
     314 check 7331184 / 8378368 write-pending 3905
     315 check 7331184 / 8378368 write-pending 3905
     Terminated

If I add

     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);

in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get

     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6

If I add

     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);

at the head of md_check_recovery, I get:

     [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006

More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )

What approach would you suggest to fix this?

Best
   Donald



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-25  8:54                   ` Donald Buczek
@ 2021-01-25 21:32                     ` Donald Buczek
  2021-01-26  0:44                       ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-25 21:32 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
> 
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
> 
> My current theory of what is happening is:
> 
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
> 
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
> 
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
> 
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
> 
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
> 
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
> 
> Does this make sense?
> 
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
> 
> ```
> #! /bin/bash
> 
> (
>          while true; do
>                  echo "start check"
>                  echo check > /sys/block/md3/md/sync_action
>                  sleep 10
>                  echo "stop check"
>                  echo idle > /sys/block/md3/md/sync_action
>                  sleep 2
>          done
> ) &
> 
> (
>          while true; do
>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                  sync /mnt/raid_md3/bigfile
>                  rm /mnt/raid_md3/bigfile
>                  sleep .1
>          done
> ) &
> 
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>          array_state=$(cat array_state)
>          if [ "$array_state" = write-pending ]; then
>                  wp_count=$(($wp_count+1))
>          else
>                  wp_count=0
>          fi
>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>          if [ $wp_count -ge 3 ]; then
>                  kill -- -$$
>                  exit
>          fi
>          sleep 1
> done
> ```
> 
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
> 
>      309 check 6283872 / 8378368 active-idle 4144
>      310 check 6283872 / 8378368 active 1702
>      311 check 6807528 / 8378368 active 4152
>      312 check 7331184 / 8378368 clean 3021
>      stop check
>      313 check 7331184 / 8378368 write-pending 3905
>      314 check 7331184 / 8378368 write-pending 3905
>      315 check 7331184 / 8378368 write-pending 3905
>      Terminated
> 
> If I add
> 
>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
> 
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
> 
>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> 
> If I add
> 
>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
> 
> at the head of md_check_recovery, I get:
> 
>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> 
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
> 
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
  		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
  		    mddev_lock(mddev) == 0) {
+			set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			flush_workqueue(md_misc_wq);
  			if (mddev->sync_thread) {
  				set_bit(MD_RECOVERY_INTR, &mddev->recovery);
  				md_reap_sync_thread(mddev);
  			}
+			clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			mddev_unlock(mddev);
  		}
  	} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-- 
2.30.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-25 21:32                     ` Donald Buczek
@ 2021-01-26  0:44                       ` Guoqing Jiang
  2021-01-26  9:50                         ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-01-26  0:44 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

[-- Attachment #1: Type: text/plain, Size: 8583 bytes --]

Hi Donald,

On 1/25/21 22:32, Donald Buczek wrote:
> 
> 
> On 25.01.21 09:54, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> a colleague of mine was able to produce the issue inside a vm and were 
>> able to find a procedure to run the vm into the issue within minutes 
>> (not unreliably after hours on a physical system as before). This of 
>> course helped to pinpoint the problem.
>>
>> My current theory of what is happening is:
>>
>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by 
>> md_write_start() when file-system I/O wants to do a write and the 
>> array transitions from "clean" to "active". 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>
>> - Before raid5d gets to write the superblock (its busy processing 
>> active stripes because of the sync activity) , userspace wants to 
>> pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>
>> - action_store() takes the reconfig_mutex before trying to stop the 
>> sync thread. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump 
>> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>
>> - raid5d is running in its main loop. 
>> raid5d()->handle_active_stripes() returns a positive batch size ( 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 
>> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't 
>> process any stripe because of MD_SB_CHANGE_PENDING. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 
>> ). This is the reason, raid5d is busy looping.
>>
>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One 
>> of its duties is to write the superblock, if a change is pending. 
>> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able 
>> to take the reconfig_mutex. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 
>> , 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) 
>> Both is not true, so the superblock is not written and 
>> MD_SB_CHANGE_PENDING is not cleared.
>>
>> - (as discussed previously) the sync thread is waiting for the number 
>> of active stripes to go down and doesn't terminate. The userspace 
>> thread is waiting for the sync thread to terminate.
>>
>> Does this make sense?

Yes, exactly! That was my thought too, the scenario is twisted.

Then resync thread is blocked due to there are too many active stripes, 
because raid5d is in busy loop since SB_CHANGE_PENDING is set which 
means tactive stripes can't be decreased, and echo idle cmd can't make 
progress given resync thread is blocked while the cmd still hold 
reconfig_mutex which make raid5d in busy loop and can't clear 
SB_CHANGE_PENDING flag.

And raid5 could suffer from the same issue I think.

>>
>> Just for reference, I add the procedure which triggers the issue on 
>> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>
>> ```
>> #! /bin/bash
>>
>> (
>>          while true; do
>>                  echo "start check"
>>                  echo check > /sys/block/md3/md/sync_action
>>                  sleep 10
>>                  echo "stop check"
>>                  echo idle > /sys/block/md3/md/sync_action
>>                  sleep 2
>>          done
>> ) &
>>
>> (
>>          while true; do
>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero 
>> of=/mnt/raid_md3/bigfile status=none
>>                  sync /mnt/raid_md3/bigfile
>>                  rm /mnt/raid_md3/bigfile
>>                  sleep .1
>>          done
>> ) &
>>
>> start="$(date +%s)"
>> cd /sys/block/md3/md
>> wp_count=0
>> while true; do
>>          array_state=$(cat array_state)
>>          if [ "$array_state" = write-pending ]; then
>>                  wp_count=$(($wp_count+1))
>>          else
>>                  wp_count=0
>>          fi
>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat 
>> sync_completed) $array_state $(cat stripe_cache_active)
>>          if [ $wp_count -ge 3 ]; then
>>                  kill -- -$$
>>                  exit
>>          fi
>>          sleep 1
>> done
>> ```
>>
>> The time, this needs to trigger the issue, varies from under a minute 
>> to one hour with 5 minute being typical. The output ends like this:
>>
>>      309 check 6283872 / 8378368 active-idle 4144
>>      310 check 6283872 / 8378368 active 1702
>>      311 check 6807528 / 8378368 active 4152
>>      312 check 7331184 / 8378368 clean 3021
>>      stop check
>>      313 check 7331184 / 8378368 write-pending 3905
>>      314 check 7331184 / 8378368 write-pending 3905
>>      315 check 7331184 / 8378368 write-pending 3905
>>      Terminated
>>
>> If I add
>>
>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", 
>> batch_size, released, mddev->sb_flags);
>>
>> in raid5d after the call to handle_active_stripes and enable the debug 
>> location after the deadlock occurred, I get
>>
>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>
>> If I add
>>
>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags 
>> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>
>> at the head of md_check_recovery, I get:
>>
>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>
>> More debug lines in md_check_recovery confirm the control flow ( `if 
>> (mddev_trylock(mddev))` block not taken )
>>

That is great that you have a reproducer now!

>> What approach would you suggest to fix this?
> 
> I naively tried the following patch and it seems to fix the problem. The 
> test procedure didn't trigger the deadlock in 10 hours.
> 
> D.
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2d21c298ffa7..f40429843906 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char 
> *page, size_t len)
>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>               mddev_lock(mddev) == 0) {
> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               flush_workqueue(md_misc_wq);
>               if (mddev->sync_thread) {
>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                   md_reap_sync_thread(mddev);
>               }
> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               mddev_unlock(mddev);
>           }
>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

Yes, it could break the deadlock issue, but I am not sure if it is the 
right way given we only set ALLOW_SB_UPDATE in suspend which makes sense 
since the io will be quiesced, but write idle action can't guarantee the 
  similar thing. I prefer to make resync thread not wait forever here.

wait_event_lock_irq(
	conf->wait_for_stripe,
	!list_empty(conf->inactive_list + hash) && 

	(atomic_read(&conf->active_stripes)
	< (conf->max_nr_stripes * 3 / 4)
	|| !test_bit(R5_INACTIVE_BLOCKED,
		     &conf->cache_state)
	*(conf->hash_locks + hash));

So, could you please try the attached?

Thanks,
Guoqing

[-- Attachment #2: raid5-proposal --]
[-- Type: text/plain, Size: 5862 bytes --]

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4337ae0..378ce5c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
 {
 	struct stripe_head *sh;
 
-	sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
+	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
 	if (!sh)
 		return NULL;  /* no more stripe available */
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8888973..33a2a22 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
 }
 
 struct stripe_head *
-raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
+raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
 			int previous, int noblock, int noquiesce)
 {
 	struct stripe_head *sh;
+	struct mddev *mddev = conf->mddev;
 	int hash = stripe_hash_locks_hash(conf, sector);
 	int inc_empty_inactive_list_flag;
 
@@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
-						      &conf->cache_state)),
+						      &conf->cache_state)
+					 || (test_bit(MD_RECOVERY_INTR,
+						      &mddev->recovery)
+					     && sync_req)),
 					*(conf->hash_locks + hash));
+				if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
+				    && sync_req)
+					break;
 				clear_bit(R5_INACTIVE_BLOCKED,
 					  &conf->cache_state);
 			} else {
@@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
 			sector_t bn = raid5_compute_blocknr(sh, i, 1);
 			sector_t s = raid5_compute_sector(conf, bn, 0,
 							  &dd_idx, NULL);
-			sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
+			sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
 				 * have been requested.  When later blocks
@@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
 	/* Finish reconstruct operations initiated by the expansion process */
 	if (sh->reconstruct_state == reconstruct_state_result) {
 		struct stripe_head *sh_src
-			= raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
+			= raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
 		if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
 			/* sh cannot be written until sh_src has been read.
 			 * so arrange for sh to be delayed a little
@@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		DEFINE_WAIT(w);
 		int d;
 	again:
-		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
+		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
 		prepare_to_wait(&conf->wait_for_overlap, &w,
 				TASK_UNINTERRUPTIBLE);
 		set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
@@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			(unsigned long long)new_sector,
 			(unsigned long long)logical_sector);
 
-		sh = raid5_get_active_stripe(conf, new_sector, previous,
+		sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
 				       (bi->bi_opf & REQ_RAHEAD), 0);
 		if (sh) {
 			if (unlikely(previous)) {
@@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
 		int j;
 		int skipped_disk = 0;
-		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
+		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
 		set_bit(STRIPE_EXPANDING, &sh->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
@@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
-		sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
+		sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
 		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
@@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 
 	md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
 
-	sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
+	sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
 	if (sh == NULL) {
-		sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
+		sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
+		if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+			*skipped = 1;
+			return 0;
+		}
+
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access
 		 */
@@ -6334,7 +6346,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
 			/* already done this stripe */
 			continue;
 
-		sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
+		sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 5c05acf..d9eab46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
 				     int previous, int *dd_idx,
 				     struct stripe_head *sh);
 extern struct stripe_head *
-raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
+raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
 			int previous, int noblock, int noquiesce);
 extern int raid5_calc_degraded(struct r5conf *conf);
 extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26  0:44                       ` Guoqing Jiang
@ 2021-01-26  9:50                         ` Donald Buczek
  2021-01-26 11:14                           ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-26  9:50 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 26.01.21 01:44, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/25/21 22:32, Donald Buczek wrote:
>>
>>
>> On 25.01.21 09:54, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>>>
>>> My current theory of what is happening is:
>>>
>>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>>
>>> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>>
>>> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>>
>>> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>>>
>>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>>>
>>> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>>>
>>> Does this make sense?
> 
> Yes, exactly! That was my thought too, the scenario is twisted.
> 
> Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag.
> 
> And raid5 could suffer from the same issue I think.
> 
>>>
>>> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>>
>>> ```
>>> #! /bin/bash
>>>
>>> (
>>>          while true; do
>>>                  echo "start check"
>>>                  echo check > /sys/block/md3/md/sync_action
>>>                  sleep 10
>>>                  echo "stop check"
>>>                  echo idle > /sys/block/md3/md/sync_action
>>>                  sleep 2
>>>          done
>>> ) &
>>>
>>> (
>>>          while true; do
>>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>>>                  sync /mnt/raid_md3/bigfile
>>>                  rm /mnt/raid_md3/bigfile
>>>                  sleep .1
>>>          done
>>> ) &
>>>
>>> start="$(date +%s)"
>>> cd /sys/block/md3/md
>>> wp_count=0
>>> while true; do
>>>          array_state=$(cat array_state)
>>>          if [ "$array_state" = write-pending ]; then
>>>                  wp_count=$(($wp_count+1))
>>>          else
>>>                  wp_count=0
>>>          fi
>>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>>>          if [ $wp_count -ge 3 ]; then
>>>                  kill -- -$$
>>>                  exit
>>>          fi
>>>          sleep 1
>>> done
>>> ```
>>>
>>> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>>>
>>>      309 check 6283872 / 8378368 active-idle 4144
>>>      310 check 6283872 / 8378368 active 1702
>>>      311 check 6807528 / 8378368 active 4152
>>>      312 check 7331184 / 8378368 clean 3021
>>>      stop check
>>>      313 check 7331184 / 8378368 write-pending 3905
>>>      314 check 7331184 / 8378368 write-pending 3905
>>>      315 check 7331184 / 8378368 write-pending 3905
>>>      Terminated
>>>
>>> If I add
>>>
>>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>>>
>>> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>>>
>>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>
>>> If I add
>>>
>>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>>
>>> at the head of md_check_recovery, I get:
>>>
>>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>
>>> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>>>
> 
> That is great that you have a reproducer now!
> 
>>> What approach would you suggest to fix this?
>>
>> I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.
>>
>> D.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2d21c298ffa7..f40429843906 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>               mddev_lock(mddev) == 0) {
>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               flush_workqueue(md_misc_wq);
>>               if (mddev->sync_thread) {
>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>                   md_reap_sync_thread(mddev);
>>               }
>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               mddev_unlock(mddev);
>>           }
>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> 
> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.

Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.

> I prefer to make resync thread not wait forever here.
> 
> wait_event_lock_irq(
>      conf->wait_for_stripe,
>      !list_empty(conf->inactive_list + hash) &&
>      (atomic_read(&conf->active_stripes)
>      < (conf->max_nr_stripes * 3 / 4)
>      || !test_bit(R5_INACTIVE_BLOCKED,
>               &conf->cache_state)
>      *(conf->hash_locks + hash));
> 
> So, could you please try the attached?
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4337ae0..378ce5c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
>  {
>  	struct stripe_head *sh;
>  
> -	sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
> +	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
>  	if (!sh)
>  		return NULL;  /* no more stripe available */
>  
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8888973..33a2a22 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
>  }
>  
>  struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
>  			int previous, int noblock, int noquiesce)
>  {
>  	struct stripe_head *sh;
> +	struct mddev *mddev = conf->mddev;
>  	int hash = stripe_hash_locks_hash(conf, sector);
>  	int inc_empty_inactive_list_flag;
>  
> @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
>  					(atomic_read(&conf->active_stripes)
>  					 < (conf->max_nr_stripes * 3 / 4)
>  					 || !test_bit(R5_INACTIVE_BLOCKED,
> -						      &conf->cache_state)),
> +						      &conf->cache_state)
> +					 || (test_bit(MD_RECOVERY_INTR,
> +						      &mddev->recovery)
> +					     && sync_req)),
>  					*(conf->hash_locks + hash));
> +				if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
> +				    && sync_req)
> +					break;
>  				clear_bit(R5_INACTIVE_BLOCKED,
>  					  &conf->cache_state);
>  			} else {
> @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
>  			sector_t bn = raid5_compute_blocknr(sh, i, 1);
>  			sector_t s = raid5_compute_sector(conf, bn, 0,
>  							  &dd_idx, NULL);
> -			sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
> +			sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
>  			if (sh2 == NULL)
>  				/* so far only the early blocks of this stripe
>  				 * have been requested.  When later blocks
> @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
>  	/* Finish reconstruct operations initiated by the expansion process */
>  	if (sh->reconstruct_state == reconstruct_state_result) {
>  		struct stripe_head *sh_src
> -			= raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
> +			= raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
>  		if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
>  			/* sh cannot be written until sh_src has been read.
>  			 * so arrange for sh to be delayed a little
> @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  		DEFINE_WAIT(w);
>  		int d;
>  	again:
> -		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
> +		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
>  		prepare_to_wait(&conf->wait_for_overlap, &w,
>  				TASK_UNINTERRUPTIBLE);
>  		set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
> @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  			(unsigned long long)new_sector,
>  			(unsigned long long)logical_sector);
>  
> -		sh = raid5_get_active_stripe(conf, new_sector, previous,
> +		sh = raid5_get_active_stripe(conf, new_sector, previous, 0,


Mistake here (fourth argument added instead of third)

>  				       (bi->bi_opf & REQ_RAHEAD), 0);
>  		if (sh) {
>  			if (unlikely(previous)) {
> @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  	for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
>  		int j;
>  		int skipped_disk = 0;
> -		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
> +		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
>  		set_bit(STRIPE_EXPANDING, &sh->state);
>  		atomic_inc(&conf->reshape_stripes);
>  		/* If any of this stripe is beyond the end of the old
> @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  	if (last_sector >= mddev->dev_sectors)
>  		last_sector = mddev->dev_sectors - 1;
>  	while (first_sector <= last_sector) {
> -		sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
> +		sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
>  		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);
> @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
>  
>  	md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
>  
> -	sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
> +	sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
>  	if (sh == NULL) {
> -		sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
> +		sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
> +		if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> +			*skipped = 1;
> +			return 0;
> +		}
> +
>  		/* make sure we don't swamp the stripe cache if someone else
>  		 * is trying to get access
>  		 */
> @@ -6334,7 +6346,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
>  			/* already done this stripe */
>  			continue;
>  
> -		sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
> +		sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
>  
>  		if (!sh) {
>  			/* failed to get a stripe - must wait */
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 5c05acf..d9eab46 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
>  				     int previous, int *dd_idx,
>  				     struct stripe_head *sh);
>  extern struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
>  			int previous, int noblock, int noquiesce);
>  extern int raid5_calc_degraded(struct r5conf *conf);
>  extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

Unfortunately, this patch did not fix the issue.

     root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
     [<0>] raid5_get_active_stripe+0x1e7/0x6b0
     [<0>] raid5_sync_request+0x2a7/0x3d0
     [<0>] md_do_sync.cold+0x3ee/0x97c
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x11b/0x140
     [<0>] ret_from_fork+0x22/0x30

which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735

Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Best
   Donald

> 
> Thanks,
> Guoqing


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26  9:50                         ` Donald Buczek
@ 2021-01-26 11:14                           ` Guoqing Jiang
  2021-01-26 12:58                             ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-01-26 11:14 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/26/21 10:50, Donald Buczek wrote:
[...]

>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 2d21c298ffa7..f40429843906 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char 
>>> *page, size_t len)
>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>               mddev_lock(mddev) == 0) {
>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               flush_workqueue(md_misc_wq);
>>>               if (mddev->sync_thread) {
>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>                   md_reap_sync_thread(mddev);
>>>               }
>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               mddev_unlock(mddev);
>>>           }
>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>
>> Yes, it could break the deadlock issue, but I am not sure if it is the 
>> right way given we only set ALLOW_SB_UPDATE in suspend which makes 
>> sense since the io will be quiesced, but write idle action can't 
>> guarantee the  similar thing.
> 
> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of 
> reconfig_mutex promises not to make any changes which would exclude 
> superblocks from being written" might make it easier to accept the usage.

I am not sure it is safe to set the flag here since write idle can't 
prevent IO from fs while mddev_suspend can guarantee that.

> 
>> I prefer to make resync thread not wait forever here.
>>

[...]

>>
>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
> 
> 
> Mistake here (fourth argument added instead of third)

Thanks for checking.

[...]

> Unfortunately, this patch did not fix the issue.
> 
>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>      [<0>] raid5_sync_request+0x2a7/0x3d0
>      [<0>] md_do_sync.cold+0x3ee/0x97c
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x11b/0x140
>      [<0>] ret_from_fork+0x22/0x30
> 
> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> 
> Isn't it still the case that the superblocks are not written, therefore 
> stripes are not processed, therefore number of active stripes are not 
> decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Hmm, how about wake the waiter up in the while loop of raid5d?

@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
                         md_check_recovery(mddev);
                         spin_lock_irq(&conf->device_lock);
                 }
+
+               if ((atomic_read(&conf->active_stripes)
+                    < (conf->max_nr_stripes * 3 / 4) ||
+                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
+                       wake_up(&conf->wait_for_stripe);
         }
         pr_debug("%d stripes handled\n", handled);

If the issue still appears then I will change the waiter to break just 
if MD_RECOVERY_INTR is set, something like.

wait_event_lock_irq(conf->wait_for_stripe,
	(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
	 /* the previous condition */,
	*(conf->hash_locks + hash));

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 11:14                           ` Guoqing Jiang
@ 2021-01-26 12:58                             ` Donald Buczek
  2021-01-26 14:06                               ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-26 12:58 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 26.01.21 12:14, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/26/21 10:50, Donald Buczek wrote:
> [...]
> 
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 2d21c298ffa7..f40429843906 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>>               mddev_lock(mddev) == 0) {
>>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               flush_workqueue(md_misc_wq);
>>>>               if (mddev->sync_thread) {
>>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>>                   md_reap_sync_thread(mddev);
>>>>               }
>>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               mddev_unlock(mddev);
>>>>           }
>>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>>
>>> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.
>>
>> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.
> 
> I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that.
> 
>>
>>> I prefer to make resync thread not wait forever here.
>>>
> 
> [...]
> 
>>>
>>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>>
>>
>> Mistake here (fourth argument added instead of third)
> 
> Thanks for checking.
> 
> [...]
> 
>> Unfortunately, this patch did not fix the issue.
>>
>>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>>      [<0>] raid5_sync_request+0x2a7/0x3d0
>>      [<0>] md_do_sync.cold+0x3ee/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>>
>> Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?
> 
> Hmm, how about wake the waiter up in the while loop of raid5d?
> 
> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>                          md_check_recovery(mddev);
>                          spin_lock_irq(&conf->device_lock);
>                  }
> +
> +               if ((atomic_read(&conf->active_stripes)
> +                    < (conf->max_nr_stripes * 3 / 4) ||
> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
> +                       wake_up(&conf->wait_for_stripe);
>          }
>          pr_debug("%d stripes handled\n", handled);

Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at

     root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
     [<0>] md_do_sync.cold+0x8ec/0x97c
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x11b/0x140
     [<0>] ret_from_fork+0x22/0x30

instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.

Best
   Donald

> If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like.
> 
> wait_event_lock_irq(conf->wait_for_stripe,
>      (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>       /* the previous condition */,
>      *(conf->hash_locks + hash));
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 12:58                             ` Donald Buczek
@ 2021-01-26 14:06                               ` Guoqing Jiang
  2021-01-26 16:05                                 ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-01-26 14:06 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 1/26/21 13:58, Donald Buczek wrote:
> 
> 
>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>
>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>                          md_check_recovery(mddev);
>>                          spin_lock_irq(&conf->device_lock);
>>                  }
>> +
>> +               if ((atomic_read(&conf->active_stripes)
>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>> +                       wake_up(&conf->wait_for_stripe);
>>          }
>>          pr_debug("%d stripes handled\n", handled);
> 
> Hmm... With this patch on top of your other one, we still have the basic 
> symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
> 
>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>      [<0>] md_do_sync.cold+0x8ec/0x97c
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x11b/0x140
>      [<0>] ret_from_fork+0x22/0x30
> 
> instead, which is 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

Not sure why recovery_active is not zero, because it is set 0 before 
blk_start_plug, and raid5_sync_request returns 0 and skipped is also set 
to 1. Perhaps handle_stripe calls md_done_sync.

Could you double check the value of recovery_active? Or just don't wait 
if resync thread is interrupted.

wait_event(mddev->recovery_wait,
	   test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
	   !atomic_read(&mddev->recovery_active));

> And, unlike before, "md: md3: data-check interrupted." from the pr_info 
> two lines above appears in dmesg.

Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.

Anyway, will try the script and investigate more about the issue.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 14:06                               ` Guoqing Jiang
@ 2021-01-26 16:05                                 ` Donald Buczek
  2021-02-02 15:42                                   ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-01-26 16:05 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 26.01.21 15:06, Guoqing Jiang wrote:
> 
> 
> On 1/26/21 13:58, Donald Buczek wrote:
>>
>>
>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>
>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>                          md_check_recovery(mddev);
>>>                          spin_lock_irq(&conf->device_lock);
>>>                  }
>>> +
>>> +               if ((atomic_read(&conf->active_stripes)
>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>> +                       wake_up(&conf->wait_for_stripe);
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>
>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>
>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
> 
> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
> 
> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
> 
> wait_event(mddev->recovery_wait,
>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>         !atomic_read(&mddev->recovery_active));

With that added, md3_resync goes into a loop, too. Not 100% busy, though.

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack

[<0>] raid5_get_active_stripe+0x1e7/0x6b0  # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.

Best
   Donald

> 
>> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
> 
> Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.
> 
> Anyway, will try the script and investigate more about the issue.
> 
> Thanks,
> Guoqing

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 16:05                                 ` Donald Buczek
@ 2021-02-02 15:42                                   ` Guoqing Jiang
  2021-02-08 11:38                                     ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-02-02 15:42 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/26/21 17:05, Donald Buczek wrote:
> Dear Guoqing,
> 
> On 26.01.21 15:06, Guoqing Jiang wrote:
>>
>>
>> On 1/26/21 13:58, Donald Buczek wrote:
>>>
>>>
>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>
>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>                          md_check_recovery(mddev);
>>>>                          spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>> +
>>>> +               if ((atomic_read(&conf->active_stripes)
>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Hmm... With this patch on top of your other one, we still have the 
>>> basic symptoms (md3_raid6 busy looping), but the sync thread is now 
>>> hanging at
>>>
>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>      [<0>] md_thread+0xab/0x160
>>>      [<0>] kthread+0x11b/0x140
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> instead, which is 
>>> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>
>> Not sure why recovery_active is not zero, because it is set 0 before 
>> blk_start_plug, and raid5_sync_request returns 0 and skipped is also 
>> set to 1. Perhaps handle_stripe calls md_done_sync.
>>
>> Could you double check the value of recovery_active? Or just don't 
>> wait if resync thread is interrupted.
>>
>> wait_event(mddev->recovery_wait,
>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>         !atomic_read(&mddev->recovery_active));
> 
> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
> 
> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
> 
> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> [<0>] raid5_sync_request+0x2a7/0x3d0       # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
> [<0>] md_do_sync.cold+0x3ee/0x97c          # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
> [<0>] md_thread+0xab/0x160
> [<0>] kthread+0x11b/0x140
> [<0>] ret_from_fork+0x22/0x30
> 
> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of 
> raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, 
> but the conditions don't match so its sleeps again.

I don't know why the condition was not true after the change since the 
RECOVERY_INTR is set and the caller is raid5_sync_request.

wait_event_lock_irq(conf->wait_for_stripe,
     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
      /* the previous condition */,
     *(conf->hash_locks + hash));


BTW, I think there some some possible ways:

1. let "echo idle" give up the reconfig_mutex if there are limited 
number of active stripe, but I feel it is ugly to check sh number from 
action_store (kind of layer violation).

2. make raid5_sync_request -> raid5_get_active_stripe can quit from the 
current situation (this was we tried, though it doesn't work so far).

3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety 
(but maybe I am wrong).

4. given the write IO keeps coming from upper layer which decrease the 
available stripes. Maybe we need to call grow_one_stripe at the 
beginning of raid5_make_request for this case, then call drop_one_stripe
at the end of make_request.

5. maybe don't hold reconfig_mutex when try to unregister sync_thread, 
like this.

         /* resync has finished, collect result */
         mddev_unlock(mddev);
         md_unregister_thread(&mddev->sync_thread);
         mddev_lock(mddev);



My suggestion would be try 2 + 4 together since the reproducer triggers 
both sync io and write io. Or try 5. Perhaps there is better way which I 
just can't find.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-02 15:42                                   ` Guoqing Jiang
@ 2021-02-08 11:38                                     ` Donald Buczek
  2021-02-08 14:53                                       ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-02-08 11:38 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On 02.02.21 16:42, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/26/21 17:05, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> On 26.01.21 15:06, Guoqing Jiang wrote:
>>>
>>>
>>> On 1/26/21 13:58, Donald Buczek wrote:
>>>>
>>>>
>>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>>
>>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>>                          md_check_recovery(mddev);
>>>>>                          spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>> +
>>>>> +               if ((atomic_read(&conf->active_stripes)
>>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>>>
>>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>>      [<0>] md_thread+0xab/0x160
>>>>      [<0>] kthread+0x11b/0x140
>>>>      [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>>
>>> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>>>
>>> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>>>
>>> wait_event(mddev->recovery_wait,
>>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>>         !atomic_read(&mddev->recovery_active));
>>
>> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>>
>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>
>> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>> [<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
>> [<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
>> [<0>] md_thread+0xab/0x160
>> [<0>] kthread+0x11b/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.
> 
> I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request.
> 
> wait_event_lock_irq(conf->wait_for_stripe,
>      (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>       /* the previous condition */,
>      *(conf->hash_locks + hash));

Not knowing the system by heart, I'd probably need more than a full day to analyze that. Please let me know, if this is needed.

> BTW, I think there some some possible ways:
> 
> 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation).

My first though is, that it is either needed to hold the mutex or it is not. Why should this depend on the number of available stripes?

And wouldn't we get into the same situation, if another another reconfiguration from userspace takes the mutex again?

> 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far).
> 
> 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong).
> 
> 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe
> at the end of make_request.
> 
> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
> 
>          /* resync has finished, collect result */
>          mddev_unlock(mddev);
>          md_unregister_thread(&mddev->sync_thread);
>          mddev_lock(mddev);

As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?

> My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find.

I just get a bit suspicious, if it is needed to add special code here and here and here, that there might be something wrong with the abstraction.

But, I'm afraid, that aside from testing, I can't be of to much help here, as I don't know all the components or details of the system that well.

Best
   Donald

> Thanks,
> Guoqing


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 11:38                                     ` Donald Buczek
@ 2021-02-08 14:53                                       ` Guoqing Jiang
  2021-02-08 18:41                                         ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2021-02-08 14:53 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 2/8/21 12:38, Donald Buczek wrote:
>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, 
>> like this.
>>
>>          /* resync has finished, collect result */
>>          mddev_unlock(mddev);
>>          md_unregister_thread(&mddev->sync_thread);
>>          mddev_lock(mddev);
> 
> As above: While we wait for the sync thread to terminate, wouldn't it be 
> a problem, if another user space operation takes the mutex?

I don't think other places can be blocked while hold mutex, otherwise 
these places can cause potential deadlock. Please try above two lines 
change. And perhaps others have better idea.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 14:53                                       ` Guoqing Jiang
@ 2021-02-08 18:41                                         ` Donald Buczek
  2021-02-09  0:46                                           ` Guoqing Jiang
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2021-02-08 18:41 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 08.02.21 15:53, Guoqing Jiang wrote:
> 
> 
> On 2/8/21 12:38, Donald Buczek wrote:
>>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>>>
>>>          /* resync has finished, collect result */
>>>          mddev_unlock(mddev);
>>>          md_unregister_thread(&mddev->sync_thread);
>>>          mddev_lock(mddev);
>>
>> As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?
> 
> I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea.

Yes, this works. No deadlock after >11000 seconds,

(Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Best
   Donald
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 18:41                                         ` Donald Buczek
@ 2021-02-09  0:46                                           ` Guoqing Jiang
  2021-02-09  9:24                                             ` Donald Buczek
  2023-03-14 13:25                                             ` Marc Smith
  0 siblings, 2 replies; 49+ messages in thread
From: Guoqing Jiang @ 2021-02-09  0:46 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 2/8/21 19:41, Donald Buczek wrote:
> Dear Guoqing,
> 
> On 08.02.21 15:53, Guoqing Jiang wrote:
>>
>>
>> On 2/8/21 12:38, Donald Buczek wrote:
>>>> 5. maybe don't hold reconfig_mutex when try to unregister 
>>>> sync_thread, like this.
>>>>
>>>>          /* resync has finished, collect result */
>>>>          mddev_unlock(mddev);
>>>>          md_unregister_thread(&mddev->sync_thread);
>>>>          mddev_lock(mddev);
>>>
>>> As above: While we wait for the sync thread to terminate, wouldn't it 
>>> be a problem, if another user space operation takes the mutex?
>>
>> I don't think other places can be blocked while hold mutex, otherwise 
>> these places can cause potential deadlock. Please try above two lines 
>> change. And perhaps others have better idea.
> 
> Yes, this works. No deadlock after >11000 seconds,
> 
> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 
> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Great. I will send a formal patch with your reported-by and tested-by.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-09  0:46                                           ` Guoqing Jiang
@ 2021-02-09  9:24                                             ` Donald Buczek
  2023-03-14 13:25                                             ` Marc Smith
  1 sibling, 0 replies; 49+ messages in thread
From: Donald Buczek @ 2021-02-09  9:24 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On 09.02.21 01:46, Guoqing Jiang wrote:

> Great. I will send a formal patch with your reported-by and tested-by.

Yes, that's fine.

Thanks a lot for your help!

Donald

> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-09  0:46                                           ` Guoqing Jiang
  2021-02-09  9:24                                             ` Donald Buczek
@ 2023-03-14 13:25                                             ` Marc Smith
  2023-03-14 13:55                                               ` Guoqing Jiang
  2023-03-15  7:52                                               ` Donald Buczek
  1 sibling, 2 replies; 49+ messages in thread
From: Marc Smith @ 2023-03-14 13:25 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
<guoqing.jiang@cloud.ionos.com> wrote:
>
> Hi Donald,
>
> On 2/8/21 19:41, Donald Buczek wrote:
> > Dear Guoqing,
> >
> > On 08.02.21 15:53, Guoqing Jiang wrote:
> >>
> >>
> >> On 2/8/21 12:38, Donald Buczek wrote:
> >>>> 5. maybe don't hold reconfig_mutex when try to unregister
> >>>> sync_thread, like this.
> >>>>
> >>>>          /* resync has finished, collect result */
> >>>>          mddev_unlock(mddev);
> >>>>          md_unregister_thread(&mddev->sync_thread);
> >>>>          mddev_lock(mddev);
> >>>
> >>> As above: While we wait for the sync thread to terminate, wouldn't it
> >>> be a problem, if another user space operation takes the mutex?
> >>
> >> I don't think other places can be blocked while hold mutex, otherwise
> >> these places can cause potential deadlock. Please try above two lines
> >> change. And perhaps others have better idea.
> >
> > Yes, this works. No deadlock after >11000 seconds,
> >
> > (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>
> Great. I will send a formal patch with your reported-by and tested-by.
>
> Thanks,
> Guoqing

I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
of the patches that supposedly resolve this were applied to the stable
kernels, however, one was omitted due to a regression:
md: don't unregister sync_thread with reconfig_mutex held (upstream
commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)

I don't see any follow-up on the thread from June 8th 2022 asking for
this patch to be dropped from all stable kernels since it caused a
regression.

The patch doesn't appear to be present in the current mainline kernel
(6.3-rc2) either. So I assume this issue is still present there, or it
was resolved differently and I just can't find the commit/patch.

I can induce the issue by using Donald's script above which will
eventually result in hangs:
...
147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds.
[147948.504624]       Tainted: P           OE     5.4.229-esos.prod #1
[147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504625] md_test_2.sh    D    0 68033      1 0x00000004
[147948.504627] Call Trace:
[147948.504634]  __schedule+0x4ab/0x4f3
[147948.504637]  ? usleep_range+0x7a/0x7a
[147948.504638]  schedule+0x67/0x81
[147948.504639]  schedule_timeout+0x2c/0xe5
[147948.504643]  ? do_raw_spin_lock+0x2b/0x52
[147948.504644]  __wait_for_common+0xc4/0x13a
[147948.504647]  ? wake_up_q+0x40/0x40
[147948.504649]  kthread_stop+0x9a/0x117
[147948.504653]  md_unregister_thread+0x43/0x4d
[147948.504655]  md_reap_sync_thread+0x1c/0x1d5
[147948.504657]  action_store+0xc9/0x284
[147948.504658]  md_attr_store+0x9f/0xb8
[147948.504661]  kernfs_fop_write+0x10a/0x14c
[147948.504664]  vfs_write+0xa0/0xdd
[147948.504666]  ksys_write+0x71/0xba
[147948.504668]  do_syscall_64+0x52/0x60
[147948.504671]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
...
[147948.504748] INFO: task md120_resync:135315 blocked for more than
122 seconds.
[147948.504749]       Tainted: P           OE     5.4.229-esos.prod #1
[147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[147948.504749] md120_resync    D    0 135315      2 0x80004000
[147948.504750] Call Trace:
[147948.504752]  __schedule+0x4ab/0x4f3
[147948.504754]  ? printk+0x53/0x6a
[147948.504755]  schedule+0x67/0x81
[147948.504756]  md_do_sync+0xae7/0xdd9
[147948.504758]  ? remove_wait_queue+0x41/0x41
[147948.504759]  md_thread+0x128/0x151
[147948.504761]  ? _raw_spin_lock_irqsave+0x31/0x5d
[147948.504762]  ? md_start_sync+0xdc/0xdc
[147948.504763]  kthread+0xe4/0xe9
[147948.504764]  ? kthread_flush_worker+0x70/0x70
[147948.504765]  ret_from_fork+0x35/0x40
...

This happens on 'raid6' MD RAID arrays that initially have sync_action==resync.

Any guidance would be greatly appreciated.

--Marc

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-14 13:25                                             ` Marc Smith
@ 2023-03-14 13:55                                               ` Guoqing Jiang
  2023-03-14 14:45                                                 ` Marc Smith
  2023-03-15  3:02                                                 ` Yu Kuai
  2023-03-15  7:52                                               ` Donald Buczek
  1 sibling, 2 replies; 49+ messages in thread
From: Guoqing Jiang @ 2023-03-14 13:55 UTC (permalink / raw)
  To: Marc Smith
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 3/14/23 21:25, Marc Smith wrote:
> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> <guoqing.jiang@cloud.ionos.com> wrote:
>> Hi Donald,
>>
>> On 2/8/21 19:41, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>
>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>> sync_thread, like this.
>>>>>>
>>>>>>           /* resync has finished, collect result */
>>>>>>           mddev_unlock(mddev);
>>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>>           mddev_lock(mddev);
>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>> be a problem, if another user space operation takes the mutex?
>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>> these places can cause potential deadlock. Please try above two lines
>>>> change. And perhaps others have better idea.
>>> Yes, this works. No deadlock after >11000 seconds,
>>>
>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>> Great. I will send a formal patch with your reported-by and tested-by.
>>
>> Thanks,
>> Guoqing
> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> of the patches that supposedly resolve this were applied to the stable
> kernels, however, one was omitted due to a regression:
> md: don't unregister sync_thread with reconfig_mutex held (upstream
> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>
> I don't see any follow-up on the thread from June 8th 2022 asking for
> this patch to be dropped from all stable kernels since it caused a
> regression.
>
> The patch doesn't appear to be present in the current mainline kernel
> (6.3-rc2) either. So I assume this issue is still present there, or it
> was resolved differently and I just can't find the commit/patch.

It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
sync_thread in action_store".

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-14 13:55                                               ` Guoqing Jiang
@ 2023-03-14 14:45                                                 ` Marc Smith
  2023-03-16 15:25                                                   ` Marc Smith
  2023-03-15  3:02                                                 ` Yu Kuai
  1 sibling, 1 reply; 49+ messages in thread
From: Marc Smith @ 2023-03-14 14:45 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>
>
> On 3/14/23 21:25, Marc Smith wrote:
> > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > <guoqing.jiang@cloud.ionos.com> wrote:
> >> Hi Donald,
> >>
> >> On 2/8/21 19:41, Donald Buczek wrote:
> >>> Dear Guoqing,
> >>>
> >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> >>>>
> >>>> On 2/8/21 12:38, Donald Buczek wrote:
> >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> >>>>>> sync_thread, like this.
> >>>>>>
> >>>>>>           /* resync has finished, collect result */
> >>>>>>           mddev_unlock(mddev);
> >>>>>>           md_unregister_thread(&mddev->sync_thread);
> >>>>>>           mddev_lock(mddev);
> >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> >>>>> be a problem, if another user space operation takes the mutex?
> >>>> I don't think other places can be blocked while hold mutex, otherwise
> >>>> these places can cause potential deadlock. Please try above two lines
> >>>> change. And perhaps others have better idea.
> >>> Yes, this works. No deadlock after >11000 seconds,
> >>>
> >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> >> Great. I will send a formal patch with your reported-by and tested-by.
> >>
> >> Thanks,
> >> Guoqing
> > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > of the patches that supposedly resolve this were applied to the stable
> > kernels, however, one was omitted due to a regression:
> > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> >
> > I don't see any follow-up on the thread from June 8th 2022 asking for
> > this patch to be dropped from all stable kernels since it caused a
> > regression.
> >
> > The patch doesn't appear to be present in the current mainline kernel
> > (6.3-rc2) either. So I assume this issue is still present there, or it
> > was resolved differently and I just can't find the commit/patch.
>
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".

Okay, let me try applying that patch... it does not appear to be
present in my 5.4.229 kernel source. Thanks.

--Marc


>
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-14 14:45                                                 ` Marc Smith
@ 2023-03-16 15:25                                                   ` Marc Smith
  2023-03-29  0:01                                                     ` Song Liu
  0 siblings, 1 reply; 49+ messages in thread
From: Marc Smith @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Guoqing Jiang
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <msmith626@gmail.com> wrote:
>
> On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> >
> >
> >
> > On 3/14/23 21:25, Marc Smith wrote:
> > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > > <guoqing.jiang@cloud.ionos.com> wrote:
> > >> Hi Donald,
> > >>
> > >> On 2/8/21 19:41, Donald Buczek wrote:
> > >>> Dear Guoqing,
> > >>>
> > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> > >>>>
> > >>>> On 2/8/21 12:38, Donald Buczek wrote:
> > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> > >>>>>> sync_thread, like this.
> > >>>>>>
> > >>>>>>           /* resync has finished, collect result */
> > >>>>>>           mddev_unlock(mddev);
> > >>>>>>           md_unregister_thread(&mddev->sync_thread);
> > >>>>>>           mddev_lock(mddev);
> > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> > >>>>> be a problem, if another user space operation takes the mutex?
> > >>>> I don't think other places can be blocked while hold mutex, otherwise
> > >>>> these places can cause potential deadlock. Please try above two lines
> > >>>> change. And perhaps others have better idea.
> > >>> Yes, this works. No deadlock after >11000 seconds,
> > >>>
> > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> > >> Great. I will send a formal patch with your reported-by and tested-by.
> > >>
> > >> Thanks,
> > >> Guoqing
> > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > > of the patches that supposedly resolve this were applied to the stable
> > > kernels, however, one was omitted due to a regression:
> > > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> > >
> > > I don't see any follow-up on the thread from June 8th 2022 asking for
> > > this patch to be dropped from all stable kernels since it caused a
> > > regression.
> > >
> > > The patch doesn't appear to be present in the current mainline kernel
> > > (6.3-rc2) either. So I assume this issue is still present there, or it
> > > was resolved differently and I just can't find the commit/patch.
> >
> > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> > sync_thread in action_store".
>
> Okay, let me try applying that patch... it does not appear to be
> present in my 5.4.229 kernel source. Thanks.

Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
sync_thread in action_store"' patch on top of vanilla 5.4.229 source
appears to fix the problem for me -- I can't reproduce the issue with
the script, and it's been running for >24 hours now. (Previously I was
able to induce the issue within a matter of minutes.)


>
> --Marc
>
>
> >
> > Thanks,
> > Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-16 15:25                                                   ` Marc Smith
@ 2023-03-29  0:01                                                     ` Song Liu
  2023-08-22 21:16                                                       ` Dragan Stancevic
  0 siblings, 1 reply; 49+ messages in thread
From: Song Liu @ 2023-03-29  0:01 UTC (permalink / raw)
  To: Marc Smith, Yu Kuai
  Cc: Guoqing Jiang, Donald Buczek, linux-raid,
	Linux Kernel Mailing List, it+raid

On Thu, Mar 16, 2023 at 8:25 AM Marc Smith <msmith626@gmail.com> wrote:
>
> On Tue, Mar 14, 2023 at 10:45 AM Marc Smith <msmith626@gmail.com> wrote:
> >
> > On Tue, Mar 14, 2023 at 9:55 AM Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
> > >
> > >
> > >
> > > On 3/14/23 21:25, Marc Smith wrote:
> > > > On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> > > > <guoqing.jiang@cloud.ionos.com> wrote:
> > > >> Hi Donald,
> > > >>
> > > >> On 2/8/21 19:41, Donald Buczek wrote:
> > > >>> Dear Guoqing,
> > > >>>
> > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
> > > >>>>
> > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
> > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
> > > >>>>>> sync_thread, like this.
> > > >>>>>>
> > > >>>>>>           /* resync has finished, collect result */
> > > >>>>>>           mddev_unlock(mddev);
> > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
> > > >>>>>>           mddev_lock(mddev);
> > > >>>>> As above: While we wait for the sync thread to terminate, wouldn't it
> > > >>>>> be a problem, if another user space operation takes the mutex?
> > > >>>> I don't think other places can be blocked while hold mutex, otherwise
> > > >>>> these places can cause potential deadlock. Please try above two lines
> > > >>>> change. And perhaps others have better idea.
> > > >>> Yes, this works. No deadlock after >11000 seconds,
> > > >>>
> > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
> > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
> > > >> Great. I will send a formal patch with your reported-by and tested-by.
> > > >>
> > > >> Thanks,
> > > >> Guoqing
> > > > I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> > > > of the patches that supposedly resolve this were applied to the stable
> > > > kernels, however, one was omitted due to a regression:
> > > > md: don't unregister sync_thread with reconfig_mutex held (upstream
> > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> > > >
> > > > I don't see any follow-up on the thread from June 8th 2022 asking for
> > > > this patch to be dropped from all stable kernels since it caused a
> > > > regression.
> > > >
> > > > The patch doesn't appear to be present in the current mainline kernel
> > > > (6.3-rc2) either. So I assume this issue is still present there, or it
> > > > was resolved differently and I just can't find the commit/patch.
> > >
> > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> > > sync_thread in action_store".
> >
> > Okay, let me try applying that patch... it does not appear to be
> > present in my 5.4.229 kernel source. Thanks.
>
> Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
> sync_thread in action_store"' patch on top of vanilla 5.4.229 source
> appears to fix the problem for me -- I can't reproduce the issue with
> the script, and it's been running for >24 hours now. (Previously I was
> able to induce the issue within a matter of minutes.)

Hi Marc,

Could you please run your reproducer on the md-tmp branch?

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-tmp

This contains a different version of the fix by Yu Kuai.

Thanks,
Song

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-29  0:01                                                     ` Song Liu
@ 2023-08-22 21:16                                                       ` Dragan Stancevic
  2023-08-23  1:22                                                         ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Dragan Stancevic @ 2023-08-22 21:16 UTC (permalink / raw)
  To: song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yukuai1

On Tue, 3/28/23 17:01 Song Liu wrote:
> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com> 
> wr=
> ote:
>  >
>  > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith 
> <msmith626@gmail.com>=
>   wrote:
>  > >
>  > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang 
> <guoqing.jiang@li=
> nux.dev> wrote:
>  > > >
>  > > >
>  > > >
>  > > > On 3/14/23 21:25, Marc Smith wrote:
>  > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>  > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>  > > > >> Hi Donald,
>  > > > >>
>  > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>  > > > >>> Dear Guoqing,
>  > > > >>>
>  > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>  > > > >>>>
>  > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>  > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>  > > > >>>>>> sync_thread, like this.
>  > > > >>>>>>
>  > > > >>>>>>           /* resync has finished, collect result */
>  > > > >>>>>>           mddev_unlock(mddev);
>  > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>  > > > >>>>>>           mddev_lock(mddev);
>  > > > >>>>> As above: While we wait for the sync thread to terminate, 
> would=
> n't it
>  > > > >>>>> be a problem, if another user space operation takes the mutex?
>  > > > >>>> I don't think other places can be blocked while hold mutex, 
> othe=
> rwise
>  > > > >>>> these places can cause potential deadlock. Please try above 
> two =
> lines
>  > > > >>>> change. And perhaps others have better idea.
>  > > > >>> Yes, this works. No deadlock after >11000 seconds,
>  > > > >>>
>  > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37, 
> 434, 12=
> 65,
>  > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>  > > > >> Great. I will send a formal patch with your reported-by and 
> tested=
> -by.
>  > > > >>
>  > > > >> Thanks,
>  > > > >> Guoqing
>  > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks 
> like 1/=
> 2
>  > > > > of the patches that supposedly resolve this were applied to the 
> sta=
> ble
>  > > > > kernels, however, one was omitted due to a regression:
>  > > > > md: don't unregister sync_thread with reconfig_mutex held 
> (upstream
>  > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>  > > > >
>  > > > > I don't see any follow-up on the thread from June 8th 2022 
> asking f=
> or
>  > > > > this patch to be dropped from all stable kernels since it caused a
>  > > > > regression.
>  > > > >
>  > > > > The patch doesn't appear to be present in the current mainline 
> kern=
> el
>  > > > > (6.3-rc2) either. So I assume this issue is still present 
> there, or=
>   it
>  > > > > was resolved differently and I just can't find the commit/patch.
>  > > >
>  > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before 
> rea=
> p
>  > > > sync_thread in action_store".
>  > >
>  > > Okay, let me try applying that patch... it does not appear to be
>  > > present in my 5.4.229 kernel source. Thanks.
>  >
>  > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>  > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>  > appears to fix the problem for me -- I can't reproduce the issue with
>  > the script, and it's been running for >24 hours now. (Previously I was
>  > able to induce the issue within a matter of minutes.)
> 
> Hi Marc,
> 
> Could you please run your reproducer on the md-tmp branch?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
> 
> This contains a different version of the fix by Yu Kuai.
> 
> Thanks,
> Song
> 

Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
"9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.

But I did pull the changes from the md-tmp branch you are refering:
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp

I was not totally clear on which change exactly to pull, but I pulled
the following changes:
2023-03-28 md: enhance checking in md_check_recovery()md-tmp	Yu Kuai	1 -7/+15
2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread()	Yu Kuai	1 -1/+1
2023-03-28 md: refactor idle/frozen_sync_thread()	Yu Kuai	2 -4/+22
2023-03-28 md: add a mutex to synchronize idle and frozen in action_store()	Yu Kuai	2 -0/+8
2023-03-28 md: refactor action_store() for 'idle' and 'frozen'	Yu Kuai	1 -16/+45

I used to be able to reproduce the lockup within minutes, but with those
changes the test system has been running for more than 120 hours.

When you said a "different fix", can you confirm that I grabbed the right
changes and that I need all 5 of them.

And second question was, has this fix been submitted upstream yet?
If so which kernel version?

Thank you



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-22 21:16                                                       ` Dragan Stancevic
@ 2023-08-23  1:22                                                         ` Yu Kuai
  2023-08-23 15:33                                                           ` Dragan Stancevic
  0 siblings, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-08-23  1:22 UTC (permalink / raw)
  To: Dragan Stancevic, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yukuai1, yukuai (C),
	yangerkun

Hi,

在 2023/08/23 5:16, Dragan Stancevic 写道:
> On Tue, 3/28/23 17:01 Song Liu wrote:
>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com>
>> wr=
>> ote:
>>   >
>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>> <msmith626@gmail.com>=
>>    wrote:
>>   > >
>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>> <guoqing.jiang@li=
>> nux.dev> wrote:
>>   > > >
>>   > > >
>>   > > >
>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>   > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>>   > > > >> Hi Donald,
>>   > > > >>
>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>   > > > >>> Dear Guoqing,
>>   > > > >>>
>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>   > > > >>>>
>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>   > > > >>>>>> sync_thread, like this.
>>   > > > >>>>>>
>>   > > > >>>>>>           /* resync has finished, collect result */
>>   > > > >>>>>>           mddev_unlock(mddev);
>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>   > > > >>>>>>           mddev_lock(mddev);
>>   > > > >>>>> As above: While we wait for the sync thread to terminate,
>> would=
>> n't it
>>   > > > >>>>> be a problem, if another user space operation takes the mutex?
>>   > > > >>>> I don't think other places can be blocked while hold mutex,
>> othe=
>> rwise
>>   > > > >>>> these places can cause potential deadlock. Please try above
>> two =
>> lines
>>   > > > >>>> change. And perhaps others have better idea.
>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>   > > > >>>
>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>> 434, 12=
>> 65,
>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>   > > > >> Great. I will send a formal patch with your reported-by and
>> tested=
>> -by.
>>   > > > >>
>>   > > > >> Thanks,
>>   > > > >> Guoqing
>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>> like 1/=
>> 2
>>   > > > > of the patches that supposedly resolve this were applied to the
>> sta=
>> ble
>>   > > > > kernels, however, one was omitted due to a regression:
>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>> (upstream
>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>   > > > >
>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>> asking f=
>> or
>>   > > > > this patch to be dropped from all stable kernels since it caused a
>>   > > > > regression.
>>   > > > >
>>   > > > > The patch doesn't appear to be present in the current mainline
>> kern=
>> el
>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>> there, or=
>>    it
>>   > > > > was resolved differently and I just can't find the commit/patch.
>>   > > >
>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before
>> rea=
>> p
>>   > > > sync_thread in action_store".
>>   > >
>>   > > Okay, let me try applying that patch... it does not appear to be
>>   > > present in my 5.4.229 kernel source. Thanks.
>>   >
>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>>   > appears to fix the problem for me -- I can't reproduce the issue with
>>   > the script, and it's been running for >24 hours now. (Previously I was
>>   > able to induce the issue within a matter of minutes.)
>>
>> Hi Marc,
>>
>> Could you please run your reproducer on the md-tmp branch?
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>
>> This contains a different version of the fix by Yu Kuai.
>>
>> Thanks,
>> Song
>>
> 
> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.
> 
> But I did pull the changes from the md-tmp branch you are refering:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
> 
> I was not totally clear on which change exactly to pull, but I pulled
> the following changes:
> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp	Yu Kuai	1 -7/+15
> 2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread()	Yu Kuai	1 -1/+1
> 2023-03-28 md: refactor idle/frozen_sync_thread()	Yu Kuai	2 -4/+22
> 2023-03-28 md: add a mutex to synchronize idle and frozen in action_store()	Yu Kuai	2 -0/+8
> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'	Yu Kuai	1 -16/+45
> 
> I used to be able to reproduce the lockup within minutes, but with those
> changes the test system has been running for more than 120 hours.
> 
> When you said a "different fix", can you confirm that I grabbed the right
> changes and that I need all 5 of them.

Yes, you grabbed the right changes, and these patches is merged to
linux-next as well.
> 
> And second question was, has this fix been submitted upstream yet?
> If so which kernel version?

This fix is currently in linux-next, and will be applied to v6.6-rc1
soon.

Thanks,
Kuai

> 
> Thank you
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-23  1:22                                                         ` Yu Kuai
@ 2023-08-23 15:33                                                           ` Dragan Stancevic
  2023-08-24  1:18                                                             ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Dragan Stancevic @ 2023-08-23 15:33 UTC (permalink / raw)
  To: Yu Kuai, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yukuai (C),
	yangerkun

Hi Kuai-

On 8/22/23 20:22, Yu Kuai wrote:
> Hi,
> 
> 在 2023/08/23 5:16, Dragan Stancevic 写道:
>> On Tue, 3/28/23 17:01 Song Liu wrote:
>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@gmail.com>
>>> wr=
>>> ote:
>>>   >
>>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>>> <msmith626@gmail.com>=
>>>    wrote:
>>>   > >
>>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>>> <guoqing.jiang@li=
>>> nux.dev> wrote:
>>>   > > >
>>>   > > >
>>>   > > >
>>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>>   > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>>>   > > > >> Hi Donald,
>>>   > > > >>
>>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>>   > > > >>> Dear Guoqing,
>>>   > > > >>>
>>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>   > > > >>>>
>>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>   > > > >>>>>> sync_thread, like this.
>>>   > > > >>>>>>
>>>   > > > >>>>>>           /* resync has finished, collect result */
>>>   > > > >>>>>>           mddev_unlock(mddev);
>>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>   > > > >>>>>>           mddev_lock(mddev);
>>>   > > > >>>>> As above: While we wait for the sync thread to terminate,
>>> would=
>>> n't it
>>>   > > > >>>>> be a problem, if another user space operation takes the 
>>> mutex?
>>>   > > > >>>> I don't think other places can be blocked while hold mutex,
>>> othe=
>>> rwise
>>>   > > > >>>> these places can cause potential deadlock. Please try above
>>> two =
>>> lines
>>>   > > > >>>> change. And perhaps others have better idea.
>>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>>   > > > >>>
>>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>>> 434, 12=
>>> 65,
>>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>   > > > >> Great. I will send a formal patch with your reported-by and
>>> tested=
>>> -by.
>>>   > > > >>
>>>   > > > >> Thanks,
>>>   > > > >> Guoqing
>>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>>> like 1/=
>>> 2
>>>   > > > > of the patches that supposedly resolve this were applied to 
>>> the
>>> sta=
>>> ble
>>>   > > > > kernels, however, one was omitted due to a regression:
>>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>>> (upstream
>>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>>   > > > >
>>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>>> asking f=
>>> or
>>>   > > > > this patch to be dropped from all stable kernels since it 
>>> caused a
>>>   > > > > regression.
>>>   > > > >
>>>   > > > > The patch doesn't appear to be present in the current mainline
>>> kern=
>>> el
>>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>>> there, or=
>>>    it
>>>   > > > > was resolved differently and I just can't find the 
>>> commit/patch.
>>>   > > >
>>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev 
>>> before
>>> rea=
>>> p
>>>   > > > sync_thread in action_store".
>>>   > >
>>>   > > Okay, let me try applying that patch... it does not appear to be
>>>   > > present in my 5.4.229 kernel source. Thanks.
>>>   >
>>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>>>   > appears to fix the problem for me -- I can't reproduce the issue 
>>> with
>>>   > the script, and it's been running for >24 hours now. (Previously 
>>> I was
>>>   > able to induce the issue within a matter of minutes.)
>>>
>>> Hi Marc,
>>>
>>> Could you please run your reproducer on the md-tmp branch?
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>
>>> This contains a different version of the fix by Yu Kuai.
>>>
>>> Thanks,
>>> Song
>>>
>>
>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. 
>> The change
>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue 
>> for me.
>>
>> But I did pull the changes from the md-tmp branch you are refering:
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>
>> I was not totally clear on which change exactly to pull, but I pulled
>> the following changes:
>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp    Yu 
>> Kuai    1 -7/+15
>> 2023-03-28 md: wake up 'resync_wait' at last in 
>> md_reap_sync_thread()    Yu Kuai    1 -1/+1
>> 2023-03-28 md: refactor idle/frozen_sync_thread()    Yu Kuai    2 -4/+22
>> 2023-03-28 md: add a mutex to synchronize idle and frozen in 
>> action_store()    Yu Kuai    2 -0/+8
>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'    Yu 
>> Kuai    1 -16/+45
>>
>> I used to be able to reproduce the lockup within minutes, but with those
>> changes the test system has been running for more than 120 hours.
>>
>> When you said a "different fix", can you confirm that I grabbed the right
>> changes and that I need all 5 of them.
> 
> Yes, you grabbed the right changes, and these patches is merged to
> linux-next as well.
>>
>> And second question was, has this fix been submitted upstream yet?
>> If so which kernel version?
> 
> This fix is currently in linux-next, and will be applied to v6.6-rc1
> soon.

Thank you, that is great news. I'd like to see this change backported to 
5.10 and 6.1, do you have any plans of backporting to any of the 
previous kernels?

If not, I would like to try to get your changes into 5.10 and 6.1 if 
Greg will accept them.


Four out of five of your changes were a straight cherry-pick into 5.10, 
one needed a minor conflict resolution. But I can definitely confirm 
that your changes fix the lockup issue on 5.10

I am now switching to 6.1 and will test the changes there too.


Thanks


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-23 15:33                                                           ` Dragan Stancevic
@ 2023-08-24  1:18                                                             ` Yu Kuai
  2023-08-28 20:32                                                               ` Dragan Stancevic
  0 siblings, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-08-24  1:18 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yangerkun, yukuai (C)

Hi,

在 2023/08/23 23:33, Dragan Stancevic 写道:
> Hi Kuai-
> 
> On 8/22/23 20:22, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/08/23 5:16, Dragan Stancevic 写道:
>>> On Tue, 3/28/23 17:01 Song Liu wrote:
>>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith 
>>>> <msmith626@gmail.com>
>>>> wr=
>>>> ote:
>>>>   >
>>>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>>>> <msmith626@gmail.com>=
>>>>    wrote:
>>>>   > >
>>>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>>>> <guoqing.jiang@li=
>>>> nux.dev> wrote:
>>>>   > > >
>>>>   > > >
>>>>   > > >
>>>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>>>   > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>>>>   > > > >> Hi Donald,
>>>>   > > > >>
>>>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>>>   > > > >>> Dear Guoqing,
>>>>   > > > >>>
>>>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>   > > > >>>>
>>>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to 
>>>> unregister
>>>>   > > > >>>>>> sync_thread, like this.
>>>>   > > > >>>>>>
>>>>   > > > >>>>>>           /* resync has finished, collect result */
>>>>   > > > >>>>>>           mddev_unlock(mddev);
>>>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>   > > > >>>>>>           mddev_lock(mddev);
>>>>   > > > >>>>> As above: While we wait for the sync thread to terminate,
>>>> would=
>>>> n't it
>>>>   > > > >>>>> be a problem, if another user space operation takes 
>>>> the mutex?
>>>>   > > > >>>> I don't think other places can be blocked while hold 
>>>> mutex,
>>>> othe=
>>>> rwise
>>>>   > > > >>>> these places can cause potential deadlock. Please try 
>>>> above
>>>> two =
>>>> lines
>>>>   > > > >>>> change. And perhaps others have better idea.
>>>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>>>   > > > >>>
>>>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>>>> 434, 12=
>>>> 65,
>>>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>>   > > > >> Great. I will send a formal patch with your reported-by and
>>>> tested=
>>>> -by.
>>>>   > > > >>
>>>>   > > > >> Thanks,
>>>>   > > > >> Guoqing
>>>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>>>> like 1/=
>>>> 2
>>>>   > > > > of the patches that supposedly resolve this were applied 
>>>> to the
>>>> sta=
>>>> ble
>>>>   > > > > kernels, however, one was omitted due to a regression:
>>>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>>>> (upstream
>>>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>>>   > > > >
>>>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>>>> asking f=
>>>> or
>>>>   > > > > this patch to be dropped from all stable kernels since it 
>>>> caused a
>>>>   > > > > regression.
>>>>   > > > >
>>>>   > > > > The patch doesn't appear to be present in the current 
>>>> mainline
>>>> kern=
>>>> el
>>>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>>>> there, or=
>>>>    it
>>>>   > > > > was resolved differently and I just can't find the 
>>>> commit/patch.
>>>>   > > >
>>>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev 
>>>> before
>>>> rea=
>>>> p
>>>>   > > > sync_thread in action_store".
>>>>   > >
>>>>   > > Okay, let me try applying that patch... it does not appear to be
>>>>   > > present in my 5.4.229 kernel source. Thanks.
>>>>   >
>>>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 
>>>> source
>>>>   > appears to fix the problem for me -- I can't reproduce the issue 
>>>> with
>>>>   > the script, and it's been running for >24 hours now. (Previously 
>>>> I was
>>>>   > able to induce the issue within a matter of minutes.)
>>>>
>>>> Hi Marc,
>>>>
>>>> Could you please run your reproducer on the md-tmp branch?
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp 
>>>>
>>>>
>>>> This contains a different version of the fix by Yu Kuai.
>>>>
>>>> Thanks,
>>>> Song
>>>>
>>>
>>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. 
>>> The change
>>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue 
>>> for me.
>>>
>>> But I did pull the changes from the md-tmp branch you are refering:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp 
>>>
>>>
>>> I was not totally clear on which change exactly to pull, but I pulled
>>> the following changes:
>>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp    Yu 
>>> Kuai    1 -7/+15
>>> 2023-03-28 md: wake up 'resync_wait' at last in 
>>> md_reap_sync_thread()    Yu Kuai    1 -1/+1
>>> 2023-03-28 md: refactor idle/frozen_sync_thread()    Yu Kuai    2 -4/+22
>>> 2023-03-28 md: add a mutex to synchronize idle and frozen in 
>>> action_store()    Yu Kuai    2 -0/+8
>>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'    Yu 
>>> Kuai    1 -16/+45
>>>
>>> I used to be able to reproduce the lockup within minutes, but with those
>>> changes the test system has been running for more than 120 hours.
>>>
>>> When you said a "different fix", can you confirm that I grabbed the 
>>> right
>>> changes and that I need all 5 of them.
>>
>> Yes, you grabbed the right changes, and these patches is merged to
>> linux-next as well.
>>>
>>> And second question was, has this fix been submitted upstream yet?
>>> If so which kernel version?
>>
>> This fix is currently in linux-next, and will be applied to v6.6-rc1
>> soon.
> 
> Thank you, that is great news. I'd like to see this change backported to 
> 5.10 and 6.1, do you have any plans of backporting to any of the 
> previous kernels?
> 
> If not, I would like to try to get your changes into 5.10 and 6.1 if 
> Greg will accept them.
> 

I don't have plans yet, so feel free to do this, I guess these patches
won't be picked automatically due to the conflict. Feel free to ask if
you meet any problems.

Thanks,
Kuai

> 
> Four out of five of your changes were a straight cherry-pick into 5.10, 
> one needed a minor conflict resolution. But I can definitely confirm 
> that your changes fix the lockup issue on 5.10
> 
> I am now switching to 6.1 and will test the changes there too.
> 
> 
> Thanks
> 
> 
> -- 
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-24  1:18                                                             ` Yu Kuai
@ 2023-08-28 20:32                                                               ` Dragan Stancevic
  2023-08-30  1:36                                                                 ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Dragan Stancevic @ 2023-08-28 20:32 UTC (permalink / raw)
  To: Yu Kuai, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yangerkun, yukuai (C)

Hi Kuai,


On 8/23/23 20:18, Yu Kuai wrote:
> Hi,
> 
> 在 2023/08/23 23:33, Dragan Stancevic 写道:
>> Hi Kuai-
>>
>> On 8/22/23 20:22, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2023/08/23 5:16, Dragan Stancevic 写道:
>>>> On Tue, 3/28/23 17:01 Song Liu wrote:
>>>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith 
>>>>> <msmith626@gmail.com>
>>>>> wr=
>>>>> ote:
>>>>>   >
>>>>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>>>>> <msmith626@gmail.com>=
>>>>>    wrote:
>>>>>   > >
>>>>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>>>>> <guoqing.jiang@li=
>>>>> nux.dev> wrote:
>>>>>   > > >
>>>>>   > > >
>>>>>   > > >
>>>>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>>>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>>>>   > > > > <guoqing.jiang@cloud.ionos.com> wrote:
>>>>>   > > > >> Hi Donald,
>>>>>   > > > >>
>>>>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>>>>   > > > >>> Dear Guoqing,
>>>>>   > > > >>>
>>>>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>   > > > >>>>
>>>>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to 
>>>>> unregister
>>>>>   > > > >>>>>> sync_thread, like this.
>>>>>   > > > >>>>>>
>>>>>   > > > >>>>>>           /* resync has finished, collect result */
>>>>>   > > > >>>>>>           mddev_unlock(mddev);
>>>>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>   > > > >>>>>>           mddev_lock(mddev);
>>>>>   > > > >>>>> As above: While we wait for the sync thread to 
>>>>> terminate,
>>>>> would=
>>>>> n't it
>>>>>   > > > >>>>> be a problem, if another user space operation takes 
>>>>> the mutex?
>>>>>   > > > >>>> I don't think other places can be blocked while hold 
>>>>> mutex,
>>>>> othe=
>>>>> rwise
>>>>>   > > > >>>> these places can cause potential deadlock. Please try 
>>>>> above
>>>>> two =
>>>>> lines
>>>>>   > > > >>>> change. And perhaps others have better idea.
>>>>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>>>>   > > > >>>
>>>>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>>>>> 434, 12=
>>>>> 65,
>>>>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>>>   > > > >> Great. I will send a formal patch with your reported-by and
>>>>> tested=
>>>>> -by.
>>>>>   > > > >>
>>>>>   > > > >> Thanks,
>>>>>   > > > >> Guoqing
>>>>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>>>>> like 1/=
>>>>> 2
>>>>>   > > > > of the patches that supposedly resolve this were applied 
>>>>> to the
>>>>> sta=
>>>>> ble
>>>>>   > > > > kernels, however, one was omitted due to a regression:
>>>>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>>>>> (upstream
>>>>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>>>>   > > > >
>>>>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>>>>> asking f=
>>>>> or
>>>>>   > > > > this patch to be dropped from all stable kernels since it 
>>>>> caused a
>>>>>   > > > > regression.
>>>>>   > > > >
>>>>>   > > > > The patch doesn't appear to be present in the current 
>>>>> mainline
>>>>> kern=
>>>>> el
>>>>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>>>>> there, or=
>>>>>    it
>>>>>   > > > > was resolved differently and I just can't find the 
>>>>> commit/patch.
>>>>>   > > >
>>>>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev 
>>>>> before
>>>>> rea=
>>>>> p
>>>>>   > > > sync_thread in action_store".
>>>>>   > >
>>>>>   > > Okay, let me try applying that patch... it does not appear to be
>>>>>   > > present in my 5.4.229 kernel source. Thanks.
>>>>>   >
>>>>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>>>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 
>>>>> source
>>>>>   > appears to fix the problem for me -- I can't reproduce the 
>>>>> issue with
>>>>>   > the script, and it's been running for >24 hours now. 
>>>>> (Previously I was
>>>>>   > able to induce the issue within a matter of minutes.)
>>>>>
>>>>> Hi Marc,
>>>>>
>>>>> Could you please run your reproducer on the md-tmp branch?
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>>>
>>>>> This contains a different version of the fix by Yu Kuai.
>>>>>
>>>>> Thanks,
>>>>> Song
>>>>>
>>>>
>>>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. 
>>>> The change
>>>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue 
>>>> for me.
>>>>
>>>> But I did pull the changes from the md-tmp branch you are refering:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>>
>>>> I was not totally clear on which change exactly to pull, but I pulled
>>>> the following changes:
>>>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp    Yu 
>>>> Kuai    1 -7/+15
>>>> 2023-03-28 md: wake up 'resync_wait' at last in 
>>>> md_reap_sync_thread()    Yu Kuai    1 -1/+1
>>>> 2023-03-28 md: refactor idle/frozen_sync_thread()    Yu Kuai    2 
>>>> -4/+22
>>>> 2023-03-28 md: add a mutex to synchronize idle and frozen in 
>>>> action_store()    Yu Kuai    2 -0/+8
>>>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'    Yu 
>>>> Kuai    1 -16/+45
>>>>
>>>> I used to be able to reproduce the lockup within minutes, but with 
>>>> those
>>>> changes the test system has been running for more than 120 hours.
>>>>
>>>> When you said a "different fix", can you confirm that I grabbed the 
>>>> right
>>>> changes and that I need all 5 of them.
>>>
>>> Yes, you grabbed the right changes, and these patches is merged to
>>> linux-next as well.
>>>>
>>>> And second question was, has this fix been submitted upstream yet?
>>>> If so which kernel version?
>>>
>>> This fix is currently in linux-next, and will be applied to v6.6-rc1
>>> soon.
>>
>> Thank you, that is great news. I'd like to see this change backported 
>> to 5.10 and 6.1, do you have any plans of backporting to any of the 
>> previous kernels?
>>
>> If not, I would like to try to get your changes into 5.10 and 6.1 if 
>> Greg will accept them.
>>
> 
> I don't have plans yet, so feel free to do this, I guess these patches
> won't be picked automatically due to the conflict. Feel free to ask if
> you meet any problems.

Just a followup on 6.1 testing. I tried reproducing this problem for 5 
days with 6.1.42 kernel without your patches and I was not able to 
reproduce it.

It seems that 6.1 has some other code that prevents this from happening.

On 5.10 I can reproduce it within minutes to an hour.



> 
> Thanks,
> Kuai
> 
>>
>> Four out of five of your changes were a straight cherry-pick into 
>> 5.10, one needed a minor conflict resolution. But I can definitely 
>> confirm that your changes fix the lockup issue on 5.10
>>
>> I am now switching to 6.1 and will test the changes there too.
>>
>>
>> Thanks
>>
>>
>> -- 
>> Peace can only come as a natural consequence
>> of universal enlightenment -Dr. Nikola Tesla
>>
>>
>> .
>>
> 

-- 
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-28 20:32                                                               ` Dragan Stancevic
@ 2023-08-30  1:36                                                                 ` Yu Kuai
  2023-09-05  3:50                                                                   ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-08-30  1:36 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yangerkun, yukuai (C)

Hi,

在 2023/08/29 4:32, Dragan Stancevic 写道:

> Just a followup on 6.1 testing. I tried reproducing this problem for 5 
> days with 6.1.42 kernel without your patches and I was not able to 
> reproduce it.
> 
> It seems that 6.1 has some other code that prevents this from happening.
> 

I see that there are lots of patches for raid456 between 5.10 and 6.1,
however, I remember that I used to reporduce the deadlock after 6.1, and
it's true it's not easy to reporduce, see below:

https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/

My guess is that 6.1 is harder to reporduce than 5.10 due to some
changes inside raid456.

By the way, raid10 had a similiar deadlock, and can be fixed the same
way, so it make sense to backport these patches.

https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com

Thanks,
Kuai


> On 5.10 I can reproduce it within minutes to an hour.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-08-30  1:36                                                                 ` Yu Kuai
@ 2023-09-05  3:50                                                                   ` Yu Kuai
  2023-09-05 13:54                                                                     ` Dragan Stancevic
  0 siblings, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-09-05  3:50 UTC (permalink / raw)
  To: Yu Kuai, Dragan Stancevic, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yangerkun

Hi,

在 2023/08/30 9:36, Yu Kuai 写道:
> Hi,
> 
> 在 2023/08/29 4:32, Dragan Stancevic 写道:
> 
>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 
>> days with 6.1.42 kernel without your patches and I was not able to 
>> reproduce it.

oops, I forgot that you need to backport this patch first to reporduce
this problem:

https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/

The patch fix the deadlock as well, but it introduce some regressions.

Thanks,
Kuai

>>
>> It seems that 6.1 has some other code that prevents this from happening.
>>
> 
> I see that there are lots of patches for raid456 between 5.10 and 6.1,
> however, I remember that I used to reporduce the deadlock after 6.1, and
> it's true it's not easy to reporduce, see below:
> 
> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/ 
> 
> 
> My guess is that 6.1 is harder to reporduce than 5.10 due to some
> changes inside raid456.
> 
> By the way, raid10 had a similiar deadlock, and can be fixed the same
> way, so it make sense to backport these patches.
> 
> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
> 
> Thanks,
> Kuai
> 
> 
>> On 5.10 I can reproduce it within minutes to an hour.
>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-05  3:50                                                                   ` Yu Kuai
@ 2023-09-05 13:54                                                                     ` Dragan Stancevic
  2023-09-13  9:08                                                                       ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Dragan Stancevic @ 2023-09-05 13:54 UTC (permalink / raw)
  To: Yu Kuai, song
  Cc: buczek, guoqing.jiang, it+raid, linux-kernel, linux-raid,
	msmith626, yangerkun

On 9/4/23 22:50, Yu Kuai wrote:
> Hi,
> 
> 在 2023/08/30 9:36, Yu Kuai 写道:
>> Hi,
>>
>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>
>>> Just a followup on 6.1 testing. I tried reproducing this problem for 
>>> 5 days with 6.1.42 kernel without your patches and I was not able to 
>>> reproduce it.
> 
> oops, I forgot that you need to backport this patch first to reporduce
> this problem:
> 
> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/
> 
> The patch fix the deadlock as well, but it introduce some regressions.

Ha, jinx :) I was about to email you that I isolated that change with 
the testing over the weekend that made it more difficult to reproduce in 
6.1 and that the original change must be reverted :)



> 
> Thanks,
> Kuai
> 
>>>
>>> It seems that 6.1 has some other code that prevents this from happening.
>>>
>>
>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>> however, I remember that I used to reporduce the deadlock after 6.1, and
>> it's true it's not easy to reporduce, see below:
>>
>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
>>
>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>> changes inside raid456.
>>
>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>> way, so it make sense to backport these patches.
>>
>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
>>
>> Thanks,
>> Kuai
>>
>>
>>> On 5.10 I can reproduce it within minutes to an hour.
>>>
>>
>> .
>>
> 

-- 
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-05 13:54                                                                     ` Dragan Stancevic
@ 2023-09-13  9:08                                                                       ` Donald Buczek
  2023-09-13 14:16                                                                         ` Dragan Stancevic
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2023-09-13  9:08 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun

On 9/5/23 3:54 PM, Dragan Stancevic wrote:
> On 9/4/23 22:50, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>> Hi,
>>>
>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>
>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>
>> oops, I forgot that you need to backport this patch first to reporduce
>> this problem:
>>
>> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/
>>
>> The patch fix the deadlock as well, but it introduce some regressions.

We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.

But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?

Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?

Thanks

  Donald

[1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/

> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
> 
> 
> 
>>
>> Thanks,
>> Kuai
>>
>>>>
>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>
>>>
>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>> it's true it's not easy to reporduce, see below:
>>>
>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
>>>
>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>> changes inside raid456.
>>>
>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>> way, so it make sense to backport these patches.
>>>
>>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
>>>
>>> Thanks,
>>> Kuai
>>>
>>>
>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>
>>>
>>> .
>>>
>>
> 


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-13  9:08                                                                       ` Donald Buczek
@ 2023-09-13 14:16                                                                         ` Dragan Stancevic
  2023-09-14  6:03                                                                           ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Dragan Stancevic @ 2023-09-13 14:16 UTC (permalink / raw)
  To: Donald Buczek, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun

Hi Donald-

On 9/13/23 04:08, Donald Buczek wrote:
> On 9/5/23 3:54 PM, Dragan Stancevic wrote:
>> On 9/4/23 22:50, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>>> Hi,
>>>>
>>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>>
>>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>>
>>> oops, I forgot that you need to backport this patch first to reporduce
>>> this problem:
>>>
>>> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/
>>>
>>> The patch fix the deadlock as well, but it introduce some regressions.
> 
> We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.
> 
> But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?
> 
> Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?

Here is a list of changes for 6.1:

e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
f71209b1f21c md: enhance checking in md_check_recovery()
753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
6f56f0c4f124 md: add a mutex to synchronize idle and frozen in 
action_store()
64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
a865b96c513b Revert "md: unlock mddev before reap sync_thread in 
action_store"

You can get them from the following tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git


> 
> Thanks
> 
>    Donald
> 
> [1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/
> 
>> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
>>
>>
>>
>>>
>>> Thanks,
>>> Kuai
>>>
>>>>>
>>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>>
>>>>
>>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>>> it's true it's not easy to reporduce, see below:
>>>>
>>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
>>>>
>>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>>> changes inside raid456.
>>>>
>>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>>> way, so it make sense to backport these patches.
>>>>
>>>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>
>>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>>
>>>>
>>>> .
>>>>
>>>
>>
> 
> 

-- 
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-13 14:16                                                                         ` Dragan Stancevic
@ 2023-09-14  6:03                                                                           ` Donald Buczek
  2023-09-17  8:55                                                                             ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2023-09-14  6:03 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun



On 9/13/23 16:16, Dragan Stancevic wrote:
> Hi Donald-
> 
> On 9/13/23 04:08, Donald Buczek wrote:
>> On 9/5/23 3:54 PM, Dragan Stancevic wrote:
>>> On 9/4/23 22:50, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2023/08/30 9:36, Yu Kuai 写道:
>>>>> Hi,
>>>>>
>>>>> 在 2023/08/29 4:32, Dragan Stancevic 写道:
>>>>>
>>>>>> Just a followup on 6.1 testing. I tried reproducing this problem for 5 days with 6.1.42 kernel without your patches and I was not able to reproduce it.
>>>>
>>>> oops, I forgot that you need to backport this patch first to reporduce
>>>> this problem:
>>>>
>>>> https://lore.kernel.org/all/20230529132037.2124527-2-yukuai1@huaweicloud.com/
>>>>
>>>> The patch fix the deadlock as well, but it introduce some regressions.
>>
>> We've just got an unplanned lock up on "check" to "idle" transition with 6.1.52 after a few hours on a backup server. For the last 2 1/2 years we used the patch I originally proposed with multiple kernel versions [1]. But this no longer seems to be valid or maybe its even destructive in combination with the other changes.
>>
>> But I totally lost track of the further development. As I understood, there are patches queue up in mainline, which might go into 6.1, too, but have not landed there which should fix the problem?
>>
>> Can anyone give me exact references to the patches I'd need to apply to 6.1.52, so that I could probably fix my problem and also test the patches for you those on production systems with a load which tends to run into that problem easily?
> 
> Here is a list of changes for 6.1:
> 
> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
> f71209b1f21c md: enhance checking in md_check_recovery()
> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"

Thanks!

I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.

LGTM !

   Donald

buczek@done:~$ dmesg|grep "data-check of RAID array"|wc
     393    2820   18864
buczek@done:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md2 : active raid6 sdc[0] sdo[15] sdn[14] sdm[13] sdl[12] sdk[11] sdj[10] sdi[9] sdh[8] sdg[7] sdf[6] sde[5] sdd[4] sdr[3] sdq[2] sdp[1]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
       [=========>...........]  check = 47.1% (3681799128/7813894144) finish=671.8min speed=102496K/sec
       bitmap: 0/59 pages [0KB], 65536KB chunk

md1 : active raid6 sdaa[0] sdz[15] sdy[14] sdx[13] sdw[12] sdv[11] sdu[10] sdt[16] sds[8] sdah[7] sdag[17] sdaf[5] sdae[4] sdad[3] sdac[2] sdab[1]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
       [=======>.............]  check = 38.5% (3009484896/7813894144) finish=811.0min speed=98720K/sec
       bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sdai[0] sdax[15] sdaw[16] sdav[13] sdau[12] sdat[11] sdas[10] sdar[9] sdaq[8] sdap[7] sdao[6] sdan[17] sdam[4] sdal[3] sdak[2] sdaj[1]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
       [========>............]  check = 42.3% (3311789940/7813894144) finish=911.9min speed=82272K/sec
       bitmap: 6/59 pages [24KB], 65536KB chunk

unused devices: <none>





> 
> You can get them from the following tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> 
> 
>>
>> Thanks
>>
>>    Donald
>>
>> [1]: https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/
>>
>>> Ha, jinx :) I was about to email you that I isolated that change with the testing over the weekend that made it more difficult to reproduce in 6.1 and that the original change must be reverted :)
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>
>>>>>>
>>>>>> It seems that 6.1 has some other code that prevents this from happening.
>>>>>>
>>>>>
>>>>> I see that there are lots of patches for raid456 between 5.10 and 6.1,
>>>>> however, I remember that I used to reporduce the deadlock after 6.1, and
>>>>> it's true it's not easy to reporduce, see below:
>>>>>
>>>>> https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
>>>>>
>>>>> My guess is that 6.1 is harder to reporduce than 5.10 due to some
>>>>> changes inside raid456.
>>>>>
>>>>> By the way, raid10 had a similiar deadlock, and can be fixed the same
>>>>> way, so it make sense to backport these patches.
>>>>>
>>>>> https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>>
>>>>>> On 5.10 I can reproduce it within minutes to an hour.
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-14  6:03                                                                           ` Donald Buczek
@ 2023-09-17  8:55                                                                             ` Donald Buczek
  2023-09-24 14:35                                                                               ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2023-09-17  8:55 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun

On 9/14/23 08:03, Donald Buczek wrote:
> On 9/13/23 16:16, Dragan Stancevic wrote:
>> Hi Donald-
>> [...]
>> Here is a list of changes for 6.1:
>>
>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>> f71209b1f21c md: enhance checking in md_check_recovery()
>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
> 
> Thanks!
> 
> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.

Oh dear, looks like the deadlock problem is _not_fixed with these patches.

We've had a lockup again after ~3 days of operation. Again, the `echo idle > $sys/md/sync_action` is hanging:

# # /proc/70554/task/70554: mdcheck.safe : /bin/bash /usr/bin/mdcheck.safe --continue --duration 06:00
# cat /proc/70554/task/70554/stack

[<0>] action_store+0x17f/0x390
[<0>] md_attr_store+0x83/0xf0
[<0>] kernfs_fop_write_iter+0x117/0x1b0
[<0>] vfs_write+0x2ce/0x400
[<0>] ksys_write+0x5f/0xe0
[<0>] do_syscall_64+0x43/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

And everything else going to that specific raid (md0) is dead, too. No task is busy looping.

So as it looks now, we cant go from 5.15.X to 6.1.X as we would like to do. These patches don't fix the problem and our own patch no longer works with 6.1. Unfortunately, this happened on a production system which I need to reboot and is not available for further analysis. We'd need to reproduce the problem on a dedicated machine to really work on it.

Here's some more possibly interesting procfs output and some examples of tasks.

/sys/devices/virtual/block/md0/inflight : 0 3936

#/proc/mdstat

Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
md1 : active raid6 sdae[0] sdad[15] sdac[14] sdab[13] sdaa[12] sdz[11] sdy[10] sdx[9] sdw[8] sdv[7] sdu[6] sdt[5] sds[4] sdah[3] sdag[2] sdaf[1]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
       bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sdc[0] sdr[17] sdq[16] sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1]
       109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
       [===>.................]  check = 15.9% (1242830396/7813894144) finish=14788.4min speed=7405K/sec
       bitmap: 53/59 pages [212KB], 65536KB chunk

unused devices: <none>

# # /proc/66024/task/66024: md0_resync :
# cat /proc/66024/task/66024/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_sync_request+0x38b/0x3b0
[<0>] md_do_sync.cold+0x40c/0x985
[<0>] md_thread+0xb1/0x160
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/939/task/939: md0_raid6 :
# cat /proc/939/task/939/stack

[<0>] md_thread+0x12d/0x160
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/1228/task/1228: xfsaild/md0 :
# cat /proc/1228/task/1228/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_make_request+0x24c/0x1170
[<0>] md_handle_request+0x131/0x220
[<0>] __submit_bio+0x89/0x130
[<0>] submit_bio_noacct_nocheck+0x160/0x360
[<0>] _xfs_buf_ioapply+0x26c/0x420
[<0>] __xfs_buf_submit+0x64/0x1d0
[<0>] xfs_buf_delwri_submit_buffers+0xc5/0x1e0
[<0>] xfsaild+0x2a0/0x880
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/49747/task/49747: kworker/24:2+xfs-inodegc/md0 :
# cat /proc/49747/task/49747/stack

[<0>] xfs_buf_lock+0x35/0xf0
[<0>] xfs_buf_find_lock+0x45/0xf0
[<0>] xfs_buf_get_map+0x17d/0xa60
[<0>] xfs_buf_read_map+0x52/0x280
[<0>] xfs_trans_read_buf_map+0x115/0x350
[<0>] xfs_btree_read_buf_block.constprop.0+0x9a/0xd0
[<0>] xfs_btree_lookup_get_block+0x97/0x170
[<0>] xfs_btree_lookup+0xc4/0x4a0
[<0>] xfs_difree_finobt+0x62/0x250
[<0>] xfs_difree+0x130/0x1c0
[<0>] xfs_ifree+0x86/0x510
[<0>] xfs_inactive_ifree.isra.0+0xa2/0x1c0
[<0>] xfs_inactive+0xf8/0x170
[<0>] xfs_inodegc_worker+0x90/0x140
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/49844/task/49844: kworker/30:3+xfs-sync/md0 :
# cat /proc/49844/task/49844/stack

[<0>] __flush_workqueue+0x10e/0x390
[<0>] xlog_cil_push_now.isra.0+0x25/0x90
[<0>] xlog_cil_force_seq+0x7c/0x240
[<0>] xfs_log_force+0x83/0x240
[<0>] xfs_log_worker+0x3b/0xd0
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30


# # /proc/52646/task/52646: kworker/u263:2+xfs-cil/md0 :
# cat /proc/52646/task/52646/stack

[<0>] raid5_get_active_stripe+0x20f/0x4d0
[<0>] raid5_make_request+0x24c/0x1170
[<0>] md_handle_request+0x131/0x220
[<0>] __submit_bio+0x89/0x130
[<0>] submit_bio_noacct_nocheck+0x160/0x360
[<0>] xlog_state_release_iclog+0xf6/0x1d0
[<0>] xlog_write_get_more_iclog_space+0x79/0xf0
[<0>] xlog_write+0x334/0x3b0
[<0>] xlog_cil_push_work+0x501/0x740
[<0>] process_one_work+0x1c7/0x3c0
[<0>] worker_thread+0x4d/0x3c0
[<0>] kthread+0xe7/0x110
[<0>] ret_from_fork+0x22/0x30

# # /proc/52753/task/52753: rm : rm -rf /project/pbackup_gone/data/C8029/home_Cyang/home_Cyang:202306011248:C3019.BEING_DELETED
# cat /proc/52753/task/52753/stack

[<0>] xfs_buf_lock+0x35/0xf0
[<0>] xfs_buf_find_lock+0x45/0xf0
[<0>] xfs_buf_get_map+0x17d/0xa60
[<0>] xfs_buf_read_map+0x52/0x280
[<0>] xfs_trans_read_buf_map+0x115/0x350
[<0>] xfs_read_agi+0x98/0x140
[<0>] xfs_iunlink+0x63/0x1f0
[<0>] xfs_remove+0x280/0x3a0
[<0>] xfs_vn_unlink+0x53/0xa0
[<0>] vfs_rmdir.part.0+0x5e/0x1e0
[<0>] do_rmdir+0x15c/0x1c0
[<0>] __x64_sys_unlinkat+0x4b/0x60
[<0>] do_syscall_64+0x43/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x64/0xce

Best
   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-17  8:55                                                                             ` Donald Buczek
@ 2023-09-24 14:35                                                                               ` Donald Buczek
  2023-09-25  1:11                                                                                 ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2023-09-24 14:35 UTC (permalink / raw)
  To: Dragan Stancevic, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626, yangerkun

On 9/17/23 10:55, Donald Buczek wrote:
> On 9/14/23 08:03, Donald Buczek wrote:
>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>> Hi Donald-
>>> [...]
>>> Here is a list of changes for 6.1:
>>>
>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
>>
>> Thanks!
>>
>> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.
> 
> Oh dear, looks like the deadlock problem is _not_fixed with these patches.

Some more info after another incident:

- We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches)
- The symptoms are not exactly the same as with the original year-old problem. Differences:
- - mdX_raid6 is NOT busy looping
- - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending"
- - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock
- - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files.
- But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid.

The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131

We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...).

But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one.

Best

   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-24 14:35                                                                               ` Donald Buczek
@ 2023-09-25  1:11                                                                                 ` Yu Kuai
  2023-09-25  9:11                                                                                   ` Donald Buczek
  0 siblings, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-09-25  1:11 UTC (permalink / raw)
  To: Donald Buczek, Dragan Stancevic, Yu Kuai, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626,
	yangerkun, yukuai (C)

Hi,

在 2023/09/24 22:35, Donald Buczek 写道:
> On 9/17/23 10:55, Donald Buczek wrote:
>> On 9/14/23 08:03, Donald Buczek wrote:
>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>> Hi Donald-
>>>> [...]
>>>> Here is a list of changes for 6.1:
>>>>
>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in 
>>>> action_store()
>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in 
>>>> action_store"
>>>
>>> Thanks!
>>>
>>> I've put these patches on v6.1.52. I've started a script which 
>>> transitions the three md-devices of a very active backup server 
>>> through idle->check->idle every 6 minutes a few ours ago.  It went 
>>> through ~400 iterations till now. No lock-ups so far.
>>
>> Oh dear, looks like the deadlock problem is _not_fixed with these 
>> patches.
> 
> Some more info after another incident:
> 
> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any 
> of the above patches)
> - The symptoms are not exactly the same as with the original year-old 
> problem. Differences:
> - - mdX_raid6 is NOT busy looping
> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not 
> "write pending"
> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does 
> not resolve the deadlock
> - - After hours in the deadlock state the system resumed operation when 
> a script of mine read(!) lots of sysfs files.
> - But in both cases, `echo idle > 
> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O 
> operation on the raid.
> 
> The fact that we didn't hit the problem for many month on 5.15.94 might 
> hint that it was introduced between 5.15.94 and 5.15.131
> 
> We'll try to reproduce the problem on a test machine for analysis, but 
> this make take time (vacation imminent for one...).
> 
> But its not like these patches caused the problem. Any maybe they _did_ 
> fix the original problem, as we didn't hit that one.

Sorry for the late reply, yes, this looks like a different problem. I'm
pretty confident that the orignal problem is fixed since that echo
idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
sync_thread to be done.

I'll check patches between 5.15.94 and 5.15.131.

Thanks,
Kuai

> 
> Best
> 
>    Donald
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-25  1:11                                                                                 ` Yu Kuai
@ 2023-09-25  9:11                                                                                   ` Donald Buczek
  2023-09-25  9:32                                                                                     ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Donald Buczek @ 2023-09-25  9:11 UTC (permalink / raw)
  To: Yu Kuai, Dragan Stancevic, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626,
	yangerkun, yukuai (C)

On 9/25/23 03:11, Yu Kuai wrote:
> Hi,
> 
> 在 2023/09/24 22:35, Donald Buczek 写道:
>> On 9/17/23 10:55, Donald Buczek wrote:
>>> On 9/14/23 08:03, Donald Buczek wrote:
>>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>>> Hi Donald-
>>>>> [...]
>>>>> Here is a list of changes for 6.1:
>>>>>
>>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread()
>>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store()
>>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store"
>>>>
>>>> Thanks!
>>>>
>>>> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far.
>>>
>>> Oh dear, looks like the deadlock problem is _not_fixed with these patches.
>>
>> Some more info after another incident:
>>
>> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches)
>> - The symptoms are not exactly the same as with the original year-old problem. Differences:
>> - - mdX_raid6 is NOT busy looping
>> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending"
>> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock
>> - - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files.
>> - But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid.
>>
>> The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131
>>
>> We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...).
>>
>> But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one.
> 
> Sorry for the late reply, yes, this looks like a different problem. I'm
> pretty confident that the orignal problem is fixed since that echo
> idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
> sync_thread to be done.
> 
> I'll check patches between 5.15.94 and 5.15.131.

We've got another event today. Some more information to save you work. I'm sorry, this comes dripping in, but as I said, currently we can't reproduce it and hit it on production machines only, where we have limited time to analyze:

* In the last two events, "echo idle > sys/devices/virtual/block/mdX/md/sync_action" was not even executing. This is not a trigger, but was a random victim when it happened the first time. This deceived me to believe this is some variation of the old problem.

* It's not filesystem related, yesterday `blkid -o value -s LABEL /dev/md1` was hanging, too, and today, for example, `df`.

* /sys/devices/virtual/block/md0/inflight today was (frozen at) "2      579"

* iotop showed no disk activity (on the raid) at all. Only a single member device had activity from time to time (usually after ~30 seconds, but sometimes after a few seconds) with usually 1-4 tps, but sometimes more, max 136 tps.

* As I said, I use a script to take a snapshot of various /sys and /proc information and running this script resolved the deadlock twice.

* The recorded stack traces of mdX_raid6 of the hanging raid recorded in the two events were

     [<0>] md_bitmap_unplug.part.0+0xce/0x100
     [<0>] raid5d+0xe4/0x5a0
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x127/0x150
     [<0>] ret_from_fork+0x22/0x30

and

     [<0>] md_super_wait+0x72/0xa0
     [<0>] md_bitmap_unplug.part.0+0xce/0x100
     [<0>] raid5d+0xe4/0x5a0
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x127/0x150
     [<0>] ret_from_fork+0x22/0x30

But note, that these probably were taken after the previous commands in the script already unfroze the system. Today I've manually looked at the stack while the system was still frozen, and it was just

     [<0>] md_thread+0x122/0x160
     [<0>] kthread+0x127/0x150
     [<0>] ret_from_fork+0x22/0x30

* Because I knew that my script seems to unblock the system, I've run it slowly line by line to see what actually unfreezes the system. There is one loop which takes "comm" "cmdline" and "stack" of all threads:

     for task in /proc/*/task/*; do
         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
         cmd cat $task/stack
     done

I've added a few "read" to single-step it. Unfortunately, when it came to the 64 nfsd threads, I've got a bit impatient and hit "return" faster then I should have and when the unfreeze happened, I couldn't say exactly were it was triggered. But it must have been somewhere in this tail:

# # /proc/1299/task/1299: nfsd

[<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
[<0>] nfsd+0xd6/0x140 [nfsd]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/13/task/13: ksoftirqd/0

[<0>] smpboot_thread_fn+0xf3/0x140
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/130/task/130: cpuhp/22

[<0>] smpboot_thread_fn+0xf3/0x140
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

# # /proc/1300/task/1300: nfsd

[<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
[<0>] nfsd+0xd6/0x140 [nfsd]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x22/0x30

## (3 more repetitions of other nfsd threads which exactly the same stack skipped here ##

So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as there shouldn't and wasn't anything nfs-mounted from this system.

Conclusion: This is probably not related to mdraid at all and might be a problem of the block or some infrastructure subsystem. Do you agree?

Best

   Donald
-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-09-25  9:11                                                                                   ` Donald Buczek
@ 2023-09-25  9:32                                                                                     ` Yu Kuai
  0 siblings, 0 replies; 49+ messages in thread
From: Yu Kuai @ 2023-09-25  9:32 UTC (permalink / raw)
  To: Donald Buczek, Yu Kuai, Dragan Stancevic, song
  Cc: guoqing.jiang, it+raid, linux-kernel, linux-raid, msmith626,
	yangerkun, yukuai (C)

Hi,

在 2023/09/25 17:11, Donald Buczek 写道:
> On 9/25/23 03:11, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/09/24 22:35, Donald Buczek 写道:
>>> On 9/17/23 10:55, Donald Buczek wrote:
>>>> On 9/14/23 08:03, Donald Buczek wrote:
>>>>> On 9/13/23 16:16, Dragan Stancevic wrote:
>>>>>> Hi Donald-
>>>>>> [...]
>>>>>> Here is a list of changes for 6.1:
>>>>>>
>>>>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly
>>>>>> f71209b1f21c md: enhance checking in md_check_recovery()
>>>>>> 753260ed0b46 md: wake up 'resync_wait' at last in 
>>>>>> md_reap_sync_thread()
>>>>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock
>>>>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in 
>>>>>> action_store()
>>>>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen'
>>>>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in 
>>>>>> action_store"
>>>>>
>>>>> Thanks!
>>>>>
>>>>> I've put these patches on v6.1.52. I've started a script which 
>>>>> transitions the three md-devices of a very active backup server 
>>>>> through idle->check->idle every 6 minutes a few ours ago.  It went 
>>>>> through ~400 iterations till now. No lock-ups so far.
>>>>
>>>> Oh dear, looks like the deadlock problem is _not_fixed with these 
>>>> patches.
>>>
>>> Some more info after another incident:
>>>
>>> - We've hit the deadlock with 5.15.131 (so it is NOT introduced by 
>>> any of the above patches)
>>> - The symptoms are not exactly the same as with the original year-old 
>>> problem. Differences:
>>> - - mdX_raid6 is NOT busy looping
>>> - - /sys/devices/virtual/block/mdX/md/array_state says "active" not 
>>> "write pending"
>>> - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` 
>>> does not resolve the deadlock
>>> - - After hours in the deadlock state the system resumed operation 
>>> when a script of mine read(!) lots of sysfs files.
>>> - But in both cases, `echo idle > 
>>> /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O 
>>> operation on the raid.
>>>
>>> The fact that we didn't hit the problem for many month on 5.15.94 
>>> might hint that it was introduced between 5.15.94 and 5.15.131
>>>
>>> We'll try to reproduce the problem on a test machine for analysis, 
>>> but this make take time (vacation imminent for one...).
>>>
>>> But its not like these patches caused the problem. Any maybe they 
>>> _did_ fix the original problem, as we didn't hit that one.
>>
>> Sorry for the late reply, yes, this looks like a different problem. I'm
>> pretty confident that the orignal problem is fixed since that echo
>> idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for
>> sync_thread to be done.
>>
>> I'll check patches between 5.15.94 and 5.15.131.
> 
> We've got another event today. Some more information to save you work. 
> I'm sorry, this comes dripping in, but as I said, currently we can't 
> reproduce it and hit it on production machines only, where we have 
> limited time to analyze:

There is a way to clarify if io is stuck in underlying disks:

Once the problem is triggered and there are no disk activity:

cat /sys/kernel/debug/block/[disk]/hctx*/sched_tags | grep busy
cat /sys/kernel/debug/block/[disk]/hctx*/tags | grep busy

If busy is not 0, means that io is stuck in underlying disk, then this
problem is not related to raid, otherwise raid doesn't issue any io to
underlyiung dikss and this problem is related to raid.

> 
> * In the last two events, "echo idle > 
> sys/devices/virtual/block/mdX/md/sync_action" was not even executing. 
> This is not a trigger, but was a random victim when it happened the 
> first time. This deceived me to believe this is some variation of the 
> old problem.
> 
> * It's not filesystem related, yesterday `blkid -o value -s LABEL 
> /dev/md1` was hanging, too, and today, for example, `df`.
> 
> * /sys/devices/virtual/block/md0/inflight today was (frozen at) "2      
> 579"
> 
> * iotop showed no disk activity (on the raid) at all. Only a single 
> member device had activity from time to time (usually after ~30 seconds, 
> but sometimes after a few seconds) with usually 1-4 tps, but sometimes 
> more, max 136 tps.
> 
> * As I said, I use a script to take a snapshot of various /sys and /proc 
> information and running this script resolved the deadlock twice.
> 
> * The recorded stack traces of mdX_raid6 of the hanging raid recorded in 
> the two events were
> 
>      [<0>] md_bitmap_unplug.part.0+0xce/0x100
>      [<0>] raid5d+0xe4/0x5a0
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x127/0x150
>      [<0>] ret_from_fork+0x22/0x30
> 
> and
> 
>      [<0>] md_super_wait+0x72/0xa0
>      [<0>] md_bitmap_unplug.part.0+0xce/0x100
>      [<0>] raid5d+0xe4/0x5a0
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x127/0x150
>      [<0>] ret_from_fork+0x22/0x30

Above stack shows that raid issue bitmap io to underlying disk and is
waiting for such io to be done, except for bitmap io is broken in raid,
this problem should not related to raid, above debugfs can help to
clarify this.

Thanks,
Kuai

> 
> But note, that these probably were taken after the previous commands in 
> the script already unfroze the system. Today I've manually looked at the 
> stack while the system was still frozen, and it was just
> 
>      [<0>] md_thread+0x122/0x160
>      [<0>] kthread+0x127/0x150
>      [<0>] ret_from_fork+0x22/0x30
> 
> * Because I knew that my script seems to unblock the system, I've run it 
> slowly line by line to see what actually unfreezes the system. There is 
> one loop which takes "comm" "cmdline" and "stack" of all threads:
> 
>      for task in /proc/*/task/*; do
>          echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | 
> xargs -0 echo)"
>          cmd cat $task/stack
>      done
> 
> I've added a few "read" to single-step it. Unfortunately, when it came 
> to the 64 nfsd threads, I've got a bit impatient and hit "return" faster 
> then I should have and when the unfreeze happened, I couldn't say 
> exactly were it was triggered. But it must have been somewhere in this 
> tail:
> 
> # # /proc/1299/task/1299: nfsd
> 
> [<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
> [<0>] nfsd+0xd6/0x140 [nfsd]
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
> 
> # # /proc/13/task/13: ksoftirqd/0
> 
> [<0>] smpboot_thread_fn+0xf3/0x140
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
> 
> # # /proc/130/task/130: cpuhp/22
> 
> [<0>] smpboot_thread_fn+0xf3/0x140
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
> 
> # # /proc/1300/task/1300: nfsd
> 
> [<0>] svc_recv+0x7a7/0x8c0 [sunrpc]
> [<0>] nfsd+0xd6/0x140 [nfsd]
> [<0>] kthread+0x127/0x150
> [<0>] ret_from_fork+0x22/0x30
> 
> ## (3 more repetitions of other nfsd threads which exactly the same 
> stack skipped here ##
> 
> So it appears, that possibly a cat /proc/PID/stack of a "ksoftirqd" or a 
> (maybe) a "cpuhp" thread unblocks the system. "nfsd" seems unlikely, as 
> there shouldn't and wasn't anything nfs-mounted from this system.
> 
> Conclusion: This is probably not related to mdraid at all and might be a 
> problem of the block or some infrastructure subsystem. Do you agree?
> 
> Best
> 
>    Donald


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-14 13:55                                               ` Guoqing Jiang
  2023-03-14 14:45                                                 ` Marc Smith
@ 2023-03-15  3:02                                                 ` Yu Kuai
  2023-03-15  9:30                                                   ` Guoqing Jiang
  1 sibling, 1 reply; 49+ messages in thread
From: Yu Kuai @ 2023-03-15  3:02 UTC (permalink / raw)
  To: Guoqing Jiang, Marc Smith
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List,
	it+raid, yukuai (C)



在 2023/03/14 21:55, Guoqing Jiang 写道:
> 
> 
> On 3/14/23 21:25, Marc Smith wrote:
>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>> <guoqing.jiang@cloud.ionos.com> wrote:
>>> Hi Donald,
>>>
>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>> Dear Guoqing,
>>>>
>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>
>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>> sync_thread, like this.
>>>>>>>
>>>>>>>           /* resync has finished, collect result */
>>>>>>>           mddev_unlock(mddev);
>>>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>>>           mddev_lock(mddev);
>>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>>> be a problem, if another user space operation takes the mutex?
>>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>>> these places can cause potential deadlock. Please try above two lines
>>>>> change. And perhaps others have better idea.
>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>
>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>
>>> Thanks,
>>> Guoqing
>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>> of the patches that supposedly resolve this were applied to the stable
>> kernels, however, one was omitted due to a regression:
>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
Hi, Guoqing,

Just borrow this thread to discuss, I think this commit might have
problem in some corner cases:

t1:				t2:
action_store
  mddev_lock
   if (mddev->sync_thread)
    mddev_unlock
    md_unregister_thread
				md_check_recovery
				 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
				 queue_work(md_misc_wq, &mddev->del_work)
    mddev_lock_nointr
    md_reap_sync_thread
    // clear running
  mddev_lock

t3:
md_start_sync
// running is not set

Our test report a problem that can be cause by this in theory, by we
can't be sure for now...

We thought about how to fix this, instead of calling
md_register_thread() here to wait for sync_thread to be done
synchronisely, we do this asynchronously like what md_set_readonly() and
do_md_stop() does.

What do you think?

Thanks,
Kuai
>>
>> I don't see any follow-up on the thread from June 8th 2022 asking for
>> this patch to be dropped from all stable kernels since it caused a
>> regression.
>>
>> The patch doesn't appear to be present in the current mainline kernel
>> (6.3-rc2) either. So I assume this issue is still present there, or it
>> was resolved differently and I just can't find the commit/patch.
> 
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".
> 
> Thanks,
> Guoqing
> .
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-15  3:02                                                 ` Yu Kuai
@ 2023-03-15  9:30                                                   ` Guoqing Jiang
  2023-03-15  9:53                                                     ` Yu Kuai
  0 siblings, 1 reply; 49+ messages in thread
From: Guoqing Jiang @ 2023-03-15  9:30 UTC (permalink / raw)
  To: Yu Kuai, Marc Smith
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 3/15/23 11:02, Yu Kuai wrote:
>
>
> 在 2023/03/14 21:55, Guoqing Jiang 写道:
>>
>>
>> On 3/14/23 21:25, Marc Smith wrote:
>>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>>> <guoqing.jiang@cloud.ionos.com> wrote:
>>>> Hi Donald,
>>>>
>>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>>> Dear Guoqing,
>>>>>
>>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>>
>>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>>> sync_thread, like this.
>>>>>>>>
>>>>>>>>           /* resync has finished, collect result */
>>>>>>>>           mddev_unlock(mddev);
>>>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>>>>           mddev_lock(mddev);
>>>>>>> As above: While we wait for the sync thread to terminate, 
>>>>>>> wouldn't it
>>>>>>> be a problem, if another user space operation takes the mutex?
>>>>>> I don't think other places can be blocked while hold mutex, 
>>>>>> otherwise
>>>>>> these places can cause potential deadlock. Please try above two 
>>>>>> lines
>>>>>> change. And perhaps others have better idea.
>>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>>
>>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>>
>>>> Thanks,
>>>> Guoqing
>>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>>> of the patches that supposedly resolve this were applied to the stable
>>> kernels, however, one was omitted due to a regression:
>>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> Hi, Guoqing,
>
> Just borrow this thread to discuss, I think this commit might have
> problem in some corner cases:
>
> t1:                t2:
> action_store
>  mddev_lock
>   if (mddev->sync_thread)
>    mddev_unlock
>    md_unregister_thread
>                 md_check_recovery
>                  set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>                  queue_work(md_misc_wq, &mddev->del_work)
>    mddev_lock_nointr
>    md_reap_sync_thread
>    // clear running
>  mddev_lock
>
> t3:
> md_start_sync
> // running is not set

What does 'running' mean? MD_RECOVERY_RUNNING?

> Our test report a problem that can be cause by this in theory, by we
> can't be sure for now...

I guess you tried to describe racy between

action_store -> md_register_thread

and

md_start_sync -> md_register_thread

Didn't you already fix them in the series?

[PATCH -next 0/5] md: fix uaf for sync_thread

Sorry, I didn't follow the problem and also your series, I might try your
test with latest mainline kernel if the test is available somewhere.

> We thought about how to fix this, instead of calling
> md_register_thread() here to wait for sync_thread to be done
> synchronisely,

IMO, md_register_thread just create and wake a thread, not sure why it
waits for sync_thread.

> we do this asynchronously like what md_set_readonly() and  
> do_md_stop() does.

Still, I don't have clear picture about the problem, so I can't judge it.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-15  9:30                                                   ` Guoqing Jiang
@ 2023-03-15  9:53                                                     ` Yu Kuai
  0 siblings, 0 replies; 49+ messages in thread
From: Yu Kuai @ 2023-03-15  9:53 UTC (permalink / raw)
  To: Guoqing Jiang, Marc Smith
  Cc: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List,
	it+raid, yukuai (C)

Hi,

在 2023/03/15 17:30, Guoqing Jiang 写道:
> 
>> Just borrow this thread to discuss, I think this commit might have
>> problem in some corner cases:
>>
>> t1:                t2:
>> action_store
>>  mddev_lock
>>   if (mddev->sync_thread)
>>    mddev_unlock
>>    md_unregister_thread
>>                 md_check_recovery
>>                  set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>>                  queue_work(md_misc_wq, &mddev->del_work)
>>    mddev_lock_nointr
>>    md_reap_sync_thread
>>    // clear running
>>  mddev_lock
>>
>> t3:
>> md_start_sync
>> // running is not set
> 
> What does 'running' mean? MD_RECOVERY_RUNNING?
> 
>> Our test report a problem that can be cause by this in theory, by we
>> can't be sure for now...
> 
> I guess you tried to describe racy between
> 
> action_store -> md_register_thread
> 
> and
> 
> md_start_sync -> md_register_thread
> 
> Didn't you already fix them in the series?
> 
> [PATCH -next 0/5] md: fix uaf for sync_thread
> 
> Sorry, I didn't follow the problem and also your series, I might try your
> test with latest mainline kernel if the test is available somewhere.
> 
>> We thought about how to fix this, instead of calling
>> md_register_thread() here to wait for sync_thread to be done
>> synchronisely,
> 
> IMO, md_register_thread just create and wake a thread, not sure why it
> waits for sync_thread.
> 
>> we do this asynchronously like what md_set_readonly() and do_md_stop() 
>> does.
> 
> Still, I don't have clear picture about the problem, so I can't judge it.
> 

Sorry that I didn't explain the problem clear. Let me explain the
problem we meet first:

1) raid10d is waiting for sync_thread to stop:
   raid10d
    md_unregister_thread
     kthread_stop

2) sync_thread is waiting for io to finish:
   md_do_sync
    wait_event(... atomic_read(&mddev->recovery_active) == 0)

3) io is waiting for raid10d to finish(online crash found 2 io in 
conf->retry_list)

Additional information from online crash:
mddev->recovery = 29, // DONE, RUNING, INTR is set

PID: 138293  TASK: ffff0000de89a900  CPU: 7   COMMAND: "md0_resync"
  #0 [ffffa00107c178a0] __switch_to at ffffa0010001d75c
  #1 [ffffa00107c178d0] __schedule at ffffa001017c7f14
  #2 [ffffa00107c179f0] schedule at ffffa001017c880c
  #3 [ffffa00107c17a20] md_do_sync at ffffa0010129cdb4
  #4 [ffffa00107c17d50] md_thread at ffffa00101290d9c
  #5 [ffffa00107c17e50] kthread at ffffa00100187a74

PID: 138294  TASK: ffff0000eba13d80  CPU: 5   COMMAND: "md0_resync"
  #0 [ffffa00107e47a60] __switch_to at ffffa0010001d75c
  #1 [ffffa00107e47a90] __schedule at ffffa001017c7f14
  #2 [ffffa00107e47bb0] schedule at ffffa001017c880c
  #3 [ffffa00107e47be0] schedule_timeout at ffffa001017d1298
  #4 [ffffa00107e47d50] md_thread at ffffa00101290ee8
  #5 [ffffa00107e47e50] kthread at ffffa00100187a74
// there are two sync_thread for md0

I believe the root cause is that two sync_thread exist for the same
mddev, and this is how I think this is possible:

t1:			t2:
action_store
  mddev_lock
   if (mddev->sync_thread)
    mddev_unlock
    md_unregister_thread
    // first sync_thread is done
			md_check_recovery
                  	 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
                  	 queue_work(md_misc_wq, &mddev->del_work)
    mddev_lock_nointr
    md_reap_sync_thread
    // MD_RECOVERY_RUNNING is cleared
  mddev_unlock

t3:
md_start_sync
// second sync_thread is registed

t3:
md_check_recovery
  queue_work(md_misc_wq, &mddev->del_work)
  // MD_RECOVERY_RUNNING  is not set, a new sync_thread can be started

This is just guess, I can't reporduce the problem yet. Please let me
know if you have any questions

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2023-03-14 13:25                                             ` Marc Smith
  2023-03-14 13:55                                               ` Guoqing Jiang
@ 2023-03-15  7:52                                               ` Donald Buczek
  1 sibling, 0 replies; 49+ messages in thread
From: Donald Buczek @ 2023-03-15  7:52 UTC (permalink / raw)
  To: Marc Smith, Guoqing Jiang
  Cc: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi,

I can just comment, that the simple patch I proposed at https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de/ works for us with several different kernel versions and currently 195 raid6 jbods on 105 systems going through several "idle->sync->idle" transitions each month for over two years now.

So if you suffer from the problem and are able to add patches to the kernel you use, you might give it a try.

Best
   Donald

On 3/14/23 14:25, Marc Smith wrote:
> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
> <guoqing.jiang@cloud.ionos.com> wrote:t
>>
>> Hi Donald,
>>
>> On 2/8/21 19:41, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>
>>>>
>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>> sync_thread, like this.
>>>>>>
>>>>>>           /* resync has finished, collect result */
>>>>>>           mddev_unlock(mddev);
>>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>>           mddev_lock(mddev);
>>>>>
>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>> be a problem, if another user space operation takes the mutex?
>>>>
>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>> these places can cause potential deadlock. Please try above two lines
>>>> change. And perhaps others have better idea.
>>>
>>> Yes, this works. No deadlock after >11000 seconds,
>>>
>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>
>> Great. I will send a formal patch with your reported-by and tested-by.
>>
>> Thanks,
>> Guoqing
> 
> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
> of the patches that supposedly resolve this were applied to the stable
> kernels, however, one was omitted due to a regression:
> md: don't unregister sync_thread with reconfig_mutex held (upstream
> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> 
> I don't see any follow-up on the thread from June 8th 2022 asking for
> this patch to be dropped from all stable kernels since it caused a
> regression.
> 
> The patch doesn't appear to be present in the current mainline kernel
> (6.3-rc2) either. So I assume this issue is still present there, or it
> was resolved differently and I just can't find the commit/patch.
> 
> I can induce the issue by using Donald's script above which will
> eventually result in hangs:
> ...
> 147948.504621] INFO: task md_test_2.sh:68033 blocked for more than 122 seconds.
> [147948.504624]       Tainted: P           OE     5.4.229-esos.prod #1
> [147948.504624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [147948.504625] md_test_2.sh    D    0 68033      1 0x00000004
> [147948.504627] Call Trace:
> [147948.504634]  __schedule+0x4ab/0x4f3
> [147948.504637]  ? usleep_range+0x7a/0x7a
> [147948.504638]  schedule+0x67/0x81
> [147948.504639]  schedule_timeout+0x2c/0xe5
> [147948.504643]  ? do_raw_spin_lock+0x2b/0x52
> [147948.504644]  __wait_for_common+0xc4/0x13a
> [147948.504647]  ? wake_up_q+0x40/0x40
> [147948.504649]  kthread_stop+0x9a/0x117
> [147948.504653]  md_unregister_thread+0x43/0x4d
> [147948.504655]  md_reap_sync_thread+0x1c/0x1d5
> [147948.504657]  action_store+0xc9/0x284
> [147948.504658]  md_attr_store+0x9f/0xb8
> [147948.504661]  kernfs_fop_write+0x10a/0x14c
> [147948.504664]  vfs_write+0xa0/0xdd
> [147948.504666]  ksys_write+0x71/0xba
> [147948.504668]  do_syscall_64+0x52/0x60
> [147948.504671]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> ...
> [147948.504748] INFO: task md120_resync:135315 blocked for more than
> 122 seconds.
> [147948.504749]       Tainted: P           OE     5.4.229-esos.prod #1
> [147948.504749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [147948.504749] md120_resync    D    0 135315      2 0x80004000
> [147948.504750] Call Trace:
> [147948.504752]  __schedule+0x4ab/0x4f3
> [147948.504754]  ? printk+0x53/0x6a
> [147948.504755]  schedule+0x67/0x81
> [147948.504756]  md_do_sync+0xae7/0xdd9
> [147948.504758]  ? remove_wait_queue+0x41/0x41
> [147948.504759]  md_thread+0x128/0x151
> [147948.504761]  ? _raw_spin_lock_irqsave+0x31/0x5d
> [147948.504762]  ? md_start_sync+0xdc/0xdc
> [147948.504763]  kthread+0xe4/0xe9
> [147948.504764]  ? kthread_flush_worker+0x70/0x70
> [147948.504765]  ret_from_fork+0x35/0x40
> ...
> 
> This happens on 'raid6' MD RAID arrays that initially have sync_action==resync.
> 
> Any guidance would be greatly appreciated.
> 
> --Marc

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2023-09-25  9:33 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
2020-11-30  2:06 ` Guoqing Jiang
2020-12-01  9:29   ` Donald Buczek
2020-12-02 17:28     ` Donald Buczek
2020-12-03  1:55       ` Guoqing Jiang
2020-12-03 11:42         ` Donald Buczek
2020-12-21 12:33           ` Donald Buczek
2021-01-19 11:30             ` Donald Buczek
2021-01-20 16:33               ` Guoqing Jiang
2021-01-23 13:04                 ` Donald Buczek
2021-01-25  8:54                   ` Donald Buczek
2021-01-25 21:32                     ` Donald Buczek
2021-01-26  0:44                       ` Guoqing Jiang
2021-01-26  9:50                         ` Donald Buczek
2021-01-26 11:14                           ` Guoqing Jiang
2021-01-26 12:58                             ` Donald Buczek
2021-01-26 14:06                               ` Guoqing Jiang
2021-01-26 16:05                                 ` Donald Buczek
2021-02-02 15:42                                   ` Guoqing Jiang
2021-02-08 11:38                                     ` Donald Buczek
2021-02-08 14:53                                       ` Guoqing Jiang
2021-02-08 18:41                                         ` Donald Buczek
2021-02-09  0:46                                           ` Guoqing Jiang
2021-02-09  9:24                                             ` Donald Buczek
2023-03-14 13:25                                             ` Marc Smith
2023-03-14 13:55                                               ` Guoqing Jiang
2023-03-14 14:45                                                 ` Marc Smith
2023-03-16 15:25                                                   ` Marc Smith
2023-03-29  0:01                                                     ` Song Liu
2023-08-22 21:16                                                       ` Dragan Stancevic
2023-08-23  1:22                                                         ` Yu Kuai
2023-08-23 15:33                                                           ` Dragan Stancevic
2023-08-24  1:18                                                             ` Yu Kuai
2023-08-28 20:32                                                               ` Dragan Stancevic
2023-08-30  1:36                                                                 ` Yu Kuai
2023-09-05  3:50                                                                   ` Yu Kuai
2023-09-05 13:54                                                                     ` Dragan Stancevic
2023-09-13  9:08                                                                       ` Donald Buczek
2023-09-13 14:16                                                                         ` Dragan Stancevic
2023-09-14  6:03                                                                           ` Donald Buczek
2023-09-17  8:55                                                                             ` Donald Buczek
2023-09-24 14:35                                                                               ` Donald Buczek
2023-09-25  1:11                                                                                 ` Yu Kuai
2023-09-25  9:11                                                                                   ` Donald Buczek
2023-09-25  9:32                                                                                     ` Yu Kuai
2023-03-15  3:02                                                 ` Yu Kuai
2023-03-15  9:30                                                   ` Guoqing Jiang
2023-03-15  9:53                                                     ` Yu Kuai
2023-03-15  7:52                                               ` Donald Buczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).