Linux-Raid Archives on lore.kernel.org
 help / color / Atom feed
* md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
@ 2020-11-28 12:25 Donald Buczek
  2020-11-30  2:06 ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2020-11-28 12:25 UTC (permalink / raw)
  To: Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Linux mdraid people,

we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.

The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.

The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with

     echo idle > /sys/devices/virtual/block/md1/md/sync_action

This doesn't complete. Here's /proc/stack of this process:

     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
     UID        PID  PPID  C STIME TTY          TIME CMD
     root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
     root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
     [<0>] kthread_stop+0x6e/0x150
     [<0>] md_unregister_thread+0x3e/0x70
     [<0>] md_reap_sync_thread+0x1f/0x1e0
     [<0>] action_store+0x141/0x2b0
     [<0>] md_attr_store+0x71/0xb0
     [<0>] kernfs_fop_write+0x113/0x1a0
     [<0>] vfs_write+0xbc/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Note, that md0 has been paused successfully just before.

     2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
     2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
     2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
     2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
     
     2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
     2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
     2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
     [ nothing related following.... ]

After that, we see md1_raid6 in a busy loop:

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6

Also, all processes doing I/O do the md device block.

This is /proc/mdstat:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           bitmap: 0/59 pages [0KB], 65536KB chunk

There doesn't seem to be any further progress.

I've taken a function_graph trace of the looping md1_raid6 process: https://owww.molgen.mpg.de/~buczek/2020-11-27_trace.txt (30 MB)

Maybe this helps to get an idea what might be going on?

Best
   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
@ 2020-11-30  2:06 ` Guoqing Jiang
  2020-12-01  9:29   ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2020-11-30  2:06 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 11/28/20 13:25, Donald Buczek wrote:
> Dear Linux mdraid people,
> 
> we are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions.
> 
> The last time this happened (in this case with Linux 5.10.0-rc4), I took 
> some data.
> 
> The triggering event seems to be the md_check cron job trying to pause 
> the ongoing check operation in the morning with
> 
>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
> 
> This doesn't complete. Here's /proc/stack of this process:
> 
>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>      UID        PID  PPID  C STIME TTY          TIME CMD
>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash 
> /usr/bin/mdcheck --continue --duration 06:00
>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat 
> /proc/23333/stack
>      [<0>] kthread_stop+0x6e/0x150
>      [<0>] md_unregister_thread+0x3e/0x70
>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>      [<0>] action_store+0x141/0x2b0
>      [<0>] md_attr_store+0x71/0xb0
>      [<0>] kernfs_fop_write+0x113/0x1a0
>      [<0>] vfs_write+0xbc/0x250
>      [<0>] ksys_write+0xa1/0xe0
>      [<0>] do_syscall_64+0x33/0x40
>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Note, that md0 has been paused successfully just before.

What is the personality of md0? Is it also raid6?

> 
>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD 
> (/usr/bin/mdcheck --continue --duration "06:00")
>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking 
> /dev/md0 from 10623180920
>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: 
> data-check of RAID array md0
>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking 
> /dev/md1 from 11582849320
>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: 
> data-check of RAID array md1
>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: 
> md0: data-check interrupted.
>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 
> 13351127680
>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: 
> md1: data-check interrupted.
>      [ nothing related following.... ]
> 
> After that, we see md1_raid6 in a busy loop:
> 
>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ 
> COMMAND
>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 
> md1_raid6

Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 
was triggered again and again.

> 
> Also, all processes doing I/O do the md device block.
> 
> This is /proc/mdstat:
> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [==================>..]  check = 94.0% 
> (7350290348/7813894144) finish=57189.3min speed=135K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 

So the RECOVERY_CHECK flag should be set, not sure if the simple changes
helps, but you may give it a try.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f..e2697d0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
                         md_update_sb(mddev, 0);

                 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
-                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
+                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
+                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
                         /* resync/recovery still happening */
                         clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                         goto unlock;

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-11-30  2:06 ` Guoqing Jiang
@ 2020-12-01  9:29   ` Donald Buczek
  2020-12-02 17:28     ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2020-12-01  9:29 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
> 
> 
> On 11/28/20 13:25, Donald Buczek wrote:
>> Dear Linux mdraid people,
>>
>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>
>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>
>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>
>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>
>> This doesn't complete. Here's /proc/stack of this process:
>>
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>      [<0>] kthread_stop+0x6e/0x150
>>      [<0>] md_unregister_thread+0x3e/0x70
>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>      [<0>] action_store+0x141/0x2b0
>>      [<0>] md_attr_store+0x71/0xb0
>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>      [<0>] vfs_write+0xbc/0x250
>>      [<0>] ksys_write+0xa1/0xe0
>>      [<0>] do_syscall_64+0x33/0x40
>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Note, that md0 has been paused successfully just before.
> 
> What is the personality of md0? Is it also raid6?

Yes.

> 
>>
>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>      [ nothing related following.... ]
>>
>> After that, we see md1_raid6 in a busy loop:
>>
>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
> 
> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
> 
>>
>> Also, all processes doing I/O do the md device block.
>>
>> This is /proc/mdstat:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
> 
> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
> helps, but you may give it a try.

Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.

Best
   Donald

> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 98bac4f..e2697d0 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>                          md_update_sb(mddev, 0);
> 
>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>                          /* resync/recovery still happening */
>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>                          goto unlock;
> 
> Thanks,
> Guoqing

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-01  9:29   ` Donald Buczek
@ 2020-12-02 17:28     ` Donald Buczek
  2020-12-03  1:55       ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2020-12-02 17:28 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:

     --- a/drivers/md/md.c
     +++ b/drivers/md/md.c
     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
                             clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
                             goto unlock;
                     }
     +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
     +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
     +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
     +                       /* resync/recovery still happening */
     +                       pr_info("md: XXX BUGFIX applied\n");
     +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
     +                       goto unlock;
     +               }
                     if (mddev->sync_thread) {
                             md_reap_sync_thread(mddev);
                             goto unlock;

With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

after 1 minute:

     Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
     md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [=>...................]  check =  8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk
     
     md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
           109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
           [>....................]  check =  0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
           bitmap: 0/59 pages [0KB], 65536KB chunk

A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:

     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
     root       134     2  0 Nov30 ?        00:00:00 [md]
     root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
     root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
     root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
     root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
     buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
     
     buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
     [<0>] md_bitmap_cond_end_sync+0x12d/0x170
     [<0>] raid5_sync_request+0x24b/0x390
     [<0>] md_do_sync+0xb41/0x1030
     [<0>] md_thread+0x122/0x160
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x1f/0x30

I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.

Donald


On 01.12.20 10:29, Donald Buczek wrote:
> Am 30.11.20 um 03:06 schrieb Guoqing Jiang:
>>
>>
>> On 11/28/20 13:25, Donald Buczek wrote:
>>> Dear Linux mdraid people,
>>>
>>> we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions.
>>>
>>> The last time this happened (in this case with Linux 5.10.0-rc4), I took some data.
>>>
>>> The triggering event seems to be the md_check cron job trying to pause the ongoing check operation in the morning with
>>>
>>>      echo idle > /sys/devices/virtual/block/md1/md/sync_action
>>>
>>> This doesn't complete. Here's /proc/stack of this process:
>>>
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# ps -fp 23333
>>>      UID        PID  PPID  C STIME TTY          TIME CMD
>>>      root     23333 23331  0 02:00 ?        00:00:00 /bin/bash /usr/bin/mdcheck --continue --duration 06:00
>>>      root@done:~/linux_problems/mdX_raid6_looping/2020-11-27# cat /proc/23333/stack
>>>      [<0>] kthread_stop+0x6e/0x150
>>>      [<0>] md_unregister_thread+0x3e/0x70
>>>      [<0>] md_reap_sync_thread+0x1f/0x1e0
>>>      [<0>] action_store+0x141/0x2b0
>>>      [<0>] md_attr_store+0x71/0xb0
>>>      [<0>] kernfs_fop_write+0x113/0x1a0
>>>      [<0>] vfs_write+0xbc/0x250
>>>      [<0>] ksys_write+0xa1/0xe0
>>>      [<0>] do_syscall_64+0x33/0x40
>>>      [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Note, that md0 has been paused successfully just before.
>>
>> What is the personality of md0? Is it also raid6?
> 
> Yes.
> 
>>
>>>
>>>      2020-11-27T02:00:01+01:00 done CROND[23333]: (root) CMD (/usr/bin/mdcheck --continue --duration "06:00")
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md0 from 10623180920
>>>      2020-11-27T02:00:01.382994+01:00 done kernel: [378596.606381] md: data-check of RAID array md0
>>>      2020-11-27T02:00:01+01:00 done root: mdcheck continue checking /dev/md1 from 11582849320
>>>      2020-11-27T02:00:01.437999+01:00 done kernel: [378596.661559] md: data-check of RAID array md1
>>>      2020-11-27T06:00:01.842003+01:00 done kernel: [392996.625147] md: md0: data-check interrupted.
>>>      2020-11-27T06:00:02+01:00 done root: pause checking /dev/md0 at 13351127680
>>>      2020-11-27T06:00:02.338989+01:00 done kernel: [392997.122520] md: md1: data-check interrupted.
>>>      [ nothing related following.... ]
>>>
>>> After that, we see md1_raid6 in a busy loop:
>>>
>>>      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>>      2376 root     20   0       0      0      0 R 100.0  0.0   1387:38 md1_raid6
>>
>> Seems the reap sync thread was trying to stop md1_raid6 while md1_raid6 was triggered again and again.
>>
>>>
>>> Also, all processes doing I/O do the md device block.
>>>
>>> This is /proc/mdstat:
>>>
>>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            [==================>..]  check = 94.0% (7350290348/7813894144) finish=57189.3min speed=135K/sec
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>>
>>
>> So the RECOVERY_CHECK flag should be set, not sure if the simple changes
>> helps, but you may give it a try.
> 
> Thanks. I've copied the original condition block to execute before the modified one and added some logging to it to see, if the change actually triggers. I will pause and unpause the check frequently on a busy machine to get this code executed more often.
> 
> Best
>    Donald
> 
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 98bac4f..e2697d0 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -9300,7 +9300,8 @@ void md_check_recovery(struct mddev *mddev)
>>                          md_update_sb(mddev, 0);
>>
>>                  if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>> -                   !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) {
>> +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>> +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>                          /* resync/recovery still happening */
>>                          clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                          goto unlock;
>>
>> Thanks,
>> Guoqing
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-02 17:28     ` Donald Buczek
@ 2020-12-03  1:55       ` Guoqing Jiang
  2020-12-03 11:42         ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2020-12-03  1:55 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 12/2/20 18:28, Donald Buczek wrote:
> Dear Guoqing,
> 
> unfortunately the patch didn't fix the problem (unless I messed it up 
> with my logging). This is what I used:
> 
>      --- a/drivers/md/md.c
>      +++ b/drivers/md/md.c
>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>                              clear_bit(MD_RECOVERY_NEEDED, 
> &mddev->recovery);
>                              goto unlock;
>                      }

I think you can add the check of RECOVERY_CHECK in above part instead of 
add a new part.

>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>      +                       /* resync/recovery still happening */
>      +                       pr_info("md: XXX BUGFIX applied\n");
>      +                       clear_bit(MD_RECOVERY_NEEDED, 
> &mddev->recovery);
>      +                       goto unlock;
>      +               }
>                      if (mddev->sync_thread) {
>                              md_reap_sync_thread(mddev);
>                              goto unlock;


> 
> With pausing and continuing the check four times an hour, I could 
> trigger the problem after about 48 hours. This time, the other device 
> (md0) has locked up on `echo idle > 
> /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 
> is still ongoing:

Without the patch, md0 was good while md1 was locked. So the patch 
switches the status of the two arrays, a little weird ...

What is the stack of the process? I guess it is same as the stack of 
23333 in your previous mail, but just to confirm.

> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [=>...................]  check =  8.5% (666852112/7813894144) 
> finish=1271.2min speed=93701K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [>....................]  check =  0.2% (19510348/7813894144) 
> finish=253779.6min speed=511K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 
> after 1 minute:
> 
>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] 
> [multipath]
>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] 
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [=>...................]  check =  8.6% (674914560/7813894144) 
> finish=941.1min speed=126418K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] 
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] 
> sdu[2] sdt[1]
>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 
> 2 [16/16] [UUUUUUUUUUUUUUUU]
>            [>....................]  check =  0.2% (19510348/7813894144) 
> finish=256805.0min speed=505K/sec
>            bitmap: 0/59 pages [0KB], 65536KB chunk
> 
> A data point, I didn't mention in my previous mail, is that the 
> mdX_resync thread is not gone when the problem occurs:
> 
>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>      root       134     2  0 Nov30 ?        00:00:00 [md]
>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>      root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>      buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat 
> /proc/5845/stack
>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>      [<0>] raid5_sync_request+0x24b/0x390
>      [<0>] md_do_sync+0xb41/0x1030
>      [<0>] md_thread+0x122/0x160
>      [<0>] kthread+0x118/0x130
>      [<0>] ret_from_fork+0x1f/0x30
> 
> I guess, md_bitmap_cond_end_sync+0x12d is the 
> `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) 
> == 0);` in md-bitmap.c.
> 

Could be, if so, then I think md_done_sync was not triggered by the path 
md0_raid6 -> ... -> handle_stripe.

I'd suggest to compare the stacks between md0 and md1 to find the 
difference.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-03  1:55       ` Guoqing Jiang
@ 2020-12-03 11:42         ` Donald Buczek
  2020-12-21 12:33           ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2020-12-03 11:42 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 12/3/20 2:55 AM, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 12/2/20 18:28, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:
>>
>>      --- a/drivers/md/md.c
>>      +++ b/drivers/md/md.c
>>      @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
>>                              clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>                              goto unlock;
>>                      }
> 
> I think you can add the check of RECOVERY_CHECK in above part instead of add a new part.

I understand that. I did it this way to get a log only when the changed expression makes a difference (original block not triggering and using goto and and modified block triggering) and even if both of us make some stupid error with boolean expressions.

> 
>>      +               if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>      +                   (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
>>      +                    test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
>>      +                       /* resync/recovery still happening */
>>      +                       pr_info("md: XXX BUGFIX applied\n");
>>      +                       clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>>      +                       goto unlock;
>>      +               }
>>                      if (mddev->sync_thread) {
>>                              md_reap_sync_thread(mddev);
>>                              goto unlock;
> 
> 
>>
>> With pausing and continuing the check four times an hour, I could trigger the problem after about 48 hours. This time, the other device (md0) has locked up on `echo idle > /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1 is still ongoing:
> 
> Without the patch, md0 was good while md1 was locked. So the patch switches the status of the two arrays, a little weird ...

I think its just random. There is a slight chance for a deadlock depending on I/O to the device or other system load when stopping the resync. In the last experiment, both md0 and md1 where changed from "check" to "idle" for about 400 times before one of it blocked.

OTOH - and I don't understand this at all - it looks like md0 hung again after reboot to a 5.4 kernel at the very first "check" to "idle" transition which seems to contradict the "slight" chance and "once after 400 attempts".

> What is the stack of the process? I guess it is same as the stack of 23333 in your previous mail, but just to confirm.

Of md0_raid6 thread ?  Sorry, didn't take the data before reboot this time, and concentrated on the non-terminating md0_resync thread.

>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.5% (666852112/7813894144) finish=1271.2min speed=93701K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=253779.6min speed=511K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> after 1 minute:
>>
>>      Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath]
>>      md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11] sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [=>...................]  check =  8.6% (674914560/7813894144) finish=941.1min speed=126418K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>      md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12] sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3] sdu[2] sdt[1]
>>            109394518016 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUUU]
>>            [>....................]  check =  0.2% (19510348/7813894144) finish=256805.0min speed=505K/sec
>>            bitmap: 0/59 pages [0KB], 65536KB chunk
>>
>> A data point, I didn't mention in my previous mail, is that the mdX_resync thread is not gone when the problem occurs:
>>
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
>>      root       134     2  0 Nov30 ?        00:00:00 [md]
>>      root      1321     2 27 Nov30 ?        12:57:14 [md0_raid6]
>>      root      1454     2 26 Nov30 ?        12:37:23 [md1_raid6]
>>      root      5845     2  0 16:20 ?        00:00:30 [md0_resync]
>>      root      5855     2 13 16:20 ?        00:14:11 [md1_resync]
>>      buczek    9880  9072  0 18:05 pts/0    00:00:00 grep -F [md
>>      buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat /proc/5845/stack
>>      [<0>] md_bitmap_cond_end_sync+0x12d/0x170
>>      [<0>] raid5_sync_request+0x24b/0x390
>>      [<0>] md_do_sync+0xb41/0x1030
>>      [<0>] md_thread+0x122/0x160
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x1f/0x30
>>
>> I guess, md_bitmap_cond_end_sync+0x12d is the `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active) == 0);` in md-bitmap.c.
>>
> 
> Could be, if so, then I think md_done_sync was not triggered by the path md0_raid6 -> ... -> handle_stripe.

Ok. Maybe I find a way to add some logging there.

> I'd suggest to compare the stacks between md0 and md1 to find the difference.

Compare stack of md0_raid6 and md1_raid6?

Okay, but when one md is deadlocked, the other is operating just normally. I/O can be done and you can even start and stop a resync on it. So I'd expect the stack of the operating raid to be varying.

Thanks.

   Donald

> 
> Thanks,
> Guoqing


-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-03 11:42         ` Donald Buczek
@ 2020-12-21 12:33           ` Donald Buczek
  2021-01-19 11:30             ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2020-12-21 12:33 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoging,

I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver.

With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action blocked, no progress in mdstat) and md1 operating normally:

     root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
     /sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
     /sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0

Best
   Donald

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2020-12-21 12:33           ` Donald Buczek
@ 2021-01-19 11:30             ` Donald Buczek
  2021-01-20 16:33               ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-19 11:30 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear md-raid people,

I've reported a problem in this thread in December:

"We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.

On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.

I was able to collect a lot of additional information from a blocked system.

- The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]

- The sync thread ("md1_resync") does not finish, because its blocked at

[<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
[<0>] raid5_sync_request+0x364/0x390
[<0>] md_do_sync+0xb41/0x1030
[<0>] md_thread+0x122/0x160
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

[1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735

- yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

- We have lots of active stripes:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
27534

- handle_stripe() doesn't make progress:

echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control

In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:

[171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0
[171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0
[171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
                 , check:4, reconstruct:0

The sector numbers repeat after some time. We have only the following variants of stripe state and "check":

state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle

- We have MD_SB_CHANGE_PENDING set:

root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
write-pending

gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939

- The system can manually be freed from the deadlock:

When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

I, of coruse, don't fully understand it yet. Any ideas?

I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)

Donald


$L_002: (struct mddev *) 0xffff8898a48e3000 : {
     private = (struct r5conf *) 0xffff88988695a800 /* --> $L_004 */
     pers = (struct md_personality *) 0xffffffff826ee280 <raid6_personality>
     unit = 9437185
     md_minor = 1
     disks = {next = 0xffff889886ee8c00, prev = 0xffff889891506000}
     flags = 0
     sb_flags = 6
     suspended = 0
     active_io = { counter=4 }  /* atomic_t */
     ro = 0
     sysfs_active = 0
     gendisk = (struct gendisk *) 0xffff8898bc357800
     kobj = {
         name = (const char *) 0xffffffff8220cef8 "md"
         entry = {next = 0xffff8898a48e3058, prev = 0xffff8898a48e3058}    /* empty */
         parent = (struct kobject *) 0xffff8898bc357868
         kset = 0x0
         ktype = (struct kobj_type *) 0xffffffff826ef140 <md_ktype>
         sd = (struct kernfs_node *) 0xffff8898b5f65980
         kref = {
             refcount = {
                 refs = { counter=17 }  /* atomic_t */
             }
         }
         state_initialized = 1
         state_in_sysfs = 1
         state_add_uevent_sent = 0
         state_remove_uevent_sent = 0
         uevent_suppress = 0
     }
     hold_active = 0
     major_version = 1
     minor_version = 2
     patch_version = 0
     persistent = 1
     external = 0
     metadata_type = [
(8)
         [0] : 0 '\000'
         [1] : 0 '\000'
         [2] : 0 '\000'
         [3] : 0 '\000'
         [4] : 0 '\000'
         [5] : 0 '\000'
         [6] : 0 '\000'
         [7] : 0 '\000'
         [8] : 0 '\000'
         [9] : 0 '\000'
         [10] : 0 '\000'
         [11] : 0 '\000'
         [12] : 0 '\000'
         [13] : 0 '\000'
         [14] : 0 '\000'
         [15] : 0 '\000'
         [16] : 0 '\000'
     ]
     chunk_sectors = 1024
     ctime = 1607680540
     utime = 1610853337
     level = 6
     layout = 2
     clevel = [
(8)
         [0] : 114 'r'
         [1] : 97 'a'
         [2] : 105 'i'
         [3] : 100 'd'
         [4] : 54 '6'
         [5] : 0 '\000'
         [6] : 0 '\000'
         [7] : 0 '\000'
         [8] : 0 '\000'
         [9] : 0 '\000'
         [10] : 0 '\000'
         [11] : 0 '\000'
         [12] : 0 '\000'
         [13] : 0 '\000'
         [14] : 0 '\000'
         [15] : 0 '\000'
     ]
     raid_disks = 16
     max_disks = 1920
     dev_sectors = 15627788288
     array_sectors = 218789036032
     external_size = 0
     events = 32409
     can_decrease_events = 0
     uuid = [
(8)
         [0] : -49 '\317'
         [1] : 69 'E'
         [2] : 119 'w'
         [3] : 62 '>'
         [4] : -113 '\217'
         [5] : -107 '\225'
         [6] : 9 '\t'
         [7] : -90 '\246'
         [8] : -2 '\376'
         [9] : 84 'T'
         [10] : -12 '\364'
         [11] : -31 '\341'
         [12] : -110 '\222'
         [13] : 24 '\030'
         [14] : -15 '\361'
         [15] : 109 'm'
     ]
     reshape_position = 18446744073709551615
     delta_disks = 0
     new_level = 6
     new_layout = 2
     new_chunk_sectors = 1024
     reshape_backwards = 0
     thread = (struct md_thread *) 0xffff889887bde080
     sync_thread = 0x0
     last_sync_action = (char *) 0xffffffff8225e246 "check"
     curr_resync = 4947256648
     curr_resync_completed = 4947016640
     resync_mark = 4415850173
     resync_mark_cnt = 66402936
     curr_mark_cnt = 75262560
     resync_max_sectors = 15627788288
     resync_mismatches = {
         counter = 0
     }
     suspend_lo = 0
     suspend_hi = 0
     sync_speed_min = 0
     sync_speed_max = 0
     parallel_resync = 0
     ok_start_degraded = 0
     recovery = 203
     recovery_disabled = 16
     in_sync = 0
     open_mutex = {
         owner = { counter=0 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff8898a48e31d8, prev = 0xffff8898a48e31d8}    /* empty */
     }
     reconfig_mutex = {
         owner = { counter=-131247629823744 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff8898a48e31f8, prev = 0xffff8898a48e31f8}    /* empty */
     }
     active = { counter=2 }  /* atomic_t */
     openers = { counter=1 }  /* atomic_t */
     changed = 0
     degraded = 0
     recovery_active = { counter=181360 }  /* atomic_t */
     recovery_wait = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff8898a48e3228, prev = 0xffff8898a48e3228}    /* empty */
     }
     recovery_cp = 18446744073709551615
     resync_min = 4871994088
     resync_max = 18446744073709551615
     sysfs_state = (struct kernfs_node *) 0xffff8898b5f65380
     sysfs_action = (struct kernfs_node *) 0xffff889887d87600
     sysfs_completed = (struct kernfs_node *) 0xffff8898ac6f4e80
     sysfs_degraded = (struct kernfs_node *) 0xffff8898ac6f4b80
     sysfs_level = (struct kernfs_node *) 0xffff8898b5f65700
     del_work = {
         data = { counter=128 }  /* atomic_t */
         entry = {next = 0xffff8898a48e3280, prev = 0xffff8898a48e3280}    /* empty */
         func = (work_func_t) 0xffffffff819ad430 <md_start_sync>
     }
     lock = {counter=0} /* spinlock_t */
     sb_wait = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffffc9000e4a7a68, prev = 0xffffc9000cf97d30}
     }
     pending_writes = { counter=0 }  /* atomic_t */
     safemode = 0
     safemode_delay = 201
     safemode_timer = {
         entry = {
             next = (struct hlist_node *) 0xdead000000000122
             pprev = 0x0
         }
         expires = 4415857239
         function = (void (*)(struct timer_list *)) 0xffffffff819a6a40 <md_safemode_timeout>
         flags = 314572820
     }
     writes_pending = {
         percpu_count_ptr = 105896157983096
         data = (struct percpu_ref_data *) 0xffff8898b340fc80
     }
     sync_checkers = 0
     queue = (struct request_queue *) 0xffff8898b65847c8
     bitmap = (struct bitmap *) 0xffff889887d84400
     bitmap_info = {
         file = 0x0
         offset = 8
         space = 0
         default_offset = 2
         default_space = 6
         mutex = {
             owner = { counter=0 }  /* atomic_t */
             wait_lock = {counter=0} /* spinlock_t */
             osq = {
                 tail = { counter=0 }  /* atomic_t */
             }
             wait_list = {next = 0xffff8898a48e3350, prev = 0xffff8898a48e3350}    /* empty */
         }
         chunksize = 67108864
         daemon_sleep = 5000
         max_write_behind = 0
         external = 0
         nodes = 0
         cluster_name = [
(8)
             [0] : 0 '\000'
             [1] : 0 '\000'
             [2] : 0 '\000'
             [3] : 0 '\000'
             [4] : 0 '\000'
             [5] : 0 '\000'
             [6] : 0 '\000'
             [7] : 0 '\000'
             [8] : 0 '\000'
             [9] : 0 '\000'
             [10] : 0 '\000'
             [11] : 0 '\000'
             [12] : 0 '\000'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
             [32] : 0 '\000'
             [33] : 0 '\000'
             [34] : 0 '\000'
             [35] : 0 '\000'
             [36] : 0 '\000'
             [37] : 0 '\000'
             [38] : 0 '\000'
             [39] : 0 '\000'
             [40] : 0 '\000'
             [41] : 0 '\000'
             [42] : 0 '\000'
             [43] : 0 '\000'
             [44] : 0 '\000'
             [45] : 0 '\000'
             [46] : 0 '\000'
             [47] : 0 '\000'
             [48] : 0 '\000'
             [49] : 0 '\000'
             [50] : 0 '\000'
             [51] : 0 '\000'
             [52] : 0 '\000'
             [53] : 0 '\000'
             [54] : 0 '\000'
             [55] : 0 '\000'
             [56] : 0 '\000'
             [57] : 0 '\000'
             [58] : 0 '\000'
             [59] : 0 '\000'
             [60] : 0 '\000'
             [61] : 0 '\000'
             [62] : 0 '\000'
             [63] : 0 '\000'
         ]
     }
     max_corr_read_errors = { counter=20 }  /* atomic_t */
     all_mddevs = {next = 0xffffffff826eef10, prev = 0xffff88810a551bc8}
     to_remove = 0x0
     bio_set = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae300
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3428, prev = 0xffff8898a48e3428}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2e0
             pool_data = (void *) 0xffff8881078ecb40
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3470, prev = 0xffff8898a48e3470}    /* empty */
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff8898a48e3530, prev = 0xffff8898a48e3530}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     sync_set = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2c0
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e3598, prev = 0xffff8898a48e3598}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae2a0
             pool_data = (void *) 0xffff8881078ecb40
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff8898a48e35e0, prev = 0xffff8898a48e35e0}    /* empty */
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff8898a48e36a0, prev = 0xffff8898a48e36a0}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     md_io_pool = {
         lock = {counter=0} /* spinlock_t */
         min_nr = 2
         curr_nr = 2
         elements = (void **) 0xffff8898b236af60
         pool_data = (void *) 0x28 <fixed_percpu_data+40>
         alloc = (mempool_alloc_t *) 0xffffffff811c8b50 <mempool_kmalloc>
         free = (mempool_free_t *) 0xffffffff811c84b0 <mempool_kfree>
         wait = {
             lock = {counter=0} /* spinlock_t */
             head = {next = 0xffff8898a48e36f8, prev = 0xffff8898a48e36f8}    /* empty */
         }
     }
     flush_bio = 0x0
     flush_pending = { counter=0 }  /* atomic_t */
     start_flush = 121249416090188
     last_flush = 121249416090188
     flush_work = {
         data = { counter=2176 }  /* atomic_t */
         entry = {next = 0xffff8898a48e3730, prev = 0xffff8898a48e3730}    /* empty */
         func = (work_func_t) 0xffffffff819aca80 <md_submit_flush_data>
     }
     event_work = {
         data = { counter=0 }  /* atomic_t */
         entry = {next = 0x0000000000000000, prev = 0x0000000000000000}
         func = 0x0
     }
     serial_info_pool = 0x0
     sync_super = 0x0
     cluster_info = 0x0
     good_device_nr = 0
     noio_flag = 0
     has_superblocks = true_21_
     fail_last_dev = false_21_
     serialize_policy = false_21_
}

$L_004: (struct r5conf *) 0xffff88988695a800 : {
     stripe_hashtbl = (struct hlist_head *) 0xffff8898a1cb9000 /* --> $L_006 */
     hash_locks = [
(23)
         [0] : {counter=0} /* spinlock_t */
         [1] : {counter=0} /* spinlock_t */
         [2] : {counter=0} /* spinlock_t */
         [3] : {counter=0} /* spinlock_t */
         [4] : {counter=0} /* spinlock_t */
         [5] : {counter=0} /* spinlock_t */
         [6] : {counter=0} /* spinlock_t */
         [7] : {counter=0} /* spinlock_t */
     ]
     mddev = (struct mddev *) 0xffff8898a48e3000
     chunk_sectors = 1024
     level = 6
     algorithm = 2
     rmw_level = 1
     max_degraded = 2
     raid_disks = 16
     max_nr_stripes = 29816
     min_nr_stripes = 256
     reshape_progress = 18446744073709551615
     reshape_safe = 0
     previous_raid_disks = 16
     prev_chunk_sectors = 1024
     prev_algo = 2
     generation = 0
     gen_lock = {
         seqcount = {
             sequence = 0
         }
     }
     reshape_checkpoint = 0
     min_offset_diff = 0
     handle_list = {next = 0xffff888a7893c890, prev = 0xffff8889de3cc7d0}
     loprio_list = {next = 0xffff88988695a898, prev = 0xffff88988695a898}    /* empty */
     hold_list = {next = 0xffff88988695a8a8, prev = 0xffff88988695a8a8}    /* empty */
     delayed_list = {next = 0xffff88988695a8b8, prev = 0xffff88988695a8b8}    /* empty */
     bitmap_list = {next = 0xffff88988695a8c8, prev = 0xffff88988695a8c8}    /* empty */
     retry_read_aligned = 0x0
     retry_read_offset = 0
     retry_read_aligned_list = 0x0
     preread_active_stripes = { counter=0 }  /* atomic_t */
     active_aligned_reads = { counter=0 }  /* atomic_t */
     pending_full_writes = { counter=0 }  /* atomic_t */
     bypass_count = 0
     bypass_threshold = 1
     skip_copy = 0
     last_hold = (struct list_head *) 0xffff88a299ea0010
     reshape_stripes = { counter=0 }  /* atomic_t */
     active_name = 0
     cache_name = [
(2)
         [0] : [
(8)
             [0] : 114 'r'
             [1] : 97 'a'
             [2] : 105 'i'
             [3] : 100 'd'
             [4] : 54 '6'
             [5] : 45 '-'
             [6] : 109 'm'
             [7] : 100 'd'
             [8] : 49 '1'
             [9] : 0 '\000'
             [10] : 0 '\000'
             [11] : 0 '\000'
             [12] : 0 '\000'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
         ]
         [1] : [
(8)
             [0] : 114 'r'
             [1] : 97 'a'
             [2] : 105 'i'
             [3] : 100 'd'
             [4] : 54 '6'
             [5] : 45 '-'
             [6] : 109 'm'
             [7] : 100 'd'
             [8] : 49 '1'
             [9] : 45 '-'
             [10] : 97 'a'
             [11] : 108 'l'
             [12] : 116 't'
             [13] : 0 '\000'
             [14] : 0 '\000'
             [15] : 0 '\000'
             [16] : 0 '\000'
             [17] : 0 '\000'
             [18] : 0 '\000'
             [19] : 0 '\000'
             [20] : 0 '\000'
             [21] : 0 '\000'
             [22] : 0 '\000'
             [23] : 0 '\000'
             [24] : 0 '\000'
             [25] : 0 '\000'
             [26] : 0 '\000'
             [27] : 0 '\000'
             [28] : 0 '\000'
             [29] : 0 '\000'
             [30] : 0 '\000'
             [31] : 0 '\000'
         ]
     ]
     slab_cache = (struct kmem_cache *) 0xffff8881078eca80
     cache_size_mutex = {
         owner = { counter=0 }  /* atomic_t */
         wait_lock = {counter=0} /* spinlock_t */
         osq = {
             tail = { counter=0 }  /* atomic_t */
         }
         wait_list = {next = 0xffff88988695a970, prev = 0xffff88988695a970}    /* empty */
     }
     seq_flush = 5031591
     seq_write = 5031591
     quiesce = 0
     fullsync = 0
     recovery_disabled = 15
     percpu = (struct raid5_percpu *) 0x604fdee147e8
     scribble_disks = 16
     scribble_sectors = 1024
     node = {
         next = 0x0
         pprev = (struct hlist_node **) 0xffff88810a5b89a8
     }
     active_stripes = { counter=27534 }  /* atomic_t */
     inactive_list = [
(3)
         [0] : {next = 0xffff88990fd76710, prev = 0xffff8889b7f72550}
         [1] : {next = 0xffff88959af002d0, prev = 0xffff889958cd4590}
         [2] : {next = 0xffff88a60a070150, prev = 0xffff88a37d3d0850}
         [3] : {next = 0xffff889e211e0690, prev = 0xffff8893a4000110}
         [4] : {next = 0xffff888c2139a350, prev = 0xffff88901a188410}
         [5] : {next = 0xffff88a83413e3d0, prev = 0xffff88a2090bc350}
         [6] : {next = 0xffff88a720d54310, prev = 0xffff888a0da6c5d0}
         [7] : {next = 0xffff888a0412c650, prev = 0xffff889250596550}
     ]
     r5c_cached_full_stripes = { counter=0 }  /* atomic_t */
     r5c_full_stripe_list = {next = 0xffff88988695aa48, prev = 0xffff88988695aa48}    /* empty */
     r5c_cached_partial_stripes = { counter=0 }  /* atomic_t */
     r5c_partial_stripe_list = {next = 0xffff88988695aa60, prev = 0xffff88988695aa60}    /* empty */
     r5c_flushing_full_stripes = { counter=0 }  /* atomic_t */
     r5c_flushing_partial_stripes = { counter=0 }  /* atomic_t */
     empty_inactive_list_nr = { counter=0 }  /* atomic_t */
     released_stripes = {
         first = 0x0
     }
     wait_for_quiescent = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff88988695aa90, prev = 0xffff88988695aa90}    /* empty */
     }
     wait_for_stripe = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffffc9000e3f7c78, prev = 0xffffc9000e3f7c78}
     }
     wait_for_overlap = {
         lock = {counter=0} /* spinlock_t */
         head = {next = 0xffff88988695aac0, prev = 0xffff88988695aac0}    /* empty */
     }
     cache_state = 3
     shrinker = {
         count_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198bc60 <raid5_cache_count>
         scan_objects = (long unsigned int (*)(struct shrinker *, struct shrink_control *)) 0xffffffff8198de90 <raid5_cache_scan>
         batch = 128
         seeks = 128
         flags = 0
         list = {next = 0xffff8898867e8c30, prev = 0xffffffffa0e14520}
         id = 0
         nr_deferred = (atomic_long_t *) 0xffff88988d8955c0
     }
     pool_size = 16
     device_lock = {counter=0} /* spinlock_t */
     disks = (struct disk_info *) 0xffff8898b1df3800
     bio_split = {
         bio_slab = (struct kmem_cache *) 0xffff889886fbdc00
         front_pad = 0
         bio_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 2
             curr_nr = 2
             elements = (void **) 0xffff8898b4fae280
             pool_data = (void *) 0xffff889886fbdc00
             alloc = (mempool_alloc_t *) 0xffffffff811c8540 <mempool_alloc_slab>
             free = (mempool_free_t *) 0xffffffff811c8560 <mempool_free_slab>
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0xffff88988695ab70, prev = 0xffff88988695ab70}    /* empty */
             }
         }
         bvec_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bio_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         bvec_integrity_pool = {
             lock = {counter=0} /* spinlock_t */
             min_nr = 0
             curr_nr = 0
             elements = 0x0
             pool_data = 0x0
             alloc = 0x0
             free = 0x0
             wait = {
                 lock = {counter=0} /* spinlock_t */
                 head = {next = 0x0000000000000000, prev = 0x0000000000000000}
             }
         }
         rescue_lock = {counter=0} /* spinlock_t */
         rescue_list = {
             head = 0x0
             tail = 0x0
         }
         rescue_work = {
             data = { counter=68719476704 }  /* atomic_t */
             entry = {next = 0xffff88988695ac78, prev = 0xffff88988695ac78}    /* empty */
             func = (work_func_t) 0xffffffff81430230 <bio_alloc_rescue>
         }
         rescue_workqueue = 0x0
     }
     thread = 0x0
     temp_inactive_list = [
(3)
         [0] : {next = 0xffff88988695aca0, prev = 0xffff88988695aca0}    /* empty */
         [1] : {next = 0xffff88988695acb0, prev = 0xffff88988695acb0}    /* empty */
         [2] : {next = 0xffff88988695acc0, prev = 0xffff88988695acc0}    /* empty */
         [3] : {next = 0xffff88988695acd0, prev = 0xffff88988695acd0}    /* empty */
         [4] : {next = 0xffff88988695ace0, prev = 0xffff88988695ace0}    /* empty */
         [5] : {next = 0xffff88988695acf0, prev = 0xffff88988695acf0}    /* empty */
         [6] : {next = 0xffff88988695ad00, prev = 0xffff88988695ad00}    /* empty */
         [7] : {next = 0xffff88988695ad10, prev = 0xffff88988695ad10}    /* empty */
     ]
     worker_groups = 0x0
     group_cnt = 0
     worker_cnt_per_group = 0
     log = 0x0
     log_private = 0x0
     pending_bios_lock = {counter=0} /* spinlock_t */
     batch_bio_dispatch = true_21_
     pending_data = (struct r5pending_data *) 0xffff88988fae0000
     free_list = {next = 0xffff88988fae4fd8, prev = 0xffff88988fae0000}
     pending_list = {next = 0xffff88988695ad60, prev = 0xffff88988695ad60}    /* empty */
     pending_data_cnt = 0
     next_pending_data = 0x0
}


set $sh = container_of ((void *)0xffff88a57f1e6550, struct stripe_head, lru)

(gdb) dump *$sh

$L_010: (struct stripe_head *) 0xffff88a57f1e6540 : {
     hash = {
         next = (struct hlist_node *) 0xffff888768f7c680
         pprev = (struct hlist_node **) 0xffff8892f0520080
     }
     lru = {next = 0xffff88945e034510, prev = 0xffff8891d8f04210}
     release_list = {
         next = (struct llist_node *) 0xffff88945e034520
     }
     raid_conf = (struct r5conf *) 0xffff88988695a800
     generation = 0
     sector = 4947068032
     pd_idx = 14
     qd_idx = 15
     ddf_layout = 0
     hash_lock_index = 0
     state = 8242                                           # 0x2032  HANDLE + INSYNC + REPLACED + STARTED
     count = { counter=0 }  /* atomic_t */
     bm_seq = 5030861
     disks = 16
     overwrite_disks = 0
     check_state = check_state_idle_5_
     reconstruct_state = reconstruct_state_idle_5_
     stripe_lock = {counter=0} /* spinlock_t */
     cpu = 17
     group = 0x0
     batch_head = 0x0                                               # no batch
     batch_lock = {counter=0} /* spinlock_t */
     batch_list = {next = 0xffff88a57f1e65c8, prev = 0xffff88a57f1e65c8}    /* empty */
     /* anon union*/ {
         log_io = 0x0
         ppl_io = 0x0
     }
     log_list = {next = 0xffff88a57f1e65e0, prev = 0xffff88a57f1e65e0}    /* empty */
     log_start = 18446744073709551615
     r5c = {next = 0xffff88a57f1e65f8, prev = 0xffff88a57f1e65f8}    /* empty */
     ppl_page = 0x0
     ops = {
         target = 0
         target2 = 0
         zero_sum_result = 0_5_
     }


(gdb) dump $sh->dev[0]

$L_011: (struct r5dev *) 0xffff88a57f1e6620 : {
     req = {
         bi_next = 0x0
         bi_disk = 0x0
         bi_opf = 0
         bi_flags = 0
         bi_ioprio = 0
         bi_write_hint = 0
         bi_status = 0 '\000'
         bi_partno = 0 '\000'
         __bi_remaining = { counter=1 }  /* atomic_t */
         bi_iter = {
             bi_sector = 0
             bi_size = 0
             bi_idx = 0
             bi_bvec_done = 0
         }
         bi_end_io = 0x0
         bi_private = 0x0
         bi_blkg = 0x0
         bi_issue = {
             value = 0
         }
         /* anon union*/ {
             bi_integrity = 0x0
         }
         bi_vcnt = 0
         bi_max_vecs = 1
         __bi_cnt = { counter=1 }  /* atomic_t */
         bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6710
         bi_pool = 0x0
         bi_inline_vecs = [
         ]
     }
     rreq = {
         bi_next = 0x0
         bi_disk = 0x0
         bi_opf = 0
         bi_flags = 0
         bi_ioprio = 0
         bi_write_hint = 0
         bi_status = 0 '\000'
         bi_partno = 0 '\000'
         __bi_remaining = { counter=1 }  /* atomic_t */
         bi_iter = {
             bi_sector = 0
             bi_size = 0
             bi_idx = 0
             bi_bvec_done = 0
         }
         bi_end_io = 0x0
         bi_private = 0x0
         bi_blkg = 0x0
         bi_issue = {
             value = 0
         }
         /* anon union*/ {
             bi_integrity = 0x0
         }
         bi_vcnt = 0
         bi_max_vecs = 1
         __bi_cnt = { counter=1 }  /* atomic_t */
         bi_io_vec = (struct bio_vec *) 0xffff88a57f1e6720
         bi_pool = 0x0
         bi_inline_vecs = [
         ]
     }
     vec = {
         bv_page = (struct page *) 0xffffea008ffccc80
         bv_len = 4096
         bv_offset = 0
     }
     rvec = {
         bv_page = 0x0
         bv_len = 0
         bv_offset = 0
     }
     page = (struct page *) 0xffffea008ffccc80
     orig_page = (struct page *) 0xffffea008ffccc80
     offset = 0
     toread = 0x0
     read = 0x0
     towrite = 0x0
     written = 0x0
     sector = 69258950784
     flags = 17                                                         # 0x11  : R5_UPTODATE + R5_INSYNC
     log_checksum = 0
     write_hint = 0
}

$sh->dev[1] .. $sh->dev[15] omitted. All the same, just different page buffers and sector numbers.

$sh->dev[14] and $sh->dev[15] have sector = 0.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-19 11:30             ` Donald Buczek
@ 2021-01-20 16:33               ` Guoqing Jiang
  2021-01-23 13:04                 ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-01-20 16:33 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
> 
> I've reported a problem in this thread in December:
> 
> "We are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions." It 
> was clear, that this is related to "mdcheck" running, because we found, 
> that the command, which should stop the scrubbing in the morning (`echo 
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
> 
> On 12/21/20, I've reported, that the problem might be caused by a 
> failure of the underlying block device driver, because I've found 
> "inflight" counters of the block devices not being zero. However, this 
> is not the case. We were able to run into the mdX_raid6 looping 
> condition a few times again, but the non-zero inflight counters have not 
> been observed again.
> 
> I was able to collect a lot of additional information from a blocked 
> system.
> 
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command 
> is waiting at kthread_stop to stop the sync thread. [ 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
> 
> - The sync thread ("md1_resync") does not finish, because its blocked at
> 
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
> 
> - yes, gdb confirms that `conf->cache_state` is 0x03 ( 
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list, 
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle 
 > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

> 
> - We have lots of active stripes:
> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either 
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE 
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could 
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So 
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by 
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
> 
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
> 
> In dmesg, we see the debug output from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 
> but never from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
> 
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> 
> The sector numbers repeat after some time. We have only the following 
> variants of stripe state and "check":
> 
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, 
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, 
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, 
> check_state_idle
> 
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call 
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be 
called by md_reap_sync_thread. But md_reap_sync_thread stuck with 
unregister sync_thread (it was blocked in raid5_get_active_stripe).


Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d 
can't break from the while(1) loop because "if (!batch_size && 
!released)" is false. I would assume released is '0' since
 >      released_stripes = {
 >          first = 0x0
 >      }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with 
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain 
the busy loop situation.

> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/array_state
> write-pending
> 
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is 
badblock of underlying device (sd*)?

> 
> So it can be assumed that handle_stripe breaks out at 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
> 
> - The system can manually be freed from the deadlock:
> 
> When `echo active > /sys/block/md1/md/array_state` is used, the 
> scrubbing and other I/O continue. Probably because of 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked 
process can make progress.

> 
> I, of coruse, don't fully understand it yet. Any ideas?
> 
> I append some data from a hanging raid... (mddev, r5conf and a sample 
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-20 16:33               ` Guoqing Jiang
@ 2021-01-23 13:04                 ` Donald Buczek
  2021-01-25  8:54                   ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-23 13:04 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 20.01.21 17:33, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/19/21 12:30, Donald Buczek wrote:
>> Dear md-raid people,
>>
>> I've reported a problem in this thread in December:
>>
>> "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear, that this is related to "mdcheck" running, because we found, that the command, which should stop the scrubbing in the morning (`echo idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
>>
>> On 12/21/20, I've reported, that the problem might be caused by a failure of the underlying block device driver, because I've found "inflight" counters of the block devices not being zero. However, this is not the case. We were able to run into the mdX_raid6 looping condition a few times again, but the non-zero inflight counters have not been observed again.
>>
>> I was able to collect a lot of additional information from a blocked system.
>>
>> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command is waiting at kthread_stop to stop the sync thread. [ https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
>>
>> - The sync thread ("md1_resync") does not finish, because its blocked at
>>
>> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
>> [<0>] raid5_sync_request+0x364/0x390
>> [<0>] md_do_sync+0xb41/0x1030
>> [<0>] md_thread+0x122/0x160
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
>>
>> - yes, gdb confirms that `conf->cache_state` is 0x03 ( R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )
> 
> The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.
> 
>>
>> - We have lots of active stripes:
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/stripe_cache_active
>> 27534
> 
> There are too many active stripes, so this is false:
> 
> atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)
> 
> so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes.
> 
> 1. Increase max_nr_stripes
> since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system.
> 
> Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory.
> 
> 2. Or decrease active_stripes
> This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set
> 
>> - handle_stripe() doesn't make progress:
>>
>> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
>>
>> In dmesg, we see the debug output from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 but never from https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
>>
>> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>>                  , check:4, reconstruct:0
>>
>> The sector numbers repeat after some time. We have only the following variants of stripe state and "check":
>>
>> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, check_state_idle
>> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, check_state_check_result
>> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, check_state_idle
>>
>> - We have MD_SB_CHANGE_PENDING set:
> 
> because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes).
> 
> Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe).
> 
> 
> Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since
>  >      released_stripes = {
>  >          first = 0x0
>  >      }
> And __get_priority_stripe fetches sh from conf->handle_list and delete
> it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
> So batch_size can't be '0', and handle_active_stripes in the loop
> repeats the process in the busy loop. This is my best guess to explain the busy loop situation.
> 
>>
>> root@deadbird:~/linux_problems/mdX_raid6_looping# cat /sys/block/md1/md/array_state
>> write-pending
>>
>> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)
> 
> since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)?

The problem is, we are still not able to run a system into the deadlocked state by a repeatable procedure. So I was not yet able to explicitly check for badblocks.

However, on the production systems, which locked up, and on the test systems, which we managed to lock up, there always was some file system activity to the affected devices aside from mdcheck itself. This alone would explain any clean->write-pending transitions and might be a required condition for the problem to happen. I'm currently trying to explicitly exercise this path (by write/fsync()/sleep), with no result yet.

Btw: When you monitor md/stripe_cache_active during a ongoing "check" on an otherwise idle system, you see a variety of values up to a certain maximum value, which is visible most of the time. But this maximum value seems to continuously increase. Also, when you start the check at higher blocks (via md/sync_min) right away, the maximum value seems to be higher. Can this be explained? Bigger seek gap between superblocks and active data area? But should this increase the number of active stripes?

Best
   Donald

>> So it can be assumed that handle_stripe breaks out at https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
>>
>> - The system can manually be freed from the deadlock:
>>
>> When `echo active > /sys/block/md1/md/array_state` is used, the scrubbing and other I/O continue. Probably because of https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520
> 
> Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress.
> 
>>
>> I, of coruse, don't fully understand it yet. Any ideas?
>>
>> I append some data from a hanging raid... (mddev, r5conf and a sample stripe_head from handle_list with it first disks)
> 
> These data did help for investigation!
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-23 13:04                 ` Donald Buczek
@ 2021-01-25  8:54                   ` Donald Buczek
  2021-01-25 21:32                     ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-25  8:54 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.

My current theory of what is happening is:

- MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)

- Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`

- action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.

- raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.

- raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.

- (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.

Does this make sense?

Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:

```
#! /bin/bash

(
         while true; do
                 echo "start check"
                 echo check > /sys/block/md3/md/sync_action
                 sleep 10
                 echo "stop check"
                 echo idle > /sys/block/md3/md/sync_action
                 sleep 2
         done
) &

(
         while true; do
                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
                 sync /mnt/raid_md3/bigfile
                 rm /mnt/raid_md3/bigfile
                 sleep .1
         done
) &

start="$(date +%s)"
cd /sys/block/md3/md
wp_count=0
while true; do
         array_state=$(cat array_state)
         if [ "$array_state" = write-pending ]; then
                 wp_count=$(($wp_count+1))
         else
                 wp_count=0
         fi
         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
         if [ $wp_count -ge 3 ]; then
                 kill -- -$$
                 exit
         fi
         sleep 1
done
```

The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:

     309 check 6283872 / 8378368 active-idle 4144
     310 check 6283872 / 8378368 active 1702
     311 check 6807528 / 8378368 active 4152
     312 check 7331184 / 8378368 clean 3021
     stop check
     313 check 7331184 / 8378368 write-pending 3905
     314 check 7331184 / 8378368 write-pending 3905
     315 check 7331184 / 8378368 write-pending 3905
     Terminated

If I add

     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);

in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get

     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6

If I add

     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);

at the head of md_check_recovery, I get:

     [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006

More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )

What approach would you suggest to fix this?

Best
   Donald



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-25  8:54                   ` Donald Buczek
@ 2021-01-25 21:32                     ` Donald Buczek
  2021-01-26  0:44                       ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-25 21:32 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
> 
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
> 
> My current theory of what is happening is:
> 
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
> 
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
> 
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
> 
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
> 
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
> 
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
> 
> Does this make sense?
> 
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
> 
> ```
> #! /bin/bash
> 
> (
>          while true; do
>                  echo "start check"
>                  echo check > /sys/block/md3/md/sync_action
>                  sleep 10
>                  echo "stop check"
>                  echo idle > /sys/block/md3/md/sync_action
>                  sleep 2
>          done
> ) &
> 
> (
>          while true; do
>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                  sync /mnt/raid_md3/bigfile
>                  rm /mnt/raid_md3/bigfile
>                  sleep .1
>          done
> ) &
> 
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>          array_state=$(cat array_state)
>          if [ "$array_state" = write-pending ]; then
>                  wp_count=$(($wp_count+1))
>          else
>                  wp_count=0
>          fi
>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>          if [ $wp_count -ge 3 ]; then
>                  kill -- -$$
>                  exit
>          fi
>          sleep 1
> done
> ```
> 
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
> 
>      309 check 6283872 / 8378368 active-idle 4144
>      310 check 6283872 / 8378368 active 1702
>      311 check 6807528 / 8378368 active 4152
>      312 check 7331184 / 8378368 clean 3021
>      stop check
>      313 check 7331184 / 8378368 write-pending 3905
>      314 check 7331184 / 8378368 write-pending 3905
>      315 check 7331184 / 8378368 write-pending 3905
>      Terminated
> 
> If I add
> 
>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
> 
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
> 
>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> 
> If I add
> 
>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
> 
> at the head of md_check_recovery, I get:
> 
>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> 
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
> 
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
  		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
  		    mddev_lock(mddev) == 0) {
+			set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			flush_workqueue(md_misc_wq);
  			if (mddev->sync_thread) {
  				set_bit(MD_RECOVERY_INTR, &mddev->recovery);
  				md_reap_sync_thread(mddev);
  			}
+			clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			mddev_unlock(mddev);
  		}
  	} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-- 
2.30.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-25 21:32                     ` Donald Buczek
@ 2021-01-26  0:44                       ` Guoqing Jiang
  2021-01-26  9:50                         ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-01-26  0:44 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid


[-- Attachment #1: Type: text/plain, Size: 8583 bytes --]

Hi Donald,

On 1/25/21 22:32, Donald Buczek wrote:
> 
> 
> On 25.01.21 09:54, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> a colleague of mine was able to produce the issue inside a vm and were 
>> able to find a procedure to run the vm into the issue within minutes 
>> (not unreliably after hours on a physical system as before). This of 
>> course helped to pinpoint the problem.
>>
>> My current theory of what is happening is:
>>
>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by 
>> md_write_start() when file-system I/O wants to do a write and the 
>> array transitions from "clean" to "active". 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>
>> - Before raid5d gets to write the superblock (its busy processing 
>> active stripes because of the sync activity) , userspace wants to 
>> pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>
>> - action_store() takes the reconfig_mutex before trying to stop the 
>> sync thread. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump 
>> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>
>> - raid5d is running in its main loop. 
>> raid5d()->handle_active_stripes() returns a positive batch size ( 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 
>> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't 
>> process any stripe because of MD_SB_CHANGE_PENDING. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 
>> ). This is the reason, raid5d is busy looping.
>>
>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One 
>> of its duties is to write the superblock, if a change is pending. 
>> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able 
>> to take the reconfig_mutex. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 
>> , 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) 
>> Both is not true, so the superblock is not written and 
>> MD_SB_CHANGE_PENDING is not cleared.
>>
>> - (as discussed previously) the sync thread is waiting for the number 
>> of active stripes to go down and doesn't terminate. The userspace 
>> thread is waiting for the sync thread to terminate.
>>
>> Does this make sense?

Yes, exactly! That was my thought too, the scenario is twisted.

Then resync thread is blocked due to there are too many active stripes, 
because raid5d is in busy loop since SB_CHANGE_PENDING is set which 
means tactive stripes can't be decreased, and echo idle cmd can't make 
progress given resync thread is blocked while the cmd still hold 
reconfig_mutex which make raid5d in busy loop and can't clear 
SB_CHANGE_PENDING flag.

And raid5 could suffer from the same issue I think.

>>
>> Just for reference, I add the procedure which triggers the issue on 
>> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>
>> ```
>> #! /bin/bash
>>
>> (
>>          while true; do
>>                  echo "start check"
>>                  echo check > /sys/block/md3/md/sync_action
>>                  sleep 10
>>                  echo "stop check"
>>                  echo idle > /sys/block/md3/md/sync_action
>>                  sleep 2
>>          done
>> ) &
>>
>> (
>>          while true; do
>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero 
>> of=/mnt/raid_md3/bigfile status=none
>>                  sync /mnt/raid_md3/bigfile
>>                  rm /mnt/raid_md3/bigfile
>>                  sleep .1
>>          done
>> ) &
>>
>> start="$(date +%s)"
>> cd /sys/block/md3/md
>> wp_count=0
>> while true; do
>>          array_state=$(cat array_state)
>>          if [ "$array_state" = write-pending ]; then
>>                  wp_count=$(($wp_count+1))
>>          else
>>                  wp_count=0
>>          fi
>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat 
>> sync_completed) $array_state $(cat stripe_cache_active)
>>          if [ $wp_count -ge 3 ]; then
>>                  kill -- -$$
>>                  exit
>>          fi
>>          sleep 1
>> done
>> ```
>>
>> The time, this needs to trigger the issue, varies from under a minute 
>> to one hour with 5 minute being typical. The output ends like this:
>>
>>      309 check 6283872 / 8378368 active-idle 4144
>>      310 check 6283872 / 8378368 active 1702
>>      311 check 6807528 / 8378368 active 4152
>>      312 check 7331184 / 8378368 clean 3021
>>      stop check
>>      313 check 7331184 / 8378368 write-pending 3905
>>      314 check 7331184 / 8378368 write-pending 3905
>>      315 check 7331184 / 8378368 write-pending 3905
>>      Terminated
>>
>> If I add
>>
>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", 
>> batch_size, released, mddev->sb_flags);
>>
>> in raid5d after the call to handle_active_stripes and enable the debug 
>> location after the deadlock occurred, I get
>>
>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>
>> If I add
>>
>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags 
>> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>
>> at the head of md_check_recovery, I get:
>>
>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>
>> More debug lines in md_check_recovery confirm the control flow ( `if 
>> (mddev_trylock(mddev))` block not taken )
>>

That is great that you have a reproducer now!

>> What approach would you suggest to fix this?
> 
> I naively tried the following patch and it seems to fix the problem. The 
> test procedure didn't trigger the deadlock in 10 hours.
> 
> D.
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2d21c298ffa7..f40429843906 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char 
> *page, size_t len)
>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>               mddev_lock(mddev) == 0) {
> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               flush_workqueue(md_misc_wq);
>               if (mddev->sync_thread) {
>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                   md_reap_sync_thread(mddev);
>               }
> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               mddev_unlock(mddev);
>           }
>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

Yes, it could break the deadlock issue, but I am not sure if it is the 
right way given we only set ALLOW_SB_UPDATE in suspend which makes sense 
since the io will be quiesced, but write idle action can't guarantee the 
  similar thing. I prefer to make resync thread not wait forever here.

wait_event_lock_irq(
	conf->wait_for_stripe,
	!list_empty(conf->inactive_list + hash) && 

	(atomic_read(&conf->active_stripes)
	< (conf->max_nr_stripes * 3 / 4)
	|| !test_bit(R5_INACTIVE_BLOCKED,
		     &conf->cache_state)
	*(conf->hash_locks + hash));

So, could you please try the attached?

Thanks,
Guoqing

[-- Attachment #2: raid5-proposal --]
[-- Type: text/plain, Size: 5862 bytes --]

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4337ae0..378ce5c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
 {
 	struct stripe_head *sh;
 
-	sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
+	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
 	if (!sh)
 		return NULL;  /* no more stripe available */
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8888973..33a2a22 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
 }
 
 struct stripe_head *
-raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
+raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
 			int previous, int noblock, int noquiesce)
 {
 	struct stripe_head *sh;
+	struct mddev *mddev = conf->mddev;
 	int hash = stripe_hash_locks_hash(conf, sector);
 	int inc_empty_inactive_list_flag;
 
@@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
-						      &conf->cache_state)),
+						      &conf->cache_state)
+					 || (test_bit(MD_RECOVERY_INTR,
+						      &mddev->recovery)
+					     && sync_req)),
 					*(conf->hash_locks + hash));
+				if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
+				    && sync_req)
+					break;
 				clear_bit(R5_INACTIVE_BLOCKED,
 					  &conf->cache_state);
 			} else {
@@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
 			sector_t bn = raid5_compute_blocknr(sh, i, 1);
 			sector_t s = raid5_compute_sector(conf, bn, 0,
 							  &dd_idx, NULL);
-			sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
+			sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
 				 * have been requested.  When later blocks
@@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
 	/* Finish reconstruct operations initiated by the expansion process */
 	if (sh->reconstruct_state == reconstruct_state_result) {
 		struct stripe_head *sh_src
-			= raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
+			= raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
 		if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
 			/* sh cannot be written until sh_src has been read.
 			 * so arrange for sh to be delayed a little
@@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 		DEFINE_WAIT(w);
 		int d;
 	again:
-		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
+		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
 		prepare_to_wait(&conf->wait_for_overlap, &w,
 				TASK_UNINTERRUPTIBLE);
 		set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
@@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			(unsigned long long)new_sector,
 			(unsigned long long)logical_sector);
 
-		sh = raid5_get_active_stripe(conf, new_sector, previous,
+		sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
 				       (bi->bi_opf & REQ_RAHEAD), 0);
 		if (sh) {
 			if (unlikely(previous)) {
@@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
 		int j;
 		int skipped_disk = 0;
-		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
+		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
 		set_bit(STRIPE_EXPANDING, &sh->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
@@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
-		sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
+		sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
 		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
@@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
 
 	md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
 
-	sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
+	sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
 	if (sh == NULL) {
-		sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
+		sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
+		if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
+			*skipped = 1;
+			return 0;
+		}
+
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access
 		 */
@@ -6334,7 +6346,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
 			/* already done this stripe */
 			continue;
 
-		sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
+		sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 5c05acf..d9eab46 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
 				     int previous, int *dd_idx,
 				     struct stripe_head *sh);
 extern struct stripe_head *
-raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
+raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
 			int previous, int noblock, int noquiesce);
 extern int raid5_calc_degraded(struct r5conf *conf);
 extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26  0:44                       ` Guoqing Jiang
@ 2021-01-26  9:50                         ` Donald Buczek
  2021-01-26 11:14                           ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-26  9:50 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 26.01.21 01:44, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/25/21 22:32, Donald Buczek wrote:
>>
>>
>> On 25.01.21 09:54, Donald Buczek wrote:
>>> Dear Guoqing,
>>>
>>> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>>>
>>> My current theory of what is happening is:
>>>
>>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>>
>>> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>>
>>> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>>
>>> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>>>
>>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>>>
>>> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>>>
>>> Does this make sense?
> 
> Yes, exactly! That was my thought too, the scenario is twisted.
> 
> Then resync thread is blocked due to there are too many active stripes, because raid5d is in busy loop since SB_CHANGE_PENDING is set which means tactive stripes can't be decreased, and echo idle cmd can't make progress given resync thread is blocked while the cmd still hold reconfig_mutex which make raid5d in busy loop and can't clear SB_CHANGE_PENDING flag.
> 
> And raid5 could suffer from the same issue I think.
> 
>>>
>>> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>>
>>> ```
>>> #! /bin/bash
>>>
>>> (
>>>          while true; do
>>>                  echo "start check"
>>>                  echo check > /sys/block/md3/md/sync_action
>>>                  sleep 10
>>>                  echo "stop check"
>>>                  echo idle > /sys/block/md3/md/sync_action
>>>                  sleep 2
>>>          done
>>> ) &
>>>
>>> (
>>>          while true; do
>>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>>>                  sync /mnt/raid_md3/bigfile
>>>                  rm /mnt/raid_md3/bigfile
>>>                  sleep .1
>>>          done
>>> ) &
>>>
>>> start="$(date +%s)"
>>> cd /sys/block/md3/md
>>> wp_count=0
>>> while true; do
>>>          array_state=$(cat array_state)
>>>          if [ "$array_state" = write-pending ]; then
>>>                  wp_count=$(($wp_count+1))
>>>          else
>>>                  wp_count=0
>>>          fi
>>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>>>          if [ $wp_count -ge 3 ]; then
>>>                  kill -- -$$
>>>                  exit
>>>          fi
>>>          sleep 1
>>> done
>>> ```
>>>
>>> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>>>
>>>      309 check 6283872 / 8378368 active-idle 4144
>>>      310 check 6283872 / 8378368 active 1702
>>>      311 check 6807528 / 8378368 active 4152
>>>      312 check 7331184 / 8378368 clean 3021
>>>      stop check
>>>      313 check 7331184 / 8378368 write-pending 3905
>>>      314 check 7331184 / 8378368 write-pending 3905
>>>      315 check 7331184 / 8378368 write-pending 3905
>>>      Terminated
>>>
>>> If I add
>>>
>>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>>>
>>> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>>>
>>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>>>
>>> If I add
>>>
>>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>>
>>> at the head of md_check_recovery, I get:
>>>
>>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>>
>>> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>>>
> 
> That is great that you have a reproducer now!
> 
>>> What approach would you suggest to fix this?
>>
>> I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.
>>
>> D.
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2d21c298ffa7..f40429843906 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>               mddev_lock(mddev) == 0) {
>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               flush_workqueue(md_misc_wq);
>>               if (mddev->sync_thread) {
>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>                   md_reap_sync_thread(mddev);
>>               }
>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>               mddev_unlock(mddev);
>>           }
>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
> 
> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.

Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.

> I prefer to make resync thread not wait forever here.
> 
> wait_event_lock_irq(
>      conf->wait_for_stripe,
>      !list_empty(conf->inactive_list + hash) &&
>      (atomic_read(&conf->active_stripes)
>      < (conf->max_nr_stripes * 3 / 4)
>      || !test_bit(R5_INACTIVE_BLOCKED,
>               &conf->cache_state)
>      *(conf->hash_locks + hash));
> 
> So, could you please try the attached?
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4337ae0..378ce5c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -1931,7 +1931,7 @@ r5c_recovery_alloc_stripe(
>  {
>  	struct stripe_head *sh;
>  
> -	sh = raid5_get_active_stripe(conf, stripe_sect, 0, noblock, 0);
> +	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, noblock, 0);
>  	if (!sh)
>  		return NULL;  /* no more stripe available */
>  
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8888973..33a2a22 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -700,10 +700,11 @@ static int has_failed(struct r5conf *conf)
>  }
>  
>  struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
>  			int previous, int noblock, int noquiesce)
>  {
>  	struct stripe_head *sh;
> +	struct mddev *mddev = conf->mddev;
>  	int hash = stripe_hash_locks_hash(conf, sector);
>  	int inc_empty_inactive_list_flag;
>  
> @@ -738,8 +739,14 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
>  					(atomic_read(&conf->active_stripes)
>  					 < (conf->max_nr_stripes * 3 / 4)
>  					 || !test_bit(R5_INACTIVE_BLOCKED,
> -						      &conf->cache_state)),
> +						      &conf->cache_state)
> +					 || (test_bit(MD_RECOVERY_INTR,
> +						      &mddev->recovery)
> +					     && sync_req)),
>  					*(conf->hash_locks + hash));
> +				if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)
> +				    && sync_req)
> +					break;
>  				clear_bit(R5_INACTIVE_BLOCKED,
>  					  &conf->cache_state);
>  			} else {
> @@ -4527,7 +4534,7 @@ static void handle_stripe_expansion(struct r5conf *conf, struct stripe_head *sh)
>  			sector_t bn = raid5_compute_blocknr(sh, i, 1);
>  			sector_t s = raid5_compute_sector(conf, bn, 0,
>  							  &dd_idx, NULL);
> -			sh2 = raid5_get_active_stripe(conf, s, 0, 1, 1);
> +			sh2 = raid5_get_active_stripe(conf, s, 0, 0, 1, 1);
>  			if (sh2 == NULL)
>  				/* so far only the early blocks of this stripe
>  				 * have been requested.  When later blocks
> @@ -5164,7 +5171,7 @@ static void handle_stripe(struct stripe_head *sh)
>  	/* Finish reconstruct operations initiated by the expansion process */
>  	if (sh->reconstruct_state == reconstruct_state_result) {
>  		struct stripe_head *sh_src
> -			= raid5_get_active_stripe(conf, sh->sector, 1, 1, 1);
> +			= raid5_get_active_stripe(conf, sh->sector, 0, 1, 1, 1);
>  		if (sh_src && test_bit(STRIPE_EXPAND_SOURCE, &sh_src->state)) {
>  			/* sh cannot be written until sh_src has been read.
>  			 * so arrange for sh to be delayed a little
> @@ -5705,7 +5712,7 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>  		DEFINE_WAIT(w);
>  		int d;
>  	again:
> -		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0);
> +		sh = raid5_get_active_stripe(conf, logical_sector, 0, 0, 0, 0);
>  		prepare_to_wait(&conf->wait_for_overlap, &w,
>  				TASK_UNINTERRUPTIBLE);
>  		set_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags);
> @@ -5861,7 +5868,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  			(unsigned long long)new_sector,
>  			(unsigned long long)logical_sector);
>  
> -		sh = raid5_get_active_stripe(conf, new_sector, previous,
> +		sh = raid5_get_active_stripe(conf, new_sector, previous, 0,


Mistake here (fourth argument added instead of third)

>  				       (bi->bi_opf & REQ_RAHEAD), 0);
>  		if (sh) {
>  			if (unlikely(previous)) {
> @@ -6100,7 +6107,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  	for (i = 0; i < reshape_sectors; i += RAID5_STRIPE_SECTORS(conf)) {
>  		int j;
>  		int skipped_disk = 0;
> -		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 1);
> +		sh = raid5_get_active_stripe(conf, stripe_addr+i, 0, 0, 0, 1);
>  		set_bit(STRIPE_EXPANDING, &sh->state);
>  		atomic_inc(&conf->reshape_stripes);
>  		/* If any of this stripe is beyond the end of the old
> @@ -6149,7 +6156,7 @@ static sector_t reshape_request(struct mddev *mddev, sector_t sector_nr, int *sk
>  	if (last_sector >= mddev->dev_sectors)
>  		last_sector = mddev->dev_sectors - 1;
>  	while (first_sector <= last_sector) {
> -		sh = raid5_get_active_stripe(conf, first_sector, 1, 0, 1);
> +		sh = raid5_get_active_stripe(conf, first_sector, 0, 1, 0, 1);
>  		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);
> @@ -6269,9 +6276,14 @@ static inline sector_t raid5_sync_request(struct mddev *mddev, sector_t sector_n
>  
>  	md_bitmap_cond_end_sync(mddev->bitmap, sector_nr, false);
>  
> -	sh = raid5_get_active_stripe(conf, sector_nr, 0, 1, 0);
> +	sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 1, 0);
>  	if (sh == NULL) {
> -		sh = raid5_get_active_stripe(conf, sector_nr, 0, 0, 0);
> +		sh = raid5_get_active_stripe(conf, sector_nr, 1, 0, 0, 0);
> +		if (!sh && test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> +			*skipped = 1;
> +			return 0;
> +		}
> +
>  		/* make sure we don't swamp the stripe cache if someone else
>  		 * is trying to get access
>  		 */
> @@ -6334,7 +6346,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio,
>  			/* already done this stripe */
>  			continue;
>  
> -		sh = raid5_get_active_stripe(conf, sector, 0, 1, 1);
> +		sh = raid5_get_active_stripe(conf, sector, 0, 0, 1, 1);
>  
>  		if (!sh) {
>  			/* failed to get a stripe - must wait */
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 5c05acf..d9eab46 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -806,7 +806,7 @@ extern sector_t raid5_compute_sector(struct r5conf *conf, sector_t r_sector,
>  				     int previous, int *dd_idx,
>  				     struct stripe_head *sh);
>  extern struct stripe_head *
> -raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
> +raid5_get_active_stripe(struct r5conf *conf, sector_t sector, int sync_req,
>  			int previous, int noblock, int noquiesce);
>  extern int raid5_calc_degraded(struct r5conf *conf);
>  extern int r5c_journal_mode_set(struct mddev *mddev, int journal_mode);

Unfortunately, this patch did not fix the issue.

     root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
     [<0>] raid5_get_active_stripe+0x1e7/0x6b0
     [<0>] raid5_sync_request+0x2a7/0x3d0
     [<0>] md_do_sync.cold+0x3ee/0x97c
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x11b/0x140
     [<0>] ret_from_fork+0x22/0x30

which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735

Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Best
   Donald

> 
> Thanks,
> Guoqing


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26  9:50                         ` Donald Buczek
@ 2021-01-26 11:14                           ` Guoqing Jiang
  2021-01-26 12:58                             ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-01-26 11:14 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/26/21 10:50, Donald Buczek wrote:
[...]

>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 2d21c298ffa7..f40429843906 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char 
>>> *page, size_t len)
>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>               mddev_lock(mddev) == 0) {
>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               flush_workqueue(md_misc_wq);
>>>               if (mddev->sync_thread) {
>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>                   md_reap_sync_thread(mddev);
>>>               }
>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>               mddev_unlock(mddev);
>>>           }
>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>
>> Yes, it could break the deadlock issue, but I am not sure if it is the 
>> right way given we only set ALLOW_SB_UPDATE in suspend which makes 
>> sense since the io will be quiesced, but write idle action can't 
>> guarantee the  similar thing.
> 
> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of 
> reconfig_mutex promises not to make any changes which would exclude 
> superblocks from being written" might make it easier to accept the usage.

I am not sure it is safe to set the flag here since write idle can't 
prevent IO from fs while mddev_suspend can guarantee that.

> 
>> I prefer to make resync thread not wait forever here.
>>

[...]

>>
>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
> 
> 
> Mistake here (fourth argument added instead of third)

Thanks for checking.

[...]

> Unfortunately, this patch did not fix the issue.
> 
>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>      [<0>] raid5_sync_request+0x2a7/0x3d0
>      [<0>] md_do_sync.cold+0x3ee/0x97c
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x11b/0x140
>      [<0>] ret_from_fork+0x22/0x30
> 
> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> 
> Isn't it still the case that the superblocks are not written, therefore 
> stripes are not processed, therefore number of active stripes are not 
> decreasing? Who is expected to wake up conf->wait_for_stripe waiters?

Hmm, how about wake the waiter up in the while loop of raid5d?

@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
                         md_check_recovery(mddev);
                         spin_lock_irq(&conf->device_lock);
                 }
+
+               if ((atomic_read(&conf->active_stripes)
+                    < (conf->max_nr_stripes * 3 / 4) ||
+                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
+                       wake_up(&conf->wait_for_stripe);
         }
         pr_debug("%d stripes handled\n", handled);

If the issue still appears then I will change the waiter to break just 
if MD_RECOVERY_INTR is set, something like.

wait_event_lock_irq(conf->wait_for_stripe,
	(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
	 /* the previous condition */,
	*(conf->hash_locks + hash));

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 11:14                           ` Guoqing Jiang
@ 2021-01-26 12:58                             ` Donald Buczek
  2021-01-26 14:06                               ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-26 12:58 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 26.01.21 12:14, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/26/21 10:50, Donald Buczek wrote:
> [...]
> 
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 2d21c298ffa7..f40429843906 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
>>>>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>>>>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>>>>               mddev_lock(mddev) == 0) {
>>>> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               flush_workqueue(md_misc_wq);
>>>>               if (mddev->sync_thread) {
>>>>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>>>                   md_reap_sync_thread(mddev);
>>>>               }
>>>> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>>>>               mddev_unlock(mddev);
>>>>           }
>>>>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>>>
>>> Yes, it could break the deadlock issue, but I am not sure if it is the right way given we only set ALLOW_SB_UPDATE in suspend which makes sense since the io will be quiesced, but write idle action can't guarantee the  similar thing.
>>
>> Thinking (and documenting) MD_ALLOW_SB_UPDATE as "the holder of reconfig_mutex promises not to make any changes which would exclude superblocks from being written" might make it easier to accept the usage.
> 
> I am not sure it is safe to set the flag here since write idle can't prevent IO from fs while mddev_suspend can guarantee that.
> 
>>
>>> I prefer to make resync thread not wait forever here.
>>>
> 
> [...]
> 
>>>
>>> -        sh = raid5_get_active_stripe(conf, new_sector, previous,
>>> +        sh = raid5_get_active_stripe(conf, new_sector, previous, 0,
>>
>>
>> Mistake here (fourth argument added instead of third)
> 
> Thanks for checking.
> 
> [...]
> 
>> Unfortunately, this patch did not fix the issue.
>>
>>      root@sloth:~/linux# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] raid5_get_active_stripe+0x1e7/0x6b0
>>      [<0>] raid5_sync_request+0x2a7/0x3d0
>>      [<0>] md_do_sync.cold+0x3ee/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> which is ( according to objdump -d -Sl drivers/md/raid5.o ) at https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>>
>> Isn't it still the case that the superblocks are not written, therefore stripes are not processed, therefore number of active stripes are not decreasing? Who is expected to wake up conf->wait_for_stripe waiters?
> 
> Hmm, how about wake the waiter up in the while loop of raid5d?
> 
> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>                          md_check_recovery(mddev);
>                          spin_lock_irq(&conf->device_lock);
>                  }
> +
> +               if ((atomic_read(&conf->active_stripes)
> +                    < (conf->max_nr_stripes * 3 / 4) ||
> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
> +                       wake_up(&conf->wait_for_stripe);
>          }
>          pr_debug("%d stripes handled\n", handled);

Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at

     root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
     [<0>] md_do_sync.cold+0x8ec/0x97c
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x11b/0x140
     [<0>] ret_from_fork+0x22/0x30

instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.

Best
   Donald

> If the issue still appears then I will change the waiter to break just if MD_RECOVERY_INTR is set, something like.
> 
> wait_event_lock_irq(conf->wait_for_stripe,
>      (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>       /* the previous condition */,
>      *(conf->hash_locks + hash));
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 12:58                             ` Donald Buczek
@ 2021-01-26 14:06                               ` Guoqing Jiang
  2021-01-26 16:05                                 ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-01-26 14:06 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 1/26/21 13:58, Donald Buczek wrote:
> 
> 
>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>
>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>                          md_check_recovery(mddev);
>>                          spin_lock_irq(&conf->device_lock);
>>                  }
>> +
>> +               if ((atomic_read(&conf->active_stripes)
>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>> +                       wake_up(&conf->wait_for_stripe);
>>          }
>>          pr_debug("%d stripes handled\n", handled);
> 
> Hmm... With this patch on top of your other one, we still have the basic 
> symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
> 
>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>      [<0>] md_do_sync.cold+0x8ec/0x97c
>      [<0>] md_thread+0xab/0x160
>      [<0>] kthread+0x11b/0x140
>      [<0>] ret_from_fork+0x22/0x30
> 
> instead, which is 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

Not sure why recovery_active is not zero, because it is set 0 before 
blk_start_plug, and raid5_sync_request returns 0 and skipped is also set 
to 1. Perhaps handle_stripe calls md_done_sync.

Could you double check the value of recovery_active? Or just don't wait 
if resync thread is interrupted.

wait_event(mddev->recovery_wait,
	   test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
	   !atomic_read(&mddev->recovery_active));

> And, unlike before, "md: md3: data-check interrupted." from the pr_info 
> two lines above appears in dmesg.

Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.

Anyway, will try the script and investigate more about the issue.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 14:06                               ` Guoqing Jiang
@ 2021-01-26 16:05                                 ` Donald Buczek
  2021-02-02 15:42                                   ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-01-26 16:05 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 26.01.21 15:06, Guoqing Jiang wrote:
> 
> 
> On 1/26/21 13:58, Donald Buczek wrote:
>>
>>
>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>
>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>                          md_check_recovery(mddev);
>>>                          spin_lock_irq(&conf->device_lock);
>>>                  }
>>> +
>>> +               if ((atomic_read(&conf->active_stripes)
>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>> +                       wake_up(&conf->wait_for_stripe);
>>>          }
>>>          pr_debug("%d stripes handled\n", handled);
>>
>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>
>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>      [<0>] md_thread+0xab/0x160
>>      [<0>] kthread+0x11b/0x140
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
> 
> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
> 
> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
> 
> wait_event(mddev->recovery_wait,
>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>         !atomic_read(&mddev->recovery_active));

With that added, md3_resync goes into a loop, too. Not 100% busy, though.

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack

[<0>] raid5_get_active_stripe+0x1e7/0x6b0  # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.

Best
   Donald

> 
>> And, unlike before, "md: md3: data-check interrupted." from the pr_info two lines above appears in dmesg.
> 
> Yes, that is intentional since MD_RECOVERY_INTR is set by write idle.
> 
> Anyway, will try the script and investigate more about the issue.
> 
> Thanks,
> Guoqing

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-01-26 16:05                                 ` Donald Buczek
@ 2021-02-02 15:42                                   ` Guoqing Jiang
  2021-02-08 11:38                                     ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-02-02 15:42 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 1/26/21 17:05, Donald Buczek wrote:
> Dear Guoqing,
> 
> On 26.01.21 15:06, Guoqing Jiang wrote:
>>
>>
>> On 1/26/21 13:58, Donald Buczek wrote:
>>>
>>>
>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>
>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>                          md_check_recovery(mddev);
>>>>                          spin_lock_irq(&conf->device_lock);
>>>>                  }
>>>> +
>>>> +               if ((atomic_read(&conf->active_stripes)
>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>          }
>>>>          pr_debug("%d stripes handled\n", handled);
>>>
>>> Hmm... With this patch on top of your other one, we still have the 
>>> basic symptoms (md3_raid6 busy looping), but the sync thread is now 
>>> hanging at
>>>
>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>      [<0>] md_thread+0xab/0x160
>>>      [<0>] kthread+0x11b/0x140
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> instead, which is 
>>> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>
>> Not sure why recovery_active is not zero, because it is set 0 before 
>> blk_start_plug, and raid5_sync_request returns 0 and skipped is also 
>> set to 1. Perhaps handle_stripe calls md_done_sync.
>>
>> Could you double check the value of recovery_active? Or just don't 
>> wait if resync thread is interrupted.
>>
>> wait_event(mddev->recovery_wait,
>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>         !atomic_read(&mddev->recovery_active));
> 
> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
> 
> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
> 
> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
> [<0>] raid5_sync_request+0x2a7/0x3d0       # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
> [<0>] md_do_sync.cold+0x3ee/0x97c          # 
> https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
> [<0>] md_thread+0xab/0x160
> [<0>] kthread+0x11b/0x140
> [<0>] ret_from_fork+0x22/0x30
> 
> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of 
> raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, 
> but the conditions don't match so its sleeps again.

I don't know why the condition was not true after the change since the 
RECOVERY_INTR is set and the caller is raid5_sync_request.

wait_event_lock_irq(conf->wait_for_stripe,
     (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
      /* the previous condition */,
     *(conf->hash_locks + hash));


BTW, I think there some some possible ways:

1. let "echo idle" give up the reconfig_mutex if there are limited 
number of active stripe, but I feel it is ugly to check sh number from 
action_store (kind of layer violation).

2. make raid5_sync_request -> raid5_get_active_stripe can quit from the 
current situation (this was we tried, though it doesn't work so far).

3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety 
(but maybe I am wrong).

4. given the write IO keeps coming from upper layer which decrease the 
available stripes. Maybe we need to call grow_one_stripe at the 
beginning of raid5_make_request for this case, then call drop_one_stripe
at the end of make_request.

5. maybe don't hold reconfig_mutex when try to unregister sync_thread, 
like this.

         /* resync has finished, collect result */
         mddev_unlock(mddev);
         md_unregister_thread(&mddev->sync_thread);
         mddev_lock(mddev);



My suggestion would be try 2 + 4 together since the reproducer triggers 
both sync io and write io. Or try 5. Perhaps there is better way which I 
just can't find.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-02 15:42                                   ` Guoqing Jiang
@ 2021-02-08 11:38                                     ` Donald Buczek
  2021-02-08 14:53                                       ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-02-08 11:38 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On 02.02.21 16:42, Guoqing Jiang wrote:
> Hi Donald,
> 
> On 1/26/21 17:05, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> On 26.01.21 15:06, Guoqing Jiang wrote:
>>>
>>>
>>> On 1/26/21 13:58, Donald Buczek wrote:
>>>>
>>>>
>>>>> Hmm, how about wake the waiter up in the while loop of raid5d?
>>>>>
>>>>> @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
>>>>>                          md_check_recovery(mddev);
>>>>>                          spin_lock_irq(&conf->device_lock);
>>>>>                  }
>>>>> +
>>>>> +               if ((atomic_read(&conf->active_stripes)
>>>>> +                    < (conf->max_nr_stripes * 3 / 4) ||
>>>>> +                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
>>>>> +                       wake_up(&conf->wait_for_stripe);
>>>>>          }
>>>>>          pr_debug("%d stripes handled\n", handled);
>>>>
>>>> Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at
>>>>
>>>>      root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>>>      [<0>] md_do_sync.cold+0x8ec/0x97c
>>>>      [<0>] md_thread+0xab/0x160
>>>>      [<0>] kthread+0x11b/0x140
>>>>      [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963
>>>
>>> Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.
>>>
>>> Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.
>>>
>>> wait_event(mddev->recovery_wait,
>>>         test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
>>>         !atomic_read(&mddev->recovery_active));
>>
>> With that added, md3_resync goes into a loop, too. Not 100% busy, though.
>>
>> root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
>>
>> [<0>] raid5_get_active_stripe+0x1e7/0x6b0  # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
>> [<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
>> [<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
>> [<0>] md_thread+0xab/0x160
>> [<0>] kthread+0x11b/0x140
>> [<0>] ret_from_fork+0x22/0x30
>>
>> Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.
> 
> I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request.
> 
> wait_event_lock_irq(conf->wait_for_stripe,
>      (test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
>       /* the previous condition */,
>      *(conf->hash_locks + hash));

Not knowing the system by heart, I'd probably need more than a full day to analyze that. Please let me know, if this is needed.

> BTW, I think there some some possible ways:
> 
> 1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation).

My first though is, that it is either needed to hold the mutex or it is not. Why should this depend on the number of available stripes?

And wouldn't we get into the same situation, if another another reconfiguration from userspace takes the mutex again?

> 2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far).
> 
> 3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong).
> 
> 4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe
> at the end of make_request.
> 
> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
> 
>          /* resync has finished, collect result */
>          mddev_unlock(mddev);
>          md_unregister_thread(&mddev->sync_thread);
>          mddev_lock(mddev);

As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?

> My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find.

I just get a bit suspicious, if it is needed to add special code here and here and here, that there might be something wrong with the abstraction.

But, I'm afraid, that aside from testing, I can't be of to much help here, as I don't know all the components or details of the system that well.

Best
   Donald

> Thanks,
> Guoqing


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 11:38                                     ` Donald Buczek
@ 2021-02-08 14:53                                       ` Guoqing Jiang
  2021-02-08 18:41                                         ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-02-08 14:53 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid



On 2/8/21 12:38, Donald Buczek wrote:
>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, 
>> like this.
>>
>>          /* resync has finished, collect result */
>>          mddev_unlock(mddev);
>>          md_unregister_thread(&mddev->sync_thread);
>>          mddev_lock(mddev);
> 
> As above: While we wait for the sync thread to terminate, wouldn't it be 
> a problem, if another user space operation takes the mutex?

I don't think other places can be blocked while hold mutex, otherwise 
these places can cause potential deadlock. Please try above two lines 
change. And perhaps others have better idea.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 14:53                                       ` Guoqing Jiang
@ 2021-02-08 18:41                                         ` Donald Buczek
  2021-02-09  0:46                                           ` Guoqing Jiang
  0 siblings, 1 reply; 24+ messages in thread
From: Donald Buczek @ 2021-02-08 18:41 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Dear Guoqing,

On 08.02.21 15:53, Guoqing Jiang wrote:
> 
> 
> On 2/8/21 12:38, Donald Buczek wrote:
>>> 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.
>>>
>>>          /* resync has finished, collect result */
>>>          mddev_unlock(mddev);
>>>          md_unregister_thread(&mddev->sync_thread);
>>>          mddev_lock(mddev);
>>
>> As above: While we wait for the sync thread to terminate, wouldn't it be a problem, if another user space operation takes the mutex?
> 
> I don't think other places can be blocked while hold mutex, otherwise these places can cause potential deadlock. Please try above two lines change. And perhaps others have better idea.

Yes, this works. No deadlock after >11000 seconds,

(Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Best
   Donald
> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-08 18:41                                         ` Donald Buczek
@ 2021-02-09  0:46                                           ` Guoqing Jiang
  2021-02-09  9:24                                             ` Donald Buczek
  0 siblings, 1 reply; 24+ messages in thread
From: Guoqing Jiang @ 2021-02-09  0:46 UTC (permalink / raw)
  To: Donald Buczek, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

Hi Donald,

On 2/8/21 19:41, Donald Buczek wrote:
> Dear Guoqing,
> 
> On 08.02.21 15:53, Guoqing Jiang wrote:
>>
>>
>> On 2/8/21 12:38, Donald Buczek wrote:
>>>> 5. maybe don't hold reconfig_mutex when try to unregister 
>>>> sync_thread, like this.
>>>>
>>>>          /* resync has finished, collect result */
>>>>          mddev_unlock(mddev);
>>>>          md_unregister_thread(&mddev->sync_thread);
>>>>          mddev_lock(mddev);
>>>
>>> As above: While we wait for the sync thread to terminate, wouldn't it 
>>> be a problem, if another user space operation takes the mutex?
>>
>> I don't think other places can be blocked while hold mutex, otherwise 
>> these places can cause potential deadlock. Please try above two lines 
>> change. And perhaps others have better idea.
> 
> Yes, this works. No deadlock after >11000 seconds,
> 
> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265, 
> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )

Great. I will send a formal patch with your reported-by and tested-by.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
  2021-02-09  0:46                                           ` Guoqing Jiang
@ 2021-02-09  9:24                                             ` Donald Buczek
  0 siblings, 0 replies; 24+ messages in thread
From: Donald Buczek @ 2021-02-09  9:24 UTC (permalink / raw)
  To: Guoqing Jiang, Song Liu, linux-raid, Linux Kernel Mailing List, it+raid

On 09.02.21 01:46, Guoqing Jiang wrote:

> Great. I will send a formal patch with your reported-by and tested-by.

Yes, that's fine.

Thanks a lot for your help!

Donald

> 
> Thanks,
> Guoqing

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
2020-11-30  2:06 ` Guoqing Jiang
2020-12-01  9:29   ` Donald Buczek
2020-12-02 17:28     ` Donald Buczek
2020-12-03  1:55       ` Guoqing Jiang
2020-12-03 11:42         ` Donald Buczek
2020-12-21 12:33           ` Donald Buczek
2021-01-19 11:30             ` Donald Buczek
2021-01-20 16:33               ` Guoqing Jiang
2021-01-23 13:04                 ` Donald Buczek
2021-01-25  8:54                   ` Donald Buczek
2021-01-25 21:32                     ` Donald Buczek
2021-01-26  0:44                       ` Guoqing Jiang
2021-01-26  9:50                         ` Donald Buczek
2021-01-26 11:14                           ` Guoqing Jiang
2021-01-26 12:58                             ` Donald Buczek
2021-01-26 14:06                               ` Guoqing Jiang
2021-01-26 16:05                                 ` Donald Buczek
2021-02-02 15:42                                   ` Guoqing Jiang
2021-02-08 11:38                                     ` Donald Buczek
2021-02-08 14:53                                       ` Guoqing Jiang
2021-02-08 18:41                                         ` Donald Buczek
2021-02-09  0:46                                           ` Guoqing Jiang
2021-02-09  9:24                                             ` Donald Buczek

Linux-Raid Archives on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-raid/0 linux-raid/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-raid linux-raid/ https://lore.kernel.org/linux-raid \
		linux-raid@vger.kernel.org
	public-inbox-index linux-raid

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-raid


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git