Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
To: Donald Buczek <buczek@molgen.mpg.de>, Song Liu <song@kernel.org>,
	linux-raid@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	it+raid@molgen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Date: Wed, 20 Jan 2021 17:33:38 +0100	[thread overview]
Message-ID: <d95aa962-9750-c27c-639a-2362bdb32f41@cloud.ionos.com> (raw)
In-Reply-To: <55e30408-ac63-965f-769f-18be5fd5885c@molgen.mpg.de>

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
> 
> I've reported a problem in this thread in December:
> 
> "We are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions." It 
> was clear, that this is related to "mdcheck" running, because we found, 
> that the command, which should stop the scrubbing in the morning (`echo 
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
> 
> On 12/21/20, I've reported, that the problem might be caused by a 
> failure of the underlying block device driver, because I've found 
> "inflight" counters of the block devices not being zero. However, this 
> is not the case. We were able to run into the mdX_raid6 looping 
> condition a few times again, but the non-zero inflight counters have not 
> been observed again.
> 
> I was able to collect a lot of additional information from a blocked 
> system.
> 
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command 
> is waiting at kthread_stop to stop the sync thread. [ 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
> 
> - The sync thread ("md1_resync") does not finish, because its blocked at
> 
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
> 
> - yes, gdb confirms that `conf->cache_state` is 0x03 ( 
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list, 
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle 
 > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

> 
> - We have lots of active stripes:
> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either 
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE 
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could 
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So 
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by 
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
> 
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
> 
> In dmesg, we see the debug output from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 
> but never from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
> 
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> 
> The sector numbers repeat after some time. We have only the following 
> variants of stripe state and "check":
> 
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, 
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, 
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, 
> check_state_idle
> 
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call 
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be 
called by md_reap_sync_thread. But md_reap_sync_thread stuck with 
unregister sync_thread (it was blocked in raid5_get_active_stripe).

Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d 
can't break from the while(1) loop because "if (!batch_size && 
!released)" is false. I would assume released is '0' since
 >      released_stripes = {
 >          first = 0x0
 >      }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with 
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain 
the busy loop situation.

> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/array_state
> write-pending
> 
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is 
badblock of underlying device (sd*)?

> 
> So it can be assumed that handle_stripe breaks out at 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
> 
> - The system can manually be freed from the deadlock:
> 
> When `echo active > /sys/block/md1/md/array_state` is used, the 
> scrubbing and other I/O continue. Probably because of 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked 
process can make progress.

> 
> I, of coruse, don't fully understand it yet. Any ideas?
> 
> I append some data from a hanging raid... (mddev, r5conf and a sample 
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing