Linux-Raid Archives on lore.kernel.org
 help / color / Atom feed
From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
To: Donald Buczek <buczek@molgen.mpg.de>, Song Liu <song@kernel.org>,
	linux-raid@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	it+raid@molgen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Date: Wed, 20 Jan 2021 17:33:38 +0100
Message-ID: <d95aa962-9750-c27c-639a-2362bdb32f41@cloud.ionos.com> (raw)
In-Reply-To: <55e30408-ac63-965f-769f-18be5fd5885c@molgen.mpg.de>

Hi Donald,

On 1/19/21 12:30, Donald Buczek wrote:
> Dear md-raid people,
> 
> I've reported a problem in this thread in December:
> 
> "We are using raid6 on several servers. Occasionally we had failures, 
> where a mdX_raid6 process seems to go into a busy loop and all I/O to 
> the md device blocks. We've seen this on various kernel versions." It 
> was clear, that this is related to "mdcheck" running, because we found, 
> that the command, which should stop the scrubbing in the morning (`echo 
> idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked.
> 
> On 12/21/20, I've reported, that the problem might be caused by a 
> failure of the underlying block device driver, because I've found 
> "inflight" counters of the block devices not being zero. However, this 
> is not the case. We were able to run into the mdX_raid6 looping 
> condition a few times again, but the non-zero inflight counters have not 
> been observed again.
> 
> I was able to collect a lot of additional information from a blocked 
> system.
> 
> - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command 
> is waiting at kthread_stop to stop the sync thread. [ 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ]
> 
> - The sync thread ("md1_resync") does not finish, because its blocked at
> 
> [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1]
> [<0>] raid5_sync_request+0x364/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735
> 
> - yes, gdb confirms that `conf->cache_state` is 0x03 ( 
> R5_INACTIVE_BLOCKED + R5_ALLOC_MORE )

The resync thread is blocked since it can't get sh from inactive list, 
so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle 
 > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread.

> 
> - We have lots of active stripes:
> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/stripe_cache_active
> 27534

There are too many active stripes, so this is false:

atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4)

so raid5_get_active_stripe has to wait till it becomes true, either 
increase max_nr_stripes or decrease active_stripes.

1. Increase max_nr_stripes
since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE 
is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could 
alloc new stripe in grow_one_stripe and increase max_nr_stripes. So 
please check the memory usage of your system.

Another thing is you can try to increase the number of sh manually by 
write new number to stripe_cache_size if there is enough memory.

2. Or decrease active_stripes
This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set

> - handle_stripe() doesn't make progress:
> 
> echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control
> 
> In dmesg, we see the debug output from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 
> but never from 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958:
> 
> [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, 
> state=0x2029 cnt=1, pd_idx=9, qd_idx=10
>                  , check:4, reconstruct:0
> 
> The sector numbers repeat after some time. We have only the following 
> variants of stripe state and "check":
> 
> state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, 
> check_state_idle
> state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, 
> check_state_check_result
> state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, 
> check_state_idle
> 
> - We have MD_SB_CHANGE_PENDING set:

because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call 
atomic_dec(&conf->active_stripes).

Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be 
called by md_reap_sync_thread. But md_reap_sync_thread stuck with 
unregister sync_thread (it was blocked in raid5_get_active_stripe).


Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d 
can't break from the while(1) loop because "if (!batch_size && 
!released)" is false. I would assume released is '0' since
 >      released_stripes = {
 >          first = 0x0
 >      }
And __get_priority_stripe fetches sh from conf->handle_list and delete
it from handle_list, handle_stripe marks the state of the sh with 
STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list.
So batch_size can't be '0', and handle_active_stripes in the loop
repeats the process in the busy loop. This is my best guess to explain 
the busy loop situation.

> 
> root@deadbird:~/linux_problems/mdX_raid6_looping# cat 
> /sys/block/md1/md/array_state
> write-pending
> 
> gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING)

since rdev_set_badblocks could set them, could you check if there is 
badblock of underlying device (sd*)?

> 
> So it can be assumed that handle_stripe breaks out at 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939
> 
> - The system can manually be freed from the deadlock:
> 
> When `echo active > /sys/block/md1/md/array_state` is used, the 
> scrubbing and other I/O continue. Probably because of 
> https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520

Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked 
process can make progress.

> 
> I, of coruse, don't fully understand it yet. Any ideas?
> 
> I append some data from a hanging raid... (mddev, r5conf and a sample 
> stripe_head from handle_list with it first disks)

These data did help for investigation!

Thanks,
Guoqing

  reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-28 12:25 Donald Buczek
2020-11-30  2:06 ` Guoqing Jiang
2020-12-01  9:29   ` Donald Buczek
2020-12-02 17:28     ` Donald Buczek
2020-12-03  1:55       ` Guoqing Jiang
2020-12-03 11:42         ` Donald Buczek
2020-12-21 12:33           ` Donald Buczek
2021-01-19 11:30             ` Donald Buczek
2021-01-20 16:33               ` Guoqing Jiang [this message]
2021-01-23 13:04                 ` Donald Buczek
2021-01-25  8:54                   ` Donald Buczek
2021-01-25 21:32                     ` Donald Buczek
2021-01-26  0:44                       ` Guoqing Jiang
2021-01-26  9:50                         ` Donald Buczek
2021-01-26 11:14                           ` Guoqing Jiang
2021-01-26 12:58                             ` Donald Buczek
2021-01-26 14:06                               ` Guoqing Jiang
2021-01-26 16:05                                 ` Donald Buczek
2021-02-02 15:42                                   ` Guoqing Jiang
2021-02-08 11:38                                     ` Donald Buczek
2021-02-08 14:53                                       ` Guoqing Jiang
2021-02-08 18:41                                         ` Donald Buczek
2021-02-09  0:46                                           ` Guoqing Jiang
2021-02-09  9:24                                             ` Donald Buczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d95aa962-9750-c27c-639a-2362bdb32f41@cloud.ionos.com \
    --to=guoqing.jiang@cloud.ionos.com \
    --cc=buczek@molgen.mpg.de \
    --cc=it+raid@molgen.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Raid Archives on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-raid/0 linux-raid/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-raid linux-raid/ https://lore.kernel.org/linux-raid \
		linux-raid@vger.kernel.org
	public-inbox-index linux-raid

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-raid


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git