From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
To: Donald Buczek <buczek@molgen.mpg.de>, Song Liu <song@kernel.org>,
linux-raid@vger.kernel.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
it+raid@molgen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Date: Thu, 3 Dec 2020 02:55:15 +0100 [thread overview]
Message-ID: <b289ae15-ff82-b36e-4be4-a1c8bbdbacd7@cloud.ionos.com> (raw)
In-Reply-To: <7c5438c7-2324-cc50-db4d-512587cb0ec9@molgen.mpg.de>
Hi Donald,
On 12/2/20 18:28, Donald Buczek wrote:
> Dear Guoqing,
>
> unfortunately the patch didn't fix the problem (unless I messed it up
> with my logging). This is what I used:
>
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)
> clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
> goto unlock;
> }
I think you can add the check of RECOVERY_CHECK in above part instead of
add a new part.
> + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
> + (!test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
> + test_bit(MD_RECOVERY_CHECK, &mddev->recovery))) {
> + /* resync/recovery still happening */
> + pr_info("md: XXX BUGFIX applied\n");
> + clear_bit(MD_RECOVERY_NEEDED,
> &mddev->recovery);
> + goto unlock;
> + }
> if (mddev->sync_thread) {
> md_reap_sync_thread(mddev);
> goto unlock;
>
> With pausing and continuing the check four times an hour, I could
> trigger the problem after about 48 hours. This time, the other device
> (md0) has locked up on `echo idle >
> /sys/devices/virtual/block/md0/md/sync_action` , while the check of md1
> is still ongoing:
Without the patch, md0 was good while md1 was locked. So the patch
switches the status of the two arrays, a little weird ...
What is the stack of the process? I guess it is same as the stack of
23333 in your previous mail, but just to confirm.
>
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
> [=>...................] check = 8.5% (666852112/7813894144)
> finish=1271.2min speed=93701K/sec
> bitmap: 0/59 pages [0KB], 65536KB chunk
> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
> [>....................] check = 0.2% (19510348/7813894144)
> finish=253779.6min speed=511K/sec
> bitmap: 0/59 pages [0KB], 65536KB chunk
>
> after 1 minute:
>
> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> [multipath]
> md1 : active raid6 sdk[0] sdj[15] sdi[14] sdh[13] sdg[12] sdf[11]
> sde[10] sdd[9] sdc[8] sdr[7] sdq[6] sdp[5] sdo[4] sdn[3] sdm[2] sdl[1]
> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
> [=>...................] check = 8.6% (674914560/7813894144)
> finish=941.1min speed=126418K/sec
> bitmap: 0/59 pages [0KB], 65536KB chunk
> md0 : active raid6 sds[0] sdah[15] sdag[16] sdaf[13] sdae[12]
> sdad[11] sdac[10] sdab[9] sdaa[8] sdz[7] sdy[6] sdx[17] sdw[4] sdv[3]
> sdu[2] sdt[1]
> 109394518016 blocks super 1.2 level 6, 512k chunk, algorithm
> 2 [16/16] [UUUUUUUUUUUUUUUU]
> [>....................] check = 0.2% (19510348/7813894144)
> finish=256805.0min speed=505K/sec
> bitmap: 0/59 pages [0KB], 65536KB chunk
>
> A data point, I didn't mention in my previous mail, is that the
> mdX_resync thread is not gone when the problem occurs:
>
> buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ ps -Af|fgrep [md
> root 134 2 0 Nov30 ? 00:00:00 [md]
> root 1321 2 27 Nov30 ? 12:57:14 [md0_raid6]
> root 1454 2 26 Nov30 ? 12:37:23 [md1_raid6]
> root 5845 2 0 16:20 ? 00:00:30 [md0_resync]
> root 5855 2 13 16:20 ? 00:14:11 [md1_resync]
> buczek 9880 9072 0 18:05 pts/0 00:00:00 grep -F [md
> buczek@done:/scratch/local/linux (v5.10-rc6-mpi)$ sudo cat
> /proc/5845/stack
> [<0>] md_bitmap_cond_end_sync+0x12d/0x170
> [<0>] raid5_sync_request+0x24b/0x390
> [<0>] md_do_sync+0xb41/0x1030
> [<0>] md_thread+0x122/0x160
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x1f/0x30
>
> I guess, md_bitmap_cond_end_sync+0x12d is the
> `wait_event(bitmap->mddev->recovery_wait,atomic_read(&bitmap->mddev->recovery_active)
> == 0);` in md-bitmap.c.
>
Could be, if so, then I think md_done_sync was not triggered by the path
md0_raid6 -> ... -> handle_stripe.
I'd suggest to compare the stacks between md0 and md1 to find the
difference.
Thanks,
Guoqing
next prev parent reply other threads:[~2020-12-03 1:56 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-28 12:25 md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Donald Buczek
2020-11-30 2:06 ` Guoqing Jiang
2020-12-01 9:29 ` Donald Buczek
2020-12-02 17:28 ` Donald Buczek
2020-12-03 1:55 ` Guoqing Jiang [this message]
2020-12-03 11:42 ` Donald Buczek
2020-12-21 12:33 ` Donald Buczek
2021-01-19 11:30 ` Donald Buczek
2021-01-20 16:33 ` Guoqing Jiang
2021-01-23 13:04 ` Donald Buczek
2021-01-25 8:54 ` Donald Buczek
2021-01-25 21:32 ` Donald Buczek
2021-01-26 0:44 ` Guoqing Jiang
2021-01-26 9:50 ` Donald Buczek
2021-01-26 11:14 ` Guoqing Jiang
2021-01-26 12:58 ` Donald Buczek
2021-01-26 14:06 ` Guoqing Jiang
2021-01-26 16:05 ` Donald Buczek
2021-02-02 15:42 ` Guoqing Jiang
2021-02-08 11:38 ` Donald Buczek
2021-02-08 14:53 ` Guoqing Jiang
2021-02-08 18:41 ` Donald Buczek
2021-02-09 0:46 ` Guoqing Jiang
2021-02-09 9:24 ` Donald Buczek
2023-03-14 13:25 ` Marc Smith
2023-03-14 13:55 ` Guoqing Jiang
2023-03-14 14:45 ` Marc Smith
2023-03-16 15:25 ` Marc Smith
2023-03-29 0:01 ` Song Liu
2023-08-22 21:16 ` Dragan Stancevic
2023-08-23 1:22 ` Yu Kuai
2023-08-23 15:33 ` Dragan Stancevic
2023-08-24 1:18 ` Yu Kuai
2023-08-28 20:32 ` Dragan Stancevic
2023-08-30 1:36 ` Yu Kuai
2023-09-05 3:50 ` Yu Kuai
2023-09-05 13:54 ` Dragan Stancevic
2023-09-13 9:08 ` Donald Buczek
2023-09-13 14:16 ` Dragan Stancevic
2023-09-14 6:03 ` Donald Buczek
2023-09-17 8:55 ` Donald Buczek
2023-09-24 14:35 ` Donald Buczek
2023-09-25 1:11 ` Yu Kuai
2023-09-25 9:11 ` Donald Buczek
2023-09-25 9:32 ` Yu Kuai
2023-03-15 3:02 ` Yu Kuai
2023-03-15 9:30 ` Guoqing Jiang
2023-03-15 9:53 ` Yu Kuai
2023-03-15 7:52 ` Donald Buczek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b289ae15-ff82-b36e-4be4-a1c8bbdbacd7@cloud.ionos.com \
--to=guoqing.jiang@cloud.ionos.com \
--cc=buczek@molgen.mpg.de \
--cc=it+raid@molgen.mpg.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=song@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).