Linux-Raid Archives on lore.kernel.org
 help / color / Atom feed
From: Donald Buczek <buczek@molgen.mpg.de>
To: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>,
	Song Liu <song@kernel.org>,
	linux-raid@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	it+raid@molgen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition
Date: Mon, 25 Jan 2021 22:32:30 +0100
Message-ID: <bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de> (raw)
In-Reply-To: <cdc0b03c-db53-35bc-2f75-93bbca0363b5@molgen.mpg.de>



On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
> 
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
> 
> My current theory of what is happening is:
> 
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
> 
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
> 
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
> 
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
> 
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
> 
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
> 
> Does this make sense?
> 
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
> 
> ```
> #! /bin/bash
> 
> (
>          while true; do
>                  echo "start check"
>                  echo check > /sys/block/md3/md/sync_action
>                  sleep 10
>                  echo "stop check"
>                  echo idle > /sys/block/md3/md/sync_action
>                  sleep 2
>          done
> ) &
> 
> (
>          while true; do
>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                  sync /mnt/raid_md3/bigfile
>                  rm /mnt/raid_md3/bigfile
>                  sleep .1
>          done
> ) &
> 
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>          array_state=$(cat array_state)
>          if [ "$array_state" = write-pending ]; then
>                  wp_count=$(($wp_count+1))
>          else
>                  wp_count=0
>          fi
>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>          if [ $wp_count -ge 3 ]; then
>                  kill -- -$$
>                  exit
>          fi
>          sleep 1
> done
> ```
> 
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
> 
>      309 check 6283872 / 8378368 active-idle 4144
>      310 check 6283872 / 8378368 active 1702
>      311 check 6807528 / 8378368 active 4152
>      312 check 7331184 / 8378368 clean 3021
>      stop check
>      313 check 7331184 / 8378368 write-pending 3905
>      314 check 7331184 / 8378368 write-pending 3905
>      315 check 7331184 / 8378368 write-pending 3905
>      Terminated
> 
> If I add
> 
>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
> 
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
> 
>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> 
> If I add
> 
>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
> 
> at the head of md_check_recovery, I get:
> 
>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> 
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
> 
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
  		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
  		    mddev_lock(mddev) == 0) {
+			set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			flush_workqueue(md_misc_wq);
  			if (mddev->sync_thread) {
  				set_bit(MD_RECOVERY_INTR, &mddev->recovery);
  				md_reap_sync_thread(mddev);
  			}
+			clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			mddev_unlock(mddev);
  		}
  	} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-- 
2.30.0

  reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-28 12:25 Donald Buczek
2020-11-30  2:06 ` Guoqing Jiang
2020-12-01  9:29   ` Donald Buczek
2020-12-02 17:28     ` Donald Buczek
2020-12-03  1:55       ` Guoqing Jiang
2020-12-03 11:42         ` Donald Buczek
2020-12-21 12:33           ` Donald Buczek
2021-01-19 11:30             ` Donald Buczek
2021-01-20 16:33               ` Guoqing Jiang
2021-01-23 13:04                 ` Donald Buczek
2021-01-25  8:54                   ` Donald Buczek
2021-01-25 21:32                     ` Donald Buczek [this message]
2021-01-26  0:44                       ` Guoqing Jiang
2021-01-26  9:50                         ` Donald Buczek
2021-01-26 11:14                           ` Guoqing Jiang
2021-01-26 12:58                             ` Donald Buczek
2021-01-26 14:06                               ` Guoqing Jiang
2021-01-26 16:05                                 ` Donald Buczek
2021-02-02 15:42                                   ` Guoqing Jiang
2021-02-08 11:38                                     ` Donald Buczek
2021-02-08 14:53                                       ` Guoqing Jiang
2021-02-08 18:41                                         ` Donald Buczek
2021-02-09  0:46                                           ` Guoqing Jiang
2021-02-09  9:24                                             ` Donald Buczek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de \
    --to=buczek@molgen.mpg.de \
    --cc=guoqing.jiang@cloud.ionos.com \
    --cc=it+raid@molgen.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Raid Archives on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-raid/0 linux-raid/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-raid linux-raid/ https://lore.kernel.org/linux-raid \
		linux-raid@vger.kernel.org
	public-inbox-index linux-raid

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-raid


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git