[md PATCH 0/5] Address deadlock associated with setting suspend

All of lore.kernel.org
 help / color / mirror / Atom feed

* [md PATCH 0/5] Address deadlock associated with setting suspend_lo
@ 2017-10-17  2:46 NeilBrown
  2017-10-17  2:46 ` [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

This series fixes the issue that Xiao Ni found
where a deadlock can happen if there are active writes
to an md/raid5 waiting for suspend_lo to be increased.

The last patch contains the important fix.  The others
prepare for it so that it can work reliably.

Thanks,
NeilBrown

---

NeilBrown (5):
      md: always hold reconfig_mutex when calling mddev_suspend()
      md: don't call bitmap_create() while array is quiesced.
      md: move suspend_hi/lo handling into core md code
      md: use mddev_suspend/resume instead of ->quiesce()
      md: allow metadata update while suspending.


 drivers/md/dm-raid.c     |    5 ++-
 drivers/md/md.c          |   74 ++++++++++++++++++++++++++++++++++------------
 drivers/md/md.h          |    6 ++++
 drivers/md/raid1.c       |   12 ++-----
 drivers/md/raid5-cache.c |   18 ++++++++---
 drivers/md/raid5.c       |   22 --------------
 6 files changed, 82 insertions(+), 55 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [md PATCH 2/5] md: don't call bitmap_create() while array is quiesced.
  2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
                   ` (3 preceding siblings ...)
  2017-10-17  2:46 ` [md PATCH 3/5] md: move suspend_hi/lo handling into core md code NeilBrown
@ 2017-10-17  2:46 ` NeilBrown
  4 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 04538b60f8f3..308456842d3e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6619,22 +6619,26 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 		return -ENOENT; /* cannot remove what isn't there */
 	err = 0;
 	if (mddev->pers) {
-		mddev->pers->quiesce(mddev, 1);
 		if (fd >= 0) {
 			struct bitmap *bitmap;
 
 			bitmap = bitmap_create(mddev, -1);
+			mddev->pers->quiesce(mddev, 1);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				err = bitmap_load(mddev);
 			} else
 				err = PTR_ERR(bitmap);
-		}
-		if (fd < 0 || err) {
+			if (err) {
+				bitmap_destroy(mddev);
+				fd = -1;
+			}
+			mddev->pers->quiesce(mddev, 0);
+		} else if (fd < 0) {
+			mddev->pers->quiesce(mddev, 1);
 			bitmap_destroy(mddev);
-			fd = -1; /* make sure to put the file */
+			mddev->pers->quiesce(mddev, 0);
 		}
-		mddev->pers->quiesce(mddev, 0);
 	}
 	if (fd < 0) {
 		struct file *f = mddev->bitmap_info.file;
@@ -6918,8 +6922,8 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.default_offset;
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
-			mddev->pers->quiesce(mddev, 1);
 			bitmap = bitmap_create(mddev, -1);
+			mddev->pers->quiesce(mddev, 1);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				rv = bitmap_load(mddev);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
@ 2017-10-17  2:46 ` NeilBrown
  2017-10-18  6:11   ` Shaohua Li
  2017-10-17  2:46 ` [md PATCH 5/5] md: allow metadata update while suspending NeilBrown
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm-raid.c     |    5 ++++-
 drivers/md/md.c          |    1 +
 drivers/md/raid5-cache.c |   18 +++++++++++++-----
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 2245d06d2045..cc2fed784a5f 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3629,8 +3629,11 @@ static void raid_postsuspend(struct dm_target *ti)
 {
 	struct raid_set *rs = ti->private;
 
-	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
+	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+		mddev_lock_nointr(&rs->md);
 		mddev_suspend(&rs->md);
+		mddev_unlock(&rs->md);
+	}
 
 	rs->md.ro = 1;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..04538b60f8f3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -344,6 +344,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 void mddev_suspend(struct mddev *mddev)
 {
 	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
+	lockdep_assert_held(&mddev->reconfig_mutex);
 	if (mddev->suspended++)
 		return;
 	synchronize_rcu();
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 0b7406ac8ce1..6a631dd21f0b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 	struct r5l_log *log = container_of(work, struct r5l_log,
 					   disable_writeback_work);
 	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	int locked = 0;
 
 	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
 		return;
@@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 
 	/* wait superblock change before suspend */
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
-
-	mddev_suspend(mddev);
-	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
-	mddev_resume(mddev);
+		   conf->log == NULL ||
+		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
+		    (locked = mddev_trylock(mddev))));
+	if (locked) {
+		mddev_suspend(mddev);
+		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+		mddev_resume(mddev);
+		mddev_unlock(mddev);
+	}
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
@@ -3165,6 +3171,8 @@ void r5l_exit_log(struct r5conf *conf)
 	conf->log = NULL;
 	synchronize_rcu();
 
+	/* Ensure disable_writeback_work wakes up and exits */
+	wake_up(&conf->mddev->sb_wait);
 	flush_work(&log->disable_writeback_work);
 	md_unregister_thread(&log->reclaim_thread);
 	mempool_destroy(log->meta_pool);



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 3/5] md: move suspend_hi/lo handling into core md code
  2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
                   ` (2 preceding siblings ...)
  2017-10-17  2:46 ` [md PATCH 4/5] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
@ 2017-10-17  2:46 ` NeilBrown
  2017-10-18  6:16   ` Shaohua Li
  2017-10-17  2:46 ` [md PATCH 2/5] md: don't call bitmap_create() while array is quiesced NeilBrown
  4 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended.  It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requests are waiting for suspend_lo/hi to change.
This is will be important after a subsequent patch
which uses mddev_suspend() to synchronize updating for
suspend_lo/hi.

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c    |   29 +++++++++++++++++++++++------
 drivers/md/raid1.c |   12 ++++--------
 drivers/md/raid5.c |   22 ----------------------
 3 files changed, 27 insertions(+), 36 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 308456842d3e..f8d0e41120cb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -266,16 +266,31 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
  * call has finished, the bio has been linked into some internal structure
  * and so is visible to ->quiesce(), so we don't need the refcount any more.
  */
+static bool is_suspended(struct mddev *mddev, struct bio *bio)
+{
+	if (mddev->suspended)
+		return true;
+	if (bio_data_dir(bio) != WRITE)
+		return false;
+	if (mddev->suspend_lo >= mddev->suspend_hi)
+		return false;
+	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
+		return false;
+	if (bio_end_sector(bio) < mddev->suspend_lo)
+		return false;
+	return true;
+}
+
 void md_handle_request(struct mddev *mddev, struct bio *bio)
 {
 check_suspended:
 	rcu_read_lock();
-	if (mddev->suspended) {
+	if (is_suspended(mddev, bio)) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended)
+			if (!is_suspended(mddev, bio))
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -4843,10 +4858,11 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 		goto unlock;
 	old = mddev->suspend_lo;
 	mddev->suspend_lo = new;
-	if (new >= old)
+	if (new >= old) {
 		/* Shrinking suspended region */
+		wake_up(&mddev->sb_wait);
 		mddev->pers->quiesce(mddev, 2);
-	else {
+	} else {
 		/* Expanding suspended region - need to wait */
 		mddev->pers->quiesce(mddev, 1);
 		mddev->pers->quiesce(mddev, 0);
@@ -4886,10 +4902,11 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 		goto unlock;
 	old = mddev->suspend_hi;
 	mddev->suspend_hi = new;
-	if (new <= old)
+	if (new <= old) {
 		/* Shrinking suspended region */
+		wake_up(&mddev->sb_wait);
 		mddev->pers->quiesce(mddev, 2);
-	else {
+	} else {
 		/* Expanding suspended region - need to wait */
 		mddev->pers->quiesce(mddev, 1);
 		mddev->pers->quiesce(mddev, 0);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f3f3e40dc9d8..277a145b3ff5 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1310,11 +1310,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 */
 
 
-	if ((bio_end_sector(bio) > mddev->suspend_lo &&
-	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
-	    (mddev_is_clustered(mddev) &&
+	if (mddev_is_clustered(mddev) &&
 	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
+		     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
 
 		/*
 		 * As the suspend_* range is controlled by userspace, we want
@@ -1325,12 +1323,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 			sigset_t full, old;
 			prepare_to_wait(&conf->wait_barrier,
 					&w, TASK_INTERRUPTIBLE);
-			if (bio_end_sector(bio) <= mddev->suspend_lo ||
-			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
-			    (mddev_is_clustered(mddev) &&
+			if (mddev_is_clustered(mddev) &&
 			     !md_cluster_ops->area_resyncing(mddev, WRITE,
 				     bio->bi_iter.bi_sector,
-				     bio_end_sector(bio))))
+				     bio_end_sector(bio)))
 				break;
 			sigfillset(&full);
 			sigprocmask(SIG_BLOCK, &full, &old);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 928e24a07133..4f40ccd21cbb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5682,28 +5682,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				goto retry;
 			}
 
-			if (rw == WRITE &&
-			    logical_sector >= mddev->suspend_lo &&
-			    logical_sector < mddev->suspend_hi) {
-				raid5_release_stripe(sh);
-				/* As the suspend_* range is controlled by
-				 * userspace, we want an interruptible
-				 * wait.
-				 */
-				prepare_to_wait(&conf->wait_for_overlap,
-						&w, TASK_INTERRUPTIBLE);
-				if (logical_sector >= mddev->suspend_lo &&
-				    logical_sector < mddev->suspend_hi) {
-					sigset_t full, old;
-					sigfillset(&full);
-					sigprocmask(SIG_BLOCK, &full, &old);
-					schedule();
-					sigprocmask(SIG_SETMASK, &old, NULL);
-					do_prepare = true;
-				}
-				goto retry;
-			}
-
 			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
 			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
 				/* Stripe is busy expanding or



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 5/5] md: allow metadata update while suspending.
  2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
  2017-10-17  2:46 ` [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
@ 2017-10-17  2:46 ` NeilBrown
  2017-10-17  2:46 ` [md PATCH 4/5] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.

Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   14 ++++++++++++++
 drivers/md/md.h |    6 ++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 60a5022bd9bb..63ecfb063b76 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -364,8 +364,12 @@ void mddev_suspend(struct mddev *mddev)
 		return;
 	synchronize_rcu();
 	wake_up(&mddev->sb_wait);
+	set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
+	smp_mb__after_atomic();
 	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
 	mddev->pers->quiesce(mddev, 1);
+	clear_bit_unlock(MD_ALLOW_SB_UPDATE, &mddev->flags);
+	wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
 
 	del_timer_sync(&mddev->safemode_timer);
 }
@@ -8835,6 +8839,16 @@ void md_check_recovery(struct mddev *mddev)
 	unlock:
 		wake_up(&mddev->sb_wait);
 		mddev_unlock(mddev);
+	} else if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags) && mddev->sb_flags) {
+		/* Write superblock - thread that called mddev_suspend()
+		 * holds reconfig_mutex for us.
+		 */
+		set_bit(MD_UPDATING_SB, &mddev->flags);
+		smp_mb__after_atomic();
+		if (test_bit(MD_ALLOW_SB_UPDATE, &mddev->flags))
+			md_update_sb(mddev, 0);
+		clear_bit_unlock(MD_UPDATING_SB, &mddev->flags);
+		wake_up(&mddev->sb_wait);
 	}
 }
 EXPORT_SYMBOL(md_check_recovery);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8287d3cd1bf..03fc641e5da1 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -237,6 +237,12 @@ enum mddev_flags {
 				 */
 	MD_HAS_PPL,		/* The raid array has PPL feature set */
 	MD_HAS_MULTIPLE_PPLS,	/* The raid array has multiple PPLs feature set */
+	MD_ALLOW_SB_UPDATE,	/* md_check_recovery is allowed to update
+				 * the metadata without taking reconfig_mutex.
+				 */
+	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
+				 * without explicitly holding reconfig_mutex.
+				 */
 };
 
 enum mddev_sb_flags {



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 4/5] md: use mddev_suspend/resume instead of ->quiesce()
  2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
  2017-10-17  2:46 ` [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
  2017-10-17  2:46 ` [md PATCH 5/5] md: allow metadata update while suspending NeilBrown
@ 2017-10-17  2:46 ` NeilBrown
  2017-10-17  2:46 ` [md PATCH 3/5] md: move suspend_hi/lo handling into core md code NeilBrown
  2017-10-17  2:46 ` [md PATCH 2/5] md: don't call bitmap_create() while array is quiesced NeilBrown
  4 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2017-10-17  2:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible.  A
future patch will make use of this.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f8d0e41120cb..60a5022bd9bb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4864,8 +4864,8 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->pers->quiesce(mddev, 2);
 	} else {
 		/* Expanding suspended region - need to wait */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
+		mddev_suspend(mddev);
+		mddev_resume(mddev);
 	}
 	err = 0;
 unlock:
@@ -4908,8 +4908,8 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 		mddev->pers->quiesce(mddev, 2);
 	} else {
 		/* Expanding suspended region - need to wait */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
+		mddev_suspend(mddev);
+		mddev_resume(mddev);
 	}
 	err = 0;
 unlock:
@@ -6640,7 +6640,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 			struct bitmap *bitmap;
 
 			bitmap = bitmap_create(mddev, -1);
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				err = bitmap_load(mddev);
@@ -6650,11 +6650,11 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
 				bitmap_destroy(mddev);
 				fd = -1;
 			}
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		} else if (fd < 0) {
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		}
 	}
 	if (fd < 0) {
@@ -6940,7 +6940,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 			mddev->bitmap_info.space =
 				mddev->bitmap_info.default_space;
 			bitmap = bitmap_create(mddev, -1);
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			if (!IS_ERR(bitmap)) {
 				mddev->bitmap = bitmap;
 				rv = bitmap_load(mddev);
@@ -6948,7 +6948,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				rv = PTR_ERR(bitmap);
 			if (rv)
 				bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 		} else {
 			/* remove the bitmap */
 			if (!mddev->bitmap) {
@@ -6971,9 +6971,9 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
 				mddev->bitmap_info.nodes = 0;
 				md_cluster_ops->leave(mddev);
 			}
-			mddev->pers->quiesce(mddev, 1);
+			mddev_suspend(mddev);
 			bitmap_destroy(mddev);
-			mddev->pers->quiesce(mddev, 0);
+			mddev_resume(mddev);
 			mddev->bitmap_info.offset = 0;
 		}
 	}



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-17  2:46 ` [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
@ 2017-10-18  6:11   ` Shaohua Li
  2017-10-18  7:35     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2017-10-18  6:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Oct 17, 2017 at 01:46:43PM +1100, Neil Brown wrote:
> Most often mddev_suspend() is called with
> reconfig_mutex held.  Make this a requirement in
> preparation a subsequent patch.

can we do further, eg, make mddev_resumed() called with the mutex held. That's
symmetrical. It appears only dm-raid.c doesn't hold the mutext for mddev_resume
in a quick scan.
 
> Taking the mutex in r5c_disable_writeback_async() is
> a little tricky as this is called from a work queue
> via log->disable_writeback_work, and flush_work()
> is called on that while holding ->reconfig_mutex.
> If the work item hasn't run before flush_work()
> is called, the work function will not be able to
> get the mutex.
> 
> So we use mddev_trylock() inside the wait_event() call, and have that
> abort when conf->log is set to NULL, which happens before
> flush_work() is called.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/dm-raid.c     |    5 ++++-
>  drivers/md/md.c          |    1 +
>  drivers/md/raid5-cache.c |   18 +++++++++++++-----
>  3 files changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 2245d06d2045..cc2fed784a5f 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -3629,8 +3629,11 @@ static void raid_postsuspend(struct dm_target *ti)
>  {
>  	struct raid_set *rs = ti->private;
>  
> -	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
> +	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
> +		mddev_lock_nointr(&rs->md);
>  		mddev_suspend(&rs->md);
> +		mddev_unlock(&rs->md);
> +	}
>  
>  	rs->md.ro = 1;
>  }
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0ff1bbf6c90e..04538b60f8f3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -344,6 +344,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>  void mddev_suspend(struct mddev *mddev)
>  {
>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
> +	lockdep_assert_held(&mddev->reconfig_mutex);
>  	if (mddev->suspended++)
>  		return;
>  	synchronize_rcu();
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 0b7406ac8ce1..6a631dd21f0b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  	struct r5l_log *log = container_of(work, struct r5l_log,
>  					   disable_writeback_work);
>  	struct mddev *mddev = log->rdev->mddev;
> +	struct r5conf *conf = mddev->private;
> +	int locked = 0;
>  
>  	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
>  		return;
> @@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  
>  	/* wait superblock change before suspend */
>  	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> -
> -	mddev_suspend(mddev);
> -	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> -	mddev_resume(mddev);
> +		   conf->log == NULL ||
> +		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
> +		    (locked = mddev_trylock(mddev))));

Probably we just bail out if conf->log == NULL. The whole trylock point is for
the exit case, we can handle it separately. The bonus is
r5c_disable_writeback_async will not magically do nothing if the mutex is
already held by others.

Thanks,
Shaohua

> +	if (locked) {
> +		mddev_suspend(mddev);
> +		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> +		mddev_resume(mddev);
> +		mddev_unlock(mddev);
> +	}
>  }
>  
>  static void r5l_submit_current_io(struct r5l_log *log)
> @@ -3165,6 +3171,8 @@ void r5l_exit_log(struct r5conf *conf)
>  	conf->log = NULL;
>  	synchronize_rcu();
>  
> +	/* Ensure disable_writeback_work wakes up and exits */
> +	wake_up(&conf->mddev->sb_wait);
>  	flush_work(&log->disable_writeback_work);
>  	md_unregister_thread(&log->reclaim_thread);
>  	mempool_destroy(log->meta_pool);
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 3/5] md: move suspend_hi/lo handling into core md code
  2017-10-17  2:46 ` [md PATCH 3/5] md: move suspend_hi/lo handling into core md code NeilBrown
@ 2017-10-18  6:16   ` Shaohua Li
  2017-10-18  7:40     ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2017-10-18  6:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Oct 17, 2017 at 01:46:43PM +1100, Neil Brown wrote:
> responding to ->suspend_lo and ->suspend_hi is similar
> to responding to ->suspended.  It is best to wait in
> the common core code without incrementing ->active_io.
> This allows mddev_suspend()/mddev_resume() to work while
> requests are waiting for suspend_lo/hi to change.
> This is will be important after a subsequent patch
> which uses mddev_suspend() to synchronize updating for
> suspend_lo/hi.
> 
> So move the code for testing suspend_lo/hi out of raid1.c
> and raid5.c, and place it in md.c
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/md.c    |   29 +++++++++++++++++++++++------
>  drivers/md/raid1.c |   12 ++++--------
>  drivers/md/raid5.c |   22 ----------------------
>  3 files changed, 27 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 308456842d3e..f8d0e41120cb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -266,16 +266,31 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
>   * call has finished, the bio has been linked into some internal structure
>   * and so is visible to ->quiesce(), so we don't need the refcount any more.
>   */
> +static bool is_suspended(struct mddev *mddev, struct bio *bio)
> +{
> +	if (mddev->suspended)
> +		return true;
> +	if (bio_data_dir(bio) != WRITE)
> +		return false;
> +	if (mddev->suspend_lo >= mddev->suspend_hi)
> +		return false;
> +	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
> +		return false;
> +	if (bio_end_sector(bio) < mddev->suspend_lo)
> +		return false;
> +	return true;
> +}
> +
>  void md_handle_request(struct mddev *mddev, struct bio *bio)
>  {
>  check_suspended:
>  	rcu_read_lock();
> -	if (mddev->suspended) {
> +	if (is_suspended(mddev, bio)) {
>  		DEFINE_WAIT(__wait);
>  		for (;;) {
>  			prepare_to_wait(&mddev->sb_wait, &__wait,
>  					TASK_UNINTERRUPTIBLE);
> -			if (!mddev->suspended)
> +			if (!is_suspended(mddev, bio))
>  				break;
>  			rcu_read_unlock();
>  			schedule();
> @@ -4843,10 +4858,11 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
>  		goto unlock;
>  	old = mddev->suspend_lo;
>  	mddev->suspend_lo = new;
> -	if (new >= old)
> +	if (new >= old) {
>  		/* Shrinking suspended region */
> +		wake_up(&mddev->sb_wait);
>  		mddev->pers->quiesce(mddev, 2);

Do we still need the quiesce(mddev, 2)? sounds not to me.

> -	else {
> +	} else {
>  		/* Expanding suspended region - need to wait */
>  		mddev->pers->quiesce(mddev, 1);
>  		mddev->pers->quiesce(mddev, 0);
> @@ -4886,10 +4902,11 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
>  		goto unlock;
>  	old = mddev->suspend_hi;
>  	mddev->suspend_hi = new;
> -	if (new <= old)
> +	if (new <= old) {
>  		/* Shrinking suspended region */
> +		wake_up(&mddev->sb_wait);
>  		mddev->pers->quiesce(mddev, 2);
> -	else {
> +	} else {
>  		/* Expanding suspended region - need to wait */
>  		mddev->pers->quiesce(mddev, 1);
>  		mddev->pers->quiesce(mddev, 0);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index f3f3e40dc9d8..277a145b3ff5 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1310,11 +1310,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  	 */
>  
>  
> -	if ((bio_end_sector(bio) > mddev->suspend_lo &&
> -	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> -	    (mddev_is_clustered(mddev) &&
> +	if (mddev_is_clustered(mddev) &&
>  	     md_cluster_ops->area_resyncing(mddev, WRITE,
> -		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
> +		     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
>  
>  		/*
>  		 * As the suspend_* range is controlled by userspace, we want
> @@ -1325,12 +1323,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  			sigset_t full, old;
>  			prepare_to_wait(&conf->wait_barrier,
>  					&w, TASK_INTERRUPTIBLE);
> -			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> -			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> -			    (mddev_is_clustered(mddev) &&
> +			if (mddev_is_clustered(mddev) &&
>  			     !md_cluster_ops->area_resyncing(mddev, WRITE,
>  				     bio->bi_iter.bi_sector,
> -				     bio_end_sector(bio))))
> +				     bio_end_sector(bio)))
>  				break;
>  			sigfillset(&full);
>  			sigprocmask(SIG_BLOCK, &full, &old);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 928e24a07133..4f40ccd21cbb 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5682,28 +5682,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>  				goto retry;
>  			}
>  
> -			if (rw == WRITE &&
> -			    logical_sector >= mddev->suspend_lo &&
> -			    logical_sector < mddev->suspend_hi) {
> -				raid5_release_stripe(sh);
> -				/* As the suspend_* range is controlled by
> -				 * userspace, we want an interruptible
> -				 * wait.
> -				 */
> -				prepare_to_wait(&conf->wait_for_overlap,
> -						&w, TASK_INTERRUPTIBLE);
> -				if (logical_sector >= mddev->suspend_lo &&
> -				    logical_sector < mddev->suspend_hi) {
> -					sigset_t full, old;
> -					sigfillset(&full);
> -					sigprocmask(SIG_BLOCK, &full, &old);
> -					schedule();
> -					sigprocmask(SIG_SETMASK, &old, NULL);
> -					do_prepare = true;
> -				}
> -				goto retry;
> -			}
> -
>  			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
>  			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
>  				/* Stripe is busy expanding or

I think we can remove the state == 2 branch in raid5_quiesce().

This leaves only raid1 and md-cluster needs the quiesce(2) calls. In the
future, we probably should abstract that md-cluster logic to separate API and
let raid1 use it for waitting. The quiesce(2) is a strange name for the wait
purpose.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-18  6:11   ` Shaohua Li
@ 2017-10-18  7:35     ` NeilBrown
  2017-10-19  1:17       ` [md PATCH 1/5 v2] " NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-18  7:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4902 bytes --]

On Tue, Oct 17 2017, Shaohua Li wrote:

> On Tue, Oct 17, 2017 at 01:46:43PM +1100, Neil Brown wrote:
>> Most often mddev_suspend() is called with
>> reconfig_mutex held.  Make this a requirement in
>> preparation a subsequent patch.
>
> can we do further, eg, make mddev_resumed() called with the mutex held. That's
> symmetrical. It appears only dm-raid.c doesn't hold the mutext for mddev_resume
> in a quick scan.
>  
>> Taking the mutex in r5c_disable_writeback_async() is
>> a little tricky as this is called from a work queue
>> via log->disable_writeback_work, and flush_work()
>> is called on that while holding ->reconfig_mutex.
>> If the work item hasn't run before flush_work()
>> is called, the work function will not be able to
>> get the mutex.
>> 
>> So we use mddev_trylock() inside the wait_event() call, and have that
>> abort when conf->log is set to NULL, which happens before
>> flush_work() is called.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/dm-raid.c     |    5 ++++-
>>  drivers/md/md.c          |    1 +
>>  drivers/md/raid5-cache.c |   18 +++++++++++++-----
>>  3 files changed, 18 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>> index 2245d06d2045..cc2fed784a5f 100644
>> --- a/drivers/md/dm-raid.c
>> +++ b/drivers/md/dm-raid.c
>> @@ -3629,8 +3629,11 @@ static void raid_postsuspend(struct dm_target *ti)
>>  {
>>  	struct raid_set *rs = ti->private;
>>  
>> -	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
>> +	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
>> +		mddev_lock_nointr(&rs->md);
>>  		mddev_suspend(&rs->md);
>> +		mddev_unlock(&rs->md);
>> +	}
>>  
>>  	rs->md.ro = 1;
>>  }
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 0ff1bbf6c90e..04538b60f8f3 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -344,6 +344,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
>>  void mddev_suspend(struct mddev *mddev)
>>  {
>>  	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
>> +	lockdep_assert_held(&mddev->reconfig_mutex);
>>  	if (mddev->suspended++)
>>  		return;
>>  	synchronize_rcu();
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 0b7406ac8ce1..6a631dd21f0b 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>  	struct r5l_log *log = container_of(work, struct r5l_log,
>>  					   disable_writeback_work);
>>  	struct mddev *mddev = log->rdev->mddev;
>> +	struct r5conf *conf = mddev->private;
>> +	int locked = 0;
>>  
>>  	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
>>  		return;
>> @@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>  
>>  	/* wait superblock change before suspend */
>>  	wait_event(mddev->sb_wait,
>> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
>> -
>> -	mddev_suspend(mddev);
>> -	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> -	mddev_resume(mddev);
>> +		   conf->log == NULL ||
>> +		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
>> +		    (locked = mddev_trylock(mddev))));
>
> Probably we just bail out if conf->log == NULL.

That is exactly what this code does.  If ->log is NULL or ever becomes
NULL, we bail out.

>                                                  The whole trylock point is for
> the exit case, we can handle it separately. The bonus is
> r5c_disable_writeback_async will not magically do nothing if the mutex is
> already held by others.

I don't understand... the try_lock is so we can wait for the lock, or
for other things.  The "wait_event()" waits until it can get the lock,
or it doesn't need to. r5c_disable_writeback_async
will not "do nothing if the mutex is already held by others"

Though I just noticed ->sb_wait doesn't get woken when the mddev is
unlocked. 
I could just add a wake_up in mddev_unlock but that probably isn't
a good idea.  I'll have a proper look...

So drop this patch for now, thanks.

NeilBrown



>
> Thanks,
> Shaohua
>
>> +	if (locked) {
>> +		mddev_suspend(mddev);
>> +		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> +		mddev_resume(mddev);
>> +		mddev_unlock(mddev);
>> +	}
>>  }
>>  
>>  static void r5l_submit_current_io(struct r5l_log *log)
>> @@ -3165,6 +3171,8 @@ void r5l_exit_log(struct r5conf *conf)
>>  	conf->log = NULL;
>>  	synchronize_rcu();
>>  
>> +	/* Ensure disable_writeback_work wakes up and exits */
>> +	wake_up(&conf->mddev->sb_wait);
>>  	flush_work(&log->disable_writeback_work);
>>  	md_unregister_thread(&log->reclaim_thread);
>>  	mempool_destroy(log->meta_pool);
>> 
>> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 3/5] md: move suspend_hi/lo handling into core md code
  2017-10-18  6:16   ` Shaohua Li
@ 2017-10-18  7:40     ` NeilBrown
  2017-10-19  1:49       ` [md PATCH 6/5] md: remove special meaning of ->quiesce(.., 2) NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-18  7:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6272 bytes --]

On Tue, Oct 17 2017, Shaohua Li wrote:

> On Tue, Oct 17, 2017 at 01:46:43PM +1100, Neil Brown wrote:
>> responding to ->suspend_lo and ->suspend_hi is similar
>> to responding to ->suspended.  It is best to wait in
>> the common core code without incrementing ->active_io.
>> This allows mddev_suspend()/mddev_resume() to work while
>> requests are waiting for suspend_lo/hi to change.
>> This is will be important after a subsequent patch
>> which uses mddev_suspend() to synchronize updating for
>> suspend_lo/hi.
>> 
>> So move the code for testing suspend_lo/hi out of raid1.c
>> and raid5.c, and place it in md.c
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  drivers/md/md.c    |   29 +++++++++++++++++++++++------
>>  drivers/md/raid1.c |   12 ++++--------
>>  drivers/md/raid5.c |   22 ----------------------
>>  3 files changed, 27 insertions(+), 36 deletions(-)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 308456842d3e..f8d0e41120cb 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -266,16 +266,31 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
>>   * call has finished, the bio has been linked into some internal structure
>>   * and so is visible to ->quiesce(), so we don't need the refcount any more.
>>   */
>> +static bool is_suspended(struct mddev *mddev, struct bio *bio)
>> +{
>> +	if (mddev->suspended)
>> +		return true;
>> +	if (bio_data_dir(bio) != WRITE)
>> +		return false;
>> +	if (mddev->suspend_lo >= mddev->suspend_hi)
>> +		return false;
>> +	if (bio->bi_iter.bi_sector >= mddev->suspend_hi)
>> +		return false;
>> +	if (bio_end_sector(bio) < mddev->suspend_lo)
>> +		return false;
>> +	return true;
>> +}
>> +
>>  void md_handle_request(struct mddev *mddev, struct bio *bio)
>>  {
>>  check_suspended:
>>  	rcu_read_lock();
>> -	if (mddev->suspended) {
>> +	if (is_suspended(mddev, bio)) {
>>  		DEFINE_WAIT(__wait);
>>  		for (;;) {
>>  			prepare_to_wait(&mddev->sb_wait, &__wait,
>>  					TASK_UNINTERRUPTIBLE);
>> -			if (!mddev->suspended)
>> +			if (!is_suspended(mddev, bio))
>>  				break;
>>  			rcu_read_unlock();
>>  			schedule();
>> @@ -4843,10 +4858,11 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
>>  		goto unlock;
>>  	old = mddev->suspend_lo;
>>  	mddev->suspend_lo = new;
>> -	if (new >= old)
>> +	if (new >= old) {
>>  		/* Shrinking suspended region */
>> +		wake_up(&mddev->sb_wait);
>>  		mddev->pers->quiesce(mddev, 2);
>
> Do we still need the quiesce(mddev, 2)? sounds not to me.

No, I don't think we do - thanks for noticing that;.

>
>> -	else {
>> +	} else {
>>  		/* Expanding suspended region - need to wait */
>>  		mddev->pers->quiesce(mddev, 1);
>>  		mddev->pers->quiesce(mddev, 0);
>> @@ -4886,10 +4902,11 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
>>  		goto unlock;
>>  	old = mddev->suspend_hi;
>>  	mddev->suspend_hi = new;
>> -	if (new <= old)
>> +	if (new <= old) {
>>  		/* Shrinking suspended region */
>> +		wake_up(&mddev->sb_wait);
>>  		mddev->pers->quiesce(mddev, 2);
>> -	else {
>> +	} else {
>>  		/* Expanding suspended region - need to wait */
>>  		mddev->pers->quiesce(mddev, 1);
>>  		mddev->pers->quiesce(mddev, 0);
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index f3f3e40dc9d8..277a145b3ff5 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1310,11 +1310,9 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  	 */
>>  
>>  
>> -	if ((bio_end_sector(bio) > mddev->suspend_lo &&
>> -	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
>> -	    (mddev_is_clustered(mddev) &&
>> +	if (mddev_is_clustered(mddev) &&
>>  	     md_cluster_ops->area_resyncing(mddev, WRITE,
>> -		     bio->bi_iter.bi_sector, bio_end_sector(bio)))) {
>> +		     bio->bi_iter.bi_sector, bio_end_sector(bio))) {
>>  
>>  		/*
>>  		 * As the suspend_* range is controlled by userspace, we want
>> @@ -1325,12 +1323,10 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>  			sigset_t full, old;
>>  			prepare_to_wait(&conf->wait_barrier,
>>  					&w, TASK_INTERRUPTIBLE);
>> -			if (bio_end_sector(bio) <= mddev->suspend_lo ||
>> -			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
>> -			    (mddev_is_clustered(mddev) &&
>> +			if (mddev_is_clustered(mddev) &&
>>  			     !md_cluster_ops->area_resyncing(mddev, WRITE,
>>  				     bio->bi_iter.bi_sector,
>> -				     bio_end_sector(bio))))
>> +				     bio_end_sector(bio)))
>>  				break;
>>  			sigfillset(&full);
>>  			sigprocmask(SIG_BLOCK, &full, &old);
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 928e24a07133..4f40ccd21cbb 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5682,28 +5682,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>>  				goto retry;
>>  			}
>>  
>> -			if (rw == WRITE &&
>> -			    logical_sector >= mddev->suspend_lo &&
>> -			    logical_sector < mddev->suspend_hi) {
>> -				raid5_release_stripe(sh);
>> -				/* As the suspend_* range is controlled by
>> -				 * userspace, we want an interruptible
>> -				 * wait.
>> -				 */
>> -				prepare_to_wait(&conf->wait_for_overlap,
>> -						&w, TASK_INTERRUPTIBLE);
>> -				if (logical_sector >= mddev->suspend_lo &&
>> -				    logical_sector < mddev->suspend_hi) {
>> -					sigset_t full, old;
>> -					sigfillset(&full);
>> -					sigprocmask(SIG_BLOCK, &full, &old);
>> -					schedule();
>> -					sigprocmask(SIG_SETMASK, &old, NULL);
>> -					do_prepare = true;
>> -				}
>> -				goto retry;
>> -			}
>> -
>>  			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
>>  			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
>>  				/* Stripe is busy expanding or
>
> I think we can remove the state == 2 branch in raid5_quiesce().
>

Yes.


> This leaves only raid1 and md-cluster needs the quiesce(2) calls. In the
> future, we probably should abstract that md-cluster logic to separate API and
> let raid1 use it for waitting. The quiesce(2) is a strange name for the wait
> purpose.

Agreed.  I'll try to find a nice solution.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-18  7:35     ` NeilBrown
@ 2017-10-19  1:17       ` NeilBrown
  2017-10-19  3:45         ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-19  1:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5066 bytes --]


Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.  Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change.  This requires
waking mddev->sb_wait in mddev_unlock().  This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.

Signed-off-by: NeilBrown <neilb@suse.com>
---

I decided I would just wake_up(&mddev->sb_wait) in mddev_unlock().
It isn't that bad, and every other idea I came up with was worse.

I also now require reconfig_mutex on mddev_resume().  That
ensures the mddev->suspended counted never has races.

Thanks,
NeilBrown


 drivers/md/dm-raid.c     | 10 ++++++++--
 drivers/md/md.c          |  3 +++
 drivers/md/raid5-cache.c | 18 +++++++++++++-----
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 2245d06d2045..49d2193602a5 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3629,8 +3629,11 @@ static void raid_postsuspend(struct dm_target *ti)
 {
 	struct raid_set *rs = ti->private;
 
-	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
+	if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+		mddev_lock_nointr(&rs->md);
 		mddev_suspend(&rs->md);
+		mddev_unlock(&rs->md);
+	}
 
 	rs->md.ro = 1;
 }
@@ -3887,8 +3890,11 @@ static void raid_resume(struct dm_target *ti)
 	if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS))
 		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
 
-	if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags))
+	if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
+		mddev_lock_nointr(mddev);
 		mddev_resume(mddev);
+		mddev_unlock(mddev);
+	}
 }
 
 static struct target_type raid_target = {
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..b664dc252106 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -344,6 +344,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
 void mddev_suspend(struct mddev *mddev)
 {
 	WARN_ON_ONCE(mddev->thread && current == mddev->thread->tsk);
+	lockdep_assert_held(&mddev->reconfig_mutex);
 	if (mddev->suspended++)
 		return;
 	synchronize_rcu();
@@ -357,6 +358,7 @@ EXPORT_SYMBOL_GPL(mddev_suspend);
 
 void mddev_resume(struct mddev *mddev)
 {
+	lockdep_assert_held(&mddev->reconfig_mutex);
 	if (--mddev->suspended)
 		return;
 	wake_up(&mddev->sb_wait);
@@ -663,6 +665,7 @@ void mddev_unlock(struct mddev *mddev)
 	 */
 	spin_lock(&pers_lock);
 	md_wakeup_thread(mddev->thread);
+	wake_up(&mddev->sb_wait);
 	spin_unlock(&pers_lock);
 }
 EXPORT_SYMBOL_GPL(mddev_unlock);
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 0b7406ac8ce1..6a631dd21f0b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 	struct r5l_log *log = container_of(work, struct r5l_log,
 					   disable_writeback_work);
 	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	int locked = 0;
 
 	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
 		return;
@@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 
 	/* wait superblock change before suspend */
 	wait_event(mddev->sb_wait,
-		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
-
-	mddev_suspend(mddev);
-	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
-	mddev_resume(mddev);
+		   conf->log == NULL ||
+		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
+		    (locked = mddev_trylock(mddev))));
+	if (locked) {
+		mddev_suspend(mddev);
+		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
+		mddev_resume(mddev);
+		mddev_unlock(mddev);
+	}
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
@@ -3165,6 +3171,8 @@ void r5l_exit_log(struct r5conf *conf)
 	conf->log = NULL;
 	synchronize_rcu();
 
+	/* Ensure disable_writeback_work wakes up and exits */
+	wake_up(&conf->mddev->sb_wait);
 	flush_work(&log->disable_writeback_work);
 	md_unregister_thread(&log->reclaim_thread);
 	mempool_destroy(log->meta_pool);
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [md PATCH 6/5] md: remove special meaning of ->quiesce(.., 2)
  2017-10-18  7:40     ` NeilBrown
@ 2017-10-19  1:49       ` NeilBrown
  0 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2017-10-19  1:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9986 bytes --]


The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.

Suggested-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.com>
---

As suggested, this remove the special meaning of '2'.

Thanks,
NeilBrown

 drivers/md/md-cluster.c  |  6 +++---
 drivers/md/md.c          | 34 ++++++++++------------------------
 drivers/md/md.h          |  9 ++++-----
 drivers/md/raid0.c       |  2 +-
 drivers/md/raid1.c       | 13 +++----------
 drivers/md/raid10.c      | 10 +++-------
 drivers/md/raid5-cache.c | 12 ++++++------
 drivers/md/raid5-log.h   |  2 +-
 drivers/md/raid5.c       | 18 ++++++------------
 9 files changed, 37 insertions(+), 69 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 03082e17c65c..72ce0bccc865 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -442,10 +442,11 @@ static void __remove_suspend_info(struct md_cluster_info *cinfo, int slot)
 static void remove_suspend_info(struct mddev *mddev, int slot)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
+	mddev->pers->quiesce(mddev, 1);
 	spin_lock_irq(&cinfo->suspend_lock);
 	__remove_suspend_info(cinfo, slot);
 	spin_unlock_irq(&cinfo->suspend_lock);
-	mddev->pers->quiesce(mddev, 2);
+	mddev->pers->quiesce(mddev, 0);
 }
 
 
@@ -492,13 +493,12 @@ static void process_suspend_info(struct mddev *mddev,
 	s->lo = lo;
 	s->hi = hi;
 	mddev->pers->quiesce(mddev, 1);
-	mddev->pers->quiesce(mddev, 0);
 	spin_lock_irq(&cinfo->suspend_lock);
 	/* Remove existing entry (if exists) before adding */
 	__remove_suspend_info(cinfo, slot);
 	list_add(&s->list, &cinfo->suspend_list);
 	spin_unlock_irq(&cinfo->suspend_lock);
-	mddev->pers->quiesce(mddev, 2);
+	mddev->pers->quiesce(mddev, 0);
 }
 
 static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4be3140adfe8..e1e7e8dc6878 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4857,7 +4857,7 @@ suspend_lo_show(struct mddev *mddev, char *page)
 static ssize_t
 suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 {
-	unsigned long long old, new;
+	unsigned long long new;
 	int err;
 
 	err = kstrtoull(buf, 10, &new);
@@ -4873,17 +4873,10 @@ suspend_lo_store(struct mddev *mddev, const char *buf, size_t len)
 	if (mddev->pers == NULL ||
 	    mddev->pers->quiesce == NULL)
 		goto unlock;
-	old = mddev->suspend_lo;
+	mddev_suspend(mddev);
 	mddev->suspend_lo = new;
-	if (new >= old) {
-		/* Shrinking suspended region */
-		wake_up(&mddev->sb_wait);
-		mddev->pers->quiesce(mddev, 2);
-	} else {
-		/* Expanding suspended region - need to wait */
-		mddev_suspend(mddev);
-		mddev_resume(mddev);
-	}
+	mddev_resume(mddev);
+
 	err = 0;
 unlock:
 	mddev_unlock(mddev);
@@ -4901,7 +4894,7 @@ suspend_hi_show(struct mddev *mddev, char *page)
 static ssize_t
 suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 {
-	unsigned long long old, new;
+	unsigned long long new;
 	int err;
 
 	err = kstrtoull(buf, 10, &new);
@@ -4914,20 +4907,13 @@ suspend_hi_store(struct mddev *mddev, const char *buf, size_t len)
 	if (err)
 		return err;
 	err = -EINVAL;
-	if (mddev->pers == NULL ||
-	    mddev->pers->quiesce == NULL)
+	if (mddev->pers == NULL)
 		goto unlock;
-	old = mddev->suspend_hi;
+
+	mddev_suspend(mddev);
 	mddev->suspend_hi = new;
-	if (new <= old) {
-		/* Shrinking suspended region */
-		wake_up(&mddev->sb_wait);
-		mddev->pers->quiesce(mddev, 2);
-	} else {
-		/* Expanding suspended region - need to wait */
-		mddev_suspend(mddev);
-		mddev_resume(mddev);
-	}
+	mddev_resume(mddev);
+
 	err = 0;
 unlock:
 	mddev_unlock(mddev);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 8c2158f3bd59..5dcdba84932f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -545,12 +545,11 @@ struct md_personality
 	int (*check_reshape) (struct mddev *mddev);
 	int (*start_reshape) (struct mddev *mddev);
 	void (*finish_reshape) (struct mddev *mddev);
-	/* quiesce moves between quiescence states
-	 * 0 - fully active
-	 * 1 - no new requests allowed
-	 * others - reserved
+	/* quiesce suspends or resumes internal processing.
+	 * 1 - stop new actions and wait for action io to complete
+	 * 0 - return to normal behaviour
 	 */
-	void (*quiesce) (struct mddev *mddev, int state);
+	void (*quiesce) (struct mddev *mddev, int quiesce);
 	/* takeover is used to transition an array from one
 	 * personality to another.  The new personality must be able
 	 * to handle the data in the current layout.
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5a00fc118470..5ecba9eef441 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -768,7 +768,7 @@ static void *raid0_takeover(struct mddev *mddev)
 	return ERR_PTR(-EINVAL);
 }
 
-static void raid0_quiesce(struct mddev *mddev, int state)
+static void raid0_quiesce(struct mddev *mddev, int quiesce)
 {
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 277a145b3ff5..5f21cb9ac778 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3273,21 +3273,14 @@ static int raid1_reshape(struct mddev *mddev)
 	return 0;
 }
 
-static void raid1_quiesce(struct mddev *mddev, int state)
+static void raid1_quiesce(struct mddev *mddev, int quiesce)
 {
 	struct r1conf *conf = mddev->private;
 
-	switch(state) {
-	case 2: /* wake for suspend */
-		wake_up(&conf->wait_barrier);
-		break;
-	case 1:
+	if (quiesce)
 		freeze_array(conf, 0);
-		break;
-	case 0:
+	else
 		unfreeze_array(conf);
-		break;
-	}
 }
 
 static void *raid1_takeover(struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 374df5796649..76f042bf9a57 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3832,18 +3832,14 @@ static void raid10_free(struct mddev *mddev, void *priv)
 	kfree(conf);
 }
 
-static void raid10_quiesce(struct mddev *mddev, int state)
+static void raid10_quiesce(struct mddev *mddev, int quiesce)
 {
 	struct r10conf *conf = mddev->private;
 
-	switch(state) {
-	case 1:
+	if (quiesce)
 		raise_barrier(conf, 0);
-		break;
-	case 0:
+	else
 		lower_barrier(conf);
-		break;
-	}
 }
 
 static int raid10_resize(struct mddev *mddev, sector_t sectors)
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6a631dd21f0b..fce290878eb5 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1589,21 +1589,21 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 	md_wakeup_thread(log->reclaim_thread);
 }
 
-void r5l_quiesce(struct r5l_log *log, int state)
+void r5l_quiesce(struct r5l_log *log, int quiesce)
 {
 	struct mddev *mddev;
-	if (!log || state == 2)
+	if (!log)
 		return;
-	if (state == 0)
-		kthread_unpark(log->reclaim_thread->tsk);
-	else if (state == 1) {
+
+	if (quiesce) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);
 		kthread_park(log->reclaim_thread->tsk);
 		r5l_wake_reclaim(log, MaxSector);
 		r5l_do_reclaim(log);
-	}
+	} else
+		kthread_unpark(log->reclaim_thread->tsk);
 }
 
 bool r5l_log_disk_error(struct r5conf *conf)
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 328d67aedda4..c3596a27a5a8 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -8,7 +8,7 @@ extern void r5l_write_stripe_run(struct r5l_log *log);
 extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
-extern void r5l_quiesce(struct r5l_log *log, int state);
+extern void r5l_quiesce(struct r5l_log *log, int quiesce);
 extern bool r5l_log_disk_error(struct r5conf *conf);
 extern bool r5c_is_writeback(struct r5l_log *log);
 extern int
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 89ad79614309..13db3d31e983 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8026,16 +8026,12 @@ static void raid5_finish_reshape(struct mddev *mddev)
 	}
 }
 
-static void raid5_quiesce(struct mddev *mddev, int state)
+static void raid5_quiesce(struct mddev *mddev, int quiesce)
 {
 	struct r5conf *conf = mddev->private;
 
-	switch(state) {
-	case 2: /* resume for a suspend */
-		wake_up(&conf->wait_for_overlap);
-		break;
-
-	case 1: /* stop all writes */
+	if (quiesce) {
+		/* stop all writes */
 		lock_all_device_hash_locks_irq(conf);
 		/* '2' tells resync/reshape to pause so that all
 		 * active stripes can drain
@@ -8051,17 +8047,15 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 		unlock_all_device_hash_locks_irq(conf);
 		/* allow reshape to continue */
 		wake_up(&conf->wait_for_overlap);
-		break;
-
-	case 0: /* re-enable writes */
+	} else {
+		/* re-enable writes */
 		lock_all_device_hash_locks_irq(conf);
 		conf->quiesce = 0;
 		wake_up(&conf->wait_for_quiescent);
 		wake_up(&conf->wait_for_overlap);
 		unlock_all_device_hash_locks_irq(conf);
-		break;
 	}
-	r5l_quiesce(conf->log, state);
+	r5l_quiesce(conf->log, quiesce);
 }
 
 static void *raid45_takeover_raid0(struct mddev *mddev, int level)
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-19  1:17       ` [md PATCH 1/5 v2] " NeilBrown
@ 2017-10-19  3:45         ` Shaohua Li
  2017-10-19  6:29           ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2017-10-19  3:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Oct 19, 2017 at 12:17:16PM +1100, Neil Brown wrote:
> 
> Most often mddev_suspend() is called with
> reconfig_mutex held.  Make this a requirement in
> preparation a subsequent patch.  Also require
> reconfig_mutex to be held for mddev_resume(),
> partly for symmetry and partly to guarantee
> no races with incr/decr of mddev->suspend.
> 
> Taking the mutex in r5c_disable_writeback_async() is
> a little tricky as this is called from a work queue
> via log->disable_writeback_work, and flush_work()
> is called on that while holding ->reconfig_mutex.
> If the work item hasn't run before flush_work()
> is called, the work function will not be able to
> get the mutex.
> 
> So we use mddev_trylock() inside the wait_event() call, and have that
> abort when conf->log is set to NULL, which happens before
> flush_work() is called.
> We wait in mddev->sb_wait and ensure this is woken
> when any of the conditions change.  This requires
> waking mddev->sb_wait in mddev_unlock().  This is only
> like to trigger extra wake_ups of threads that needn't
> be woken when metadata is being written, and that
> doesn't happen often enough that the cost would be
> noticeable.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>

--snip--

Applied the other patches in the series including the 'remove quiesce(2)' one.

> index 0b7406ac8ce1..6a631dd21f0b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  	struct r5l_log *log = container_of(work, struct r5l_log,
>  					   disable_writeback_work);
>  	struct mddev *mddev = log->rdev->mddev;
> +	struct r5conf *conf = mddev->private;
> +	int locked = 0;
>  
>  	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
>  		return;
> @@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  
>  	/* wait superblock change before suspend */
>  	wait_event(mddev->sb_wait,
> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> -
> -	mddev_suspend(mddev);
> -	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> -	mddev_resume(mddev);
> +		   conf->log == NULL ||
> +		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
> +		    (locked = mddev_trylock(mddev))));
> +	if (locked) {
> +		mddev_suspend(mddev);
> +		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> +		mddev_resume(mddev);
> +		mddev_unlock(mddev);
> +	}

For this one, my point is:

	wait_event(mddev->sb_wait, conf->log == NULL ||
		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
	if (conf->log == NULL)
		return;

	mddev_suspend(mddev);
	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
	mddev_resume(mddev);

does it work?

>  }
>  
>  static void r5l_submit_current_io(struct r5l_log *log)
> @@ -3165,6 +3171,8 @@ void r5l_exit_log(struct r5conf *conf)
>  	conf->log = NULL;
>  	synchronize_rcu();
>  
> +	/* Ensure disable_writeback_work wakes up and exits */
> +	wake_up(&conf->mddev->sb_wait);
>  	flush_work(&log->disable_writeback_work);
>  	md_unregister_thread(&log->reclaim_thread);
>  	mempool_destroy(log->meta_pool);
> -- 
> 2.14.0.rc0.dirty
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-19  3:45         ` Shaohua Li
@ 2017-10-19  6:29           ` NeilBrown
  2017-10-20  4:37             ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-19  6:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3311 bytes --]

On Wed, Oct 18 2017, Shaohua Li wrote:

> On Thu, Oct 19, 2017 at 12:17:16PM +1100, Neil Brown wrote:
>> 
>> Most often mddev_suspend() is called with
>> reconfig_mutex held.  Make this a requirement in
>> preparation a subsequent patch.  Also require
>> reconfig_mutex to be held for mddev_resume(),
>> partly for symmetry and partly to guarantee
>> no races with incr/decr of mddev->suspend.
>> 
>> Taking the mutex in r5c_disable_writeback_async() is
>> a little tricky as this is called from a work queue
>> via log->disable_writeback_work, and flush_work()
>> is called on that while holding ->reconfig_mutex.
>> If the work item hasn't run before flush_work()
>> is called, the work function will not be able to
>> get the mutex.
>> 
>> So we use mddev_trylock() inside the wait_event() call, and have that
>> abort when conf->log is set to NULL, which happens before
>> flush_work() is called.
>> We wait in mddev->sb_wait and ensure this is woken
>> when any of the conditions change.  This requires
>> waking mddev->sb_wait in mddev_unlock().  This is only
>> like to trigger extra wake_ups of threads that needn't
>> be woken when metadata is being written, and that
>> doesn't happen often enough that the cost would be
>> noticeable.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>
> --snip--
>
> Applied the other patches in the series including the 'remove quiesce(2)' one.

Thanks.

>
>> index 0b7406ac8ce1..6a631dd21f0b 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>  	struct r5l_log *log = container_of(work, struct r5l_log,
>>  					   disable_writeback_work);
>>  	struct mddev *mddev = log->rdev->mddev;
>> +	struct r5conf *conf = mddev->private;
>> +	int locked = 0;
>>  
>>  	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
>>  		return;
>> @@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>  
>>  	/* wait superblock change before suspend */
>>  	wait_event(mddev->sb_wait,
>> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
>> -
>> -	mddev_suspend(mddev);
>> -	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> -	mddev_resume(mddev);
>> +		   conf->log == NULL ||
>> +		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
>> +		    (locked = mddev_trylock(mddev))));
>> +	if (locked) {
>> +		mddev_suspend(mddev);
>> +		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> +		mddev_resume(mddev);
>> +		mddev_unlock(mddev);
>> +	}
>
> For this one, my point is:
>
> 	wait_event(mddev->sb_wait, conf->log == NULL ||
> 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> 	if (conf->log == NULL)
> 		return;
>
> 	mddev_suspend(mddev);
> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> 	mddev_resume(mddev);
>
> does it work?

The
	lockdep_assert_held(&mddev->reconfig_mutex);
in mddev_suspend() will complain.

If you put an mddev_lock() call in there to stop the complaint, and if
the work item doesn't start before the reconfig_mutex is taken prior to
stopping the array, then r5l_exit_log() can deadlock at
	flush_work(&log->disable_writeback_work);

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-19  6:29           ` NeilBrown
@ 2017-10-20  4:37             ` Shaohua Li
  2017-10-23  0:02               ` NeilBrown
  0 siblings, 1 reply; 17+ messages in thread
From: Shaohua Li @ 2017-10-20  4:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Oct 19, 2017 at 05:29:22PM +1100, Neil Brown wrote:
> On Wed, Oct 18 2017, Shaohua Li wrote:
> 
> > On Thu, Oct 19, 2017 at 12:17:16PM +1100, Neil Brown wrote:
> >> 
> >> Most often mddev_suspend() is called with
> >> reconfig_mutex held.  Make this a requirement in
> >> preparation a subsequent patch.  Also require
> >> reconfig_mutex to be held for mddev_resume(),
> >> partly for symmetry and partly to guarantee
> >> no races with incr/decr of mddev->suspend.
> >> 
> >> Taking the mutex in r5c_disable_writeback_async() is
> >> a little tricky as this is called from a work queue
> >> via log->disable_writeback_work, and flush_work()
> >> is called on that while holding ->reconfig_mutex.
> >> If the work item hasn't run before flush_work()
> >> is called, the work function will not be able to
> >> get the mutex.
> >> 
> >> So we use mddev_trylock() inside the wait_event() call, and have that
> >> abort when conf->log is set to NULL, which happens before
> >> flush_work() is called.
> >> We wait in mddev->sb_wait and ensure this is woken
> >> when any of the conditions change.  This requires
> >> waking mddev->sb_wait in mddev_unlock().  This is only
> >> like to trigger extra wake_ups of threads that needn't
> >> be woken when metadata is being written, and that
> >> doesn't happen often enough that the cost would be
> >> noticeable.
> >> 
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >
> > --snip--
> >
> > Applied the other patches in the series including the 'remove quiesce(2)' one.
> 
> Thanks.
> 
> >
> >> index 0b7406ac8ce1..6a631dd21f0b 100644
> >> --- a/drivers/md/raid5-cache.c
> >> +++ b/drivers/md/raid5-cache.c
> >> @@ -693,6 +693,8 @@ static void r5c_disable_writeback_async(struct work_struct *work)
> >>  	struct r5l_log *log = container_of(work, struct r5l_log,
> >>  					   disable_writeback_work);
> >>  	struct mddev *mddev = log->rdev->mddev;
> >> +	struct r5conf *conf = mddev->private;
> >> +	int locked = 0;
> >>  
> >>  	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
> >>  		return;
> >> @@ -701,11 +703,15 @@ static void r5c_disable_writeback_async(struct work_struct *work)
> >>  
> >>  	/* wait superblock change before suspend */
> >>  	wait_event(mddev->sb_wait,
> >> -		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> >> -
> >> -	mddev_suspend(mddev);
> >> -	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> >> -	mddev_resume(mddev);
> >> +		   conf->log == NULL ||
> >> +		   (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags) &&
> >> +		    (locked = mddev_trylock(mddev))));
> >> +	if (locked) {
> >> +		mddev_suspend(mddev);
> >> +		log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> >> +		mddev_resume(mddev);
> >> +		mddev_unlock(mddev);
> >> +	}
> >
> > For this one, my point is:
> >
> > 	wait_event(mddev->sb_wait, conf->log == NULL ||
> > 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> > 	if (conf->log == NULL)
> > 		return;
> >
> > 	mddev_suspend(mddev);
> > 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> > 	mddev_resume(mddev);
> >
> > does it work?
> 
> The
> 	lockdep_assert_held(&mddev->reconfig_mutex);
> in mddev_suspend() will complain.
> 
> If you put an mddev_lock() call in there to stop the complaint, and if
> the work item doesn't start before the reconfig_mutex is taken prior to
> stopping the array, then r5l_exit_log() can deadlock at
> 	flush_work(&log->disable_writeback_work);

Ok, got it now. But really don't like this patch. The mddev_unlock is strange,
r5c_disable_writeback_async could skip disabling writeback too. Could we add a
new callback like .prepare_free, and flush workqueue there. After we drop the
reconfig_mutex in do_md_stop, we call the prepare_free. We can probably set a
flag, so later r5c_disable_writeback_async will bail out doing nothing. I think
this should work, right?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-20  4:37             ` Shaohua Li
@ 2017-10-23  0:02               ` NeilBrown
  2017-10-23  1:48                 ` Shaohua Li
  0 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2017-10-23  0:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2247 bytes --]

On Thu, Oct 19 2017, Shaohua Li wrote:
>> >
>> > For this one, my point is:
>> >
>> > 	wait_event(mddev->sb_wait, conf->log == NULL ||
>> > 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
>> > 	if (conf->log == NULL)
>> > 		return;
>> >
>> > 	mddev_suspend(mddev);
>> > 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> > 	mddev_resume(mddev);
>> >
>> > does it work?
>> 
>> The
>> 	lockdep_assert_held(&mddev->reconfig_mutex);
>> in mddev_suspend() will complain.
>> 
>> If you put an mddev_lock() call in there to stop the complaint, and if
>> the work item doesn't start before the reconfig_mutex is taken prior to
>> stopping the array, then r5l_exit_log() can deadlock at
>> 	flush_work(&log->disable_writeback_work);
>
> Ok, got it now. But really don't like this patch. The mddev_unlock is strange,
> r5c_disable_writeback_async could skip disabling writeback too. Could we add a
> new callback like .prepare_free, and flush workqueue there. After we drop the
> reconfig_mutex in do_md_stop, we call the prepare_free. We can probably set a
> flag, so later r5c_disable_writeback_async will bail out doing nothing. I think
> this should work, right?

Might work, though it sounds more messy to me (assuming I understand).

I would like to get rid of disable_writeback_work altogether.
Just set  log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH in
r5c_update_on_rdev_error(), and make sure that does the right thing.

The distinction between write-through and write-back should be able to
be a per-stripe_head distinction.  Once we set r5c_journal_mode, new
stripe_heads will get the new mode, old ones are allowed to continue how
they are.
Maybe we could keep a counter of how many stripes are in WRITE_BACK
mode, and test that counter in r5c_is_writeback()??

I don't know what all the issues are so it would need careful review,
preferably by someone familiar with the code.

Short of that, I think my current patch is the best interim step.  I
agree that it isn't the most elegant thing ever, but it is localized and
I believe it works correctly.
The "mddev_unlock()" shouldn't look too strange, it perfectly balances
he mddev_try_lock().

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [md PATCH 1/5 v2] md: always hold reconfig_mutex when calling mddev_suspend()
  2017-10-23  0:02               ` NeilBrown
@ 2017-10-23  1:48                 ` Shaohua Li
  0 siblings, 0 replies; 17+ messages in thread
From: Shaohua Li @ 2017-10-23  1:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Mon, Oct 23, 2017 at 11:02:23AM +1100, Neil Brown wrote:
> On Thu, Oct 19 2017, Shaohua Li wrote:
> >> >
> >> > For this one, my point is:
> >> >
> >> > 	wait_event(mddev->sb_wait, conf->log == NULL ||
> >> > 		   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
> >> > 	if (conf->log == NULL)
> >> > 		return;
> >> >
> >> > 	mddev_suspend(mddev);
> >> > 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> >> > 	mddev_resume(mddev);
> >> >
> >> > does it work?
> >> 
> >> The
> >> 	lockdep_assert_held(&mddev->reconfig_mutex);
> >> in mddev_suspend() will complain.
> >> 
> >> If you put an mddev_lock() call in there to stop the complaint, and if
> >> the work item doesn't start before the reconfig_mutex is taken prior to
> >> stopping the array, then r5l_exit_log() can deadlock at
> >> 	flush_work(&log->disable_writeback_work);
> >
> > Ok, got it now. But really don't like this patch. The mddev_unlock is strange,
> > r5c_disable_writeback_async could skip disabling writeback too. Could we add a
> > new callback like .prepare_free, and flush workqueue there. After we drop the
> > reconfig_mutex in do_md_stop, we call the prepare_free. We can probably set a
> > flag, so later r5c_disable_writeback_async will bail out doing nothing. I think
> > this should work, right?
> 
> Might work, though it sounds more messy to me (assuming I understand).
> 
> I would like to get rid of disable_writeback_work altogether.
> Just set  log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH in
> r5c_update_on_rdev_error(), and make sure that does the right thing.
> 
> The distinction between write-through and write-back should be able to
> be a per-stripe_head distinction.  Once we set r5c_journal_mode, new
> stripe_heads will get the new mode, old ones are allowed to continue how
> they are.
> Maybe we could keep a counter of how many stripes are in WRITE_BACK
> mode, and test that counter in r5c_is_writeback()??
> 
> I don't know what all the issues are so it would need careful review,
> preferably by someone familiar with the code.

We check the writeback in several steps, would be problematical, but I didn't
take a close look yet.

> Short of that, I think my current patch is the best interim step.  I
> agree that it isn't the most elegant thing ever, but it is localized and
> I believe it works correctly.
> The "mddev_unlock()" shouldn't look too strange, it perfectly balances
> he mddev_try_lock().

Alright, we need this patch to avoid the lockdep warning, so I queued the patch
now. Once I got time, I'll check if the proposal works.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-23  1:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-17  2:46 [md PATCH 0/5] Address deadlock associated with setting suspend_lo NeilBrown
2017-10-17  2:46 ` [md PATCH 1/5] md: always hold reconfig_mutex when calling mddev_suspend() NeilBrown
2017-10-18  6:11   ` Shaohua Li
2017-10-18  7:35     ` NeilBrown
2017-10-19  1:17       ` [md PATCH 1/5 v2] " NeilBrown
2017-10-19  3:45         ` Shaohua Li
2017-10-19  6:29           ` NeilBrown
2017-10-20  4:37             ` Shaohua Li
2017-10-23  0:02               ` NeilBrown
2017-10-23  1:48                 ` Shaohua Li
2017-10-17  2:46 ` [md PATCH 5/5] md: allow metadata update while suspending NeilBrown
2017-10-17  2:46 ` [md PATCH 4/5] md: use mddev_suspend/resume instead of ->quiesce() NeilBrown
2017-10-17  2:46 ` [md PATCH 3/5] md: move suspend_hi/lo handling into core md code NeilBrown
2017-10-18  6:16   ` Shaohua Li
2017-10-18  7:40     ` NeilBrown
2017-10-19  1:49       ` [md PATCH 6/5] md: remove special meaning of ->quiesce(.., 2) NeilBrown
2017-10-17  2:46 ` [md PATCH 2/5] md: don't call bitmap_create() while array is quiesced NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.