All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/5] the latest changes for md-cluster
@ 2017-03-01  8:42 Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 2/5] md: move bitmap_destroy before __md_stop Guoqing Jiang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  8:42 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Since 2, 3 and 13 are merged, so the relationship between V2
and previous patchset are:
	 V2	 	     old
	0001 -> 0004, 0005, 0006, 0008 and 0009
	0002 -> 	    0007
	0003 -> 	    0010
	0004 -> 	0011 and 0012
	0005 -> 	    0014
     
     
And based on Shaohua's comments, this version also changes:

1. discard the first patch for now.
2. add mode check before destroy bitmap.
3. for patch 11, add more codes to handle lockres_init returns err.
4. call revalidate_disk after set_capacity in both patch 3 and 5.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
  2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
@ 2017-03-01  8:42 ` Guoqing Jiang
  2017-03-02 18:28   ` Shaohua Li
  2017-03-01  8:42 ` [PATCH V2 3/5] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  8:42 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.

With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid earlier
than before.

And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()).

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 44206bc6e3aa..e1d9116044ae 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
 	/* stop the array and free an attached data structures.
 	 * This is called from dm-raid
 	 */
-	__md_stop(mddev);
 	bitmap_destroy(mddev);
+	__md_stop(mddev);
 	if (mddev->bio_set)
 		bioset_free(mddev->bio_set);
 }
@@ -5688,6 +5688,22 @@ static int do_md_stop(struct mddev *mddev, int mode,
 			set_disk_ro(disk, 0);
 
 		__md_stop_writes(mddev);
+
+		/*
+		 * Destroy bitmap after all writes are stopped
+		 */
+		if (mode == 0) {
+			bitmap_destroy(mddev);
+			if (mddev->bitmap_info.file) {
+				struct file *f = mddev->bitmap_info.file;
+				spin_lock(&mddev->lock);
+				mddev->bitmap_info.file = NULL;
+				spin_unlock(&mddev->lock);
+				fput(f);
+			}
+			mddev->bitmap_info.offset = 0;
+		}
+
 		__md_stop(mddev);
 		mddev->queue->backing_dev_info->congested_fn = NULL;
 
@@ -5712,19 +5728,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	 */
 	if (mode == 0) {
 		pr_info("md: %s stopped.\n", mdname(mddev));
-
-		bitmap_destroy(mddev);
-		if (mddev->bitmap_info.file) {
-			struct file *f = mddev->bitmap_info.file;
-			spin_lock(&mddev->lock);
-			mddev->bitmap_info.file = NULL;
-			spin_unlock(&mddev->lock);
-			fput(f);
-		}
-		mddev->bitmap_info.offset = 0;
-
 		export_array(mddev);
-
 		md_clean(mddev);
 		if (mddev->hold_active == UNTIL_STOP)
 			mddev->hold_active = 0;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 3/5] md-cluster: add CHANGE_CAPACITY message type
  2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 2/5] md: move bitmap_destroy before __md_stop Guoqing Jiang
@ 2017-03-01  8:42 ` Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 4/5] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  8:42 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

The msg type CHANGE_CAPACITY is introduced to support
resize clustered raid in later patch, and it is sent
after all the nodes have the same sync_size, receiver
node just need to set new capacity once received this
msg.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 5cf0a9d29bf0..8b7d55bf5aa2 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -104,6 +104,7 @@ enum msg_type {
 	REMOVE,
 	RE_ADD,
 	BITMAP_NEEDS_SYNC,
+	CHANGE_CAPACITY,
 };
 
 struct cluster_msg {
@@ -579,6 +580,10 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
 	case METADATA_UPDATED:
 		process_metadata_update(mddev, msg);
 		break;
+	case CHANGE_CAPACITY:
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		revalidate_disk(mddev->gendisk);
+		break;
 	case RESYNCING:
 		process_suspend_info(mddev, le32_to_cpu(msg->slot),
 				     le64_to_cpu(msg->low),
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 4/5] md-cluster: introduce cluster_check_sync_size
  2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 2/5] md: move bitmap_destroy before __md_stop Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 3/5] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
@ 2017-03-01  8:42 ` Guoqing Jiang
  2017-03-01  8:42 ` [PATCH V2 5/5] md-cluster: add the support for resize Guoqing Jiang
  2017-03-01  9:30 ` [PATCH V2 1/5] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang
  4 siblings, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  8:42 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Support resize is a little complex for clustered
raid, since we need to ensure all the nodes share
the same knowledge about the size of raid.

We achieve the goal by check the sync_size which
is in each node's bitmap, we can only change the
capacity after cluster_check_sync_size returns 0.

Also, get_bitmap_from_slot is added to get a slot's
bitmap. And we exported some funcs since they are
used in cluster_check_sync_size().

We can also reuse get_bitmap_from_slot to remove
redundant code existed in bitmap_copy_from_slot.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c     | 41 ++++++++++++++++++++++++---------
 drivers/md/bitmap.h     |  2 ++
 drivers/md/md-cluster.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2ccac958a..b6fa55a3cff8 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
 	kunmap_atomic(sb);
 	write_page(bitmap, bitmap->storage.sb_page, 1);
 }
+EXPORT_SYMBOL(bitmap_update_sb);
 
 /* print out the bitmap file superblock */
 void bitmap_print_sb(struct bitmap *bitmap)
@@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
 /*
  * free memory that was allocated
  */
-static void bitmap_free(struct bitmap *bitmap)
+void bitmap_free(struct bitmap *bitmap)
 {
 	unsigned long k, pages;
 	struct bitmap_page *bp;
@@ -1761,6 +1762,7 @@ static void bitmap_free(struct bitmap *bitmap)
 	kfree(bp);
 	kfree(bitmap);
 }
+EXPORT_SYMBOL(bitmap_free);
 
 void bitmap_destroy(struct mddev *mddev)
 {
@@ -1920,6 +1922,27 @@ int bitmap_load(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(bitmap_load);
 
+struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
+{
+	int rv = 0;
+	struct bitmap *bitmap;
+
+	bitmap = bitmap_create(mddev, slot);
+	if (IS_ERR(bitmap)) {
+		rv = PTR_ERR(bitmap);
+		return ERR_PTR(rv);
+	}
+
+	rv = bitmap_init_from_disk(bitmap, 0);
+	if (rv) {
+		bitmap_free(bitmap);
+		return ERR_PTR(rv);
+	}
+
+	return bitmap;
+}
+EXPORT_SYMBOL(get_bitmap_from_slot);
+
 /* Loads the bitmap associated with slot and copies the resync information
  * to our bitmap
  */
@@ -1929,14 +1952,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 	int rv = 0, i, j;
 	sector_t block, lo = 0, hi = 0;
 	struct bitmap_counts *counts;
-	struct bitmap *bitmap = bitmap_create(mddev, slot);
-
-	if (IS_ERR(bitmap))
-		return PTR_ERR(bitmap);
+	struct bitmap *bitmap;
 
-	rv = bitmap_init_from_disk(bitmap, 0);
-	if (rv)
-		goto err;
+	bitmap = get_bitmap_from_slot(mddev, slot);
+	if (IS_ERR(bitmap)) {
+		pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
+		return -1;
+	}
 
 	counts = &bitmap->counts;
 	for (j = 0; j < counts->chunks; j++) {
@@ -1963,8 +1985,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 	bitmap_unplug(mddev->bitmap);
 	*low = lo;
 	*high = hi;
-err:
-	bitmap_free(bitmap);
+
 	return rv;
 }
 EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 5b6dd63dda91..9f761097aab2 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -267,8 +267,10 @@ void bitmap_daemon_work(struct mddev *mddev);
 
 int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 		  int chunksize, int init);
+struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
 int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 				sector_t *lo, sector_t *hi, bool clear_bits);
+void bitmap_free(struct bitmap *bitmap);
 #endif
 
 #endif
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 8b7d55bf5aa2..5a42fc5202ff 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -1091,6 +1091,66 @@ static void metadata_update_cancel(struct mddev *mddev)
 	unlock_comm(cinfo);
 }
 
+/*
+ * retun 0 if all the bitmaps have the same sync_size
+ */
+int cluster_check_sync_size(struct mddev *mddev)
+{
+	int i, rv;
+	bitmap_super_t *sb;
+	unsigned long my_sync_size, sync_size = 0;
+	int node_num = mddev->bitmap_info.nodes;
+	int current_slot = md_cluster_ops->slot_number(mddev);
+	struct bitmap *bitmap = mddev->bitmap;
+	char str[64];
+	struct dlm_lock_resource *bm_lockres;
+
+	sb = kmap_atomic(bitmap->storage.sb_page);
+	my_sync_size = sb->sync_size;
+	kunmap_atomic(sb);
+
+	for (i = 0; i < node_num; i++) {
+		if (i == current_slot)
+			continue;
+
+		bitmap = get_bitmap_from_slot(mddev, i);
+		if (IS_ERR(bitmap)) {
+			pr_err("can't get bitmap from slot %d\n", i);
+			return -1;
+		}
+
+		/*
+		 * If we can hold the bitmap lock of one node then
+		 * the slot is not occupied, update the sb.
+		 */
+		snprintf(str, 64, "bitmap%04d", i);
+		bm_lockres = lockres_init(mddev, str, NULL, 1);
+		if (!bm_lockres) {
+			pr_err("md-cluster: Cannot initialize %s\n", str);
+			lockres_free(bm_lockres);
+			return -ENOMEM;
+		}
+		bm_lockres->flags |= DLM_LKF_NOQUEUE;
+		rv = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
+		if (!rv)
+			bitmap_update_sb(bitmap);
+		lockres_free(bm_lockres);
+
+		sb = kmap_atomic(bitmap->storage.sb_page);
+		if (sync_size == 0)
+			sync_size = sb->sync_size;
+		else if (sync_size != sb->sync_size) {
+			kunmap_atomic(sb);
+			bitmap_free(bitmap);
+			return -1;
+		}
+		kunmap_atomic(sb);
+		bitmap_free(bitmap);
+	}
+
+	return (my_sync_size == sync_size) ? 0 : -1;
+}
+
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 5/5] md-cluster: add the support for resize
  2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
                   ` (2 preceding siblings ...)
  2017-03-01  8:42 ` [PATCH V2 4/5] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
@ 2017-03-01  8:42 ` Guoqing Jiang
  2017-03-01  9:30 ` [PATCH V2 1/5] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang
  4 siblings, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  8:42 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

To update size for cluster raid, we need to make
sure all nodes can perform the change successfully.
However, it is possible that some of them can't do
it due to failure (bitmap_resize could fail). So
we need to consider the issue before we set the
capacity unconditionally, and we use below steps
to perform sanity check.

1. A change the size, then broadcast METADATA_UPDATED
   msg.
2. B and C receive METADATA_UPDATED change the size
   excepts call set_capacity, sync_size is not update
   if the change failed. Also call bitmap_update_sb
   to sync sb to disk.
3. A checks other node's sync_size, if sync_size has
   been updated in all nodes, then send CHANGE_CAPACITY
   msg otherwise send msg to revert previous change.
4. B and C call set_capacity if receive CHANGE_CAPACITY
   msg, otherwise pers->resize will be called to restore
   the old value.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Documentation/md/md-cluster.txt |  2 +-
 drivers/md/md-cluster.c         | 76 +++++++++++++++++++++++++++++++++++++++++
 drivers/md/md-cluster.h         |  1 +
 drivers/md/md.c                 | 21 +++++++++---
 4 files changed, 94 insertions(+), 6 deletions(-)

diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
index 38883276d31c..2663d49dd8a0 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -321,4 +321,4 @@ The algorithm is:
 
 There are somethings which are not supported by cluster MD yet.
 
-- update size and change array_sectors.
+- change array_sectors.
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 5a42fc5202ff..88ce15843a42 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -1151,6 +1151,81 @@ int cluster_check_sync_size(struct mddev *mddev)
 	return (my_sync_size == sync_size) ? 0 : -1;
 }
 
+/*
+ * Update the size for cluster raid is a little more complex, we perform it
+ * by the steps:
+ * 1. hold token lock and update superblock in initiator node.
+ * 2. send METADATA_UPDATED msg to other nodes.
+ * 3. The initiator node continues to check each bitmap's sync_size, if all
+ *    bitmaps have the same value of sync_size, then we can set capacity and
+ *    let other nodes to perform it. If one node can't update sync_size
+ *    accordingly, we need to revert to previous value.
+ */
+static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
+{
+	struct md_cluster_info *cinfo = mddev->cluster_info;
+	struct cluster_msg cmsg;
+	struct md_rdev *rdev;
+	int ret = 0;
+	int raid_slot = -1;
+
+	md_update_sb(mddev, 1);
+	lock_comm(cinfo, 1);
+
+	memset(&cmsg, 0, sizeof(cmsg));
+	cmsg.type = cpu_to_le32(METADATA_UPDATED);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
+			raid_slot = rdev->desc_nr;
+			break;
+		}
+	if (raid_slot >= 0) {
+		cmsg.raid_slot = cpu_to_le32(raid_slot);
+		/*
+		 * We can only change capiticy after all the nodes can do it,
+		 * so need to wait after other nodes already received the msg
+		 * and handled the change
+		 */
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret) {
+			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
+			       __func__, __LINE__);
+			unlock_comm(cinfo);
+			return;
+		}
+	} else {
+		pr_err("md-cluster: No good device id found to send\n");
+		unlock_comm(cinfo);
+		return;
+	}
+
+	/*
+	 * check the sync_size from other node's bitmap, if sync_size
+	 * have already updated in other nodes as expected, send an
+	 * empty metadata msg to permit the change of capacity
+	 */
+	if (cluster_check_sync_size(mddev) == 0) {
+		memset(&cmsg, 0, sizeof(cmsg));
+		cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret)
+			pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
+			       __func__, __LINE__);
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		revalidate_disk(mddev->gendisk);
+	} else {
+		/* revert to previous sectors */
+		ret = mddev->pers->resize(mddev, old_dev_sectors);
+		if (!ret)
+			revalidate_disk(mddev->gendisk);
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret)
+			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
+			       __func__, __LINE__);
+	}
+	unlock_comm(cinfo);
+}
+
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
@@ -1396,6 +1471,7 @@ static struct md_cluster_operations cluster_ops = {
 	.gather_bitmaps = gather_bitmaps,
 	.lock_all_bitmaps = lock_all_bitmaps,
 	.unlock_all_bitmaps = unlock_all_bitmaps,
+	.update_size = update_size,
 };
 
 static int __init cluster_init(void)
diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
index e765499ba591..274016177983 100644
--- a/drivers/md/md-cluster.h
+++ b/drivers/md/md-cluster.h
@@ -27,6 +27,7 @@ struct md_cluster_operations {
 	int (*gather_bitmaps)(struct md_rdev *rdev);
 	int (*lock_all_bitmaps)(struct mddev *mddev);
 	void (*unlock_all_bitmaps)(struct mddev *mddev);
+	void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors);
 };
 
 #endif /* _MD_CLUSTER_H */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e1d9116044ae..46bae2fb4f9a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6496,10 +6496,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 	struct md_rdev *rdev;
 	int rv;
 	int fit = (num_sectors == 0);
-
-	/* cluster raid doesn't support update size */
-	if (mddev_is_clustered(mddev))
-		return -EINVAL;
+	sector_t old_dev_sectors = mddev->dev_sectors;
 
 	if (mddev->pers->resize == NULL)
 		return -EINVAL;
@@ -6528,7 +6525,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 	}
 	rv = mddev->pers->resize(mddev, num_sectors);
 	if (!rv) {
-		if (mddev->queue) {
+		if (mddev_is_clustered(mddev))
+			md_cluster_ops->update_size(mddev, old_dev_sectors);
+		else if (mddev->queue) {
 			set_capacity(mddev->gendisk, mddev->array_sectors);
 			revalidate_disk(mddev->gendisk);
 		}
@@ -8746,6 +8745,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 	int role, ret;
 	char b[BDEVNAME_SIZE];
 
+	/*
+	 * If size is changed in another node then we need to
+	 * do resize as well.
+	 */
+	if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
+		ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
+		if (ret)
+			pr_info("md-cluster: resize failed\n");
+		else
+			bitmap_update_sb(mddev->bitmap);
+	}
+
 	/* Check for change of roles in the active devices */
 	rdev_for_each(rdev2, mddev) {
 		if (test_bit(Faulty, &rdev2->flags))
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 1/5] md-cluster: use sync way to handle METADATA_UPDATED msg
  2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
                   ` (3 preceding siblings ...)
  2017-03-01  8:42 ` [PATCH V2 5/5] md-cluster: add the support for resize Guoqing Jiang
@ 2017-03-01  9:30 ` Guoqing Jiang
  4 siblings, 0 replies; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-01  9:30 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 82 +++++++++++++++++++++++++++++++++++++++----------
 drivers/md/md.c         |  4 ---
 drivers/md/md.h         |  3 --
 3 files changed, 66 insertions(+), 23 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 321ecac23027..5cf0a9d29bf0 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -67,9 +67,10 @@ struct resync_info {
  * set up all the related infos such as bitmap and personality */
 #define		MD_CLUSTER_ALREADY_IN_CLUSTER		6
 #define		MD_CLUSTER_PENDING_RECV_EVENT		7
-
+#define 	MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD		8
 
 struct md_cluster_info {
+	struct mddev *mddev; /* the md device which md_cluster_info belongs to */
 	/* dlm lock space and resources for clustered raid. */
 	dlm_lockspace_t *lockspace;
 	int slot_number;
@@ -523,11 +524,17 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 
 static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
+	int got_lock = 0;
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
-	set_bit(MD_RELOAD_SB, &mddev->flags);
+
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-	md_wakeup_thread(mddev->thread);
+	wait_event(mddev->thread->wqueue,
+		   (got_lock = mddev_trylock(mddev)) ||
+		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
+	md_reload_sb(mddev, mddev->good_device_nr);
+	if (got_lock)
+		mddev_unlock(mddev);
 }
 
 static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
@@ -646,11 +653,29 @@ static void recv_daemon(struct md_thread *thread)
  * Takes the lock on the TOKEN lock resource so no other
  * node can communicate while the operation is underway.
  */
-static int lock_token(struct md_cluster_info *cinfo)
+static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 {
-	int error;
+	int error, set_bit = 0;
+	struct mddev *mddev = cinfo->mddev;
 
+	/*
+	 * If resync thread run after raid1d thread, then process_metadata_update
+	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
+	 * since another node already got EX on Token and waitting the EX of Ack),
+	 * so let resync wake up thread in case flag is set.
+	 */
+	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				      &cinfo->state)) {
+		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+					      &cinfo->state);
+		WARN_ON_ONCE(error);
+		md_wakeup_thread(mddev->thread);
+		set_bit = 1;
+	}
 	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (set_bit)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+
 	if (error)
 		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
 				__func__, __LINE__, error);
@@ -663,12 +688,12 @@ static int lock_token(struct md_cluster_info *cinfo)
 /* lock_comm()
  * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
  */
-static int lock_comm(struct md_cluster_info *cinfo)
+static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
 {
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
 
-	return lock_token(cinfo);
+	return lock_token(cinfo, mddev_locked);
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -743,11 +768,12 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 	return error;
 }
 
-static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
+static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
+		   bool mddev_locked)
 {
 	int ret;
 
-	lock_comm(cinfo);
+	lock_comm(cinfo, mddev_locked);
 	ret = __sendmsg(cinfo, cmsg);
 	unlock_comm(cinfo);
 	return ret;
@@ -834,6 +860,7 @@ static int join(struct mddev *mddev, int nodes)
 	mutex_init(&cinfo->recv_mutex);
 
 	mddev->cluster_info = cinfo;
+	cinfo->mddev = mddev;
 
 	memset(str, 0, 64);
 	sprintf(str, "%pU", mddev->uuid);
@@ -908,6 +935,7 @@ static int join(struct mddev *mddev, int nodes)
 
 	return 0;
 err:
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
@@ -943,7 +971,7 @@ static void resync_bitmap(struct mddev *mddev)
 	int err;
 
 	cmsg.type = cpu_to_le32(BITMAP_NEEDS_SYNC);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		pr_err("%s:%d: failed to send BITMAP_NEEDS_SYNC message (%d)\n",
 			__func__, __LINE__, err);
@@ -963,6 +991,7 @@ static int leave(struct mddev *mddev)
 	if (cinfo->slot_number > 0 && mddev->recovery_cp != MaxSector)
 		resync_bitmap(mddev);
 
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
@@ -997,16 +1026,30 @@ static int slot_number(struct mddev *mddev)
 static int metadata_update_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
+	int ret;
+
+	/*
+	 * metadata_update_start is always called with the protection of
+	 * reconfig_mutex, so set WAITING_FOR_TOKEN here.
+	 */
+	ret = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				    &cinfo->state);
+	WARN_ON_ONCE(ret);
+	md_wakeup_thread(mddev->thread);
 
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state) ||
 		   test_and_clear_bit(MD_CLUSTER_SEND_LOCKED_ALREADY, &cinfo->state));
 
 	/* If token is already locked, return 0 */
-	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
+	if (cinfo->token_lockres->mode == DLM_LOCK_EX) {
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 		return 0;
+	}
 
-	return lock_token(cinfo);
+	ret = lock_token(cinfo, 1);
+	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return ret;
 }
 
 static int metadata_update_finish(struct mddev *mddev)
@@ -1069,7 +1112,14 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
 	cmsg.low = cpu_to_le64(lo);
 	cmsg.high = cpu_to_le64(hi);
 
-	return sendmsg(cinfo, &cmsg);
+	/*
+	 * mddev_lock is held if resync_info_update is called from
+	 * resync_finish (md_reap_sync_thread -> resync_finish)
+	 */
+	if (lo == 0 && hi == 0)
+		return sendmsg(cinfo, &cmsg, 1);
+	else
+		return sendmsg(cinfo, &cmsg, 0);
 }
 
 static int resync_finish(struct mddev *mddev)
@@ -1119,7 +1169,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
 	cmsg.type = cpu_to_le32(NEWDISK);
 	memcpy(cmsg.uuid, uuid, 16);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	lock_comm(cinfo);
+	lock_comm(cinfo, 1);
 	ret = __sendmsg(cinfo, &cmsg);
 	if (ret)
 		return ret;
@@ -1179,7 +1229,7 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	cmsg.type = cpu_to_le32(REMOVE);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	return sendmsg(cinfo, &cmsg);
+	return sendmsg(cinfo, &cmsg, 1);
 }
 
 static int lock_all_bitmaps(struct mddev *mddev)
@@ -1243,7 +1293,7 @@ static int gather_bitmaps(struct md_rdev *rdev)
 
 	cmsg.type = cpu_to_le32(RE_ADD);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		goto out;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f25bd6de0c72..44206bc6e3aa 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8384,7 +8384,6 @@ void md_check_recovery(struct mddev *mddev)
 		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
-		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
 		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
@@ -8433,9 +8432,6 @@ void md_check_recovery(struct mddev *mddev)
 						rdev->raid_disk < 0)
 					md_kick_rdev_from_array(rdev);
 			}
-
-			if (test_and_clear_bit(MD_RELOAD_SB, &mddev->flags))
-				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
 		if (!mddev->external) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index dde8ecb760c8..1c00160b09f9 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -219,9 +219,6 @@ enum mddev_flags {
 				 * it then */
 	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
 	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
-	MD_RELOAD_SB,		/* Reload the superblock because another node
-				 * updated it.
-				 */
 	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
 				   * already took resync lock, need to
 				   * release the lock */
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
  2017-03-01  8:42 ` [PATCH V2 2/5] md: move bitmap_destroy before __md_stop Guoqing Jiang
@ 2017-03-02 18:28   ` Shaohua Li
  2017-03-02 22:15     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2017-03-02 18:28 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
> Since we have switched to sync way to handle METADATA_UPDATED
> msg for md-cluster, then process_metadata_update is depended
> on mddev->thread->wqueue.
> 
> With the new change, clustered raid could possible hang if
> array received a METADATA_UPDATED msg after array unregistered
> mddev->thread, so we need to stop clustered raid earlier
> than before.
> 
> And this change should be safe for non-clustered raid since
> all writes are stopped before the destroy. Also in md_run,
> we activate the personality (pers->run()) before activating
> the bitmap (bitmap_create()). So it is pleasingly symmetric
> to stop the bitmap (bitmap_destroy()) before stopping the
> personality (__md_stop() calls pers->free()).
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md.c | 30 +++++++++++++++++-------------
>  1 file changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 44206bc6e3aa..e1d9116044ae 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>  	/* stop the array and free an attached data structures.
>  	 * This is called from dm-raid
>  	 */
> -	__md_stop(mddev);
>  	bitmap_destroy(mddev);
> +	__md_stop(mddev);
>  	if (mddev->bio_set)
>  		bioset_free(mddev->bio_set);
>  }

Applied other 4 patches. But this one I still have concerns.

For raid1, if a bio is behind IO, we return the bio to upper layer but don't
wait behind IO completion. So even there are no writes running, there might be
behind IO running. mddev_detach will do the wait which checks bitmap. If we
bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.

Probably we should move mddev_detach out of __md_stop and always do:
mddev_detach()
bitmap_destroy()
__md_stop()

This looks safer to me.

Thanks,
Shaohua
> @@ -5688,6 +5688,22 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  			set_disk_ro(disk, 0);
>  
>  		__md_stop_writes(mddev);
> +
> +		/*
> +		 * Destroy bitmap after all writes are stopped
> +		 */
> +		if (mode == 0) {
> +			bitmap_destroy(mddev);
> +			if (mddev->bitmap_info.file) {
> +				struct file *f = mddev->bitmap_info.file;
> +				spin_lock(&mddev->lock);
> +				mddev->bitmap_info.file = NULL;
> +				spin_unlock(&mddev->lock);
> +				fput(f);
> +			}
> +			mddev->bitmap_info.offset = 0;
> +		}
> +
>  		__md_stop(mddev);
>  		mddev->queue->backing_dev_info->congested_fn = NULL;
>  
> @@ -5712,19 +5728,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  	 */
>  	if (mode == 0) {
>  		pr_info("md: %s stopped.\n", mdname(mddev));
> -
> -		bitmap_destroy(mddev);
> -		if (mddev->bitmap_info.file) {
> -			struct file *f = mddev->bitmap_info.file;
> -			spin_lock(&mddev->lock);
> -			mddev->bitmap_info.file = NULL;
> -			spin_unlock(&mddev->lock);
> -			fput(f);
> -		}
> -		mddev->bitmap_info.offset = 0;
> -
>  		export_array(mddev);
> -
>  		md_clean(mddev);
>  		if (mddev->hold_active == UNTIL_STOP)
>  			mddev->hold_active = 0;
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
  2017-03-02 18:28   ` Shaohua Li
@ 2017-03-02 22:15     ` NeilBrown
  2017-03-03  3:08       ` Guoqing Jiang
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2017-03-02 22:15 UTC (permalink / raw)
  To: Shaohua Li, Guoqing Jiang; +Cc: linux-raid, shli, neilb

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

On Thu, Mar 02 2017, Shaohua Li wrote:

> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>> Since we have switched to sync way to handle METADATA_UPDATED
>> msg for md-cluster, then process_metadata_update is depended
>> on mddev->thread->wqueue.
>> 
>> With the new change, clustered raid could possible hang if
>> array received a METADATA_UPDATED msg after array unregistered
>> mddev->thread, so we need to stop clustered raid earlier
>> than before.
>> 
>> And this change should be safe for non-clustered raid since
>> all writes are stopped before the destroy. Also in md_run,
>> we activate the personality (pers->run()) before activating
>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>> to stop the bitmap (bitmap_destroy()) before stopping the
>> personality (__md_stop() calls pers->free()).
>> 
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  drivers/md/md.c | 30 +++++++++++++++++-------------
>>  1 file changed, 17 insertions(+), 13 deletions(-)
>> 
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 44206bc6e3aa..e1d9116044ae 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>  	/* stop the array and free an attached data structures.
>>  	 * This is called from dm-raid
>>  	 */
>> -	__md_stop(mddev);
>>  	bitmap_destroy(mddev);
>> +	__md_stop(mddev);
>>  	if (mddev->bio_set)
>>  		bioset_free(mddev->bio_set);
>>  }
>
> Applied other 4 patches. But this one I still have concerns.
>
> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
> wait behind IO completion. So even there are no writes running, there might be
> behind IO running. mddev_detach will do the wait which checks bitmap. If we
> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.
>
> Probably we should move mddev_detach out of __md_stop and always do:
> mddev_detach()
> bitmap_destroy()
> __md_stop()
>
> This looks safer to me.

Thanks for catching that.  I agree - mddev_detach should come before
bitmap_destroy.
I might be best to change __md_stop() to start

static void __md_stop(struct mddev *mddev)
{
	struct md_personality *pers = mddev->pers;
	mddev_detach(mddev);
+	bitmap_destroy(mddev);
	/* Ensure ->event_work is done */
	flush_workqueue(md_misc_wq);


That would make the remainder of the patch (below) unnecessary.
I think it is wrong anyway.
We were correct to move the "bitmap_destroy() call up to the
"if (mddev->pers)" case, but we were not correct to move the
closing of bitmap_info.file.
If a file is added to an array but that array is not started
(so ->pers is not set), then stopping the array will not close the file
with the change below, and that isn't good.

So if we move bitmap_destroy() into  __md_stop() and remove it from
do_md_stop and md_stop(), that might be exactly what we need.

Thanks,
NeilBrown

>
> Thanks,
> Shaohua
>> @@ -5688,6 +5688,22 @@ static int do_md_stop(struct mddev *mddev, int mode,
>>  			set_disk_ro(disk, 0);
>>  
>>  		__md_stop_writes(mddev);
>> +
>> +		/*
>> +		 * Destroy bitmap after all writes are stopped
>> +		 */
>> +		if (mode == 0) {
>> +			bitmap_destroy(mddev);
>> +			if (mddev->bitmap_info.file) {
>> +				struct file *f = mddev->bitmap_info.file;
>> +				spin_lock(&mddev->lock);
>> +				mddev->bitmap_info.file = NULL;
>> +				spin_unlock(&mddev->lock);
>> +				fput(f);
>> +			}
>> +			mddev->bitmap_info.offset = 0;
>> +		}
>> +
>>  		__md_stop(mddev);
>>  		mddev->queue->backing_dev_info->congested_fn = NULL;
>>  
>> @@ -5712,19 +5728,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>>  	 */
>>  	if (mode == 0) {
>>  		pr_info("md: %s stopped.\n", mdname(mddev));
>> -
>> -		bitmap_destroy(mddev);
>> -		if (mddev->bitmap_info.file) {
>> -			struct file *f = mddev->bitmap_info.file;
>> -			spin_lock(&mddev->lock);
>> -			mddev->bitmap_info.file = NULL;
>> -			spin_unlock(&mddev->lock);
>> -			fput(f);
>> -		}
>> -		mddev->bitmap_info.offset = 0;
>> -
>>  		export_array(mddev);
>> -
>>  		md_clean(mddev);
>>  		if (mddev->hold_active == UNTIL_STOP)
>>  			mddev->hold_active = 0;
>> -- 
>> 2.6.2
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
  2017-03-02 22:15     ` NeilBrown
@ 2017-03-03  3:08       ` Guoqing Jiang
  2017-03-03  5:20         ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Guoqing Jiang @ 2017-03-03  3:08 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/03/2017 06:15 AM, NeilBrown wrote:
> On Thu, Mar 02 2017, Shaohua Li wrote:
>
>> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>>> Since we have switched to sync way to handle METADATA_UPDATED
>>> msg for md-cluster, then process_metadata_update is depended
>>> on mddev->thread->wqueue.
>>>
>>> With the new change, clustered raid could possible hang if
>>> array received a METADATA_UPDATED msg after array unregistered
>>> mddev->thread, so we need to stop clustered raid earlier
>>> than before.
>>>
>>> And this change should be safe for non-clustered raid since
>>> all writes are stopped before the destroy. Also in md_run,
>>> we activate the personality (pers->run()) before activating
>>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>>> to stop the bitmap (bitmap_destroy()) before stopping the
>>> personality (__md_stop() calls pers->free()).
>>>
>>> Reviewed-by: NeilBrown <neilb@suse.com>
>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>> ---
>>>   drivers/md/md.c | 30 +++++++++++++++++-------------
>>>   1 file changed, 17 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 44206bc6e3aa..e1d9116044ae 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>>   	/* stop the array and free an attached data structures.
>>>   	 * This is called from dm-raid
>>>   	 */
>>> -	__md_stop(mddev);
>>>   	bitmap_destroy(mddev);
>>> +	__md_stop(mddev);
>>>   	if (mddev->bio_set)
>>>   		bioset_free(mddev->bio_set);
>>>   }
>> Applied other 4 patches. But this one I still have concerns.
>>
>> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
>> wait behind IO completion. So even there are no writes running, there might be
>> behind IO running. mddev_detach will do the wait which checks bitmap. If we
>> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.

Got it, thanks for explanation!

>>
>> Probably we should move mddev_detach out of __md_stop and always do:
>> mddev_detach()
>> bitmap_destroy()
>> __md_stop()
>>
>> This looks safer to me.
> Thanks for catching that.  I agree - mddev_detach should come before
> bitmap_destroy.
> I might be best to change __md_stop() to start
>
> static void __md_stop(struct mddev *mddev)
> {
> 	struct md_personality *pers = mddev->pers;
> 	mddev_detach(mddev);
> +	bitmap_destroy(mddev);
> 	/* Ensure ->event_work is done */
> 	flush_workqueue(md_misc_wq);

With this change, does it mean we destroy bitmap unconditionally even
if the mode is 2? Thanks.

> That would make the remainder of the patch (below) unnecessary.
> I think it is wrong anyway.
> We were correct to move the "bitmap_destroy() call up to the
> "if (mddev->pers)" case, but we were not correct to move the
> closing of bitmap_info.file.
> If a file is added to an array but that array is not started
> (so ->pers is not set), then stopping the array will not close the file
> with the change below, and that isn't good.

Hmm, thanks for reminder. Though I have some questions about above.
I guess the file is pointed to bitmap file, from mdadm manpage, I see
  "-b, --bitmap=" is used under create, build, grow and assemble mode.
But  how to add a file to a array which is not started? Please correct me
for my wrong understanding.

> So if we move bitmap_destroy() into  __md_stop() and remove it from
> do_md_stop and md_stop(), that might be exactly what we need.

How about the below changes? It may addresses both your concern and
Shaohua's concern, but not sure ...

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 79a99a1..a0e79d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5555,7 +5555,6 @@ static void mddev_detach(struct mddev *mddev)
  static void __md_stop(struct mddev *mddev)
  {
         struct md_personality *pers = mddev->pers;
-       mddev_detach(mddev);
         /* Ensure ->event_work is done */
         flush_workqueue(md_misc_wq);
         spin_lock(&mddev->lock);
@@ -5574,8 +5573,9 @@ void md_stop(struct mddev *mddev)
         /* stop the array and free an attached data structures.
          * This is called from dm-raid
          */
-       __md_stop(mddev);
+       mddev_detach(mddev);
         bitmap_destroy(mddev);
+       __md_stop(mddev);
         if (mddev->bio_set)
                 bioset_free(mddev->bio_set);
  }
@@ -5688,6 +5688,10 @@ static int do_md_stop(struct mddev *mddev, int mode,
                         set_disk_ro(disk, 0);

                 __md_stop_writes(mddev);
+               mddev_detach(mddev);
+               if (mode == 0)
+                       bitmap_destroy(mddev);
+
                 __md_stop(mddev);
mddev->queue->backing_dev_info->congested_fn = NULL;

@@ -5713,7 +5717,8 @@ static int do_md_stop(struct mddev *mddev, int mode,
         if (mode == 0) {
                 pr_info("md: %s stopped.\n", mdname(mddev));

-               bitmap_destroy(mddev);
+               if (!mddev->pers)
+                       bitmap_destroy(mddev);
                 if (mddev->bitmap_info.file) {
                         struct file *f = mddev->bitmap_info.file;
                         spin_lock(&mddev->lock);

Thanks,
Guoqing

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 2/5] md: move bitmap_destroy before __md_stop
  2017-03-03  3:08       ` Guoqing Jiang
@ 2017-03-03  5:20         ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2017-03-03  5:20 UTC (permalink / raw)
  To: Guoqing Jiang, Shaohua Li; +Cc: linux-raid, shli

[-- Attachment #1: Type: text/plain, Size: 5863 bytes --]

On Fri, Mar 03 2017, Guoqing Jiang wrote:

> On 03/03/2017 06:15 AM, NeilBrown wrote:
>> On Thu, Mar 02 2017, Shaohua Li wrote:
>>
>>> On Wed, Mar 01, 2017 at 04:42:37PM +0800, Guoqing Jiang wrote:
>>>> Since we have switched to sync way to handle METADATA_UPDATED
>>>> msg for md-cluster, then process_metadata_update is depended
>>>> on mddev->thread->wqueue.
>>>>
>>>> With the new change, clustered raid could possible hang if
>>>> array received a METADATA_UPDATED msg after array unregistered
>>>> mddev->thread, so we need to stop clustered raid earlier
>>>> than before.
>>>>
>>>> And this change should be safe for non-clustered raid since
>>>> all writes are stopped before the destroy. Also in md_run,
>>>> we activate the personality (pers->run()) before activating
>>>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>>>> to stop the bitmap (bitmap_destroy()) before stopping the
>>>> personality (__md_stop() calls pers->free()).
>>>>
>>>> Reviewed-by: NeilBrown <neilb@suse.com>
>>>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>   drivers/md/md.c | 30 +++++++++++++++++-------------
>>>>   1 file changed, 17 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index 44206bc6e3aa..e1d9116044ae 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -5574,8 +5574,8 @@ void md_stop(struct mddev *mddev)
>>>>   	/* stop the array and free an attached data structures.
>>>>   	 * This is called from dm-raid
>>>>   	 */
>>>> -	__md_stop(mddev);
>>>>   	bitmap_destroy(mddev);
>>>> +	__md_stop(mddev);
>>>>   	if (mddev->bio_set)
>>>>   		bioset_free(mddev->bio_set);
>>>>   }
>>> Applied other 4 patches. But this one I still have concerns.
>>>
>>> For raid1, if a bio is behind IO, we return the bio to upper layer but don't
>>> wait behind IO completion. So even there are no writes running, there might be
>>> behind IO running. mddev_detach will do the wait which checks bitmap. If we
>>> bitmap_destroy before __md_stop, mddev_detach doesn't do the wait.
>
> Got it, thanks for explanation!
>
>>>
>>> Probably we should move mddev_detach out of __md_stop and always do:
>>> mddev_detach()
>>> bitmap_destroy()
>>> __md_stop()
>>>
>>> This looks safer to me.
>> Thanks for catching that.  I agree - mddev_detach should come before
>> bitmap_destroy.
>> I might be best to change __md_stop() to start
>>
>> static void __md_stop(struct mddev *mddev)
>> {
>> 	struct md_personality *pers = mddev->pers;
>> 	mddev_detach(mddev);
>> +	bitmap_destroy(mddev);
>> 	/* Ensure ->event_work is done */
>> 	flush_workqueue(md_misc_wq);
>
> With this change, does it mean we destroy bitmap unconditionally even
> if the mode is 2? Thanks.

Yes, and we should destroy the bitmap.  In that case we need to return
the array to the start it was before RUN_ARRAY ioctl.

>
>> That would make the remainder of the patch (below) unnecessary.
>> I think it is wrong anyway.
>> We were correct to move the "bitmap_destroy() call up to the
>> "if (mddev->pers)" case, but we were not correct to move the
>> closing of bitmap_info.file.
>> If a file is added to an array but that array is not started
>> (so ->pers is not set), then stopping the array will not close the file
>> with the change below, and that isn't good.
>
> Hmm, thanks for reminder. Though I have some questions about above.
> I guess the file is pointed to bitmap file, from mdadm manpage, I see
>   "-b, --bitmap=" is used under create, build, grow and assemble mode.
> But  how to add a file to a array which is not started? Please correct me
> for my wrong understanding.

Call SET_BITMAP_FILE ioctl.
mdadm won't do this without then calling RUN_ARRAY (or similar).
But is it is possible and we should handle it.
o

>
>> So if we move bitmap_destroy() into  __md_stop() and remove it from
>> do_md_stop and md_stop(), that might be exactly what we need.
>
> How about the below changes? It may addresses both your concern and
> Shaohua's concern, but not sure ...


No, that is more complex than needed, and doesn't call bitmap_destroy()
always when required.

Thanks,
NeilBrown


>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 79a99a1..a0e79d6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5555,7 +5555,6 @@ static void mddev_detach(struct mddev *mddev)
>   static void __md_stop(struct mddev *mddev)
>   {
>          struct md_personality *pers = mddev->pers;
> -       mddev_detach(mddev);
>          /* Ensure ->event_work is done */
>          flush_workqueue(md_misc_wq);
>          spin_lock(&mddev->lock);
> @@ -5574,8 +5573,9 @@ void md_stop(struct mddev *mddev)
>          /* stop the array and free an attached data structures.
>           * This is called from dm-raid
>           */
> -       __md_stop(mddev);
> +       mddev_detach(mddev);
>          bitmap_destroy(mddev);
> +       __md_stop(mddev);
>          if (mddev->bio_set)
>                  bioset_free(mddev->bio_set);
>   }
> @@ -5688,6 +5688,10 @@ static int do_md_stop(struct mddev *mddev, int mode,
>                          set_disk_ro(disk, 0);
>
>                  __md_stop_writes(mddev);
> +               mddev_detach(mddev);
> +               if (mode == 0)
> +                       bitmap_destroy(mddev);
> +
>                  __md_stop(mddev);
> mddev->queue->backing_dev_info->congested_fn = NULL;
>
> @@ -5713,7 +5717,8 @@ static int do_md_stop(struct mddev *mddev, int mode,
>          if (mode == 0) {
>                  pr_info("md: %s stopped.\n", mdname(mddev));
>
> -               bitmap_destroy(mddev);
> +               if (!mddev->pers)
> +                       bitmap_destroy(mddev);
>                  if (mddev->bitmap_info.file) {
>                          struct file *f = mddev->bitmap_info.file;
>                          spin_lock(&mddev->lock);
>
> Thanks,
> Guoqing

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-03-03  5:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01  8:42 [PATCH V2 0/5] the latest changes for md-cluster Guoqing Jiang
2017-03-01  8:42 ` [PATCH V2 2/5] md: move bitmap_destroy before __md_stop Guoqing Jiang
2017-03-02 18:28   ` Shaohua Li
2017-03-02 22:15     ` NeilBrown
2017-03-03  3:08       ` Guoqing Jiang
2017-03-03  5:20         ` NeilBrown
2017-03-01  8:42 ` [PATCH V2 3/5] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
2017-03-01  8:42 ` [PATCH V2 4/5] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
2017-03-01  8:42 ` [PATCH V2 5/5] md-cluster: add the support for resize Guoqing Jiang
2017-03-01  9:30 ` [PATCH V2 1/5] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.