All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] the latest changes for md-cluster
@ 2017-02-24  3:15 Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 01/14] md-cluster: remove unnecessary header files Guoqing Jiang
                   ` (14 more replies)
  0 siblings, 15 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Hi,

This patchset is based on md/for-next branch, which includes below parts:

1. some code improvements (0001-0004).
2. changes for use sync way to handle METADATA_UPDATED msg (0005-0009).
3. add resize support for md-cluster (0010-0012 and 0014), and also
   make some related changes inside md (0013).

Thanks,
Guoqing   

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/14] md-cluster: remove unnecessary header files
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:03   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 02/14] md-cluster: free md_cluster_info if node leave cluster Guoqing Jiang
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

md-cluster.h is already included in md.h, so remove
the redundant one and we don't want to cross include
header file too.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 -
 drivers/md/md-cluster.h | 2 --
 2 files changed, 3 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 2b13117fb918..03c38ed222ce 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -16,7 +16,6 @@
 #include <linux/raid/md_p.h>
 #include "md.h"
 #include "bitmap.h"
-#include "md-cluster.h"
 
 #define LVB_SIZE	64
 #define NEW_DEV_TIMEOUT 5000
diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
index e765499ba591..8f26a5e80810 100644
--- a/drivers/md/md-cluster.h
+++ b/drivers/md/md-cluster.h
@@ -3,8 +3,6 @@
 #ifndef _MD_CLUSTER_H
 #define _MD_CLUSTER_H
 
-#include "md.h"
-
 struct mddev;
 struct md_rdev;
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/14] md-cluster: free md_cluster_info if node leave cluster
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 01/14] md-cluster: remove unnecessary header files Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 03/14] md-cluster: remove useless memset from gather_all_resync_info Guoqing Jiang
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

To avoid memory leak, we need to free the cinfo which
is allocated when node join cluster.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 03c38ed222ce..84c001063983 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -973,6 +973,7 @@ static int leave(struct mddev *mddev)
 	lockres_free(cinfo->bitmap_lockres);
 	unlock_all_bitmaps(mddev);
 	dlm_release_lockspace(cinfo->lockspace, 2);
+	kfree(cinfo);
 	return 0;
 }
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/14] md-cluster: remove useless memset from gather_all_resync_info
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 01/14] md-cluster: remove unnecessary header files Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 02/14] md-cluster: free md_cluster_info if node leave cluster Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info Guoqing Jiang
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

This memset is not needed.  The lvb is already zeroed because
it was recently allocated by lockres_init, which uses kzalloc(),
and read_resync_info() doesn't need it to be zero anyway.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 84c001063983..620828e56dc8 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -776,7 +776,6 @@ static int gather_all_resync_info(struct mddev *mddev, int total_slots)
 		bm_lockres->flags |= DLM_LKF_NOQUEUE;
 		ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
 		if (ret == -EAGAIN) {
-			memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE);
 			s = read_resync_info(mddev, bm_lockres);
 			if (s) {
 				pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n",
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (2 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 03/14] md-cluster: remove useless memset from gather_all_resync_info Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:06   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 05/14] md-cluster: add new parameter for lock_token Guoqing Jiang
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Store "struct mddev *mddev" in md_cluster_info,
then we can get mddev inside lock_token in next
patch.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 620828e56dc8..1ec89273ff99 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -69,6 +69,7 @@ struct resync_info {
 
 
 struct md_cluster_info {
+	struct mddev *mddev; /* the md device which md_cluster_info belongs to */
 	/* dlm lock space and resources for clustered raid. */
 	dlm_lockspace_t *lockspace;
 	int slot_number;
@@ -833,6 +834,7 @@ static int join(struct mddev *mddev, int nodes)
 	mutex_init(&cinfo->recv_mutex);
 
 	mddev->cluster_info = cinfo;
+	cinfo->mddev = mddev;
 
 	memset(str, 0, 64);
 	sprintf(str, "%pU", mddev->uuid);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/14] md-cluster: add new parameter for lock_token
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (3 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:07   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 07/14] md: move bitmap_destroy before __md_stop Guoqing Jiang
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

lock_token is called from either lock_comm or
metadata_update_start, if we look back more for
the call chain, some of them are called with the
reconfig_mutex held while a few of them don't.

Specifically, resync_info_update is mostly called
without the protection of reconfig_mutex. But
resync_finish calls resync_info_update within
the mutex.

We make the change to lock_token since we need
to use synchronous way to handle METADATA_UPDATED
message in latter patch.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 1ec89273ff99..a6cf02c5c7bc 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -646,7 +646,7 @@ static void recv_daemon(struct md_thread *thread)
  * Takes the lock on the TOKEN lock resource so no other
  * node can communicate while the operation is underway.
  */
-static int lock_token(struct md_cluster_info *cinfo)
+static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 {
 	int error;
 
@@ -663,12 +663,12 @@ static int lock_token(struct md_cluster_info *cinfo)
 /* lock_comm()
  * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
  */
-static int lock_comm(struct md_cluster_info *cinfo)
+static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
 {
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
 
-	return lock_token(cinfo);
+	return lock_token(cinfo, mddev_locked);
 }
 
 static void unlock_comm(struct md_cluster_info *cinfo)
@@ -743,11 +743,12 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
 	return error;
 }
 
-static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
+static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
+		   bool mddev_locked)
 {
 	int ret;
 
-	lock_comm(cinfo);
+	lock_comm(cinfo, mddev_locked);
 	ret = __sendmsg(cinfo, cmsg);
 	unlock_comm(cinfo);
 	return ret;
@@ -944,7 +945,7 @@ static void resync_bitmap(struct mddev *mddev)
 	int err;
 
 	cmsg.type = cpu_to_le32(BITMAP_NEEDS_SYNC);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		pr_err("%s:%d: failed to send BITMAP_NEEDS_SYNC message (%d)\n",
 			__func__, __LINE__, err);
@@ -1007,7 +1008,7 @@ static int metadata_update_start(struct mddev *mddev)
 	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
 		return 0;
 
-	return lock_token(cinfo);
+	return lock_token(cinfo, 1);
 }
 
 static int metadata_update_finish(struct mddev *mddev)
@@ -1070,7 +1071,14 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
 	cmsg.low = cpu_to_le64(lo);
 	cmsg.high = cpu_to_le64(hi);
 
-	return sendmsg(cinfo, &cmsg);
+	/*
+	 * mddev_lock is held if resync_info_update is called from
+	 * resync_finish (md_reap_sync_thread -> resync_finish)
+	 */
+	if (lo == 0 && hi == 0)
+		return sendmsg(cinfo, &cmsg, 1);
+	else
+		return sendmsg(cinfo, &cmsg, 0);
 }
 
 static int resync_finish(struct mddev *mddev)
@@ -1120,7 +1128,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
 	cmsg.type = cpu_to_le32(NEWDISK);
 	memcpy(cmsg.uuid, uuid, 16);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	lock_comm(cinfo);
+	lock_comm(cinfo, 1);
 	ret = __sendmsg(cinfo, &cmsg);
 	if (ret)
 		return ret;
@@ -1180,7 +1188,7 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	cmsg.type = cpu_to_le32(REMOVE);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	return sendmsg(cinfo, &cmsg);
+	return sendmsg(cinfo, &cmsg, 1);
 }
 
 static int lock_all_bitmaps(struct mddev *mddev)
@@ -1244,7 +1252,7 @@ static int gather_bitmaps(struct md_rdev *rdev)
 
 	cmsg.type = cpu_to_le32(RE_ADD);
 	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
-	err = sendmsg(cinfo, &cmsg);
+	err = sendmsg(cinfo, &cmsg, 1);
 	if (err)
 		goto out;
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/14] md: move bitmap_destroy before __md_stop
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (4 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 05/14] md-cluster: add new parameter for lock_token Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:20   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread Guoqing Jiang
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Since we have switched to sync way to handle METADATA_UPDATED
msg for md-cluster, then process_metadata_update is depended
on mddev->thread->wqueue.

With the new change, clustered raid could possible hang if
array received a METADATA_UPDATED msg after array unregistered
mddev->thread, so we need to stop clustered raid earlier
than before.

And this change should be safe for non-clustered raid since
all writes are stopped before the destroy. Also in md_run,
we activate the personality (pers->run()) before activating
the bitmap (bitmap_create()). So it is pleasingly symmetric
to stop the bitmap (bitmap_destroy()) before stopping the
personality (__md_stop() calls pers->free()).

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6e910d99c9c1..7bcc757386e1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5582,8 +5582,8 @@ void md_stop(struct mddev *mddev)
 	/* stop the array and free an attached data structures.
 	 * This is called from dm-raid
 	 */
-	__md_stop(mddev);
 	bitmap_destroy(mddev);
+	__md_stop(mddev);
 	if (mddev->bio_set)
 		bioset_free(mddev->bio_set);
 }
@@ -5696,6 +5696,20 @@ static int do_md_stop(struct mddev *mddev, int mode,
 			set_disk_ro(disk, 0);
 
 		__md_stop_writes(mddev);
+
+		/*
+		 * Destroy bitmap after all writes are stopped
+		 */
+		bitmap_destroy(mddev);
+		if (mddev->bitmap_info.file) {
+			struct file *f = mddev->bitmap_info.file;
+			spin_lock(&mddev->lock);
+			mddev->bitmap_info.file = NULL;
+			spin_unlock(&mddev->lock);
+			fput(f);
+		}
+		mddev->bitmap_info.offset = 0;
+
 		__md_stop(mddev);
 		mddev->queue->backing_dev_info.congested_fn = NULL;
 
@@ -5720,19 +5734,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	 */
 	if (mode == 0) {
 		pr_info("md: %s stopped.\n", mdname(mddev));
-
-		bitmap_destroy(mddev);
-		if (mddev->bitmap_info.file) {
-			struct file *f = mddev->bitmap_info.file;
-			spin_lock(&mddev->lock);
-			mddev->bitmap_info.file = NULL;
-			spin_unlock(&mddev->lock);
-			fput(f);
-		}
-		mddev->bitmap_info.offset = 0;
-
 		export_array(mddev);
-
 		md_clean(mddev);
 		if (mddev->hold_active == UNTIL_STOP)
 			mddev->hold_active = 0;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (5 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 07/14] md: move bitmap_destroy before __md_stop Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:22   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start Guoqing Jiang
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

After used sync way to handle METADATA_UPDATED msg, a deadlock
could appear if stop a resyncing array.

betalinux244:~ # ps aux|grep md|grep D
root     17164  0.0  0.0      0     0 ?        D    Jan09   0:00 [md0_cluster_rec]
root     18151  0.0  0.1  19852  2008 ?        Ds   Jan09   0:00 /sbin/mdadm -Ssq
betalinux244:~ # cat /proc/17164/stack
[<ffffffffa06a7395>] recv_daemon+0x1f5/0x590 [md_cluster]
[<ffffffffa067be20>] md_thread+0x130/0x150 [md_mod]
[<ffffffff810995ed>] kthread+0xbd/0xe0
[<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099530>] kthread+0x0/0xe0
[<ffffffffffffffff>] 0xffffffffffffffff
betalinux244:~ # cat /proc/18151/stack
[<ffffffff81099879>] kthread_stop+0x59/0x130
[<ffffffffa067c566>] md_unregister_thread+0x46/0x80 [md_mod]
[<ffffffffa06a6e71>] leave+0x81/0x120 [md_cluster]
[<ffffffffa0684e94>] md_cluster_stop+0x14/0x30 [md_mod]
[<ffffffffa06858b6>] bitmap_free+0x126/0x130 [md_mod]
[<ffffffffa0682d06>] do_md_stop+0x356/0x5f0 [md_mod]
[<ffffffffa0683cbe>] md_ioctl+0x6fe/0x1680 [md_mod]
[<ffffffff812ed158>] blkdev_ioctl+0x258/0x920
[<ffffffff8122f81d>] block_ioctl+0x3d/0x40
[<ffffffff8120ac0d>] do_vfs_ioctl+0x2cd/0x4a0
[<ffffffff8120ae54>] SyS_ioctl+0x74/0x80
[<ffffffff815e936e>] entry_SYSCALL_64_fastpath+0x12/0x6d
[<ffffffffffffffff>] 0xffffffffffffffff

Since md_unregister_thread(&cinfo->recv_thread) is blocked by
recv_daemon -> process_recvd_msg -> process_metadata_update.
To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
also need to be set before unregister thread.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 0aad477d1b20..5e2c54be6f30 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -932,6 +932,7 @@ static int join(struct mddev *mddev, int nodes)
 
 	return 0;
 err:
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
@@ -987,6 +988,7 @@ static int leave(struct mddev *mddev)
 	if (cinfo->slot_number > 0 && mddev->recovery_cp != MaxSector)
 		resync_bitmap(mddev);
 
+	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 	md_unregister_thread(&cinfo->recovery_thread);
 	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (6 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 18:24   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 10/14] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Since we have switched to sync way to handle METADATA_UPDATED
msg, then process_metadata_update can only continue with either
get the reconfig_mutex or MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
set.

However, there is a deadlock issue which happened as follows:

1. Node A sends METADATA_UPDATED msg (held Token lock).
2. Node B wants to do resync, and is blocked since it can't
   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   not set since the callchain
   (md_do_sync -> sync_request
               -> resync_info_update
	       -> sendmsg
	       -> lock_comm -> lock_token)
   doesn't hold reconfig_mutex.
3. Node B trys to update sb (held reconfig_mutex), but stopped
   at wait_event() in metadata_update_start since we have set
   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
4. Then Node B receives METADATA_UPDATED msg from A, of course
   recv_daemon is blocked forever.

The followings are more detailed infos.
betalinux240:~ # cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 vdc[1] vdb[0]
2095104 blocks super 1.2 [2/2] [UU]
[>....................]  resync =  4.6% (98112/2095104) finish=28.8min speed=1154K/sec
bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>
betalinux240:~ # ps aux|grep md|grep D
root      4393  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_raid1]
root      4402  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_cluster_rec]
root      4407  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_resync]
betalinux240:~ # cat /proc/4407/stack
[<ffffffffa05eb531>] dlm_lock_sync+0x81/0xc0 [md_cluster]
[<ffffffffa05eba23>] lock_token+0x23/0xa0 [md_cluster]
[<ffffffffa05ebad2>] lock_comm+0x32/0x90 [md_cluster]
[<ffffffffa05ebb45>] sendmsg+0x15/0x30 [md_cluster]
[<ffffffffa05ebd5a>] resync_info_update+0x8a/0xc0 [md_cluster]
[<ffffffffa06b6ae7>] sync_request+0xa57/0xaf0 [raid1]
[<ffffffffa069825d>] md_do_sync+0x90d/0xe70 [md_mod]
[<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
[<ffffffff810995ed>] kthread+0xbd/0xe0
[<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099530>] kthread+0x0/0xe0
betalinux240:~ # cat /proc/4402/stack
[<ffffffffa05ecf77>] recv_daemon+0x1d7/0x570 [md_cluster]
[<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
[<ffffffff810995ed>] kthread+0xbd/0xe0
[<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099530>] kthread+0x0/0xe0
betalinux240:~ # cat /proc/4393/stack
[<ffffffffa05ec0e5>] metadata_update_start+0xa5/0xb0 [md_cluster]
[<ffffffffa069913e>] md_update_sb.part.50+0x8e/0x850 [md_mod]
[<ffffffffa069ab1d>] md_check_recovery+0x21d/0x4e0 [md_mod]
[<ffffffffa06b9322>] raid1d+0x42/0x7d0 [raid1]
[<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
[<ffffffff810995ed>] kthread+0xbd/0xe0
[<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
[<ffffffff81099530>] kthread+0x0/0xe0

Since metadata_update_start always calls lock_token with
reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
here as well, and lock_token don't need to set it twice unless
lock_token is invoked from lock_comm.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 5e2c54be6f30..bcc269312b71 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -654,7 +654,7 @@ static void recv_daemon(struct md_thread *thread)
  */
 static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 {
-	int error;
+	int error, set_bit = 0;
 	struct mddev *mddev = cinfo->mddev;
 
 	/*
@@ -663,14 +663,16 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 	 * since another node already got EX on Token and waitting the EX of Ack),
 	 * so let resync wake up thread in case flag is set.
 	 */
-	if (mddev_locked) {
+	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				      &cinfo->state)) {
 		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
 					      &cinfo->state);
 		WARN_ON_ONCE(error);
 		md_wakeup_thread(mddev->thread);
+		set_bit = 1;
 	}
 	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
-	if (mddev_locked)
+	if (set_bit)
 		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 
 	if (error)
@@ -1023,16 +1025,30 @@ static int slot_number(struct mddev *mddev)
 static int metadata_update_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
+	int ret;
+
+	/*
+	 * metadata_update_start is always called with the protection of
+	 * reconfig_mutex, so set WAITING_FOR_TOKEN here.
+	 */
+	ret = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+				    &cinfo->state);
+	WARN_ON_ONCE(ret);
+	md_wakeup_thread(mddev->thread);
 
 	wait_event(cinfo->wait,
 		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state) ||
 		   test_and_clear_bit(MD_CLUSTER_SEND_LOCKED_ALREADY, &cinfo->state));
 
 	/* If token is already locked, return 0 */
-	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
+	if (cinfo->token_lockres->mode == DLM_LOCK_EX) {
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
 		return 0;
+	}
 
-	return lock_token(cinfo, 1);
+	ret = lock_token(cinfo, 1);
+	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+	return ret;
 }
 
 static int metadata_update_finish(struct mddev *mddev)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/14] md-cluster: add CHANGE_CAPACITY message type
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (7 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 11/14] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

The msg type CHANGE_CAPACITY is introduced to support
resize clustered raid in later patch, and it is sent
after all the nodes have the same sync_size, receiver
node just need to set new capacity once received this
msg.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index bcc269312b71..f1b9a7a3ddd2 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -103,6 +103,7 @@ enum msg_type {
 	REMOVE,
 	RE_ADD,
 	BITMAP_NEEDS_SYNC,
+	CHANGE_CAPACITY,
 };
 
 struct cluster_msg {
@@ -578,6 +579,9 @@ static int process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg)
 	case METADATA_UPDATED:
 		process_metadata_update(mddev, msg);
 		break;
+	case CHANGE_CAPACITY:
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+		break;
 	case RESYNCING:
 		process_suspend_info(mddev, le32_to_cpu(msg->slot),
 				     le64_to_cpu(msg->low),
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/14] md-cluster: introduce cluster_check_sync_size
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (8 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 10/14] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 19:05   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot Guoqing Jiang
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Support resize is a little complex for clustered
raid, since we need to ensure all the nodes share
the same knowledge about the size of raid.

We achieve the goal by check the sync_size which
is in each node's bitmap, we can only change the
capacity after cluster_check_sync_size returns 0.

Also, get_bitmap_from_slot is added to get a slot's
bitmap. And we exported some funcs since they are
used in cluster_check_sync_size().

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c     | 25 ++++++++++++++++++++-
 drivers/md/bitmap.h     |  2 ++
 drivers/md/md-cluster.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 9fb2ccac958a..67a7d399f501 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
 	kunmap_atomic(sb);
 	write_page(bitmap, bitmap->storage.sb_page, 1);
 }
+EXPORT_SYMBOL(bitmap_update_sb);
 
 /* print out the bitmap file superblock */
 void bitmap_print_sb(struct bitmap *bitmap)
@@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
 /*
  * free memory that was allocated
  */
-static void bitmap_free(struct bitmap *bitmap)
+void bitmap_free(struct bitmap *bitmap)
 {
 	unsigned long k, pages;
 	struct bitmap_page *bp;
@@ -1761,6 +1762,7 @@ static void bitmap_free(struct bitmap *bitmap)
 	kfree(bp);
 	kfree(bitmap);
 }
+EXPORT_SYMBOL(bitmap_free);
 
 void bitmap_destroy(struct mddev *mddev)
 {
@@ -1920,6 +1922,27 @@ int bitmap_load(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(bitmap_load);
 
+struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
+{
+	int rv = 0;
+	struct bitmap *bitmap;
+
+	bitmap = bitmap_create(mddev, slot);
+	if (IS_ERR(bitmap)) {
+		rv = PTR_ERR(bitmap);
+		return ERR_PTR(rv);
+	}
+
+	rv = bitmap_init_from_disk(bitmap, 0);
+	if (rv) {
+		bitmap_free(bitmap);
+		return ERR_PTR(rv);
+	}
+
+	return bitmap;
+}
+EXPORT_SYMBOL(get_bitmap_from_slot);
+
 /* Loads the bitmap associated with slot and copies the resync information
  * to our bitmap
  */
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 5b6dd63dda91..9f761097aab2 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -267,8 +267,10 @@ void bitmap_daemon_work(struct mddev *mddev);
 
 int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 		  int chunksize, int init);
+struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
 int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 				sector_t *lo, sector_t *hi, bool clear_bits);
+void bitmap_free(struct bitmap *bitmap);
 #endif
 
 #endif
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index f1b9a7a3ddd2..d3c024e6bfcf 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -1089,6 +1089,64 @@ static void metadata_update_cancel(struct mddev *mddev)
 	unlock_comm(cinfo);
 }
 
+/*
+ * retun 0 if all the bitmaps have the same sync_size
+ */
+int cluster_check_sync_size(struct mddev *mddev)
+{
+	int i, rv;
+	bitmap_super_t *sb;
+	unsigned long my_sync_size, sync_size = 0;
+	int node_num = mddev->bitmap_info.nodes;
+	int current_slot = md_cluster_ops->slot_number(mddev);
+	struct bitmap *bitmap = mddev->bitmap;
+	char str[64];
+	struct dlm_lock_resource *bm_lockres;
+
+	sb = kmap_atomic(bitmap->storage.sb_page);
+	my_sync_size = sb->sync_size;
+	kunmap_atomic(sb);
+
+	for (i = 0; i < node_num; i++) {
+		if (i == current_slot)
+			continue;
+
+		bitmap = get_bitmap_from_slot(mddev, i);
+		if (IS_ERR(bitmap)) {
+			pr_err("can't get bitmap from slot %d\n", i);
+			return -1;
+		}
+
+		/*
+		 * If we can hold the bitmap lock of one node then
+		 * the slot is not occupied, update the sb.
+		 */
+		snprintf(str, 64, "bitmap%04d", i);
+		bm_lockres = lockres_init(mddev, str, NULL, 1);
+		if (!bm_lockres) {
+			pr_err("md-cluster: Cannot initialize %s\n", str);
+		}
+		bm_lockres->flags |= DLM_LKF_NOQUEUE;
+		rv = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
+		if (!rv)
+			bitmap_update_sb(bitmap);
+		lockres_free(bm_lockres);
+
+		sb = kmap_atomic(bitmap->storage.sb_page);
+		if (sync_size == 0)
+			sync_size = sb->sync_size;
+		else if (sync_size != sb->sync_size) {
+			kunmap_atomic(sb);
+			bitmap_free(bitmap);
+			return -1;
+		}
+		kunmap_atomic(sb);
+		bitmap_free(bitmap);
+	}
+
+	return (my_sync_size == sync_size) ? 0 : -1;
+}
+
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (9 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 11/14] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 19:06   ` Shaohua Li
  2017-02-24  3:15 ` [PATCH 13/14] md: move funcs from pers->resize to update_size Guoqing Jiang
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Since get_bitmap_from_slot is introduced in previous
commit, we can use it in bitmap_copy_from_slot to
remove redundant code.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 67a7d399f501..b6fa55a3cff8 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1952,14 +1952,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 	int rv = 0, i, j;
 	sector_t block, lo = 0, hi = 0;
 	struct bitmap_counts *counts;
-	struct bitmap *bitmap = bitmap_create(mddev, slot);
-
-	if (IS_ERR(bitmap))
-		return PTR_ERR(bitmap);
+	struct bitmap *bitmap;
 
-	rv = bitmap_init_from_disk(bitmap, 0);
-	if (rv)
-		goto err;
+	bitmap = get_bitmap_from_slot(mddev, slot);
+	if (IS_ERR(bitmap)) {
+		pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
+		return -1;
+	}
 
 	counts = &bitmap->counts;
 	for (j = 0; j < counts->chunks; j++) {
@@ -1986,8 +1985,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 	bitmap_unplug(mddev->bitmap);
 	*low = lo;
 	*high = hi;
-err:
-	bitmap_free(bitmap);
+
 	return rv;
 }
 EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/14] md: move funcs from pers->resize to update_size
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (10 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-24  3:15 ` [PATCH 14/14] md-cluster: add the support for resize Guoqing Jiang
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c     | 8 ++++++--
 drivers/md/raid1.c  | 2 --
 drivers/md/raid10.c | 4 ----
 drivers/md/raid5.c  | 2 --
 4 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7bcc757386e1..975c9dd60c05 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6534,8 +6534,12 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 			return -ENOSPC;
 	}
 	rv = mddev->pers->resize(mddev, num_sectors);
-	if (!rv)
-		revalidate_disk(mddev->gendisk);
+	if (!rv) {
+		if (mddev->queue) {
+			set_capacity(mddev->gendisk, mddev->array_sectors);
+			revalidate_disk(mddev->gendisk);
+		}
+	}
 	return rv;
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d4e8796b4569..9688f81ea7e9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3246,8 +3246,6 @@ static int raid1_resize(struct mddev *mddev, sector_t sectors)
 			return ret;
 	}
 	md_set_array_sectors(mddev, newsize);
-	set_capacity(mddev->gendisk, mddev->array_sectors);
-	revalidate_disk(mddev->gendisk);
 	if (sectors > mddev->dev_sectors &&
 	    mddev->recovery_cp > mddev->dev_sectors) {
 		mddev->recovery_cp = mddev->dev_sectors;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ade7d69234d5..59533cbea446 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3943,10 +3943,6 @@ static int raid10_resize(struct mddev *mddev, sector_t sectors)
 			return ret;
 	}
 	md_set_array_sectors(mddev, size);
-	if (mddev->queue) {
-		set_capacity(mddev->gendisk, mddev->array_sectors);
-		revalidate_disk(mddev->gendisk);
-	}
 	if (sectors > mddev->dev_sectors &&
 	    mddev->recovery_cp > oldsize) {
 		mddev->recovery_cp = oldsize;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7b7722bb2e8d..df63cd7830ce 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7603,8 +7603,6 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
 			return ret;
 	}
 	md_set_array_sectors(mddev, newsize);
-	set_capacity(mddev->gendisk, mddev->array_sectors);
-	revalidate_disk(mddev->gendisk);
 	if (sectors > mddev->dev_sectors &&
 	    mddev->recovery_cp > mddev->dev_sectors) {
 		mddev->recovery_cp = mddev->dev_sectors;
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 14/14] md-cluster: add the support for resize
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (11 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 13/14] md: move funcs from pers->resize to update_size Guoqing Jiang
@ 2017-02-24  3:15 ` Guoqing Jiang
  2017-02-28 19:25   ` Shaohua Li
  2017-02-28 19:27 ` [PATCH 00/14] the latest changes for md-cluster Shaohua Li
  2017-03-01  2:02 ` [PATCH 06/14] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang
  14 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2017-02-24  3:15 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

To update size for cluster raid, we need to make
sure all nodes can perform the change successfully.
However, it is possible that some of them can't do
it due to failure (bitmap_resize could fail). So
we need to consider the issue before we set the
capacity unconditionally, and we use below steps
to perform sanity check.

1. A change the size, then broadcast METADATA_UPDATED
   msg.
2. B and C receive METADATA_UPDATED change the size
   excepts call set_capacity, sync_size is not update
   if the change failed. Also call bitmap_update_sb
   to sync sb to disk.
3. A checks other node's sync_size, if sync_size has
   been updated in all nodes, then send CHANGE_CAPACITY
   msg otherwise send msg to revert previous change.
4. B and C call set_capacity if receive CHANGE_CAPACITY
   msg, otherwise pers->resize will be called to restore
   the old value.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Documentation/md/md-cluster.txt |  2 +-
 drivers/md/md-cluster.c         | 75 +++++++++++++++++++++++++++++++++++++++++
 drivers/md/md-cluster.h         |  1 +
 drivers/md/md.c                 | 21 +++++++++---
 4 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
index 38883276d31c..2663d49dd8a0 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -321,4 +321,4 @@ The algorithm is:
 
 There are somethings which are not supported by cluster MD yet.
 
-- update size and change array_sectors.
+- change array_sectors.
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index d3c024e6bfcf..75da83187c31 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -1147,6 +1147,80 @@ int cluster_check_sync_size(struct mddev *mddev)
 	return (my_sync_size == sync_size) ? 0 : -1;
 }
 
+/*
+ * Update the size for cluster raid is a little more complex, we perform it
+ * by the steps:
+ * 1. hold token lock and update superblock in initiator node.
+ * 2. send METADATA_UPDATED msg to other nodes.
+ * 3. The initiator node continues to check each bitmap's sync_size, if all
+ *    bitmaps have the same value of sync_size, then we can set capacity and
+ *    let other nodes to perform it. If one node can't update sync_size
+ *    accordingly, we need to revert to previous value.
+ */
+static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
+{
+	struct md_cluster_info *cinfo = mddev->cluster_info;
+	struct cluster_msg cmsg;
+	struct md_rdev *rdev;
+	int ret = 0;
+	int raid_slot = -1;
+
+	md_update_sb(mddev, 1);
+	lock_comm(cinfo, 1);
+
+	memset(&cmsg, 0, sizeof(cmsg));
+	cmsg.type = cpu_to_le32(METADATA_UPDATED);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
+			raid_slot = rdev->desc_nr;
+			break;
+		}
+	if (raid_slot >= 0) {
+		cmsg.raid_slot = cpu_to_le32(raid_slot);
+		/*
+		 * We can only change capiticy after all the nodes can do it,
+		 * so need to wait after other nodes already received the msg
+		 * and handled the change
+		 */
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret) {
+			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
+			       __func__, __LINE__);
+			unlock_comm(cinfo);
+			return;
+		}
+	} else {
+		pr_err("md-cluster: No good device id found to send\n");
+		unlock_comm(cinfo);
+		return;
+	}
+
+	/*
+	 * check the sync_size from other node's bitmap, if sync_size
+	 * have already updated in other nodes as expected, send an
+	 * empty metadata msg to permit the change of capacity
+	 */
+	if (cluster_check_sync_size(mddev) == 0) {
+		memset(&cmsg, 0, sizeof(cmsg));
+		cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret)
+			pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
+			       __func__, __LINE__);
+		set_capacity(mddev->gendisk, mddev->array_sectors);
+	} else {
+		/* revert to previous sectors */
+		ret = mddev->pers->resize(mddev, old_dev_sectors);
+		if (!ret)
+			revalidate_disk(mddev->gendisk);
+		ret = __sendmsg(cinfo, &cmsg);
+		if (ret)
+			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
+			       __func__, __LINE__);
+	}
+	unlock_comm(cinfo);
+}
+
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
@@ -1392,6 +1466,7 @@ static struct md_cluster_operations cluster_ops = {
 	.gather_bitmaps = gather_bitmaps,
 	.lock_all_bitmaps = lock_all_bitmaps,
 	.unlock_all_bitmaps = unlock_all_bitmaps,
+	.update_size = update_size,
 };
 
 static int __init cluster_init(void)
diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
index 8f26a5e80810..e939c14222ba 100644
--- a/drivers/md/md-cluster.h
+++ b/drivers/md/md-cluster.h
@@ -25,6 +25,7 @@ struct md_cluster_operations {
 	int (*gather_bitmaps)(struct md_rdev *rdev);
 	int (*lock_all_bitmaps)(struct mddev *mddev);
 	void (*unlock_all_bitmaps)(struct mddev *mddev);
+	void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors);
 };
 
 #endif /* _MD_CLUSTER_H */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 975c9dd60c05..2bc64059ff57 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6503,10 +6503,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 	struct md_rdev *rdev;
 	int rv;
 	int fit = (num_sectors == 0);
-
-	/* cluster raid doesn't support update size */
-	if (mddev_is_clustered(mddev))
-		return -EINVAL;
+	sector_t old_dev_sectors = mddev->dev_sectors;
 
 	if (mddev->pers->resize == NULL)
 		return -EINVAL;
@@ -6535,7 +6532,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 	}
 	rv = mddev->pers->resize(mddev, num_sectors);
 	if (!rv) {
-		if (mddev->queue) {
+		if (mddev_is_clustered(mddev))
+			md_cluster_ops->update_size(mddev, old_dev_sectors);
+		else if (mddev->queue) {
 			set_capacity(mddev->gendisk, mddev->array_sectors);
 			revalidate_disk(mddev->gendisk);
 		}
@@ -8753,6 +8752,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 	int role, ret;
 	char b[BDEVNAME_SIZE];
 
+	/*
+	 * If size is changed in another node then we need to
+	 * do resize as well.
+	 */
+	if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
+		ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
+		if (ret)
+			pr_info("md-cluster: resize failed\n");
+		else
+			bitmap_update_sb(mddev->bitmap);
+	}
+
 	/* Check for change of roles in the active devices */
 	rdev_for_each(rdev2, mddev) {
 		if (test_bit(Faulty, &rdev2->flags))
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/14] md-cluster: remove unnecessary header files
  2017-02-24  3:15 ` [PATCH 01/14] md-cluster: remove unnecessary header files Guoqing Jiang
@ 2017-02-28 18:03   ` Shaohua Li
  2017-03-01  2:47     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:03 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:11AM +0800, Guoqing Jiang wrote:
> md-cluster.h is already included in md.h, so remove
> the redundant one and we don't want to cross include
> header file too.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>

It would be better md.h doesn't include md-cluster.h and include md-cluster.h
in required .c files.

> ---
>  drivers/md/md-cluster.c | 1 -
>  drivers/md/md-cluster.h | 2 --
>  2 files changed, 3 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 2b13117fb918..03c38ed222ce 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -16,7 +16,6 @@
>  #include <linux/raid/md_p.h>
>  #include "md.h"
>  #include "bitmap.h"
> -#include "md-cluster.h"
>  
>  #define LVB_SIZE	64
>  #define NEW_DEV_TIMEOUT 5000
> diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
> index e765499ba591..8f26a5e80810 100644
> --- a/drivers/md/md-cluster.h
> +++ b/drivers/md/md-cluster.h
> @@ -3,8 +3,6 @@
>  #ifndef _MD_CLUSTER_H
>  #define _MD_CLUSTER_H
>  
> -#include "md.h"
> -
>  struct mddev;
>  struct md_rdev;
>  
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info
  2017-02-24  3:15 ` [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info Guoqing Jiang
@ 2017-02-28 18:06   ` Shaohua Li
  2017-03-01  2:49     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:06 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:14AM +0800, Guoqing Jiang wrote:
> Store "struct mddev *mddev" in md_cluster_info,
> then we can get mddev inside lock_token in next
> patch.

Please merge this to the patch which requires the mddev. Don't think we need a
separate patch.
 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md-cluster.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 620828e56dc8..1ec89273ff99 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -69,6 +69,7 @@ struct resync_info {
>  
>  
>  struct md_cluster_info {
> +	struct mddev *mddev; /* the md device which md_cluster_info belongs to */
>  	/* dlm lock space and resources for clustered raid. */
>  	dlm_lockspace_t *lockspace;
>  	int slot_number;
> @@ -833,6 +834,7 @@ static int join(struct mddev *mddev, int nodes)
>  	mutex_init(&cinfo->recv_mutex);
>  
>  	mddev->cluster_info = cinfo;
> +	cinfo->mddev = mddev;
>  
>  	memset(str, 0, 64);
>  	sprintf(str, "%pU", mddev->uuid);
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/14] md-cluster: add new parameter for lock_token
  2017-02-24  3:15 ` [PATCH 05/14] md-cluster: add new parameter for lock_token Guoqing Jiang
@ 2017-02-28 18:07   ` Shaohua Li
  2017-03-01  2:48     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:07 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:15AM +0800, Guoqing Jiang wrote:
> lock_token is called from either lock_comm or
> metadata_update_start, if we look back more for
> the call chain, some of them are called with the
> reconfig_mutex held while a few of them don't.
> 
> Specifically, resync_info_update is mostly called
> without the protection of reconfig_mutex. But
> resync_finish calls resync_info_update within
> the mutex.
> 
> We make the change to lock_token since we need
> to use synchronous way to handle METADATA_UPDATED
> message in latter patch.

please merge this to the patch which uses the new paramter

 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md-cluster.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 1ec89273ff99..a6cf02c5c7bc 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -646,7 +646,7 @@ static void recv_daemon(struct md_thread *thread)
>   * Takes the lock on the TOKEN lock resource so no other
>   * node can communicate while the operation is underway.
>   */
> -static int lock_token(struct md_cluster_info *cinfo)
> +static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>  {
>  	int error;
>  
> @@ -663,12 +663,12 @@ static int lock_token(struct md_cluster_info *cinfo)
>  /* lock_comm()
>   * Sets the MD_CLUSTER_SEND_LOCK bit to lock the send channel.
>   */
> -static int lock_comm(struct md_cluster_info *cinfo)
> +static int lock_comm(struct md_cluster_info *cinfo, bool mddev_locked)
>  {
>  	wait_event(cinfo->wait,
>  		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state));
>  
> -	return lock_token(cinfo);
> +	return lock_token(cinfo, mddev_locked);
>  }
>  
>  static void unlock_comm(struct md_cluster_info *cinfo)
> @@ -743,11 +743,12 @@ static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
>  	return error;
>  }
>  
> -static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg)
> +static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg,
> +		   bool mddev_locked)
>  {
>  	int ret;
>  
> -	lock_comm(cinfo);
> +	lock_comm(cinfo, mddev_locked);
>  	ret = __sendmsg(cinfo, cmsg);
>  	unlock_comm(cinfo);
>  	return ret;
> @@ -944,7 +945,7 @@ static void resync_bitmap(struct mddev *mddev)
>  	int err;
>  
>  	cmsg.type = cpu_to_le32(BITMAP_NEEDS_SYNC);
> -	err = sendmsg(cinfo, &cmsg);
> +	err = sendmsg(cinfo, &cmsg, 1);
>  	if (err)
>  		pr_err("%s:%d: failed to send BITMAP_NEEDS_SYNC message (%d)\n",
>  			__func__, __LINE__, err);
> @@ -1007,7 +1008,7 @@ static int metadata_update_start(struct mddev *mddev)
>  	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
>  		return 0;
>  
> -	return lock_token(cinfo);
> +	return lock_token(cinfo, 1);
>  }
>  
>  static int metadata_update_finish(struct mddev *mddev)
> @@ -1070,7 +1071,14 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
>  	cmsg.low = cpu_to_le64(lo);
>  	cmsg.high = cpu_to_le64(hi);
>  
> -	return sendmsg(cinfo, &cmsg);
> +	/*
> +	 * mddev_lock is held if resync_info_update is called from
> +	 * resync_finish (md_reap_sync_thread -> resync_finish)
> +	 */
> +	if (lo == 0 && hi == 0)
> +		return sendmsg(cinfo, &cmsg, 1);
> +	else
> +		return sendmsg(cinfo, &cmsg, 0);
>  }
>  
>  static int resync_finish(struct mddev *mddev)
> @@ -1120,7 +1128,7 @@ static int add_new_disk(struct mddev *mddev, struct md_rdev *rdev)
>  	cmsg.type = cpu_to_le32(NEWDISK);
>  	memcpy(cmsg.uuid, uuid, 16);
>  	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
> -	lock_comm(cinfo);
> +	lock_comm(cinfo, 1);
>  	ret = __sendmsg(cinfo, &cmsg);
>  	if (ret)
>  		return ret;
> @@ -1180,7 +1188,7 @@ static int remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  	struct md_cluster_info *cinfo = mddev->cluster_info;
>  	cmsg.type = cpu_to_le32(REMOVE);
>  	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
> -	return sendmsg(cinfo, &cmsg);
> +	return sendmsg(cinfo, &cmsg, 1);
>  }
>  
>  static int lock_all_bitmaps(struct mddev *mddev)
> @@ -1244,7 +1252,7 @@ static int gather_bitmaps(struct md_rdev *rdev)
>  
>  	cmsg.type = cpu_to_le32(RE_ADD);
>  	cmsg.raid_slot = cpu_to_le32(rdev->desc_nr);
> -	err = sendmsg(cinfo, &cmsg);
> +	err = sendmsg(cinfo, &cmsg, 1);
>  	if (err)
>  		goto out;
>  
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 07/14] md: move bitmap_destroy before __md_stop
  2017-02-24  3:15 ` [PATCH 07/14] md: move bitmap_destroy before __md_stop Guoqing Jiang
@ 2017-02-28 18:20   ` Shaohua Li
  2017-03-01  3:46     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:20 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:17AM +0800, Guoqing Jiang wrote:
> Since we have switched to sync way to handle METADATA_UPDATED
> msg for md-cluster, then process_metadata_update is depended
> on mddev->thread->wqueue.
> 
> With the new change, clustered raid could possible hang if
> array received a METADATA_UPDATED msg after array unregistered
> mddev->thread, so we need to stop clustered raid earlier
> than before.
> 
> And this change should be safe for non-clustered raid since
> all writes are stopped before the destroy. Also in md_run,
> we activate the personality (pers->run()) before activating
> the bitmap (bitmap_create()). So it is pleasingly symmetric
> to stop the bitmap (bitmap_destroy()) before stopping the
> personality (__md_stop() calls pers->free()).

There is no patch 6. So I can't judge the purpose of the patch.

The patch changed behaviors. We only destroy bitmap with mode == 0, now we do it
even for mode == 2. Please specify why it's safe. The __md_stop will wait for
behind writes for example only with bitmap set, but the patch makes us not do
the wait.

> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 6e910d99c9c1..7bcc757386e1 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5582,8 +5582,8 @@ void md_stop(struct mddev *mddev)
>  	/* stop the array and free an attached data structures.
>  	 * This is called from dm-raid
>  	 */
> -	__md_stop(mddev);
>  	bitmap_destroy(mddev);
> +	__md_stop(mddev);
>  	if (mddev->bio_set)
>  		bioset_free(mddev->bio_set);
>  }
> @@ -5696,6 +5696,20 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  			set_disk_ro(disk, 0);
>  
>  		__md_stop_writes(mddev);
> +
> +		/*
> +		 * Destroy bitmap after all writes are stopped
> +		 */
> +		bitmap_destroy(mddev);
> +		if (mddev->bitmap_info.file) {
> +			struct file *f = mddev->bitmap_info.file;
> +			spin_lock(&mddev->lock);
> +			mddev->bitmap_info.file = NULL;
> +			spin_unlock(&mddev->lock);
> +			fput(f);
> +		}
> +		mddev->bitmap_info.offset = 0;
> +
>  		__md_stop(mddev);
>  		mddev->queue->backing_dev_info.congested_fn = NULL;
>  
> @@ -5720,19 +5734,7 @@ static int do_md_stop(struct mddev *mddev, int mode,
>  	 */
>  	if (mode == 0) {
>  		pr_info("md: %s stopped.\n", mdname(mddev));
> -
> -		bitmap_destroy(mddev);
> -		if (mddev->bitmap_info.file) {
> -			struct file *f = mddev->bitmap_info.file;
> -			spin_lock(&mddev->lock);
> -			mddev->bitmap_info.file = NULL;
> -			spin_unlock(&mddev->lock);
> -			fput(f);
> -		}
> -		mddev->bitmap_info.offset = 0;
> -
>  		export_array(mddev);
> -
>  		md_clean(mddev);
>  		if (mddev->hold_active == UNTIL_STOP)
>  			mddev->hold_active = 0;
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread
  2017-02-24  3:15 ` [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread Guoqing Jiang
@ 2017-02-28 18:22   ` Shaohua Li
  0 siblings, 0 replies; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:22 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:18AM +0800, Guoqing Jiang wrote:
> After used sync way to handle METADATA_UPDATED msg, a deadlock
> could appear if stop a resyncing array.

shouldn't this put into the patch of 'handle METADATA_UPDATED' msg?
 
> betalinux244:~ # ps aux|grep md|grep D
> root     17164  0.0  0.0      0     0 ?        D    Jan09   0:00 [md0_cluster_rec]
> root     18151  0.0  0.1  19852  2008 ?        Ds   Jan09   0:00 /sbin/mdadm -Ssq
> betalinux244:~ # cat /proc/17164/stack
> [<ffffffffa06a7395>] recv_daemon+0x1f5/0x590 [md_cluster]
> [<ffffffffa067be20>] md_thread+0x130/0x150 [md_mod]
> [<ffffffff810995ed>] kthread+0xbd/0xe0
> [<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
> [<ffffffff81099530>] kthread+0x0/0xe0
> [<ffffffffffffffff>] 0xffffffffffffffff
> betalinux244:~ # cat /proc/18151/stack
> [<ffffffff81099879>] kthread_stop+0x59/0x130
> [<ffffffffa067c566>] md_unregister_thread+0x46/0x80 [md_mod]
> [<ffffffffa06a6e71>] leave+0x81/0x120 [md_cluster]
> [<ffffffffa0684e94>] md_cluster_stop+0x14/0x30 [md_mod]
> [<ffffffffa06858b6>] bitmap_free+0x126/0x130 [md_mod]
> [<ffffffffa0682d06>] do_md_stop+0x356/0x5f0 [md_mod]
> [<ffffffffa0683cbe>] md_ioctl+0x6fe/0x1680 [md_mod]
> [<ffffffff812ed158>] blkdev_ioctl+0x258/0x920
> [<ffffffff8122f81d>] block_ioctl+0x3d/0x40
> [<ffffffff8120ac0d>] do_vfs_ioctl+0x2cd/0x4a0
> [<ffffffff8120ae54>] SyS_ioctl+0x74/0x80
> [<ffffffff815e936e>] entry_SYSCALL_64_fastpath+0x12/0x6d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Since md_unregister_thread(&cinfo->recv_thread) is blocked by
> recv_daemon -> process_recvd_msg -> process_metadata_update.
> To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
> also need to be set before unregister thread.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md-cluster.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 0aad477d1b20..5e2c54be6f30 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -932,6 +932,7 @@ static int join(struct mddev *mddev, int nodes)
>  
>  	return 0;
>  err:
> +	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>  	md_unregister_thread(&cinfo->recovery_thread);
>  	md_unregister_thread(&cinfo->recv_thread);
>  	lockres_free(cinfo->message_lockres);
> @@ -987,6 +988,7 @@ static int leave(struct mddev *mddev)
>  	if (cinfo->slot_number > 0 && mddev->recovery_cp != MaxSector)
>  		resync_bitmap(mddev);
>  
> +	set_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>  	md_unregister_thread(&cinfo->recovery_thread);
>  	md_unregister_thread(&cinfo->recv_thread);
>  	lockres_free(cinfo->message_lockres);
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start
  2017-02-24  3:15 ` [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start Guoqing Jiang
@ 2017-02-28 18:24   ` Shaohua Li
  2017-03-01  2:58     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 18:24 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:19AM +0800, Guoqing Jiang wrote:
> Since we have switched to sync way to handle METADATA_UPDATED
> msg, then process_metadata_update can only continue with either
> get the reconfig_mutex or MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
> set.

same here. Either the patch should be put into the 'METADATA_UPDATED' patch or
this patch should be before the 'METADATA_UPDATED' patch.

We shouldn't introduce a broken patch then fix it in latter patch.
 
> However, there is a deadlock issue which happened as follows:
> 
> 1. Node A sends METADATA_UPDATED msg (held Token lock).
> 2. Node B wants to do resync, and is blocked since it can't
>    get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
>    not set since the callchain
>    (md_do_sync -> sync_request
>                -> resync_info_update
> 	       -> sendmsg
> 	       -> lock_comm -> lock_token)
>    doesn't hold reconfig_mutex.
> 3. Node B trys to update sb (held reconfig_mutex), but stopped
>    at wait_event() in metadata_update_start since we have set
>    MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
> 4. Then Node B receives METADATA_UPDATED msg from A, of course
>    recv_daemon is blocked forever.
> 
> The followings are more detailed infos.
> betalinux240:~ # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 vdc[1] vdb[0]
> 2095104 blocks super 1.2 [2/2] [UU]
> [>....................]  resync =  4.6% (98112/2095104) finish=28.8min speed=1154K/sec
> bitmap: 1/1 pages [4KB], 65536KB chunk
> 
> unused devices: <none>
> betalinux240:~ # ps aux|grep md|grep D
> root      4393  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_raid1]
> root      4402  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_cluster_rec]
> root      4407  0.0  0.0      0     0 ?        D    11:42   0:00 [md0_resync]
> betalinux240:~ # cat /proc/4407/stack
> [<ffffffffa05eb531>] dlm_lock_sync+0x81/0xc0 [md_cluster]
> [<ffffffffa05eba23>] lock_token+0x23/0xa0 [md_cluster]
> [<ffffffffa05ebad2>] lock_comm+0x32/0x90 [md_cluster]
> [<ffffffffa05ebb45>] sendmsg+0x15/0x30 [md_cluster]
> [<ffffffffa05ebd5a>] resync_info_update+0x8a/0xc0 [md_cluster]
> [<ffffffffa06b6ae7>] sync_request+0xa57/0xaf0 [raid1]
> [<ffffffffa069825d>] md_do_sync+0x90d/0xe70 [md_mod]
> [<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
> [<ffffffff810995ed>] kthread+0xbd/0xe0
> [<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
> [<ffffffff81099530>] kthread+0x0/0xe0
> betalinux240:~ # cat /proc/4402/stack
> [<ffffffffa05ecf77>] recv_daemon+0x1d7/0x570 [md_cluster]
> [<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
> [<ffffffff810995ed>] kthread+0xbd/0xe0
> [<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
> [<ffffffff81099530>] kthread+0x0/0xe0
> betalinux240:~ # cat /proc/4393/stack
> [<ffffffffa05ec0e5>] metadata_update_start+0xa5/0xb0 [md_cluster]
> [<ffffffffa069913e>] md_update_sb.part.50+0x8e/0x850 [md_mod]
> [<ffffffffa069ab1d>] md_check_recovery+0x21d/0x4e0 [md_mod]
> [<ffffffffa06b9322>] raid1d+0x42/0x7d0 [raid1]
> [<ffffffffa0694c50>] md_thread+0x130/0x150 [md_mod]
> [<ffffffff810995ed>] kthread+0xbd/0xe0
> [<ffffffff815e96bf>] ret_from_fork+0x3f/0x70
> [<ffffffff81099530>] kthread+0x0/0xe0
> 
> Since metadata_update_start always calls lock_token with
> reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
> here as well, and lock_token don't need to set it twice unless
> lock_token is invoked from lock_comm.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/md-cluster.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index 5e2c54be6f30..bcc269312b71 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -654,7 +654,7 @@ static void recv_daemon(struct md_thread *thread)
>   */
>  static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>  {
> -	int error;
> +	int error, set_bit = 0;
>  	struct mddev *mddev = cinfo->mddev;
>  
>  	/*
> @@ -663,14 +663,16 @@ static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
>  	 * since another node already got EX on Token and waitting the EX of Ack),
>  	 * so let resync wake up thread in case flag is set.
>  	 */
> -	if (mddev_locked) {
> +	if (mddev_locked && !test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> +				      &cinfo->state)) {
>  		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
>  					      &cinfo->state);
>  		WARN_ON_ONCE(error);
>  		md_wakeup_thread(mddev->thread);
> +		set_bit = 1;
>  	}
>  	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
> -	if (mddev_locked)
> +	if (set_bit)
>  		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>  
>  	if (error)
> @@ -1023,16 +1025,30 @@ static int slot_number(struct mddev *mddev)
>  static int metadata_update_start(struct mddev *mddev)
>  {
>  	struct md_cluster_info *cinfo = mddev->cluster_info;
> +	int ret;
> +
> +	/*
> +	 * metadata_update_start is always called with the protection of
> +	 * reconfig_mutex, so set WAITING_FOR_TOKEN here.
> +	 */
> +	ret = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
> +				    &cinfo->state);
> +	WARN_ON_ONCE(ret);
> +	md_wakeup_thread(mddev->thread);
>  
>  	wait_event(cinfo->wait,
>  		   !test_and_set_bit(MD_CLUSTER_SEND_LOCK, &cinfo->state) ||
>  		   test_and_clear_bit(MD_CLUSTER_SEND_LOCKED_ALREADY, &cinfo->state));
>  
>  	/* If token is already locked, return 0 */
> -	if (cinfo->token_lockres->mode == DLM_LOCK_EX)
> +	if (cinfo->token_lockres->mode == DLM_LOCK_EX) {
> +		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
>  		return 0;
> +	}
>  
> -	return lock_token(cinfo, 1);
> +	ret = lock_token(cinfo, 1);
> +	clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
> +	return ret;
>  }
>  
>  static int metadata_update_finish(struct mddev *mddev)
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 11/14] md-cluster: introduce cluster_check_sync_size
  2017-02-24  3:15 ` [PATCH 11/14] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
@ 2017-02-28 19:05   ` Shaohua Li
  2017-03-01  3:11     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 19:05 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:21AM +0800, Guoqing Jiang wrote:
> Support resize is a little complex for clustered
> raid, since we need to ensure all the nodes share
> the same knowledge about the size of raid.
> 
> We achieve the goal by check the sync_size which
> is in each node's bitmap, we can only change the
> capacity after cluster_check_sync_size returns 0.
> 
> Also, get_bitmap_from_slot is added to get a slot's
> bitmap. And we exported some funcs since they are
> used in cluster_check_sync_size().
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/bitmap.c     | 25 ++++++++++++++++++++-
>  drivers/md/bitmap.h     |  2 ++
>  drivers/md/md-cluster.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 9fb2ccac958a..67a7d399f501 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
>  	kunmap_atomic(sb);
>  	write_page(bitmap, bitmap->storage.sb_page, 1);
>  }
> +EXPORT_SYMBOL(bitmap_update_sb);
>  
>  /* print out the bitmap file superblock */
>  void bitmap_print_sb(struct bitmap *bitmap)
> @@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
>  /*
>   * free memory that was allocated
>   */
> -static void bitmap_free(struct bitmap *bitmap)
> +void bitmap_free(struct bitmap *bitmap)
>  {
>  	unsigned long k, pages;
>  	struct bitmap_page *bp;
> @@ -1761,6 +1762,7 @@ static void bitmap_free(struct bitmap *bitmap)
>  	kfree(bp);
>  	kfree(bitmap);
>  }
> +EXPORT_SYMBOL(bitmap_free);
>  
>  void bitmap_destroy(struct mddev *mddev)
>  {
> @@ -1920,6 +1922,27 @@ int bitmap_load(struct mddev *mddev)
>  }
>  EXPORT_SYMBOL_GPL(bitmap_load);
>  
> +struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
> +{
> +	int rv = 0;
> +	struct bitmap *bitmap;
> +
> +	bitmap = bitmap_create(mddev, slot);
> +	if (IS_ERR(bitmap)) {
> +		rv = PTR_ERR(bitmap);
> +		return ERR_PTR(rv);
> +	}
> +
> +	rv = bitmap_init_from_disk(bitmap, 0);
> +	if (rv) {
> +		bitmap_free(bitmap);
> +		return ERR_PTR(rv);
> +	}
> +
> +	return bitmap;
> +}
> +EXPORT_SYMBOL(get_bitmap_from_slot);
> +
>  /* Loads the bitmap associated with slot and copies the resync information
>   * to our bitmap
>   */
> diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
> index 5b6dd63dda91..9f761097aab2 100644
> --- a/drivers/md/bitmap.h
> +++ b/drivers/md/bitmap.h
> @@ -267,8 +267,10 @@ void bitmap_daemon_work(struct mddev *mddev);
>  
>  int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
>  		  int chunksize, int init);
> +struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
>  int bitmap_copy_from_slot(struct mddev *mddev, int slot,
>  				sector_t *lo, sector_t *hi, bool clear_bits);
> +void bitmap_free(struct bitmap *bitmap);
>  #endif
>  
>  #endif
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index f1b9a7a3ddd2..d3c024e6bfcf 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -1089,6 +1089,64 @@ static void metadata_update_cancel(struct mddev *mddev)
>  	unlock_comm(cinfo);
>  }
>  
> +/*
> + * retun 0 if all the bitmaps have the same sync_size
> + */
> +int cluster_check_sync_size(struct mddev *mddev)
> +{
> +	int i, rv;
> +	bitmap_super_t *sb;
> +	unsigned long my_sync_size, sync_size = 0;
> +	int node_num = mddev->bitmap_info.nodes;
> +	int current_slot = md_cluster_ops->slot_number(mddev);
> +	struct bitmap *bitmap = mddev->bitmap;
> +	char str[64];
> +	struct dlm_lock_resource *bm_lockres;
> +
> +	sb = kmap_atomic(bitmap->storage.sb_page);
> +	my_sync_size = sb->sync_size;
> +	kunmap_atomic(sb);
> +
> +	for (i = 0; i < node_num; i++) {
> +		if (i == current_slot)
> +			continue;
> +
> +		bitmap = get_bitmap_from_slot(mddev, i);
> +		if (IS_ERR(bitmap)) {
> +			pr_err("can't get bitmap from slot %d\n", i);
> +			return -1;
> +		}
> +
> +		/*
> +		 * If we can hold the bitmap lock of one node then
> +		 * the slot is not occupied, update the sb.
> +		 */
> +		snprintf(str, 64, "bitmap%04d", i);
> +		bm_lockres = lockres_init(mddev, str, NULL, 1);
> +		if (!bm_lockres) {
> +			pr_err("md-cluster: Cannot initialize %s\n", str);
> +		}

just print error here?

> +		bm_lockres->flags |= DLM_LKF_NOQUEUE;
> +		rv = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
> +		if (!rv)
> +			bitmap_update_sb(bitmap);

always the sb even the sync_size is the same?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot
  2017-02-24  3:15 ` [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot Guoqing Jiang
@ 2017-02-28 19:06   ` Shaohua Li
  0 siblings, 0 replies; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 19:06 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:22AM +0800, Guoqing Jiang wrote:
> Since get_bitmap_from_slot is introduced in previous
> commit, we can use it in bitmap_copy_from_slot to
> remove redundant code.

this should be merged into patch 11
 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/bitmap.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 67a7d399f501..b6fa55a3cff8 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -1952,14 +1952,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
>  	int rv = 0, i, j;
>  	sector_t block, lo = 0, hi = 0;
>  	struct bitmap_counts *counts;
> -	struct bitmap *bitmap = bitmap_create(mddev, slot);
> -
> -	if (IS_ERR(bitmap))
> -		return PTR_ERR(bitmap);
> +	struct bitmap *bitmap;
>  
> -	rv = bitmap_init_from_disk(bitmap, 0);
> -	if (rv)
> -		goto err;
> +	bitmap = get_bitmap_from_slot(mddev, slot);
> +	if (IS_ERR(bitmap)) {
> +		pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
> +		return -1;
> +	}
>  
>  	counts = &bitmap->counts;
>  	for (j = 0; j < counts->chunks; j++) {
> @@ -1986,8 +1985,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
>  	bitmap_unplug(mddev->bitmap);
>  	*low = lo;
>  	*high = hi;
> -err:
> -	bitmap_free(bitmap);
> +
>  	return rv;
>  }
>  EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);
> -- 
> 2.6.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/14] md-cluster: add the support for resize
  2017-02-24  3:15 ` [PATCH 14/14] md-cluster: add the support for resize Guoqing Jiang
@ 2017-02-28 19:25   ` Shaohua Li
  2017-03-01  3:28     ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 19:25 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:24AM +0800, Guoqing Jiang wrote:
> To update size for cluster raid, we need to make
> sure all nodes can perform the change successfully.
> However, it is possible that some of them can't do
> it due to failure (bitmap_resize could fail). So
> we need to consider the issue before we set the
> capacity unconditionally, and we use below steps
> to perform sanity check.
> 
> 1. A change the size, then broadcast METADATA_UPDATED
>    msg.
> 2. B and C receive METADATA_UPDATED change the size
>    excepts call set_capacity, sync_size is not update
>    if the change failed. Also call bitmap_update_sb
>    to sync sb to disk.
> 3. A checks other node's sync_size, if sync_size has
>    been updated in all nodes, then send CHANGE_CAPACITY
>    msg otherwise send msg to revert previous change.
> 4. B and C call set_capacity if receive CHANGE_CAPACITY
>    msg, otherwise pers->resize will be called to restore
>    the old value.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  Documentation/md/md-cluster.txt |  2 +-
>  drivers/md/md-cluster.c         | 75 +++++++++++++++++++++++++++++++++++++++++
>  drivers/md/md-cluster.h         |  1 +
>  drivers/md/md.c                 | 21 +++++++++---
>  4 files changed, 93 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
> index 38883276d31c..2663d49dd8a0 100644
> --- a/Documentation/md/md-cluster.txt
> +++ b/Documentation/md/md-cluster.txt
> @@ -321,4 +321,4 @@ The algorithm is:
>  
>  There are somethings which are not supported by cluster MD yet.
>  
> -- update size and change array_sectors.
> +- change array_sectors.
> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
> index d3c024e6bfcf..75da83187c31 100644
> --- a/drivers/md/md-cluster.c
> +++ b/drivers/md/md-cluster.c
> @@ -1147,6 +1147,80 @@ int cluster_check_sync_size(struct mddev *mddev)
>  	return (my_sync_size == sync_size) ? 0 : -1;
>  }
>  
> +/*
> + * Update the size for cluster raid is a little more complex, we perform it
> + * by the steps:
> + * 1. hold token lock and update superblock in initiator node.
> + * 2. send METADATA_UPDATED msg to other nodes.
> + * 3. The initiator node continues to check each bitmap's sync_size, if all
> + *    bitmaps have the same value of sync_size, then we can set capacity and
> + *    let other nodes to perform it. If one node can't update sync_size
> + *    accordingly, we need to revert to previous value.
> + */
> +static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
> +{
> +	struct md_cluster_info *cinfo = mddev->cluster_info;
> +	struct cluster_msg cmsg;
> +	struct md_rdev *rdev;
> +	int ret = 0;
> +	int raid_slot = -1;
> +
> +	md_update_sb(mddev, 1);
> +	lock_comm(cinfo, 1);
> +
> +	memset(&cmsg, 0, sizeof(cmsg));
> +	cmsg.type = cpu_to_le32(METADATA_UPDATED);
> +	rdev_for_each(rdev, mddev)
> +		if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
> +			raid_slot = rdev->desc_nr;
> +			break;
> +		}
> +	if (raid_slot >= 0) {
> +		cmsg.raid_slot = cpu_to_le32(raid_slot);
> +		/*
> +		 * We can only change capiticy after all the nodes can do it,
> +		 * so need to wait after other nodes already received the msg
> +		 * and handled the change
> +		 */
> +		ret = __sendmsg(cinfo, &cmsg);
> +		if (ret) {
> +			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
> +			       __func__, __LINE__);
> +			unlock_comm(cinfo);
> +			return;
> +		}
> +	} else {
> +		pr_err("md-cluster: No good device id found to send\n");
> +		unlock_comm(cinfo);
> +		return;
> +	}
> +
> +	/*
> +	 * check the sync_size from other node's bitmap, if sync_size
> +	 * have already updated in other nodes as expected, send an
> +	 * empty metadata msg to permit the change of capacity
> +	 */
> +	if (cluster_check_sync_size(mddev) == 0) {
> +		memset(&cmsg, 0, sizeof(cmsg));
> +		cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
> +		ret = __sendmsg(cinfo, &cmsg);
> +		if (ret)
> +			pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
> +			       __func__, __LINE__);
> +		set_capacity(mddev->gendisk, mddev->array_sectors);

don't call revalidate_disk here? And why don't move the gendisk related stuff to md.c.

> +	} else {
> +		/* revert to previous sectors */
> +		ret = mddev->pers->resize(mddev, old_dev_sectors);
> +		if (!ret)
> +			revalidate_disk(mddev->gendisk);
> +		ret = __sendmsg(cinfo, &cmsg);
> +		if (ret)
> +			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
> +			       __func__, __LINE__);
> +	}
> +	unlock_comm(cinfo);
> +}
> +
>  static int resync_start(struct mddev *mddev)
>  {
>  	struct md_cluster_info *cinfo = mddev->cluster_info;
> @@ -1392,6 +1466,7 @@ static struct md_cluster_operations cluster_ops = {
>  	.gather_bitmaps = gather_bitmaps,
>  	.lock_all_bitmaps = lock_all_bitmaps,
>  	.unlock_all_bitmaps = unlock_all_bitmaps,
> +	.update_size = update_size,
>  };
>  
>  static int __init cluster_init(void)
> diff --git a/drivers/md/md-cluster.h b/drivers/md/md-cluster.h
> index 8f26a5e80810..e939c14222ba 100644
> --- a/drivers/md/md-cluster.h
> +++ b/drivers/md/md-cluster.h
> @@ -25,6 +25,7 @@ struct md_cluster_operations {
>  	int (*gather_bitmaps)(struct md_rdev *rdev);
>  	int (*lock_all_bitmaps)(struct mddev *mddev);
>  	void (*unlock_all_bitmaps)(struct mddev *mddev);
> +	void (*update_size)(struct mddev *mddev, sector_t old_dev_sectors);
>  };
>  
>  #endif /* _MD_CLUSTER_H */
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 975c9dd60c05..2bc64059ff57 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6503,10 +6503,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>  	struct md_rdev *rdev;
>  	int rv;
>  	int fit = (num_sectors == 0);
> -
> -	/* cluster raid doesn't support update size */
> -	if (mddev_is_clustered(mddev))
> -		return -EINVAL;
> +	sector_t old_dev_sectors = mddev->dev_sectors;
>  
>  	if (mddev->pers->resize == NULL)
>  		return -EINVAL;
> @@ -6535,7 +6532,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>  	}
>  	rv = mddev->pers->resize(mddev, num_sectors);
>  	if (!rv) {
> -		if (mddev->queue) {
> +		if (mddev_is_clustered(mddev))
> +			md_cluster_ops->update_size(mddev, old_dev_sectors);
> +		else if (mddev->queue) {
>  			set_capacity(mddev->gendisk, mddev->array_sectors);
>  			revalidate_disk(mddev->gendisk);
>  		}
> @@ -8753,6 +8752,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>  	int role, ret;
>  	char b[BDEVNAME_SIZE];
>  
> +	/*
> +	 * If size is changed in another node then we need to
> +	 * do resize as well.
> +	 */
> +	if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
> +		ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
> +		if (ret)
> +			pr_info("md-cluster: resize failed\n");
> +		else
> +			bitmap_update_sb(mddev->bitmap);
> +	}

I'm confused, who will trigger this? The patch 10 only calls set_capacity.
Please describe the details in each node. Also please add it to md-cluster.txt
document.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/14] the latest changes for md-cluster
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (12 preceding siblings ...)
  2017-02-24  3:15 ` [PATCH 14/14] md-cluster: add the support for resize Guoqing Jiang
@ 2017-02-28 19:27 ` Shaohua Li
  2017-03-01  3:50   ` Guoqing Jiang
  2017-03-01  2:02 ` [PATCH 06/14] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang
  14 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2017-02-28 19:27 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, shli, neilb

On Fri, Feb 24, 2017 at 11:15:10AM +0800, Guoqing Jiang wrote:
> Hi,
> 
> This patchset is based on md/for-next branch, which includes below parts:
> 
> 1. some code improvements (0001-0004).
> 2. changes for use sync way to handle METADATA_UPDATED msg (0005-0009).
> 3. add resize support for md-cluster (0010-0012 and 0014), and also
>    make some related changes inside md (0013).

Applied patch 2, 3, 13. Replied to other patches.

BTW, is it possible you can create a test suite people can verify the md-cluster stuff?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 06/14] md-cluster: use sync way to handle METADATA_UPDATED msg
  2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
                   ` (13 preceding siblings ...)
  2017-02-28 19:27 ` [PATCH 00/14] the latest changes for md-cluster Shaohua Li
@ 2017-03-01  2:02 ` Guoqing Jiang
  14 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  2:02 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

Also remove RELOAD_SB related codes since there are not valid
anymore.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 28 +++++++++++++++++++++++++---
 drivers/md/md.c         |  4 ----
 drivers/md/md.h         |  3 ---
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index a6cf02c5c7bc..0aad477d1b20 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -66,7 +66,7 @@ struct resync_info {
  * set up all the related infos such as bitmap and personality */
 #define		MD_CLUSTER_ALREADY_IN_CLUSTER		6
 #define		MD_CLUSTER_PENDING_RECV_EVENT		7
-
+#define 	MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD		8
 
 struct md_cluster_info {
 	struct mddev *mddev; /* the md device which md_cluster_info belongs to */
@@ -523,11 +523,17 @@ static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg)
 
 static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg)
 {
+	int got_lock = 0;
 	struct md_cluster_info *cinfo = mddev->cluster_info;
 	mddev->good_device_nr = le32_to_cpu(msg->raid_slot);
-	set_bit(MD_RELOAD_SB, &mddev->flags);
+
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
-	md_wakeup_thread(mddev->thread);
+	wait_event(mddev->thread->wqueue,
+		   (got_lock = mddev_trylock(mddev)) ||
+		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
+	md_reload_sb(mddev, mddev->good_device_nr);
+	if (got_lock)
+		mddev_unlock(mddev);
 }
 
 static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg)
@@ -649,8 +655,24 @@ static void recv_daemon(struct md_thread *thread)
 static int lock_token(struct md_cluster_info *cinfo, bool mddev_locked)
 {
 	int error;
+	struct mddev *mddev = cinfo->mddev;
 
+	/*
+	 * If resync thread run after raid1d thread, then process_metadata_update
+	 * could not continue if raid1d held reconfig_mutex (and raid1d is blocked
+	 * since another node already got EX on Token and waitting the EX of Ack),
+	 * so let resync wake up thread in case flag is set.
+	 */
+	if (mddev_locked) {
+		error = test_and_set_bit_lock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
+					      &cinfo->state);
+		WARN_ON_ONCE(error);
+		md_wakeup_thread(mddev->thread);
+	}
 	error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (mddev_locked)
+		clear_bit_unlock(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state);
+
 	if (error)
 		pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n",
 				__func__, __LINE__, error);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 55e7e7a8714e..6e910d99c9c1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8389,7 +8389,6 @@ void md_check_recovery(struct mddev *mddev)
 		(mddev->sb_flags & ~ (1<<MD_SB_CHANGE_PENDING)) ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
-		test_bit(MD_RELOAD_SB, &mddev->flags) ||
 		(mddev->external == 0 && mddev->safemode == 1) ||
 		(mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending)
 		 && !mddev->in_sync && mddev->recovery_cp == MaxSector)
@@ -8438,9 +8437,6 @@ void md_check_recovery(struct mddev *mddev)
 						rdev->raid_disk < 0)
 					md_kick_rdev_from_array(rdev);
 			}
-
-			if (test_and_clear_bit(MD_RELOAD_SB, &mddev->flags))
-				md_reload_sb(mddev, mddev->good_device_nr);
 		}
 
 		if (!mddev->external) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8859cbf84b6..ec5ffde03ccf 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -219,9 +219,6 @@ enum mddev_flags {
 				 * it then */
 	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
 	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
-	MD_RELOAD_SB,		/* Reload the superblock because another node
-				 * updated it.
-				 */
 	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
 				   * already took resync lock, need to
 				   * release the lock */
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/14] md-cluster: remove unnecessary header files
  2017-02-28 18:03   ` Shaohua Li
@ 2017-03-01  2:47     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  2:47 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 02:03 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:11AM +0800, Guoqing Jiang wrote:
>> md-cluster.h is already included in md.h, so remove
>> the redundant one and we don't want to cross include
>> header file too.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> It would be better md.h doesn't include md-cluster.h and include md-cluster.h
> in required .c files.

Currently, there are compile dependency since md_cluster_operations and
md_cluster_info (though it need to move from md-cluster.c to md-cluster.h
in future) are used inside md.h. Anyway, I will consider about how to 
improve
this.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 05/14] md-cluster: add new parameter for lock_token
  2017-02-28 18:07   ` Shaohua Li
@ 2017-03-01  2:48     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  2:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 02:07 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:15AM +0800, Guoqing Jiang wrote:
>> lock_token is called from either lock_comm or
>> metadata_update_start, if we look back more for
>> the call chain, some of them are called with the
>> reconfig_mutex held while a few of them don't.
>>
>> Specifically, resync_info_update is mostly called
>> without the protection of reconfig_mutex. But
>> resync_finish calls resync_info_update within
>> the mutex.
>>
>> We make the change to lock_token since we need
>> to use synchronous way to handle METADATA_UPDATED
>> message in latter patch.
> please merge this to the patch which uses the new paramter
>

NP, though I think it is easy for review.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info
  2017-02-28 18:06   ` Shaohua Li
@ 2017-03-01  2:49     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  2:49 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 02:06 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:14AM +0800, Guoqing Jiang wrote:
>> Store "struct mddev *mddev" in md_cluster_info,
>> then we can get mddev inside lock_token in next
>> patch.
> Please merge this to the patch which requires the mddev. Don't think we need a
> separate patch.
>   

I will merge it.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start
  2017-02-28 18:24   ` Shaohua Li
@ 2017-03-01  2:58     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  2:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 02:24 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:19AM +0800, Guoqing Jiang wrote:
>> Since we have switched to sync way to handle METADATA_UPDATED
>> msg, then process_metadata_update can only continue with either
>> get the reconfig_mutex or MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
>> set.
> same here. Either the patch should be put into the 'METADATA_UPDATED' patch or
> this patch should be before the 'METADATA_UPDATED' patch.
>
> We shouldn't introduce a broken patch then fix it in latter patch.

Yes, for both 8 and 9, it is possible to merge with 6 (I resend it though
it was not posted successfully for some reason). I seperated them since
it would be more easy for review and don't want mess up  header of the
6th patch.

Thanks,
Guoqing


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 11/14] md-cluster: introduce cluster_check_sync_size
  2017-02-28 19:05   ` Shaohua Li
@ 2017-03-01  3:11     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  3:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 03:05 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:21AM +0800, Guoqing Jiang wrote:
>> Support resize is a little complex for clustered
>> raid, since we need to ensure all the nodes share
>> the same knowledge about the size of raid.
>>
>> We achieve the goal by check the sync_size which
>> is in each node's bitmap, we can only change the
>> capacity after cluster_check_sync_size returns 0.
>>
>> Also, get_bitmap_from_slot is added to get a slot's
>> bitmap. And we exported some funcs since they are
>> used in cluster_check_sync_size().
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>   drivers/md/bitmap.c     | 25 ++++++++++++++++++++-
>>   drivers/md/bitmap.h     |  2 ++
>>   drivers/md/md-cluster.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
>> index 9fb2ccac958a..67a7d399f501 100644
>> --- a/drivers/md/bitmap.c
>> +++ b/drivers/md/bitmap.c
>> @@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
>>   	kunmap_atomic(sb);
>>   	write_page(bitmap, bitmap->storage.sb_page, 1);
>>   }
>> +EXPORT_SYMBOL(bitmap_update_sb);
>>   
>>   /* print out the bitmap file superblock */
>>   void bitmap_print_sb(struct bitmap *bitmap)
>> @@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
>>   /*
>>    * free memory that was allocated
>>    */
>> -static void bitmap_free(struct bitmap *bitmap)
>> +void bitmap_free(struct bitmap *bitmap)
>>   {
>>   	unsigned long k, pages;
>>   	struct bitmap_page *bp;
>> @@ -1761,6 +1762,7 @@ static void bitmap_free(struct bitmap *bitmap)
>>   	kfree(bp);
>>   	kfree(bitmap);
>>   }
>> +EXPORT_SYMBOL(bitmap_free);
>>   
>>   void bitmap_destroy(struct mddev *mddev)
>>   {
>> @@ -1920,6 +1922,27 @@ int bitmap_load(struct mddev *mddev)
>>   }
>>   EXPORT_SYMBOL_GPL(bitmap_load);
>>   
>> +struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
>> +{
>> +	int rv = 0;
>> +	struct bitmap *bitmap;
>> +
>> +	bitmap = bitmap_create(mddev, slot);
>> +	if (IS_ERR(bitmap)) {
>> +		rv = PTR_ERR(bitmap);
>> +		return ERR_PTR(rv);
>> +	}
>> +
>> +	rv = bitmap_init_from_disk(bitmap, 0);
>> +	if (rv) {
>> +		bitmap_free(bitmap);
>> +		return ERR_PTR(rv);
>> +	}
>> +
>> +	return bitmap;
>> +}
>> +EXPORT_SYMBOL(get_bitmap_from_slot);
>> +
>>   /* Loads the bitmap associated with slot and copies the resync information
>>    * to our bitmap
>>    */
>> diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
>> index 5b6dd63dda91..9f761097aab2 100644
>> --- a/drivers/md/bitmap.h
>> +++ b/drivers/md/bitmap.h
>> @@ -267,8 +267,10 @@ void bitmap_daemon_work(struct mddev *mddev);
>>   
>>   int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
>>   		  int chunksize, int init);
>> +struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
>>   int bitmap_copy_from_slot(struct mddev *mddev, int slot,
>>   				sector_t *lo, sector_t *hi, bool clear_bits);
>> +void bitmap_free(struct bitmap *bitmap);
>>   #endif
>>   
>>   #endif
>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>> index f1b9a7a3ddd2..d3c024e6bfcf 100644
>> --- a/drivers/md/md-cluster.c
>> +++ b/drivers/md/md-cluster.c
>> @@ -1089,6 +1089,64 @@ static void metadata_update_cancel(struct mddev *mddev)
>>   	unlock_comm(cinfo);
>>   }
>>   
>> +/*
>> + * retun 0 if all the bitmaps have the same sync_size
>> + */
>> +int cluster_check_sync_size(struct mddev *mddev)
>> +{
>> +	int i, rv;
>> +	bitmap_super_t *sb;
>> +	unsigned long my_sync_size, sync_size = 0;
>> +	int node_num = mddev->bitmap_info.nodes;
>> +	int current_slot = md_cluster_ops->slot_number(mddev);
>> +	struct bitmap *bitmap = mddev->bitmap;
>> +	char str[64];
>> +	struct dlm_lock_resource *bm_lockres;
>> +
>> +	sb = kmap_atomic(bitmap->storage.sb_page);
>> +	my_sync_size = sb->sync_size;
>> +	kunmap_atomic(sb);
>> +
>> +	for (i = 0; i < node_num; i++) {
>> +		if (i == current_slot)
>> +			continue;
>> +
>> +		bitmap = get_bitmap_from_slot(mddev, i);
>> +		if (IS_ERR(bitmap)) {
>> +			pr_err("can't get bitmap from slot %d\n", i);
>> +			return -1;
>> +		}
>> +
>> +		/*
>> +		 * If we can hold the bitmap lock of one node then
>> +		 * the slot is not occupied, update the sb.
>> +		 */
>> +		snprintf(str, 64, "bitmap%04d", i);
>> +		bm_lockres = lockres_init(mddev, str, NULL, 1);
>> +		if (!bm_lockres) {
>> +			pr_err("md-cluster: Cannot initialize %s\n", str);
>> +		}
> just print error here?

It should return -ENOMEM, thanks.

>
>> +		bm_lockres->flags |= DLM_LKF_NOQUEUE;
>> +		rv = dlm_lock_sync(bm_lockres, DLM_LOCK_PW);
>> +		if (!rv)
>> +			bitmap_update_sb(bitmap);
> always the sb even the sync_size is the same?

I suppose you mean "always *update* the sb even the sync_size is the same".

If a node can hold PW on one bitmap lock, then we believe that the bitmap is
not used (we create 4 bitmaps by default, though it is possible that 
only one
node is on-line), so the bitmap without related on-line node just keep 
the original
data, which means sync_size can't be same as the master node (which issues
update_size cmd).

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 14/14] md-cluster: add the support for resize
  2017-02-28 19:25   ` Shaohua Li
@ 2017-03-01  3:28     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  3:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 03:25 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:24AM +0800, Guoqing Jiang wrote:
>> To update size for cluster raid, we need to make
>> sure all nodes can perform the change successfully.
>> However, it is possible that some of them can't do
>> it due to failure (bitmap_resize could fail). So
>> we need to consider the issue before we set the
>> capacity unconditionally, and we use below steps
>> to perform sanity check.
>>
>> 1. A change the size, then broadcast METADATA_UPDATED
>>     msg.
>> 2. B and C receive METADATA_UPDATED change the size
>>     excepts call set_capacity, sync_size is not update
>>     if the change failed. Also call bitmap_update_sb
>>     to sync sb to disk.
>> 3. A checks other node's sync_size, if sync_size has
>>     been updated in all nodes, then send CHANGE_CAPACITY
>>     msg otherwise send msg to revert previous change.
>> 4. B and C call set_capacity if receive CHANGE_CAPACITY
>>     msg, otherwise pers->resize will be called to restore
>>     the old value.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>   Documentation/md/md-cluster.txt |  2 +-
>>   drivers/md/md-cluster.c         | 75 +++++++++++++++++++++++++++++++++++++++++
>>   drivers/md/md-cluster.h         |  1 +
>>   drivers/md/md.c                 | 21 +++++++++---
>>   4 files changed, 93 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/md/md-cluster.txt b/Documentation/md/md-cluster.txt
>> index 38883276d31c..2663d49dd8a0 100644
>> --- a/Documentation/md/md-cluster.txt
>> +++ b/Documentation/md/md-cluster.txt
>> @@ -321,4 +321,4 @@ The algorithm is:
>>   
>>   There are somethings which are not supported by cluster MD yet.
>>   
>> -- update size and change array_sectors.
>> +- change array_sectors.
>> diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
>> index d3c024e6bfcf..75da83187c31 100644
>> --- a/drivers/md/md-cluster.c
>> +++ b/drivers/md/md-cluster.c
>> @@ -1147,6 +1147,80 @@ int cluster_check_sync_size(struct mddev *mddev)
>>   	return (my_sync_size == sync_size) ? 0 : -1;
>>   }
>>   
>> +/*
>> + * Update the size for cluster raid is a little more complex, we perform it
>> + * by the steps:
>> + * 1. hold token lock and update superblock in initiator node.
>> + * 2. send METADATA_UPDATED msg to other nodes.
>> + * 3. The initiator node continues to check each bitmap's sync_size, if all
>> + *    bitmaps have the same value of sync_size, then we can set capacity and
>> + *    let other nodes to perform it. If one node can't update sync_size
>> + *    accordingly, we need to revert to previous value.
>> + */
>> +static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
>> +{
>> +	struct md_cluster_info *cinfo = mddev->cluster_info;
>> +	struct cluster_msg cmsg;
>> +	struct md_rdev *rdev;
>> +	int ret = 0;
>> +	int raid_slot = -1;
>> +
>> +	md_update_sb(mddev, 1);
>> +	lock_comm(cinfo, 1);
>> +
>> +	memset(&cmsg, 0, sizeof(cmsg));
>> +	cmsg.type = cpu_to_le32(METADATA_UPDATED);
>> +	rdev_for_each(rdev, mddev)
>> +		if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags)) {
>> +			raid_slot = rdev->desc_nr;
>> +			break;
>> +		}
>> +	if (raid_slot >= 0) {
>> +		cmsg.raid_slot = cpu_to_le32(raid_slot);
>> +		/*
>> +		 * We can only change capiticy after all the nodes can do it,
>> +		 * so need to wait after other nodes already received the msg
>> +		 * and handled the change
>> +		 */
>> +		ret = __sendmsg(cinfo, &cmsg);
>> +		if (ret) {
>> +			pr_err("%s:%d: failed to send METADATA_UPDATED msg\n",
>> +			       __func__, __LINE__);
>> +			unlock_comm(cinfo);
>> +			return;
>> +		}
>> +	} else {
>> +		pr_err("md-cluster: No good device id found to send\n");
>> +		unlock_comm(cinfo);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * check the sync_size from other node's bitmap, if sync_size
>> +	 * have already updated in other nodes as expected, send an
>> +	 * empty metadata msg to permit the change of capacity
>> +	 */
>> +	if (cluster_check_sync_size(mddev) == 0) {
>> +		memset(&cmsg, 0, sizeof(cmsg));
>> +		cmsg.type = cpu_to_le32(CHANGE_CAPACITY);
>> +		ret = __sendmsg(cinfo, &cmsg);
>> +		if (ret)
>> +			pr_err("%s:%d: failed to send CHANGE_CAPACITY msg\n",
>> +			       __func__, __LINE__);
>> +		set_capacity(mddev->gendisk, mddev->array_sectors);
> don't call revalidate_disk here? And why don't move the gendisk related stuff to md.c.

Thanks, I will add revalidate_disk after set_capacity. But we can't move 
it to md.c
since md-cluster runs a different way for resize.

>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -6503,10 +6503,7 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>>   	struct md_rdev *rdev;
>>   	int rv;
>>   	int fit = (num_sectors == 0);
>> -
>> -	/* cluster raid doesn't support update size */
>> -	if (mddev_is_clustered(mddev))
>> -		return -EINVAL;
>> +	sector_t old_dev_sectors = mddev->dev_sectors;
>>   
>>   	if (mddev->pers->resize == NULL)
>>   		return -EINVAL;
>> @@ -6535,7 +6532,9 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
>>   	}
>>   	rv = mddev->pers->resize(mddev, num_sectors);
>>   	if (!rv) {
>> -		if (mddev->queue) {
>> +		if (mddev_is_clustered(mddev))
>> +			md_cluster_ops->update_size(mddev, old_dev_sectors);
>> +		else if (mddev->queue) {
>>   			set_capacity(mddev->gendisk, mddev->array_sectors);
>>   			revalidate_disk(mddev->gendisk);
>>   		}

You can see the path is different between common md and md-cluster, 
because we have
to check if all the bitmaps have the same sync_size or not, then set 
capacity and revalidate
disk.

>> @@ -8753,6 +8752,18 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
>>   	int role, ret;
>>   	char b[BDEVNAME_SIZE];
>>   
>> +	/*
>> +	 * If size is changed in another node then we need to
>> +	 * do resize as well.
>> +	 */
>> +	if (mddev->dev_sectors != le64_to_cpu(sb->size)) {
>> +		ret = mddev->pers->resize(mddev, le64_to_cpu(sb->size));
>> +		if (ret)
>> +			pr_info("md-cluster: resize failed\n");
>> +		else
>> +			bitmap_update_sb(mddev->bitmap);
>> +	}
> I'm confused, who will trigger this? The patch 10 only calls set_capacity.

process_metadata_update -> md_reload_sb -> check_sb_changes, so which
means if a node received METADATA_UPDATED msg, it will call the path.

And in this patch you may see that update_size sends the msg.

+static void update_size(struct mddev *mddev, sector_t old_dev_sectors)
+{
+	struct md_cluster_info *cinfo = mddev->cluster_info;
+	struct cluster_msg cmsg;
+	struct md_rdev *rdev;
+	int ret = 0;
+	int raid_slot = -1;
+
+	md_update_sb(mddev, 1);
+	lock_comm(cinfo, 1);
+
+	memset(&cmsg, 0, sizeof(cmsg));
+	cmsg.type = cpu_to_le32(METADATA_UPDATED);

> Please describe the details in each node. Also please add it to md-cluster.txt
> document.

Does it help?

diff --git a/Documentation/md/md-cluster.txt 
b/Documentation/md/md-cluster.txt
index 2663d49dd8a0..709439e687c2 100644
--- a/Documentation/md/md-cluster.txt
+++ b/Documentation/md/md-cluster.txt
@@ -71,8 +71,8 @@ There are three groups of locks for managing the device:

   3.1.1 METADATA_UPDATED: informs other nodes that the metadata has
     been updated, and the node must re-read the md superblock. This is
-   performed synchronously. It is primarily used to signal device
-   failure.
+   performed synchronously. We send this message if sb is updated or
+   the size is changed.

Thanks,
Guoqing

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 07/14] md: move bitmap_destroy before __md_stop
  2017-02-28 18:20   ` Shaohua Li
@ 2017-03-01  3:46     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  3:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb



On 03/01/2017 02:20 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:17AM +0800, Guoqing Jiang wrote:
>> Since we have switched to sync way to handle METADATA_UPDATED
>> msg for md-cluster, then process_metadata_update is depended
>> on mddev->thread->wqueue.
>>
>> With the new change, clustered raid could possible hang if
>> array received a METADATA_UPDATED msg after array unregistered
>> mddev->thread, so we need to stop clustered raid earlier
>> than before.
>>
>> And this change should be safe for non-clustered raid since
>> all writes are stopped before the destroy. Also in md_run,
>> we activate the personality (pers->run()) before activating
>> the bitmap (bitmap_create()). So it is pleasingly symmetric
>> to stop the bitmap (bitmap_destroy()) before stopping the
>> personality (__md_stop() calls pers->free()).
> There is no patch 6. So I can't judge the purpose of the patch.

Sorry, it is resent now.

> The patch changed behaviors. We only destroy bitmap with mode == 0, now we do it
> even for mode == 2. Please specify why it's safe.

Thanks for reminder, I will add the check for mode.

> The __md_stop will wait for
> behind writes for example only with bitmap set, but the patch makes us not do
> the wait.

I don't know behind write very well,  but we have call __md_stop_writes 
before
destroy bitmap, also bitmap is flushed. Shouldn't it mean the 
behind_writes is
cleared? If not, when the behind_write bit can be cleared? Thanks.

Guoqing


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 00/14] the latest changes for md-cluster
  2017-02-28 19:27 ` [PATCH 00/14] the latest changes for md-cluster Shaohua Li
@ 2017-03-01  3:50   ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2017-03-01  3:50 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, zhilong



On 03/01/2017 03:27 AM, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 11:15:10AM +0800, Guoqing Jiang wrote:
>> Hi,
>>
>> This patchset is based on md/for-next branch, which includes below parts:
>>
>> 1. some code improvements (0001-0004).
>> 2. changes for use sync way to handle METADATA_UPDATED msg (0005-0009).
>> 3. add resize support for md-cluster (0010-0012 and 0014), and also
>>     make some related changes inside md (0013).
> Applied patch 2, 3, 13. Replied to other patches.

Thanks for review.

> BTW, is it possible you can create a test suite people can verify the md-cluster stuff?

Actually, Zhilong is currently working on this, we will push the test
suite into mdadm/test and also improve current test too (you may
see his effort in the list).

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-03-01  3:50 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24  3:15 [PATCH 00/14] the latest changes for md-cluster Guoqing Jiang
2017-02-24  3:15 ` [PATCH 01/14] md-cluster: remove unnecessary header files Guoqing Jiang
2017-02-28 18:03   ` Shaohua Li
2017-03-01  2:47     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 02/14] md-cluster: free md_cluster_info if node leave cluster Guoqing Jiang
2017-02-24  3:15 ` [PATCH 03/14] md-cluster: remove useless memset from gather_all_resync_info Guoqing Jiang
2017-02-24  3:15 ` [PATCH 04/14] md-cluster: add mddev into struct md_cluster_info Guoqing Jiang
2017-02-28 18:06   ` Shaohua Li
2017-03-01  2:49     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 05/14] md-cluster: add new parameter for lock_token Guoqing Jiang
2017-02-28 18:07   ` Shaohua Li
2017-03-01  2:48     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 07/14] md: move bitmap_destroy before __md_stop Guoqing Jiang
2017-02-28 18:20   ` Shaohua Li
2017-03-01  3:46     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 08/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD before unregister thread Guoqing Jiang
2017-02-28 18:22   ` Shaohua Li
2017-02-24  3:15 ` [PATCH 09/14] md-cluster: set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in metadata_update_start Guoqing Jiang
2017-02-28 18:24   ` Shaohua Li
2017-03-01  2:58     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 10/14] md-cluster: add CHANGE_CAPACITY message type Guoqing Jiang
2017-02-24  3:15 ` [PATCH 11/14] md-cluster: introduce cluster_check_sync_size Guoqing Jiang
2017-02-28 19:05   ` Shaohua Li
2017-03-01  3:11     ` Guoqing Jiang
2017-02-24  3:15 ` [PATCH 12/14] md/bitmap: replace redundant codes with get_bitmap_from_slot Guoqing Jiang
2017-02-28 19:06   ` Shaohua Li
2017-02-24  3:15 ` [PATCH 13/14] md: move funcs from pers->resize to update_size Guoqing Jiang
2017-02-24  3:15 ` [PATCH 14/14] md-cluster: add the support for resize Guoqing Jiang
2017-02-28 19:25   ` Shaohua Li
2017-03-01  3:28     ` Guoqing Jiang
2017-02-28 19:27 ` [PATCH 00/14] the latest changes for md-cluster Shaohua Li
2017-03-01  3:50   ` Guoqing Jiang
2017-03-01  2:02 ` [PATCH 06/14] md-cluster: use sync way to handle METADATA_UPDATED msg Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.