All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] The latest patches for md-cluster
@ 2016-04-21  5:58 Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
                   ` (12 more replies)
  0 siblings, 13 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

For cluster raid1, we found some issues and some codes need to be
improved during the past months, and all the patches are based on
for-next branch of md tree.

The patchset also available in github as follows:

https://github.com/GuoqingJiang/linux/tree/md-for-next

Thanks,
Guoqing

Guoqing Jiang (13):
  md-cluster: change resync lock from asynchronous to synchronous
  md-cluser: make resync_finish only called after pers->sync_request
  md-cluster: wake up thread to continue recovery
  md-cluster: unregister thread if err happened
  md-cluster: fix locking when node joins cluster during message
    broadcast
  md-cluster: change array_sectors and update size are not supported
  md-cluster: wakeup thread if activated a spare disk
  md: set MD_CHANGE_PENDING in a spinlocked region
  md-cluster: always setup in-memory bitmap
  md-cluster: sync bitmap when node received RESYNCING msg
  md-cluster/bitmap: fix wrong calcuation of offset
  md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and
    bitmap_file_set_bit
  md-cluster/bitmap: unplug bitmap to sync dirty pages to disk

 Documentation/md-cluster.txt |   6 +++
 drivers/md/bitmap.c          | 108 ++++++++++++++++++++++++++++++++++---------
 drivers/md/bitmap.h          |   3 ++
 drivers/md/md-cluster.c      |  53 +++++++++++++++++----
 drivers/md/md.c              |  80 ++++++++++++++++++++++----------
 drivers/md/raid1.c           |   2 +
 drivers/md/raid10.c          |   6 ++-
 drivers/md/raid5-cache.c     |   2 +
 drivers/md/raid5.c           |   2 +
 9 files changed, 207 insertions(+), 55 deletions(-)

-- 
2.6.6


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  6:20   ` kbuild test robot
  2016-04-21  6:55   ` [PATCH V2 " Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 02/13] md-cluser: make resync_finish only called after pers->sync_request Guoqing Jiang
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

If multiple nodes choose to attempt do resync at the same time
they need to be serialized so they don't duplicate effort. This
serialization is done by locking the 'resync' DLM lock.

Currently if a node cannot get the lock immediately it doesn't
request notification when the lock becomes available (i.e.
DLM_LKF_NOQUEUE is set), so it may not reliably find out when it
is safe to try again.

Rather than trying to arrange an async wake-up when the lock
becomes available, switch to using synchronous locking - this is
a lot easier to think about.  As it is not permitted to block in
the 'raid1d' thread, move the locking to the resync thread.  So
the rsync thread is forked immediately, but it blocks until the
resync lock is available. Once the lock is locked it checks again
if any resync action is needed.

A particular symptom of the current problem is that a node can
get stuck with "resync=pending" indefinitely.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c |  2 --
 drivers/md/md.c         | 22 ++++++++++++++--------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index dd97d42..12fbfec 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -937,7 +937,6 @@ static void metadata_update_cancel(struct mddev *mddev)
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-	cinfo->resync_lockres->flags |= DLM_LKF_NOQUEUE;
 	return dlm_lock_sync(cinfo->resync_lockres, DLM_LOCK_EX);
 }
 
@@ -967,7 +966,6 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
 static int resync_finish(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-	cinfo->resync_lockres->flags &= ~DLM_LKF_NOQUEUE;
 	dlm_unlock_sync(cinfo->resync_lockres);
 	return resync_info_update(mddev, 0, 0);
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c068f17..113e71c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7781,6 +7781,7 @@ void md_do_sync(struct md_thread *thread)
 	char *desc, *action = NULL;
 	struct blk_plug plug;
 	bool cluster_resync_finished = false;
+	int ret;
 
 	/* just incase thread restarts... */
 	if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
@@ -7790,6 +7791,19 @@ void md_do_sync(struct md_thread *thread)
 		return;
 	}
 
+	if (mddev_is_clustered(mddev)) {
+		ret = md_cluster_ops->resync_start(mddev);
+		if (ret)
+			goto skip;
+
+		if (!(test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
+			test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
+			test_bit(MD_RECOVERY_RECOVER, &mddev->recovery))
+		     && ((unsigned long long)mddev->curr_resync_completed
+			 < (unsigned long long)mddev->resync_max_sectors))
+			goto skip;
+	}
+
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 		if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
 			desc = "data-check";
@@ -8221,14 +8235,6 @@ static void md_start_sync(struct work_struct *ws)
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 	int ret = 0;
 
-	if (mddev_is_clustered(mddev)) {
-		ret = md_cluster_ops->resync_start(mddev);
-		if (ret) {
-			mddev->sync_thread = NULL;
-			goto out;
-		}
-	}
-
 	mddev->sync_thread = md_register_thread(md_do_sync,
 						mddev,
 						"resync");
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/13] md-cluser: make resync_finish only called after pers->sync_request
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 03/13] md-cluster: wake up thread to continue recovery Guoqing Jiang
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

It is not reasonable that cluster raid to release resync
lock before the last pers->sync_request has finished.

As the metadata will be changed when node performs resync,
we need to inform other nodes to update metadata, so the
MD_CHANGE_PENDING flag is set before finish resync.

Then metadata_update_finish is move ahead to ensure that
METADATA_UPDATED msg is sent before finish resync, and
metadata_update_start need to be run after "repeat:" label
accordingly.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 113e71c..0639c01 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2286,6 +2286,7 @@ void md_update_sb(struct mddev *mddev, int force_change)
 		return;
 	}
 
+repeat:
 	if (mddev_is_clustered(mddev)) {
 		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
 			force_change = 1;
@@ -2298,7 +2299,7 @@ void md_update_sb(struct mddev *mddev, int force_change)
 			return;
 		}
 	}
-repeat:
+
 	/* First make sure individual recovery_offsets are correct */
 	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk >= 0 &&
@@ -2425,6 +2426,9 @@ repeat:
 	md_super_wait(mddev);
 	/* if there was a failure, MD_CHANGE_DEVS was set, and we re-write super */
 
+	if (mddev_is_clustered(mddev) && ret == 0)
+		md_cluster_ops->metadata_update_finish(mddev);
+
 	spin_lock(&mddev->lock);
 	if (mddev->in_sync != sync_req ||
 	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
@@ -2447,9 +2451,6 @@ repeat:
 		clear_bit(BlockedBadBlocks, &rdev->flags);
 		wake_up(&rdev->blocked_wait);
 	}
-
-	if (mddev_is_clustered(mddev) && ret == 0)
-		md_cluster_ops->metadata_update_finish(mddev);
 }
 EXPORT_SYMBOL(md_update_sb);
 
@@ -7780,7 +7781,6 @@ void md_do_sync(struct md_thread *thread)
 	struct md_rdev *rdev;
 	char *desc, *action = NULL;
 	struct blk_plug plug;
-	bool cluster_resync_finished = false;
 	int ret;
 
 	/* just incase thread restarts... */
@@ -8098,11 +8098,6 @@ void md_do_sync(struct md_thread *thread)
 		mddev->curr_resync_completed = mddev->curr_resync;
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
-	/* tell personality and other nodes that we are finished */
-	if (mddev_is_clustered(mddev)) {
-		md_cluster_ops->resync_finish(mddev);
-		cluster_resync_finished = true;
-	}
 	mddev->pers->sync_request(mddev, max_sectors, &skipped);
 
 	if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
@@ -8142,9 +8137,15 @@ void md_do_sync(struct md_thread *thread)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	if (mddev_is_clustered(mddev) &&
-	    test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
-	    !cluster_resync_finished)
+	    ret == 0) {
+		/* set CHANGE_PENDING here since maybe another
+		 * update is needed, so other nodes are informed */
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 		md_cluster_ops->resync_finish(mddev);
+	}
 
 	spin_lock(&mddev->lock);
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/13] md-cluster: wake up thread to continue recovery
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 02/13] md-cluser: make resync_finish only called after pers->sync_request Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 04/13] md-cluster: unregister thread if err happened Guoqing Jiang
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

In recovery case, we need to set MD_RECOVERY_NEEDED
and wake up thread only if recover is not finished.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 12fbfec..0d4ddf8 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -284,11 +284,14 @@ static void recover_bitmaps(struct md_thread *thread)
 			goto dlm_unlock;
 		}
 		if (hi > 0) {
-			/* TODO:Wait for current resync to get over */
-			set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 			if (lo < mddev->recovery_cp)
 				mddev->recovery_cp = lo;
-			md_check_recovery(mddev);
+			/* wake up thread to continue resync in case resync
+			 * is not finished */
+			if (mddev->recovery_cp != MaxSector) {
+			    set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+			    md_wakeup_thread(mddev->thread);
+			}
 		}
 dlm_unlock:
 		dlm_unlock_sync(bm_lockres);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/13] md-cluster: unregister thread if err happened
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (2 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 03/13] md-cluster: wake up thread to continue recovery Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 05/13] md-cluster: fix locking when node joins cluster during message broadcast Guoqing Jiang
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

The two threads need to be unregistered if a node
can't join cluster successfully.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 0d4ddf8..76f88f7 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -818,6 +818,8 @@ static int join(struct mddev *mddev, int nodes)
 
 	return 0;
 err:
+	md_unregister_thread(&cinfo->recovery_thread);
+	md_unregister_thread(&cinfo->recv_thread);
 	lockres_free(cinfo->message_lockres);
 	lockres_free(cinfo->token_lockres);
 	lockres_free(cinfo->ack_lockres);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/13] md-cluster: fix locking when node joins cluster during message broadcast
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (3 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 04/13] md-cluster: unregister thread if err happened Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 06/13] md-cluster: change array_sectors and update size are not supported Guoqing Jiang
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

If a node joins the cluster while a message broadcast
is under way, a lock issue could happen as follows.

For a cluster which included two nodes, if node A is
calling __sendmsg before up-convert CR to EX on ack,
and node B released CR on ack. But if a new node C
joins the cluster and it doesn't receive the message
which A sent before, so it could hold CR on ack before
A up-convert CR to EX on ack.

So a node joining the cluster should get an EX lock on
the "token" first to ensure no broadcast is ongoing,
then release it after held CR on ack.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md-cluster.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 76f88f7..30f1160 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -781,17 +781,24 @@ static int join(struct mddev *mddev, int nodes)
 	cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0);
 	if (!cinfo->token_lockres)
 		goto err;
-	cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
-	if (!cinfo->ack_lockres)
-		goto err;
 	cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0);
 	if (!cinfo->no_new_dev_lockres)
 		goto err;
 
+	ret = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX);
+	if (ret) {
+		ret = -EAGAIN;
+		pr_err("md-cluster: can't join cluster to avoid lock issue\n");
+		goto err;
+	}
+	cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0);
+	if (!cinfo->ack_lockres)
+		goto err;
 	/* get sync CR lock on ACK. */
 	if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR))
 		pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n",
 				ret);
+	dlm_unlock_sync(cinfo->token_lockres);
 	/* get sync CR lock on no-new-dev. */
 	if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR))
 		pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/13] md-cluster: change array_sectors and update size are not supported
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (4 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 05/13] md-cluster: fix locking when node joins cluster during message broadcast Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 07/13] md-cluster: wakeup thread if activated a spare disk Guoqing Jiang
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

Currently, some features are not supported yet,
such as change array_sectors and update size, so
return EINVAL for them and listed it in document.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 Documentation/md-cluster.txt | 6 ++++++
 drivers/md/md.c              | 8 ++++++++
 2 files changed, 14 insertions(+)

diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt
index c100c71..3888327 100644
--- a/Documentation/md-cluster.txt
+++ b/Documentation/md-cluster.txt
@@ -316,3 +316,9 @@ The algorithm is:
  nodes are using the raid which is achieved by lock all bitmap
  locks within the cluster, and also those locks are unlocked
  accordingly.
+
+7. Unsupported features
+
+There are somethings which are not supported by cluster MD yet.
+
+- update size and change array_sectors.
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0639c01..1b26397 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4812,6 +4812,10 @@ array_size_store(struct mddev *mddev, const char *buf, size_t len)
 	if (err)
 		return err;
 
+	/* cluster raid doesn't support change array_sectors */
+	if (mddev_is_clustered(mddev))
+		return -EINVAL;
+
 	if (strncmp(buf, "default", 7) == 0) {
 		if (mddev->pers)
 			sectors = mddev->pers->size(mddev, 0, 0);
@@ -6433,6 +6437,10 @@ static int update_size(struct mddev *mddev, sector_t num_sectors)
 	int rv;
 	int fit = (num_sectors == 0);
 
+	/* cluster raid doesn't support update size */
+	if (mddev_is_clustered(mddev))
+		return -EINVAL;
+
 	if (mddev->pers->resize == NULL)
 		return -EINVAL;
 	/* The "num_sectors" is the number of sectors of each device that
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/13] md-cluster: wakeup thread if activated a spare disk
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (5 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 06/13] md-cluster: change array_sectors and update size are not supported Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region Guoqing Jiang
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

When a device is re-added, it will ultimately need
to be activated and that happens in md_check_recovery,
so we need to set MD_RECOVERY_NEEDED right after
remove_and_add_spares.

A specifical issue without the change is that when
one node perform fail/remove/readd on a disk, but
slave nodes could not add the disk back to array as
expected (added as missed instead of in sync). So
give slave nodes a chance to do resync.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1b26397..dbc7c83 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8690,6 +8690,11 @@ static void check_sb_changes(struct mddev *mddev, struct md_rdev *rdev)
 				ret = remove_and_add_spares(mddev, rdev2);
 				pr_info("Activated spare: %s\n",
 						bdevname(rdev2->bdev,b));
+				/* wakeup mddev->thread here, so array could
+				 * perform resync with the new activated disk */
+				set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+				md_wakeup_thread(mddev->thread);
+
 			}
 			/* device faulty
 			 * We just want to do the minimum to mark the disk
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (6 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 07/13] md-cluster: wakeup thread if activated a spare disk Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  6:26   ` kbuild test robot
  2016-04-21  6:58   ` [PATCH V2 " Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are protected by
mddev->lock.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/md.c          | 22 +++++++++++++++++-----
 drivers/md/raid1.c       |  2 ++
 drivers/md/raid10.c      |  6 +++++-
 drivers/md/raid5-cache.c |  2 ++
 drivers/md/raid5.c       |  2 ++
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index dbc7c83..80f0e8a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2290,12 +2290,18 @@ repeat:
 	if (mddev_is_clustered(mddev)) {
 		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
 			force_change = 1;
+		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
+			nospares = 1;
 		ret = md_cluster_ops->metadata_update_start(mddev);
 		/* Has someone else has updated the sb */
 		if (!does_sb_need_changing(mddev)) {
 			if (ret == 0)
 				md_cluster_ops->metadata_update_cancel(mddev);
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_lock(&mddev->write_lock);
+			if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
+			    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
+				clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_unlock(&mddev->lock);
 			return;
 		}
 	}
@@ -2431,7 +2437,8 @@ repeat:
 
 	spin_lock(&mddev->lock);
 	if (mddev->in_sync != sync_req ||
-	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
+	    test_bit(MD_CHANGE_DEVS, &mddev->flags) ||
+	    test_bit(MD_CHANGE_CLEAN, &mddev->flags)) {
 		/* have to write it out again */
 		spin_unlock(&mddev->lock);
 		goto repeat;
@@ -8142,18 +8149,20 @@ void md_do_sync(struct md_thread *thread)
 		}
 	}
  skip:
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
-
 	if (mddev_is_clustered(mddev) &&
 	    ret == 0) {
 		/* set CHANGE_PENDING here since maybe another
 		 * update is needed, so other nodes are informed */
+		spin_lock(&mddev->lock);
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 		md_cluster_ops->resync_finish(mddev);
-	}
+	} else
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	spin_lock(&mddev->lock);
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -8546,6 +8555,7 @@ EXPORT_SYMBOL(md_finish_reshape);
 int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		       int is_new)
 {
+	struct mddev *mddev = rdev->mddev;
 	int rv;
 	if (is_new)
 		s += rdev->new_data_offset;
@@ -8555,8 +8565,10 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 	if (rv == 0) {
 		/* Make sure they get written out promptly */
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &rdev->mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(rdev->mddev->thread);
 		return 1;
 	} else
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 39fb21e..cd9e4cc 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1474,8 +1474,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	 * if recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid1:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e3fd725..98a4cf1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1102,8 +1102,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_iter.bi_sector < conf->reshape_progress))) {
 		/* Need to update reshape_position in metadata */
 		mddev->reshape_position = conf->reshape_progress;
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
@@ -1585,15 +1587,17 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
+	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * If recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid10:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9531f5f..2ba9366 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -712,8 +712,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	 * in_teardown check workaround this issue.
 	 */
 	if (!log->in_teardown) {
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 31ac0f0..a36af8b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2514,8 +2514,10 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid:%s: Operation continuing on %d devices.\n",
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (7 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  7:00   ` kbuild test robot
                     ` (2 more replies)
  2016-04-21  5:58 ` [PATCH 10/13] md-cluster: sync bitmap when node received RESYNCING msg Guoqing Jiang
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

The in-memory bitmap for raid is allocated on demand,
then for cluster scenario, it is possible that slave
node which received RESYNCING message doesn't have the
in-memory bitmap when master node is perform resyncing,
so we can't make bitmap is match up well among each
nodes.

So for cluster scenario, we need always preserve the
bitmap, and ensure the page will not be freed. And a
no_hijack flag is introduced to both bitmap_checkpage
and bitmap_get_counter, which makes cluster raid returns
fail once allocate failed.

And the next patch is relied on this change since it
keeps sync bitmap among each nodes during resyncing
stage.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 46 insertions(+), 13 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 7df6b4f..00cf1c1 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -46,7 +46,7 @@ static inline char *bmname(struct bitmap *bitmap)
  * allocated while we're using it
  */
 static int bitmap_checkpage(struct bitmap_counts *bitmap,
-			    unsigned long page, int create)
+			    unsigned long page, int create, int no_hijack)
 __releases(bitmap->lock)
 __acquires(bitmap->lock)
 {
@@ -90,6 +90,9 @@ __acquires(bitmap->lock)
 
 	if (mappage == NULL) {
 		pr_debug("md/bitmap: map page allocation failed, hijacking\n");
+		/* We don't support hijack for cluster raid */
+		if (no_hijack)
+			return -ENOMEM;
 		/* failed - set the hijacked flag so that we can use the
 		 * pointer as a counter */
 		if (!bitmap->bp[page].map)
@@ -1177,7 +1180,7 @@ static void bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset)
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
 					    sector_t offset, sector_t *blocks,
-					    int create);
+					    int create, int no_hijack);
 
 /*
  * bitmap daemon -- periodically wakes up to clean bits and flush pages
@@ -1257,7 +1260,7 @@ void bitmap_daemon_work(struct mddev *mddev)
 		}
 		bmc = bitmap_get_counter(counts,
 					 block,
-					 &blocks, 0);
+					 &blocks, 0, 0);
 
 		if (!bmc) {
 			j |= PAGE_COUNTER_MASK;
@@ -1307,7 +1310,7 @@ void bitmap_daemon_work(struct mddev *mddev)
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
 					    sector_t offset, sector_t *blocks,
-					    int create)
+					    int create, int no_hijack)
 __releases(bitmap->lock)
 __acquires(bitmap->lock)
 {
@@ -1321,7 +1324,7 @@ __acquires(bitmap->lock)
 	sector_t csize;
 	int err;
 
-	err = bitmap_checkpage(bitmap, page, create);
+	err = bitmap_checkpage(bitmap, page, create, 0);
 
 	if (bitmap->bp[page].hijacked ||
 	    bitmap->bp[page].map == NULL)
@@ -1368,7 +1371,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect
 		bitmap_counter_t *bmc;
 
 		spin_lock_irq(&bitmap->counts.lock);
-		bmc = bitmap_get_counter(&bitmap->counts, offset, &blocks, 1);
+		bmc = bitmap_get_counter(&bitmap->counts, offset, &blocks, 1, 0);
 		if (!bmc) {
 			spin_unlock_irq(&bitmap->counts.lock);
 			return 0;
@@ -1430,7 +1433,7 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
 		bitmap_counter_t *bmc;
 
 		spin_lock_irqsave(&bitmap->counts.lock, flags);
-		bmc = bitmap_get_counter(&bitmap->counts, offset, &blocks, 0);
+		bmc = bitmap_get_counter(&bitmap->counts, offset, &blocks, 0, 0);
 		if (!bmc) {
 			spin_unlock_irqrestore(&bitmap->counts.lock, flags);
 			return;
@@ -1474,7 +1477,7 @@ static int __bitmap_start_sync(struct bitmap *bitmap, sector_t offset, sector_t
 		return 1; /* always resync if no bitmap */
 	}
 	spin_lock_irq(&bitmap->counts.lock);
-	bmc = bitmap_get_counter(&bitmap->counts, offset, blocks, 0);
+	bmc = bitmap_get_counter(&bitmap->counts, offset, blocks, 0, 0);
 	rv = 0;
 	if (bmc) {
 		/* locked */
@@ -1526,7 +1529,7 @@ void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, i
 		return;
 	}
 	spin_lock_irqsave(&bitmap->counts.lock, flags);
-	bmc = bitmap_get_counter(&bitmap->counts, offset, blocks, 0);
+	bmc = bitmap_get_counter(&bitmap->counts, offset, blocks, 0, 0);
 	if (bmc == NULL)
 		goto unlock;
 	/* locked */
@@ -1604,7 +1607,7 @@ static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int n
 	sector_t secs;
 	bitmap_counter_t *bmc;
 	spin_lock_irq(&bitmap->counts.lock);
-	bmc = bitmap_get_counter(&bitmap->counts, offset, &secs, 1);
+	bmc = bitmap_get_counter(&bitmap->counts, offset, &secs, 1, 0);
 	if (!bmc) {
 		spin_unlock_irq(&bitmap->counts.lock);
 		return;
@@ -2029,17 +2032,47 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 		     chunks << chunkshift);
 
 	spin_lock_irq(&bitmap->counts.lock);
+	/* For cluster raid, need to pre-allocate bitmap */
+	if (mddev_is_clustered(bitmap->mddev)) {
+		unsigned long page;
+		for (page = 0; page < pages; page++) {
+			ret = bitmap_checkpage(&bitmap->counts, page, 1, 1);
+			if (ret) {
+				unsigned long k;
+
+				/* deallocate the page memory */
+				for (k = 0; k < page; k++) {
+					if (new_bp[k].map)
+						kfree(new_bp[k].map);
+				}
+
+				/* restore some fields from old_counts */
+				bitmap->counts.bp = old_counts.bp;
+				bitmap->counts.pages = old_counts.pages;
+				bitmap->counts.missing_pages = old_counts.pages;
+				bitmap->counts.chunkshift = old_counts.chunkshift;
+				bitmap->counts.chunks = old_counts.chunks;
+				bitmap->mddev->bitmap_info.chunksize = 1 << (old_counts.chunkshift +
+									     BITMAP_BLOCK_SHIFT);
+				blocks = old_counts.chunks << old_counts.chunkshift;
+				pr_err("Could not pre-allocate in-memory bitmap for cluster raid\n");
+				break;
+			} else
+				bitmap->counts.bp[page].count += 1;
+		}
+	}
+
 	for (block = 0; block < blocks; ) {
 		bitmap_counter_t *bmc_old, *bmc_new;
 		int set;
 
 		bmc_old = bitmap_get_counter(&old_counts, block,
-					     &old_blocks, 0);
+					     &old_blocks, 0, 0);
 		set = bmc_old && NEEDED(*bmc_old);
 
 		if (set) {
 			bmc_new = bitmap_get_counter(&bitmap->counts, block,
-						     &new_blocks, 1);
+						     &new_blocks, 1, 0);
 			if (*bmc_new == 0) {
 				/* need to set on-disk bits too. */
 				sector_t end = block + new_blocks;
@@ -2067,7 +2100,7 @@ int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
 		while (block < (chunks << chunkshift)) {
 			bitmap_counter_t *bmc;
 			bmc = bitmap_get_counter(&bitmap->counts, block,
-						 &new_blocks, 1);
+						 &new_blocks, 1, 0);
 			if (bmc) {
 				/* new space.  It needs to be resynced, so
 				 * we set NEEDED_MASK.
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/13] md-cluster: sync bitmap when node received RESYNCING msg
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (8 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 11/13] md-cluster/bitmap: fix wrong calcuation of offset Guoqing Jiang
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

If the node received RESYNCING message which means
another node will perform resync with the area, then
we don't want to do it again in another node.

Let's set RESYNC_MASK and clear NEEDED_MASK for the
region from old-low to new-low which has finished
syncing, and the region from old-hi to new-hi is about
to syncing, bitmap_sync_with_cluste is introduced for
the purpose.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c     | 21 +++++++++++++++++++++
 drivers/md/bitmap.h     |  3 +++
 drivers/md/md-cluster.c | 27 +++++++++++++++++++++++++++
 3 files changed, 51 insertions(+)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 00cf1c1..2e198f3 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1597,6 +1597,27 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector, bool force)
 }
 EXPORT_SYMBOL(bitmap_cond_end_sync);
 
+void bitmap_sync_with_cluster(struct mddev *mddev,
+			      sector_t old_lo, sector_t old_hi,
+			      sector_t new_lo, sector_t new_hi)
+{
+	struct bitmap *bitmap = mddev->bitmap;
+	sector_t sector, blocks = 0;
+
+	for (sector = old_lo; sector < new_lo; ) {
+		bitmap_end_sync(bitmap, sector, &blocks, 0);
+		sector += blocks;
+	}
+	WARN((blocks > new_lo) && old_lo, "alignment is not correct for lo\n");
+
+	for (sector = old_hi; sector < new_hi; ) {
+		bitmap_start_sync(bitmap, sector, &blocks, 0);
+		sector += blocks;
+	}
+	WARN((blocks > new_hi) && old_hi, "alignment is not correct for hi\n");
+}
+EXPORT_SYMBOL(bitmap_sync_with_cluster);
+
 static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int needed)
 {
 	/* For each chunk covered by any of these sectors, set the
diff --git a/drivers/md/bitmap.h b/drivers/md/bitmap.h
index 5e3fcd6..5b6dd63 100644
--- a/drivers/md/bitmap.h
+++ b/drivers/md/bitmap.h
@@ -258,6 +258,9 @@ int bitmap_start_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks,
 void bitmap_end_sync(struct bitmap *bitmap, sector_t offset, sector_t *blocks, int aborted);
 void bitmap_close_sync(struct bitmap *bitmap);
 void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector, bool force);
+void bitmap_sync_with_cluster(struct mddev *mddev,
+			      sector_t old_lo, sector_t old_hi,
+			      sector_t new_lo, sector_t new_hi);
 
 void bitmap_unplug(struct bitmap *bitmap);
 void bitmap_daemon_work(struct mddev *mddev);
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 30f1160..a55b5f4 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -85,6 +85,9 @@ struct md_cluster_info {
 	struct completion newdisk_completion;
 	wait_queue_head_t wait;
 	unsigned long state;
+	/* record the region in RESYNCING message */
+	sector_t sync_low;
+	sector_t sync_hi;
 };
 
 enum msg_type {
@@ -411,6 +414,30 @@ static void process_suspend_info(struct mddev *mddev,
 		md_wakeup_thread(mddev->thread);
 		return;
 	}
+
+	/*
+	 * The bitmaps are not same for different nodes
+	 * if RESYNCING is happening in one node, then
+	 * the node which received the RESYNCING message
+	 * probably will perform resync with the region
+	 * [lo, hi] again, so we could reduce resync time
+	 * a lot if we can ensure that the bitmaps among
+	 * different nodes are match up well.
+	 *
+	 * sync_low/hi is used to record the region which
+	 * arrived in the previous RESYNCING message,
+	 *
+	 * Call bitmap_sync_with_cluster to clear
+	 * NEEDED_MASK and set RESYNC_MASK since
+	 * resync thread is running in another node,
+	 * so we don't need to do the resync again
+	 * with the same section */
+	bitmap_sync_with_cluster(mddev, cinfo->sync_low,
+					cinfo->sync_hi,
+					lo, hi);
+	cinfo->sync_low = lo;
+	cinfo->sync_hi = hi;
+
 	s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL);
 	if (!s)
 		return;
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/13] md-cluster/bitmap: fix wrong calcuation of offset
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (9 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 10/13] md-cluster: sync bitmap when node received RESYNCING msg Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 12/13] md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 13/13] md-cluster/bitmap: unplug bitmap to sync dirty pages to disk Guoqing Jiang
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

The offset is wrong in bitmap_storage_alloc, we should
set it like below in bitmap_init_from_disk().

node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE));

Because 'offset' is only assigned to 'page->index' and
that is usually over-written by read_sb_page. So it does
not cause problem in general, but it still need to be fixed.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 2e198f3..bdae67a 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -759,7 +759,7 @@ static int bitmap_storage_alloc(struct bitmap_storage *store,
 		bytes += sizeof(bitmap_super_t);
 
 	num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE);
-	offset = slot_number * (num_pages - 1);
+	offset = slot_number * num_pages;
 
 	store->filemap = kmalloc(sizeof(struct page *)
 				 * num_pages, GFP_KERNEL);
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/13] md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (10 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 11/13] md-cluster/bitmap: fix wrong calcuation of offset Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  2016-04-21  5:58 ` [PATCH 13/13] md-cluster/bitmap: unplug bitmap to sync dirty pages to disk Guoqing Jiang
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

The pnum passed to set_page_attr and test_page_attr should from
0 to storage.file_pages - 1, but bitmap_file_set_bit and
bitmap_file_clear_bit call set_page_attr and test_page_attr with
page->index parameter while page->index has already added node_offset
before.

So we need to minus node_offset in both bitmap_file_clear_bit
and bitmap_file_set_bit.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index bdae67a..08f4c04 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -903,6 +903,11 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 	struct page *page;
 	void *kaddr;
 	unsigned long chunk = block >> bitmap->counts.chunkshift;
+	struct bitmap_storage *store = &bitmap->storage;
+	unsigned long node_offset = 0;
+
+	if (mddev_is_clustered(bitmap->mddev))
+		node_offset = bitmap->cluster_slot * store->file_pages;
 
 	page = filemap_get_page(&bitmap->storage, chunk);
 	if (!page)
@@ -918,7 +923,7 @@ static void bitmap_file_set_bit(struct bitmap *bitmap, sector_t block)
 	kunmap_atomic(kaddr);
 	pr_debug("set file bit %lu page %lu\n", bit, page->index);
 	/* record page number so it gets flushed to disk when unplug occurs */
-	set_page_attr(bitmap, page->index, BITMAP_PAGE_DIRTY);
+	set_page_attr(bitmap, page->index - node_offset, BITMAP_PAGE_DIRTY);
 }
 
 static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block)
@@ -927,6 +932,11 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block)
 	struct page *page;
 	void *paddr;
 	unsigned long chunk = block >> bitmap->counts.chunkshift;
+	struct bitmap_storage *store = &bitmap->storage;
+	unsigned long node_offset = 0;
+
+	if (mddev_is_clustered(bitmap->mddev))
+		node_offset = bitmap->cluster_slot * store->file_pages;
 
 	page = filemap_get_page(&bitmap->storage, chunk);
 	if (!page)
@@ -938,8 +948,8 @@ static void bitmap_file_clear_bit(struct bitmap *bitmap, sector_t block)
 	else
 		clear_bit_le(bit, paddr);
 	kunmap_atomic(paddr);
-	if (!test_page_attr(bitmap, page->index, BITMAP_PAGE_NEEDWRITE)) {
-		set_page_attr(bitmap, page->index, BITMAP_PAGE_PENDING);
+	if (!test_page_attr(bitmap, page->index - node_offset, BITMAP_PAGE_NEEDWRITE)) {
+		set_page_attr(bitmap, page->index - node_offset, BITMAP_PAGE_PENDING);
 		bitmap->allclean = 0;
 	}
 }
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/13] md-cluster/bitmap: unplug bitmap to sync dirty pages to disk
  2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
                   ` (11 preceding siblings ...)
  2016-04-21  5:58 ` [PATCH 12/13] md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit Guoqing Jiang
@ 2016-04-21  5:58 ` Guoqing Jiang
  12 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  5:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

This patch is doing two distinct but related things.

1. It adds bitmap_unplug() for the main bitmap (mddev->bitmap).  As bit
have been set, BITMAP_PAGE_DIRTY is set so bitmap_deamon_work() will
not write those pages out in its regular scans, only bitmap_unplug()
will.  If there are no writes to the array, bitmap_unplug() won't be
called, so we need to call it explicitly here.

2. bitmap_write_all() is a bit of a confusing interface as it doesn't
actually write anything.  The current code for writing "bitmap" works
but this change makes it a bit clearer.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/bitmap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 08f4c04..9830dbf 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1921,14 +1921,14 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
 
 	if (clear_bits) {
 		bitmap_update_sb(bitmap);
-		/* Setting this for the ev_page should be enough.
-		 * And we do not require both write_all and PAGE_DIRT either
-		 */
+		/* BITMAP_PAGE_PENDING is set, but bitmap_unplug needs
+		 * BITMAP_PAGE_DIRTY or _NEEDWRITE to write ... */
 		for (i = 0; i < bitmap->storage.file_pages; i++)
-			set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY);
-		bitmap_write_all(bitmap);
+			if (test_page_attr(bitmap, i, BITMAP_PAGE_PENDING))
+				set_page_attr(bitmap, i, BITMAP_PAGE_NEEDWRITE);
 		bitmap_unplug(bitmap);
 	}
+	bitmap_unplug(mddev->bitmap);
 	*low = lo;
 	*high = hi;
 err:
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous
  2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
@ 2016-04-21  6:20   ` kbuild test robot
  2016-04-21  6:55   ` [PATCH V2 " Guoqing Jiang
  1 sibling, 0 replies; 33+ messages in thread
From: kbuild test robot @ 2016-04-21  6:20 UTC (permalink / raw)
  Cc: kbuild-all, shli, neilb, linux-raid, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 2879 bytes --]

Hi,

[auto build test WARNING on v4.6-rc4]
[also build test WARNING on next-20160420]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/The-latest-patches-for-md-cluster/20160421-140507
config: x86_64-randconfig-x010-201616 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/md/md.c: In function 'md_start_sync':
>> drivers/md/md.c:8244:1: warning: label 'out' defined but not used [-Wunused-label]
    out:
    ^

vim +/out +8244 drivers/md/md.c

60fc1370 NeilBrown         2011-12-23  8228  		}
b4c4c7b8 NeilBrown         2007-02-28  8229  	}
746d3207 NeilBrown         2013-04-24  8230  no_add:
6dafab6b NeilBrown         2012-09-19  8231  	if (removed)
6dafab6b NeilBrown         2012-09-19  8232  		set_bit(MD_CHANGE_DEVS, &mddev->flags);
b4c4c7b8 NeilBrown         2007-02-28  8233  	return spares;
b4c4c7b8 NeilBrown         2007-02-28  8234  }
7ebc0be7 NeilBrown         2011-01-14  8235  
ac05f256 NeilBrown         2014-09-30  8236  static void md_start_sync(struct work_struct *ws)
ac05f256 NeilBrown         2014-09-30  8237  {
ac05f256 NeilBrown         2014-09-30  8238  	struct mddev *mddev = container_of(ws, struct mddev, del_work);
c186b128 Goldwyn Rodrigues 2015-09-30  8239  	int ret = 0;
c186b128 Goldwyn Rodrigues 2015-09-30  8240  
ac05f256 NeilBrown         2014-09-30  8241  	mddev->sync_thread = md_register_thread(md_do_sync,
ac05f256 NeilBrown         2014-09-30  8242  						mddev,
ac05f256 NeilBrown         2014-09-30  8243  						"resync");
c186b128 Goldwyn Rodrigues 2015-09-30 @8244  out:
ac05f256 NeilBrown         2014-09-30  8245  	if (!mddev->sync_thread) {
c186b128 Goldwyn Rodrigues 2015-09-30  8246  		if (!(mddev_is_clustered(mddev) && ret == -EAGAIN))
ac05f256 NeilBrown         2014-09-30  8247  			printk(KERN_ERR "%s: could not start resync"
ac05f256 NeilBrown         2014-09-30  8248  			       " thread...\n",
ac05f256 NeilBrown         2014-09-30  8249  			       mdname(mddev));
ac05f256 NeilBrown         2014-09-30  8250  		/* leave the spares where they are, it shouldn't hurt */
ac05f256 NeilBrown         2014-09-30  8251  		clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
ac05f256 NeilBrown         2014-09-30  8252  		clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);

:::::: The code at line 8244 was first introduced by commit
:::::: c186b128cda5a246da25f474e4689cb2bfacfcac md-cluster: Perform resync/recovery under a DLM lock

:::::: TO: Goldwyn Rodrigues <rgoldwyn@suse.com>
:::::: CC: Goldwyn Rodrigues <rgoldwyn@suse.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 23720 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-21  5:58 ` [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region Guoqing Jiang
@ 2016-04-21  6:26   ` kbuild test robot
  2016-04-21  6:58   ` [PATCH V2 " Guoqing Jiang
  1 sibling, 0 replies; 33+ messages in thread
From: kbuild test robot @ 2016-04-21  6:26 UTC (permalink / raw)
  Cc: kbuild-all, shli, neilb, linux-raid, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

Hi,

[auto build test ERROR on v4.6-rc4]
[also build test ERROR on next-20160420]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/The-latest-patches-for-md-cluster/20160421-140507
config: x86_64-randconfig-x010-201616 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/md/md.c: In function 'md_update_sb':
>> drivers/md/md.c:2303:20: error: 'struct mddev' has no member named 'write_lock'
       spin_lock(&mddev->write_lock);
                       ^
   drivers/md/md.c: In function 'md_start_sync':
   drivers/md/md.c:8262:1: warning: label 'out' defined but not used [-Wunused-label]
    out:
    ^

vim +2303 drivers/md/md.c

  2297				nospares = 1;
  2298			ret = md_cluster_ops->metadata_update_start(mddev);
  2299			/* Has someone else has updated the sb */
  2300			if (!does_sb_need_changing(mddev)) {
  2301				if (ret == 0)
  2302					md_cluster_ops->metadata_update_cancel(mddev);
> 2303				spin_lock(&mddev->write_lock);
  2304				if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
  2305				    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
  2306					clear_bit(MD_CHANGE_PENDING, &mddev->flags);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 23720 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V2 01/13] md-cluster: change resync lock from asynchronous to synchronous
  2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
  2016-04-21  6:20   ` kbuild test robot
@ 2016-04-21  6:55   ` Guoqing Jiang
  1 sibling, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  6:55 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

If multiple nodes choose to attempt do resync at the same time
they need to be serialized so they don't duplicate effort. This
serialization is done by locking the 'resync' DLM lock.

Currently if a node cannot get the lock immediately it doesn't
request notification when the lock becomes available (i.e.
DLM_LKF_NOQUEUE is set), so it may not reliably find out when it
is safe to try again.

Rather than trying to arrange an async wake-up when the lock
becomes available, switch to using synchronous locking - this is
a lot easier to think about.  As it is not permitted to block in
the 'raid1d' thread, move the locking to the resync thread.  So
the rsync thread is forked immediately, but it blocks until the
resync lock is available. Once the lock is locked it checks again
if any resync action is needed.

A particular symptom of the current problem is that a node can
get stuck with "resync=pending" indefinitely.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
Changes:
1. fix warning reported by auto build test

 drivers/md/md-cluster.c |  2 --
 drivers/md/md.c         | 23 ++++++++++++++---------
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index dd97d42..12fbfec 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -937,7 +937,6 @@ static void metadata_update_cancel(struct mddev *mddev)
 static int resync_start(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-	cinfo->resync_lockres->flags |= DLM_LKF_NOQUEUE;
 	return dlm_lock_sync(cinfo->resync_lockres, DLM_LOCK_EX);
 }
 
@@ -967,7 +966,6 @@ static int resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi)
 static int resync_finish(struct mddev *mddev)
 {
 	struct md_cluster_info *cinfo = mddev->cluster_info;
-	cinfo->resync_lockres->flags &= ~DLM_LKF_NOQUEUE;
 	dlm_unlock_sync(cinfo->resync_lockres);
 	return resync_info_update(mddev, 0, 0);
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 194580f..6a0a5b2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7784,6 +7784,7 @@ void md_do_sync(struct md_thread *thread)
 	char *desc, *action = NULL;
 	struct blk_plug plug;
 	bool cluster_resync_finished = false;
+	int ret;
 
 	/* just incase thread restarts... */
 	if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
@@ -7793,6 +7794,19 @@ void md_do_sync(struct md_thread *thread)
 		return;
 	}
 
+	if (mddev_is_clustered(mddev)) {
+		ret = md_cluster_ops->resync_start(mddev);
+		if (ret)
+			goto skip;
+
+		if (!(test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
+			test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
+			test_bit(MD_RECOVERY_RECOVER, &mddev->recovery))
+		     && ((unsigned long long)mddev->curr_resync_completed
+			 < (unsigned long long)mddev->resync_max_sectors))
+			goto skip;
+	}
+
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
 		if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
 			desc = "data-check";
@@ -8224,18 +8238,9 @@ static void md_start_sync(struct work_struct *ws)
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 	int ret = 0;
 
-	if (mddev_is_clustered(mddev)) {
-		ret = md_cluster_ops->resync_start(mddev);
-		if (ret) {
-			mddev->sync_thread = NULL;
-			goto out;
-		}
-	}
-
 	mddev->sync_thread = md_register_thread(md_do_sync,
 						mddev,
 						"resync");
-out:
 	if (!mddev->sync_thread) {
 		if (!(mddev_is_clustered(mddev) && ret == -EAGAIN))
 			printk(KERN_ERR "%s: could not start resync"
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH V2 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-21  5:58 ` [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region Guoqing Jiang
  2016-04-21  6:26   ` kbuild test robot
@ 2016-04-21  6:58   ` Guoqing Jiang
  2016-04-25 17:32     ` Shaohua Li
  2016-04-27  1:56     ` [PATCH V3 " Guoqing Jiang
  1 sibling, 2 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  6:58 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are protected by
mddev->lock.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
Changes:
1. s/write_lock/lock which is reported by auto build

 drivers/md/md.c          | 22 +++++++++++++++++-----
 drivers/md/raid1.c       |  2 ++
 drivers/md/raid10.c      |  6 +++++-
 drivers/md/raid5-cache.c |  2 ++
 drivers/md/raid5.c       |  2 ++
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bf2b74d..e74657e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2293,12 +2293,18 @@ repeat:
 	if (mddev_is_clustered(mddev)) {
 		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
 			force_change = 1;
+		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
+			nospares = 1;
 		ret = md_cluster_ops->metadata_update_start(mddev);
 		/* Has someone else has updated the sb */
 		if (!does_sb_need_changing(mddev)) {
 			if (ret == 0)
 				md_cluster_ops->metadata_update_cancel(mddev);
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_lock(&mddev->lock);
+			if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
+			    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
+				clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_unlock(&mddev->lock);
 			return;
 		}
 	}
@@ -2434,7 +2440,8 @@ repeat:
 
 	spin_lock(&mddev->lock);
 	if (mddev->in_sync != sync_req ||
-	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
+	    test_bit(MD_CHANGE_DEVS, &mddev->flags) ||
+	    test_bit(MD_CHANGE_CLEAN, &mddev->flags)) {
 		/* have to write it out again */
 		spin_unlock(&mddev->lock);
 		goto repeat;
@@ -8145,18 +8152,20 @@ void md_do_sync(struct md_thread *thread)
 		}
 	}
  skip:
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
-
 	if (mddev_is_clustered(mddev) &&
 	    ret == 0) {
 		/* set CHANGE_PENDING here since maybe another
 		 * update is needed, so other nodes are informed */
+		spin_lock(&mddev->lock);
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 		md_cluster_ops->resync_finish(mddev);
-	}
+	} else
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	spin_lock(&mddev->lock);
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -8548,6 +8557,7 @@ EXPORT_SYMBOL(md_finish_reshape);
 int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		       int is_new)
 {
+	struct mddev *mddev = rdev->mddev;
 	int rv;
 	if (is_new)
 		s += rdev->new_data_offset;
@@ -8557,8 +8567,10 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 	if (rv == 0) {
 		/* Make sure they get written out promptly */
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &rdev->mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(rdev->mddev->thread);
 		return 1;
 	} else
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a7f2b9c..985fa07 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1474,8 +1474,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	 * if recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid1:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e3fd725..98a4cf1 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1102,8 +1102,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_iter.bi_sector < conf->reshape_progress))) {
 		/* Need to update reshape_position in metadata */
 		mddev->reshape_position = conf->reshape_progress;
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
@@ -1585,15 +1587,17 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
+	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * If recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid10:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9531f5f..2ba9366 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -712,8 +712,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	 * in_teardown check workaround this issue.
 	 */
 	if (!log->in_teardown) {
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8ab8b65..d4a1e37 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2514,8 +2514,10 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock(&mddev->lock);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock(&mddev->lock);
 	printk(KERN_ALERT
 	       "md/raid:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid:%s: Operation continuing on %d devices.\n",
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
@ 2016-04-21  7:00   ` kbuild test robot
  2016-04-21  7:00   ` [PATCH] md-cluster: fix ifnullfree.cocci warnings kbuild test robot
  2016-04-25 17:45   ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Shaohua Li
  2 siblings, 0 replies; 33+ messages in thread
From: kbuild test robot @ 2016-04-21  7:00 UTC (permalink / raw)
  Cc: kbuild-all, shli, neilb, linux-raid, Guoqing Jiang

Hi,

[auto build test WARNING on v4.6-rc4]
[also build test WARNING on next-20160420]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Guoqing-Jiang/The-latest-patches-for-md-cluster/20160421-140507


coccinelle warnings: (new ones prefixed by >>)

>> drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] md-cluster: fix ifnullfree.cocci warnings
  2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
  2016-04-21  7:00   ` kbuild test robot
@ 2016-04-21  7:00   ` kbuild test robot
  2016-04-21  9:10     ` Guoqing Jiang
  2016-04-25 17:45   ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Shaohua Li
  2 siblings, 1 reply; 33+ messages in thread
From: kbuild test robot @ 2016-04-21  7:00 UTC (permalink / raw)
  Cc: kbuild-all, shli, neilb, linux-raid, Guoqing Jiang

drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.

 NULL check before some freeing functions is not needed.

 Based on checkpatch warning
 "kfree(NULL) is safe this check is probably not required"
 and kfreeaddr.cocci by Julia Lawall.

Generated by: scripts/coccinelle/free/ifnullfree.cocci

CC: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---

 bitmap.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -2045,8 +2045,7 @@ int bitmap_resize(struct bitmap *bitmap,
 
 				/* deallocate the page memory */
 				for (k = 0; k < page; k++) {
-					if (new_bp[k].map)
-						kfree(new_bp[k].map);
+					kfree(new_bp[k].map);
 				}
 
 				/* restore some fields from old_counts */

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] md-cluster: fix ifnullfree.cocci warnings
  2016-04-21  7:00   ` [PATCH] md-cluster: fix ifnullfree.cocci warnings kbuild test robot
@ 2016-04-21  9:10     ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-21  9:10 UTC (permalink / raw)
  To: kbuild test robot; +Cc: kbuild-all, shli, neilb, linux-raid



On 04/21/2016 03:00 PM, kbuild test robot wrote:
> drivers/md/bitmap.c:2049:6-11: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values.
>
>   NULL check before some freeing functions is not needed.
>
>   Based on checkpatch warning
>   "kfree(NULL) is safe this check is probably not required"
>   and kfreeaddr.cocci by Julia Lawall.

Oops, I missed it.

Acked-by: Guoqing Jiang <gqjiang@suse.com>

Thanks,
Guoqing

> Generated by: scripts/coccinelle/free/ifnullfree.cocci
>
> CC: Guoqing Jiang <gqjiang@suse.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>
>   bitmap.c |    3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -2045,8 +2045,7 @@ int bitmap_resize(struct bitmap *bitmap,
>   
>   				/* deallocate the page memory */
>   				for (k = 0; k < page; k++) {
> -					if (new_bp[k].map)
> -						kfree(new_bp[k].map);
> +					kfree(new_bp[k].map);
>   				}
>   
>   				/* restore some fields from old_counts */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-21  6:58   ` [PATCH V2 " Guoqing Jiang
@ 2016-04-25 17:32     ` Shaohua Li
  2016-04-26  3:19       ` Guoqing Jiang
  2016-04-27  1:56     ` [PATCH V3 " Guoqing Jiang
  1 sibling, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2016-04-25 17:32 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: neilb, linux-raid

On Thu, Apr 21, 2016 at 02:58:15PM +0800, Guoqing Jiang wrote:
> Some code waits for a metadata update by:
> 
> 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
> 2. setting MD_CHANGE_PENDING and waking the management thread
> 3. waiting for MD_CHANGE_PENDING to be cleared
> 
> If the first two are done without locking, the code in md_update_sb()
> which checks if it needs to repeat might test if an update is needed
> before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
> in the wait returning early.
> 
> So make sure all places that set MD_CHANGE_PENDING are protected by
> mddev->lock.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
> Changes:
> 1. s/write_lock/lock which is reported by auto build
> 
>  drivers/md/md.c          | 22 +++++++++++++++++-----
>  drivers/md/raid1.c       |  2 ++
>  drivers/md/raid10.c      |  6 +++++-
>  drivers/md/raid5-cache.c |  2 ++
>  drivers/md/raid5.c       |  2 ++
>  5 files changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index bf2b74d..e74657e 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2293,12 +2293,18 @@ repeat:
>  	if (mddev_is_clustered(mddev)) {
>  		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
>  			force_change = 1;
> +		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
> +			nospares = 1;
>  		ret = md_cluster_ops->metadata_update_start(mddev);
>  		/* Has someone else has updated the sb */
>  		if (!does_sb_need_changing(mddev)) {
>  			if (ret == 0)
>  				md_cluster_ops->metadata_update_cancel(mddev);
> -			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
> +			spin_lock(&mddev->lock);
> +			if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
> +			    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
> +				clear_bit(MD_CHANGE_PENDING, &mddev->flags);
> +			spin_unlock(&mddev->lock);
>  			return;
>  		}
>  	}
> @@ -2434,7 +2440,8 @@ repeat:
>  
>  	spin_lock(&mddev->lock);
>  	if (mddev->in_sync != sync_req ||
> -	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
> +	    test_bit(MD_CHANGE_DEVS, &mddev->flags) ||
> +	    test_bit(MD_CHANGE_CLEAN, &mddev->flags)) {
>  		/* have to write it out again */
>  		spin_unlock(&mddev->lock);
>  		goto repeat;
> @@ -8145,18 +8152,20 @@ void md_do_sync(struct md_thread *thread)
>  		}
>  	}
>   skip:
> -	set_bit(MD_CHANGE_DEVS, &mddev->flags);
> -
>  	if (mddev_is_clustered(mddev) &&
>  	    ret == 0) {
>  		/* set CHANGE_PENDING here since maybe another
>  		 * update is needed, so other nodes are informed */
> +		spin_lock(&mddev->lock);
> +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		spin_unlock(&mddev->lock);
>  		md_wakeup_thread(mddev->thread);
>  		wait_event(mddev->sb_wait,
>  			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>  		md_cluster_ops->resync_finish(mddev);
> -	}
> +	} else
> +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  
>  	spin_lock(&mddev->lock);
>  	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
> @@ -8548,6 +8557,7 @@ EXPORT_SYMBOL(md_finish_reshape);
>  int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>  		       int is_new)
>  {
> +	struct mddev *mddev = rdev->mddev;
>  	int rv;
>  	if (is_new)
>  		s += rdev->new_data_offset;
> @@ -8557,8 +8567,10 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>  	if (rv == 0) {
>  		/* Make sure they get written out promptly */
>  		sysfs_notify_dirent_safe(rdev->sysfs_state);
> +		spin_lock(&mddev->lock);
>  		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
>  		set_bit(MD_CHANGE_PENDING, &rdev->mddev->flags);
> +		spin_unlock(&mddev->lock);
>  		md_wakeup_thread(rdev->mddev->thread);
>  		return 1;
>  	} else
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a7f2b9c..985fa07 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1474,8 +1474,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>  	 * if recovery is running, make sure it aborts.
>  	 */
>  	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> +	spin_lock(&mddev->lock);
>  	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  	set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +	spin_unlock(&mddev->lock);
>  	printk(KERN_ALERT
>  	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
>  	       "md/raid1:%s: Operation continuing on %d devices.\n",
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index e3fd725..98a4cf1 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1102,8 +1102,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>  		bio->bi_iter.bi_sector < conf->reshape_progress))) {
>  		/* Need to update reshape_position in metadata */
>  		mddev->reshape_position = conf->reshape_progress;
> +		spin_lock(&mddev->lock);
>  		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		spin_unlock(&mddev->lock);
>  		md_wakeup_thread(mddev->thread);
>  		wait_event(mddev->sb_wait,
>  			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> @@ -1585,15 +1587,17 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>  	}
>  	if (test_and_clear_bit(In_sync, &rdev->flags))
>  		mddev->degraded++;
> +	spin_unlock_irqrestore(&conf->device_lock, flags);
>  	/*
>  	 * If recovery is running, make sure it aborts.
>  	 */
>  	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>  	set_bit(Blocked, &rdev->flags);
>  	set_bit(Faulty, &rdev->flags);
> +	spin_lock(&mddev->lock);
>  	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  	set_bit(MD_CHANGE_PENDING, &mddev->flags);
> -	spin_unlock_irqrestore(&conf->device_lock, flags);
> +	spin_unlock(&mddev->lock);
>  	printk(KERN_ALERT
>  	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
>  	       "md/raid10:%s: Operation continuing on %d devices.\n",
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 9531f5f..2ba9366 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -712,8 +712,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>  	 * in_teardown check workaround this issue.
>  	 */
>  	if (!log->in_teardown) {
> +		spin_lock(&mddev->lock);
>  		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		spin_unlock(&mddev->lock);
>  		md_wakeup_thread(mddev->thread);
>  		wait_event(mddev->sb_wait,
>  			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8ab8b65..d4a1e37 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2514,8 +2514,10 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>  
>  	set_bit(Blocked, &rdev->flags);
>  	set_bit(Faulty, &rdev->flags);
> +	spin_lock(&mddev->lock);
>  	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  	set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +	spin_unlock(&mddev->lock);
>  	printk(KERN_ALERT
>  	       "md/raid:%s: Disk failure on %s, disabling device.\n"
>  	       "md/raid:%s: Operation continuing on %d devices.\n",

The .error_handler is called by md_error, called by .bio_endio, which is called
in softirq. So the lock should be hold with irq disabled.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
  2016-04-21  7:00   ` kbuild test robot
  2016-04-21  7:00   ` [PATCH] md-cluster: fix ifnullfree.cocci warnings kbuild test robot
@ 2016-04-25 17:45   ` Shaohua Li
  2016-04-26  3:22     ` Guoqing Jiang
  2 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2016-04-25 17:45 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: neilb, linux-raid

On Thu, Apr 21, 2016 at 01:58:10PM +0800, Guoqing Jiang wrote:
> The in-memory bitmap for raid is allocated on demand,
> then for cluster scenario, it is possible that slave
> node which received RESYNCING message doesn't have the
> in-memory bitmap when master node is perform resyncing,
> so we can't make bitmap is match up well among each
> nodes.
> 
> So for cluster scenario, we need always preserve the
> bitmap, and ensure the page will not be freed. And a
> no_hijack flag is introduced to both bitmap_checkpage
> and bitmap_get_counter, which makes cluster raid returns
> fail once allocate failed.
> 
> And the next patch is relied on this change since it
> keeps sync bitmap among each nodes during resyncing
> stage.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 46 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index 7df6b4f..00cf1c1 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -46,7 +46,7 @@ static inline char *bmname(struct bitmap *bitmap)
>   * allocated while we're using it
>   */
>  static int bitmap_checkpage(struct bitmap_counts *bitmap,
> -			    unsigned long page, int create)
> +			    unsigned long page, int create, int no_hijack)
>  __releases(bitmap->lock)
>  __acquires(bitmap->lock)
>  {
> @@ -90,6 +90,9 @@ __acquires(bitmap->lock)
>  
>  	if (mappage == NULL) {
>  		pr_debug("md/bitmap: map page allocation failed, hijacking\n");
> +		/* We don't support hijack for cluster raid */
> +		if (no_hijack)
> +			return -ENOMEM;
>  		/* failed - set the hijacked flag so that we can use the
>  		 * pointer as a counter */
>  		if (!bitmap->bp[page].map)
> @@ -1177,7 +1180,7 @@ static void bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset)
>  
>  static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
>  					    sector_t offset, sector_t *blocks,
> -					    int create);
> +					    int create, int no_hijack);
>  
>  /*
>   * bitmap daemon -- periodically wakes up to clean bits and flush pages
> @@ -1257,7 +1260,7 @@ void bitmap_daemon_work(struct mddev *mddev)
>  		}
>  		bmc = bitmap_get_counter(counts,
>  					 block,
> -					 &blocks, 0);
> +					 &blocks, 0, 0);
>  
>  		if (!bmc) {
>  			j |= PAGE_COUNTER_MASK;
> @@ -1307,7 +1310,7 @@ void bitmap_daemon_work(struct mddev *mddev)
>  
>  static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
>  					    sector_t offset, sector_t *blocks,
> -					    int create)
> +					    int create, int no_hijack)
>  __releases(bitmap->lock)
>  __acquires(bitmap->lock)
>  {
> @@ -1321,7 +1324,7 @@ __acquires(bitmap->lock)
>  	sector_t csize;
>  	int err;
>  
> -	err = bitmap_checkpage(bitmap, page, create);
> +	err = bitmap_checkpage(bitmap, page, create, 0);
>  
>  	if (bitmap->bp[page].hijacked ||
>  	    bitmap->bp[page].map == NULL)

bitmap_get_counter doesn't use the new no_hijack parameter. And you always pass
0 to this function. so looks this change isn't required.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V2 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-25 17:32     ` Shaohua Li
@ 2016-04-26  3:19       ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-26  3:19 UTC (permalink / raw)
  To: Shaohua Li; +Cc: neilb, linux-raid



On 04/26/2016 01:32 AM, Shaohua Li wrote:
> On Thu, Apr 21, 2016 at 02:58:15PM +0800, Guoqing Jiang wrote:
>> Some code waits for a metadata update by:
>>
>> 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
>> 2. setting MD_CHANGE_PENDING and waking the management thread
>> 3. waiting for MD_CHANGE_PENDING to be cleared
>>
>> If the first two are done without locking, the code in md_update_sb()
>> which checks if it needs to repeat might test if an update is needed
>> before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
>> in the wait returning early.
>>
>> So make sure all places that set MD_CHANGE_PENDING are protected by
>> mddev->lock.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>> Changes:
>> 1. s/write_lock/lock which is reported by auto build
>>
>>   drivers/md/md.c          | 22 +++++++++++++++++-----
>>   drivers/md/raid1.c       |  2 ++
>>   drivers/md/raid10.c      |  6 +++++-
>>   drivers/md/raid5-cache.c |  2 ++
>>   drivers/md/raid5.c       |  2 ++
>>   5 files changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index bf2b74d..e74657e 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -2293,12 +2293,18 @@ repeat:
>>   	if (mddev_is_clustered(mddev)) {
>>   		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
>>   			force_change = 1;
>> +		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
>> +			nospares = 1;
>>   		ret = md_cluster_ops->metadata_update_start(mddev);
>>   		/* Has someone else has updated the sb */
>>   		if (!does_sb_need_changing(mddev)) {
>>   			if (ret == 0)
>>   				md_cluster_ops->metadata_update_cancel(mddev);
>> -			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +			spin_lock(&mddev->lock);
>> +			if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
>> +			    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
>> +				clear_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +			spin_unlock(&mddev->lock);
>>   			return;
>>   		}
>>   	}
>> @@ -2434,7 +2440,8 @@ repeat:
>>   
>>   	spin_lock(&mddev->lock);
>>   	if (mddev->in_sync != sync_req ||
>> -	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
>> +	    test_bit(MD_CHANGE_DEVS, &mddev->flags) ||
>> +	    test_bit(MD_CHANGE_CLEAN, &mddev->flags)) {
>>   		/* have to write it out again */
>>   		spin_unlock(&mddev->lock);
>>   		goto repeat;
>> @@ -8145,18 +8152,20 @@ void md_do_sync(struct md_thread *thread)
>>   		}
>>   	}
>>    skip:
>> -	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>> -
>>   	if (mddev_is_clustered(mddev) &&
>>   	    ret == 0) {
>>   		/* set CHANGE_PENDING here since maybe another
>>   		 * update is needed, so other nodes are informed */
>> +		spin_lock(&mddev->lock);
>> +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   		set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +		spin_unlock(&mddev->lock);
>>   		md_wakeup_thread(mddev->thread);
>>   		wait_event(mddev->sb_wait,
>>   			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>>   		md_cluster_ops->resync_finish(mddev);
>> -	}
>> +	} else
>> +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   
>>   	spin_lock(&mddev->lock);
>>   	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
>> @@ -8548,6 +8557,7 @@ EXPORT_SYMBOL(md_finish_reshape);
>>   int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>>   		       int is_new)
>>   {
>> +	struct mddev *mddev = rdev->mddev;
>>   	int rv;
>>   	if (is_new)
>>   		s += rdev->new_data_offset;
>> @@ -8557,8 +8567,10 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>>   	if (rv == 0) {
>>   		/* Make sure they get written out promptly */
>>   		sysfs_notify_dirent_safe(rdev->sysfs_state);
>> +		spin_lock(&mddev->lock);
>>   		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
>>   		set_bit(MD_CHANGE_PENDING, &rdev->mddev->flags);
>> +		spin_unlock(&mddev->lock);
>>   		md_wakeup_thread(rdev->mddev->thread);
>>   		return 1;
>>   	} else
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index a7f2b9c..985fa07 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1474,8 +1474,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   	 * if recovery is running, make sure it aborts.
>>   	 */
>>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>> +	spin_lock(&mddev->lock);
>>   	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   	set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +	spin_unlock(&mddev->lock);
>>   	printk(KERN_ALERT
>>   	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
>>   	       "md/raid1:%s: Operation continuing on %d devices.\n",
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index e3fd725..98a4cf1 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1102,8 +1102,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
>>   		bio->bi_iter.bi_sector < conf->reshape_progress))) {
>>   		/* Need to update reshape_position in metadata */
>>   		mddev->reshape_position = conf->reshape_progress;
>> +		spin_lock(&mddev->lock);
>>   		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   		set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +		spin_unlock(&mddev->lock);
>>   		md_wakeup_thread(mddev->thread);
>>   		wait_event(mddev->sb_wait,
>>   			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>> @@ -1585,15 +1587,17 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>   	}
>>   	if (test_and_clear_bit(In_sync, &rdev->flags))
>>   		mddev->degraded++;
>> +	spin_unlock_irqrestore(&conf->device_lock, flags);
>>   	/*
>>   	 * If recovery is running, make sure it aborts.
>>   	 */
>>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>>   	set_bit(Blocked, &rdev->flags);
>>   	set_bit(Faulty, &rdev->flags);
>> +	spin_lock(&mddev->lock);
>>   	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   	set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> -	spin_unlock_irqrestore(&conf->device_lock, flags);
>> +	spin_unlock(&mddev->lock);
>>   	printk(KERN_ALERT
>>   	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
>>   	       "md/raid10:%s: Operation continuing on %d devices.\n",
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 9531f5f..2ba9366 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -712,8 +712,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>>   	 * in_teardown check workaround this issue.
>>   	 */
>>   	if (!log->in_teardown) {
>> +		spin_lock(&mddev->lock);
>>   		set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   		set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +		spin_unlock(&mddev->lock);
>>   		md_wakeup_thread(mddev->thread);
>>   		wait_event(mddev->sb_wait,
>>   			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 8ab8b65..d4a1e37 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -2514,8 +2514,10 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>>   
>>   	set_bit(Blocked, &rdev->flags);
>>   	set_bit(Faulty, &rdev->flags);
>> +	spin_lock(&mddev->lock);
>>   	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>>   	set_bit(MD_CHANGE_PENDING, &mddev->flags);
>> +	spin_unlock(&mddev->lock);
>>   	printk(KERN_ALERT
>>   	       "md/raid:%s: Disk failure on %s, disabling device.\n"
>>   	       "md/raid:%s: Operation continuing on %d devices.\n",
> The .error_handler is called by md_error, called by .bio_endio, which is called
> in softirq. So the lock should be hold with irq disabled.

Thanks, I will change them to spin_lock_irqsave/restore.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-25 17:45   ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Shaohua Li
@ 2016-04-26  3:22     ` Guoqing Jiang
  2016-04-27 15:24       ` Shaohua Li
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-26  3:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: neilb, linux-raid



On 04/26/2016 01:45 AM, Shaohua Li wrote:
> On Thu, Apr 21, 2016 at 01:58:10PM +0800, Guoqing Jiang wrote:
>> The in-memory bitmap for raid is allocated on demand,
>> then for cluster scenario, it is possible that slave
>> node which received RESYNCING message doesn't have the
>> in-memory bitmap when master node is perform resyncing,
>> so we can't make bitmap is match up well among each
>> nodes.
>>
>> So for cluster scenario, we need always preserve the
>> bitmap, and ensure the page will not be freed. And a
>> no_hijack flag is introduced to both bitmap_checkpage
>> and bitmap_get_counter, which makes cluster raid returns
>> fail once allocate failed.
>>
>> And the next patch is relied on this change since it
>> keeps sync bitmap among each nodes during resyncing
>> stage.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>   drivers/md/bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 46 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
>> index 7df6b4f..00cf1c1 100644
>> --- a/drivers/md/bitmap.c
>> +++ b/drivers/md/bitmap.c
>> @@ -46,7 +46,7 @@ static inline char *bmname(struct bitmap *bitmap)
>>    * allocated while we're using it
>>    */
>>   static int bitmap_checkpage(struct bitmap_counts *bitmap,
>> -			    unsigned long page, int create)
>> +			    unsigned long page, int create, int no_hijack)
>>   __releases(bitmap->lock)
>>   __acquires(bitmap->lock)
>>   {
>> @@ -90,6 +90,9 @@ __acquires(bitmap->lock)
>>   
>>   	if (mappage == NULL) {
>>   		pr_debug("md/bitmap: map page allocation failed, hijacking\n");
>> +		/* We don't support hijack for cluster raid */
>> +		if (no_hijack)
>> +			return -ENOMEM;
>>   		/* failed - set the hijacked flag so that we can use the
>>   		 * pointer as a counter */
>>   		if (!bitmap->bp[page].map)
>> @@ -1177,7 +1180,7 @@ static void bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset)
>>   
>>   static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
>>   					    sector_t offset, sector_t *blocks,
>> -					    int create);
>> +					    int create, int no_hijack);
>>   
>>   /*
>>    * bitmap daemon -- periodically wakes up to clean bits and flush pages
>> @@ -1257,7 +1260,7 @@ void bitmap_daemon_work(struct mddev *mddev)
>>   		}
>>   		bmc = bitmap_get_counter(counts,
>>   					 block,
>> -					 &blocks, 0);
>> +					 &blocks, 0, 0);
>>   
>>   		if (!bmc) {
>>   			j |= PAGE_COUNTER_MASK;
>> @@ -1307,7 +1310,7 @@ void bitmap_daemon_work(struct mddev *mddev)
>>   
>>   static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
>>   					    sector_t offset, sector_t *blocks,
>> -					    int create)
>> +					    int create, int no_hijack)
>>   __releases(bitmap->lock)
>>   __acquires(bitmap->lock)
>>   {
>> @@ -1321,7 +1324,7 @@ __acquires(bitmap->lock)
>>   	sector_t csize;
>>   	int err;
>>   
>> -	err = bitmap_checkpage(bitmap, page, create);
>> +	err = bitmap_checkpage(bitmap, page, create, 0);
>>   
>>   	if (bitmap->bp[page].hijacked ||
>>   	    bitmap->bp[page].map == NULL)
> bitmap_get_counter doesn't use the new no_hijack parameter. And you always pass
> 0 to this function. so looks this change isn't required.
>

The below part of this patch pass 1 to bitmap_checkpage, so it is needed.

+	/* For cluster raid, need to pre-allocate bitmap */
+	if (mddev_is_clustered(bitmap->mddev)) {
+		unsigned long page;
+		for (page = 0; page < pages; page++) {
+			ret = bitmap_checkpage(&bitmap->counts, page, 1, 1);

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-21  6:58   ` [PATCH V2 " Guoqing Jiang
  2016-04-25 17:32     ` Shaohua Li
@ 2016-04-27  1:56     ` Guoqing Jiang
  2016-04-27 15:27       ` Shaohua Li
  1 sibling, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-27  1:56 UTC (permalink / raw)
  To: shli; +Cc: neilb, linux-raid, Guoqing Jiang

Some code waits for a metadata update by:

1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
2. setting MD_CHANGE_PENDING and waking the management thread
3. waiting for MD_CHANGE_PENDING to be cleared

If the first two are done without locking, the code in md_update_sb()
which checks if it needs to repeat might test if an update is needed
before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
in the wait returning early.

So make sure all places that set MD_CHANGE_PENDING are protected by
mddev->lock.

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
V3 changes:
1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
   raid10's __make_request

V2 Changes:
1. s/write_lock/lock which is reported by auto build

 drivers/md/md.c          | 22 +++++++++++++++++-----
 drivers/md/raid1.c       |  2 ++
 drivers/md/raid10.c      |  6 +++++-
 drivers/md/raid5-cache.c |  2 ++
 drivers/md/raid5.c       |  2 ++
 5 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 06f6e81..943c953 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2295,12 +2295,18 @@ repeat:
 	if (mddev_is_clustered(mddev)) {
 		if (test_and_clear_bit(MD_CHANGE_DEVS, &mddev->flags))
 			force_change = 1;
+		if (test_and_clear_bit(MD_CHANGE_CLEAN, &mddev->flags))
+			nospares = 1;
 		ret = md_cluster_ops->metadata_update_start(mddev);
 		/* Has someone else has updated the sb */
 		if (!does_sb_need_changing(mddev)) {
 			if (ret == 0)
 				md_cluster_ops->metadata_update_cancel(mddev);
-			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_lock(&mddev->lock);
+			if (!test_bit(MD_CHANGE_DEVS, &mddev->flags) &&
+			    !test_bit(MD_CHANGE_CLEAN, &mddev->flags))
+				clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+			spin_unlock(&mddev->lock);
 			return;
 		}
 	}
@@ -2436,7 +2442,8 @@ repeat:
 
 	spin_lock(&mddev->lock);
 	if (mddev->in_sync != sync_req ||
-	    test_bit(MD_CHANGE_DEVS, &mddev->flags)) {
+	    test_bit(MD_CHANGE_DEVS, &mddev->flags) ||
+	    test_bit(MD_CHANGE_CLEAN, &mddev->flags)) {
 		/* have to write it out again */
 		spin_unlock(&mddev->lock);
 		goto repeat;
@@ -8147,18 +8154,20 @@ void md_do_sync(struct md_thread *thread)
 		}
 	}
  skip:
-	set_bit(MD_CHANGE_DEVS, &mddev->flags);
-
 	if (mddev_is_clustered(mddev) &&
 	    ret == 0) {
 		/* set CHANGE_PENDING here since maybe another
 		 * update is needed, so other nodes are informed */
+		spin_lock(&mddev->lock);
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 		md_cluster_ops->resync_finish(mddev);
-	}
+	} else
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	spin_lock(&mddev->lock);
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -8550,6 +8559,7 @@ EXPORT_SYMBOL(md_finish_reshape);
 int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 		       int is_new)
 {
+	struct mddev *mddev = rdev->mddev;
 	int rv;
 	if (is_new)
 		s += rdev->new_data_offset;
@@ -8559,8 +8569,10 @@ int rdev_set_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 	if (rv == 0) {
 		/* Make sure they get written out promptly */
 		sysfs_notify_dirent_safe(rdev->sysfs_state);
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &rdev->mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(rdev->mddev->thread);
 		return 1;
 	} else
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a7f2b9c..cffd224 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1474,8 +1474,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 	 * if recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	spin_lock_irqsave(&mddev->lock, flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock_irqrestore(&mddev->lock, flags);
 	printk(KERN_ALERT
 	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid1:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e3fd725..cd615fe 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1102,8 +1102,10 @@ static void __make_request(struct mddev *mddev, struct bio *bio)
 		bio->bi_iter.bi_sector < conf->reshape_progress))) {
 		/* Need to update reshape_position in metadata */
 		mddev->reshape_position = conf->reshape_progress;
+		spin_lock_irqsave(&mddev->lock, flags);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock_irqrestore(&mddev->lock, flags);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
@@ -1585,15 +1587,17 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
+	spin_unlock_irqrestore(&conf->device_lock, flags);
 	/*
 	 * If recovery is running, make sure it aborts.
 	 */
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock_irqsave(&mddev->lock, flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	spin_unlock_irqrestore(&mddev->lock, flags);
 	printk(KERN_ALERT
 	       "md/raid10:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid10:%s: Operation continuing on %d devices.\n",
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9531f5f..2ba9366 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -712,8 +712,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	 * in_teardown check workaround this issue.
 	 */
 	if (!log->in_teardown) {
+		spin_lock(&mddev->lock);
 		set_bit(MD_CHANGE_DEVS, &mddev->flags);
 		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		spin_unlock(&mddev->lock);
 		md_wakeup_thread(mddev->thread);
 		wait_event(mddev->sb_wait,
 			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8ab8b65..fffa53f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2514,8 +2514,10 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	set_bit(Blocked, &rdev->flags);
 	set_bit(Faulty, &rdev->flags);
+	spin_lock_irqsave(&mddev->lock, flags);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	set_bit(MD_CHANGE_PENDING, &mddev->flags);
+	spin_unlock_irqrestore(&mddev->lock, flags);
 	printk(KERN_ALERT
 	       "md/raid:%s: Disk failure on %s, disabling device.\n"
 	       "md/raid:%s: Operation continuing on %d devices.\n",
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-26  3:22     ` Guoqing Jiang
@ 2016-04-27 15:24       ` Shaohua Li
  2016-04-28  2:59         ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2016-04-27 15:24 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: neilb, linux-raid

On Tue, Apr 26, 2016 at 11:22:37AM +0800, Guoqing Jiang wrote:
> 
> 
> On 04/26/2016 01:45 AM, Shaohua Li wrote:
> >On Thu, Apr 21, 2016 at 01:58:10PM +0800, Guoqing Jiang wrote:
> >>The in-memory bitmap for raid is allocated on demand,
> >>then for cluster scenario, it is possible that slave
> >>node which received RESYNCING message doesn't have the
> >>in-memory bitmap when master node is perform resyncing,
> >>so we can't make bitmap is match up well among each
> >>nodes.
> >>
> >>So for cluster scenario, we need always preserve the
> >>bitmap, and ensure the page will not be freed. And a
> >>no_hijack flag is introduced to both bitmap_checkpage
> >>and bitmap_get_counter, which makes cluster raid returns
> >>fail once allocate failed.
> >>
> >>And the next patch is relied on this change since it
> >>keeps sync bitmap among each nodes during resyncing
> >>stage.
> >>
> >>Reviewed-by: NeilBrown <neilb@suse.com>
> >>Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >>---
> >>  drivers/md/bitmap.c | 59 +++++++++++++++++++++++++++++++++++++++++------------
> >>  1 file changed, 46 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> >>index 7df6b4f..00cf1c1 100644
> >>--- a/drivers/md/bitmap.c
> >>+++ b/drivers/md/bitmap.c
> >>@@ -46,7 +46,7 @@ static inline char *bmname(struct bitmap *bitmap)
> >>   * allocated while we're using it
> >>   */
> >>  static int bitmap_checkpage(struct bitmap_counts *bitmap,
> >>-			    unsigned long page, int create)
> >>+			    unsigned long page, int create, int no_hijack)
> >>  __releases(bitmap->lock)
> >>  __acquires(bitmap->lock)
> >>  {
> >>@@ -90,6 +90,9 @@ __acquires(bitmap->lock)
> >>  	if (mappage == NULL) {
> >>  		pr_debug("md/bitmap: map page allocation failed, hijacking\n");
> >>+		/* We don't support hijack for cluster raid */
> >>+		if (no_hijack)
> >>+			return -ENOMEM;
> >>  		/* failed - set the hijacked flag so that we can use the
> >>  		 * pointer as a counter */
> >>  		if (!bitmap->bp[page].map)
> >>@@ -1177,7 +1180,7 @@ static void bitmap_set_pending(struct bitmap_counts *bitmap, sector_t offset)
> >>  static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
> >>  					    sector_t offset, sector_t *blocks,
> >>-					    int create);
> >>+					    int create, int no_hijack);
> >>  /*
> >>   * bitmap daemon -- periodically wakes up to clean bits and flush pages
> >>@@ -1257,7 +1260,7 @@ void bitmap_daemon_work(struct mddev *mddev)
> >>  		}
> >>  		bmc = bitmap_get_counter(counts,
> >>  					 block,
> >>-					 &blocks, 0);
> >>+					 &blocks, 0, 0);
> >>  		if (!bmc) {
> >>  			j |= PAGE_COUNTER_MASK;
> >>@@ -1307,7 +1310,7 @@ void bitmap_daemon_work(struct mddev *mddev)
> >>  static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
> >>  					    sector_t offset, sector_t *blocks,
> >>-					    int create)
> >>+					    int create, int no_hijack)
> >>  __releases(bitmap->lock)
> >>  __acquires(bitmap->lock)
> >>  {
> >>@@ -1321,7 +1324,7 @@ __acquires(bitmap->lock)
> >>  	sector_t csize;
> >>  	int err;
> >>-	err = bitmap_checkpage(bitmap, page, create);
> >>+	err = bitmap_checkpage(bitmap, page, create, 0);
> >>  	if (bitmap->bp[page].hijacked ||
> >>  	    bitmap->bp[page].map == NULL)
> >bitmap_get_counter doesn't use the new no_hijack parameter. And you always pass
> >0 to this function. so looks this change isn't required.
> >
> 
> The below part of this patch pass 1 to bitmap_checkpage, so it is needed.
> 
> +	/* For cluster raid, need to pre-allocate bitmap */
> +	if (mddev_is_clustered(bitmap->mddev)) {
> +		unsigned long page;
> +		for (page = 0; page < pages; page++) {
> +			ret = bitmap_checkpage(&bitmap->counts, page, 1, 1);

I mean bitmap_get_counter(). You add no_hijack parameter, but not use it

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-27  1:56     ` [PATCH V3 " Guoqing Jiang
@ 2016-04-27 15:27       ` Shaohua Li
  2016-04-28  2:55         ` Guoqing Jiang
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2016-04-27 15:27 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: neilb, linux-raid

On Tue, Apr 26, 2016 at 09:56:26PM -0400, Guoqing Jiang wrote:
> Some code waits for a metadata update by:
> 
> 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
> 2. setting MD_CHANGE_PENDING and waking the management thread
> 3. waiting for MD_CHANGE_PENDING to be cleared
> 
> If the first two are done without locking, the code in md_update_sb()
> which checks if it needs to repeat might test if an update is needed
> before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
> in the wait returning early.
> 
> So make sure all places that set MD_CHANGE_PENDING are protected by
> mddev->lock.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
> V3 changes:
> 1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
>    raid10's __make_request

shouldn't other places use spin_lock_irq/spin_unlock_irq? interrupt can occur
after you do spin_lock(), and if it's md_error, we deadlock.

please resend the whole series.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-27 15:27       ` Shaohua Li
@ 2016-04-28  2:55         ` Guoqing Jiang
  2016-04-28  3:58           ` Shaohua Li
  0 siblings, 1 reply; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-28  2:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: neilb, linux-raid



On 04/27/2016 11:27 AM, Shaohua Li wrote:
> On Tue, Apr 26, 2016 at 09:56:26PM -0400, Guoqing Jiang wrote:
>> Some code waits for a metadata update by:
>>
>> 1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
>> 2. setting MD_CHANGE_PENDING and waking the management thread
>> 3. waiting for MD_CHANGE_PENDING to be cleared
>>
>> If the first two are done without locking, the code in md_update_sb()
>> which checks if it needs to repeat might test if an update is needed
>> before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
>> in the wait returning early.
>>
>> So make sure all places that set MD_CHANGE_PENDING are protected by
>> mddev->lock.
>>
>> Reviewed-by: NeilBrown <neilb@suse.com>
>> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> ---
>> V3 changes:
>> 1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
>>     raid10's __make_request
> shouldn't other places use spin_lock_irq/spin_unlock_irq? interrupt can occur
> after you do spin_lock(), and if it's md_error, we deadlock.

It could possible in theory if func was interrupted by md_error after it 
called spin_lock,
but seems lots of place in md.c also use spin_lock/unlock for 
mddev->lock, take
md_do_sync and md_update_sb as example, both of them used 
spin_lock(&mddev->lock)
and spin_unlock(&mddev->lock) before.

So I guess it will not cause trouble, otherwise, then we need to change 
all the usages of
spin_lock/unlock(&mddev->lock), or introduce a new lock for this 
scenario. I am not sure
which one is more acceptable.

> please resend the whole series.

No problem, I will re-post them (except this one for now).

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/13] md-cluster: always setup in-memory bitmap
  2016-04-27 15:24       ` Shaohua Li
@ 2016-04-28  2:59         ` Guoqing Jiang
  0 siblings, 0 replies; 33+ messages in thread
From: Guoqing Jiang @ 2016-04-28  2:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: neilb, linux-raid




On 04/27/2016 11:24 AM, Shaohua Li wrote:
>>   static bitmap_counter_t *bitmap_get_counter(struct bitmap_counts *bitmap,
>>   					    sector_t offset, sector_t *blocks,
>> -					    int create)
>> +					    int create, int no_hijack)
>>   __releases(bitmap->lock)
>>   __acquires(bitmap->lock)
>>   {
>> @@ -1321,7 +1324,7 @@ __acquires(bitmap->lock)
>>   	sector_t csize;
>>   	int err;
>> -	err = bitmap_checkpage(bitmap, page, create);
>> +	err = bitmap_checkpage(bitmap, page, create, 0);
>>   	if (bitmap->bp[page].hijacked ||
>>   	    bitmap->bp[page].map == NULL)
>>> bitmap_get_counter doesn't use the new no_hijack parameter. And you always pass
>>> 0 to this function. so looks this change isn't required.
>>>
>> The below part of this patch pass 1 to bitmap_checkpage, so it is needed.
>>
>> +	/* For cluster raid, need to pre-allocate bitmap */
>> +	if (mddev_is_clustered(bitmap->mddev)) {
>> +		unsigned long page;
>> +		for (page = 0; page < pages; page++) {
>> +			ret = bitmap_checkpage(&bitmap->counts, page, 1, 1);
> I mean bitmap_get_counter(). You add no_hijack parameter, but not use it

Yes, I misunderstood it, thanks for point it out and will remove it.

Regards,
Guoqing

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-28  2:55         ` Guoqing Jiang
@ 2016-04-28  3:58           ` Shaohua Li
  2016-04-29  1:26             ` NeilBrown
  0 siblings, 1 reply; 33+ messages in thread
From: Shaohua Li @ 2016-04-28  3:58 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: neilb, linux-raid

On Wed, Apr 27, 2016 at 10:55:43PM -0400, Guoqing Jiang wrote:
> 
> 
> On 04/27/2016 11:27 AM, Shaohua Li wrote:
> >On Tue, Apr 26, 2016 at 09:56:26PM -0400, Guoqing Jiang wrote:
> >>Some code waits for a metadata update by:
> >>
> >>1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
> >>2. setting MD_CHANGE_PENDING and waking the management thread
> >>3. waiting for MD_CHANGE_PENDING to be cleared
> >>
> >>If the first two are done without locking, the code in md_update_sb()
> >>which checks if it needs to repeat might test if an update is needed
> >>before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
> >>in the wait returning early.
> >>
> >>So make sure all places that set MD_CHANGE_PENDING are protected by
> >>mddev->lock.
> >>
> >>Reviewed-by: NeilBrown <neilb@suse.com>
> >>Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >>---
> >>V3 changes:
> >>1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
> >>    raid10's __make_request
> >shouldn't other places use spin_lock_irq/spin_unlock_irq? interrupt can occur
> >after you do spin_lock(), and if it's md_error, we deadlock.
> 
> It could possible in theory if func was interrupted by md_error after it
> called spin_lock,
> but seems lots of place in md.c also use spin_lock/unlock for mddev->lock,
> take
> md_do_sync and md_update_sb as example, both of them used
> spin_lock(&mddev->lock)
> and spin_unlock(&mddev->lock) before.
> 
> So I guess it will not cause trouble, otherwise, then we need to change all
> the usages of
> spin_lock/unlock(&mddev->lock), or introduce a new lock for this scenario. I
> am not sure
> which one is more acceptable.

It doesn't cause trouble, because no interrupt/bh uses lock before. But now we
use it in softirq, that's the difference. Please enable lockdep, I think it
will complain. either we change all the locking to irq save or introducing a
new lock. either is ok.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-28  3:58           ` Shaohua Li
@ 2016-04-29  1:26             ` NeilBrown
  2016-04-29 18:21               ` Shaohua Li
  0 siblings, 1 reply; 33+ messages in thread
From: NeilBrown @ 2016-04-29  1:26 UTC (permalink / raw)
  To: Shaohua Li, Guoqing Jiang; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3524 bytes --]

On Thu, Apr 28 2016, Shaohua Li wrote:

> On Wed, Apr 27, 2016 at 10:55:43PM -0400, Guoqing Jiang wrote:
>> 
>> 
>> On 04/27/2016 11:27 AM, Shaohua Li wrote:
>> >On Tue, Apr 26, 2016 at 09:56:26PM -0400, Guoqing Jiang wrote:
>> >>Some code waits for a metadata update by:
>> >>
>> >>1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
>> >>2. setting MD_CHANGE_PENDING and waking the management thread
>> >>3. waiting for MD_CHANGE_PENDING to be cleared
>> >>
>> >>If the first two are done without locking, the code in md_update_sb()
>> >>which checks if it needs to repeat might test if an update is needed
>> >>before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
>> >>in the wait returning early.
>> >>
>> >>So make sure all places that set MD_CHANGE_PENDING are protected by
>> >>mddev->lock.
>> >>
>> >>Reviewed-by: NeilBrown <neilb@suse.com>
>> >>Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
>> >>---
>> >>V3 changes:
>> >>1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
>> >>    raid10's __make_request
>> >shouldn't other places use spin_lock_irq/spin_unlock_irq? interrupt can occur
>> >after you do spin_lock(), and if it's md_error, we deadlock.
>> 
>> It could possible in theory if func was interrupted by md_error after it
>> called spin_lock,
>> but seems lots of place in md.c also use spin_lock/unlock for mddev->lock,
>> take
>> md_do_sync and md_update_sb as example, both of them used
>> spin_lock(&mddev->lock)
>> and spin_unlock(&mddev->lock) before.
>> 
>> So I guess it will not cause trouble, otherwise, then we need to change all
>> the usages of
>> spin_lock/unlock(&mddev->lock), or introduce a new lock for this scenario. I
>> am not sure
>> which one is more acceptable.
>
> It doesn't cause trouble, because no interrupt/bh uses lock before. But now we
> use it in softirq, that's the difference. Please enable lockdep, I think it
> will complain. either we change all the locking to irq save or introducing a
> new lock. either is ok.

Thanks for catching this!

As you say, the straight forward solution is change all the current
  spin_lock(&mddev->lock);
to
  spin_lock_irq(&mddev->lock);

and similar for unlock.  Except where it can be called from interrupts
we have to use spin_lock_irqsave().

There is another option that occurs to me.  Not sure if it is elegant or
ugly, so I'm keen to see what you think.

In the places where were set MD_CHANGE_PENDING and one other bit -
either MD_CHANGE_DEVS or MD_CHANGE_CLEAN - we could use set_mask_bits
to set them both atomically.

  set_mask_bits(&mddev->flags, BIT(MD_CHANGE_PENDING) | BIT(MD_CHANGE_DEVS));

Then in md_update_sb, when deciding whether to loop back to "repeat:" we
use a new "bit_clear_unless".

#define bit_clear_unless(ptr, _clear, _test)			\
({								\
	const typeof(*ptr) clear = (_clear), test = (_test);	\
	typeof(*ptr) old, new;					\
								\
	do {							\
		old = ACCESS_ONCE(*ptr);			\
		new = old & ~clear;				\
	} while (!(old & test) && cmpxchg(ptr, old, new) != old);\
								\
	!(old & test);							\
})

The code in md_update_sb() would be

 if (mddev->in_sync != sync_req ||
     !bit_clear_unless(&mddev->flags, BIT(MD_CHANGE_PENDING),
                       BIT(MD_CHANGE_DEVS)|BIT(MD_CHANGE_CLEAN)))
        goto repeat;


So if either DEV or CLEAN we set, PENDING would not be cleared and the
code would goto repeat.

What do you think?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH V3 08/13] md: set MD_CHANGE_PENDING in a spinlocked region
  2016-04-29  1:26             ` NeilBrown
@ 2016-04-29 18:21               ` Shaohua Li
  0 siblings, 0 replies; 33+ messages in thread
From: Shaohua Li @ 2016-04-29 18:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: Guoqing Jiang, linux-raid

On Fri, Apr 29, 2016 at 11:26:32AM +1000, NeilBrown wrote:
> On Thu, Apr 28 2016, Shaohua Li wrote:
> 
> > On Wed, Apr 27, 2016 at 10:55:43PM -0400, Guoqing Jiang wrote:
> >> 
> >> 
> >> On 04/27/2016 11:27 AM, Shaohua Li wrote:
> >> >On Tue, Apr 26, 2016 at 09:56:26PM -0400, Guoqing Jiang wrote:
> >> >>Some code waits for a metadata update by:
> >> >>
> >> >>1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
> >> >>2. setting MD_CHANGE_PENDING and waking the management thread
> >> >>3. waiting for MD_CHANGE_PENDING to be cleared
> >> >>
> >> >>If the first two are done without locking, the code in md_update_sb()
> >> >>which checks if it needs to repeat might test if an update is needed
> >> >>before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
> >> >>in the wait returning early.
> >> >>
> >> >>So make sure all places that set MD_CHANGE_PENDING are protected by
> >> >>mddev->lock.
> >> >>
> >> >>Reviewed-by: NeilBrown <neilb@suse.com>
> >> >>Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> >> >>---
> >> >>V3 changes:
> >> >>1. use spin_lock_irqsave/spin_unlock_irqrestore in error funcs and
> >> >>    raid10's __make_request
> >> >shouldn't other places use spin_lock_irq/spin_unlock_irq? interrupt can occur
> >> >after you do spin_lock(), and if it's md_error, we deadlock.
> >> 
> >> It could possible in theory if func was interrupted by md_error after it
> >> called spin_lock,
> >> but seems lots of place in md.c also use spin_lock/unlock for mddev->lock,
> >> take
> >> md_do_sync and md_update_sb as example, both of them used
> >> spin_lock(&mddev->lock)
> >> and spin_unlock(&mddev->lock) before.
> >> 
> >> So I guess it will not cause trouble, otherwise, then we need to change all
> >> the usages of
> >> spin_lock/unlock(&mddev->lock), or introduce a new lock for this scenario. I
> >> am not sure
> >> which one is more acceptable.
> >
> > It doesn't cause trouble, because no interrupt/bh uses lock before. But now we
> > use it in softirq, that's the difference. Please enable lockdep, I think it
> > will complain. either we change all the locking to irq save or introducing a
> > new lock. either is ok.
> 
> Thanks for catching this!
> 
> As you say, the straight forward solution is change all the current
>   spin_lock(&mddev->lock);
> to
>   spin_lock_irq(&mddev->lock);
> 
> and similar for unlock.  Except where it can be called from interrupts
> we have to use spin_lock_irqsave().
> 
> There is another option that occurs to me.  Not sure if it is elegant or
> ugly, so I'm keen to see what you think.
> 
> In the places where were set MD_CHANGE_PENDING and one other bit -
> either MD_CHANGE_DEVS or MD_CHANGE_CLEAN - we could use set_mask_bits
> to set them both atomically.
> 
>   set_mask_bits(&mddev->flags, BIT(MD_CHANGE_PENDING) | BIT(MD_CHANGE_DEVS));
> 
> Then in md_update_sb, when deciding whether to loop back to "repeat:" we
> use a new "bit_clear_unless".
> 
> #define bit_clear_unless(ptr, _clear, _test)			\
> ({								\
> 	const typeof(*ptr) clear = (_clear), test = (_test);	\
> 	typeof(*ptr) old, new;					\
> 								\
> 	do {							\
> 		old = ACCESS_ONCE(*ptr);			\
> 		new = old & ~clear;				\
> 	} while (!(old & test) && cmpxchg(ptr, old, new) != old);\
> 								\
> 	!(old & test);							\
> })
> 
> The code in md_update_sb() would be
> 
>  if (mddev->in_sync != sync_req ||
>      !bit_clear_unless(&mddev->flags, BIT(MD_CHANGE_PENDING),
>                        BIT(MD_CHANGE_DEVS)|BIT(MD_CHANGE_CLEAN)))
>         goto repeat;
> 
> 
> So if either DEV or CLEAN we set, PENDING would not be cleared and the
> code would goto repeat.
> 
> What do you think?

Looks great, I like this idea. Thanks!

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-04-29 18:21 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-21  5:58 [PATCH 00/13] The latest patches for md-cluster Guoqing Jiang
2016-04-21  5:58 ` [PATCH 01/13] md-cluster: change resync lock from asynchronous to synchronous Guoqing Jiang
2016-04-21  6:20   ` kbuild test robot
2016-04-21  6:55   ` [PATCH V2 " Guoqing Jiang
2016-04-21  5:58 ` [PATCH 02/13] md-cluser: make resync_finish only called after pers->sync_request Guoqing Jiang
2016-04-21  5:58 ` [PATCH 03/13] md-cluster: wake up thread to continue recovery Guoqing Jiang
2016-04-21  5:58 ` [PATCH 04/13] md-cluster: unregister thread if err happened Guoqing Jiang
2016-04-21  5:58 ` [PATCH 05/13] md-cluster: fix locking when node joins cluster during message broadcast Guoqing Jiang
2016-04-21  5:58 ` [PATCH 06/13] md-cluster: change array_sectors and update size are not supported Guoqing Jiang
2016-04-21  5:58 ` [PATCH 07/13] md-cluster: wakeup thread if activated a spare disk Guoqing Jiang
2016-04-21  5:58 ` [PATCH 08/13] md: set MD_CHANGE_PENDING in a spinlocked region Guoqing Jiang
2016-04-21  6:26   ` kbuild test robot
2016-04-21  6:58   ` [PATCH V2 " Guoqing Jiang
2016-04-25 17:32     ` Shaohua Li
2016-04-26  3:19       ` Guoqing Jiang
2016-04-27  1:56     ` [PATCH V3 " Guoqing Jiang
2016-04-27 15:27       ` Shaohua Li
2016-04-28  2:55         ` Guoqing Jiang
2016-04-28  3:58           ` Shaohua Li
2016-04-29  1:26             ` NeilBrown
2016-04-29 18:21               ` Shaohua Li
2016-04-21  5:58 ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Guoqing Jiang
2016-04-21  7:00   ` kbuild test robot
2016-04-21  7:00   ` [PATCH] md-cluster: fix ifnullfree.cocci warnings kbuild test robot
2016-04-21  9:10     ` Guoqing Jiang
2016-04-25 17:45   ` [PATCH 09/13] md-cluster: always setup in-memory bitmap Shaohua Li
2016-04-26  3:22     ` Guoqing Jiang
2016-04-27 15:24       ` Shaohua Li
2016-04-28  2:59         ` Guoqing Jiang
2016-04-21  5:58 ` [PATCH 10/13] md-cluster: sync bitmap when node received RESYNCING msg Guoqing Jiang
2016-04-21  5:58 ` [PATCH 11/13] md-cluster/bitmap: fix wrong calcuation of offset Guoqing Jiang
2016-04-21  5:58 ` [PATCH 12/13] md-cluster/bitmap: fix wrong page num in bitmap_file_clear_bit and bitmap_file_set_bit Guoqing Jiang
2016-04-21  5:58 ` [PATCH 13/13] md-cluster/bitmap: unplug bitmap to sync dirty pages to disk Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.