All of lore.kernel.org
 help / color / mirror / Atom feed
* [md PATCH 0/8] Assort md updates for 2.6.31
@ 2009-08-02 21:58 NeilBrown
  2009-08-02 21:58 ` [md PATCH 2/8] md: Push down data integrity code to personalities NeilBrown
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid

Following are a small assortment of md fixes/improvements that I will
be pushing to Linus for 2.6.31.
Review always appreciated.

NeilBrown



---

Andre Noll (1):
      md: Push down data integrity code to personalities.

Dan Williams (1):
      md/raid6: release spare page at ->stop()

NeilBrown (6):
      md: Use revalidate_disk to effect changes in size of device.
      md: allow raid5_quiesce to work properly when reshape is happening.
      md/raid5: set reshape_position correctly when reshape starts.
      md: Handle growth of v1.x metadata correctly.
      md: avoid array overflow with bad v1.x metadata
      md: when a level change reduces the number of devices, remove the excess.


 drivers/md/linear.c    |    2 +
 drivers/md/md.c        |  143 ++++++++++++++++++++++++++++++------------------
 drivers/md/md.h        |    2 +
 drivers/md/multipath.c |    5 +-
 drivers/md/raid0.c     |    1 
 drivers/md/raid1.c     |    7 ++
 drivers/md/raid10.c    |    4 +
 drivers/md/raid5.c     |   51 ++++++++---------
 8 files changed, 131 insertions(+), 84 deletions(-)

-- 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [md PATCH 1/8] md/raid6: release spare page at ->stop()
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
  2009-08-02 21:58 ` [md PATCH 2/8] md: Push down data integrity code to personalities NeilBrown
  2009-08-02 21:58 ` [md PATCH 3/8] md: when a level change reduces the number of devices, remove the excess NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 7/8] md: allow raid5_quiesce to work properly when reshape is happening NeilBrown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: stable, Dan Williams, NeilBrown

From: Dan Williams <dan.j.williams@intel.com>

Add missing call to safe_put_page from stop() by unifying open coded
raid5_conf_t de-allocation under free_conf().

Cc: <stable@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   26 ++++++++++++--------------
 1 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3783553..3937423 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4316,6 +4316,15 @@ raid5_size(mddev_t *mddev, sector_t sectors, int raid_disks)
 	return sectors * (raid_disks - conf->max_degraded);
 }
 
+static void free_conf(raid5_conf_t *conf)
+{
+	shrink_stripes(conf);
+	safe_put_page(conf->spare_page);
+	kfree(conf->disks);
+	kfree(conf->stripe_hashtbl);
+	kfree(conf);
+}
+
 static raid5_conf_t *setup_conf(mddev_t *mddev)
 {
 	raid5_conf_t *conf;
@@ -4447,11 +4456,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 
  abort:
 	if (conf) {
-		shrink_stripes(conf);
-		safe_put_page(conf->spare_page);
-		kfree(conf->disks);
-		kfree(conf->stripe_hashtbl);
-		kfree(conf);
+		free_conf(conf);
 		return ERR_PTR(-EIO);
 	} else
 		return ERR_PTR(-ENOMEM);
@@ -4629,12 +4634,8 @@ abort:
 	md_unregister_thread(mddev->thread);
 	mddev->thread = NULL;
 	if (conf) {
-		shrink_stripes(conf);
 		print_raid5_conf(conf);
-		safe_put_page(conf->spare_page);
-		kfree(conf->disks);
-		kfree(conf->stripe_hashtbl);
-		kfree(conf);
+		free_conf(conf);
 	}
 	mddev->private = NULL;
 	printk(KERN_ALERT "raid5: failed to run raid set %s\n", mdname(mddev));
@@ -4649,13 +4650,10 @@ static int stop(mddev_t *mddev)
 
 	md_unregister_thread(mddev->thread);
 	mddev->thread = NULL;
-	shrink_stripes(conf);
-	kfree(conf->stripe_hashtbl);
 	mddev->queue->backing_dev_info.congested_fn = NULL;
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
-	kfree(conf->disks);
-	kfree(conf);
+	free_conf(conf);
 	mddev->private = NULL;
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 3/8] md: when a level change reduces the number of devices, remove the excess.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
  2009-08-02 21:58 ` [md PATCH 2/8] md: Push down data integrity code to personalities NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 1/8] md/raid6: release spare page at ->stop() NeilBrown
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: stable, NeilBrown

When an array is changed from RAID6 to RAID5, fewer drives are
needed.  So any device that is made superfluous by the level
conversion must be marked as not-active.
For the RAID6->RAID5 conversion, this will be a drive which only
has 'Q' blocks on it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 13535f0..5bd7896 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2694,6 +2694,7 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
 	ssize_t rv = len;
 	struct mdk_personality *pers;
 	void *priv;
+	mdk_rdev_t *rdev;
 
 	if (mddev->pers == NULL) {
 		if (len == 0)
@@ -2773,6 +2774,12 @@ level_store(mddev_t *mddev, const char *buf, size_t len)
 	mddev_suspend(mddev);
 	mddev->pers->stop(mddev);
 	module_put(mddev->pers->owner);
+	/* Invalidate devices that are now superfluous */
+	list_for_each_entry(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= mddev->raid_disks) {
+			rdev->raid_disk = -1;
+			clear_bit(In_sync, &rdev->flags);
+		}
 	mddev->pers = pers;
 	mddev->private = priv;
 	strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 2/8] md: Push down data integrity code to personalities.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 3/8] md: when a level change reduces the number of devices, remove the excess NeilBrown
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: Andre Noll, NeilBrown

From: Andre Noll <maan@systemlinux.org>

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity.  The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/linear.c    |    1 +
 drivers/md/md.c        |   93 +++++++++++++++++++++++++++++++++---------------
 drivers/md/md.h        |    2 +
 drivers/md/multipath.c |    5 ++-
 drivers/md/raid0.c     |    1 +
 drivers/md/raid1.c     |    6 ++-
 drivers/md/raid10.c    |    4 ++
 7 files changed, 79 insertions(+), 33 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5810fa9..54c8677 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -220,6 +220,7 @@ static int linear_run (mddev_t *mddev)
 	mddev->queue->unplug_fn = linear_unplug;
 	mddev->queue->backing_dev_info.congested_fn = linear_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
+	md_integrity_register(mddev);
 	return 0;
 }
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index d4351ff..13535f0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1487,36 +1487,74 @@ static int match_mddev_units(mddev_t *mddev1, mddev_t *mddev2)
 
 static LIST_HEAD(pending_raid_disks);
 
-static void md_integrity_check(mdk_rdev_t *rdev, mddev_t *mddev)
+/*
+ * Try to register data integrity profile for an mddev
+ *
+ * This is called when an array is started and after a disk has been kicked
+ * from the array. It only succeeds if all working and active component devices
+ * are integrity capable with matching profiles.
+ */
+int md_integrity_register(mddev_t *mddev)
+{
+	mdk_rdev_t *rdev, *reference = NULL;
+
+	if (list_empty(&mddev->disks))
+		return 0; /* nothing to do */
+	if (blk_get_integrity(mddev->gendisk))
+		return 0; /* already registered */
+	list_for_each_entry(rdev, &mddev->disks, same_set) {
+		/* skip spares and non-functional disks */
+		if (test_bit(Faulty, &rdev->flags))
+			continue;
+		if (rdev->raid_disk < 0)
+			continue;
+		/*
+		 * If at least one rdev is not integrity capable, we can not
+		 * enable data integrity for the md device.
+		 */
+		if (!bdev_get_integrity(rdev->bdev))
+			return -EINVAL;
+		if (!reference) {
+			/* Use the first rdev as the reference */
+			reference = rdev;
+			continue;
+		}
+		/* does this rdev's profile match the reference profile? */
+		if (blk_integrity_compare(reference->bdev->bd_disk,
+				rdev->bdev->bd_disk) < 0)
+			return -EINVAL;
+	}
+	/*
+	 * All component devices are integrity capable and have matching
+	 * profiles, register the common profile for the md device.
+	 */
+	if (blk_integrity_register(mddev->gendisk,
+			bdev_get_integrity(reference->bdev)) != 0) {
+		printk(KERN_ERR "md: failed to register integrity for %s\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+	printk(KERN_NOTICE "md: data integrity on %s enabled\n",
+		mdname(mddev));
+	return 0;
+}
+EXPORT_SYMBOL(md_integrity_register);
+
+/* Disable data integrity if non-capable/non-matching disk is being added */
+void md_integrity_add_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
 {
-	struct mdk_personality *pers = mddev->pers;
-	struct gendisk *disk = mddev->gendisk;
 	struct blk_integrity *bi_rdev = bdev_get_integrity(rdev->bdev);
-	struct blk_integrity *bi_mddev = blk_get_integrity(disk);
+	struct blk_integrity *bi_mddev = blk_get_integrity(mddev->gendisk);
 
-	/* Data integrity passthrough not supported on RAID 4, 5 and 6 */
-	if (pers && pers->level >= 4 && pers->level <= 6)
+	if (!bi_mddev) /* nothing to do */
 		return;
-
-	/* If rdev is integrity capable, register profile for mddev */
-	if (!bi_mddev && bi_rdev) {
-		if (blk_integrity_register(disk, bi_rdev))
-			printk(KERN_ERR "%s: %s Could not register integrity!\n",
-			       __func__, disk->disk_name);
-		else
-			printk(KERN_NOTICE "Enabling data integrity on %s\n",
-			       disk->disk_name);
+	if (rdev->raid_disk < 0) /* skip spares */
 		return;
-	}
-
-	/* Check that mddev and rdev have matching profiles */
-	if (blk_integrity_compare(disk, rdev->bdev->bd_disk) < 0) {
-		printk(KERN_ERR "%s: %s/%s integrity mismatch!\n", __func__,
-		       disk->disk_name, rdev->bdev->bd_disk->disk_name);
-		printk(KERN_NOTICE "Disabling data integrity on %s\n",
-		       disk->disk_name);
-		blk_integrity_unregister(disk);
-	}
+	if (bi_rdev && blk_integrity_compare(mddev->gendisk,
+					     rdev->bdev->bd_disk) >= 0)
+		return;
+	printk(KERN_NOTICE "disabling data integrity on %s\n", mdname(mddev));
+	blk_integrity_unregister(mddev->gendisk);
 }
 
 static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
@@ -1591,7 +1629,6 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	/* May as well allow recovery to be retried once */
 	mddev->recovery_disabled = 0;
 
-	md_integrity_check(rdev, mddev);
 	return 0;
 
  fail:
@@ -4048,10 +4085,6 @@ static int do_md_run(mddev_t * mddev)
 	}
 	strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
 
-	if (pers->level >= 4 && pers->level <= 6)
-		/* Cannot support integrity (yet) */
-		blk_integrity_unregister(mddev->gendisk);
-
 	if (mddev->reshape_position != MaxSector &&
 	    pers->start_reshape == NULL) {
 		/* This personality cannot handle reshaping... */
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 9430a11..78f0316 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -431,5 +431,7 @@ extern int md_allow_write(mddev_t *mddev);
 extern void md_wait_for_blocked_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
 extern void md_set_array_sectors(mddev_t *mddev, sector_t array_sectors);
 extern int md_check_no_bitmap(mddev_t *mddev);
+extern int md_integrity_register(mddev_t *mddev);
+void md_integrity_add_rdev(mdk_rdev_t *rdev, mddev_t *mddev);
 
 #endif /* _MD_MD_H */
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 237fe3f..7140909 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -313,6 +313,7 @@ static int multipath_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 			set_bit(In_sync, &rdev->flags);
 			rcu_assign_pointer(p->rdev, rdev);
 			err = 0;
+			md_integrity_add_rdev(rdev, mddev);
 			break;
 		}
 
@@ -345,7 +346,9 @@ static int multipath_remove_disk(mddev_t *mddev, int number)
 			/* lost the race, try later */
 			err = -EBUSY;
 			p->rdev = rdev;
+			goto abort;
 		}
+		md_integrity_register(mddev);
 	}
 abort:
 
@@ -519,7 +522,7 @@ static int multipath_run (mddev_t *mddev)
 	mddev->queue->unplug_fn = multipath_unplug;
 	mddev->queue->backing_dev_info.congested_fn = multipath_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
-
+	md_integrity_register(mddev);
 	return 0;
 
 out_free_conf:
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 335f490..898e2bd 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -351,6 +351,7 @@ static int raid0_run(mddev_t *mddev)
 
 	blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec);
 	dump_zones(mddev);
+	md_integrity_register(mddev);
 	return 0;
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 0569efb..67e794d 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1144,7 +1144,7 @@ static int raid1_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 			rcu_assign_pointer(p->rdev, rdev);
 			break;
 		}
-
+	md_integrity_add_rdev(rdev, mddev);
 	print_conf(conf);
 	return err;
 }
@@ -1178,7 +1178,9 @@ static int raid1_remove_disk(mddev_t *mddev, int number)
 			/* lost the race, try later */
 			err = -EBUSY;
 			p->rdev = rdev;
+			goto abort;
 		}
+		md_integrity_register(mddev);
 	}
 abort:
 
@@ -2067,7 +2069,7 @@ static int run(mddev_t *mddev)
 	mddev->queue->unplug_fn = raid1_unplug;
 	mddev->queue->backing_dev_info.congested_fn = raid1_congested;
 	mddev->queue->backing_dev_info.congested_data = mddev;
-
+	md_integrity_register(mddev);
 	return 0;
 
 out_no_mem:
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 7298a5e..3d9020c 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1170,6 +1170,7 @@ static int raid10_add_disk(mddev_t *mddev, mdk_rdev_t *rdev)
 			break;
 		}
 
+	md_integrity_add_rdev(rdev, mddev);
 	print_conf(conf);
 	return err;
 }
@@ -1203,7 +1204,9 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
 			/* lost the race, try later */
 			err = -EBUSY;
 			p->rdev = rdev;
+			goto abort;
 		}
+		md_integrity_register(mddev);
 	}
 abort:
 
@@ -2225,6 +2228,7 @@ static int run(mddev_t *mddev)
 
 	if (conf->near_copies < mddev->raid_disks)
 		blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec);
+	md_integrity_register(mddev);
 	return 0;
 
 out_free_conf:



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 4/8] md: avoid array overflow with bad v1.x metadata
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
                   ` (5 preceding siblings ...)
  2009-08-02 21:58 ` [md PATCH 6/8] md/raid5: set reshape_position correctly when reshape starts NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device NeilBrown
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

We trust the 'desc_nr' field in v1.x metadata enough to use it
as an index in an array.  This isn't really safe.
So range-check the value first.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5bd7896..fc04963 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1308,7 +1308,12 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 	}
 	if (mddev->level != LEVEL_MULTIPATH) {
 		int role;
-		role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]);
+		if (rdev->desc_nr < 0 ||
+		    rdev->desc_nr >= le32_to_cpu(sb->max_dev)) {
+			role = 0xffff;
+			rdev->desc_nr = -1;
+		} else
+			role = le16_to_cpu(sb->dev_roles[rdev->desc_nr]);
 		switch(role) {
 		case 0xffff: /* spare */
 			break;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 6/8] md/raid5: set reshape_position correctly when reshape starts.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
                   ` (4 preceding siblings ...)
  2009-08-02 21:58 ` [md PATCH 5/8] md: Handle growth of v1.x metadata correctly NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 4/8] md: avoid array overflow with bad v1.x metadata NeilBrown
  2009-08-02 21:58 ` [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device NeilBrown
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

As the internal reshape_progress counter is the main driver
for reshape, the fact that reshape_position sometimes starts with the
wrong value has minimal effect.  It is visible in sysfs and that
is all.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3937423..659151e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5000,7 +5000,7 @@ static int raid5_start_reshape(mddev_t *mddev)
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	mddev->raid_disks = conf->raid_disks;
-	mddev->reshape_position = 0;
+	mddev->reshape_position = conf->reshape_progress;
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 5/8] md: Handle growth of v1.x metadata correctly.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
                   ` (3 preceding siblings ...)
  2009-08-02 21:58 ` [md PATCH 7/8] md: allow raid5_quiesce to work properly when reshape is happening NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 6/8] md/raid5: set reshape_position correctly when reshape starts NeilBrown
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: stable, NeilBrown

The v1.x metadata does not have a fixed size and can grow
when devices are added.
If it grows enough to require an extra sector of storage,
we need to update the 'sb_size' to match.

Without this, md can write out an incomplete superblock with a
bad checksum, which will be rejected when trying to re-assemble
the array.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index fc04963..5b507e9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1399,8 +1399,14 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 		if (rdev2->desc_nr+1 > max_dev)
 			max_dev = rdev2->desc_nr+1;
 
-	if (max_dev > le32_to_cpu(sb->max_dev))
+	if (max_dev > le32_to_cpu(sb->max_dev)) {
+		int bmask;
 		sb->max_dev = cpu_to_le32(max_dev);
+		rdev->sb_size = max_dev * 2 + 256;
+		bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;
+		if (rdev->sb_size & bmask)
+			rdev->sb_size = (rdev->sb_size | bmask) + 1;
+	}
 	for (i=0; i<max_dev;i++)
 		sb->dev_roles[i] = cpu_to_le16(0xfffe);
 	



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 7/8] md: allow raid5_quiesce to work properly when reshape is happening.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
                   ` (2 preceding siblings ...)
  2009-08-02 21:58 ` [md PATCH 1/8] md/raid6: release spare page at ->stop() NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-02 21:58 ` [md PATCH 5/8] md: Handle growth of v1.x metadata correctly NeilBrown
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

The ->quiesce method is not supposed to stop resync/recovery/reshape,
just normal IO.
But in raid5 we don't have a way to know which stripes are being
used for normal IO and which for resync etc, so we need to wait for
all stripes to be idle to be sure that all writes have completed.

However reshape keeps at least some stripe busy for an extended period
of time, so a call to raid5_quiesce can block for several seconds
needlessly.
So arrange for reshape etc to pause briefly while raid5_quiesce is
trying to quiesce the array so that the active_stripes count can
drop to zero.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 659151e..2dc35b4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3999,6 +3999,9 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 		return 0;
 	}
 
+	/* Allow raid5_quiesce to complete */
+	wait_event(conf->wait_for_overlap, conf->quiesce != 2);
+
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		return reshape_request(mddev, sector_nr, skipped);
 
@@ -5104,12 +5107,18 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 
 	case 1: /* stop all writes */
 		spin_lock_irq(&conf->device_lock);
-		conf->quiesce = 1;
+		/* '2' tells resync/reshape to pause so that all
+		 * active stripes can drain
+		 */
+		conf->quiesce = 2;
 		wait_event_lock_irq(conf->wait_for_stripe,
 				    atomic_read(&conf->active_stripes) == 0 &&
 				    atomic_read(&conf->active_aligned_reads) == 0,
 				    conf->device_lock, /* nothing */);
+		conf->quiesce = 1;
 		spin_unlock_irq(&conf->device_lock);
+		/* allow reshape to continue */
+		wake_up(&conf->wait_for_overlap);
 		break;
 
 	case 0: /* re-enable writes */



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
                   ` (6 preceding siblings ...)
  2009-08-02 21:58 ` [md PATCH 4/8] md: avoid array overflow with bad v1.x metadata NeilBrown
@ 2009-08-02 21:58 ` NeilBrown
  2009-08-05 22:07   ` H. Peter Anvin
  7 siblings, 1 reply; 19+ messages in thread
From: NeilBrown @ 2009-08-02 21:58 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown

As revalidate_disk calls check_disk_size_change, it will cause
any capacity change of a gendisk to be propagated to the blockdev
inode.  So use that instead of mucking about with locks and
i_size_write.

Also add a call to revalidate_disk in do_md_run and a few other places
where the gendisk capacity is changed.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/linear.c |    1 +
 drivers/md/md.c     |   28 +++++-----------------------
 drivers/md/raid1.c  |    1 +
 drivers/md/raid5.c  |   12 ++----------
 4 files changed, 9 insertions(+), 33 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 54c8677..5fe39c2 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -257,6 +257,7 @@ static int linear_add(mddev_t *mddev, mdk_rdev_t *rdev)
 	rcu_assign_pointer(mddev->private, newconf);
 	md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
 	set_capacity(mddev->gendisk, mddev->array_sectors);
+	revalidate_disk(mddev->gendisk);
 	call_rcu(&oldconf->rcu, free_conf);
 	return 0;
 }
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5b507e9..5b722d8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3740,17 +3740,8 @@ array_size_store(mddev_t *mddev, const char *buf, size_t len)
 
 	mddev->array_sectors = sectors;
 	set_capacity(mddev->gendisk, mddev->array_sectors);
-	if (mddev->pers) {
-		struct block_device *bdev = bdget_disk(mddev->gendisk, 0);
-
-		if (bdev) {
-			mutex_lock(&bdev->bd_inode->i_mutex);
-			i_size_write(bdev->bd_inode,
-				     (loff_t)mddev->array_sectors << 9);
-			mutex_unlock(&bdev->bd_inode->i_mutex);
-			bdput(bdev);
-		}
-	}
+	if (mddev->pers)
+		revalidate_disk(mddev->gendisk);
 
 	return len;
 }
@@ -4240,6 +4231,7 @@ static int do_md_run(mddev_t * mddev)
 	md_wakeup_thread(mddev->thread);
 	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
 
+	revalidate_disk(mddev->gendisk);
 	mddev->changed = 1;
 	md_new_event(mddev);
 	sysfs_notify_dirent(mddev->sysfs_state);
@@ -5138,18 +5130,8 @@ static int update_size(mddev_t *mddev, sector_t num_sectors)
 			return -ENOSPC;
 	}
 	rv = mddev->pers->resize(mddev, num_sectors);
-	if (!rv) {
-		struct block_device *bdev;
-
-		bdev = bdget_disk(mddev->gendisk, 0);
-		if (bdev) {
-			mutex_lock(&bdev->bd_inode->i_mutex);
-			i_size_write(bdev->bd_inode,
-				     (loff_t)mddev->array_sectors << 9);
-			mutex_unlock(&bdev->bd_inode->i_mutex);
-			bdput(bdev);
-		}
-	}
+	if (!rv)
+		revalidate_disk(mddev->gendisk);
 	return rv;
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 67e794d..8726fd7 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2134,6 +2134,7 @@ static int raid1_resize(mddev_t *mddev, sector_t sectors)
 		return -EINVAL;
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	mddev->changed = 1;
+	revalidate_disk(mddev->gendisk);
 	if (sectors > mddev->dev_sectors &&
 	    mddev->recovery_cp == MaxSector) {
 		mddev->recovery_cp = mddev->dev_sectors;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2dc35b4..2b521ee 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4858,6 +4858,7 @@ static int raid5_resize(mddev_t *mddev, sector_t sectors)
 		return -EINVAL;
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	mddev->changed = 1;
+	revalidate_disk(mddev->gendisk);
 	if (sectors > mddev->dev_sectors && mddev->recovery_cp == MaxSector) {
 		mddev->recovery_cp = mddev->dev_sectors;
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -5058,7 +5059,6 @@ static void end_reshape(raid5_conf_t *conf)
  */
 static void raid5_finish_reshape(mddev_t *mddev)
 {
-	struct block_device *bdev;
 	raid5_conf_t *conf = mddev->private;
 
 	if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) {
@@ -5067,15 +5067,7 @@ static void raid5_finish_reshape(mddev_t *mddev)
 			md_set_array_sectors(mddev, raid5_size(mddev, 0, 0));
 			set_capacity(mddev->gendisk, mddev->array_sectors);
 			mddev->changed = 1;
-
-			bdev = bdget_disk(mddev->gendisk, 0);
-			if (bdev) {
-				mutex_lock(&bdev->bd_inode->i_mutex);
-				i_size_write(bdev->bd_inode,
-					     (loff_t)mddev->array_sectors << 9);
-				mutex_unlock(&bdev->bd_inode->i_mutex);
-				bdput(bdev);
-			}
+			revalidate_disk(mddev->gendisk);
 		} else {
 			int d;
 			mddev->degraded = conf->raid_disks;



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-02 21:58 ` [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device NeilBrown
@ 2009-08-05 22:07   ` H. Peter Anvin
  2009-08-06  1:03     ` Mike Snitzer
  0 siblings, 1 reply; 19+ messages in thread
From: H. Peter Anvin @ 2009-08-05 22:07 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On 08/02/2009 02:58 PM, NeilBrown wrote:
> As revalidate_disk calls check_disk_size_change, it will cause
> any capacity change of a gendisk to be propagated to the blockdev
> inode.  So use that instead of mucking about with locks and
> i_size_write.
> 
> Also add a call to revalidate_disk in do_md_run and a few other places
> where the gendisk capacity is changed.
> 

This patch causes my Fedora 11 system with all filesystems on RAID-1 to
not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)

	-hpa


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-05 22:07   ` H. Peter Anvin
@ 2009-08-06  1:03     ` Mike Snitzer
  2009-08-06  1:15       ` H. Peter Anvin
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Snitzer @ 2009-08-06  1:03 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: NeilBrown, linux-raid

On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
> On 08/02/2009 02:58 PM, NeilBrown wrote:
>> As revalidate_disk calls check_disk_size_change, it will cause
>> any capacity change of a gendisk to be propagated to the blockdev
>> inode.  So use that instead of mucking about with locks and
>> i_size_write.
>>
>> Also add a call to revalidate_disk in do_md_run and a few other places
>> where the gendisk capacity is changed.
>>
>
> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)

I reported similar findings, with some more detail, relative to
Fedora's rawhide here:
http://lkml.org/lkml/2009/8/5/275
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-06  1:03     ` Mike Snitzer
@ 2009-08-06  1:15       ` H. Peter Anvin
  2009-08-06  3:13           ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: H. Peter Anvin @ 2009-08-06  1:15 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: NeilBrown, linux-raid

On 08/05/2009 06:03 PM, Mike Snitzer wrote:
> On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
>> On 08/02/2009 02:58 PM, NeilBrown wrote:
>>> As revalidate_disk calls check_disk_size_change, it will cause
>>> any capacity change of a gendisk to be propagated to the blockdev
>>> inode.  So use that instead of mucking about with locks and
>>> i_size_write.
>>>
>>> Also add a call to revalidate_disk in do_md_run and a few other places
>>> where the gendisk capacity is changed.
>>>
>> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
>> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)
> 
> I reported similar findings, with some more detail, relative to
> Fedora's rawhide here:
> http://lkml.org/lkml/2009/8/5/275

Sounds to be the same, yes.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-06  1:15       ` H. Peter Anvin
@ 2009-08-06  3:13           ` Neil Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2009-08-06  3:13 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mike Snitzer, linux-raid, linux-kernel

On Wednesday August 5, hpa@zytor.com wrote:
> On 08/05/2009 06:03 PM, Mike Snitzer wrote:
> > On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
> >> On 08/02/2009 02:58 PM, NeilBrown wrote:
> >>> As revalidate_disk calls check_disk_size_change, it will cause
> >>> any capacity change of a gendisk to be propagated to the blockdev
> >>> inode.  So use that instead of mucking about with locks and
> >>> i_size_write.
> >>>
> >>> Also add a call to revalidate_disk in do_md_run and a few other places
> >>> where the gendisk capacity is changed.
> >>>
> >> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
> >> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)
> > 
> > I reported similar findings, with some more detail, relative to
> > Fedora's rawhide here:
> > http://lkml.org/lkml/2009/8/5/275
> 
> Sounds to be the same, yes.

Thanks for the reports guys.

I managed to reproduce the lockup and I think this patch should fix
it.
If you could review/test I would appreciate it.

Thanks,
NeilBrown


From cf90cb85596d05d9595bb8ee4cadb7d4091212b5 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 6 Aug 2009 13:10:43 +1000
Subject: [PATCH] Remove deadlock potential in md_open

A recent commit:
  commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoid the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   10 +++++++---
 drivers/md/md.h |   10 ++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5b98bea..1ecb219 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -359,6 +359,7 @@ static mddev_t * mddev_find(dev_t unit)
 	else
 		new->md_minor = MINOR(unit) >> MdpMinorShift;
 
+	mutex_init(&new->open_mutex);
 	mutex_init(&new->reconfig_mutex);
 	INIT_LIST_HEAD(&new->disks);
 	INIT_LIST_HEAD(&new->all_mddevs);
@@ -4304,9 +4305,11 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	struct gendisk *disk = mddev->gendisk;
 	mdk_rdev_t *rdev;
 
+	mutex_lock(&mddev->open_mutex);
 	if (atomic_read(&mddev->openers) > is_open) {
 		printk("md: %s still in use.\n",mdname(mddev));
-		return -EBUSY;
+		err = -EBUSY;
+		goto out;
 	}
 
 	if (mddev->pers) {
@@ -4434,6 +4437,7 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	md_new_event(mddev);
 	sysfs_notify_dirent(mddev->sysfs_state);
 out:
+	mutex_unlock(&mddev->open_mutex);
 	return err;
 }
 
@@ -5518,12 +5522,12 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	}
 	BUG_ON(mddev != bdev->bd_disk->private_data);
 
-	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
+	if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
 		goto out;
 
 	err = 0;
 	atomic_inc(&mddev->openers);
-	mddev_unlock(mddev);
+	mutex_unlock(&mddev->open_mutex);
 
 	check_disk_change(bdev);
  out:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 78f0316..f8fc188 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -223,6 +223,16 @@ struct mddev_s
 							    * so we don't loop trying */
 
 	int				in_sync;	/* know to not need resync */
+	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
+	 * that we are never stopping an array while it is open.
+	 * 'reconfig_mutex' protects all other reconfiguration.
+	 * These locks are separate due to conflicting interactions
+	 * with bdev->bd_mutex.
+	 * Lock ordering is:
+	 *  reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
+	 *  bd_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
+	 */
+	struct mutex			open_mutex;
 	struct mutex			reconfig_mutex;
 	atomic_t			active;		/* general refcount */
 	atomic_t			openers;	/* number of active opens */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
@ 2009-08-06  3:13           ` Neil Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2009-08-06  3:13 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Mike Snitzer, linux-raid, linux-kernel

On Wednesday August 5, hpa@zytor.com wrote:
> On 08/05/2009 06:03 PM, Mike Snitzer wrote:
> > On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
> >> On 08/02/2009 02:58 PM, NeilBrown wrote:
> >>> As revalidate_disk calls check_disk_size_change, it will cause
> >>> any capacity change of a gendisk to be propagated to the blockdev
> >>> inode.  So use that instead of mucking about with locks and
> >>> i_size_write.
> >>>
> >>> Also add a call to revalidate_disk in do_md_run and a few other places
> >>> where the gendisk capacity is changed.
> >>>
> >> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
> >> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)
> > 
> > I reported similar findings, with some more detail, relative to
> > Fedora's rawhide here:
> > http://lkml.org/lkml/2009/8/5/275
> 
> Sounds to be the same, yes.

Thanks for the reports guys.

I managed to reproduce the lockup and I think this patch should fix
it.
If you could review/test I would appreciate it.

Thanks,
NeilBrown


>From cf90cb85596d05d9595bb8ee4cadb7d4091212b5 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 6 Aug 2009 13:10:43 +1000
Subject: [PATCH] Remove deadlock potential in md_open

A recent commit:
  commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoid the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   10 +++++++---
 drivers/md/md.h |   10 ++++++++++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5b98bea..1ecb219 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -359,6 +359,7 @@ static mddev_t * mddev_find(dev_t unit)
 	else
 		new->md_minor = MINOR(unit) >> MdpMinorShift;
 
+	mutex_init(&new->open_mutex);
 	mutex_init(&new->reconfig_mutex);
 	INIT_LIST_HEAD(&new->disks);
 	INIT_LIST_HEAD(&new->all_mddevs);
@@ -4304,9 +4305,11 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	struct gendisk *disk = mddev->gendisk;
 	mdk_rdev_t *rdev;
 
+	mutex_lock(&mddev->open_mutex);
 	if (atomic_read(&mddev->openers) > is_open) {
 		printk("md: %s still in use.\n",mdname(mddev));
-		return -EBUSY;
+		err = -EBUSY;
+		goto out;
 	}
 
 	if (mddev->pers) {
@@ -4434,6 +4437,7 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	md_new_event(mddev);
 	sysfs_notify_dirent(mddev->sysfs_state);
 out:
+	mutex_unlock(&mddev->open_mutex);
 	return err;
 }
 
@@ -5518,12 +5522,12 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	}
 	BUG_ON(mddev != bdev->bd_disk->private_data);
 
-	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
+	if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
 		goto out;
 
 	err = 0;
 	atomic_inc(&mddev->openers);
-	mddev_unlock(mddev);
+	mutex_unlock(&mddev->open_mutex);
 
 	check_disk_change(bdev);
  out:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 78f0316..f8fc188 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -223,6 +223,16 @@ struct mddev_s
 							    * so we don't loop trying */
 
 	int				in_sync;	/* know to not need resync */
+	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
+	 * that we are never stopping an array while it is open.
+	 * 'reconfig_mutex' protects all other reconfiguration.
+	 * These locks are separate due to conflicting interactions
+	 * with bdev->bd_mutex.
+	 * Lock ordering is:
+	 *  reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
+	 *  bd_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
+	 */
+	struct mutex			open_mutex;
 	struct mutex			reconfig_mutex;
 	atomic_t			active;		/* general refcount */
 	atomic_t			openers;	/* number of active opens */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-06  3:13           ` Neil Brown
  (?)
@ 2009-08-06  4:35           ` H. Peter Anvin
  -1 siblings, 0 replies; 19+ messages in thread
From: H. Peter Anvin @ 2009-08-06  4:35 UTC (permalink / raw)
  To: Neil Brown; +Cc: Mike Snitzer, linux-raid, linux-kernel

On 08/05/2009 08:13 PM, Neil Brown wrote:

> 
> Thanks for the reports guys.
> 
> I managed to reproduce the lockup and I think this patch should fix
> it.
> If you could review/test I would appreciate it.
> 
> Thanks,
> NeilBrown
> 

This patch makes my system boot, whereas previously it would deadlock
100% of the time.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-06  3:13           ` Neil Brown
@ 2009-08-07 13:57             ` Mike Snitzer
  -1 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2009-08-07 13:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid, linux-kernel

On Wed, Aug 5, 2009 at 11:13 PM, Neil Brown<neilb@suse.de> wrote:
> On Wednesday August 5, hpa@zytor.com wrote:
>> On 08/05/2009 06:03 PM, Mike Snitzer wrote:
>> > On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
>> >> On 08/02/2009 02:58 PM, NeilBrown wrote:
>> >>> As revalidate_disk calls check_disk_size_change, it will cause
>> >>> any capacity change of a gendisk to be propagated to the blockdev
>> >>> inode.  So use that instead of mucking about with locks and
>> >>> i_size_write.
>> >>>
>> >>> Also add a call to revalidate_disk in do_md_run and a few other places
>> >>> where the gendisk capacity is changed.
>> >>>
>> >> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
>> >> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)
>> >
>> > I reported similar findings, with some more detail, relative to
>> > Fedora's rawhide here:
>> > http://lkml.org/lkml/2009/8/5/275
>>
>> Sounds to be the same, yes.
>
> Thanks for the reports guys.
>
> I managed to reproduce the lockup and I think this patch should fix
> it.
> If you could review/test I would appreciate it.
>
> Thanks,
> NeilBrown
>
>
> From cf90cb85596d05d9595bb8ee4cadb7d4091212b5 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 6 Aug 2009 13:10:43 +1000
> Subject: [PATCH] Remove deadlock potential in md_open
>
> A recent commit:
>  commit 449aad3e25358812c43afc60918c5ad3819488e7
>
> introduced the possibility of an A-B/B-A deadlock between
> bd_mutex and reconfig_mutex.
>
> __blkdev_get holds bd_mutex while calling md_open which takes
>   reconfig_mutex,
> do_md_run is always called with reconfig_mutex held, and it now
>   takes bd_mutex in the call the revalidate_disk.
>
> This potential deadlock was not caught by lockdep due to the
> use of mutex_lock_interruptible_nexted which was introduced
> by
>   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
> do avoid a warning of an impossible deadlock.
>
> It is quite possible to split reconfig_mutex in to two locks.
> One protects the array data structures while it is being
> reconfigured, the other ensures that an array is never even partially
> open while it is being deactivated.
>
> So create a new lock, open_mutex, just to ensure exclusion between
> 'open' and 'stop'.
>
> This avoids the deadlock and also avoid the lockdep warning mentioned
> in commit d63a5a74d
>
> Reported-by: "Mike Snitzer" <snitzer@gmail.com>
> Reported-by: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: NeilBrown <neilb@suse.de>

Neil,

I finally tested this with the LVM testsuite's
t-pvcreate-operation-md.sh script that I mentioned here:
http://lkml.org/lkml/2009/8/5/275

The test succeeds but unfortunately triggers the following lockdep warning:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-0.125.rc5.git2.snitm.x86_64 #1
-------------------------------------------------------
mdadm/1348 is trying to acquire lock:
 (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811809dc>]
bd_release_from_disk+0x49/0x113

but task is already holding lock:
 (&new->open_mutex){+.+...}, at: [<ffffffff81418ac7>] do_md_stop+0x3f/0x529

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&new->open_mutex){+.+...}:
       [<ffffffff8109ed70>] __lock_acquire+0x990/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e6b8>] mutex_lock_interruptible_nested+0x4d/0x68
       [<ffffffff8141e683>] md_open+0x71/0xb3
       [<ffffffff81181821>] __blkdev_get+0xe1/0x37e
       [<ffffffff81181900>] __blkdev_get+0x1c0/0x37e
       [<ffffffff81181ae1>] blkdev_get+0x23/0x39
       [<ffffffff81181b7c>] blkdev_open+0x85/0xd1
       [<ffffffff8114f909>] __dentry_open+0x1af/0x303
       [<ffffffff81150da8>] nameidata_to_filp+0x55/0x7d
       [<ffffffff8115ecb1>] do_filp_open+0x504/0x960
       [<ffffffff8114f595>] do_sys_open+0x70/0x12e
       [<ffffffff8114f6c0>] sys_open+0x33/0x49
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #1 (&bdev->bd_mutex/1){+.+...}:
       [<ffffffff8109ed70>] __lock_acquire+0x990/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
       [<ffffffff811817ce>] __blkdev_get+0x8e/0x37e
       [<ffffffff81181900>] __blkdev_get+0x1c0/0x37e
       [<ffffffff81181ae1>] blkdev_get+0x23/0x39
       [<ffffffff81181b7c>] blkdev_open+0x85/0xd1
       [<ffffffff8114f909>] __dentry_open+0x1af/0x303
       [<ffffffff81150da8>] nameidata_to_filp+0x55/0x7d
       [<ffffffff8115ecb1>] do_filp_open+0x504/0x960
       [<ffffffff8114f595>] do_sys_open+0x70/0x12e
       [<ffffffff8114f6c0>] sys_open+0x33/0x49
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (&bdev->bd_mutex){+.+.+.}:
       [<ffffffff8109ec64>] __lock_acquire+0x884/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
       [<ffffffff811809dc>] bd_release_from_disk+0x49/0x113
       [<ffffffff81415cd5>] unbind_rdev_from_array+0x5f/0x148
       [<ffffffff81416f96>] kick_rdev_from_array+0x25/0x48
       [<ffffffff81417011>] export_array+0x58/0xc2
       [<ffffffff81418d57>] do_md_stop+0x2cf/0x529
       [<ffffffff8141d7ab>] md_ioctl+0xa5e/0xffe
       [<ffffffff8128141f>] __blkdev_driver_ioctl+0x4b/0xa2
       [<ffffffff81281d9b>] blkdev_ioctl+0x881/0x8ee
       [<ffffffff81181f54>] block_ioctl+0x50/0x68
       [<ffffffff81161466>] vfs_ioctl+0x3e/0xa2
       [<ffffffff81161a03>] do_vfs_ioctl+0x499/0x500
       [<ffffffff81161ad5>] sys_ioctl+0x6b/0xa2
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by mdadm/1348:
 #0:  (&new->reconfig_mutex){+.+.+.}, at: [<ffffffff81415e82>]
mddev_lock+0x2a/0x40
 #1:  (&new->open_mutex){+.+...}, at: [<ffffffff81418ac7>] do_md_stop+0x3f/0x529

stack backtrace:
Pid: 1348, comm: mdadm Tainted: G        W
2.6.31-0.125.rc5.git2.snitm.x86_64 #1
Call Trace:
 [<ffffffff8109e3c1>] print_circular_bug_tail+0x80/0x9f
 [<ffffffff8109ec64>] __lock_acquire+0x884/0xb29
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff81014d10>] ? restore_args+0x0/0x30
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff810a0298>] lock_acquire+0xd5/0x115
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff810537e7>] ? need_resched+0x36/0x54
 [<ffffffff8152fbdf>] ? _spin_unlock_irq+0x46/0x61
 [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
 [<ffffffff811809dc>] bd_release_from_disk+0x49/0x113
 [<ffffffff81415cd5>] unbind_rdev_from_array+0x5f/0x148
 [<ffffffff81416f96>] kick_rdev_from_array+0x25/0x48
 [<ffffffff81082dd6>] ? flush_workqueue+0x0/0xd1
 [<ffffffff81417011>] export_array+0x58/0xc2
 [<ffffffff81418d57>] do_md_stop+0x2cf/0x529
 [<ffffffff8152e6b8>] ? mutex_lock_interruptible_nested+0x4d/0x68
 [<ffffffff8141d7ab>] md_ioctl+0xa5e/0xffe
 [<ffffffff8109e710>] ? __lock_acquire+0x330/0xb29
 [<ffffffff8103bd8f>] ? pvclock_clocksource_read+0x56/0xa6
 [<ffffffff8128141f>] __blkdev_driver_ioctl+0x4b/0xa2
 [<ffffffff8109bc58>] ? lock_release_holdtime+0x61/0x7c
 [<ffffffff81281d9b>] blkdev_ioctl+0x881/0x8ee
 [<ffffffff81246cb1>] ? __rcu_read_unlock+0x34/0x4a
 [<ffffffff812472c9>] ? avc_has_perm_noaudit+0x358/0x37d
 [<ffffffff8109d37a>] ? mark_lock+0x31/0x221
 [<ffffffff81247bee>] ? avc_has_perm+0x60/0x86
 [<ffffffff8103bd8f>] ? pvclock_clocksource_read+0x56/0xa6
 [<ffffffff812492b3>] ? inode_has_perm+0x7d/0xa0
 [<ffffffff815305e6>] ? _spin_unlock_irqrestore+0x5e/0x83
 [<ffffffff81181f54>] block_ioctl+0x50/0x68
 [<ffffffff81161466>] vfs_ioctl+0x3e/0xa2
 [<ffffffff81161a03>] do_vfs_ioctl+0x499/0x500
 [<ffffffff81161ad5>] sys_ioctl+0x6b/0xa2
 [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size  of device.
@ 2009-08-07 13:57             ` Mike Snitzer
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Snitzer @ 2009-08-07 13:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: H. Peter Anvin, linux-raid, linux-kernel

On Wed, Aug 5, 2009 at 11:13 PM, Neil Brown<neilb@suse.de> wrote:
> On Wednesday August 5, hpa@zytor.com wrote:
>> On 08/05/2009 06:03 PM, Mike Snitzer wrote:
>> > On Wed, Aug 5, 2009 at 6:07 PM, H. Peter Anvin<hpa@zytor.com> wrote:
>> >> On 08/02/2009 02:58 PM, NeilBrown wrote:
>> >>> As revalidate_disk calls check_disk_size_change, it will cause
>> >>> any capacity change of a gendisk to be propagated to the blockdev
>> >>> inode.  So use that instead of mucking about with locks and
>> >>> i_size_write.
>> >>>
>> >>> Also add a call to revalidate_disk in do_md_run and a few other places
>> >>> where the gendisk capacity is changed.
>> >>>
>> >> This patch causes my Fedora 11 system with all filesystems on RAID-1 to
>> >> not boot (it hangs in early userspace, Ctrl-Alt-Del reboots the system.)
>> >
>> > I reported similar findings, with some more detail, relative to
>> > Fedora's rawhide here:
>> > http://lkml.org/lkml/2009/8/5/275
>>
>> Sounds to be the same, yes.
>
> Thanks for the reports guys.
>
> I managed to reproduce the lockup and I think this patch should fix
> it.
> If you could review/test I would appreciate it.
>
> Thanks,
> NeilBrown
>
>
> From cf90cb85596d05d9595bb8ee4cadb7d4091212b5 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 6 Aug 2009 13:10:43 +1000
> Subject: [PATCH] Remove deadlock potential in md_open
>
> A recent commit:
>  commit 449aad3e25358812c43afc60918c5ad3819488e7
>
> introduced the possibility of an A-B/B-A deadlock between
> bd_mutex and reconfig_mutex.
>
> __blkdev_get holds bd_mutex while calling md_open which takes
>   reconfig_mutex,
> do_md_run is always called with reconfig_mutex held, and it now
>   takes bd_mutex in the call the revalidate_disk.
>
> This potential deadlock was not caught by lockdep due to the
> use of mutex_lock_interruptible_nexted which was introduced
> by
>   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
> do avoid a warning of an impossible deadlock.
>
> It is quite possible to split reconfig_mutex in to two locks.
> One protects the array data structures while it is being
> reconfigured, the other ensures that an array is never even partially
> open while it is being deactivated.
>
> So create a new lock, open_mutex, just to ensure exclusion between
> 'open' and 'stop'.
>
> This avoids the deadlock and also avoid the lockdep warning mentioned
> in commit d63a5a74d
>
> Reported-by: "Mike Snitzer" <snitzer@gmail.com>
> Reported-by: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: NeilBrown <neilb@suse.de>

Neil,

I finally tested this with the LVM testsuite's
t-pvcreate-operation-md.sh script that I mentioned here:
http://lkml.org/lkml/2009/8/5/275

The test succeeds but unfortunately triggers the following lockdep warning:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-0.125.rc5.git2.snitm.x86_64 #1
-------------------------------------------------------
mdadm/1348 is trying to acquire lock:
 (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811809dc>]
bd_release_from_disk+0x49/0x113

but task is already holding lock:
 (&new->open_mutex){+.+...}, at: [<ffffffff81418ac7>] do_md_stop+0x3f/0x529

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&new->open_mutex){+.+...}:
       [<ffffffff8109ed70>] __lock_acquire+0x990/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e6b8>] mutex_lock_interruptible_nested+0x4d/0x68
       [<ffffffff8141e683>] md_open+0x71/0xb3
       [<ffffffff81181821>] __blkdev_get+0xe1/0x37e
       [<ffffffff81181900>] __blkdev_get+0x1c0/0x37e
       [<ffffffff81181ae1>] blkdev_get+0x23/0x39
       [<ffffffff81181b7c>] blkdev_open+0x85/0xd1
       [<ffffffff8114f909>] __dentry_open+0x1af/0x303
       [<ffffffff81150da8>] nameidata_to_filp+0x55/0x7d
       [<ffffffff8115ecb1>] do_filp_open+0x504/0x960
       [<ffffffff8114f595>] do_sys_open+0x70/0x12e
       [<ffffffff8114f6c0>] sys_open+0x33/0x49
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #1 (&bdev->bd_mutex/1){+.+...}:
       [<ffffffff8109ed70>] __lock_acquire+0x990/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
       [<ffffffff811817ce>] __blkdev_get+0x8e/0x37e
       [<ffffffff81181900>] __blkdev_get+0x1c0/0x37e
       [<ffffffff81181ae1>] blkdev_get+0x23/0x39
       [<ffffffff81181b7c>] blkdev_open+0x85/0xd1
       [<ffffffff8114f909>] __dentry_open+0x1af/0x303
       [<ffffffff81150da8>] nameidata_to_filp+0x55/0x7d
       [<ffffffff8115ecb1>] do_filp_open+0x504/0x960
       [<ffffffff8114f595>] do_sys_open+0x70/0x12e
       [<ffffffff8114f6c0>] sys_open+0x33/0x49
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (&bdev->bd_mutex){+.+.+.}:
       [<ffffffff8109ec64>] __lock_acquire+0x884/0xb29
       [<ffffffff810a0298>] lock_acquire+0xd5/0x115
       [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
       [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
       [<ffffffff811809dc>] bd_release_from_disk+0x49/0x113
       [<ffffffff81415cd5>] unbind_rdev_from_array+0x5f/0x148
       [<ffffffff81416f96>] kick_rdev_from_array+0x25/0x48
       [<ffffffff81417011>] export_array+0x58/0xc2
       [<ffffffff81418d57>] do_md_stop+0x2cf/0x529
       [<ffffffff8141d7ab>] md_ioctl+0xa5e/0xffe
       [<ffffffff8128141f>] __blkdev_driver_ioctl+0x4b/0xa2
       [<ffffffff81281d9b>] blkdev_ioctl+0x881/0x8ee
       [<ffffffff81181f54>] block_ioctl+0x50/0x68
       [<ffffffff81161466>] vfs_ioctl+0x3e/0xa2
       [<ffffffff81161a03>] do_vfs_ioctl+0x499/0x500
       [<ffffffff81161ad5>] sys_ioctl+0x6b/0xa2
       [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by mdadm/1348:
 #0:  (&new->reconfig_mutex){+.+.+.}, at: [<ffffffff81415e82>]
mddev_lock+0x2a/0x40
 #1:  (&new->open_mutex){+.+...}, at: [<ffffffff81418ac7>] do_md_stop+0x3f/0x529

stack backtrace:
Pid: 1348, comm: mdadm Tainted: G        W
2.6.31-0.125.rc5.git2.snitm.x86_64 #1
Call Trace:
 [<ffffffff8109e3c1>] print_circular_bug_tail+0x80/0x9f
 [<ffffffff8109ec64>] __lock_acquire+0x884/0xb29
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff81014d10>] ? restore_args+0x0/0x30
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff810a0298>] lock_acquire+0xd5/0x115
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff8152e318>] __mutex_lock_common+0x5d/0x3b0
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff811809dc>] ? bd_release_from_disk+0x49/0x113
 [<ffffffff810537e7>] ? need_resched+0x36/0x54
 [<ffffffff8152fbdf>] ? _spin_unlock_irq+0x46/0x61
 [<ffffffff8152e789>] mutex_lock_nested+0x4d/0x6a
 [<ffffffff811809dc>] bd_release_from_disk+0x49/0x113
 [<ffffffff81415cd5>] unbind_rdev_from_array+0x5f/0x148
 [<ffffffff81416f96>] kick_rdev_from_array+0x25/0x48
 [<ffffffff81082dd6>] ? flush_workqueue+0x0/0xd1
 [<ffffffff81417011>] export_array+0x58/0xc2
 [<ffffffff81418d57>] do_md_stop+0x2cf/0x529
 [<ffffffff8152e6b8>] ? mutex_lock_interruptible_nested+0x4d/0x68
 [<ffffffff8141d7ab>] md_ioctl+0xa5e/0xffe
 [<ffffffff8109e710>] ? __lock_acquire+0x330/0xb29
 [<ffffffff8103bd8f>] ? pvclock_clocksource_read+0x56/0xa6
 [<ffffffff8128141f>] __blkdev_driver_ioctl+0x4b/0xa2
 [<ffffffff8109bc58>] ? lock_release_holdtime+0x61/0x7c
 [<ffffffff81281d9b>] blkdev_ioctl+0x881/0x8ee
 [<ffffffff81246cb1>] ? __rcu_read_unlock+0x34/0x4a
 [<ffffffff812472c9>] ? avc_has_perm_noaudit+0x358/0x37d
 [<ffffffff8109d37a>] ? mark_lock+0x31/0x221
 [<ffffffff81247bee>] ? avc_has_perm+0x60/0x86
 [<ffffffff8103bd8f>] ? pvclock_clocksource_read+0x56/0xa6
 [<ffffffff812492b3>] ? inode_has_perm+0x7d/0xa0
 [<ffffffff815305e6>] ? _spin_unlock_irqrestore+0x5e/0x83
 [<ffffffff81181f54>] block_ioctl+0x50/0x68
 [<ffffffff81161466>] vfs_ioctl+0x3e/0xa2
 [<ffffffff81161a03>] do_vfs_ioctl+0x499/0x500
 [<ffffffff81161ad5>] sys_ioctl+0x6b/0xa2
 [<ffffffff810141c2>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device.
  2009-08-07 13:57             ` Mike Snitzer
@ 2009-08-10  1:26               ` Neil Brown
  -1 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2009-08-10  1:26 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: H. Peter Anvin, linux-raid, linux-kernel

On Friday August 7, snitzer@gmail.com wrote:
> 
> Neil,
> 
> I finally tested this with the LVM testsuite's
> t-pvcreate-operation-md.sh script that I mentioned here:
> http://lkml.org/lkml/2009/8/5/275
> 
> The test succeeds but unfortunately triggers the following lockdep warning:

Thanks!
It looks like I was protecting too much of do_md_stop with the new
lock.
This patch should do better.

I'll see if I can try out that testsuite...

Thanks,
NeilBrown

From 603c37e9533d7a9b329c79a9075a2d50955dbf2c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 10 Aug 2009 10:37:12 +1000
Subject: [PATCH] Remove deadlock potential in md_open

A recent commit:
  commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoid the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   18 ++++++++++--------
 drivers/md/md.h |   10 ++++++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5b98bea..5614500 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -359,6 +359,7 @@ static mddev_t * mddev_find(dev_t unit)
 	else
 		new->md_minor = MINOR(unit) >> MdpMinorShift;
 
+	mutex_init(&new->open_mutex);
 	mutex_init(&new->reconfig_mutex);
 	INIT_LIST_HEAD(&new->disks);
 	INIT_LIST_HEAD(&new->all_mddevs);
@@ -4304,12 +4305,11 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	struct gendisk *disk = mddev->gendisk;
 	mdk_rdev_t *rdev;
 
+	mutex_lock(&mddev->open_mutex);
 	if (atomic_read(&mddev->openers) > is_open) {
 		printk("md: %s still in use.\n",mdname(mddev));
-		return -EBUSY;
-	}
-
-	if (mddev->pers) {
+		err = -EBUSY;
+	} else if (mddev->pers) {
 
 		if (mddev->sync_thread) {
 			set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -4367,7 +4367,10 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 			set_disk_ro(disk, 1);
 		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
 	}
-
+out:
+	mutex_unlock(&mddev->open_mutex);
+	if (err)
+		return err;
 	/*
 	 * Free resources if final stop
 	 */
@@ -4433,7 +4436,6 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	blk_integrity_unregister(disk);
 	md_new_event(mddev);
 	sysfs_notify_dirent(mddev->sysfs_state);
-out:
 	return err;
 }
 
@@ -5518,12 +5520,12 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	}
 	BUG_ON(mddev != bdev->bd_disk->private_data);
 
-	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
+	if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
 		goto out;
 
 	err = 0;
 	atomic_inc(&mddev->openers);
-	mddev_unlock(mddev);
+	mutex_unlock(&mddev->open_mutex);
 
 	check_disk_change(bdev);
  out:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 78f0316..f8fc188 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -223,6 +223,16 @@ struct mddev_s
 							    * so we don't loop trying */
 
 	int				in_sync;	/* know to not need resync */
+	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
+	 * that we are never stopping an array while it is open.
+	 * 'reconfig_mutex' protects all other reconfiguration.
+	 * These locks are separate due to conflicting interactions
+	 * with bdev->bd_mutex.
+	 * Lock ordering is:
+	 *  reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
+	 *  bd_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
+	 */
+	struct mutex			open_mutex;
 	struct mutex			reconfig_mutex;
 	atomic_t			active;		/* general refcount */
 	atomic_t			openers;	/* number of active opens */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [md PATCH 8/8] md: Use revalidate_disk to effect changes in size  of device.
@ 2009-08-10  1:26               ` Neil Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2009-08-10  1:26 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: H. Peter Anvin, linux-raid, linux-kernel

On Friday August 7, snitzer@gmail.com wrote:
> 
> Neil,
> 
> I finally tested this with the LVM testsuite's
> t-pvcreate-operation-md.sh script that I mentioned here:
> http://lkml.org/lkml/2009/8/5/275
> 
> The test succeeds but unfortunately triggers the following lockdep warning:

Thanks!
It looks like I was protecting too much of do_md_stop with the new
lock.
This patch should do better.

I'll see if I can try out that testsuite...

Thanks,
NeilBrown

>From 603c37e9533d7a9b329c79a9075a2d50955dbf2c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 10 Aug 2009 10:37:12 +1000
Subject: [PATCH] Remove deadlock potential in md_open

A recent commit:
  commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoid the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c |   18 ++++++++++--------
 drivers/md/md.h |   10 ++++++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5b98bea..5614500 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -359,6 +359,7 @@ static mddev_t * mddev_find(dev_t unit)
 	else
 		new->md_minor = MINOR(unit) >> MdpMinorShift;
 
+	mutex_init(&new->open_mutex);
 	mutex_init(&new->reconfig_mutex);
 	INIT_LIST_HEAD(&new->disks);
 	INIT_LIST_HEAD(&new->all_mddevs);
@@ -4304,12 +4305,11 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	struct gendisk *disk = mddev->gendisk;
 	mdk_rdev_t *rdev;
 
+	mutex_lock(&mddev->open_mutex);
 	if (atomic_read(&mddev->openers) > is_open) {
 		printk("md: %s still in use.\n",mdname(mddev));
-		return -EBUSY;
-	}
-
-	if (mddev->pers) {
+		err = -EBUSY;
+	} else if (mddev->pers) {
 
 		if (mddev->sync_thread) {
 			set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -4367,7 +4367,10 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 			set_disk_ro(disk, 1);
 		clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
 	}
-
+out:
+	mutex_unlock(&mddev->open_mutex);
+	if (err)
+		return err;
 	/*
 	 * Free resources if final stop
 	 */
@@ -4433,7 +4436,6 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 	blk_integrity_unregister(disk);
 	md_new_event(mddev);
 	sysfs_notify_dirent(mddev->sysfs_state);
-out:
 	return err;
 }
 
@@ -5518,12 +5520,12 @@ static int md_open(struct block_device *bdev, fmode_t mode)
 	}
 	BUG_ON(mddev != bdev->bd_disk->private_data);
 
-	if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
+	if ((err = mutex_lock_interruptible(&mddev->open_mutex)))
 		goto out;
 
 	err = 0;
 	atomic_inc(&mddev->openers);
-	mddev_unlock(mddev);
+	mutex_unlock(&mddev->open_mutex);
 
 	check_disk_change(bdev);
  out:
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 78f0316..f8fc188 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -223,6 +223,16 @@ struct mddev_s
 							    * so we don't loop trying */
 
 	int				in_sync;	/* know to not need resync */
+	/* 'open_mutex' avoids races between 'md_open' and 'do_md_stop', so
+	 * that we are never stopping an array while it is open.
+	 * 'reconfig_mutex' protects all other reconfiguration.
+	 * These locks are separate due to conflicting interactions
+	 * with bdev->bd_mutex.
+	 * Lock ordering is:
+	 *  reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
+	 *  bd_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
+	 */
+	struct mutex			open_mutex;
 	struct mutex			reconfig_mutex;
 	atomic_t			active;		/* general refcount */
 	atomic_t			openers;	/* number of active opens */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-08-10  1:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-02 21:58 [md PATCH 0/8] Assort md updates for 2.6.31 NeilBrown
2009-08-02 21:58 ` [md PATCH 2/8] md: Push down data integrity code to personalities NeilBrown
2009-08-02 21:58 ` [md PATCH 3/8] md: when a level change reduces the number of devices, remove the excess NeilBrown
2009-08-02 21:58 ` [md PATCH 1/8] md/raid6: release spare page at ->stop() NeilBrown
2009-08-02 21:58 ` [md PATCH 7/8] md: allow raid5_quiesce to work properly when reshape is happening NeilBrown
2009-08-02 21:58 ` [md PATCH 5/8] md: Handle growth of v1.x metadata correctly NeilBrown
2009-08-02 21:58 ` [md PATCH 6/8] md/raid5: set reshape_position correctly when reshape starts NeilBrown
2009-08-02 21:58 ` [md PATCH 4/8] md: avoid array overflow with bad v1.x metadata NeilBrown
2009-08-02 21:58 ` [md PATCH 8/8] md: Use revalidate_disk to effect changes in size of device NeilBrown
2009-08-05 22:07   ` H. Peter Anvin
2009-08-06  1:03     ` Mike Snitzer
2009-08-06  1:15       ` H. Peter Anvin
2009-08-06  3:13         ` Neil Brown
2009-08-06  3:13           ` Neil Brown
2009-08-06  4:35           ` H. Peter Anvin
2009-08-07 13:57           ` Mike Snitzer
2009-08-07 13:57             ` Mike Snitzer
2009-08-10  1:26             ` Neil Brown
2009-08-10  1:26               ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.