linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 6] md: various fixes for md
@ 2008-01-14  1:45 NeilBrown
  2008-01-14  1:45 ` [PATCH 001 of 6] md: Fix an occasional deadlock in raid5 NeilBrown
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Al Viro, Dan Williams

Following are 6 patches for md which are suitable for 2.6.25-rc1.

The first fixes a bug which could make it a candidate for 24-final.
However it is a deadlock that seems to occur very rarely, and has been
in mainline since 2.6.22.  So letting it into one more release
shouldn't be a big problem.  While the fix is fairly simple, it could
have some unexpected consequences, so I'd rather go for the next cycle.

The second patch fixes a bug which only affect -mm at the moment but
will probably affect 2.6.25 unless fixed.

The rest are cleanups with no functional change (I hope).

Thanks,
NeilBrown


 [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
 [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
 [PATCH 003 of 6] md: Change a few 'int' to 'size_t' in md
 [PATCH 004 of 6] md: Change INTERATE_MDDEV to for_each_mddev
 [PATCH 005 of 6] md: Change ITERATE_RDEV to rdev_for_each
 [PATCH 006 of 6] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  2008-01-16  5:01   ` dean gaudet
  2008-01-14  1:45 ` [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array NeilBrown
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Dan Williams


raid5's 'make_request' function calls generic_make_request on
underlying devices and if we run out of stripe heads, it could end up
waiting for one of those requests to complete.
This is bad as recursive calls to generic_make_request go on a queue
and are not even attempted until make_request completes.

So: don't make any generic_make_request calls in raid5 make_request
until all waiting has been done.  We do this by simply setting
STRIPE_HANDLE instead of calling handle_stripe().

If we need more stripe_heads, raid5d will get called to process the
pending stripe_heads which will call generic_make_request from a
different thread where no deadlock will happen.


This change by itself causes a performance hit.  So add a change so
that raid5_activate_delayed is only called at unplug time, never in
raid5.  This seems to bring back the performance numbers.  Calling it
in raid5d was sometimes too soon...

Cc: "Dan Williams" <dan.j.williams@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c |   13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-01-11 15:01:14.000000000 +1100
+++ ./drivers/md/raid5.c	2008-01-14 12:24:07.000000000 +1100
@@ -3157,7 +3157,8 @@ static void raid5_activate_delayed(raid5
 				atomic_inc(&conf->preread_active_stripes);
 			list_add_tail(&sh->lru, &conf->handle_list);
 		}
-	}
+	} else
+		blk_plug_device(conf->mddev->queue);
 }
 
 static void activate_bit_delay(raid5_conf_t *conf)
@@ -3547,7 +3548,7 @@ static int make_request(struct request_q
 				goto retry;
 			}
 			finish_wait(&conf->wait_for_overlap, &w);
-			handle_stripe(sh, NULL);
+			set_bit(STRIPE_HANDLE, &sh->state);
 			release_stripe(sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
@@ -3890,7 +3891,7 @@ static int  retry_aligned_read(raid5_con
  * During the scan, completed stripes are saved for us by the interrupt
  * handler, so that they will not have to wait for our next wakeup.
  */
-static void raid5d (mddev_t *mddev)
+static void raid5d(mddev_t *mddev)
 {
 	struct stripe_head *sh;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3915,12 +3916,6 @@ static void raid5d (mddev_t *mddev)
 			activate_bit_delay(conf);
 		}
 
-		if (list_empty(&conf->handle_list) &&
-		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
-		    !blk_queue_plugged(mddev->queue) &&
-		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
-
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
  2008-01-14  1:45 ` [PATCH 001 of 6] md: Fix an occasional deadlock in raid5 NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  2008-01-14  2:04   ` Al Viro
  2008-01-14  1:45 ` [PATCH 003 of 6] md: Change a few 'int' to 'size_t' in md NeilBrown
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Al Viro


Due to possible deadlock issues we need to use a schedule work to
kobject_del an 'rdev' object from a different thread.

A recent change means that kobject_add no longer gets a refernce, and
kobject_del doesn't put a reference.  Consequently, we need to
explicitly hold a reference to ensure that the last reference isn't
dropped before the scheduled work get a chance to call kobject_del.

Also, rename delayed_delete to md_delayed_delete to that it is more
obvious in a stack trace which code is to blame.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:23:53.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 12:24:17.000000000 +1100
@@ -1421,10 +1421,11 @@ static int bind_rdev_to_array(mdk_rdev_t
 	return err;
 }
 
-static void delayed_delete(struct work_struct *ws)
+static void md_delayed_delete(struct work_struct *ws)
 {
 	mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
 	kobject_del(&rdev->kobj);
+	kobject_put(&rdev->kobj);
 }
 
 static void unbind_rdev_from_array(mdk_rdev_t * rdev)
@@ -1443,7 +1444,8 @@ static void unbind_rdev_from_array(mdk_r
 	/* We need to delay this, otherwise we can deadlock when
 	 * writing to 'remove' to "dev/state"
 	 */
-	INIT_WORK(&rdev->del_work, delayed_delete);
+	INIT_WORK(&rdev->del_work, md_delayed_delete);
+	kobject_get(&rdev->kobj);
 	schedule_work(&rdev->del_work);
 }
 
@@ -3688,7 +3690,7 @@ static int do_md_stop(mddev_t * mddev, i
 				sysfs_remove_link(&mddev->kobj, nm);
 			}
 
-		/* make sure all delayed_delete calls have finished */
+		/* make sure all md_delayed_delete calls have finished */
 		flush_scheduled_work();
 
 		export_array(mddev);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 003 of 6] md: Change a few 'int' to 'size_t' in md
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
  2008-01-14  1:45 ` [PATCH 001 of 6] md: Fix an occasional deadlock in raid5 NeilBrown
  2008-01-14  1:45 ` [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  2008-01-14  1:45 ` [PATCH 004 of 6] md: Change INTERATE_MDDEV to for_each_mddev NeilBrown
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


As suggested by Andrew Morton.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:24:17.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 12:24:31.000000000 +1100
@@ -1802,7 +1802,7 @@ static ssize_t
 state_show(mdk_rdev_t *rdev, char *page)
 {
 	char *sep = "";
-	int len=0;
+	size_t len = 0;
 
 	if (test_bit(Faulty, &rdev->flags)) {
 		len+= sprintf(page+len, "%sfaulty",sep);
@@ -2320,7 +2320,7 @@ level_show(mddev_t *mddev, char *page)
 static ssize_t
 level_store(mddev_t *mddev, const char *buf, size_t len)
 {
-	int rv = len;
+	ssize_t rv = len;
 	if (mddev->pers)
 		return -EBUSY;
 	if (len == 0)
@@ -2807,7 +2807,7 @@ metadata_store(mddev_t *mddev, const cha
 		return len;
 	}
 	if (strncmp(buf, "external:", 9) == 0) {
-		int namelen = len-9;
+		size_t namelen = len-9;
 		if (namelen >= sizeof(mddev->metadata_type))
 			namelen = sizeof(mddev->metadata_type)-1;
 		strncpy(mddev->metadata_type, buf+9, namelen);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 004 of 6] md: Change INTERATE_MDDEV to for_each_mddev
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
                   ` (2 preceding siblings ...)
  2008-01-14  1:45 ` [PATCH 003 of 6] md: Change a few 'int' to 'size_t' in md NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  2008-01-14  1:45 ` [PATCH 005 of 6] md: Change ITERATE_RDEV to rdev_for_each NeilBrown
  2008-01-14  1:45 ` [PATCH 006 of 6] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING NeilBrown
  5 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


As this is more consistent with kernel style.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:24:54.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 12:26:04.000000000 +1100
@@ -195,7 +195,7 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
  * Any code which breaks out of this loop while own
  * a reference to the current mddev and must mddev_put it.
  */
-#define ITERATE_MDDEV(mddev,tmp)					\
+#define for_each_mddev(mddev,tmp)					\
 									\
 	for (({ spin_lock(&all_mddevs_lock); 				\
 		tmp = all_mddevs.next;					\
@@ -1596,7 +1596,7 @@ static void md_print_devices(void)
 	printk("md:	**********************************\n");
 	printk("md:	* <COMPLETE RAID STATE PRINTOUT> *\n");
 	printk("md:	**********************************\n");
-	ITERATE_MDDEV(mddev,tmp) {
+	for_each_mddev(mddev, tmp) {
 
 		if (mddev->bitmap)
 			bitmap_print_sb(mddev->bitmap);
@@ -2014,7 +2014,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 		struct list_head *tmp, *tmp2;
 
 		mddev_unlock(rdev->mddev);
-		ITERATE_MDDEV(mddev, tmp) {
+		for_each_mddev(mddev, tmp) {
 			mdk_rdev_t *rdev2;
 
 			mddev_lock(mddev);
@@ -5464,7 +5464,7 @@ void md_do_sync(mddev_t *mddev)
 			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 			goto skip;
 		}
-		ITERATE_MDDEV(mddev2,tmp) {
+		for_each_mddev(mddev2, tmp) {
 			if (mddev2 == mddev)
 				continue;
 			if (mddev2->curr_resync && 
@@ -5912,7 +5912,7 @@ static int md_notify_reboot(struct notif
 
 		printk(KERN_INFO "md: stopping all md devices.\n");
 
-		ITERATE_MDDEV(mddev,tmp)
+		for_each_mddev(mddev, tmp)
 			if (mddev_trylock(mddev)) {
 				do_md_stop (mddev, 1);
 				mddev_unlock(mddev);
@@ -6046,7 +6046,7 @@ static __exit void md_exit(void)
 	unregister_reboot_notifier(&md_notifier);
 	unregister_sysctl_table(raid_table_header);
 	remove_proc_entry("mdstat", NULL);
-	ITERATE_MDDEV(mddev,tmp) {
+	for_each_mddev(mddev, tmp) {
 		struct gendisk *disk = mddev->gendisk;
 		if (!disk)
 			continue;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 005 of 6] md: Change ITERATE_RDEV to rdev_for_each
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
                   ` (3 preceding siblings ...)
  2008-01-14  1:45 ` [PATCH 004 of 6] md: Change INTERATE_MDDEV to for_each_mddev NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  2008-01-14  1:45 ` [PATCH 006 of 6] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING NeilBrown
  5 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


as this is morein line with common practice in the kernel.
Also swap the args around to be more like list_for_each.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c       |    4 +-
 ./drivers/md/faulty.c       |    2 -
 ./drivers/md/linear.c       |    2 -
 ./drivers/md/md.c           |   64 ++++++++++++++++++++++----------------------
 ./drivers/md/multipath.c    |    2 -
 ./drivers/md/raid0.c        |    8 ++---
 ./drivers/md/raid1.c        |    2 -
 ./drivers/md/raid10.c       |    2 -
 ./drivers/md/raid5.c        |    6 ++--
 ./include/linux/raid/md_k.h |    2 -
 10 files changed, 47 insertions(+), 47 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/bitmap.c	2008-01-14 12:26:10.000000000 +1100
@@ -231,7 +231,7 @@ static struct page *read_sb_page(mddev_t
 	if (!page)
 		return ERR_PTR(-ENOMEM);
 
-	ITERATE_RDEV(mddev, rdev, tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (! test_bit(In_sync, &rdev->flags)
 		    || test_bit(Faulty, &rdev->flags))
 			continue;
@@ -255,7 +255,7 @@ static int write_sb_page(struct bitmap *
 	struct list_head *tmp;
 	mddev_t *mddev = bitmap->mddev;
 
-	ITERATE_RDEV(mddev, rdev, tmp)
+	rdev_for_each(rdev, tmp, mddev)
 		if (test_bit(In_sync, &rdev->flags)
 		    && !test_bit(Faulty, &rdev->flags)) {
 			int size = PAGE_SIZE;

diff .prev/drivers/md/faulty.c ./drivers/md/faulty.c
--- .prev/drivers/md/faulty.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/faulty.c	2008-01-14 12:26:10.000000000 +1100
@@ -294,7 +294,7 @@ static int run(mddev_t *mddev)
 	}
 	conf->nfaults = 0;
 
-	ITERATE_RDEV(mddev, rdev, tmp)
+	rdev_for_each(rdev, tmp, mddev)
 		conf->rdev = rdev;
 
 	mddev->array_size = mddev->size;

diff .prev/drivers/md/linear.c ./drivers/md/linear.c
--- .prev/drivers/md/linear.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/linear.c	2008-01-14 12:26:10.000000000 +1100
@@ -122,7 +122,7 @@ static linear_conf_t *linear_conf(mddev_
 	cnt = 0;
 	conf->array_size = 0;
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		int j = rdev->raid_disk;
 		dev_info_t *disk = conf->disks + j;
 

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:26:04.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 12:26:10.000000000 +1100
@@ -311,7 +311,7 @@ static mdk_rdev_t * find_rdev_nr(mddev_t
 	mdk_rdev_t * rdev;
 	struct list_head *tmp;
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (rdev->desc_nr == nr)
 			return rdev;
 	}
@@ -323,7 +323,7 @@ static mdk_rdev_t * find_rdev(mddev_t * 
 	struct list_head *tmp;
 	mdk_rdev_t *rdev;
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (rdev->bdev->bd_dev == dev)
 			return rdev;
 	}
@@ -944,7 +944,7 @@ static void super_90_sync(mddev_t *mddev
 		sb->state |= (1<<MD_SB_BITMAP_PRESENT);
 
 	sb->disks[0].state = (1<<MD_DISK_REMOVED);
-	ITERATE_RDEV(mddev,rdev2,tmp) {
+	rdev_for_each(rdev2, tmp, mddev) {
 		mdp_disk_t *d;
 		int desc_nr;
 		if (rdev2->raid_disk >= 0 && test_bit(In_sync, &rdev2->flags)
@@ -1297,7 +1297,7 @@ static void super_1_sync(mddev_t *mddev,
 	}
 
 	max_dev = 0;
-	ITERATE_RDEV(mddev,rdev2,tmp)
+	rdev_for_each(rdev2, tmp, mddev)
 		if (rdev2->desc_nr+1 > max_dev)
 			max_dev = rdev2->desc_nr+1;
 
@@ -1306,7 +1306,7 @@ static void super_1_sync(mddev_t *mddev,
 	for (i=0; i<max_dev;i++)
 		sb->dev_roles[i] = cpu_to_le16(0xfffe);
 	
-	ITERATE_RDEV(mddev,rdev2,tmp) {
+	rdev_for_each(rdev2, tmp, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(0xfffe);
@@ -1344,8 +1344,8 @@ static int match_mddev_units(mddev_t *md
 	struct list_head *tmp, *tmp2;
 	mdk_rdev_t *rdev, *rdev2;
 
-	ITERATE_RDEV(mddev1,rdev,tmp)
-		ITERATE_RDEV(mddev2, rdev2, tmp2)
+	rdev_for_each(rdev, tmp, mddev1)
+		rdev_for_each(rdev2, tmp2, mddev2)
 			if (rdev->bdev->bd_contains ==
 			    rdev2->bdev->bd_contains)
 				return 1;
@@ -1518,7 +1518,7 @@ static void export_array(mddev_t *mddev)
 	struct list_head *tmp;
 	mdk_rdev_t *rdev;
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (!rdev->mddev) {
 			MD_BUG();
 			continue;
@@ -1602,11 +1602,11 @@ static void md_print_devices(void)
 			bitmap_print_sb(mddev->bitmap);
 		else
 			printk("%s: ", mdname(mddev));
-		ITERATE_RDEV(mddev,rdev,tmp2)
+		rdev_for_each(rdev, tmp2, mddev)
 			printk("<%s>", bdevname(rdev->bdev,b));
 		printk("\n");
 
-		ITERATE_RDEV(mddev,rdev,tmp2)
+		rdev_for_each(rdev, tmp2, mddev)
 			print_rdev(rdev);
 	}
 	printk("md:	**********************************\n");
@@ -1625,7 +1625,7 @@ static void sync_sbs(mddev_t * mddev, in
 	mdk_rdev_t *rdev;
 	struct list_head *tmp;
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (rdev->sb_events == mddev->events ||
 		    (nospares &&
 		     rdev->raid_disk < 0 &&
@@ -1732,7 +1732,7 @@ repeat:
 		mdname(mddev),mddev->in_sync);
 
 	bitmap_update_sb(mddev->bitmap);
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		char b[BDEVNAME_SIZE];
 		dprintk(KERN_INFO "md: ");
 		if (rdev->sb_loaded != 1)
@@ -2018,7 +2018,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 			mdk_rdev_t *rdev2;
 
 			mddev_lock(mddev);
-			ITERATE_RDEV(mddev, rdev2, tmp2)
+			rdev_for_each(rdev2, tmp2, mddev)
 				if (test_bit(AllReserved, &rdev2->flags) ||
 				    (rdev->bdev == rdev2->bdev &&
 				     rdev != rdev2 &&
@@ -2204,7 +2204,7 @@ static void analyze_sbs(mddev_t * mddev)
 	char b[BDEVNAME_SIZE];
 
 	freshest = NULL;
-	ITERATE_RDEV(mddev,rdev,tmp)
+	rdev_for_each(rdev, tmp, mddev)
 		switch (super_types[mddev->major_version].
 			load_super(rdev, freshest, mddev->minor_version)) {
 		case 1:
@@ -2225,7 +2225,7 @@ static void analyze_sbs(mddev_t * mddev)
 		validate_super(mddev, freshest);
 
 	i = 0;
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (rdev != freshest)
 			if (super_types[mddev->major_version].
 			    validate_super(mddev, rdev)) {
@@ -3319,7 +3319,7 @@ static int do_md_run(mddev_t * mddev)
 		}
 
 		/* devices must have minimum size of one chunk */
-		ITERATE_RDEV(mddev,rdev,tmp) {
+		rdev_for_each(rdev, tmp, mddev) {
 			if (test_bit(Faulty, &rdev->flags))
 				continue;
 			if (rdev->size < chunk_size / 1024) {
@@ -3346,7 +3346,7 @@ static int do_md_run(mddev_t * mddev)
 	 * the only valid external interface is through the md
 	 * device.
 	 */
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		if (test_bit(Faulty, &rdev->flags))
 			continue;
 		sync_blockdev(rdev->bdev);
@@ -3412,8 +3412,8 @@ static int do_md_run(mddev_t * mddev)
 		mdk_rdev_t *rdev2;
 		struct list_head *tmp2;
 		int warned = 0;
-		ITERATE_RDEV(mddev, rdev, tmp) {
-			ITERATE_RDEV(mddev, rdev2, tmp2) {
+		rdev_for_each(rdev, tmp, mddev) {
+			rdev_for_each(rdev2, tmp2, mddev) {
 				if (rdev < rdev2 &&
 				    rdev->bdev->bd_contains ==
 				    rdev2->bdev->bd_contains) {
@@ -3473,7 +3473,7 @@ static int do_md_run(mddev_t * mddev)
 	mddev->safemode_delay = (200 * HZ)/1000 +1; /* 200 msec delay */
 	mddev->in_sync = 1;
 
-	ITERATE_RDEV(mddev,rdev,tmp)
+	rdev_for_each(rdev, tmp, mddev)
 		if (rdev->raid_disk >= 0) {
 			char nm[20];
 			sprintf(nm, "rd%d", rdev->raid_disk);
@@ -3506,7 +3506,7 @@ static int do_md_run(mddev_t * mddev)
 	if (mddev->degraded && !mddev->sync_thread) {
 		struct list_head *rtmp;
 		int spares = 0;
-		ITERATE_RDEV(mddev,rdev,rtmp)
+		rdev_for_each(rdev, rtmp, mddev)
 			if (rdev->raid_disk >= 0 &&
 			    !test_bit(In_sync, &rdev->flags) &&
 			    !test_bit(Faulty, &rdev->flags))
@@ -3683,7 +3683,7 @@ static int do_md_stop(mddev_t * mddev, i
 		}
 		mddev->bitmap_offset = 0;
 
-		ITERATE_RDEV(mddev,rdev,tmp)
+		rdev_for_each(rdev, tmp, mddev)
 			if (rdev->raid_disk >= 0) {
 				char nm[20];
 				sprintf(nm, "rd%d", rdev->raid_disk);
@@ -3724,7 +3724,7 @@ static void autorun_array(mddev_t *mddev
 
 	printk(KERN_INFO "md: running: ");
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		char b[BDEVNAME_SIZE];
 		printk("<%s>", bdevname(rdev->bdev,b));
 	}
@@ -3851,7 +3851,7 @@ static int get_array_info(mddev_t * mdde
 	struct list_head *tmp;
 
 	nr=working=active=failed=spare=0;
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		nr++;
 		if (test_bit(Faulty, &rdev->flags))
 			failed++;
@@ -4391,7 +4391,7 @@ static int update_size(mddev_t *mddev, u
 	 */
 	if (mddev->sync_thread)
 		return -EBUSY;
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		sector_t avail;
 		avail = rdev->size * 2;
 
@@ -5132,7 +5132,7 @@ static int md_seq_show(struct seq_file *
 		}
 
 		size = 0;
-		ITERATE_RDEV(mddev,rdev,tmp2) {
+		rdev_for_each(rdev, tmp2, mddev) {
 			char b[BDEVNAME_SIZE];
 			seq_printf(seq, " %s[%d]",
 				bdevname(rdev->bdev,b), rdev->desc_nr);
@@ -5287,7 +5287,7 @@ static int is_mddev_idle(mddev_t *mddev)
 	long curr_events;
 
 	idle = 1;
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		struct gendisk *disk = rdev->bdev->bd_contains->bd_disk;
 		curr_events = disk_stat_read(disk, sectors[0]) + 
 				disk_stat_read(disk, sectors[1]) - 
@@ -5514,7 +5514,7 @@ void md_do_sync(mddev_t *mddev)
 		/* recovery follows the physical size of devices */
 		max_sectors = mddev->size << 1;
 		j = MaxSector;
-		ITERATE_RDEV(mddev,rdev,rtmp)
+		rdev_for_each(rdev, rtmp, mddev)
 			if (rdev->raid_disk >= 0 &&
 			    !test_bit(Faulty, &rdev->flags) &&
 			    !test_bit(In_sync, &rdev->flags) &&
@@ -5667,7 +5667,7 @@ void md_do_sync(mddev_t *mddev)
 		} else {
 			if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery))
 				mddev->curr_resync = MaxSector;
-			ITERATE_RDEV(mddev,rdev,rtmp)
+			rdev_for_each(rdev, rtmp, mddev)
 				if (rdev->raid_disk >= 0 &&
 				    !test_bit(Faulty, &rdev->flags) &&
 				    !test_bit(In_sync, &rdev->flags) &&
@@ -5705,7 +5705,7 @@ static int remove_and_add_spares(mddev_t
 	struct list_head *rtmp;
 	int spares = 0;
 
-	ITERATE_RDEV(mddev,rdev,rtmp)
+	rdev_for_each(rdev, rtmp, mddev)
 		if (rdev->raid_disk >= 0 &&
 		    !mddev->external &&
 		    (test_bit(Faulty, &rdev->flags) ||
@@ -5721,7 +5721,7 @@ static int remove_and_add_spares(mddev_t
 		}
 
 	if (mddev->degraded) {
-		ITERATE_RDEV(mddev,rdev,rtmp)
+		rdev_for_each(rdev, rtmp, mddev)
 			if (rdev->raid_disk < 0
 			    && !test_bit(Faulty, &rdev->flags)) {
 				rdev->recovery_offset = 0;
@@ -5835,7 +5835,7 @@ void md_check_recovery(mddev_t *mddev)
 			 * information must be scrapped
 			 */
 			if (!mddev->degraded)
-				ITERATE_RDEV(mddev,rdev,rtmp)
+				rdev_for_each(rdev, rtmp, mddev)
 					rdev->saved_raid_disk = -1;
 
 			mddev->recovery = 0;

diff .prev/drivers/md/multipath.c ./drivers/md/multipath.c
--- .prev/drivers/md/multipath.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/multipath.c	2008-01-14 12:26:10.000000000 +1100
@@ -436,7 +436,7 @@ static int multipath_run (mddev_t *mddev
 	}
 
 	conf->working_disks = 0;
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		disk_idx = rdev->raid_disk;
 		if (disk_idx < 0 ||
 		    disk_idx >= mddev->raid_disks)

diff .prev/drivers/md/raid0.c ./drivers/md/raid0.c
--- .prev/drivers/md/raid0.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/raid0.c	2008-01-14 12:26:10.000000000 +1100
@@ -72,11 +72,11 @@ static int create_strip_zones (mddev_t *
 	 */
 	conf->nr_strip_zones = 0;
  
-	ITERATE_RDEV(mddev,rdev1,tmp1) {
+	rdev_for_each(rdev1, tmp1, mddev) {
 		printk("raid0: looking at %s\n",
 			bdevname(rdev1->bdev,b));
 		c = 0;
-		ITERATE_RDEV(mddev,rdev2,tmp2) {
+		rdev_for_each(rdev2, tmp2, mddev) {
 			printk("raid0:   comparing %s(%llu)",
 			       bdevname(rdev1->bdev,b),
 			       (unsigned long long)rdev1->size);
@@ -124,7 +124,7 @@ static int create_strip_zones (mddev_t *
 	cnt = 0;
 	smallest = NULL;
 	zone->dev = conf->devlist;
-	ITERATE_RDEV(mddev, rdev1, tmp1) {
+	rdev_for_each(rdev1, tmp1, mddev) {
 		int j = rdev1->raid_disk;
 
 		if (j < 0 || j >= mddev->raid_disks) {
@@ -293,7 +293,7 @@ static int raid0_run (mddev_t *mddev)
 
 	/* calculate array device size */
 	mddev->array_size = 0;
-	ITERATE_RDEV(mddev,rdev,tmp)
+	rdev_for_each(rdev, tmp, mddev)
 		mddev->array_size += rdev->size;
 
 	printk("raid0 : md_size is %llu blocks.\n", 

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/raid10.c	2008-01-14 12:26:10.000000000 +1100
@@ -2026,7 +2026,7 @@ static int run(mddev_t *mddev)
 		goto out_free_conf;
 	}
 
-	ITERATE_RDEV(mddev, rdev, tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-01-11 15:01:13.000000000 +1100
+++ ./drivers/md/raid1.c	2008-01-14 12:26:10.000000000 +1100
@@ -1887,7 +1887,7 @@ static int run(mddev_t *mddev)
 	if (!conf->r1bio_pool)
 		goto out_no_mem;
 
-	ITERATE_RDEV(mddev, rdev, tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
 		    || disk_idx < 0)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2008-01-14 12:24:07.000000000 +1100
+++ ./drivers/md/raid5.c	2008-01-14 12:26:10.000000000 +1100
@@ -4129,7 +4129,7 @@ static int run(mddev_t *mddev)
 
 	pr_debug("raid5: run(%s) called.\n", mdname(mddev));
 
-	ITERATE_RDEV(mddev,rdev,tmp) {
+	rdev_for_each(rdev, tmp, mddev) {
 		raid_disk = rdev->raid_disk;
 		if (raid_disk >= conf->raid_disks
 		    || raid_disk < 0)
@@ -4542,7 +4542,7 @@ static int raid5_start_reshape(mddev_t *
 	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
 		return -EBUSY;
 
-	ITERATE_RDEV(mddev, rdev, rtmp)
+	rdev_for_each(rdev, rtmp, mddev)
 		if (rdev->raid_disk < 0 &&
 		    !test_bit(Faulty, &rdev->flags))
 			spares++;
@@ -4564,7 +4564,7 @@ static int raid5_start_reshape(mddev_t *
 	/* Add some new drives, as many as will fit.
 	 * We know there are enough to make the newly sized array work.
 	 */
-	ITERATE_RDEV(mddev, rdev, rtmp)
+	rdev_for_each(rdev, rtmp, mddev)
 		if (rdev->raid_disk < 0 &&
 		    !test_bit(Faulty, &rdev->flags)) {
 			if (raid5_add_disk(mddev, rdev)) {

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-01-11 15:01:13.000000000 +1100
+++ ./include/linux/raid/md_k.h	2008-01-14 12:26:10.000000000 +1100
@@ -322,7 +322,7 @@ static inline char * mdname (mddev_t * m
 /*
  * iterates through the 'same array disks' ringlist
  */
-#define ITERATE_RDEV(mddev,rdev,tmp)					\
+#define rdev_for_each(rdev, tmp, mddev)				\
 	ITERATE_RDEV_GENERIC((mddev)->disks,rdev,tmp)
 
 /*

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 006 of 6] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.
  2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
                   ` (4 preceding siblings ...)
  2008-01-14  1:45 ` [PATCH 005 of 6] md: Change ITERATE_RDEV to rdev_for_each NeilBrown
@ 2008-01-14  1:45 ` NeilBrown
  5 siblings, 0 replies; 20+ messages in thread
From: NeilBrown @ 2008-01-14  1:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Finish ITERATE_ to for_each conversion.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |    8 ++++----
 ./include/linux/raid/md_k.h |   14 ++++----------
 2 files changed, 8 insertions(+), 14 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:26:10.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 12:26:15.000000000 +1100
@@ -3767,7 +3767,7 @@ static void autorun_devices(int part)
 		printk(KERN_INFO "md: considering %s ...\n",
 			bdevname(rdev0->bdev,b));
 		INIT_LIST_HEAD(&candidates);
-		ITERATE_RDEV_PENDING(rdev,tmp)
+		rdev_for_each_list(rdev, tmp, pending_raid_disks)
 			if (super_90_load(rdev, rdev0, 0) >= 0) {
 				printk(KERN_INFO "md:  adding %s ...\n",
 					bdevname(rdev->bdev,b));
@@ -3810,7 +3810,7 @@ static void autorun_devices(int part)
 			mddev_unlock(mddev);
 		} else {
 			printk(KERN_INFO "md: created %s\n", mdname(mddev));
-			ITERATE_RDEV_GENERIC(candidates,rdev,tmp) {
+			rdev_for_each_list(rdev, tmp, candidates) {
 				list_del_init(&rdev->same_set);
 				if (bind_rdev_to_array(rdev, mddev))
 					export_rdev(rdev);
@@ -3821,7 +3821,7 @@ static void autorun_devices(int part)
 		/* on success, candidates will be empty, on error
 		 * it won't...
 		 */
-		ITERATE_RDEV_GENERIC(candidates,rdev,tmp)
+		rdev_for_each_list(rdev, tmp, candidates)
 			export_rdev(rdev);
 		mddev_put(mddev);
 	}
@@ -4936,7 +4936,7 @@ static void status_unused(struct seq_fil
 
 	seq_printf(seq, "unused devices: ");
 
-	ITERATE_RDEV_PENDING(rdev,tmp) {
+	rdev_for_each_list(rdev, tmp, pending_raid_disks) {
 		char b[BDEVNAME_SIZE];
 		i++;
 		seq_printf(seq, "%s ",

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-01-14 12:26:10.000000000 +1100
+++ ./include/linux/raid/md_k.h	2008-01-14 12:26:15.000000000 +1100
@@ -313,23 +313,17 @@ static inline char * mdname (mddev_t * m
  * iterates through some rdev ringlist. It's safe to remove the
  * current 'rdev'. Dont touch 'tmp' though.
  */
-#define ITERATE_RDEV_GENERIC(head,rdev,tmp)				\
+#define rdev_for_each_list(rdev, tmp, list)				\
 									\
-	for ((tmp) = (head).next;					\
+	for ((tmp) = (list).next;					\
 		(rdev) = (list_entry((tmp), mdk_rdev_t, same_set)),	\
-			(tmp) = (tmp)->next, (tmp)->prev != &(head)	\
+			(tmp) = (tmp)->next, (tmp)->prev != &(list)	\
 		; )
 /*
  * iterates through the 'same array disks' ringlist
  */
 #define rdev_for_each(rdev, tmp, mddev)				\
-	ITERATE_RDEV_GENERIC((mddev)->disks,rdev,tmp)
-
-/*
- * Iterates through 'pending RAID disks'
- */
-#define ITERATE_RDEV_PENDING(rdev,tmp)					\
-	ITERATE_RDEV_GENERIC(pending_raid_disks,rdev,tmp)
+	rdev_for_each_list(rdev, tmp, (mddev)->disks)
 
 typedef struct mdk_thread_s {
 	void			(*run) (mddev_t *mddev);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  1:45 ` [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array NeilBrown
@ 2008-01-14  2:04   ` Al Viro
  2008-01-14  3:21     ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2008-01-14  2:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel

On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
> 
> Due to possible deadlock issues we need to use a schedule work to
> kobject_del an 'rdev' object from a different thread.
> 
> A recent change means that kobject_add no longer gets a refernce, and
> kobject_del doesn't put a reference.  Consequently, we need to
> explicitly hold a reference to ensure that the last reference isn't
> dropped before the scheduled work get a chance to call kobject_del.
> 
> Also, rename delayed_delete to md_delayed_delete to that it is more
> obvious in a stack trace which code is to blame.

I don't know...  You still get kobject_del() and export_rdev()
in unpredictable order; sure, it won't be freed under you, but...

What is that deadlock problem, anyway?  I don't see anything that
would look like an obvious candidate in the stuff you are delaying...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  2:04   ` Al Viro
@ 2008-01-14  3:21     ` Neil Brown
  2008-01-14  3:43       ` Al Viro
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2008-01-14  3:21 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, linux-raid, linux-kernel

On Monday January 14, viro@ZenIV.linux.org.uk wrote:
> On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
> > 
> > Due to possible deadlock issues we need to use a schedule work to
> > kobject_del an 'rdev' object from a different thread.
> > 
> > A recent change means that kobject_add no longer gets a refernce, and
> > kobject_del doesn't put a reference.  Consequently, we need to
> > explicitly hold a reference to ensure that the last reference isn't
> > dropped before the scheduled work get a chance to call kobject_del.
> > 
> > Also, rename delayed_delete to md_delayed_delete to that it is more
> > obvious in a stack trace which code is to blame.
> 
> I don't know...  You still get kobject_del() and export_rdev()
> in unpredictable order; sure, it won't be freed under you, but...

I cannot see that that would matter.
kobject_del deletes the object from the kobj tree and free sysfs.
export_rdev disconnects the objects from md structures and releases
the connection with the device.  They are quite independent.

> 
> What is that deadlock problem, anyway?  I don't see anything that
> would look like an obvious candidate in the stuff you are delaying...

Maybe it isn't there any more....

Once upon a time, when I 
   echo remove > /sys/block/mdX/md/dev-YYY/state

sysfs_write_file would hold buffer->sem while calling my store
handler.
When my store handler tried to delete the relevant kobject, it would
eventually call orphan_all_buffers which would try to take buf->sem
and deadlock.

orphan_all_buffers doesn't exist any more, so maybe the deadlock is
gone too.
However the comment at the top of sysfs_schedule_callback in
sysfs/file.c says:

 *
 * sysfs attribute methods must not unregister themselves or their parent
 * kobject (which would amount to the same thing).  Attempts to do so will
 * deadlock, since unregistration is mutually exclusive with driver
 * callbacks.
 *

so I'm included to leave the code as it is....  ofcourse the comment
could be well out of date.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  3:21     ` Neil Brown
@ 2008-01-14  3:43       ` Al Viro
  2008-01-14  4:48         ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2008-01-14  3:43 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel

On Mon, Jan 14, 2008 at 02:21:45PM +1100, Neil Brown wrote:

> Maybe it isn't there any more....
> 
> Once upon a time, when I 
>    echo remove > /sys/block/mdX/md/dev-YYY/state

Egads.  And just what will protect you from parallel callers
of state_store()?  buffer->mutex does *not* do that - it only
gives you exclusion on given struct file.  Run the command
above from several shells and you've got independent open
from each redirect => different struct file *and* different
buffer for each => no exclusion whatsoever.

And _that_ is present right in the mainline tree - it's unrelated
to -mm kobject changes.

BTW, yes, you do have a deadlock there - kobject_del() will try to evict
children, which will include waiting for currently running ->store()
to finish, which will include the caller since .../state *is* a child of
that sucker.

The real problem is the lack of any kind of exclusion considerations in
md.c itself, AFAICS.  Fun with ordering is secondary (BTW, yes, it is
a problem - will sysfs ->store() to attribute between export_rdev() and
kobject_del() work correctly?)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  3:43       ` Al Viro
@ 2008-01-14  4:48         ` Neil Brown
  2008-01-14  6:28           ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2008-01-14  4:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, linux-raid, linux-kernel

On Monday January 14, viro@ZenIV.linux.org.uk wrote:
> On Mon, Jan 14, 2008 at 02:21:45PM +1100, Neil Brown wrote:
> 
> > Maybe it isn't there any more....
> > 
> > Once upon a time, when I 
> >    echo remove > /sys/block/mdX/md/dev-YYY/state
> 
> Egads.  And just what will protect you from parallel callers
> of state_store()?  buffer->mutex does *not* do that - it only
> gives you exclusion on given struct file.  Run the command
> above from several shells and you've got independent open
> from each redirect => different struct file *and* different
> buffer for each => no exclusion whatsoever.

well in -mm, rdev_attr_store gets a lock on
rdev->mddev->reconfig_mutex. 
It doesn't test is rdev->mddev is NULL though, so if the write happens
after unbind_rdev_from_array, we lose.
A test for NULL would be easy enough.  And I think that the mddev
won't actually disappear until the rdevs are all gone (you subsequent
comment about kobject_del ordering seems to confirm that) so a simple test
for NULL should be sufficient.

> 
> And _that_ is present right in the mainline tree - it's unrelated
> to -mm kobject changes.
> 
> BTW, yes, you do have a deadlock there - kobject_del() will try to evict
> children, which will include waiting for currently running ->store()
> to finish, which will include the caller since .../state *is* a child of
> that sucker.
> 
> The real problem is the lack of any kind of exclusion considerations in
> md.c itself, AFAICS.  Fun with ordering is secondary (BTW, yes, it is
> a problem - will sysfs ->store() to attribute between export_rdev() and
> kobject_del() work correctly?)

Probably not.  The possibility that rdev->mddev could be NULL would
break a lot of these.  Maybe I should delay setting rdev->mddev to
NULL until after the kobject_del.  Then audit them all.

Thanks.  I'll see what I can some up with.

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  4:48         ` Neil Brown
@ 2008-01-14  6:28           ` Neil Brown
  2008-01-14 12:59             ` Al Viro
  0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2008-01-14  6:28 UTC (permalink / raw)
  To: Al Viro; +Cc: Andrew Morton, linux-raid, linux-kernel

On Monday January 14, neilb@suse.de wrote:
> 
> Thanks.  I'll see what I can some up with.

How about this, against current -mm

On both the read and write path for an rdev attribute, we
call mddev_lock, first checking that mddev is not NULL.
Once we get the lock, we check again.
If rdev->mddev is not NULL, we know it will stay that way as it only
gets cleared under the same lock.

While in the rdev show/store routines, we know that the mddev cannot
get freed, do to the kobject relationships.

rdev_size_store is awkward because it has to drop the lock.  So we
take a copy of rdev->mddev before the drop, and we are safe...

Comments?

NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-01-14 12:26:15.000000000 +1100
+++ ./drivers/md/md.c	2008-01-14 17:05:53.000000000 +1100
@@ -1998,9 +1998,11 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 	char *e;
 	unsigned long long size = simple_strtoull(buf, &e, 10);
 	unsigned long long oldsize = rdev->size;
+	mddev_t *my_mddev = rdev->mddev;
+
 	if (e==buf || (*e && *e != '\n'))
 		return -EINVAL;
-	if (rdev->mddev->pers)
+	if (my_mddev->pers)
 		return -EBUSY;
 	rdev->size = size;
 	if (size > oldsize && rdev->mddev->external) {
@@ -2013,7 +2015,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 		int overlap = 0;
 		struct list_head *tmp, *tmp2;
 
-		mddev_unlock(rdev->mddev);
+		mddev_unlock(my_mddev);
 		for_each_mddev(mddev, tmp) {
 			mdk_rdev_t *rdev2;
 
@@ -2033,7 +2035,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 				break;
 			}
 		}
-		mddev_lock(rdev->mddev);
+		mddev_lock(my_mddev);
 		if (overlap) {
 			/* Someone else could have slipped in a size
 			 * change here, but doing so is just silly.
@@ -2045,8 +2047,8 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 			return -EBUSY;
 		}
 	}
-	if (size < rdev->mddev->size || rdev->mddev->size == 0)
-		rdev->mddev->size = size;
+	if (size < my_mddev->size || my_mddev->size == 0)
+		my_mddev->size = size;
 	return len;
 }
 
@@ -2067,10 +2069,21 @@ rdev_attr_show(struct kobject *kobj, str
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+	mddev_t *mddev = rdev->mddev;
+	ssize_t rv;
 
 	if (!entry->show)
 		return -EIO;
-	return entry->show(rdev, page);
+
+	rv = mddev ? mddev_lock(mddev) : -EBUSY;
+	if (!rv) {
+		if (rdev->mddev == NULL)
+			rv = -EBUSY;
+		else
+			rv = entry->show(rdev, page);
+		mddev_unlock(mddev);
+	}
+	return rv;
 }
 
 static ssize_t
@@ -2079,15 +2092,19 @@ rdev_attr_store(struct kobject *kobj, st
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
-	int rv;
+	ssize_t rv;
+	mddev_t *mddev = rdev->mddev;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
-	rv = mddev_lock(rdev->mddev);
+	rv = mddev ? mddev_lock(mddev): -EBUSY;
 	if (!rv) {
-		rv = entry->store(rdev, page, length);
+		if (rdev->mddev == NULL)
+			rv = -EBUSY;
+		else
+			rv = entry->store(rdev, page, length);
 		mddev_unlock(rdev->mddev);
 	}
 	return rv;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14  6:28           ` Neil Brown
@ 2008-01-14 12:59             ` Al Viro
  2008-01-14 13:56               ` Al Viro
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2008-01-14 12:59 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel

On Mon, Jan 14, 2008 at 05:28:44PM +1100, Neil Brown wrote:
> On Monday January 14, neilb@suse.de wrote:
> > 
> > Thanks.  I'll see what I can some up with.
> 
> How about this, against current -mm
> 
> On both the read and write path for an rdev attribute, we
> call mddev_lock, first checking that mddev is not NULL.
> Once we get the lock, we check again.
> If rdev->mddev is not NULL, we know it will stay that way as it only
> gets cleared under the same lock.
> 
> While in the rdev show/store routines, we know that the mddev cannot
> get freed, do to the kobject relationships.
> 
> rdev_size_store is awkward because it has to drop the lock.  So we
> take a copy of rdev->mddev before the drop, and we are safe...
> 
> Comments?

*cringe*

I really don't like the entire scheme, to be honest.  BTW, what happens
if you try to add the same device to the same array after having it kicked
out?  If that comes before your delayed kobject_del(), the things will
get nasty since sysfs will (rightfully) refuse to add another entry with
the same name and parent while the old one is still there and for all
sysfs knows is going to stay there...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.
  2008-01-14 12:59             ` Al Viro
@ 2008-01-14 13:56               ` Al Viro
  0 siblings, 0 replies; 20+ messages in thread
From: Al Viro @ 2008-01-14 13:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel

On Mon, Jan 14, 2008 at 12:59:39PM +0000, Al Viro wrote:

> I really don't like the entire scheme, to be honest.  BTW, what happens
> if you try to add the same device to the same array after having it kicked
> out?  If that comes before your delayed kobject_del(), the things will
> get nasty since sysfs will (rightfully) refuse to add another entry with
> the same name and parent while the old one is still there and for all
> sysfs knows is going to stay there...

More fun questions: what are the locking requirements for ->resize()?
You are calling it with no exclusion whatsoever...  What about
bind_rdev_to_array()?  At the very least, you want to protect
mddev->disks, and AFAICS new_dev_store() has no exclusion at all.
And I suspect that you have other things in there in need of protection
(finding free desc_nr, for one); can all of those be handled by simple
spinlocks?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-14  1:45 ` [PATCH 001 of 6] md: Fix an occasional deadlock in raid5 NeilBrown
@ 2008-01-16  5:01   ` dean gaudet
  2008-01-16  5:54     ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: dean gaudet @ 2008-01-16  5:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel, Dan Williams

On Mon, 14 Jan 2008, NeilBrown wrote:

> 
> raid5's 'make_request' function calls generic_make_request on
> underlying devices and if we run out of stripe heads, it could end up
> waiting for one of those requests to complete.
> This is bad as recursive calls to generic_make_request go on a queue
> and are not even attempted until make_request completes.
> 
> So: don't make any generic_make_request calls in raid5 make_request
> until all waiting has been done.  We do this by simply setting
> STRIPE_HANDLE instead of calling handle_stripe().
> 
> If we need more stripe_heads, raid5d will get called to process the
> pending stripe_heads which will call generic_make_request from a
> different thread where no deadlock will happen.
> 
> 
> This change by itself causes a performance hit.  So add a change so
> that raid5_activate_delayed is only called at unplug time, never in
> raid5.  This seems to bring back the performance numbers.  Calling it
> in raid5d was sometimes too soon...
> 
> Cc: "Dan Williams" <dan.j.williams@gmail.com>
> Signed-off-by: Neil Brown <neilb@suse.de>

probably doesn't matter, but for the record:

Tested-by: dean gaudet <dean@arctic.org>

this time i tested with internal and external bitmaps and it survived 8h 
and 14h resp. under the parallel tar workload i used to reproduce the 
hang.

btw this should probably be a candidate for 2.6.22 and .23 stable.

thanks
-dean

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-16  5:01   ` dean gaudet
@ 2008-01-16  5:54     ` Andrew Morton
  2008-01-16  6:13       ` dean gaudet
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2008-01-16  5:54 UTC (permalink / raw)
  To: dean gaudet; +Cc: NeilBrown, linux-raid, linux-kernel, Dan Williams

On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <dean@arctic.org> wrote:

> On Mon, 14 Jan 2008, NeilBrown wrote:
> 
> > 
> > raid5's 'make_request' function calls generic_make_request on
> > underlying devices and if we run out of stripe heads, it could end up
> > waiting for one of those requests to complete.
> > This is bad as recursive calls to generic_make_request go on a queue
> > and are not even attempted until make_request completes.
> > 
> > So: don't make any generic_make_request calls in raid5 make_request
> > until all waiting has been done.  We do this by simply setting
> > STRIPE_HANDLE instead of calling handle_stripe().
> > 
> > If we need more stripe_heads, raid5d will get called to process the
> > pending stripe_heads which will call generic_make_request from a
> > different thread where no deadlock will happen.
> > 
> > 
> > This change by itself causes a performance hit.  So add a change so
> > that raid5_activate_delayed is only called at unplug time, never in
> > raid5.  This seems to bring back the performance numbers.  Calling it
> > in raid5d was sometimes too soon...
> > 
> > Cc: "Dan Williams" <dan.j.williams@gmail.com>
> > Signed-off-by: Neil Brown <neilb@suse.de>
> 
> probably doesn't matter, but for the record:
> 
> Tested-by: dean gaudet <dean@arctic.org>
> 
> this time i tested with internal and external bitmaps and it survived 8h 
> and 14h resp. under the parallel tar workload i used to reproduce the 
> hang.
> 
> btw this should probably be a candidate for 2.6.22 and .23 stable.
> 

hm, Neil said

  The first fixes a bug which could make it a candidate for 24-final. 
  However it is a deadlock that seems to occur very rarely, and has been in
  mainline since 2.6.22.  So letting it into one more release shouldn't be
  a big problem.  While the fix is fairly simple, it could have some
  unexpected consequences, so I'd rather go for the next cycle.

food fight!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-16  5:54     ` Andrew Morton
@ 2008-01-16  6:13       ` dean gaudet
  2008-01-16  7:09         ` Dan Williams
  0 siblings, 1 reply; 20+ messages in thread
From: dean gaudet @ 2008-01-16  6:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: NeilBrown, linux-raid, linux-kernel, Dan Williams

On Tue, 15 Jan 2008, Andrew Morton wrote:

> On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <dean@arctic.org> wrote:
> 
> > On Mon, 14 Jan 2008, NeilBrown wrote:
> > 
> > > 
> > > raid5's 'make_request' function calls generic_make_request on
> > > underlying devices and if we run out of stripe heads, it could end up
> > > waiting for one of those requests to complete.
> > > This is bad as recursive calls to generic_make_request go on a queue
> > > and are not even attempted until make_request completes.
> > > 
> > > So: don't make any generic_make_request calls in raid5 make_request
> > > until all waiting has been done.  We do this by simply setting
> > > STRIPE_HANDLE instead of calling handle_stripe().
> > > 
> > > If we need more stripe_heads, raid5d will get called to process the
> > > pending stripe_heads which will call generic_make_request from a
> > > different thread where no deadlock will happen.
> > > 
> > > 
> > > This change by itself causes a performance hit.  So add a change so
> > > that raid5_activate_delayed is only called at unplug time, never in
> > > raid5.  This seems to bring back the performance numbers.  Calling it
> > > in raid5d was sometimes too soon...
> > > 
> > > Cc: "Dan Williams" <dan.j.williams@gmail.com>
> > > Signed-off-by: Neil Brown <neilb@suse.de>
> > 
> > probably doesn't matter, but for the record:
> > 
> > Tested-by: dean gaudet <dean@arctic.org>
> > 
> > this time i tested with internal and external bitmaps and it survived 8h 
> > and 14h resp. under the parallel tar workload i used to reproduce the 
> > hang.
> > 
> > btw this should probably be a candidate for 2.6.22 and .23 stable.
> > 
> 
> hm, Neil said
> 
>   The first fixes a bug which could make it a candidate for 24-final. 
>   However it is a deadlock that seems to occur very rarely, and has been in
>   mainline since 2.6.22.  So letting it into one more release shouldn't be
>   a big problem.  While the fix is fairly simple, it could have some
>   unexpected consequences, so I'd rather go for the next cycle.
> 
> food fight!
> 

heheh.

it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.

but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...

so OK i agree with Neil, let's test more... food fight over! :)

-dean

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-16  6:13       ` dean gaudet
@ 2008-01-16  7:09         ` Dan Williams
  2008-01-16  7:15           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Dan Williams @ 2008-01-16  7:09 UTC (permalink / raw)
  To: dean gaudet; +Cc: Andrew Morton, NeilBrown, linux-raid, linux-kernel

> heheh.
>
> it's really easy to reproduce the hang without the patch -- i could
> hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> (or is even more rare), which would explain why this has is overall a
> rare problem.
>

Hmmm... how rare?

http://marc.info/?l=linux-kernel&m=119461747005776&w=2

There is nothing specific that prevents other filesystems from hitting
it, perhaps XFS is just better at submitting large i/o's.  -stable
should get some kind of treatment.  I'll take altered performance over
a hung system.

--
Dan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-16  7:09         ` Dan Williams
@ 2008-01-16  7:15           ` Andrew Morton
  2008-01-16 21:54             ` Neil Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2008-01-16  7:15 UTC (permalink / raw)
  To: Dan Williams; +Cc: dean gaudet, NeilBrown, linux-raid, linux-kernel

On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <dan.j.williams@intel.com> wrote:

> > heheh.
> >
> > it's really easy to reproduce the hang without the patch -- i could
> > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > (or is even more rare), which would explain why this has is overall a
> > rare problem.
> >
> 
> Hmmm... how rare?
> 
> http://marc.info/?l=linux-kernel&m=119461747005776&w=2
> 
> There is nothing specific that prevents other filesystems from hitting
> it, perhaps XFS is just better at submitting large i/o's.  -stable
> should get some kind of treatment.  I'll take altered performance over
> a hung system.

We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
wimpy.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
  2008-01-16  7:15           ` Andrew Morton
@ 2008-01-16 21:54             ` Neil Brown
  0 siblings, 0 replies; 20+ messages in thread
From: Neil Brown @ 2008-01-16 21:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dan Williams, dean gaudet, linux-raid, linux-kernel

On Tuesday January 15, akpm@linux-foundation.org wrote:
> On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <dan.j.williams@intel.com> wrote:
> 
> > > heheh.
> > >
> > > it's really easy to reproduce the hang without the patch -- i could
> > > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > > (or is even more rare), which would explain why this has is overall a
> > > rare problem.
> > >
> > 
> > Hmmm... how rare?
> > 
> > http://marc.info/?l=linux-kernel&m=119461747005776&w=2
> > 
> > There is nothing specific that prevents other filesystems from hitting
> > it, perhaps XFS is just better at submitting large i/o's.  -stable
> > should get some kind of treatment.  I'll take altered performance over
> > a hung system.
> 
> We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
> wimpy.

I am feeling wimpy.  There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change.  I even have a small uncertainty of correctness - could
it still hang in some other way?  I don't think so, but this is
complex code...

If it were really common I would have expected more noise on the
mailing list.  Sure, there has been some, but not much.  However maybe
people are searching the archives and finding the "increase stripe
cache size" trick, and not reporting anything .... seems unlikely
though.

How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y?  Any one (or any distro) that really
needs it can of course grab the patch them selves...

??

NeilBrown

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-01-16 21:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-14  1:45 [PATCH 000 of 6] md: various fixes for md NeilBrown
2008-01-14  1:45 ` [PATCH 001 of 6] md: Fix an occasional deadlock in raid5 NeilBrown
2008-01-16  5:01   ` dean gaudet
2008-01-16  5:54     ` Andrew Morton
2008-01-16  6:13       ` dean gaudet
2008-01-16  7:09         ` Dan Williams
2008-01-16  7:15           ` Andrew Morton
2008-01-16 21:54             ` Neil Brown
2008-01-14  1:45 ` [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array NeilBrown
2008-01-14  2:04   ` Al Viro
2008-01-14  3:21     ` Neil Brown
2008-01-14  3:43       ` Al Viro
2008-01-14  4:48         ` Neil Brown
2008-01-14  6:28           ` Neil Brown
2008-01-14 12:59             ` Al Viro
2008-01-14 13:56               ` Al Viro
2008-01-14  1:45 ` [PATCH 003 of 6] md: Change a few 'int' to 'size_t' in md NeilBrown
2008-01-14  1:45 ` [PATCH 004 of 6] md: Change INTERATE_MDDEV to for_each_mddev NeilBrown
2008-01-14  1:45 ` [PATCH 005 of 6] md: Change ITERATE_RDEV to rdev_for_each NeilBrown
2008-01-14  1:45 ` [PATCH 006 of 6] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).