linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE
@ 2008-03-03  0:16 NeilBrown
  2008-03-03  0:17 ` [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error NeilBrown
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Keld Simonsen, K.Tanaka

Following are 9 patches for md in 2.6.25-rc2-mm1.  All are fairly
simply bugfixes and are suitable for 2.6.25 (though I confess I
should have sent some of them eariler :-( ).

They include fixes for three bugs found but the scsi fault injection framework
(thanks!), and small raid10 read optimisation, and various other bits
and pieces.

NeilBrown


 [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
 [PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap.
 [PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata.
 [PATCH 004 of 9] md: Clean up irregularity with raid autodetect.
 [PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write.
 [PATCH 006 of 9] md: Lock access to rdev attributes properly
 [PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts.
 [PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync.
 [PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03 15:54   ` Andre Noll
  2008-03-03  0:17 ` [PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap NeilBrown
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, K.Tanaka


When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all
submitted IO to complete (except for requests that failed and are
sitting in the retry queue - these are counted in ->nr_queue and will
stay there during a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting
this problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c  |   62 +++++++++++++++++++++++++++++++++-----------------
 ./drivers/md/raid10.c |   60 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 80 insertions(+), 42 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-02-22 15:45:35.000000000 +1100
+++ ./drivers/md/raid10.c	2008-02-22 15:45:35.000000000 +1100
@@ -629,7 +629,36 @@ static int raid10_congested(void *data, 
 	return ret;
 }
 
+static int flush_pending_writes(conf_t *conf)
+{
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 * We return 1 if any requests were actually submitted.
+	 */
+	int rv = 0;
+
+	spin_lock_irq(&conf->device_lock);
 
+	if (conf->pending_bio_list.head) {
+		struct bio *bio;
+		bio = bio_list_get(&conf->pending_bio_list);
+		blk_remove_plug(conf->mddev->queue);
+		spin_unlock_irq(&conf->device_lock);
+		/* flush any pending bitmap writes to disk
+		 * before proceeding w/ I/O */
+		bitmap_unplug(conf->mddev->bitmap);
+
+		while (bio) { /* submit pending writes */
+			struct bio *next = bio->bi_next;
+			bio->bi_next = NULL;
+			generic_make_request(bio);
+			bio = next;
+		}
+		rv = 1;
+	} else
+		spin_unlock_irq(&conf->device_lock);
+	return rv;
+}
 /* Barriers....
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -720,7 +749,8 @@ static void freeze_array(conf_t *conf)
 	wait_event_lock_irq(conf->wait_barrier,
 			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
 			    conf->resync_lock,
-			    raid10_unplug(conf->mddev->queue));
+			    ({ flush_pending_writes(conf);
+			       raid10_unplug(conf->mddev->queue); }));
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -892,6 +922,9 @@ static int make_request(struct request_q
 	blk_plug_device(mddev->queue);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
+	/* In case raid10d snuck in to freeze_array */
+	wake_up(&conf->wait_barrier);
+
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
 
@@ -1464,28 +1497,14 @@ static void raid10d(mddev_t *mddev)
 
 	for (;;) {
 		char b[BDEVNAME_SIZE];
-		spin_lock_irqsave(&conf->device_lock, flags);
-
-		if (conf->pending_bio_list.head) {
-			bio = bio_list_get(&conf->pending_bio_list);
-			blk_remove_plug(mddev->queue);
-			spin_unlock_irqrestore(&conf->device_lock, flags);
-			/* flush any pending bitmap writes to disk before proceeding w/ I/O */
-			bitmap_unplug(mddev->bitmap);
 
-			while (bio) { /* submit pending writes */
-				struct bio *next = bio->bi_next;
-				bio->bi_next = NULL;
-				generic_make_request(bio);
-				bio = next;
-			}
-			unplug = 1;
+		unplug += flush_pending_writes(conf);
 
-			continue;
-		}
-
-		if (list_empty(head))
+		spin_lock_irqsave(&conf->device_lock, flags);
+		if (list_empty(head)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			break;
+		}
 		r10_bio = list_entry(head->prev, r10bio_t, retry_list);
 		list_del(head->prev);
 		conf->nr_queued--;
@@ -1548,7 +1567,6 @@ static void raid10d(mddev_t *mddev)
 			}
 		}
 	}
-	spin_unlock_irqrestore(&conf->device_lock, flags);
 	if (unplug)
 		unplug_slaves(mddev);
 }

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
+++ ./drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 * We return 1 if any requests were actually submitted.
+	 */
+	int rv = 0;
+
+	spin_lock_irq(&conf->device_lock);
+
+	if (conf->pending_bio_list.head) {
+		struct bio *bio;
+		bio = bio_list_get(&conf->pending_bio_list);
+		blk_remove_plug(conf->mddev->queue);
+		spin_unlock_irq(&conf->device_lock);
+		/* flush any pending bitmap writes to
+		 * disk before proceeding w/ I/O */
+		bitmap_unplug(conf->mddev->bitmap);
+
+		while (bio) { /* submit pending writes */
+			struct bio *next = bio->bi_next;
+			bio->bi_next = NULL;
+			generic_make_request(bio);
+			bio = next;
+		}
+		rv = 1;
+	} else
+		spin_unlock_irq(&conf->device_lock);
+	return rv;
+}
+
 /* Barriers....
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -681,7 +712,8 @@ static void freeze_array(conf_t *conf)
 	wait_event_lock_irq(conf->wait_barrier,
 			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
 			    conf->resync_lock,
-			    raid1_unplug(conf->mddev->queue));
+			    ({ flush_pending_writes(conf);
+			       raid1_unplug(conf->mddev->queue); }));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +939,9 @@ static int make_request(struct request_q
 	blk_plug_device(mddev->queue);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
+	/* In case raid1d snuck into freeze_array */
+	wake_up(&conf->wait_barrier);
+
 	if (do_sync)
 		md_wakeup_thread(mddev->thread);
 #if 0
@@ -1473,28 +1508,14 @@ static void raid1d(mddev_t *mddev)
 	
 	for (;;) {
 		char b[BDEVNAME_SIZE];
-		spin_lock_irqsave(&conf->device_lock, flags);
 
-		if (conf->pending_bio_list.head) {
-			bio = bio_list_get(&conf->pending_bio_list);
-			blk_remove_plug(mddev->queue);
-			spin_unlock_irqrestore(&conf->device_lock, flags);
-			/* flush any pending bitmap writes to disk before proceeding w/ I/O */
-			bitmap_unplug(mddev->bitmap);
-
-			while (bio) { /* submit pending writes */
-				struct bio *next = bio->bi_next;
-				bio->bi_next = NULL;
-				generic_make_request(bio);
-				bio = next;
-			}
-			unplug = 1;
-
-			continue;
-		}
+		unplug += flush_pending_writes(conf);
 
-		if (list_empty(head))
+		spin_lock_irqsave(&conf->device_lock, flags);
+		if (list_empty(head)) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			break;
+		}
 		r1_bio = list_entry(head->prev, r1bio_t, retry_list);
 		list_del(head->prev);
 		conf->nr_queued--;
@@ -1590,7 +1611,6 @@ static void raid1d(mddev_t *mddev)
 			}
 		}
 	}
-	spin_unlock_irqrestore(&conf->device_lock, flags);
 	if (unplug)
 		unplug_slaves(mddev);
 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
  2008-03-03  0:17 ` [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata NeilBrown
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


On an md array with a write-intent bitmap, a thread wakes up every few
seconds and scans the bitmap looking for work to do.  If the
array is idle, there will be no work to do, but a lot of scanning is
done to discover this.

So cache the fact that the bitmap is completely clean, and avoid
scanning the whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/bitmap.c         |   19 +++++++++++++++++--
 ./include/linux/raid/bitmap.h |    2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c	2008-02-22 15:45:56.000000000 +1100
+++ ./drivers/md/bitmap.c	2008-02-22 15:45:56.000000000 +1100
@@ -1047,6 +1047,11 @@ void bitmap_daemon_work(struct bitmap *b
 	if (time_before(jiffies, bitmap->daemon_lastrun + bitmap->daemon_sleep*HZ))
 		return;
 	bitmap->daemon_lastrun = jiffies;
+	if (bitmap->allclean) {
+		bitmap->mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+		return;
+	}
+	bitmap->allclean = 1;
 
 	for (j = 0; j < bitmap->chunks; j++) {
 		bitmap_counter_t *bmc;
@@ -1068,8 +1073,10 @@ void bitmap_daemon_work(struct bitmap *b
 					clear_page_attr(bitmap, page, BITMAP_PAGE_NEEDWRITE);
 
 				spin_unlock_irqrestore(&bitmap->lock, flags);
-				if (need_write)
+				if (need_write) {
 					write_page(bitmap, page, 0);
+					bitmap->allclean = 0;
+				}
 				continue;
 			}
 
@@ -1098,6 +1105,9 @@ void bitmap_daemon_work(struct bitmap *b
 /*
   if (j < 100) printk("bitmap: j=%lu, *bmc = 0x%x\n", j, *bmc);
 */
+			if (*bmc)
+				bitmap->allclean = 0;
+
 			if (*bmc == 2) {
 				*bmc=1; /* maybe clear the bit next time */
 				set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
@@ -1132,6 +1142,8 @@ void bitmap_daemon_work(struct bitmap *b
 		}
 	}
 
+	if (bitmap->allclean == 0)
+		bitmap->mddev->thread->timeout = bitmap->daemon_sleep * HZ;
 }
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
@@ -1226,6 +1238,7 @@ int bitmap_startwrite(struct bitmap *bit
 			sectors -= blocks;
 		else sectors = 0;
 	}
+	bitmap->allclean = 0;
 	return 0;
 }
 
@@ -1296,6 +1309,7 @@ int bitmap_start_sync(struct bitmap *bit
 		}
 	}
 	spin_unlock_irq(&bitmap->lock);
+	bitmap->allclean = 0;
 	return rv;
 }
 
@@ -1332,6 +1346,7 @@ void bitmap_end_sync(struct bitmap *bitm
 	}
  unlock:
 	spin_unlock_irqrestore(&bitmap->lock, flags);
+	bitmap->allclean = 0;
 }
 
 void bitmap_close_sync(struct bitmap *bitmap)
@@ -1399,7 +1414,7 @@ static void bitmap_set_memory_bits(struc
 		set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 	}
 	spin_unlock_irq(&bitmap->lock);
-
+	bitmap->allclean = 0;
 }
 
 /* dirty the memory and file bits for bitmap chunks "s" to "e" */

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h	2008-02-22 15:45:56.000000000 +1100
+++ ./include/linux/raid/bitmap.h	2008-02-22 15:45:56.000000000 +1100
@@ -235,6 +235,8 @@ struct bitmap {
 
 	unsigned long flags;
 
+	int allclean;
+
 	unsigned long max_write_behind; /* write-behind mode */
 	atomic_t behind_writes;
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
  2008-03-03  0:17 ` [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error NeilBrown
  2008-03-03  0:17 ` [PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 004 of 9] md: Clean up irregularity with raid autodetect NeilBrown
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Make sure the data doesn't start before the end of the superblock
when the superblock is at the start of the device.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-02-22 15:46:10.000000000 +1100
+++ ./drivers/md/md.c	2008-02-22 15:46:10.000000000 +1100
@@ -1105,7 +1105,11 @@ static int super_1_load(mdk_rdev_t *rdev
 	rdev->sb_size = le32_to_cpu(sb->max_dev) * 2 + 256;
 	bmask = queue_hardsect_size(rdev->bdev->bd_disk->queue)-1;
 	if (rdev->sb_size & bmask)
-		rdev-> sb_size = (rdev->sb_size | bmask)+1;
+		rdev->sb_size = (rdev->sb_size | bmask) + 1;
+
+	if (minor_version
+	    && rdev->data_offset < sb_offset + (rdev->sb_size/512))
+		return -EINVAL;
 
 	if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
 		rdev->desc_nr = -1;
@@ -1137,7 +1141,7 @@ static int super_1_load(mdk_rdev_t *rdev
 		else
 			ret = 0;
 	}
-	if (minor_version) 
+	if (minor_version)
 		rdev->size = ((rdev->bdev->bd_inode->i_size>>9) - le64_to_cpu(sb->data_offset)) / 2;
 	else
 		rdev->size = rdev->sb_offset;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 004 of 9] md: Clean up irregularity with raid autodetect.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (2 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write NeilBrown
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


When a raid1 array is stopped, all components currently get added
to the list for auto-detection.  However we should really only add
components that were found by autodetection in the first place.
So add a flag to record that information, and use it.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |    4 +++-
 ./include/linux/raid/md_k.h |    1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-02-22 15:46:10.000000000 +1100
+++ ./drivers/md/md.c	2008-02-22 15:46:25.000000000 +1100
@@ -1503,7 +1503,8 @@ static void export_rdev(mdk_rdev_t * rde
 	free_disk_sb(rdev);
 	list_del_init(&rdev->same_set);
 #ifndef MODULE
-	md_autodetect_dev(rdev->bdev->bd_dev);
+	if (test_bit(AutoDetected, &rdev->flags))
+		md_autodetect_dev(rdev->bdev->bd_dev);
 #endif
 	unlock_rdev(rdev);
 	kobject_put(&rdev->kobj);
@@ -6025,6 +6026,7 @@ static void autostart_arrays(int part)
 			MD_BUG();
 			continue;
 		}
+		set_bit(AutoDetected, &rdev->flags);
 		list_add(&rdev->same_set, &pending_raid_disks);
 		i_passed++;
 	}

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-02-22 15:46:25.000000000 +1100
+++ ./include/linux/raid/md_k.h	2008-02-22 15:46:25.000000000 +1100
@@ -83,6 +83,7 @@ struct mdk_rdev_s
 #define	BarriersNotsupp	5		/* BIO_RW_BARRIER is not supported */
 #define	AllReserved	6		/* If whole device is reserved for
 					 * one array */
+#define	AutoDetected	7		/* added by auto-detect */
 
 	int desc_nr;			/* descriptor index in the superblock */
 	int raid_disk;			/* role of device in array */

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (3 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 004 of 9] md: Clean up irregularity with raid autodetect NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 006 of 9] md: Lock access to rdev attributes properly NeilBrown
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


A resync/reshape/recovery thread will refuse to progress when the
array is marked read-only.  So whenever it mark it not read-only, it
is important to wake up thread resync thread.
There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a
raid5 array that is in the middle of a reshape (restripe) is started.
The array will initially be semi-read-only (meaning it acts like it is
readonly until the first write).  So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape
will not be started, and there is no event which will ever restart
that thread.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-02-22 15:46:25.000000000 +1100
+++ ./drivers/md/md.c	2008-02-22 15:46:52.000000000 +1100
@@ -5356,6 +5356,7 @@ void md_write_start(mddev_t *mddev, stru
 		mddev->ro = 0;
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 		md_wakeup_thread(mddev->thread);
+		md_wakeup_thread(mddev->sync_thread);
 	}
 	atomic_inc(&mddev->writes_pending);
 	if (mddev->in_sync) {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 006 of 9] md: Lock access to rdev attributes properly
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (4 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts NeilBrown
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


When we access attributes of an rdev (component device on an md array)
through sysfs, we really need to lock the array against concurrent
changes.  We currently do that when we change an attribute, but not
when we read an attribute.  We need to lock when reading as well else
rdev->mddev could become NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock
the mddev while scanning other mddevs for overlapping regions.  We
currently assume that rdev->mddev will still be unchanged after the
scan, but that cannot be certain.  So take a copy of rdev->mddev for
use at the end of the function.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-02-22 15:46:52.000000000 +1100
+++ ./drivers/md/md.c	2008-02-22 15:56:20.000000000 +1100
@@ -2001,9 +2001,11 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 	char *e;
 	unsigned long long size = simple_strtoull(buf, &e, 10);
 	unsigned long long oldsize = rdev->size;
+	mddev_t *my_mddev = rdev->mddev;
+
 	if (e==buf || (*e && *e != '\n'))
 		return -EINVAL;
-	if (rdev->mddev->pers)
+	if (my_mddev->pers)
 		return -EBUSY;
 	rdev->size = size;
 	if (size > oldsize && rdev->mddev->external) {
@@ -2016,7 +2018,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 		int overlap = 0;
 		struct list_head *tmp, *tmp2;
 
-		mddev_unlock(rdev->mddev);
+		mddev_unlock(my_mddev);
 		for_each_mddev(mddev, tmp) {
 			mdk_rdev_t *rdev2;
 
@@ -2036,7 +2038,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 				break;
 			}
 		}
-		mddev_lock(rdev->mddev);
+		mddev_lock(my_mddev);
 		if (overlap) {
 			/* Someone else could have slipped in a size
 			 * change here, but doing so is just silly.
@@ -2048,8 +2050,8 @@ rdev_size_store(mdk_rdev_t *rdev, const 
 			return -EBUSY;
 		}
 	}
-	if (size < rdev->mddev->size || rdev->mddev->size == 0)
-		rdev->mddev->size = size;
+	if (size < my_mddev->size || my_mddev->size == 0)
+		my_mddev->size = size;
 	return len;
 }
 
@@ -2070,10 +2072,21 @@ rdev_attr_show(struct kobject *kobj, str
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+	mddev_t *mddev = rdev->mddev;
+	ssize_t rv;
 
 	if (!entry->show)
 		return -EIO;
-	return entry->show(rdev, page);
+
+	rv = mddev ? mddev_lock(mddev) : -EBUSY;
+	if (!rv) {
+		if (rdev->mddev == NULL)
+			rv = -EBUSY;
+		else
+			rv = entry->show(rdev, page);
+		mddev_unlock(mddev);
+	}
+	return rv;
 }
 
 static ssize_t
@@ -2082,15 +2095,19 @@ rdev_attr_store(struct kobject *kobj, st
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
-	int rv;
+	ssize_t rv;
+	mddev_t *mddev = rdev->mddev;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
-	rv = mddev_lock(rdev->mddev);
+	rv = mddev ? mddev_lock(mddev): -EBUSY;
 	if (!rv) {
-		rv = entry->store(rdev, page, length);
+		if (rdev->mddev == NULL)
+			rv = -EBUSY;
+		else
+			rv = entry->store(rdev, page, length);
 		mddev_unlock(rdev->mddev);
 	}
 	return rv;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (5 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 006 of 9] md: Lock access to rdev attributes properly NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:17 ` [PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync NeilBrown
  2008-03-03  0:18 ` [PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock NeilBrown
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Keld Simonsen


From: Keld Simonsen <keld@dkuug.dk>

This patch changes the disk to be read for layout "far > 1" to always be
the disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array)
from the first band of stripes, and the raid will then work as a raid0
consisting of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be used.
The outer blocks of a disk may be as much as 100 % faster than the inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better,
as they will work as raid0, the sequential read rate will be number of
disks involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the lowest
block address for the logical block will be used.

Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-03-03 09:35:29.000000000 +1100
+++ ./drivers/md/raid10.c	2008-03-03 09:35:52.000000000 +1100
@@ -537,7 +537,8 @@ static int read_balance(conf_t *conf, r1
 	current_distance = abs(r10_bio->devs[slot].addr -
 			       conf->mirrors[disk].head_position);
 
-	/* Find the disk whose head is closest */
+	/* Find the disk whose head is closest,
+	 * or - for far > 1 - find the closest to partition beginning */
 
 	for (nslot = slot; nslot < conf->copies; nslot++) {
 		int ndisk = r10_bio->devs[nslot].devnum;
@@ -557,8 +558,13 @@ static int read_balance(conf_t *conf, r1
 			slot = nslot;
 			break;
 		}
-		new_distance = abs(r10_bio->devs[nslot].addr -
-				   conf->mirrors[ndisk].head_position);
+
+		/* for far > 1 always use the lowest address */
+		if (conf->far_copies > 1)
+			new_distance = r10_bio->devs[nslot].addr;
+		else
+			new_distance = abs(r10_bio->devs[nslot].addr -
+					   conf->mirrors[ndisk].head_position);
 		if (new_distance < current_distance) {
 			current_distance = new_distance;
 			disk = ndisk;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync.
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (6 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts NeilBrown
@ 2008-03-03  0:17 ` NeilBrown
  2008-03-03  0:18 ` [PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock NeilBrown
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, K.Tanaka


Thanks to K.Tanaka and the scsi fault injection framework, here is
a fix for another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and
a resync request is pending, the attempt to fix the error will block
until the resync progresses, and the resync will block until the
read request completes.  Thus a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c  |   11 +++++++++--
 ./drivers/md/raid10.c |   11 +++++++++--
 2 files changed, 18 insertions(+), 4 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid10.c	2008-03-03 09:56:53.000000000 +1100
@@ -747,13 +747,20 @@ static void freeze_array(conf_t *conf)
 	/* stop syncio and normal IO and wait for everything to
 	 * go quiet.
 	 * We increment barrier and nr_waiting, and then
-	 * wait until barrier+nr_pending match nr_queued+2
+	 * wait until nr_pending match nr_queued+1
+	 * This is called in the context of one normal IO request
+	 * that has failed. Thus any sync request that might be pending
+	 * will be blocked by nr_pending, and we need to wait for
+	 * pending IO requests to complete or be queued for re-try.
+	 * Thus the number queued (nr_queued) plus this request (1)
+	 * must match the number of pending IOs (nr_pending) before
+	 * we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	conf->nr_waiting++;
 	wait_event_lock_irq(conf->wait_barrier,
-			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
+			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
 			       raid10_unplug(conf->mddev->queue); }));

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid1.c	2008-03-03 09:56:52.000000000 +1100
@@ -704,13 +704,20 @@ static void freeze_array(conf_t *conf)
 	/* stop syncio and normal IO and wait for everything to
 	 * go quite.
 	 * We increment barrier and nr_waiting, and then
-	 * wait until barrier+nr_pending match nr_queued+2
+	 * wait until nr_pending match nr_queued+1
+	 * This is called in the context of one normal IO request
+	 * that has failed. Thus any sync request that might be pending
+	 * will be blocked by nr_pending, and we need to wait for
+	 * pending IO requests to complete or be queued for re-try.
+	 * Thus the number queued (nr_queued) plus this request (1)
+	 * must match the number of pending IOs (nr_pending) before
+	 * we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->barrier++;
 	conf->nr_waiting++;
 	wait_event_lock_irq(conf->wait_barrier,
-			    conf->barrier+conf->nr_pending == conf->nr_queued+2,
+			    conf->nr_pending == conf->nr_queued+1,
 			    conf->resync_lock,
 			    ({ flush_pending_writes(conf);
 			       raid1_unplug(conf->mddev->queue); }));

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock
  2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (7 preceding siblings ...)
  2008-03-03  0:17 ` [PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync NeilBrown
@ 2008-03-03  0:18 ` NeilBrown
  8 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2008-03-03  0:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, K.Tanaka


From: "K.Tanaka" <k-tanaka@ce.jp.nec.com>

This message describes another issue about md RAID10 found by
testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:
When a scsi error results in disabling a disk during RAID10 recovery,
the resync threads of md RAID10 could stall.
This case, the raid array has already been broken and it may not matter.
But I think stall is not preferable. If it occurs, even shutdown or reboot
will fail because of resource busy.

The deadlock mechanism:
The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be
handled when recovering. The "remaining" counter is incremented when building a BIO
in sync_request() and is decremented when finish a BIO in end_sync_write().

If building a BIO fails for some reasons in sync_request(), the "remaining" should be
decremented if it has already been incremented. I found a case where this decrement
is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for
md_done_sync() called by end_sync_write(), but end_sync_write() never calls
md_done_sync() because of the "remaining" counter mismatch.

For example, this problem would be reproduced in the following case:

Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
      3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
      [>....................]  recovery =  2.2% (45376/1959808) finish=0.7min speed=45376K/sec

This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.

md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
      3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
      [=>...................]  recovery =  5.0% (99520/1959808) finish=5.9min speed=5237K/sec

 2739 ?        S<     0:17 [md0_raid10]
28608 ?        D<     0:00 [md0_resync]
28629 pts/1    Ss     0:00 bash
28830 pts/1    R+     0:00 ps ax
31819 ?        D<     0:00 [kjournald]

The resync thread keeps working, but actually it is deadlocked.

Patch:
By this patch, the remaining counter will be decremented if needed.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c |    2 ++
 1 file changed, 2 insertions(+)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2008-03-03 09:56:53.000000000 +1100
+++ ./drivers/md/raid10.c	2008-03-03 11:08:28.000000000 +1100
@@ -1818,6 +1818,8 @@ static sector_t sync_request(mddev_t *md
 				if (j == conf->copies) {
 					/* Cannot recover, so abort the recovery */
 					put_buf(r10_bio);
+					if (rb2)
+						atomic_dec(&rb2->remaining);
 					r10_bio = rb2;
 					if (!test_and_set_bit(MD_RECOVERY_ERR, &mddev->recovery))
 						printk(KERN_INFO "raid10: %s: insufficient working devices for recovery.\n",

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-03  0:17 ` [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error NeilBrown
@ 2008-03-03 15:54   ` Andre Noll
  2008-03-04  6:08     ` Neil Brown
  0 siblings, 1 reply; 15+ messages in thread
From: Andre Noll @ 2008-03-03 15:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel, K.Tanaka

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

On 11:17, NeilBrown wrote:

> So we create a new function 'flush_pending_writes' to give that attention,
> and call it in freeze_array to be sure that we aren't waiting on raid1d.

Two minor remarks:

> +static int flush_pending_writes(conf_t *conf)
> +{
> +	/* Any writes that have been queued but are awaiting
> +	 * bitmap updates get flushed here.
> +	 * We return 1 if any requests were actually submitted.
> +	 */
> +	int rv = 0;
> +
> +	spin_lock_irq(&conf->device_lock);
>  
> +	if (conf->pending_bio_list.head) {
> +		struct bio *bio;
> +		bio = bio_list_get(&conf->pending_bio_list);
> +		blk_remove_plug(conf->mddev->queue);
> +		spin_unlock_irq(&conf->device_lock);
> +		/* flush any pending bitmap writes to disk
> +		 * before proceeding w/ I/O */
> +		bitmap_unplug(conf->mddev->bitmap);
> +
> +		while (bio) { /* submit pending writes */
> +			struct bio *next = bio->bi_next;
> +			bio->bi_next = NULL;
> +			generic_make_request(bio);
> +			bio = next;
> +		}
> +		rv = 1;
> +	} else
> +		spin_unlock_irq(&conf->device_lock);
> +	return rv;
> +}

Do we really need to take the spin lock in the common case where
conf->pending_bio_list.head is NULL? If not, the above could be
optimized to the slightly faster and better readable

	struct bio *bio;

	if (!conf->pending_bio_list.head)
		return 0;
	spin_lock_irq(&conf->device_lock);
	bio = bio_list_get(&conf->pending_bio_list);
	...
	spin_unlock_irq(&conf->device_lock);
	return 1;


> diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
> --- .prev/drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
> +++ ./drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
> @@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
>  }
>  
>  
> +static int flush_pending_writes(conf_t *conf)
> +{

[snip]

Any chance to avoid this code duplication?

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-03 15:54   ` Andre Noll
@ 2008-03-04  6:08     ` Neil Brown
  2008-03-04 11:29       ` Andre Noll
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2008-03-04  6:08 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Morton, linux-raid, linux-kernel, K.Tanaka

On Monday March 3, maan@systemlinux.org wrote:
> On 11:17, NeilBrown wrote:
> 
> > So we create a new function 'flush_pending_writes' to give that attention,
> > and call it in freeze_array to be sure that we aren't waiting on raid1d.
> 
> Two minor remarks:

Thanks for looking.

> 
> Do we really need to take the spin lock in the common case where
> conf->pending_bio_list.head is NULL? If not, the above could be
> optimized to the slightly faster and better readable
> 
> 	struct bio *bio;
> 
> 	if (!conf->pending_bio_list.head)
> 		return 0;
> 	spin_lock_irq(&conf->device_lock);
> 	bio = bio_list_get(&conf->pending_bio_list);
> 	...
> 	spin_unlock_irq(&conf->device_lock);
> 	return 1;

Maybe... If I write a memory location inside a spinlock, then after
the spinlock is dropped, I read that location on a different CPU,
am I always guaranteed to see the new value? or do I need some sort of
memory barrier?
If I could be clear on that (and memory-barriers.txt isn't
helping... maybe if I read it 7 more times) then we probably could
make the change you suggest.
???

> 
> 
> > diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
> > --- .prev/drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
> > +++ ./drivers/md/raid1.c	2008-02-22 15:45:35.000000000 +1100
> > @@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
> >  }
> >  
> >  
> > +static int flush_pending_writes(conf_t *conf)
> > +{
> 
> [snip]
> 
> Any chance to avoid this code duplication?

Not really.
The "conf_t" in each case is a very different conf_t that just happens
to have the same name and some common fields.
There is actually quite a lot of structural similarity between raid1
and raid10, but I don't think there is much to be gained by trying to
share code there.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-04  6:08     ` Neil Brown
@ 2008-03-04 11:29       ` Andre Noll
  2008-03-06  3:29         ` Neil Brown
  0 siblings, 1 reply; 15+ messages in thread
From: Andre Noll @ 2008-03-04 11:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel, K.Tanaka

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On 17:08, Neil Brown wrote:
> > Do we really need to take the spin lock in the common case where
> > conf->pending_bio_list.head is NULL? If not, the above could be
> > optimized to the slightly faster and better readable
> > 
> > 	struct bio *bio;
> > 
> > 	if (!conf->pending_bio_list.head)
> > 		return 0;
> > 	spin_lock_irq(&conf->device_lock);
> > 	bio = bio_list_get(&conf->pending_bio_list);
> > 	...
> > 	spin_unlock_irq(&conf->device_lock);
> > 	return 1;
> 
> Maybe... If I write a memory location inside a spinlock, then after
> the spinlock is dropped, I read that location on a different CPU,
> am I always guaranteed to see the new value? or do I need some sort of
> memory barrier?

Are you worried about another CPU setting conf->pending_bio_list.head
to != NULL after the if statement? If that's an issue I think also
the original patch is problematic because the same might happen after
the final spin_unlock_irq() but but before flush_pending_writes()
returns zero.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-04 11:29       ` Andre Noll
@ 2008-03-06  3:29         ` Neil Brown
  2008-03-06 10:51           ` Andre Noll
  0 siblings, 1 reply; 15+ messages in thread
From: Neil Brown @ 2008-03-06  3:29 UTC (permalink / raw)
  To: Andre Noll; +Cc: Andrew Morton, linux-raid, linux-kernel, K.Tanaka

On Tuesday March 4, maan@systemlinux.org wrote:
> On 17:08, Neil Brown wrote:
> > > Do we really need to take the spin lock in the common case where
> > > conf->pending_bio_list.head is NULL? If not, the above could be
> > > optimized to the slightly faster and better readable
> > > 
> > > 	struct bio *bio;
> > > 
> > > 	if (!conf->pending_bio_list.head)
> > > 		return 0;
> > > 	spin_lock_irq(&conf->device_lock);
> > > 	bio = bio_list_get(&conf->pending_bio_list);
> > > 	...
> > > 	spin_unlock_irq(&conf->device_lock);
> > > 	return 1;
> > 
> > Maybe... If I write a memory location inside a spinlock, then after
> > the spinlock is dropped, I read that location on a different CPU,
> > am I always guaranteed to see the new value? or do I need some sort of
> > memory barrier?
> 
> Are you worried about another CPU setting conf->pending_bio_list.head
> to != NULL after the if statement? If that's an issue I think also
> the original patch is problematic because the same might happen after
> the final spin_unlock_irq() but but before flush_pending_writes()
> returns zero.

No.  I'm worried that another CPU might set
conf->pending_bio_list.head *before* the if statement, but it isn't
seen by this CPU because of the lack of memory barriers.  The spinlock
ensures that the memory state is consistent.
It is possible that I am being overcautious.  But I think that is
better than the alternative.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
  2008-03-06  3:29         ` Neil Brown
@ 2008-03-06 10:51           ` Andre Noll
  0 siblings, 0 replies; 15+ messages in thread
From: Andre Noll @ 2008-03-06 10:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel, K.Tanaka

[-- Attachment #1: Type: text/plain, Size: 1088 bytes --]

On 14:29, Neil Brown wrote:
> > Are you worried about another CPU setting conf->pending_bio_list.head
> > to != NULL after the if statement? If that's an issue I think also
> > the original patch is problematic because the same might happen after
> > the final spin_unlock_irq() but but before flush_pending_writes()
> > returns zero.
> 
> No.  I'm worried that another CPU might set
> conf->pending_bio_list.head *before* the if statement, but it isn't
> seen by this CPU because of the lack of memory barriers.  The spinlock
> ensures that the memory state is consistent.

But is that enough to avoid the deadlock? I think the following
scenario would be possible with the code in the original patch:

	// suppose conf->pending_bio_list.head==NULL ATM

	CPU0:
	int rv = 0;
	spin_lock_irq(&conf->device_lock);
	if (conf->pending_bio_list.head) // false
	spin_unlock_irq(&conf->device_lock);

	CPU1:
	conf->pending_bio_list.head = something;

	CPU0:
	return rv; // zero

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-03-06 10:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-03  0:16 [PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
2008-03-03  0:17 ` [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error NeilBrown
2008-03-03 15:54   ` Andre Noll
2008-03-04  6:08     ` Neil Brown
2008-03-04 11:29       ` Andre Noll
2008-03-06  3:29         ` Neil Brown
2008-03-06 10:51           ` Andre Noll
2008-03-03  0:17 ` [PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap NeilBrown
2008-03-03  0:17 ` [PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata NeilBrown
2008-03-03  0:17 ` [PATCH 004 of 9] md: Clean up irregularity with raid autodetect NeilBrown
2008-03-03  0:17 ` [PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write NeilBrown
2008-03-03  0:17 ` [PATCH 006 of 9] md: Lock access to rdev attributes properly NeilBrown
2008-03-03  0:17 ` [PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts NeilBrown
2008-03-03  0:17 ` [PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync NeilBrown
2008-03-03  0:18 ` [PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).