All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/18] Assorted md patches headed for 2.6.30
@ 2009-02-12  3:10 NeilBrown
  2009-02-12  3:10 ` [PATCH 04/18] md: be more consistent about setting WriteMostly flag when adding a drive to an array NeilBrown
                   ` (19 more replies)
  0 siblings, 20 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Hi,
 following is my current patch queue for 2.6.30, in case anyone would
like to review or otherwise comment.
They should show up in -next shortly.

Probably the most interesting are the last few which provide support
for converting a raid1 into a raid5, and a raid5 into a raid6.
I plan to do some more work here so the code might change a bit before
final submission, as I work out how best ot factor the code.

mdadm doesn't current support these conversions, but you can
simply
   echo raid5 > /sys/block/md0/md/level
to change a 2-drive raid1 into a raid5.  Similarly for 5->6

The raid6 array created will have a somewhat unusual layout in that
all the Q blocks will be on the last drive.  Later I'll create
functionality to restripe the array so that the Q block is rotated
around all the drives as you would expect.

Comments and testing very welcome.

Thanks,
NeilBrown




^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/18] md: never clear bit from the write-intent bitmap when the array is degraded.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (6 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 08/18] md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device NeilBrown
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

It is safe to clear a bit from the write-intent bitmap for a raid1
when if we know the data has been written to all devices, which is
what the current test does.
But it is not always safe to update the 'events_cleared' counter in
that case.  This is one request could complete successfully after some
other request has partially failed.

So simply disable the clearing and updating of events_cleared whenever
the array is degraded.  This might end up not clearing some bits that
could safely be cleared, but it is safest approach.

Note that the bug fixed here did not risk corrupting data by letting
the array get out-of-sync.  Rather it meant that when a device is
removed and re-added to the array, it might incorrectly require a full
recovery rather than just recovering based on the bitmap.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 7199437..be29937 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1306,6 +1306,9 @@ void bitmap_endwrite(struct bitmap *bitmap, sector_t offset, unsigned long secto
 		PRINTK(KERN_DEBUG "dec write-behind count %d/%d\n",
 		  atomic_read(&bitmap->behind_writes), bitmap->max_write_behind);
 	}
+	if (bitmap->mddev->degraded)
+		/* Never clear bits or update events_cleared when degraded */
+		success = 0;
 
 	while (sectors) {
 		int blocks;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/18] md: write bitmap information to devices that are undergoing recovery.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (3 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 07/18] md/raid5: simplify interface for init_stripe and get_active_stripe NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash NeilBrown
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

When we add some spares to an array and start recovery, and with have
a bitmap which is stored 'internally' on all devices, we call
bitmap_write_all to make sure the bitmap is correct on the new
device(s).
However that doesn't work as write_sb_page only writes to
'In_sync' devices, and devices undergoing recovery are not
'In_sync' until recovery finishes.

So extend write_sb_page (actually next_active_rdev) to include devices
that are under recovery.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index be29937..96a43bd 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -265,7 +265,6 @@ static mdk_rdev_t *next_active_rdev(mdk_rdev_t *rdev, mddev_t *mddev)
 	list_for_each_continue_rcu(pos, &mddev->disks) {
 		rdev = list_entry(pos, mdk_rdev_t, same_set);
 		if (rdev->raid_disk >= 0 &&
-		    test_bit(In_sync, &rdev->flags) &&
 		    !test_bit(Faulty, &rdev->flags)) {
 			/* this is a usable devices */
 			atomic_inc(&rdev->nr_pending);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (4 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 02/18] md: write bitmap information to devices that are undergoing recovery NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12 17:26   ` John Stoffel
  2009-02-13 16:20   ` Bill Davidsen
  2009-02-12  3:10 ` [PATCH 08/18] md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument NeilBrown
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Version 1.x metadata has the ability to record the status of a
partially completed drive recovery.
However we only update that record on a clean shutdown.
It would be nice to update it on unclean shutdowns too, particularly
when using a bitmap that removes much to the 'sync' effort after an
unclean shutdown.

One complication with checkpointing recovery is that we only know
where we are up to in terms of IO requests started, not which ones
have completed.  And we need to know what has completed to record
how much is recovered.  So occasionally pause the recovery until all
submitted requests are completed, then update the record of where
we are up to.

When we have a bitmap, we already do that pause occasionally to keep
the bitmap up-to-date.  So enhance that code to record the recovery
offset and schedule a superblock update.
And when there is no bitmap, just pause 16 times during the resync to
do a checkpoint.
'16' is a fairly arbitrary number.  But we don't really have any good
way to judge how often is acceptable, and it seems like a reasonable
number for now.


Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c       |    2 ++
 drivers/md/md.c           |   26 ++++++++++++++++++++++----
 include/linux/raid/md_k.h |    7 +++++++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 96a43bd..ff43e4b 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1445,6 +1445,8 @@ void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
 	wait_event(bitmap->mddev->recovery_wait,
 		   atomic_read(&bitmap->mddev->recovery_active) == 0);
 
+	bitmap->mddev->curr_resync_completed = bitmap->mddev->curr_resync;
+	set_bit(MD_CHANGE_CLEAN, &bitmap->mddev->flags);
 	sector &= ~((1ULL << CHUNK_BLOCK_SHIFT(bitmap)) - 1);
 	s = 0;
 	while (s < sector && s < bitmap->mddev->resync_max_sectors) {
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4495104..7639c08 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1325,10 +1325,13 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	}
 
 	if (rdev->raid_disk >= 0 &&
-	    !test_bit(In_sync, &rdev->flags) &&
-	    rdev->recovery_offset > 0) {
-		sb->feature_map |= cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
-		sb->recovery_offset = cpu_to_le64(rdev->recovery_offset);
+	    !test_bit(In_sync, &rdev->flags)) {
+		if (mddev->curr_resync_completed > rdev->recovery_offset)
+			rdev->recovery_offset = mddev->curr_resync_completed;
+		if (rdev->recovery_offset > 0) {
+			sb->feature_map |= cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
+			sb->recovery_offset = cpu_to_le64(rdev->recovery_offset);
+		}
 	}
 
 	if (mddev->reshape_position != MaxSector) {
@@ -6034,6 +6037,19 @@ void md_do_sync(mddev_t *mddev)
 		}
 		if (kthread_should_stop())
 			goto interrupted;
+
+		if (mddev->curr_resync > mddev->curr_resync_completed &&
+		    (mddev->curr_resync - mddev->curr_resync_completed)
+		    > (max_sectors >> 4)) {
+			/* time to update curr_resync_completed */
+			printk("updating\n");
+			blk_unplug(mddev->queue);
+			wait_event(mddev->recovery_wait,
+				   atomic_read(&mddev->recovery_active) == 0);
+			mddev->curr_resync_completed =
+				mddev->curr_resync;
+			set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+		}
 		sectors = mddev->pers->sync_request(mddev, j, &skipped,
 						  currspeed < speed_min(mddev));
 		if (sectors == 0) {
@@ -6167,6 +6183,8 @@ static int remove_and_add_spares(mddev_t *mddev)
 	mdk_rdev_t *rdev;
 	int spares = 0;
 
+	mddev->curr_resync_completed = 0;
+
 	list_for_each_entry(rdev, &mddev->disks, same_set)
 		if (rdev->raid_disk >= 0 &&
 		    !test_bit(Blocked, &rdev->flags) &&
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 9743e4d..92bd993 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -172,6 +172,13 @@ struct mddev_s
 	struct mdk_thread_s		*thread;	/* management thread */
 	struct mdk_thread_s		*sync_thread;	/* doing resync or reconstruct */
 	sector_t			curr_resync;	/* last block scheduled */
+	/* As resync requests can complete out of order, we cannot easily track
+	 * how much resync has been completed.  So we occasionally pause until
+	 * every completes, then set curr_resync_completed to curr_resync.
+	 * As such it may be well behind the real resync mark, but it is a value
+	 * we are certain of.
+	 */
+	sector_t			curr_resync_completed;
 	unsigned long			resync_mark;	/* a recent timestamp */
 	sector_t			resync_mark_cnt;/* blocks written at resync_mark */
 	sector_t			curr_mark_cnt; /* blocks scheduled now */



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/18] md: be more consistent about setting WriteMostly flag when adding a drive to an array
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 05/18] md: Make mddev->size sector-based NeilBrown
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

When a drive is added to an array using ADD_NEW_DISK, there are two
places we can get certain flags from:  the meta on the disk or the
flags passed through the IOCTL.

For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
from either of those sources depending on if it is set (i.e. we
effectively 'or' the two sources together).

This makes it awkward to clear, and is at best inconsistent.

As documented code (in mdadm) requires that setting
MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
inconsistency by always using the value for this flag from the ioctl,
and ignoring the value on disk.


Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 7639c08..8c5cc02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4477,6 +4477,8 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 		clear_bit(In_sync, &rdev->flags); /* just to be sure */
 		if (info->state & (1<<MD_DISK_WRITEMOSTLY))
 			set_bit(WriteMostly, &rdev->flags);
+		else
+			clear_bit(WriteMostly, &rdev->flags);
 
 		rdev->raid_disk = -1;
 		err = bind_rdev_to_array(rdev, mddev);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/18] md: Make mddev->size sector-based.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
  2009-02-12  3:10 ` [PATCH 04/18] md: be more consistent about setting WriteMostly flag when adding a drive to an array NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 06/18] md: Represent raid device size in sectors NeilBrown
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

From: Andre Noll <maan@systemlinux.org>

This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/bitmap.c       |    2 -
 drivers/md/faulty.c       |    2 -
 drivers/md/md.c           |   97 +++++++++++++++++++++++++--------------------
 drivers/md/multipath.c    |    2 -
 drivers/md/raid1.c        |   10 ++---
 drivers/md/raid10.c       |    6 +--
 drivers/md/raid5.c        |   24 ++++++-----
 include/linux/raid/md_k.h |    3 +
 8 files changed, 78 insertions(+), 68 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index ff43e4b..2233b4c 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -296,7 +296,7 @@ static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 				    + size/512 > 0)
 					/* bitmap runs in to metadata */
 					goto bad_alignment;
-				if (rdev->data_offset + mddev->size*2
+				if (rdev->data_offset + mddev->dev_sectors
 				    > rdev->sb_start + bitmap->offset)
 					/* data runs in to bitmap */
 					goto bad_alignment;
diff --git a/drivers/md/faulty.c b/drivers/md/faulty.c
index 86d9adf..b3b0437 100644
--- a/drivers/md/faulty.c
+++ b/drivers/md/faulty.c
@@ -298,7 +298,7 @@ static int run(mddev_t *mddev)
 	list_for_each_entry(rdev, &mddev->disks, same_set)
 		conf->rdev = rdev;
 
-	mddev->array_sectors = mddev->size * 2;
+	mddev->array_sectors = mddev->dev_sectors;
 	mddev->private = conf;
 
 	reconfig(mddev, mddev->layout, -1);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 8c5cc02..df28dca 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -817,7 +817,7 @@ static int super_90_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->clevel[0] = 0;
 		mddev->layout = sb->layout;
 		mddev->raid_disks = sb->raid_disks;
-		mddev->size = sb->size;
+		mddev->dev_sectors = sb->size * 2;
 		mddev->events = ev1;
 		mddev->bitmap_offset = 0;
 		mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
@@ -931,7 +931,7 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 
 	sb->ctime = mddev->ctime;
 	sb->level = mddev->level;
-	sb->size  = mddev->size;
+	sb->size = mddev->dev_sectors / 2;
 	sb->raid_disks = mddev->raid_disks;
 	sb->md_minor = mddev->md_minor;
 	sb->not_persistent = 0;
@@ -1029,7 +1029,7 @@ static void super_90_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 static unsigned long long
 super_90_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 {
-	if (num_sectors && num_sectors < rdev->mddev->size * 2)
+	if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
 		return 0; /* component must fit device */
 	if (rdev->mddev->bitmap_offset)
 		return 0; /* can't move bitmap */
@@ -1221,7 +1221,7 @@ static int super_1_validate(mddev_t *mddev, mdk_rdev_t *rdev)
 		mddev->clevel[0] = 0;
 		mddev->layout = le32_to_cpu(sb->layout);
 		mddev->raid_disks = le32_to_cpu(sb->raid_disks);
-		mddev->size = le64_to_cpu(sb->size)/2;
+		mddev->dev_sectors = le64_to_cpu(sb->size);
 		mddev->events = ev1;
 		mddev->bitmap_offset = 0;
 		mddev->default_bitmap_offset = 1024 >> 9;
@@ -1317,7 +1317,7 @@ static void super_1_sync(mddev_t *mddev, mdk_rdev_t *rdev)
 	sb->cnt_corrected_read = cpu_to_le32(atomic_read(&rdev->corrected_errors));
 
 	sb->raid_disks = cpu_to_le32(mddev->raid_disks);
-	sb->size = cpu_to_le64(mddev->size<<1);
+	sb->size = cpu_to_le64(mddev->dev_sectors);
 
 	if (mddev->bitmap && mddev->bitmap_file == NULL) {
 		sb->bitmap_offset = cpu_to_le32((__u32)mddev->bitmap_offset);
@@ -1373,7 +1373,7 @@ super_1_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 {
 	struct mdp_superblock_1 *sb;
 	sector_t max_sectors;
-	if (num_sectors && num_sectors < rdev->mddev->size * 2)
+	if (num_sectors && num_sectors < rdev->mddev->dev_sectors)
 		return 0; /* component must fit device */
 	if (rdev->sb_start < rdev->data_offset) {
 		/* minor versions 1 and 2; superblock before data */
@@ -1457,8 +1457,9 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	if (find_rdev(mddev, rdev->bdev->bd_dev))
 		return -EEXIST;
 
-	/* make sure rdev->size exceeds mddev->size */
-	if (rdev->size && (mddev->size == 0 || rdev->size < mddev->size)) {
+	/* make sure rdev->size exceeds mddev->dev_sectors / 2 */
+	if (rdev->size && (mddev->dev_sectors == 0 ||
+			rdev->size < mddev->dev_sectors / 2)) {
 		if (mddev->pers) {
 			/* Cannot change size, so fail
 			 * If mddev->level <= 0, then we don't care
@@ -1467,7 +1468,7 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			if (mddev->level > 0)
 				return -ENOSPC;
 		} else
-			mddev->size = rdev->size;
+			mddev->dev_sectors = rdev->size * 2;
 	}
 
 	/* Verify rdev->desc_nr is unique.
@@ -2208,7 +2209,7 @@ rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 			size -= rdev->data_offset/2;
 		}
 	}
-	if (size < my_mddev->size)
+	if (size < my_mddev->dev_sectors / 2)
 		return -EINVAL; /* component must fit device */
 
 	rdev->size = size;
@@ -2774,7 +2775,7 @@ array_state_show(mddev_t *mddev, char *page)
 	else {
 		if (list_empty(&mddev->disks) &&
 		    mddev->raid_disks == 0 &&
-		    mddev->size == 0)
+		    mddev->dev_sectors == 0)
 			st = clear;
 		else
 			st = inactive;
@@ -2981,7 +2982,8 @@ __ATTR(bitmap_set_bits, S_IWUSR, null_show, bitmap_store);
 static ssize_t
 size_show(mddev_t *mddev, char *page)
 {
-	return sprintf(page, "%llu\n", (unsigned long long)mddev->size);
+	return sprintf(page, "%llu\n",
+		(unsigned long long)mddev->dev_sectors / 2);
 }
 
 static int update_size(mddev_t *mddev, sector_t num_sectors);
@@ -2993,20 +2995,19 @@ size_store(mddev_t *mddev, const char *buf, size_t len)
 	 * not increase it (except from 0).
 	 * If array is active, we can try an on-line resize
 	 */
-	char *e;
-	int err = 0;
-	unsigned long long size = simple_strtoull(buf, &e, 10);
-	if (!*buf || *buf == '\n' ||
-	    (*e && *e != '\n'))
-		return -EINVAL;
+	unsigned long long sectors;
+	int err = strict_strtoull(buf, 10, &sectors);
 
+	if (err < 0)
+		return err;
+	sectors *= 2;
 	if (mddev->pers) {
-		err = update_size(mddev, size * 2);
+		err = update_size(mddev, sectors);
 		md_update_sb(mddev, 1);
 	} else {
-		if (mddev->size == 0 ||
-		    mddev->size > size)
-			mddev->size = size;
+		if (mddev->dev_sectors == 0 ||
+		    mddev->dev_sectors > sectors)
+			mddev->dev_sectors = sectors;
 		else
 			err = -ENOSPC;
 	}
@@ -3271,15 +3272,15 @@ static struct md_sysfs_entry md_sync_speed = __ATTR_RO(sync_speed);
 static ssize_t
 sync_completed_show(mddev_t *mddev, char *page)
 {
-	unsigned long max_blocks, resync;
+	unsigned long max_sectors, resync;
 
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
-		max_blocks = mddev->resync_max_sectors;
+		max_sectors = mddev->resync_max_sectors;
 	else
-		max_blocks = mddev->size << 1;
+		max_sectors = mddev->dev_sectors;
 
 	resync = (mddev->curr_resync - atomic_read(&mddev->recovery_active));
-	return sprintf(page, "%lu / %lu\n", resync, max_blocks);
+	return sprintf(page, "%lu / %lu\n", resync, max_sectors);
 }
 
 static struct md_sysfs_entry md_sync_completed = __ATTR_RO(sync_completed);
@@ -3754,11 +3755,11 @@ static int do_md_run(mddev_t * mddev)
 
 		/* perform some consistency tests on the device.
 		 * We don't want the data to overlap the metadata,
-		 * Internal Bitmap issues has handled elsewhere.
+		 * Internal Bitmap issues have been handled elsewhere.
 		 */
 		if (rdev->data_offset < rdev->sb_start) {
-			if (mddev->size &&
-			    rdev->data_offset + mddev->size*2
+			if (mddev->dev_sectors &&
+			    rdev->data_offset + mddev->dev_sectors
 			    > rdev->sb_start) {
 				printk("md: %s: data overlaps metadata\n",
 				       mdname(mddev));
@@ -3836,7 +3837,9 @@ static int do_md_run(mddev_t * mddev)
 	}
 
 	mddev->recovery = 0;
-	mddev->resync_max_sectors = mddev->size << 1; /* may be over-ridden by personality */
+	/* may be over-ridden by personality */
+	mddev->resync_max_sectors = mddev->dev_sectors;
+
 	mddev->barriers_work = 1;
 	mddev->ok_start_degraded = start_dirty_degraded;
 
@@ -4092,7 +4095,7 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 		export_array(mddev);
 
 		mddev->array_sectors = 0;
-		mddev->size = 0;
+		mddev->dev_sectors = 0;
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
 		mddev->resync_min = 0;
@@ -4297,8 +4300,8 @@ static int get_array_info(mddev_t * mddev, void __user * arg)
 	info.patch_version = MD_PATCHLEVEL_VERSION;
 	info.ctime         = mddev->ctime;
 	info.level         = mddev->level;
-	info.size          = mddev->size;
-	if (info.size != mddev->size) /* overflow */
+	info.size          = mddev->dev_sectors / 2;
+	if (info.size != mddev->dev_sectors / 2) /* overflow */
 		info.size = -1;
 	info.nr_disks      = nr;
 	info.raid_disks    = mddev->raid_disks;
@@ -4748,7 +4751,7 @@ static int set_array_info(mddev_t * mddev, mdu_array_info_t *info)
 
 	mddev->level         = info->level;
 	mddev->clevel[0]     = 0;
-	mddev->size          = info->size;
+	mddev->dev_sectors   = 2 * (sector_t)info->size;
 	mddev->raid_disks    = info->raid_disks;
 	/* don't set md_minor, it is determined by which /dev/md* was
 	 * openned
@@ -4886,12 +4889,18 @@ static int update_array_info(mddev_t *mddev, mdu_array_info_t *info)
 		)
 		return -EINVAL;
 	/* Check there is only one change */
-	if (info->size >= 0 && mddev->size != info->size) cnt++;
-	if (mddev->raid_disks != info->raid_disks) cnt++;
-	if (mddev->layout != info->layout) cnt++;
-	if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) cnt++;
-	if (cnt == 0) return 0;
-	if (cnt > 1) return -EINVAL;
+	if (info->size >= 0 && mddev->dev_sectors / 2 != info->size)
+		cnt++;
+	if (mddev->raid_disks != info->raid_disks)
+		cnt++;
+	if (mddev->layout != info->layout)
+		cnt++;
+	if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT))
+		cnt++;
+	if (cnt == 0)
+		return 0;
+	if (cnt > 1)
+		return -EINVAL;
 
 	if (mddev->layout != info->layout) {
 		/* Change layout
@@ -4903,7 +4912,7 @@ static int update_array_info(mddev_t *mddev, mdu_array_info_t *info)
 		else
 			return mddev->pers->reconfig(mddev, info->layout, -1);
 	}
-	if (info->size >= 0 && mddev->size != info->size)
+	if (info->size >= 0 && mddev->dev_sectors / 2 != info->size)
 		rv = update_size(mddev, (sector_t)info->size * 2);
 
 	if (mddev->raid_disks    != info->raid_disks)
@@ -5403,7 +5412,7 @@ static void status_resync(struct seq_file *seq, mddev_t * mddev)
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 		max_blocks = mddev->resync_max_sectors >> 1;
 	else
-		max_blocks = mddev->size;
+		max_blocks = mddev->dev_sectors / 2;
 
 	/*
 	 * Should not happen.
@@ -5979,10 +5988,10 @@ void md_do_sync(mddev_t *mddev)
 			j = mddev->recovery_cp;
 
 	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
-		max_sectors = mddev->size << 1;
+		max_sectors = mddev->dev_sectors;
 	else {
 		/* recovery follows the physical size of devices */
-		max_sectors = mddev->size << 1;
+		max_sectors = mddev->dev_sectors;
 		j = MaxSector;
 		list_for_each_entry(rdev, &mddev->disks, same_set)
 			if (rdev->raid_disk >= 0 &&
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index f6d08f2..a21084b 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -498,7 +498,7 @@ static int multipath_run (mddev_t *mddev)
 	/*
 	 * Ok, everything is just fine now
 	 */
-	mddev->array_sectors = mddev->size * 2;
+	mddev->array_sectors = mddev->dev_sectors;
 
 	mddev->queue->unplug_fn = multipath_unplug;
 	mddev->queue->backing_dev_info.congested_fn = multipath_congested;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 01e3cff..c4b7780 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1722,7 +1722,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
 			return 0;
 	}
 
-	max_sector = mddev->size << 1;
+	max_sector = mddev->dev_sectors;
 	if (sector_nr >= max_sector) {
 		/* If we aborted, we need to abort the
 		 * sync on the 'current' bitmap chunk (there will
@@ -2047,7 +2047,7 @@ static int run(mddev_t *mddev)
 	/*
 	 * Ok, everything is just fine now
 	 */
-	mddev->array_sectors = mddev->size * 2;
+	mddev->array_sectors = mddev->dev_sectors;
 
 	mddev->queue->unplug_fn = raid1_unplug;
 	mddev->queue->backing_dev_info.congested_fn = raid1_congested;
@@ -2112,12 +2112,12 @@ static int raid1_resize(mddev_t *mddev, sector_t sectors)
 	mddev->array_sectors = sectors;
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	mddev->changed = 1;
-	if (mddev->array_sectors / 2 > mddev->size &&
+	if (mddev->array_sectors > mddev->dev_sectors &&
 	    mddev->recovery_cp == MaxSector) {
-		mddev->recovery_cp = mddev->size << 1;
+		mddev->recovery_cp = mddev->dev_sectors;
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	}
-	mddev->size = mddev->array_sectors / 2;
+	mddev->dev_sectors = mddev->array_sectors;
 	mddev->resync_max_sectors = sectors;
 	return 0;
 }
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6736d6d..0484c7e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1694,7 +1694,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i
 			return 0;
 
  skipped:
-	max_sector = mddev->size << 1;
+	max_sector = mddev->dev_sectors;
 	if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
 		max_sector = mddev->resync_max_sectors;
 	if (sector_nr >= max_sector) {
@@ -2075,7 +2075,7 @@ static int run(mddev_t *mddev)
 	conf->far_offset = fo;
 	conf->chunk_mask = (sector_t)(mddev->chunk_size>>9)-1;
 	conf->chunk_shift = ffz(~mddev->chunk_size) - 9;
-	size = mddev->size >> (conf->chunk_shift-1);
+	size = mddev->dev_sectors >> conf->chunk_shift;
 	sector_div(size, fc);
 	size = size * conf->raid_disks;
 	sector_div(size, nc);
@@ -2088,7 +2088,7 @@ static int run(mddev_t *mddev)
 	 */
 	stride += conf->raid_disks - 1;
 	sector_div(stride, conf->raid_disks);
-	mddev->size = stride  << (conf->chunk_shift-1);
+	mddev->dev_sectors = stride << conf->chunk_shift;
 
 	if (fo)
 		stride = 1;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a5ba080..99c5261 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3626,8 +3626,8 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 				     *(new_data_disks) -1,
 				     raid_disks, data_disks,
 				     &dd_idx, &pd_idx, conf);
-	if (last_sector >= (mddev->size<<1))
-		last_sector = (mddev->size<<1)-1;
+	if (last_sector >= mddev->dev_sectors)
+		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
 		pd_idx = stripe_to_pdidx(first_sector, conf,
 					 conf->previous_raid_disks);
@@ -3667,7 +3667,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	struct stripe_head *sh;
 	int pd_idx;
 	int raid_disks = conf->raid_disks;
-	sector_t max_sector = mddev->size << 1;
+	sector_t max_sector = mddev->dev_sectors;
 	int sync_blocks;
 	int still_degraded = 0;
 	int i;
@@ -3705,7 +3705,7 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	 */
 	if (mddev->degraded >= conf->max_degraded &&
 	    test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-		sector_t rv = (mddev->size << 1) - sector_nr;
+		sector_t rv = mddev->dev_sectors - sector_nr;
 		*skipped = 1;
 		return rv;
 	}
@@ -4143,8 +4143,8 @@ static int run(mddev_t *mddev)
 	conf->expand_progress = mddev->reshape_position;
 
 	/* device size must be a multiple of chunk size */
-	mddev->size &= ~(mddev->chunk_size/1024 -1);
-	mddev->resync_max_sectors = mddev->size << 1;
+	mddev->dev_sectors &= ~(mddev->chunk_size / 512 - 1);
+	mddev->resync_max_sectors = mddev->dev_sectors;
 
 	if (conf->level == 6 && conf->raid_disks < 4) {
 		printk(KERN_ERR "raid6: not enough configured devices for %s (%d, minimum 4)\n",
@@ -4251,8 +4251,8 @@ static int run(mddev_t *mddev)
 	mddev->queue->backing_dev_info.congested_data = mddev;
 	mddev->queue->backing_dev_info.congested_fn = raid5_congested;
 
-	mddev->array_sectors = 2 * mddev->size * (conf->previous_raid_disks -
-					    conf->max_degraded);
+	mddev->array_sectors = mddev->dev_sectors *
+		(conf->previous_raid_disks - conf->max_degraded);
 
 	blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
 
@@ -4479,11 +4479,11 @@ static int raid5_resize(mddev_t *mddev, sector_t sectors)
 					  - conf->max_degraded);
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	mddev->changed = 1;
-	if (sectors/2  > mddev->size && mddev->recovery_cp == MaxSector) {
-		mddev->recovery_cp = mddev->size << 1;
+	if (sectors > mddev->dev_sectors && mddev->recovery_cp == MaxSector) {
+		mddev->recovery_cp = mddev->dev_sectors;
 		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
 	}
-	mddev->size = sectors /2;
+	mddev->dev_sectors = sectors;
 	mddev->resync_max_sectors = sectors;
 	return 0;
 }
@@ -4612,7 +4612,7 @@ static void end_reshape(raid5_conf_t *conf)
 	struct block_device *bdev;
 
 	if (!test_bit(MD_RECOVERY_INTR, &conf->mddev->recovery)) {
-		conf->mddev->array_sectors = 2 * conf->mddev->size *
+		conf->mddev->array_sectors = conf->mddev->dev_sectors *
 			(conf->raid_disks - conf->max_degraded);
 		set_capacity(conf->mddev->gendisk, conf->mddev->array_sectors);
 		conf->mddev->changed = 1;
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 92bd993..24f899d 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -155,7 +155,8 @@ struct mddev_s
 	char				clevel[16];
 	int				raid_disks;
 	int				max_disks;
-	sector_t			size; /* used size of component devices */
+	sector_t			dev_sectors; 	/* used size of
+							 * component devices */
 	sector_t			array_sectors; /* exported array size */
 	__u64				events;
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/18] md: Represent raid device size in sectors.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
  2009-02-12  3:10 ` [PATCH 04/18] md: be more consistent about setting WriteMostly flag when adding a drive to an array NeilBrown
  2009-02-12  3:10 ` [PATCH 05/18] md: Make mddev->size sector-based NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 07/18] md/raid5: simplify interface for init_stripe and get_active_stripe NeilBrown
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

From: Andre Noll <maan@systemlinux.org>

This patch renames the "size" field of struct mdk_rdev_s to
"sectors" and changes this field to store sectors instead of
blocks.

All users of this field, linear.c, raid0.c and md.c, are fixed up
accordingly which gets rid of many multiplications and divisions.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/linear.c       |    4 +-
 drivers/md/md.c           |   96 +++++++++++++++++++++++----------------------
 drivers/md/raid0.c        |   41 +++++++++----------
 include/linux/raid/md_k.h |    2 -
 4 files changed, 71 insertions(+), 72 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 09658b2..83110f8 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -135,8 +135,8 @@ static linear_conf_t *linear_conf(mddev_t *mddev, int raid_disks)
 		    mddev->queue->max_sectors > (PAGE_SIZE>>9))
 			blk_queue_max_sectors(mddev->queue, PAGE_SIZE>>9);
 
-		disk->num_sectors = rdev->size * 2;
-		conf->array_sectors += rdev->size * 2;
+		disk->num_sectors = rdev->sectors;
+		conf->array_sectors += rdev->sectors;
 
 		cnt++;
 	}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index df28dca..d3c6020 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -414,7 +414,7 @@ static void free_disk_sb(mdk_rdev_t * rdev)
 		rdev->sb_loaded = 0;
 		rdev->sb_page = NULL;
 		rdev->sb_start = 0;
-		rdev->size = 0;
+		rdev->sectors = 0;
 	}
 }
 
@@ -780,9 +780,9 @@ static int super_90_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version
 		else 
 			ret = 0;
 	}
-	rdev->size = calc_num_sectors(rdev, sb->chunk_size) / 2;
+	rdev->sectors = calc_num_sectors(rdev, sb->chunk_size);
 
-	if (rdev->size < sb->size && sb->level > 1)
+	if (rdev->sectors < sb->size * 2 && sb->level > 1)
 		/* "this cannot possibly happen" ... */
 		ret = -EINVAL;
 
@@ -1185,16 +1185,17 @@ static int super_1_load(mdk_rdev_t *rdev, mdk_rdev_t *refdev, int minor_version)
 			ret = 0;
 	}
 	if (minor_version)
-		rdev->size = ((rdev->bdev->bd_inode->i_size>>9) - le64_to_cpu(sb->data_offset)) / 2;
+		rdev->sectors = (rdev->bdev->bd_inode->i_size >> 9) -
+			le64_to_cpu(sb->data_offset);
 	else
-		rdev->size = rdev->sb_start / 2;
-	if (rdev->size < le64_to_cpu(sb->data_size)/2)
+		rdev->sectors = rdev->sb_start;
+	if (rdev->sectors < le64_to_cpu(sb->data_size))
 		return -EINVAL;
-	rdev->size = le64_to_cpu(sb->data_size)/2;
+	rdev->sectors = le64_to_cpu(sb->data_size);
 	if (le32_to_cpu(sb->chunksize))
-		rdev->size &= ~((sector_t)le32_to_cpu(sb->chunksize)/2 - 1);
+		rdev->sectors &= ~((sector_t)le32_to_cpu(sb->chunksize) - 1);
 
-	if (le64_to_cpu(sb->size) > rdev->size*2)
+	if (le64_to_cpu(sb->size) > rdev->sectors)
 		return -EINVAL;
 	return ret;
 }
@@ -1389,7 +1390,7 @@ super_1_rdev_size_change(mdk_rdev_t *rdev, sector_t num_sectors)
 		sector_t sb_start;
 		sb_start = (rdev->bdev->bd_inode->i_size >> 9) - 8*2;
 		sb_start &= ~(sector_t)(4*2 - 1);
-		max_sectors = rdev->size * 2 + sb_start - rdev->sb_start;
+		max_sectors = rdev->sectors + sb_start - rdev->sb_start;
 		if (!num_sectors || num_sectors > max_sectors)
 			num_sectors = max_sectors;
 		rdev->sb_start = sb_start;
@@ -1457,9 +1458,9 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 	if (find_rdev(mddev, rdev->bdev->bd_dev))
 		return -EEXIST;
 
-	/* make sure rdev->size exceeds mddev->dev_sectors / 2 */
-	if (rdev->size && (mddev->dev_sectors == 0 ||
-			rdev->size < mddev->dev_sectors / 2)) {
+	/* make sure rdev->sectors exceeds mddev->dev_sectors */
+	if (rdev->sectors && (mddev->dev_sectors == 0 ||
+			rdev->sectors < mddev->dev_sectors)) {
 		if (mddev->pers) {
 			/* Cannot change size, so fail
 			 * If mddev->level <= 0, then we don't care
@@ -1468,7 +1469,7 @@ static int bind_rdev_to_array(mdk_rdev_t * rdev, mddev_t * mddev)
 			if (mddev->level > 0)
 				return -ENOSPC;
 		} else
-			mddev->dev_sectors = rdev->size * 2;
+			mddev->dev_sectors = rdev->sectors;
 	}
 
 	/* Verify rdev->desc_nr is unique.
@@ -1722,8 +1723,8 @@ static void print_sb_1(struct mdp_superblock_1 *sb)
 static void print_rdev(mdk_rdev_t *rdev, int major_version)
 {
 	char b[BDEVNAME_SIZE];
-	printk(KERN_INFO "md: rdev %s, SZ:%08llu F:%d S:%d DN:%u\n",
-		bdevname(rdev->bdev,b), (unsigned long long)rdev->size,
+	printk(KERN_INFO "md: rdev %s, Sect:%08llu F:%d S:%d DN:%u\n",
+		bdevname(rdev->bdev, b), (unsigned long long)rdev->sectors,
 	        test_bit(Faulty, &rdev->flags), test_bit(In_sync, &rdev->flags),
 	        rdev->desc_nr);
 	if (rdev->sb_loaded) {
@@ -2162,7 +2163,7 @@ offset_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 		return -EINVAL;
 	if (rdev->mddev->pers && rdev->raid_disk >= 0)
 		return -EBUSY;
-	if (rdev->size && rdev->mddev->external)
+	if (rdev->sectors && rdev->mddev->external)
 		/* Must set offset before size, so overlap checks
 		 * can be sane */
 		return -EBUSY;
@@ -2176,7 +2177,7 @@ __ATTR(offset, S_IRUGO|S_IWUSR, offset_show, offset_store);
 static ssize_t
 rdev_size_show(mdk_rdev_t *rdev, char *page)
 {
-	return sprintf(page, "%llu\n", (unsigned long long)rdev->size);
+	return sprintf(page, "%llu\n", (unsigned long long)rdev->sectors / 2);
 }
 
 static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
@@ -2192,31 +2193,31 @@ static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
 static ssize_t
 rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
-	unsigned long long size;
-	unsigned long long oldsize = rdev->size;
 	mddev_t *my_mddev = rdev->mddev;
+	sector_t oldsectors = rdev->sectors;
+	unsigned long long sectors;
 
-	if (strict_strtoull(buf, 10, &size) < 0)
+	if (strict_strtoull(buf, 10, &sectors) < 0)
 		return -EINVAL;
+	sectors *= 2;
 	if (my_mddev->pers && rdev->raid_disk >= 0) {
 		if (my_mddev->persistent) {
-			size = super_types[my_mddev->major_version].
-				rdev_size_change(rdev, size * 2);
-			if (!size)
+			sectors = super_types[my_mddev->major_version].
+				rdev_size_change(rdev, sectors);
+			if (!sectors)
 				return -EBUSY;
-		} else if (!size) {
-			size = (rdev->bdev->bd_inode->i_size >> 10);
-			size -= rdev->data_offset/2;
-		}
+		} else if (!sectors)
+			sectors = (rdev->bdev->bd_inode->i_size >> 9) -
+				rdev->data_offset;
 	}
-	if (size < my_mddev->dev_sectors / 2)
+	if (sectors < my_mddev->dev_sectors)
 		return -EINVAL; /* component must fit device */
 
-	rdev->size = size;
-	if (size > oldsize && my_mddev->external) {
+	rdev->sectors = sectors;
+	if (sectors > oldsectors && my_mddev->external) {
 		/* need to check that all other rdevs with the same ->bdev
 		 * do not overlap.  We need to unlock the mddev to avoid
-		 * a deadlock.  We have already changed rdev->size, and if
+		 * a deadlock.  We have already changed rdev->sectors, and if
 		 * we have to change it back, we will have the lock again.
 		 */
 		mddev_t *mddev;
@@ -2232,9 +2233,9 @@ rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 				if (test_bit(AllReserved, &rdev2->flags) ||
 				    (rdev->bdev == rdev2->bdev &&
 				     rdev != rdev2 &&
-				     overlaps(rdev->data_offset, rdev->size * 2,
+				     overlaps(rdev->data_offset, rdev->sectors,
 					      rdev2->data_offset,
-					      rdev2->size * 2))) {
+					      rdev2->sectors))) {
 					overlap = 1;
 					break;
 				}
@@ -2248,11 +2249,11 @@ rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 		if (overlap) {
 			/* Someone else could have slipped in a size
 			 * change here, but doing so is just silly.
-			 * We put oldsize back because we *know* it is
+			 * We put oldsectors back because we *know* it is
 			 * safe, and trust userspace not to race with
 			 * itself
 			 */
-			rdev->size = oldsize;
+			rdev->sectors = oldsectors;
 			return -EBUSY;
 		}
 	}
@@ -3725,13 +3726,13 @@ static int do_md_run(mddev_t * mddev)
 		list_for_each_entry(rdev, &mddev->disks, same_set) {
 			if (test_bit(Faulty, &rdev->flags))
 				continue;
-			if (rdev->size < chunk_size / 1024) {
+			if (rdev->sectors < chunk_size / 512) {
 				printk(KERN_WARNING
 					"md: Dev %s smaller than chunk_size:"
-					" %lluk < %dk\n",
+					" %llu < %d\n",
 					bdevname(rdev->bdev,b),
-					(unsigned long long)rdev->size,
-					chunk_size / 1024);
+					(unsigned long long)rdev->sectors,
+					chunk_size / 512);
 				return -EINVAL;
 			}
 		}
@@ -4545,7 +4546,7 @@ static int add_new_disk(mddev_t * mddev, mdu_disk_info_t *info)
 			rdev->sb_start = rdev->bdev->bd_inode->i_size / 512;
 		} else 
 			rdev->sb_start = calc_dev_sboffset(rdev->bdev);
-		rdev->size = calc_num_sectors(rdev, mddev->chunk_size) / 2;
+		rdev->sectors = calc_num_sectors(rdev, mddev->chunk_size);
 
 		err = bind_rdev_to_array(rdev, mddev);
 		if (err) {
@@ -4615,7 +4616,7 @@ static int hot_add_disk(mddev_t * mddev, dev_t dev)
 	else
 		rdev->sb_start = rdev->bdev->bd_inode->i_size / 512;
 
-	rdev->size = calc_num_sectors(rdev, mddev->chunk_size) / 2;
+	rdev->sectors = calc_num_sectors(rdev, mddev->chunk_size);
 
 	if (test_bit(Faulty, &rdev->flags)) {
 		printk(KERN_WARNING 
@@ -4816,8 +4817,7 @@ static int update_size(mddev_t *mddev, sector_t num_sectors)
 		 */
 		return -EBUSY;
 	list_for_each_entry(rdev, &mddev->disks, same_set) {
-		sector_t avail;
-		avail = rdev->size * 2;
+		sector_t avail = rdev->sectors;
 
 		if (fit && (num_sectors == 0 || num_sectors > avail))
 			num_sectors = avail;
@@ -5545,7 +5545,7 @@ struct mdstat_info {
 static int md_seq_show(struct seq_file *seq, void *v)
 {
 	mddev_t *mddev = v;
-	sector_t size;
+	sector_t sectors;
 	mdk_rdev_t *rdev;
 	struct mdstat_info *mi = seq->private;
 	struct bitmap *bitmap;
@@ -5581,7 +5581,7 @@ static int md_seq_show(struct seq_file *seq, void *v)
 			seq_printf(seq, " %s", mddev->pers->name);
 		}
 
-		size = 0;
+		sectors = 0;
 		list_for_each_entry(rdev, &mddev->disks, same_set) {
 			char b[BDEVNAME_SIZE];
 			seq_printf(seq, " %s[%d]",
@@ -5593,7 +5593,7 @@ static int md_seq_show(struct seq_file *seq, void *v)
 				continue;
 			} else if (rdev->raid_disk < 0)
 				seq_printf(seq, "(S)"); /* spare */
-			size += rdev->size;
+			sectors += rdev->sectors;
 		}
 
 		if (!list_empty(&mddev->disks)) {
@@ -5603,7 +5603,7 @@ static int md_seq_show(struct seq_file *seq, void *v)
 					   mddev->array_sectors / 2);
 			else
 				seq_printf(seq, "\n      %llu blocks",
-					   (unsigned long long)size);
+					   (unsigned long long)sectors / 2);
 		}
 		if (mddev->persistent) {
 			if (mddev->major_version != 0 ||
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index c605ba8..a5037de 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -73,16 +73,15 @@ static int create_strip_zones (mddev_t *mddev)
 		list_for_each_entry(rdev2, &mddev->disks, same_set) {
 			printk(KERN_INFO "raid0:   comparing %s(%llu)",
 			       bdevname(rdev1->bdev,b),
-			       (unsigned long long)rdev1->size);
+			       (unsigned long long)rdev1->sectors);
 			printk(KERN_INFO " with %s(%llu)\n",
 			       bdevname(rdev2->bdev,b),
-			       (unsigned long long)rdev2->size);
+			       (unsigned long long)rdev2->sectors);
 			if (rdev2 == rdev1) {
 				printk(KERN_INFO "raid0:   END\n");
 				break;
 			}
-			if (rdev2->size == rdev1->size)
-			{
+			if (rdev2->sectors == rdev1->sectors) {
 				/*
 				 * Not unique, don't count it as a new
 				 * group
@@ -145,7 +144,7 @@ static int create_strip_zones (mddev_t *mddev)
 		    mddev->queue->max_sectors > (PAGE_SIZE>>9))
 			blk_queue_max_sectors(mddev->queue, PAGE_SIZE>>9);
 
-		if (!smallest || (rdev1->size <smallest->size))
+		if (!smallest || (rdev1->sectors < smallest->sectors))
 			smallest = rdev1;
 		cnt++;
 	}
@@ -155,10 +154,10 @@ static int create_strip_zones (mddev_t *mddev)
 		goto abort;
 	}
 	zone->nb_dev = cnt;
-	zone->sectors = smallest->size * cnt * 2;
+	zone->sectors = smallest->sectors * cnt;
 	zone->zone_start = 0;
 
-	current_start = smallest->size * 2;
+	current_start = smallest->sectors;
 	curr_zone_start = zone->sectors;
 
 	/* now do the other zones */
@@ -177,29 +176,29 @@ static int create_strip_zones (mddev_t *mddev)
 			rdev = conf->strip_zone[0].dev[j];
 			printk(KERN_INFO "raid0: checking %s ...",
 				bdevname(rdev->bdev, b));
-			if (rdev->size > current_start / 2) {
-				printk(KERN_INFO " contained as device %d\n",
-					c);
-				zone->dev[c] = rdev;
-				c++;
-				if (!smallest || (rdev->size <smallest->size)) {
-					smallest = rdev;
-					printk(KERN_INFO "  (%llu) is smallest!.\n",
-						(unsigned long long)rdev->size);
-				}
-			} else
+			if (rdev->sectors <= current_start) {
 				printk(KERN_INFO " nope.\n");
+				continue;
+			}
+			printk(KERN_INFO " contained as device %d\n", c);
+			zone->dev[c] = rdev;
+			c++;
+			if (!smallest || rdev->sectors < smallest->sectors) {
+				smallest = rdev;
+				printk(KERN_INFO "  (%llu) is smallest!.\n",
+					(unsigned long long)rdev->sectors);
+			}
 		}
 
 		zone->nb_dev = c;
-		zone->sectors = (smallest->size * 2 - current_start) * c;
+		zone->sectors = (smallest->sectors - current_start) * c;
 		printk(KERN_INFO "raid0: zone->nb_dev: %d, sectors: %llu\n",
 			zone->nb_dev, (unsigned long long)zone->sectors);
 
 		zone->zone_start = curr_zone_start;
 		curr_zone_start += zone->sectors;
 
-		current_start = smallest->size * 2;
+		current_start = smallest->sectors;
 		printk(KERN_INFO "raid0: current zone start: %llu\n",
 			(unsigned long long)current_start);
 	}
@@ -293,7 +292,7 @@ static int raid0_run (mddev_t *mddev)
 	/* calculate array device size */
 	mddev->array_sectors = 0;
 	list_for_each_entry(rdev, &mddev->disks, same_set)
-		mddev->array_sectors += rdev->size * 2;
+		mddev->array_sectors += rdev->sectors;
 
 	printk(KERN_INFO "raid0 : md_size is %llu sectors.\n",
 		(unsigned long long)mddev->array_sectors);
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 24f899d..9598155 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -49,7 +49,7 @@ struct mdk_rdev_s
 {
 	struct list_head same_set;	/* RAID devices within the same set */
 
-	sector_t size;			/* Device size (in blocks) */
+	sector_t sectors;		/* Device size (in 512bytes sectors) */
 	mddev_t *mddev;			/* RAID array if running */
 	long last_events;		/* IO event timestamp */
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/18] md/raid5: simplify interface for init_stripe and get_active_stripe
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (2 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 06/18] md: Represent raid device size in sectors NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 02/18] md: write bitmap information to devices that are undergoing recovery NeilBrown
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Rather than passing 'pd_idx' and 'disks' to these functions, just pass
'previous' which tells whether to use the 'previous' or 'current'
geometry during a reshape, and let init_stripe calculate
disks and pd_idx and anything else it might need.

This is not a substantial simplification and even adds a division.
However we will shortly be adding more complexity to to init_stripe
to handle more interesting 'reshape' activities, and without this
change, the interface to these functions would get very complex.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   43 ++++++++++++++++++++++---------------------
 1 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 99c5261..94ed206 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -271,8 +271,9 @@ static int grow_buffers(struct stripe_head *sh, int num)
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i);
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
+static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int i;
@@ -287,11 +288,11 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int
 
 	remove_hash(sh);
 
+	sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 	sh->sector = sector;
-	sh->pd_idx = pd_idx;
+	sh->pd_idx = stripe_to_pdidx(sector, conf, sh->disks);
 	sh->state = 0;
 
-	sh->disks = disks;
 
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
@@ -327,10 +328,12 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(struct request_queue *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks,
-					     int pd_idx, int noblock)
+static struct stripe_head *
+get_active_stripe(raid5_conf_t *conf, sector_t sector,
+		  int previous, int noblock)
 {
 	struct stripe_head *sh;
+	int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 
 	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
 
@@ -358,7 +361,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector
 					);
 				conf->inactive_blocked = 0;
 			} else
-				init_stripe(sh, sector, pd_idx, disks);
+				init_stripe(sh, sector, previous);
 		} else {
 			if (atomic_read(&sh->count)) {
 			  BUG_ON(!list_empty(&sh->lru));
@@ -2476,8 +2479,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 						conf->raid_disks -
 						conf->max_degraded, &dd_idx,
 						&pd_idx, conf);
-			sh2 = get_active_stripe(conf, s, conf->raid_disks,
-						pd_idx, 1);
+			sh2 = get_active_stripe(conf, s, 0, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
 				 * have been requested.  When later blocks
@@ -3410,8 +3412,10 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int disks, data_disks;
+		int previous;
 
 	retry:
+		previous = 0;
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 		if (likely(conf->expand_progress == MaxSector))
 			disks = conf->raid_disks;
@@ -3426,9 +3430,10 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			 */
 			spin_lock_irq(&conf->device_lock);
 			disks = conf->raid_disks;
-			if (logical_sector >= conf->expand_progress)
+			if (logical_sector >= conf->expand_progress) {
 				disks = conf->previous_raid_disks;
-			else {
+				previous = 1;
+			} else {
 				if (logical_sector >= conf->expand_lo) {
 					spin_unlock_irq(&conf->device_lock);
 					schedule();
@@ -3445,7 +3450,8 @@ static int make_request(struct request_queue *q, struct bio * bi)
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
+		sh = get_active_stripe(conf, new_sector, previous,
+				       (bi->bi_rw&RWA_MASK));
 		if (sh) {
 			if (unlikely(conf->expand_progress != MaxSector)) {
 				/* expansion might have moved on while waiting for a
@@ -3579,9 +3585,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
 		int j;
 		int skipped = 0;
-		pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
-		sh = get_active_stripe(conf, sector_nr+i,
-				       conf->raid_disks, pd_idx, 0);
+		sh = get_active_stripe(conf, sector_nr+i, 0, 0);
 		set_bit(STRIPE_EXPANDING, &sh->state);
 		atomic_inc(&conf->reshape_stripes);
 		/* If any of this stripe is beyond the end of the old
@@ -3629,10 +3633,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
-		pd_idx = stripe_to_pdidx(first_sector, conf,
-					 conf->previous_raid_disks);
-		sh = get_active_stripe(conf, first_sector,
-				       conf->previous_raid_disks, pd_idx, 0);
+		sh = get_active_stripe(conf, first_sector, 1, 0);
 		set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		release_stripe(sh);
@@ -3722,9 +3723,9 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 	bitmap_cond_end_sync(mddev->bitmap, sector_nr);
 
 	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
-	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
+	sh = get_active_stripe(conf, sector_nr, 0, 1);
 	if (sh == NULL) {
-		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);
+		sh = get_active_stripe(conf, sector_nr, 0, 0);
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access
 		 */
@@ -3790,7 +3791,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 			/* already done this stripe */
 			continue;
 
-		sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1);
+		sh = get_active_stripe(conf, sector, 0, 1);
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/18] md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (5 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 01/18] md: never clear bit from the write-intent bitmap when the array is degraded NeilBrown
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

This similar to the recent change to get_active_stripe.
There is no functional change, just come rearrangement to make
future patches cleaner.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |   78 +++++++++++++++++++++++-----------------------------
 1 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 94ed206..a061484 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -271,7 +271,7 @@ static int grow_buffers(struct stripe_head *sh, int num)
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i);
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks);
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 {
@@ -290,7 +290,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 
 	sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 	sh->sector = sector;
-	sh->pd_idx = stripe_to_pdidx(sector, conf, sh->disks);
+	sh->pd_idx = stripe_to_pdidx(sector, conf, previous);
 	sh->state = 0;
 
 
@@ -1230,15 +1230,18 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
  * Input: a 'big' sector number,
  * Output: index of the data and parity disk, and the sector # in them.
  */
-static sector_t raid5_compute_sector(sector_t r_sector, unsigned int raid_disks,
-			unsigned int data_disks, unsigned int * dd_idx,
-			unsigned int * pd_idx, raid5_conf_t *conf)
+static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
+				     int previous,
+				     int *dd_idx, int *pd_idx)
 {
 	long stripe;
 	unsigned long chunk_number;
 	unsigned int chunk_offset;
 	sector_t new_sector;
 	int sectors_per_chunk = conf->chunk_size >> 9;
+	int raid_disks = previous ? conf->previous_raid_disks
+				  : conf->raid_disks;
+	int data_disks = raid_disks - conf->max_degraded;
 
 	/* First compute the information on this sector */
 
@@ -1403,7 +1406,9 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	chunk_number = stripe * data_disks + i;
 	r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset;
 
-	check = raid5_compute_sector(r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf);
+	check = raid5_compute_sector(conf, r_sector,
+				     (raid_disks != conf->raid_disks),
+				     &dummy1, &dummy2);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
 		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
@@ -1803,16 +1808,18 @@ static int page_is_zero(struct page *p)
 		memcmp(a, a+4, STRIPE_SIZE-4)==0);
 }
 
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous)
 {
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	int pd_idx, dd_idx;
 	int chunk_offset = sector_div(stripe, sectors_per_chunk);
+	int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 
-	raid5_compute_sector(stripe * (disks - conf->max_degraded)
+	raid5_compute_sector(conf,
+			     stripe * (disks - conf->max_degraded)
 			     *sectors_per_chunk + chunk_offset,
-			     disks, disks - conf->max_degraded,
-			     &dd_idx, &pd_idx, conf);
+			     previous,
+			     &dd_idx, &pd_idx);
 	return pd_idx;
 }
 
@@ -2475,10 +2482,8 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 			struct stripe_head *sh2;
 
 			sector_t bn = compute_blocknr(sh, i);
-			sector_t s = raid5_compute_sector(bn, conf->raid_disks,
-						conf->raid_disks -
-						conf->max_degraded, &dd_idx,
-						&pd_idx, conf);
+			sector_t s = raid5_compute_sector(conf, bn, 0,
+							  &dd_idx, &pd_idx);
 			sh2 = get_active_stripe(conf, s, 0, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
@@ -2765,8 +2770,7 @@ static bool handle_stripe5(struct stripe_head *sh)
 	    !sh->reconstruct_state) {
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
-			conf->raid_disks);
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0);
 		schedule_reconstruction5(sh, &s, 1, 1);
 	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -2984,8 +2988,7 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf,
-					     conf->raid_disks);
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0);
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
@@ -3257,8 +3260,6 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - conf->max_degraded;
 	unsigned int dd_idx, pd_idx;
 	struct bio* align_bi;
 	mdk_rdev_t *rdev;
@@ -3282,12 +3283,9 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 	/*
 	 *	compute position
 	 */
-	align_bi->bi_sector =  raid5_compute_sector(raid_bio->bi_sector,
-					raid_disks,
-					data_disks,
-					&dd_idx,
-					&pd_idx,
-					conf);
+	align_bi->bi_sector =  raid5_compute_sector(conf, raid_bio->bi_sector,
+						    0,
+						    &dd_idx, &pd_idx);
 
 	rcu_read_lock();
 	rdev = rcu_dereference(conf->disks[dd_idx].rdev);
@@ -3444,8 +3442,9 @@ static int make_request(struct request_queue *q, struct bio * bi)
 		}
 		data_disks = disks - conf->max_degraded;
 
- 		new_sector = raid5_compute_sector(logical_sector, disks, data_disks,
-						  &dd_idx, &pd_idx, conf);
+		new_sector = raid5_compute_sector(conf, logical_sector,
+						  previous,
+						  &dd_idx, &pd_idx);
 		pr_debug("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
@@ -3622,14 +3621,12 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 * block on the destination stripes.
 	 */
 	first_sector =
-		raid5_compute_sector(sector_nr*(new_data_disks),
-				     raid_disks, data_disks,
-				     &dd_idx, &pd_idx, conf);
+		raid5_compute_sector(conf, sector_nr*(new_data_disks),
+				     1, &dd_idx, &pd_idx);
 	last_sector =
-		raid5_compute_sector((sector_nr+conf->chunk_size/512)
-				     *(new_data_disks) -1,
-				     raid_disks, data_disks,
-				     &dd_idx, &pd_idx, conf);
+		raid5_compute_sector(conf, ((sector_nr+conf->chunk_size/512)
+					    *(new_data_disks) - 1),
+				     1, &dd_idx, &pd_idx);
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
@@ -3666,8 +3663,6 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
-	int pd_idx;
-	int raid_disks = conf->raid_disks;
 	sector_t max_sector = mddev->dev_sectors;
 	int sync_blocks;
 	int still_degraded = 0;
@@ -3722,7 +3717,6 @@ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *ski
 
 	bitmap_cond_end_sync(mddev->bitmap, sector_nr);
 
-	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
 	sh = get_active_stripe(conf, sector_nr, 0, 1);
 	if (sh == NULL) {
 		sh = get_active_stripe(conf, sector_nr, 0, 0);
@@ -3774,12 +3768,8 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	int handled = 0;
 
 	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
-	sector = raid5_compute_sector(	logical_sector,
-					conf->raid_disks,
-					conf->raid_disks - conf->max_degraded,
-					&dd_idx,
-					&pd_idx,
-					conf);
+	sector = raid5_compute_sector(conf, logical_sector,
+				      0, &dd_idx, &pd_idx);
 	last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);
 
 	for (; logical_sector < last_sector;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (7 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 01/18] md: never clear bit from the write-intent bitmap when the array is degraded NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12 16:56   ` Andre Noll
  2009-02-13 16:37   ` Bill Davidsen
  2009-02-12  3:10 ` [PATCH 17/18] md: add ->takeover method for raid5 to be able to take over raid1 NeilBrown
                   ` (10 subsequent siblings)
  19 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Code currently assumes that the devices in a raid6 stripe are
  0 1 ... N-1 P Q
in some rotated order.  We will shortly add new layouts in which
this strict pattern is broken.
So remove this expectation.  We still assume that the data disks
are roughly in-order.  However P and Q can be inserted anywhere within
that order.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c         |  197 +++++++++++++++++++++++++-------------------
 include/linux/raid/raid5.h |   15 ++-
 2 files changed, 120 insertions(+), 92 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a061484..95f39d2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -130,6 +130,14 @@ static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
 	bio->bi_phys_segments = raid5_bi_phys_segments(bio) || (cnt << 16);
 }
 
+/* Find first data disk in a raid6 stripe */
+static inline int raid6_d0(struct stripe_head *sh)
+{
+	if (sh->qd_idx == sh->disks - 1)
+		return 0;
+	else
+		return sh->qd_idx + 1;
+}
 static inline int raid6_next_disk(int disk, int raid_disks)
 {
 	disk++;
@@ -193,6 +201,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh)
 		}
 	}
 }
+
 static void release_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
@@ -271,12 +280,14 @@ static int grow_buffers(struct stripe_head *sh, int num)
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i);
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous);
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
+			   int *qd_idx);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int i;
+	int qd_idx;
 
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
@@ -290,7 +301,8 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 
 	sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 	sh->sector = sector;
-	sh->pd_idx = stripe_to_pdidx(sector, conf, previous);
+	sh->pd_idx = stripe_to_pdidx(sector, conf, previous, &qd_idx);
+	sh->qd_idx = qd_idx;
 	sh->state = 0;
 
 
@@ -1232,7 +1244,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
  */
 static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 				     int previous,
-				     int *dd_idx, int *pd_idx)
+				     int *dd_idx, int *pd_idx, int *qd_idx)
 {
 	long stripe;
 	unsigned long chunk_number;
@@ -1265,6 +1277,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	/*
 	 * Select the parity disk based on the user selected algorithm.
 	 */
+	*qd_idx = ~0;
 	switch(conf->level) {
 	case 4:
 		*pd_idx = data_disks;
@@ -1300,24 +1313,30 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 			*pd_idx = raid_disks - 1 - (stripe % raid_disks);
-			if (*pd_idx == raid_disks-1)
+			*qd_idx = *pd_idx + 1;
+			if (*pd_idx == raid_disks-1) {
 				(*dd_idx)++; 	/* Q D D D P */
-			else if (*dd_idx >= *pd_idx)
+				*qd_idx = 0;
+			} else if (*dd_idx >= *pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
 			break;
 		case ALGORITHM_RIGHT_ASYMMETRIC:
 			*pd_idx = stripe % raid_disks;
-			if (*pd_idx == raid_disks-1)
+			*qd_idx = *pd_idx + 1;
+			if (*pd_idx == raid_disks-1) {
 				(*dd_idx)++; 	/* Q D D D P */
-			else if (*dd_idx >= *pd_idx)
+				*qd_idx = 0;
+			} else if (*dd_idx >= *pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
 			*pd_idx = raid_disks - 1 - (stripe % raid_disks);
+			*qd_idx = (*pd_idx + 1) % raid_disks;
 			*dd_idx = (*pd_idx + 2 + *dd_idx) % raid_disks;
 			break;
 		case ALGORITHM_RIGHT_SYMMETRIC:
 			*pd_idx = stripe % raid_disks;
+			*qd_idx = (*pd_idx + 1) % raid_disks;
 			*dd_idx = (*pd_idx + 2 + *dd_idx) % raid_disks;
 			break;
 		default:
@@ -1344,7 +1363,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
 	int chunk_offset;
-	int chunk_number, dummy1, dummy2, dd_idx = i;
+	int chunk_number, dummy1, dummy2, dummy3, dd_idx = i;
 	sector_t r_sector;
 
 
@@ -1375,7 +1394,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		}
 		break;
 	case 6:
-		if (i == raid6_next_disk(sh->pd_idx, raid_disks))
+		if (i == sh->qd_idx)
 			return 0; /* It is the Q disk */
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
@@ -1408,7 +1427,7 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 
 	check = raid5_compute_sector(conf, r_sector,
 				     (raid_disks != conf->raid_disks),
-				     &dummy1, &dummy2);
+				     &dummy1, &dummy2, &dummy3);
 	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
 		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
@@ -1477,13 +1496,14 @@ static void copy_data(int frombio, struct bio *bio,
 static void compute_parity6(struct stripe_head *sh, int method)
 {
 	raid6_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
+	int i, pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
 	struct bio *chosen;
 	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
 	void *ptrs[disks];
 
-	qd_idx = raid6_next_disk(pd_idx, disks);
-	d0_idx = raid6_next_disk(qd_idx, disks);
+	pd_idx = sh->pd_idx;
+	qd_idx = sh->qd_idx;
+	d0_idx = raid6_d0(sh);
 
 	pr_debug("compute_parity, stripe %llu, method %d\n",
 		(unsigned long long)sh->sector, method);
@@ -1521,22 +1541,23 @@ static void compute_parity6(struct stripe_head *sh, int method)
 			set_bit(R5_UPTODATE, &sh->dev[i].flags);
 		}
 
-//	switch(method) {
-//	case RECONSTRUCT_WRITE:
-//	case CHECK_PARITY:
-//	case UPDATE_PARITY:
-		/* Note that unlike RAID-5, the ordering of the disks matters greatly. */
-		/* FIX: Is this ordering of drives even remotely optimal? */
-		count = 0;
-		i = d0_idx;
-		do {
+	/* Note that unlike RAID-5, the ordering of the disks matters greatly.*/
+	/* FIX: Is this ordering of drives even remotely optimal? */
+	count = 0;
+	i = d0_idx;
+	do {
+		if (i == sh->pd_idx)
+			ptrs[disks-2] = page_address(sh->dev[i].page);
+		else if (i == sh->qd_idx)
+			ptrs[disks-1] = page_address(sh->dev[i].page);
+		else {
 			ptrs[count++] = page_address(sh->dev[i].page);
-			if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
+			if (!test_bit(R5_UPTODATE, &sh->dev[i].flags))
 				printk("block %d/%d not uptodate on parity calc\n", i,count);
-			i = raid6_next_disk(i, disks);
-		} while ( i != d0_idx );
-//		break;
-//	}
+		}
+		i = raid6_next_disk(i, disks);
+	} while (i != d0_idx);
+	BUG_ON(count+2 != disks);
 
 	raid6_call.gen_syndrome(disks, STRIPE_SIZE, ptrs);
 
@@ -1560,8 +1581,7 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 {
 	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-	int pd_idx = sh->pd_idx;
-	int qd_idx = raid6_next_disk(pd_idx, disks);
+	int qd_idx = sh->qd_idx;
 
 	pr_debug("compute_block_1, stripe %llu, idx %d\n",
 		(unsigned long long)sh->sector, dd_idx);
@@ -1597,21 +1617,36 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 {
 	int i, count, disks = sh->disks;
-	int pd_idx = sh->pd_idx;
-	int qd_idx = raid6_next_disk(pd_idx, disks);
-	int d0_idx = raid6_next_disk(qd_idx, disks);
-	int faila, failb;
+	int d0_idx = raid6_d0(sh);
+	int faila = -1, failb = -1;
+	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
+	void *ptrs[disks];
 
-	/* faila and failb are disk numbers relative to d0_idx */
-	/* pd_idx become disks-2 and qd_idx become disks-1 */
-	faila = (dd_idx1 < d0_idx) ? dd_idx1+(disks-d0_idx) : dd_idx1-d0_idx;
-	failb = (dd_idx2 < d0_idx) ? dd_idx2+(disks-d0_idx) : dd_idx2-d0_idx;
+	count = 0;
+	i = d0_idx;
+	do {
+		int slot;
+		if (i == sh->pd_idx)
+			slot = disks-2;
+		else if (i == sh->qd_idx)
+			slot = disks-1;
+		else
+			slot = count++;
+		ptrs[slot] = page_address(sh->dev[i].page);
+		if (i == dd_idx1)
+			faila = slot;
+		if (i == dd_idx2)
+			failb = slot;
+		i = raid6_next_disk(i, disks);
+	} while (i != d0_idx);
+	BUG_ON(count+2 != disks);
 
 	BUG_ON(faila == failb);
 	if ( failb < faila ) { int tmp = faila; faila = failb; failb = tmp; }
 
 	pr_debug("compute_block_2, stripe %llu, idx %d,%d (%d,%d)\n",
-	       (unsigned long long)sh->sector, dd_idx1, dd_idx2, faila, failb);
+		 (unsigned long long)sh->sector, dd_idx1, dd_idx2,
+		 faila, failb);
 
 	if ( failb == disks-1 ) {
 		/* Q disk is one of the missing disks */
@@ -1621,39 +1656,26 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 			return;
 		} else {
 			/* We're missing D+Q; recompute D from P */
-			compute_block_1(sh, (dd_idx1 == qd_idx) ? dd_idx2 : dd_idx1, 0);
+			compute_block_1(sh, ((dd_idx1 == sh->qd_idx) ?
+					     dd_idx2 : dd_idx1),
+					0);
 			compute_parity6(sh, UPDATE_PARITY); /* Is this necessary? */
 			return;
 		}
 	}
 
-	/* We're missing D+P or D+D; build pointer table */
-	{
-		/**** FIX THIS: This could be very bad if disks is close to 256 ****/
-		void *ptrs[disks];
-
-		count = 0;
-		i = d0_idx;
-		do {
-			ptrs[count++] = page_address(sh->dev[i].page);
-			i = raid6_next_disk(i, disks);
-			if (i != dd_idx1 && i != dd_idx2 &&
-			    !test_bit(R5_UPTODATE, &sh->dev[i].flags))
-				printk("compute_2 with missing block %d/%d\n", count, i);
-		} while ( i != d0_idx );
-
-		if ( failb == disks-2 ) {
-			/* We're missing D+P. */
-			raid6_datap_recov(disks, STRIPE_SIZE, faila, ptrs);
-		} else {
-			/* We're missing D+D. */
-			raid6_2data_recov(disks, STRIPE_SIZE, faila, failb, ptrs);
-		}
-
-		/* Both the above update both missing blocks */
-		set_bit(R5_UPTODATE, &sh->dev[dd_idx1].flags);
-		set_bit(R5_UPTODATE, &sh->dev[dd_idx2].flags);
+	/* We're missing D+P or D+D; */
+	if (failb == disks-2) {
+		/* We're missing D+P. */
+		raid6_datap_recov(disks, STRIPE_SIZE, faila, ptrs);
+	} else {
+		/* We're missing D+D. */
+		raid6_2data_recov(disks, STRIPE_SIZE, faila, failb, ptrs);
 	}
+
+	/* Both the above update both missing blocks */
+	set_bit(R5_UPTODATE, &sh->dev[dd_idx1].flags);
+	set_bit(R5_UPTODATE, &sh->dev[dd_idx2].flags);
 }
 
 static void
@@ -1808,7 +1830,8 @@ static int page_is_zero(struct page *p)
 		memcmp(a, a+4, STRIPE_SIZE-4)==0);
 }
 
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous)
+static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
+			   int *qd_idxp)
 {
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	int pd_idx, dd_idx;
@@ -1819,7 +1842,7 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous)
 			     stripe * (disks - conf->max_degraded)
 			     *sectors_per_chunk + chunk_offset,
 			     previous,
-			     &dd_idx, &pd_idx);
+			     &dd_idx, &pd_idx, qd_idxp);
 	return pd_idx;
 }
 
@@ -2478,12 +2501,13 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 	clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	for (i = 0; i < sh->disks; i++)
 		if (i != sh->pd_idx && (!r6s || i != r6s->qd_idx)) {
-			int dd_idx, pd_idx, j;
+			int dd_idx, pd_idx, qd_idx, j;
 			struct stripe_head *sh2;
 
 			sector_t bn = compute_blocknr(sh, i);
-			sector_t s = raid5_compute_sector(conf, bn, 0,
-							  &dd_idx, &pd_idx);
+			sector_t s =
+				raid5_compute_sector(conf, bn, 0,
+						     &dd_idx, &pd_idx, &qd_idx);
 			sh2 = get_active_stripe(conf, s, 0, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
@@ -2507,8 +2531,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 			set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
 			for (j = 0; j < conf->raid_disks; j++)
 				if (j != sh2->pd_idx &&
-				    (!r6s || j != raid6_next_disk(sh2->pd_idx,
-								 sh2->disks)) &&
+				    (!r6s || j != sh2->qd_idx) &&
 				    !test_bit(R5_Expanded, &sh2->dev[j].flags))
 					break;
 			if (j == conf->raid_disks) {
@@ -2768,9 +2791,11 @@ static bool handle_stripe5(struct stripe_head *sh)
 
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 	    !sh->reconstruct_state) {
+		int qd_idx;
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0);
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0, &qd_idx);
+		sh->qd_idx = qd_idx;
 		schedule_reconstruction5(sh, &s, 1, 1);
 	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -2811,7 +2836,7 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 	struct r5dev *dev, *pdev, *qdev;
 	mdk_rdev_t *blocked_rdev = NULL;
 
-	r6s.qd_idx = raid6_next_disk(pd_idx, disks);
+	r6s.qd_idx = sh->qd_idx;
 	pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
 		"pd_idx=%d, qd_idx=%d\n",
 	       (unsigned long long)sh->sector, sh->state,
@@ -2987,8 +3012,10 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
+		int qd_idx;
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0);
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0, &qd_idx);
+		sh->qd_idx = qd_idx;
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
@@ -3260,7 +3287,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	unsigned int dd_idx, pd_idx;
+	unsigned int dd_idx, pd_idx, qd_idx;
 	struct bio* align_bi;
 	mdk_rdev_t *rdev;
 
@@ -3285,7 +3312,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 	 */
 	align_bi->bi_sector =  raid5_compute_sector(conf, raid_bio->bi_sector,
 						    0,
-						    &dd_idx, &pd_idx);
+						    &dd_idx, &pd_idx, &qd_idx);
 
 	rcu_read_lock();
 	rdev = rcu_dereference(conf->disks[dd_idx].rdev);
@@ -3377,7 +3404,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	unsigned int dd_idx, pd_idx;
+	int dd_idx, pd_idx, qd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
@@ -3444,7 +3471,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 
 		new_sector = raid5_compute_sector(conf, logical_sector,
 						  previous,
-						  &dd_idx, &pd_idx);
+						  &dd_idx, &pd_idx, &qd_idx);
 		pr_debug("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
@@ -3532,7 +3559,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
-	int pd_idx;
+	int pd_idx, qd_idx;
 	sector_t first_sector, last_sector;
 	int raid_disks = conf->previous_raid_disks;
 	int data_disks = raid_disks - conf->max_degraded;
@@ -3595,7 +3622,7 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 			if (j == sh->pd_idx)
 				continue;
 			if (conf->level == 6 &&
-			    j == raid6_next_disk(sh->pd_idx, sh->disks))
+			    j == sh->qd_idx)
 				continue;
 			s = compute_blocknr(sh, j);
 			if (s < mddev->array_sectors) {
@@ -3622,11 +3649,11 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	first_sector =
 		raid5_compute_sector(conf, sector_nr*(new_data_disks),
-				     1, &dd_idx, &pd_idx);
+				     1, &dd_idx, &pd_idx, &qd_idx);
 	last_sector =
 		raid5_compute_sector(conf, ((sector_nr+conf->chunk_size/512)
 					    *(new_data_disks) - 1),
-				     1, &dd_idx, &pd_idx);
+				     1, &dd_idx, &pd_idx, &qd_idx);
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
@@ -3761,7 +3788,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
 	struct stripe_head *sh;
-	int dd_idx, pd_idx;
+	int dd_idx, pd_idx, qd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
 	int remaining;
@@ -3769,7 +3796,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 
 	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	sector = raid5_compute_sector(conf, logical_sector,
-				      0, &dd_idx, &pd_idx);
+				      0, &dd_idx, &pd_idx, &qd_idx);
 	last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);
 
 	for (; logical_sector < last_sector;
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 3b26727..804dac7 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -197,15 +197,16 @@ enum reconstruct_states {
 
 struct stripe_head {
 	struct hlist_node	hash;
-	struct list_head	lru;			/* inactive_list or handle_list */
-	struct raid5_private_data	*raid_conf;
-	sector_t		sector;			/* sector of this row */
-	int			pd_idx;			/* parity disk index */
-	unsigned long		state;			/* state flags */
-	atomic_t		count;			/* nr of active thread/requests */
+	struct list_head	lru;	      /* inactive_list or handle_list */
+	struct raid5_private_data *raid_conf;
+	sector_t		sector;		/* sector of this row */
+	short			pd_idx;		/* parity disk index */
+	short			qd_idx;		/* 'Q' disk index for raid6 */
+	unsigned long		state;		/* state flags */
+	atomic_t		count;	      /* nr of active thread/requests */
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
-	int			disks;			/* disks in stripe */
+	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
 	/* stripe_operations



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/18] md/raid5: Add support for new layouts for raid5 and raid6.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (10 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 13/18] md/raid5: refactor raid5 "run" NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 14/18] md: md_unregister_thread should cope with being passed NULL NeilBrown
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

DDF uses different layouts for P and Q blocks than current md/raid6
so add those that are missing.
Also add support for RAID6 layouts that are identical to various
raid5 layouts with the simple addition of one device to hold all of
the 'Q' blocks.
Finally add 'raid5' layouts to match raid4.
These last to will allow online level conversion.

Note that this does not provide correct support for DDF/raid6 yet
as the order in which data blocks are summed to produce the Q block
is significant and different between current md code and DDF
requirements.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c         |  151 ++++++++++++++++++++++++++++++++++++++++----
 include/linux/raid/raid5.h |   61 +++++++++++++++++-
 2 files changed, 193 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ab751b2..b26b637 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1079,7 +1079,7 @@ static void shrink_stripes(raid5_conf_t *conf)
 
 static void raid5_end_read_request(struct bio * bi, int error)
 {
- 	struct stripe_head *sh = bi->bi_private;
+	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -1161,7 +1161,7 @@ static void raid5_end_read_request(struct bio * bi, int error)
 
 static void raid5_end_write_request(struct bio *bi, int error)
 {
- 	struct stripe_head *sh = bi->bi_private;
+	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
 	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -1301,20 +1301,27 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 			pd_idx = stripe % raid_disks;
 			*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
 			break;
+		case ALGORITHM_PARITY_0:
+			pd_idx = 0;
+			(*dd_idx)++;
+			break;
+		case ALGORITHM_PARITY_N:
+			pd_idx = data_disks;
+			break;
 		default:
 			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
 				conf->algorithm);
+			BUG();
 		}
 		break;
 	case 6:
 
-		/**** FIX THIS ****/
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 			pd_idx = raid_disks - 1 - (stripe % raid_disks);
 			qd_idx = pd_idx + 1;
 			if (pd_idx == raid_disks-1) {
-				(*dd_idx)++; 	/* Q D D D P */
+				(*dd_idx)++;	/* Q D D D P */
 				qd_idx = 0;
 			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
@@ -1323,7 +1330,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 			pd_idx = stripe % raid_disks;
 			qd_idx = pd_idx + 1;
 			if (pd_idx == raid_disks-1) {
-				(*dd_idx)++; 	/* Q D D D P */
+				(*dd_idx)++;	/* Q D D D P */
 				qd_idx = 0;
 			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
@@ -1338,9 +1345,89 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 			qd_idx = (pd_idx + 1) % raid_disks;
 			*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
 			break;
+
+		case ALGORITHM_PARITY_0:
+			pd_idx = 0;
+			qd_idx = 1;
+			(*dd_idx) += 2;
+			break;
+		case ALGORITHM_PARITY_N:
+			pd_idx = data_disks;
+			qd_idx = data_disks + 1;
+			break;
+
+		case ALGORITHM_ROTATING_ZERO_RESTART:
+			/* Exactly the same as RIGHT_ASYMMETRIC, but or
+			 * of blocks for computing Q is different.
+			 */
+			pd_idx = stripe % raid_disks;
+			qd_idx = pd_idx + 1;
+			if (pd_idx == raid_disks-1) {
+				(*dd_idx)++;	/* Q D D D P */
+				qd_idx = 0;
+			} else if (*dd_idx >= pd_idx)
+				(*dd_idx) += 2; /* D D P Q D */
+			break;
+
+		case ALGORITHM_ROTATING_N_RESTART:
+			/* Same a left_asymmetric, by first stripe is
+			 * D D D P Q  rather than
+			 * Q D D D P
+			 */
+			pd_idx = raid_disks - 1 - ((stripe + 1) % raid_disks);
+			qd_idx = pd_idx + 1;
+			if (pd_idx == raid_disks-1) {
+				(*dd_idx)++;	/* Q D D D P */
+				qd_idx = 0;
+			} else if (*dd_idx >= pd_idx)
+				(*dd_idx) += 2; /* D D P Q D */
+			break;
+
+		case ALGORITHM_ROTATING_N_CONTINUE:
+			/* Same as left_symmetric but Q is before P */
+			pd_idx = raid_disks - 1 - (stripe % raid_disks);
+			qd_idx = (pd_idx + raid_disks - 1) % raid_disks;
+			*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
+			break;
+
+		case ALGORITHM_LEFT_ASYMMETRIC_6:
+			/* RAID5 left_asymmetric, with Q on last device */
+			pd_idx = data_disks - stripe % (raid_disks-1);
+			if (*dd_idx >= pd_idx)
+				(*dd_idx)++;
+			qd_idx = raid_disks - 1;
+			break;
+
+		case ALGORITHM_RIGHT_ASYMMETRIC_6:
+			pd_idx = stripe % (raid_disks-1);
+			if (*dd_idx >= pd_idx)
+				(*dd_idx)++;
+			qd_idx = raid_disks - 1;
+			break;
+
+		case ALGORITHM_LEFT_SYMMETRIC_6:
+			pd_idx = data_disks - stripe % (raid_disks-1);
+			*dd_idx = (pd_idx + 1 + *dd_idx) % (raid_disks-1);
+			qd_idx = raid_disks - 1;
+			break;
+
+		case ALGORITHM_RIGHT_SYMMETRIC_6:
+			pd_idx = stripe % (raid_disks-1);
+			*dd_idx = (pd_idx + 1 + *dd_idx) % (raid_disks-1);
+			qd_idx = raid_disks - 1;
+			break;
+
+		case ALGORITHM_PARITY_0_6:
+			pd_idx = 0;
+			(*dd_idx)++;
+			qd_idx = raid_disks - 1;
+			break;
+
+
 		default:
 			printk(KERN_CRIT "raid6: unsupported algorithm %d\n",
 			       conf->algorithm);
+			BUG();
 		}
 		break;
 	}
@@ -1392,9 +1479,15 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 				i += raid_disks;
 			i -= (sh->pd_idx + 1);
 			break;
+		case ALGORITHM_PARITY_0:
+			i -= 1;
+			break;
+		case ALGORITHM_PARITY_N:
+			break;
 		default:
 			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
 			       conf->algorithm);
+			BUG();
 		}
 		break;
 	case 6:
@@ -1403,8 +1496,10 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-		  	if (sh->pd_idx == raid_disks-1)
-				i--; 	/* Q D D D P */
+		case ALGORITHM_ROTATING_ZERO_RESTART:
+		case ALGORITHM_ROTATING_N_RESTART:
+			if (sh->pd_idx == raid_disks-1)
+				i--;	/* Q D D D P */
 			else if (i > sh->pd_idx)
 				i -= 2; /* D D P Q D */
 			break;
@@ -1419,9 +1514,35 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 				i -= (sh->pd_idx + 2);
 			}
 			break;
+		case ALGORITHM_PARITY_0:
+			i -= 2;
+			break;
+		case ALGORITHM_PARITY_N:
+			break;
+		case ALGORITHM_ROTATING_N_CONTINUE:
+			if (sh->pd_idx == 0)
+				i--;	/* P D D D Q */
+			else if (i > sh->pd_idx)
+				i -= 2; /* D D Q P D */
+			break;
+		case ALGORITHM_LEFT_ASYMMETRIC_6:
+		case ALGORITHM_RIGHT_ASYMMETRIC_6:
+			if (i > sh->pd_idx)
+				i--;
+			break;
+		case ALGORITHM_LEFT_SYMMETRIC_6:
+		case ALGORITHM_RIGHT_SYMMETRIC_6:
+			if (i < sh->pd_idx)
+				i += data_disks + 1;
+			i -= (sh->pd_idx + 1);
+			break;
+		case ALGORITHM_PARITY_0_6:
+			i -= 1;
+			break;
 		default:
 			printk(KERN_CRIT "raid6: unsupported algorithm %d\n",
 			       conf->algorithm);
+			BUG();
 		}
 		break;
 	}
@@ -3295,7 +3416,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 		return 0;
 	}
 	/*
- 	 * use bio_clone to make a copy of the bio
+	 * use bio_clone to make a copy of the bio
 	 */
 	align_bi = bio_clone(raid_bio, GFP_NOIO);
 	if (!align_bi)
@@ -3426,7 +3547,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 	if (rw == READ &&
 	     mddev->reshape_position == MaxSector &&
 	     chunk_aligned_read(q,bi))
-            	return 0;
+		return 0;
 
 	logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	last_sector = bi->bi_sector + (bi->bi_size>>9);
@@ -4021,6 +4142,12 @@ static int run(mddev_t *mddev)
 		       mdname(mddev), mddev->level);
 		return -EIO;
 	}
+	if ((mddev->level == 5 && !algorithm_valid_raid5(mddev->layout)) ||
+	    (mddev->level == 6 && !algorithm_valid_raid6(mddev->layout))) {
+		printk(KERN_ERR "raid5: %s: layout %d not supported\n",
+		       mdname(mddev), mddev->layout);
+		return -EIO;
+	}
 
 	if (mddev->chunk_size < PAGE_SIZE) {
 		printk(KERN_ERR "md/raid5: chunk_size must be at least "
@@ -4172,12 +4299,6 @@ static int run(mddev_t *mddev)
 			conf->chunk_size, mdname(mddev));
 		goto abort;
 	}
-	if (conf->algorithm > ALGORITHM_RIGHT_SYMMETRIC) {
-		printk(KERN_ERR 
-			"raid5: unsupported parity algorithm %d for %s\n",
-			conf->algorithm, mdname(mddev));
-		goto abort;
-	}
 	if (mddev->degraded > conf->max_degraded) {
 		printk(KERN_ERR "raid5: not enough operational devices for %s"
 			" (%d/%d failed)\n",
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 804dac7..4d43b08 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -395,9 +395,62 @@ typedef struct raid5_private_data raid5_conf_t;
 /*
  * Our supported algorithms
  */
-#define ALGORITHM_LEFT_ASYMMETRIC	0
-#define ALGORITHM_RIGHT_ASYMMETRIC	1
-#define ALGORITHM_LEFT_SYMMETRIC	2
-#define ALGORITHM_RIGHT_SYMMETRIC	3
+#define ALGORITHM_LEFT_ASYMMETRIC	0 /* Rotating Parity N with Data Restart */
+#define ALGORITHM_RIGHT_ASYMMETRIC	1 /* Rotating Parity 0 with Data Restart */
+#define ALGORITHM_LEFT_SYMMETRIC	2 /* Rotating Parity N with Data Continuation */
+#define ALGORITHM_RIGHT_SYMMETRIC	3 /* Rotating Parity 0 with Data Continuation */
 
+/* Define non-rotating (raid4) algorithms.  These allow
+ * conversion of raid4 to raid5.
+ */
+#define ALGORITHM_PARITY_0		4 /* P or P,Q are initial devices */
+#define ALGORITHM_PARITY_N		5 /* P or P,Q are final devices. */
+
+/* DDF RAID6 layouts differ from md/raid6 layouts in two ways.
+ * Firstly, the exact positioning of the parity block is slightly
+ * different between the 'LEFT_*' modes of md and the "_N_*" modes
+ * of DDF.
+ * Secondly, or order of datablocks over which the Q syndrome is computed
+ * is different.
+ * Consequently we have different layouts for DDF/raid6 than md/raid6.
+ * These layouts are from the DDFv1.2 spec.
+ * Interestingly DDFv1.2-Errata-A does not specify N_CONTINUE but
+ * leaves RLQ=3 as 'Vendor Specific'
+ */
+
+#define ALGORITHM_ROTATING_ZERO_RESTART	8 /* DDF PRL=6 RLQ=1 */
+#define ALGORITHM_ROTATING_N_RESTART	9 /* DDF PRL=6 RLQ=2 */
+#define ALGORITHM_ROTATING_N_CONTINUE	10 /*DDF PRL=6 RLQ=3 */
+
+
+/* For every RAID5 algorithm we define a RAID6 algorithm
+ * with exactly the same layout for data and parity, and
+ * with the Q block always on the last device (N-1).
+ * This allows trivial conversion from RAID5 to RAID6
+ */
+#define ALGORITHM_LEFT_ASYMMETRIC_6	16
+#define ALGORITHM_RIGHT_ASYMMETRIC_6	17
+#define ALGORITHM_LEFT_SYMMETRIC_6	18
+#define ALGORITHM_RIGHT_SYMMETRIC_6	19
+#define ALGORITHM_PARITY_0_6		20
+#define ALGORITHM_PARITY_N_6		ALGORITHM_PARITY_N
+
+static inline int algorithm_valid_raid5(int layout)
+{
+	return (layout >= 0) &&
+		(layout <= 5);
+}
+static inline int algorithm_valid_raid6(int layout)
+{
+	return (layout >= 0 && layout <= 5)
+		||
+		(layout == 8 || layout == 10)
+		||
+		(layout >= 16 && layout <= 20);
+}
+
+static inline int algorithm_is_DDF(int layout)
+{
+	return layout >= 8 && layout <= 10;
+}
 #endif



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/18] md/raid5: simplify raid5_compute_sector interface
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (13 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 12/18] md/raid5: finish support for DDF/raid6 NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 16/18] md: add ->takeover method to support changing the personality managing an array NeilBrown
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
a 'struct stripe_head *' and fill in the relevant fields.  This is
more extensible.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  118 ++++++++++++++++++++++++++--------------------------
 1 files changed, 58 insertions(+), 60 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 95f39d2..ab751b2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -280,14 +280,13 @@ static int grow_buffers(struct stripe_head *sh, int num)
 }
 
 static void raid5_build_block(struct stripe_head *sh, int i);
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
-			   int *qd_idx);
+static void stripe_set_idx(sector_t stripe, raid5_conf_t *conf, int previous,
+			    struct stripe_head *sh);
 
 static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 {
 	raid5_conf_t *conf = sh->raid_conf;
 	int i;
-	int qd_idx;
 
 	BUG_ON(atomic_read(&sh->count) != 0);
 	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
@@ -301,8 +300,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 
 	sh->disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 	sh->sector = sector;
-	sh->pd_idx = stripe_to_pdidx(sector, conf, previous, &qd_idx);
-	sh->qd_idx = qd_idx;
+	stripe_set_idx(sector, conf, previous, sh);
 	sh->state = 0;
 
 
@@ -1243,12 +1241,13 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev)
  * Output: index of the data and parity disk, and the sector # in them.
  */
 static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
-				     int previous,
-				     int *dd_idx, int *pd_idx, int *qd_idx)
+				     int previous, int *dd_idx,
+				     struct stripe_head *sh)
 {
 	long stripe;
 	unsigned long chunk_number;
 	unsigned int chunk_offset;
+	int pd_idx, qd_idx;
 	sector_t new_sector;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	int raid_disks = previous ? conf->previous_raid_disks
@@ -1277,30 +1276,30 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	/*
 	 * Select the parity disk based on the user selected algorithm.
 	 */
-	*qd_idx = ~0;
+	pd_idx = qd_idx = ~0;
 	switch(conf->level) {
 	case 4:
-		*pd_idx = data_disks;
+		pd_idx = data_disks;
 		break;
 	case 5:
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
-			*pd_idx = data_disks - stripe % raid_disks;
-			if (*dd_idx >= *pd_idx)
+			pd_idx = data_disks - stripe % raid_disks;
+			if (*dd_idx >= pd_idx)
 				(*dd_idx)++;
 			break;
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-			*pd_idx = stripe % raid_disks;
-			if (*dd_idx >= *pd_idx)
+			pd_idx = stripe % raid_disks;
+			if (*dd_idx >= pd_idx)
 				(*dd_idx)++;
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
-			*pd_idx = data_disks - stripe % raid_disks;
-			*dd_idx = (*pd_idx + 1 + *dd_idx) % raid_disks;
+			pd_idx = data_disks - stripe % raid_disks;
+			*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
 			break;
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			*pd_idx = stripe % raid_disks;
-			*dd_idx = (*pd_idx + 1 + *dd_idx) % raid_disks;
+			pd_idx = stripe % raid_disks;
+			*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
 			break;
 		default:
 			printk(KERN_ERR "raid5: unsupported algorithm %d\n",
@@ -1312,32 +1311,32 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 		/**** FIX THIS ****/
 		switch (conf->algorithm) {
 		case ALGORITHM_LEFT_ASYMMETRIC:
-			*pd_idx = raid_disks - 1 - (stripe % raid_disks);
-			*qd_idx = *pd_idx + 1;
-			if (*pd_idx == raid_disks-1) {
+			pd_idx = raid_disks - 1 - (stripe % raid_disks);
+			qd_idx = pd_idx + 1;
+			if (pd_idx == raid_disks-1) {
 				(*dd_idx)++; 	/* Q D D D P */
-				*qd_idx = 0;
-			} else if (*dd_idx >= *pd_idx)
+				qd_idx = 0;
+			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
 			break;
 		case ALGORITHM_RIGHT_ASYMMETRIC:
-			*pd_idx = stripe % raid_disks;
-			*qd_idx = *pd_idx + 1;
-			if (*pd_idx == raid_disks-1) {
+			pd_idx = stripe % raid_disks;
+			qd_idx = pd_idx + 1;
+			if (pd_idx == raid_disks-1) {
 				(*dd_idx)++; 	/* Q D D D P */
-				*qd_idx = 0;
-			} else if (*dd_idx >= *pd_idx)
+				qd_idx = 0;
+			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
 			break;
 		case ALGORITHM_LEFT_SYMMETRIC:
-			*pd_idx = raid_disks - 1 - (stripe % raid_disks);
-			*qd_idx = (*pd_idx + 1) % raid_disks;
-			*dd_idx = (*pd_idx + 2 + *dd_idx) % raid_disks;
+			pd_idx = raid_disks - 1 - (stripe % raid_disks);
+			qd_idx = (pd_idx + 1) % raid_disks;
+			*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
 			break;
 		case ALGORITHM_RIGHT_SYMMETRIC:
-			*pd_idx = stripe % raid_disks;
-			*qd_idx = (*pd_idx + 1) % raid_disks;
-			*dd_idx = (*pd_idx + 2 + *dd_idx) % raid_disks;
+			pd_idx = stripe % raid_disks;
+			qd_idx = (pd_idx + 1) % raid_disks;
+			*dd_idx = (pd_idx + 2 + *dd_idx) % raid_disks;
 			break;
 		default:
 			printk(KERN_CRIT "raid6: unsupported algorithm %d\n",
@@ -1346,6 +1345,10 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 		break;
 	}
 
+	if (sh) {
+		sh->pd_idx = pd_idx;
+		sh->qd_idx = qd_idx;
+	}
 	/*
 	 * Finally, compute the new sector number
 	 */
@@ -1363,8 +1366,9 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
 	int chunk_offset;
-	int chunk_number, dummy1, dummy2, dummy3, dd_idx = i;
+	int chunk_number, dummy1, dd_idx = i;
 	sector_t r_sector;
+	struct stripe_head sh2;
 
 
 	chunk_offset = sector_div(new_sector, sectors_per_chunk);
@@ -1427,8 +1431,9 @@ static sector_t compute_blocknr(struct stripe_head *sh, int i)
 
 	check = raid5_compute_sector(conf, r_sector,
 				     (raid_disks != conf->raid_disks),
-				     &dummy1, &dummy2, &dummy3);
-	if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) {
+				     &dummy1, &sh2);
+	if (check != sh->sector || dummy1 != dd_idx || sh2.pd_idx != sh->pd_idx
+		|| sh2.qd_idx != sh->qd_idx) {
 		printk(KERN_ERR "compute_blocknr: map not correct\n");
 		return 0;
 	}
@@ -1830,11 +1835,11 @@ static int page_is_zero(struct page *p)
 		memcmp(a, a+4, STRIPE_SIZE-4)==0);
 }
 
-static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
-			   int *qd_idxp)
+static void stripe_set_idx(sector_t stripe, raid5_conf_t *conf, int previous,
+			    struct stripe_head *sh)
 {
 	int sectors_per_chunk = conf->chunk_size >> 9;
-	int pd_idx, dd_idx;
+	int dd_idx;
 	int chunk_offset = sector_div(stripe, sectors_per_chunk);
 	int disks = previous ? conf->previous_raid_disks : conf->raid_disks;
 
@@ -1842,8 +1847,7 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
 			     stripe * (disks - conf->max_degraded)
 			     *sectors_per_chunk + chunk_offset,
 			     previous,
-			     &dd_idx, &pd_idx, qd_idxp);
-	return pd_idx;
+			     &dd_idx, sh);
 }
 
 static void
@@ -2501,13 +2505,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh,
 	clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
 	for (i = 0; i < sh->disks; i++)
 		if (i != sh->pd_idx && (!r6s || i != r6s->qd_idx)) {
-			int dd_idx, pd_idx, qd_idx, j;
+			int dd_idx, j;
 			struct stripe_head *sh2;
 
 			sector_t bn = compute_blocknr(sh, i);
-			sector_t s =
-				raid5_compute_sector(conf, bn, 0,
-						     &dd_idx, &pd_idx, &qd_idx);
+			sector_t s = raid5_compute_sector(conf, bn, 0,
+							  &dd_idx, NULL);
 			sh2 = get_active_stripe(conf, s, 0, 1);
 			if (sh2 == NULL)
 				/* so far only the early blocks of this stripe
@@ -2791,11 +2794,9 @@ static bool handle_stripe5(struct stripe_head *sh)
 
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
 	    !sh->reconstruct_state) {
-		int qd_idx;
 		/* Need to write out all blocks after computing parity */
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0, &qd_idx);
-		sh->qd_idx = qd_idx;
+		stripe_set_idx(sh->sector, conf, 0, sh);
 		schedule_reconstruction5(sh, &s, 1, 1);
 	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
@@ -3012,10 +3013,8 @@ static bool handle_stripe6(struct stripe_head *sh, struct page *tmp_page)
 
 	if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
 		/* Need to write out all blocks after computing P&Q */
-		int qd_idx;
 		sh->disks = conf->raid_disks;
-		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 0, &qd_idx);
-		sh->qd_idx = qd_idx;
+		stripe_set_idx(sh->sector, conf, 0, sh);
 		compute_parity6(sh, RECONSTRUCT_WRITE);
 		for (i = conf->raid_disks ; i-- ;  ) {
 			set_bit(R5_LOCKED, &sh->dev[i].flags);
@@ -3287,7 +3286,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	unsigned int dd_idx, pd_idx, qd_idx;
+	unsigned int dd_idx;
 	struct bio* align_bi;
 	mdk_rdev_t *rdev;
 
@@ -3312,7 +3311,7 @@ static int chunk_aligned_read(struct request_queue *q, struct bio * raid_bio)
 	 */
 	align_bi->bi_sector =  raid5_compute_sector(conf, raid_bio->bi_sector,
 						    0,
-						    &dd_idx, &pd_idx, &qd_idx);
+						    &dd_idx, NULL);
 
 	rcu_read_lock();
 	rdev = rcu_dereference(conf->disks[dd_idx].rdev);
@@ -3404,7 +3403,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	int dd_idx, pd_idx, qd_idx;
+	int dd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
 	struct stripe_head *sh;
@@ -3471,7 +3470,7 @@ static int make_request(struct request_queue *q, struct bio * bi)
 
 		new_sector = raid5_compute_sector(conf, logical_sector,
 						  previous,
-						  &dd_idx, &pd_idx, &qd_idx);
+						  &dd_idx, NULL);
 		pr_debug("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
@@ -3559,7 +3558,6 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
-	int pd_idx, qd_idx;
 	sector_t first_sector, last_sector;
 	int raid_disks = conf->previous_raid_disks;
 	int data_disks = raid_disks - conf->max_degraded;
@@ -3649,11 +3647,11 @@ static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped
 	 */
 	first_sector =
 		raid5_compute_sector(conf, sector_nr*(new_data_disks),
-				     1, &dd_idx, &pd_idx, &qd_idx);
+				     1, &dd_idx, NULL);
 	last_sector =
 		raid5_compute_sector(conf, ((sector_nr+conf->chunk_size/512)
 					    *(new_data_disks) - 1),
-				     1, &dd_idx, &pd_idx, &qd_idx);
+				     1, &dd_idx, NULL);
 	if (last_sector >= mddev->dev_sectors)
 		last_sector = mddev->dev_sectors - 1;
 	while (first_sector <= last_sector) {
@@ -3788,7 +3786,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	 * it will be only one 'dd_idx' and only need one call to raid5_compute_sector.
 	 */
 	struct stripe_head *sh;
-	int dd_idx, pd_idx, qd_idx;
+	int dd_idx;
 	sector_t sector, logical_sector, last_sector;
 	int scnt = 0;
 	int remaining;
@@ -3796,7 +3794,7 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 
 	logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
 	sector = raid5_compute_sector(conf, logical_sector,
-				      0, &dd_idx, &pd_idx, &qd_idx);
+				      0, &dd_idx, NULL);
 	last_sector = raid_bio->bi_sector + (raid_bio->bi_size>>9);
 
 	for (; logical_sector < last_sector;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/18] md/raid5: finish support for DDF/raid6
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (12 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 14/18] md: md_unregister_thread should cope with being passed NULL NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 10/18] md/raid5: simplify raid5_compute_sector interface NeilBrown
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

DDF requires RAID6 calculations over different devices in a different
order.
For md/raid6, we calculate over just the data devices, starting
immediately after the 'Q' block.
For ddf/raid6 we calculate over all devices, using zeros in place of
the P and Q blocks.

This requires unfortunately complex loops...

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c         |   62 +++++++++++++++++++++++++++++---------------
 include/linux/raid/raid5.h |    1 +
 2 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b26b637..f1dbfc4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -133,6 +133,10 @@ static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
+	if (sh->ddf_layout)
+		/* ddf always start from first device */
+		return 0;
+	/* md starts just after Q block */
 	if (sh->qd_idx == sh->disks - 1)
 		return 0;
 	else
@@ -1248,6 +1252,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	unsigned long chunk_number;
 	unsigned int chunk_offset;
 	int pd_idx, qd_idx;
+	int ddf_layout = 0;
 	sector_t new_sector;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	int raid_disks = previous ? conf->previous_raid_disks
@@ -1367,6 +1372,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 				qd_idx = 0;
 			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
+			ddf_layout = 1;
 			break;
 
 		case ALGORITHM_ROTATING_N_RESTART:
@@ -1381,6 +1387,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 				qd_idx = 0;
 			} else if (*dd_idx >= pd_idx)
 				(*dd_idx) += 2; /* D D P Q D */
+			ddf_layout = 1;
 			break;
 
 		case ALGORITHM_ROTATING_N_CONTINUE:
@@ -1388,6 +1395,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 			pd_idx = raid_disks - 1 - (stripe % raid_disks);
 			qd_idx = (pd_idx + raid_disks - 1) % raid_disks;
 			*dd_idx = (pd_idx + 1 + *dd_idx) % raid_disks;
+			ddf_layout = 1;
 			break;
 
 		case ALGORITHM_LEFT_ASYMMETRIC_6:
@@ -1435,6 +1443,7 @@ static sector_t raid5_compute_sector(raid5_conf_t *conf, sector_t r_sector,
 	if (sh) {
 		sh->pd_idx = pd_idx;
 		sh->qd_idx = qd_idx;
+		sh->ddf_layout = ddf_layout;
 	}
 	/*
 	 * Finally, compute the new sector number
@@ -1623,9 +1632,10 @@ static void compute_parity6(struct stripe_head *sh, int method)
 {
 	raid6_conf_t *conf = sh->raid_conf;
 	int i, pd_idx, qd_idx, d0_idx, disks = sh->disks, count;
+	int syndrome_disks = sh->ddf_layout ? disks : (disks - 2);
 	struct bio *chosen;
 	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
-	void *ptrs[disks];
+	void *ptrs[syndrome_disks+2];
 
 	pd_idx = sh->pd_idx;
 	qd_idx = sh->qd_idx;
@@ -1672,20 +1682,23 @@ static void compute_parity6(struct stripe_head *sh, int method)
 	count = 0;
 	i = d0_idx;
 	do {
+		const void *dblk = sh->ddf_layout ? raid6_empty_zero_page : NULL;
 		if (i == sh->pd_idx)
-			ptrs[disks-2] = page_address(sh->dev[i].page);
+			ptrs[syndrome_disks] = page_address(sh->dev[i].page);
 		else if (i == sh->qd_idx)
-			ptrs[disks-1] = page_address(sh->dev[i].page);
+			ptrs[syndrome_disks+1] = page_address(sh->dev[i].page);
 		else {
-			ptrs[count++] = page_address(sh->dev[i].page);
+			dblk = page_address(sh->dev[i].page);
 			if (!test_bit(R5_UPTODATE, &sh->dev[i].flags))
 				printk("block %d/%d not uptodate on parity calc\n", i,count);
 		}
+		if (dblk)
+			ptrs[count++] = (void*)dblk;
 		i = raid6_next_disk(i, disks);
 	} while (i != d0_idx);
-	BUG_ON(count+2 != disks);
+	BUG_ON(count != syndrome_disks);
 
-	raid6_call.gen_syndrome(disks, STRIPE_SIZE, ptrs);
+	raid6_call.gen_syndrome(syndrome_disks, STRIPE_SIZE, ptrs);
 
 	switch(method) {
 	case RECONSTRUCT_WRITE:
@@ -1743,29 +1756,35 @@ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero)
 static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 {
 	int i, count, disks = sh->disks;
+	int syndrome_disks = sh->ddf_layout ? disks : disks-2;
 	int d0_idx = raid6_d0(sh);
 	int faila = -1, failb = -1;
 	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
-	void *ptrs[disks];
+	void *ptrs[syndrome_disks+2];
 
 	count = 0;
 	i = d0_idx;
 	do {
-		int slot;
-		if (i == sh->pd_idx)
-			slot = disks-2;
-		else if (i == sh->qd_idx)
-			slot = disks-1;
-		else
-			slot = count++;
-		ptrs[slot] = page_address(sh->dev[i].page);
+		const void *dblk = sh->ddf_layout ? raid6_empty_zero_page : NULL;
+		int slot = count;
+		if (i == sh->pd_idx) {
+			slot = syndrome_disks;
+			ptrs[slot] = page_address(sh->dev[i].page);
+		} else if (i == sh->qd_idx) {
+			slot = syndrome_disks+1;
+			ptrs[slot] = page_address(sh->dev[i].page);
+		} else
+			dblk = page_address(sh->dev[i].page);
+		if (dblk)
+			ptrs[count++] = (void*)dblk;
+
 		if (i == dd_idx1)
 			faila = slot;
 		if (i == dd_idx2)
 			failb = slot;
 		i = raid6_next_disk(i, disks);
 	} while (i != d0_idx);
-	BUG_ON(count+2 != disks);
+	BUG_ON(count != syndrome_disks);
 
 	BUG_ON(faila == failb);
 	if ( failb < faila ) { int tmp = faila; faila = failb; failb = tmp; }
@@ -1774,9 +1793,9 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 		 (unsigned long long)sh->sector, dd_idx1, dd_idx2,
 		 faila, failb);
 
-	if ( failb == disks-1 ) {
+	if ( failb == syndrome_disks+1 ) {
 		/* Q disk is one of the missing disks */
-		if ( faila == disks-2 ) {
+		if ( faila == syndrome_disks ) {
 			/* Missing P+Q, just recompute */
 			compute_parity6(sh, UPDATE_PARITY);
 			return;
@@ -1791,12 +1810,13 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2)
 	}
 
 	/* We're missing D+P or D+D; */
-	if (failb == disks-2) {
+	if (failb == syndrome_disks) {
 		/* We're missing D+P. */
-		raid6_datap_recov(disks, STRIPE_SIZE, faila, ptrs);
+		raid6_datap_recov(syndrome_disks+2, STRIPE_SIZE, faila, ptrs);
 	} else {
 		/* We're missing D+D. */
-		raid6_2data_recov(disks, STRIPE_SIZE, faila, failb, ptrs);
+		raid6_2data_recov(syndrome_disks+2, STRIPE_SIZE, faila, failb,
+				  ptrs);
 	}
 
 	/* Both the above update both missing blocks */
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 4d43b08..3adda05 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -202,6 +202,7 @@ struct stripe_head {
 	sector_t		sector;		/* sector of this row */
 	short			pd_idx;		/* parity disk index */
 	short			qd_idx;		/* 'Q' disk index for raid6 */
+	short			ddf_layout;		/* use DDF ordering to calculate Q */
 	unsigned long		state;		/* state flags */
 	atomic_t		count;	      /* nr of active thread/requests */
 	spinlock_t		lock;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 13/18] md/raid5: refactor raid5 "run"
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (9 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 17/18] md: add ->takeover method for raid5 to be able to take over raid1 NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 11/18] md/raid5: Add support for new layouts for raid5 and raid6 NeilBrown
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

.. so that the code to create the private data structures is separate.
This will help with future code to change the level of an active
array.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/raid5.c |  207 ++++++++++++++++++++++++++++++----------------------
 1 files changed, 119 insertions(+), 88 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f1dbfc4..ee42d1a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4149,89 +4149,36 @@ static struct attribute_group raid5_attrs_group = {
 	.attrs = raid5_attrs,
 };
 
-static int run(mddev_t *mddev)
+static raid5_conf_t *setup_conf(mddev_t *mddev)
 {
 	raid5_conf_t *conf;
 	int raid_disk, memory;
 	mdk_rdev_t *rdev;
 	struct disk_info *disk;
-	int working_disks = 0;
 
 	if (mddev->level != 5 && mddev->level != 4 && mddev->level != 6) {
 		printk(KERN_ERR "raid5: %s: raid level not set to 4/5/6 (%d)\n",
 		       mdname(mddev), mddev->level);
-		return -EIO;
+		return ERR_PTR(-EIO);
 	}
 	if ((mddev->level == 5 && !algorithm_valid_raid5(mddev->layout)) ||
 	    (mddev->level == 6 && !algorithm_valid_raid6(mddev->layout))) {
 		printk(KERN_ERR "raid5: %s: layout %d not supported\n",
 		       mdname(mddev), mddev->layout);
-		return -EIO;
+		return ERR_PTR(-EIO);
 	}
 
 	if (mddev->chunk_size < PAGE_SIZE) {
 		printk(KERN_ERR "md/raid5: chunk_size must be at least "
 		       "PAGE_SIZE but %d < %ld\n",
 		       mddev->chunk_size, PAGE_SIZE);
-		return -EINVAL;
-	}
-
-	if (mddev->reshape_position != MaxSector) {
-		/* Check that we can continue the reshape.
-		 * Currently only disks can change, it must
-		 * increase, and we must be past the point where
-		 * a stripe over-writes itself
-		 */
-		sector_t here_new, here_old;
-		int old_disks;
-		int max_degraded = (mddev->level == 5 ? 1 : 2);
-
-		if (mddev->new_level != mddev->level ||
-		    mddev->new_layout != mddev->layout ||
-		    mddev->new_chunk != mddev->chunk_size) {
-			printk(KERN_ERR "raid5: %s: unsupported reshape "
-			       "required - aborting.\n",
-			       mdname(mddev));
-			return -EINVAL;
-		}
-		if (mddev->delta_disks <= 0) {
-			printk(KERN_ERR "raid5: %s: unsupported reshape "
-			       "(reduce disks) required - aborting.\n",
-			       mdname(mddev));
-			return -EINVAL;
-		}
-		old_disks = mddev->raid_disks - mddev->delta_disks;
-		/* reshape_position must be on a new-stripe boundary, and one
-		 * further up in new geometry must map after here in old
-		 * geometry.
-		 */
-		here_new = mddev->reshape_position;
-		if (sector_div(here_new, (mddev->chunk_size>>9)*
-			       (mddev->raid_disks - max_degraded))) {
-			printk(KERN_ERR "raid5: reshape_position not "
-			       "on a stripe boundary\n");
-			return -EINVAL;
-		}
-		/* here_new is the stripe we will write to */
-		here_old = mddev->reshape_position;
-		sector_div(here_old, (mddev->chunk_size>>9)*
-			   (old_disks-max_degraded));
-		/* here_old is the first stripe that we might need to read
-		 * from */
-		if (here_new >= here_old) {
-			/* Reading from the same stripe as writing to - bad */
-			printk(KERN_ERR "raid5: reshape_position too early for "
-			       "auto-recovery - aborting.\n");
-			return -EINVAL;
-		}
-		printk(KERN_INFO "raid5: reshape will continue\n");
-		/* OK, we should be able to continue; */
+		return ERR_PTR(-EINVAL);
 	}
 
-
-	mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
-	if ((conf = mddev->private) == NULL)
+	conf = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
+	if (conf == NULL)
 		goto abort;
+
 	if (mddev->reshape_position == MaxSector) {
 		conf->previous_raid_disks = conf->raid_disks = mddev->raid_disks;
 	} else {
@@ -4255,7 +4202,6 @@ static int run(mddev_t *mddev)
 			goto abort;
 	}
 	spin_lock_init(&conf->device_lock);
-	mddev->queue->queue_lock = &conf->device_lock;
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
@@ -4284,17 +4230,11 @@ static int run(mddev_t *mddev)
 			printk(KERN_INFO "raid5: device %s operational as raid"
 				" disk %d\n", bdevname(rdev->bdev,b),
 				raid_disk);
-			working_disks++;
 		} else
 			/* Cannot rely on bitmap to complete recovery */
 			conf->fullsync = 1;
 	}
 
-	/*
-	 * 0 for a fully functional array, 1 or 2 for a degraded array.
-	 */
-	mddev->degraded = conf->raid_disks - working_disks;
-	conf->mddev = mddev;
 	conf->chunk_size = mddev->chunk_size;
 	conf->level = mddev->level;
 	if (conf->level == 6)
@@ -4305,9 +4245,94 @@ static int run(mddev_t *mddev)
 	conf->max_nr_stripes = NR_STRIPES;
 	conf->expand_progress = mddev->reshape_position;
 
-	/* device size must be a multiple of chunk size */
-	mddev->dev_sectors &= ~(mddev->chunk_size / 512 - 1);
-	mddev->resync_max_sectors = mddev->dev_sectors;
+	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
+		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
+	if (grow_stripes(conf, conf->max_nr_stripes)) {
+		printk(KERN_ERR 
+			"raid5: couldn't allocate %dkB for buffers\n", memory);
+		shrink_stripes(conf);
+		goto abort;
+	} else
+		printk(KERN_INFO "raid5: allocated %dkB for %s\n",
+			memory, mdname(mddev));
+
+	return conf;
+
+ abort:
+	if (conf) {
+		safe_put_page(conf->spare_page);
+		kfree(conf->disks);
+		kfree(conf->stripe_hashtbl);
+		kfree(conf);
+		return ERR_PTR(-EIO);
+	} else
+		return ERR_PTR(-ENOMEM);
+}
+
+static int run(mddev_t *mddev)
+{
+	raid5_conf_t *conf;
+	int working_disks = 0;
+	mdk_rdev_t *rdev;
+
+	if (mddev->reshape_position != MaxSector) {
+		/* Check that we can continue the reshape.
+		 * Currently only disks can change, it must
+		 * increase, and we must be past the point where
+		 * a stripe over-writes itself
+		 */
+		sector_t here_new, here_old;
+		int old_disks;
+		int max_degraded = (mddev->level == 5 ? 1 : 2);
+
+		if (mddev->new_level != mddev->level ||
+		    mddev->new_layout != mddev->layout ||
+		    mddev->new_chunk != mddev->chunk_size) {
+			printk(KERN_ERR "raid5: %s: unsupported reshape "
+			       "required - aborting.\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+		if (mddev->delta_disks <= 0) {
+			printk(KERN_ERR "raid5: %s: unsupported reshape "
+			       "(reduce disks) required - aborting.\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+		old_disks = mddev->raid_disks - mddev->delta_disks;
+		/* reshape_position must be on a new-stripe boundary, and one
+		 * further up in new geometry must map after here in old
+		 * geometry.
+		 */
+		here_new = mddev->reshape_position;
+		if (sector_div(here_new, (mddev->chunk_size>>9)*
+			       (mddev->raid_disks - max_degraded))) {
+			printk(KERN_ERR "raid5: reshape_position not "
+			       "on a stripe boundary\n");
+			return -EINVAL;
+		}
+		/* here_new is the stripe we will write to */
+		here_old = mddev->reshape_position;
+		sector_div(here_old, (mddev->chunk_size>>9)*
+			   (old_disks-max_degraded));
+		/* here_old is the first stripe that we might need to read
+		 * from */
+		if (here_new >= here_old) {
+			/* Reading from the same stripe as writing to - bad */
+			printk(KERN_ERR "raid5: reshape_position too early for "
+			       "auto-recovery - aborting.\n");
+			return -EINVAL;
+		}
+		printk(KERN_INFO "raid5: reshape will continue\n");
+		/* OK, we should be able to continue; */
+	}
+
+	conf = setup_conf(mddev);
+
+	if (conf == NULL)
+		return -EIO;
+	if (IS_ERR(conf))
+		return PTR_ERR(conf);
 
 	if (conf->level == 6 && conf->raid_disks < 4) {
 		printk(KERN_ERR "raid6: not enough configured devices for %s (%d, minimum 4)\n",
@@ -4326,6 +4351,23 @@ static int run(mddev_t *mddev)
 		goto abort;
 	}
 
+	mddev->private = conf;
+	mddev->queue->queue_lock = &conf->device_lock;
+
+	/*
+	 * 0 for a fully functional array, 1 or 2 for a degraded array.
+	 */
+	list_for_each_entry(rdev, &mddev->disks, same_set)
+		if (rdev->raid_disk >= 0 &&
+		    test_bit(In_sync, &rdev->flags))
+			working_disks++;
+
+	mddev->degraded = conf->raid_disks - working_disks;
+
+	/* device size must be a multiple of chunk size */
+	mddev->dev_sectors &= ~(mddev->chunk_size / 512 - 1);
+	mddev->resync_max_sectors = mddev->dev_sectors;
+
 	if (mddev->degraded > 0 &&
 	    mddev->recovery_cp != MaxSector) {
 		if (mddev->ok_start_degraded)
@@ -4341,26 +4383,13 @@ static int run(mddev_t *mddev)
 		}
 	}
 
-	{
-		mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5");
-		if (!mddev->thread) {
-			printk(KERN_ERR 
-				"raid5: couldn't allocate thread for %s\n",
-				mdname(mddev));
-			goto abort;
-		}
-	}
-	memory = conf->max_nr_stripes * (sizeof(struct stripe_head) +
-		 conf->raid_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024;
-	if (grow_stripes(conf, conf->max_nr_stripes)) {
+	mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5");
+	if (!mddev->thread) {
 		printk(KERN_ERR 
-			"raid5: couldn't allocate %dkB for buffers\n", memory);
-		shrink_stripes(conf);
-		md_unregister_thread(mddev->thread);
+		       "raid5: couldn't allocate thread for %s\n",
+		       mdname(mddev));
 		goto abort;
-	} else
-		printk(KERN_INFO "raid5: allocated %dkB for %s\n",
-			memory, mdname(mddev));
+	}
 
 	if (mddev->degraded == 0)
 		printk("raid5: raid level %d set %s active with %d out of %d"
@@ -4422,6 +4451,8 @@ abort:
 		kfree(conf->stripe_hashtbl);
 		kfree(conf);
 	}
+	if (mddev->thread)
+		md_unregister_thread(mddev->thread);
 	mddev->private = NULL;
 	printk(KERN_ALERT "raid5: failed to run raid set %s\n", mdname(mddev));
 	return -EIO;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 14/18] md: md_unregister_thread should cope with being passed NULL
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (11 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 11/18] md/raid5: Add support for new layouts for raid5 and raid6 NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 12/18] md/raid5: finish support for DDF/raid6 NeilBrown
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Mostly md_unregister_thread is only called when we know that the
thread is NULL, but sometimes we need to check first.  It is safer
to put the check inside md_unregister_thread itself.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c    |    2 ++
 drivers/md/raid5.c |    3 +--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d3c6020..cbe0f20 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5339,6 +5339,8 @@ mdk_thread_t *md_register_thread(void (*run) (mddev_t *), mddev_t *mddev,
 
 void md_unregister_thread(mdk_thread_t *thread)
 {
+	if (! thread)
+		return;
 	dprintk("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
 
 	kthread_stop(thread->tsk);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ee42d1a..6c33add 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4451,8 +4451,7 @@ abort:
 		kfree(conf->stripe_hashtbl);
 		kfree(conf);
 	}
-	if (mddev->thread)
-		md_unregister_thread(mddev->thread);
+	md_unregister_thread(mddev->thread);
 	mddev->private = NULL;
 	printk(KERN_ALERT "raid5: failed to run raid set %s\n", mdname(mddev));
 	return -EIO;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 15/18] md: hopefully enable suspend/resume of md devices.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (16 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 18/18] md/raid5: allow layout/chunksize to be changed on an active2-drive raid5 NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  8:11 ` [PATCH 00/18] Assorted md patches headed for 2.6.30 Keld Jørn Simonsen
  2009-02-12  9:42 ` Farkas Levente
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid


---

 drivers/md/md.c           |   79 ++++++++++++++++++++++++++++++++++++---------
 include/linux/raid/md_k.h |    2 +
 2 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index cbe0f20..0e0e1ff 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -202,12 +202,68 @@ static DEFINE_SPINLOCK(all_mddevs_lock);
 		)
 
 
-static int md_fail_request(struct request_queue *q, struct bio *bio)
+/* Rather than calling directly into the personality make_request function,
+ * IO requests come here first so that we can check if the device is
+ * being suspended pending a reconfiguration.
+ * We hold a refcount over the call to ->make_request.  By the time that
+ * call has finished, the bio has been linked into some internal structure
+ * and so is visible to ->quiesce(), so we don't need the refcount any more.
+ */
+static int md_make_request(struct request_queue *q, struct bio *bio)
 {
-	bio_io_error(bio);
-	return 0;
+	mddev_t *mddev = q->queuedata;
+	int rv;
+	if (mddev == NULL || mddev->pers == NULL) {
+		bio_io_error(bio);
+		return 0;
+	}
+	rcu_read_lock();
+	if (mddev->suspended) {
+		DEFINE_WAIT(__wait);
+		for(;;) {
+			prepare_to_wait(&mddev->sb_wait, &__wait,
+					TASK_UNINTERRUPTIBLE);
+			if (!mddev->suspended)
+				break;
+			rcu_read_unlock();
+			schedule();
+			rcu_read_lock();
+		}
+		finish_wait(&mddev->sb_wait, &__wait);
+	}
+	atomic_inc(&mddev->active_io);
+	rcu_read_unlock();
+	rv = mddev->pers->make_request(q, bio);
+	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
+		wake_up(&mddev->sb_wait);
+	
+	return rv;
 }
 
+static void mddev_suspend(mddev_t *mddev)
+{
+	BUG_ON(mddev->suspended);
+	mddev->suspended = 1;
+	synchronize_rcu();
+	wait_event(mddev->sb_wait, atomic_read(&mddev->active_io) == 0);
+	mddev->pers->quiesce(mddev, 1);
+	md_unregister_thread(mddev->thread);
+	mddev->thread = NULL;
+	/* we now know that no code is executing in the personality module,
+	 * except possibly the tail end of a ->bi_end_io function, but that
+	 * is certain to complete before the module has a chance to get
+	 * unloaded
+	 */
+}
+
+static void mddev_resume(mddev_t *mddev)
+{
+	mddev->suspended = 0;
+	wake_up(&mddev->sb_wait);
+	mddev->pers->quiesce(mddev, 0);
+}
+
+
 static inline mddev_t *mddev_get(mddev_t *mddev)
 {
 	atomic_inc(&mddev->active);
@@ -315,6 +371,7 @@ static mddev_t * mddev_find(dev_t unit)
 	init_timer(&new->safemode_timer);
 	atomic_set(&new->active, 1);
 	atomic_set(&new->openers, 0);
+	atomic_set(&new->active_io, 0);
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
 	init_waitqueue_head(&new->recovery_wait);
@@ -3597,10 +3654,12 @@ static int md_alloc(dev_t dev, char *name)
 		mddev_put(mddev);
 		return -ENOMEM;
 	}
+	mddev->queue->queuedata = mddev;
+
 	/* Can be unlocked because the queue is new: no concurrency */
 	queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, mddev->queue);
 
-	blk_queue_make_request(mddev->queue, md_fail_request);
+	blk_queue_make_request(mddev->queue, md_make_request);
 
 	disk = alloc_disk(1 << shift);
 	if (!disk) {
@@ -3896,16 +3955,6 @@ static int do_md_run(mddev_t * mddev)
 
 	set_capacity(disk, mddev->array_sectors);
 
-	/* If we call blk_queue_make_request here, it will
-	 * re-initialise max_sectors etc which may have been
-	 * refined inside -> run.  So just set the bits we need to set.
-	 * Most initialisation happended when we called
-	 * blk_queue_make_request(..., md_fail_request)
-	 * earlier.
-	 */
-	mddev->queue->queuedata = mddev;
-	mddev->queue->make_request_fn = mddev->pers->make_request;
-
 	/* If there is a partially-recovered drive we need to
 	 * start recovery here.  If we leave it to md_check_recovery,
 	 * it will remove the drives and not do the right thing
@@ -4035,7 +4084,7 @@ static int do_md_stop(mddev_t * mddev, int mode, int is_open)
 			md_super_wait(mddev);
 			if (mddev->ro)
 				set_disk_ro(disk, 0);
-			blk_queue_make_request(mddev->queue, md_fail_request);
+
 			mddev->pers->stop(mddev);
 			mddev->queue->merge_bvec_fn = NULL;
 			mddev->queue->unplug_fn = NULL;
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index 9598155..a815bab 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -132,6 +132,8 @@ struct mddev_s
 #define MD_CHANGE_CLEAN 1	/* transition to or from 'clean' */
 #define MD_CHANGE_PENDING 2	/* superblock update in progress */
 
+	int				suspended;
+	atomic_t			active_io;
 	int				ro;
 
 	struct gendisk			*gendisk;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 17/18] md: add ->takeover method for raid5 to be able to take over raid1
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (8 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 13/18] md/raid5: refactor raid5 "run" NeilBrown
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

The RAID1 must have two drives and be a suitable size to
be a multiple of a chunksize that isn't too small.
---

 drivers/md/raid5.c |   72 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 89ce65d..8847c14 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4151,7 +4151,8 @@ static struct attribute_group raid5_attrs_group = {
 	.attrs = raid5_attrs,
 };
 
-static raid5_conf_t *setup_conf(mddev_t *mddev, int raid_disks, int level, int layout)
+static raid5_conf_t *setup_conf(mddev_t *mddev, int raid_disks,
+				int level, int layout, int chunksize)
 {
 	raid5_conf_t *conf;
 	int raid_disk, memory;
@@ -4170,7 +4171,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev, int raid_disks, int level, int l
 		return ERR_PTR(-EIO);
 	}
 
-	if (mddev->chunk_size < PAGE_SIZE) {
+	if (chunksize < PAGE_SIZE) {
 		printk(KERN_ERR "md/raid5: chunk_size must be at least "
 		       "PAGE_SIZE but %d < %ld\n",
 		       mddev->chunk_size, PAGE_SIZE);
@@ -4237,7 +4238,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev, int raid_disks, int level, int l
 			conf->fullsync = 1;
 	}
 
-	conf->chunk_size = mddev->chunk_size;
+	conf->chunk_size = chunksize;
 	conf->level = level;
 	if (conf->level == 6)
 		conf->max_degraded = 2;
@@ -4330,11 +4331,13 @@ static int run(mddev_t *mddev)
 	}
 
 	if (mddev->private == NULL)
-		conf = setup_conf(mddev, mddev->raid_disks, mddev->level, mddev->layout);
+		conf = setup_conf(mddev, mddev->raid_disks, mddev->level,
+				  mddev->layout, mddev->chunk_size);
 	else {
 		conf = mddev->private;
 		mddev->raid_disks = conf->raid_disks;
 		mddev->layout = conf->algorithm;
+		mddev->chunk_size = conf->chunk_size;
 	}
 
 	if (IS_ERR(conf))
@@ -4869,6 +4872,63 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	}
 }
 
+
+static void *raid5_takeover_raid1(mddev_t *mddev)
+{
+	int chunksect;
+	raid5_conf_t *conf;
+
+	if (mddev->raid_disks != 2 ||
+	    mddev->degraded > 1)
+		return ERR_PTR(-EINVAL);
+
+	/* Should check if there are write-behind devices? */
+
+	chunksect = 64*2; /* 64K by default */
+
+	/* The array must be an exact multiple of chunksize */
+	while (chunksect && (mddev->array_sectors & (chunksect-1)))
+		chunksect >>= 1;
+
+	if ((chunksect<<9) < STRIPE_SIZE)
+		/* array size does not allow a suitable chunk size */
+		return ERR_PTR(-EINVAL);
+
+	conf = setup_conf(mddev, 2, 5, ALGORITHM_LEFT_SYMMETRIC,
+			  chunksect << 9);
+	if (IS_ERR(conf))
+		return conf;
+
+	conf->thread = md_register_thread(raid5d, mddev, "%s_raid5");
+	if (conf->thread)
+		return conf;
+
+	safe_put_page(conf->spare_page);
+	kfree(conf->disks);
+	kfree(conf->stripe_hashtbl);
+	kfree(conf);
+
+	return ERR_PTR(-ENOMEM);
+
+}
+
+static void *raid5_takeover(mddev_t *mddev)
+{
+	/* raid5 can take over:
+	 *  raid0 - if all devices are the same - make it a raid4 layout
+	 *  raid1 - if there are two drives.  We need to know the chunk size
+	 *  raid4 - trivial - just use a raid4 layout.
+	 *  raid6 - Providing it is a *_6 layout
+	 *
+	 * For now, just do raid1
+	 */
+
+	if (mddev->level == 1)
+		return raid5_takeover_raid1(mddev);
+
+	return ERR_PTR(-EINVAL);
+}
+
 static struct mdk_personality raid5_personality;
 
 static void *raid6_takeover(mddev_t *mddev)
@@ -4911,7 +4971,8 @@ static void *raid6_takeover(mddev_t *mddev)
 	default:
 		return ERR_PTR(-EINVAL);
 	}
-	conf = setup_conf(mddev, mddev->raid_disks + 1, 6, new_layout);
+	conf = setup_conf(mddev, mddev->raid_disks + 1, 6, new_layout,
+			  mddev->chunk_size);
 	if (IS_ERR(conf))
 		return conf;
 
@@ -4970,6 +5031,7 @@ static struct mdk_personality raid5_personality =
 	.start_reshape  = raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
+	.takeover	= raid5_takeover,
 };
 
 static struct mdk_personality raid4_personality =



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 16/18] md: add ->takeover method to support changing the personality managing an array
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (14 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 10/18] md/raid5: simplify raid5_compute_sector interface NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 18/18] md/raid5: allow layout/chunksize to be changed on an active2-drive raid5 NeilBrown
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
new RAID6 will use a layout which places Q on the last device, and
that device will be missing.
If there are any available spares, one will immediately have Q
recovered onto it.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c            |   92 ++++++++++++++++++++++++++++++++++----
 drivers/md/raid5.c         |  106 +++++++++++++++++++++++++++++++++++++-------
 include/linux/raid/md_k.h  |   10 ++++
 include/linux/raid/raid5.h |    5 ++
 4 files changed, 186 insertions(+), 27 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0e0e1ff..bd003d7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2614,18 +2614,92 @@ level_show(mddev_t *mddev, char *page)
 static ssize_t
 level_store(mddev_t *mddev, const char *buf, size_t len)
 {
+	char level[16];
 	ssize_t rv = len;
-	if (mddev->pers)
+	struct mdk_personality *pers;
+	void *priv;
+
+	if (mddev->pers == NULL) {
+		if (len == 0)
+			return 0;
+		if (len >= sizeof(mddev->clevel))
+			return -ENOSPC;
+		strncpy(mddev->clevel, buf, len);
+		if (mddev->clevel[len-1] == '\n')
+			len--;
+		mddev->clevel[len] = 0;
+		mddev->level = LEVEL_NONE;
+		return rv;
+	}
+
+	/* request to change the personality.  Need to ensure:
+	 *  - array is not engaged in resync/recovery/reshape
+	 *  - old personality can be suspended
+	 *  - new personality will access other array.
+	 */
+
+	if (mddev->sync_thread || mddev->reshape_position != MaxSector)
 		return -EBUSY;
-	if (len == 0)
-		return 0;
-	if (len >= sizeof(mddev->clevel))
-		return -ENOSPC;
-	strncpy(mddev->clevel, buf, len);
-	if (mddev->clevel[len-1] == '\n')
+
+	if (!mddev->pers->quiesce) {
+		printk(KERN_WARNING "md: %s: %s does not support online personality change\n",
+		       mdname(mddev), mddev->pers->name);
+		return -EINVAL;
+	}
+
+	/* Now find the new personality */
+	if (len == 0 || len >= sizeof(level))
+		return -EINVAL;
+	strncpy(level, buf, len);
+	if (level[len-1] == '\n')
 		len--;
-	mddev->clevel[len] = 0;
-	mddev->level = LEVEL_NONE;
+	level[len] = 0;
+
+	request_module("md-%s", level);
+	spin_lock(&pers_lock);
+	pers = find_pers(LEVEL_NONE, level);
+	if (!pers || !try_module_get(pers->owner)) {
+		spin_unlock(&pers_lock);
+		printk(KERN_WARNING "md: personality %s not loaded\n", level);
+		return -EINVAL;
+	}
+	spin_unlock(&pers_lock);
+
+	if (pers == mddev->pers) {
+		/* Nothing to do! */
+		module_put(pers->owner);
+		return rv;
+	}
+	if (!pers->takeover) {
+		module_put(pers->owner);
+		printk(KERN_WARNING "md: %s: %s does not support personality takeover\n",
+		       mdname(mddev), level);
+		return -EINVAL;
+	}
+
+	priv = pers->takeover(mddev);
+	if (IS_ERR(priv)) {
+		module_put(pers->owner);
+		printk(KERN_WARNING "md: %s: %s would not accept array\n",
+		       mdname(mddev), level);
+		return PTR_ERR(priv);
+	}
+
+	/* Looks like we have a winner */
+	mddev_suspend(mddev);
+	mddev->pers->stop(mddev);
+	module_put(mddev->pers->owner);
+	mddev->pers = pers;
+	mddev->private = priv;
+	strlcpy(mddev->clevel, pers->name, sizeof(mddev->clevel));
+	mddev->level = pers->level;
+	mddev->new_level = pers->level;
+	mddev->new_layout = mddev->layout;
+	mddev->new_chunk = mddev->chunk_size;
+	pers->run(mddev);
+	mddev_resume(mddev);
+	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+	md_wakeup_thread(mddev->thread);
 	return rv;
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6c33add..89ce65d 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -912,8 +912,10 @@ static int grow_stripes(raid5_conf_t *conf, int num)
 	struct kmem_cache *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name[0], "raid5-%s", mdname(conf->mddev));
-	sprintf(conf->cache_name[1], "raid5-%s-alt", mdname(conf->mddev));
+	sprintf(conf->cache_name[0],
+		"raid%d-%s", conf->level, mdname(conf->mddev));
+	sprintf(conf->cache_name[1],
+		"raid%d-%s-alt", conf->level, mdname(conf->mddev));
 	conf->active_name = 0;
 	sc = kmem_cache_create(conf->cache_name[conf->active_name],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
@@ -4149,22 +4151,22 @@ static struct attribute_group raid5_attrs_group = {
 	.attrs = raid5_attrs,
 };
 
-static raid5_conf_t *setup_conf(mddev_t *mddev)
+static raid5_conf_t *setup_conf(mddev_t *mddev, int raid_disks, int level, int layout)
 {
 	raid5_conf_t *conf;
 	int raid_disk, memory;
 	mdk_rdev_t *rdev;
 	struct disk_info *disk;
 
-	if (mddev->level != 5 && mddev->level != 4 && mddev->level != 6) {
+	if (level != 5 && level != 4 && level != 6) {
 		printk(KERN_ERR "raid5: %s: raid level not set to 4/5/6 (%d)\n",
-		       mdname(mddev), mddev->level);
+		       mdname(mddev), level);
 		return ERR_PTR(-EIO);
 	}
-	if ((mddev->level == 5 && !algorithm_valid_raid5(mddev->layout)) ||
-	    (mddev->level == 6 && !algorithm_valid_raid6(mddev->layout))) {
+	if ((level == 5 && !algorithm_valid_raid5(layout)) ||
+	    (level == 6 && !algorithm_valid_raid6(layout))) {
 		printk(KERN_ERR "raid5: %s: layout %d not supported\n",
-		       mdname(mddev), mddev->layout);
+		       mdname(mddev), layout);
 		return ERR_PTR(-EIO);
 	}
 
@@ -4180,10 +4182,10 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 		goto abort;
 
 	if (mddev->reshape_position == MaxSector) {
-		conf->previous_raid_disks = conf->raid_disks = mddev->raid_disks;
+		conf->previous_raid_disks = conf->raid_disks = raid_disks;
 	} else {
-		conf->raid_disks = mddev->raid_disks;
-		conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks;
+		conf->raid_disks = raid_disks;
+		conf->previous_raid_disks = raid_disks - mddev->delta_disks;
 	}
 
 	conf->disks = kzalloc(conf->raid_disks * sizeof(struct disk_info),
@@ -4196,7 +4198,7 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 	if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
 		goto abort;
 
-	if (mddev->level == 6) {
+	if (level == 6) {
 		conf->spare_page = alloc_page(GFP_KERNEL);
 		if (!conf->spare_page)
 			goto abort;
@@ -4236,12 +4238,12 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 	}
 
 	conf->chunk_size = mddev->chunk_size;
-	conf->level = mddev->level;
+	conf->level = level;
 	if (conf->level == 6)
 		conf->max_degraded = 2;
 	else
 		conf->max_degraded = 1;
-	conf->algorithm = mddev->layout;
+	conf->algorithm = layout;
 	conf->max_nr_stripes = NR_STRIPES;
 	conf->expand_progress = mddev->reshape_position;
 
@@ -4327,10 +4329,14 @@ static int run(mddev_t *mddev)
 		/* OK, we should be able to continue; */
 	}
 
-	conf = setup_conf(mddev);
+	if (mddev->private == NULL)
+		conf = setup_conf(mddev, mddev->raid_disks, mddev->level, mddev->layout);
+	else {
+		conf = mddev->private;
+		mddev->raid_disks = conf->raid_disks;
+		mddev->layout = conf->algorithm;
+	}
 
-	if (conf == NULL)
-		return -EIO;
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
@@ -4383,7 +4389,11 @@ static int run(mddev_t *mddev)
 		}
 	}
 
-	mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5");
+	if (conf->thread) {
+		mddev->thread = conf->thread;
+		conf->thread = NULL;
+	} else
+		mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5");
 	if (!mddev->thread) {
 		printk(KERN_ERR 
 		       "raid5: couldn't allocate thread for %s\n",
@@ -4859,6 +4869,65 @@ static void raid5_quiesce(mddev_t *mddev, int state)
 	}
 }
 
+static struct mdk_personality raid5_personality;
+
+static void *raid6_takeover(mddev_t *mddev)
+{
+	/* Currently can only take over a raid5.  We map the
+	 * personality to an equivalent raid6 personality
+	 * with the Q block at the end.
+	 */
+	int new_layout;
+	raid5_conf_t *conf;
+
+	if (mddev->pers != &raid5_personality)
+		return ERR_PTR(-EINVAL);
+	if (mddev->degraded > 1)
+		return ERR_PTR(-EINVAL);
+	if (mddev->raid_disks > 253)
+		return ERR_PTR(-EINVAL);
+	if (mddev->raid_disks < 3)
+		return ERR_PTR(-EINVAL);
+
+	switch(mddev->layout) {
+	case ALGORITHM_LEFT_ASYMMETRIC:
+		new_layout = ALGORITHM_LEFT_ASYMMETRIC_6;
+		break;
+	case ALGORITHM_RIGHT_ASYMMETRIC:
+		new_layout = ALGORITHM_RIGHT_ASYMMETRIC_6;
+		break;
+	case ALGORITHM_LEFT_SYMMETRIC:
+		new_layout = ALGORITHM_LEFT_SYMMETRIC_6;
+		break;
+	case ALGORITHM_RIGHT_SYMMETRIC:
+		new_layout = ALGORITHM_RIGHT_SYMMETRIC_6;
+		break;
+	case ALGORITHM_PARITY_0:
+		new_layout = ALGORITHM_PARITY_0_6;
+		break;
+	case ALGORITHM_PARITY_N:
+		new_layout = ALGORITHM_PARITY_N;
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+	conf = setup_conf(mddev, mddev->raid_disks + 1, 6, new_layout);
+	if (IS_ERR(conf))
+		return conf;
+
+	conf->thread = md_register_thread(raid5d, mddev, "%s_raid5");
+	if (conf->thread)
+		return conf;
+
+	safe_put_page(conf->spare_page);
+	kfree(conf->disks);
+	kfree(conf->stripe_hashtbl);
+	kfree(conf);
+
+	return ERR_PTR(-ENOMEM);
+}
+
+
 static struct mdk_personality raid6_personality =
 {
 	.name		= "raid6",
@@ -4879,6 +4948,7 @@ static struct mdk_personality raid6_personality =
 	.start_reshape  = raid5_start_reshape,
 #endif
 	.quiesce	= raid5_quiesce,
+	.takeover	= raid6_takeover,
 };
 static struct mdk_personality raid5_personality =
 {
diff --git a/include/linux/raid/md_k.h b/include/linux/raid/md_k.h
index a815bab..3755045 100644
--- a/include/linux/raid/md_k.h
+++ b/include/linux/raid/md_k.h
@@ -334,6 +334,16 @@ struct mdk_personality
 	 * others - reserved
 	 */
 	void (*quiesce) (mddev_t *mddev, int state);
+	/* takeover is used to transition an array from one
+	 * personality to another.  The new personality must be able
+	 * to handle the data in the current layout.
+	 * e.g. 2drive raid1 -> 2drive raid5
+	 *      ndrive raid5 -> degraded n+1drive raid6 with special layout
+	 * If the takeover succeeds, a new 'private' structure is returned.
+	 * This needs to be installed and then ->quiesce used to activate the
+	 * array.
+	 */
+	void *(*takeover) (mddev_t *mddev);
 };
 
 
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 3adda05..4894cd5 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -387,6 +387,11 @@ struct raid5_private_data {
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
+
+	/* When taking over an array from a different personality, we store
+	 * the new thread here until we fully activate the array.
+	 */
+	struct mdk_thread_s		*thread;
 };
 
 typedef struct raid5_private_data raid5_conf_t;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 18/18] md/raid5: allow layout/chunksize to be changed on an active2-drive raid5.
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (15 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 16/18] md: add ->takeover method to support changing the personality managing an array NeilBrown
@ 2009-02-12  3:10 ` NeilBrown
  2009-02-12  3:10 ` [PATCH 15/18] md: hopefully enable suspend/resume of md devices NeilBrown
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12  3:10 UTC (permalink / raw)
  To: linux-raid

2-drive raid5's aren't very interesting.  But if you are converting
a raid1 into a raid5, you will at least temporarily have one.  And
that it a good time to set the layout/chunksize for the new RAID5
if you aren't happy with the defaults.

layout and chunksize don't actually affect the placement of data
on a 2-drive raid5, so we just do some internal book-keeping.

Signed-off-by: NeilBrown <neilb@suse.de>
---

 drivers/md/md.c    |   31 +++++++++++++++++++++----------
 drivers/md/raid5.c |   40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index bd003d7..6fc07fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2727,12 +2727,19 @@ layout_store(mddev_t *mddev, const char *buf, size_t len)
 	if (!*buf || (*e && *e != '\n'))
 		return -EINVAL;
 
-	if (mddev->pers)
-		return -EBUSY;
-	if (mddev->reshape_position != MaxSector)
-		mddev->new_layout = n;
-	else
-		mddev->layout = n;
+	if (mddev->pers) {
+		int err;
+		if (mddev->pers->reconfig == NULL)
+			return -EBUSY;
+		err = mddev->pers->reconfig(mddev, n, -1);
+		if (err)
+			return err;
+	} else {
+		if (mddev->reshape_position != MaxSector)
+			mddev->new_layout = n;
+		else
+			mddev->layout = n;
+	}
 	return len;
 }
 static struct md_sysfs_entry md_layout =
@@ -2789,16 +2796,20 @@ chunk_size_show(mddev_t *mddev, char *page)
 static ssize_t
 chunk_size_store(mddev_t *mddev, const char *buf, size_t len)
 {
-	/* can only set chunk_size if array is not yet active */
 	char *e;
 	unsigned long n = simple_strtoul(buf, &e, 10);
 
 	if (!*buf || (*e && *e != '\n'))
 		return -EINVAL;
 
-	if (mddev->pers)
-		return -EBUSY;
-	else if (mddev->reshape_position != MaxSector)
+	if (mddev->pers) {
+		int err;
+		if (mddev->pers->reconfig == NULL)
+			return -EBUSY;
+		err = mddev->pers->reconfig(mddev, -1, n);
+		if (err)
+			return err;
+	} else if (mddev->reshape_position != MaxSector)
 		mddev->new_chunk = n;
 	else
 		mddev->chunk_size = n;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8847c14..1546e9b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4912,6 +4912,45 @@ static void *raid5_takeover_raid1(mddev_t *mddev)
 
 }
 
+static int raid5_reconfig(mddev_t *mddev, int new_layout, int new_chunk)
+{
+	/* Currently the layout and chunk size can only be changed
+	 * for a 2-drive raid array, as in that case no data shuffling
+	 * is required.
+	 * Later we might validate these and set new_* so a reshape
+	 * can complete the change.
+	 */
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+
+	if (new_layout >= 0 && !algorithm_valid_raid5(new_layout))
+		return -EINVAL;
+	if (new_chunk > 0) {
+		if (new_chunk & (new_chunk-1))
+			/* not a power of 2 */
+			return -EINVAL;
+		if (new_chunk < PAGE_SIZE)
+			return -EINVAL;
+		if (mddev->array_sectors & ((new_chunk>>9)-1))
+			/* not factor of array size */
+			return -EINVAL;
+	}
+
+	/* They look valid */
+
+	if (mddev->raid_disks != 2)
+		return -EINVAL;
+
+	if (new_layout >= 0) {
+		conf->algorithm = new_layout;
+		mddev->layout = mddev->new_layout = new_layout;
+	}
+	if (new_chunk > 0) {
+		conf->chunk_size = new_chunk;
+		mddev->chunk_size = mddev->new_chunk = new_chunk;
+	}
+	return 0;
+}
+
 static void *raid5_takeover(mddev_t *mddev)
 {
 	/* raid5 can take over:
@@ -5032,6 +5071,7 @@ static struct mdk_personality raid5_personality =
 #endif
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
+	.reconfig	= raid5_reconfig,
 };
 
 static struct mdk_personality raid4_personality =



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (17 preceding siblings ...)
  2009-02-12  3:10 ` [PATCH 15/18] md: hopefully enable suspend/resume of md devices NeilBrown
@ 2009-02-12  8:11 ` Keld Jørn Simonsen
  2009-02-12  9:13   ` Steve Fairbairn
  2009-02-12  9:21   ` NeilBrown
  2009-02-12  9:42 ` Farkas Levente
  19 siblings, 2 replies; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12  8:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
> Hi,
>  following is my current patch queue for 2.6.30, in case anyone would
> like to review or otherwise comment.
> They should show up in -next shortly.
> 
> Probably the most interesting are the last few which provide support
> for converting a raid1 into a raid5, and a raid5 into a raid6.
> I plan to do some more work here so the code might change a bit before
> final submission, as I work out how best ot factor the code.
> 
> mdadm doesn't current support these conversions, but you can
> simply
>    echo raid5 > /sys/block/md0/md/level
> to change a 2-drive raid1 into a raid5.  Similarly for 5->6
> 
> The raid6 array created will have a somewhat unusual layout in that
> all the Q blocks will be on the last drive.  Later I'll create
> functionality to restripe the array so that the Q block is rotated
> around all the drives as you would expect.
> 
> Comments and testing very welcome.

I would rather have functionality to convert raid10 to raid5.
raid1 should be depreciated, as raid10,n2 for all purposes is the same
but better implementation and performance, and raid10,f2 and raid10,o2
are even better.  Nobody should use raid1 anymore.

Best regards
keld

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  8:11 ` [PATCH 00/18] Assorted md patches headed for 2.6.30 Keld Jørn Simonsen
@ 2009-02-12  9:13   ` Steve Fairbairn
  2009-02-12  9:46     ` Keld Jørn Simonsen
                       ` (2 more replies)
  2009-02-12  9:21   ` NeilBrown
  1 sibling, 3 replies; 50+ messages in thread
From: Steve Fairbairn @ 2009-02-12  9:13 UTC (permalink / raw)
  To: linux-raid

Keld Jørn Simonsen wrote:
> 
> I would rather have functionality to convert raid10 to raid5.
> raid1 should be depreciated, as raid10,n2 for all purposes is the same
> but better implementation and performance, and raid10,f2 and raid10,o2
> are even better.  Nobody should use raid1 anymore.
> 
Complete ignorance of raid10 here, but is raid10,<anything> bootable, 
like raid1 is?  I use raid1 on my root and boot partitions.

Steve.

PS.  Apologies to Keld for sending this directly to him first time.  I 
never remember that this group is reply to sender, not reply to group.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  8:11 ` [PATCH 00/18] Assorted md patches headed for 2.6.30 Keld Jørn Simonsen
  2009-02-12  9:13   ` Steve Fairbairn
@ 2009-02-12  9:21   ` NeilBrown
  2009-02-12  9:53     ` Keld Jørn Simonsen
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2009-02-12  9:21 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
> On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
>> Comments and testing very welcome.
>
> I would rather have functionality to convert raid10 to raid5.
> raid1 should be depreciated, as raid10,n2 for all purposes is the same
> but better implementation and performance, and raid10,f2 and raid10,o2
> are even better.  Nobody should use raid1 anymore.

That is a fairly simplistic view.
raid1 supports --write-mostly and --write-behind which raid10 is unlikely
ever to support.
Certainly in many cases raid10 is just as good or better than raid1
though.

Certainly a raid10->raid5 conversion for a 2-drive n2 configuration
is trivial to arrange.  Other conversions are less likely to be supported
as they require significant non-trivial rearrangement of data.

Thanks for the comment.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
                   ` (18 preceding siblings ...)
  2009-02-12  8:11 ` [PATCH 00/18] Assorted md patches headed for 2.6.30 Keld Jørn Simonsen
@ 2009-02-12  9:42 ` Farkas Levente
  2009-02-12 10:40   ` NeilBrown
  19 siblings, 1 reply; 50+ messages in thread
From: Farkas Levente @ 2009-02-12  9:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown wrote:
> Hi,
>  following is my current patch queue for 2.6.30, in case anyone would
> like to review or otherwise comment.
> They should show up in -next shortly.
> 
> Probably the most interesting are the last few which provide support
> for converting a raid1 into a raid5, and a raid5 into a raid6.
> I plan to do some more work here so the code might change a bit before
> final submission, as I work out how best ot factor the code.
> 
> mdadm doesn't current support these conversions, but you can
> simply
>    echo raid5 > /sys/block/md0/md/level
> to change a 2-drive raid1 into a raid5.  Similarly for 5->6

any plan for non-raid to raid1 or anything else like in windows on can
convert a normal partition into a mirrored one online.

-- 
  Levente                               "Si vis pacem para bellum!"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:13   ` Steve Fairbairn
@ 2009-02-12  9:46     ` Keld Jørn Simonsen
  2009-02-12 10:52       ` NeilBrown
  2009-02-12 10:53       ` Julian Cowley
  2009-02-12 22:57     ` Dan Williams
  2009-02-13 16:56     ` Bill Davidsen
  2 siblings, 2 replies; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12  9:46 UTC (permalink / raw)
  To: Steve Fairbairn; +Cc: linux-raid

On Thu, Feb 12, 2009 at 09:13:17AM +0000, Steve Fairbairn wrote:
> Keld Jørn Simonsen wrote:
> >
> >I would rather have functionality to convert raid10 to raid5.
> >raid1 should be depreciated, as raid10,n2 for all purposes is the same
> >but better implementation and performance, and raid10,f2 and raid10,o2
> >are even better.  Nobody should use raid1 anymore.
> >
> Complete ignorance of raid10 here, but is raid10,<anything> bootable, 
> like raid1 is?  I use raid1 on my root and boot partitions.

AFAIK, raid10,n2 in default mode (superblock etc) is bootable, as it
looks like two copies of a normal FS. I think this was even reported 
on this list at some time. 

You are not the only one that does not know much about raid10. I think
most Linux administrators don't.  And other system adminstrators most
likely don't either.

Maybe we should rename raid10 to raid1?

Raid10 should just be an enhanced raid1. And I understand from Neil that
raid10,o2 is mostly done because it is a standard raid1 layout. So it is
strange that it is not available with raid1 in Linux. And I also think
that the raid10,f2 layout is available from some HW raid controllers, as
their implementation of raid1. So all what is in raid10 is other places
considered raid1 stuff.

Or we could merge the two drivers. 

I think we will never get people in general to know about raid10 as well
as they know about raid1, not even by far. 

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:21   ` NeilBrown
@ 2009-02-12  9:53     ` Keld Jørn Simonsen
  2009-02-12 10:45       ` NeilBrown
  0 siblings, 1 reply; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12  9:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Feb 12, 2009 at 08:21:12PM +1100, NeilBrown wrote:
> On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
> > On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
> >> Comments and testing very welcome.
> >
> > I would rather have functionality to convert raid10 to raid5.
> > raid1 should be depreciated, as raid10,n2 for all purposes is the same
> > but better implementation and performance, and raid10,f2 and raid10,o2
> > are even better.  Nobody should use raid1 anymore.
> 
> That is a fairly simplistic view.

It was also formulated to provoke some thoughts.

> raid1 supports --write-mostly and --write-behind which raid10 is unlikely
> ever to support.

why?

Anyway would it not be possible that this functionality be implemented
for raid10,n2?

> Certainly in many cases raid10 is just as good or better than raid1
> though.
> 
> Certainly a raid10->raid5 conversion for a 2-drive n2 configuration
> is trivial to arrange.  Other conversions are less likely to be supported
> as they require significant non-trivial rearrangement of data.

Yes, possibly.


Some code to grow raid10 would also be desirable. Maybe it is some of
the same operations that need to be applied: getting the old data in,
have it restructured for the new format, in a safe way, and possibly
with the help of an extra disk, or possibly not. It sounds non-trivial
to me too.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:42 ` Farkas Levente
@ 2009-02-12 10:40   ` NeilBrown
  2009-02-12 11:17     ` Farkas Levente
  0 siblings, 1 reply; 50+ messages in thread
From: NeilBrown @ 2009-02-12 10:40 UTC (permalink / raw)
  To: Farkas Levente; +Cc: linux-raid

On Thu, February 12, 2009 8:42 pm, Farkas Levente wrote:
> NeilBrown wrote:
>> Hi,
>>  following is my current patch queue for 2.6.30, in case anyone would
>> like to review or otherwise comment.
>> They should show up in -next shortly.
>>
>> Probably the most interesting are the last few which provide support
>> for converting a raid1 into a raid5, and a raid5 into a raid6.
>> I plan to do some more work here so the code might change a bit before
>> final submission, as I work out how best ot factor the code.
>>
>> mdadm doesn't current support these conversions, but you can
>> simply
>>    echo raid5 > /sys/block/md0/md/level
>> to change a 2-drive raid1 into a raid5.  Similarly for 5->6
>
> any plan for non-raid to raid1 or anything else like in windows on can
> convert a normal partition into a mirrored one online.

No plan exactly, but I do think about it from time to time.

There are two problems with this, and solving just one of them
doesn't help you much.  So you really have to solve both at once,
which reduces the motivation towards either ....

One problem is the task of changing the implementation of the device
underneath the filesystem without the filesystem needing to care.

i.e. the filesystem opens block device 8,1 (/dev/sda1) and starts do
IO, then mdadm steps in and magically changes things so that /dev/sda1
is now on a raid1 array which happens to access the same data, but
through a different code path.
Figuring out exactly which data structure to install the redirection
and how to doing in a way that is guaranteed to be safe is non-trivial.

dm has a mechanism to change the implementation under a given dm
device, and md now has an mechanism to change the implementation
under a given md device.  But generalising that to 'any device' is
not entirely trivial.  Now that I have done it for md I'm in a better
position to understand how it might be done.

The other problem is where to store the metadata.  You need at least a
few bytes and realistically 1K of space on the devices that is free to
be used by md to record information about device state to allow arrays to
be assembled correctly.

One idea I had was to get the filesystem to allocate a block and make that
available to md, then md would copy the data from the last block of the
device into that block and redirect all IO request aim at the
last block so that really access the relocated block.  Then md puts
it's metadata in that last block.

This could work but is a little to error prone for my liking.  e.g.
if you fsck the device, you suddenly loose your guarantee that
the filesystem isn't going to write to that relocation block.

I think it could only work if mdadm can inspect the device and ensure
that the last block isn't part of any partition, or any active filesystem.
This is possible, but messy.

e.g. on my notebook which has a 250Gig drive whatever I used to partition
it (cfdisk?) insisted on using multiples of cylinders for partitions
(what an out-of-date concept!) and as the reported geometry is

Disk /dev/sda: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders

There are 5013 unused sectors at the end - plenty of room for
md to put some metadata.  But if someone else had used sfdisk,
I think they would find no spare space and be unhappy.

Maybe it is sufficient to support just those people who are
lucky enough to not be using the whole device...


So it might happen, but it is just a little to easy to stick this
one in the too-hard basket.

Thanks,
NeilBrown




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:53     ` Keld Jørn Simonsen
@ 2009-02-12 10:45       ` NeilBrown
  2009-02-12 11:11         ` Keld Jørn Simonsen
  2009-02-12 15:28         ` Wil Reichert
  0 siblings, 2 replies; 50+ messages in thread
From: NeilBrown @ 2009-02-12 10:45 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On Thu, February 12, 2009 8:53 pm, Keld Jørn Simonsen wrote:
> On Thu, Feb 12, 2009 at 08:21:12PM +1100, NeilBrown wrote:
>> On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
>> > On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
>> >> Comments and testing very welcome.
>> >
>> > I would rather have functionality to convert raid10 to raid5.
>> > raid1 should be depreciated, as raid10,n2 for all purposes is the same
>> > but better implementation and performance, and raid10,f2 and raid10,o2
>> > are even better.  Nobody should use raid1 anymore.
>>
>> That is a fairly simplistic view.
>
> It was also formulated to provoke some thoughts.
>
>> raid1 supports --write-mostly and --write-behind which raid10 is
>> unlikely
>> ever to support.
>
> why?
>
> Anyway would it not be possible that this functionality be implemented
> for raid10,n2?

It would be possible, but it might not be sensible.

write-mostly and write-behind only really make sense when you have the
clear distinction between drives that raid1 gives you.
These options don't make sense for raid10 in general.  Only in very specific
layouts.
If you like, raid1 is an implementation of a specific raid10 layout,
where it makes sense to add some extra functionality.

>
> Some code to grow raid10 would also be desirable. Maybe it is some of
> the same operations that need to be applied: getting the old data in,
> have it restructured for the new format, in a safe way, and possibly
> with the help of an extra disk, or possibly not. It sounds non-trivial
> to me too.

What particular growth scenarios are you interested it?
Just adding a drive and restriping onto that?  i.e keep that
same nominal layout but increase 'raid-disks'?

That would be quite similar to the raid5 grow operation so it shouldn't
be too hard to achieve.
A 'grow' which changed the layout (e.g. near to far) would be a lot
harder.

NeilBrown


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:46     ` Keld Jørn Simonsen
@ 2009-02-12 10:52       ` NeilBrown
  2009-02-12 11:16         ` Keld Jørn Simonsen
  2009-02-12 10:53       ` Julian Cowley
  1 sibling, 1 reply; 50+ messages in thread
From: NeilBrown @ 2009-02-12 10:52 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Steve Fairbairn, linux-raid

On Thu, February 12, 2009 8:46 pm, Keld Jørn Simonsen wrote:
> On Thu, Feb 12, 2009 at 09:13:17AM +0000, Steve Fairbairn wrote:
>> Keld Jørn Simonsen wrote:
>> >
>> >I would rather have functionality to convert raid10 to raid5.
>> >raid1 should be depreciated, as raid10,n2 for all purposes is the same
>> >but better implementation and performance, and raid10,f2 and raid10,o2
>> >are even better.  Nobody should use raid1 anymore.
>> >
>> Complete ignorance of raid10 here, but is raid10,<anything> bootable,
>> like raid1 is?  I use raid1 on my root and boot partitions.
>
> AFAIK, raid10,n2 in default mode (superblock etc) is bootable, as it
> looks like two copies of a normal FS. I think this was even reported
> on this list at some time.
>
> You are not the only one that does not know much about raid10. I think
> most Linux administrators don't.  And other system adminstrators most
> likely don't either.
>
> Maybe we should rename raid10 to raid1?
>
> Raid10 should just be an enhanced raid1. And I understand from Neil that
> raid10,o2 is mostly done because it is a standard raid1 layout. So it is
> strange that it is not available with raid1 in Linux. And I also think
> that the raid10,f2 layout is available from some HW raid controllers, as
> their implementation of raid1. So all what is in raid10 is other places
> considered raid1 stuff.
>

The 'offset' layout came about to be able to support a DDF format
which is called:

4.2.18 Integrated Offset Stripe Mirroring (PRL=11, RLQ=01)

(4.2.18 is the section of the document
 PRL is Primary Raid Level
 RLQ is Rail Level Qualifier
)
There is also

4.2.17 Integrated Adjacent Stripe Mirroring (PRL= 11, RLQ=00)

which is essentially the same as our n2 layout.

You should see their
4.3.4 Spanned Secondary RAID Level (SRL=03)

Though.  That would be really .. interesting to implement.


You can down load the ddf spec at

http://www.snia.org/tech_activities/standards/curr_standards/ddf/

NeilBrown


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:46     ` Keld Jørn Simonsen
  2009-02-12 10:52       ` NeilBrown
@ 2009-02-12 10:53       ` Julian Cowley
  2009-02-13 16:54         ` Bill Davidsen
  1 sibling, 1 reply; 50+ messages in thread
From: Julian Cowley @ 2009-02-12 10:53 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2063 bytes --]

On Thu, 12 Feb 2009, Keld Jørn Simonsen wrote:
> On Thu, Feb 12, 2009 at 09:13:17AM +0000, Steve Fairbairn wrote:
>> Keld Jørn Simonsen wrote:
>>>
>>> I would rather have functionality to convert raid10 to raid5.
>>> raid1 should be depreciated, as raid10,n2 for all purposes is the same
>>> but better implementation and performance, and raid10,f2 and raid10,o2
>>> are even better.  Nobody should use raid1 anymore.

Interesting.

>> Complete ignorance of raid10 here, but is raid10,<anything> bootable,
>> like raid1 is?  I use raid1 on my root and boot partitions.
>
> AFAIK, raid10,n2 in default mode (superblock etc) is bootable, as it
> looks like two copies of a normal FS. I think this was even reported
> on this list at some time.
>
> You are not the only one that does not know much about raid10. I think
> most Linux administrators don't.  And other system adminstrators most
> likely don't either.

Probably because of its name, which is too close to RAID 1+0 and therefore 
can easily be ignored in situations where RAID 1+0 normally can't be used.

I always assumed (without looking into it) that raid10 was supposed to be 
a replacement/improvement on RAID 1+0, which requires a minimum of 4 
drives.  I'd seen the wikipedia page where 3 drives are used for raid10, 
but it never occurred to me that raid10 is such a generalization that it 
could be used with just two drives.

> Maybe we should rename raid10 to raid1?

That would be a mistake, too.  It's not tied to RAID 1+0, but it's not 
tied down to RAID 1 either, and its name should reflect that.  I'd prefer 
something like raid1+0E, raid0E, raid1+1, or even just raid11.  The latter 
term is not used anywhere I know of, so it stands out.

The horse has already probably left the barn on this one, though.

Perhaps instead the documentation in mdadm(8) and md(4) could be updated 
to mention that raid10 is a combination of the concepts in RAID 1 and RAID 
0, but is generalized enough so that it can be done with just two drives 
at a minimum.  That would have caught my eye, at least.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 10:45       ` NeilBrown
@ 2009-02-12 11:11         ` Keld Jørn Simonsen
  2009-02-12 15:28         ` Wil Reichert
  1 sibling, 0 replies; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12 11:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Thu, Feb 12, 2009 at 09:45:38PM +1100, NeilBrown wrote:
> On Thu, February 12, 2009 8:53 pm, Keld Jørn Simonsen wrote:
> > On Thu, Feb 12, 2009 at 08:21:12PM +1100, NeilBrown wrote:
> >> On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
> >> > On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
> >> >> Comments and testing very welcome.
> >> >
> >> > I would rather have functionality to convert raid10 to raid5.
> >> > raid1 should be depreciated, as raid10,n2 for all purposes is the same
> >> > but better implementation and performance, and raid10,f2 and raid10,o2
> >> > are even better.  Nobody should use raid1 anymore.
> >>
> >> That is a fairly simplistic view.
> >
> > It was also formulated to provoke some thoughts.
> >
> >> raid1 supports --write-mostly and --write-behind which raid10 is
> >> unlikely
> >> ever to support.
> >
> > why?
> >
> > Anyway would it not be possible that this functionality be implemented
> > for raid10,n2?
> 
> It would be possible, but it might not be sensible.
> 
> write-mostly and write-behind only really make sense when you have the
> clear distinction between drives that raid1 gives you.
> These options don't make sense for raid10 in general.  Only in very specific
> layouts.
> If you like, raid1 is an implementation of a specific raid10 layout,
> where it makes sense to add some extra functionality.

Yes, I understand that.

> >
> > Some code to grow raid10 would also be desirable. Maybe it is some of
> > the same operations that need to be applied: getting the old data in,
> > have it restructured for the new format, in a safe way, and possibly
> > with the help of an extra disk, or possibly not. It sounds non-trivial
> > to me too.
> 
> What particular growth scenarios are you interested it?
> Just adding a drive and restriping onto that?  i.e keep that
> same nominal layout but increase 'raid-disks'?

yes. 

> That would be quite similar to the raid5 grow operation so it shouldn't
> be too hard to achieve.

Yes,  I was thinking about growing a raid10,f2 and that should not be
that difficult. You could rebuild each of the raid-0 layers at a time,
from the other raid-0 layer. And a way to make it fast would be to use some bigger
buffers, say 20 MB, to minimize head moves.

I have some needs for such growing for one of my servers.

I think some similar techniques could be used to grow n2 and o2,
given tat there is a clear strategy for filling up the new layout that
requires less space than the old one, so the old space can be reused.
This should even be possible for growing by more than one drive.

> A 'grow' which changed the layout (e.g. near to far) would be a lot
> harder.

Hmm, my ideas were that it should not be difficult, but it could be slow.
One could have a specific bitmap that could track where all data was
during the grow operation. But it could involve some intermediate
storing...

Best regards
keld

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 10:52       ` NeilBrown
@ 2009-02-12 11:16         ` Keld Jørn Simonsen
  0 siblings, 0 replies; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12 11:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Steve Fairbairn, linux-raid

On Thu, Feb 12, 2009 at 09:52:32PM +1100, NeilBrown wrote:
> On Thu, February 12, 2009 8:46 pm, Keld Jørn Simonsen wrote:
> > On Thu, Feb 12, 2009 at 09:13:17AM +0000, Steve Fairbairn wrote:
> >> Keld Jørn Simonsen wrote:
> >> >
> >> >I would rather have functionality to convert raid10 to raid5.
> >> >raid1 should be depreciated, as raid10,n2 for all purposes is the same
> >> >but better implementation and performance, and raid10,f2 and raid10,o2
> >> >are even better.  Nobody should use raid1 anymore.
> >> >
> >> Complete ignorance of raid10 here, but is raid10,<anything> bootable,
> >> like raid1 is?  I use raid1 on my root and boot partitions.
> >
> > AFAIK, raid10,n2 in default mode (superblock etc) is bootable, as it
> > looks like two copies of a normal FS. I think this was even reported
> > on this list at some time.
> >
> > You are not the only one that does not know much about raid10. I think
> > most Linux administrators don't.  And other system adminstrators most
> > likely don't either.
> >
> > Maybe we should rename raid10 to raid1?
> >
> > Raid10 should just be an enhanced raid1. And I understand from Neil that
> > raid10,o2 is mostly done because it is a standard raid1 layout. So it is
> > strange that it is not available with raid1 in Linux. And I also think
> > that the raid10,f2 layout is available from some HW raid controllers, as
> > their implementation of raid1. So all what is in raid10 is other places
> > considered raid1 stuff.
> >
> 
> The 'offset' layout came about to be able to support a DDF format
> which is called:
> 
> 4.2.18 Integrated Offset Stripe Mirroring (PRL=11, RLQ=01)
> 
> (4.2.18 is the section of the document
>  PRL is Primary Raid Level
>  RLQ is Rail Level Qualifier
> )
> There is also
> 
> 4.2.17 Integrated Adjacent Stripe Mirroring (PRL= 11, RLQ=00)
> 
> which is essentially the same as our n2 layout.
> 
> You should see their
> 4.3.4 Spanned Secondary RAID Level (SRL=03)
> 
> Though.  That would be really .. interesting to implement.
> 
> 
> You can down load the ddf spec at
> 
> http://www.snia.org/tech_activities/standards/curr_standards/ddf/
> 
> NeilBrown

I did look at the spec, and I could not find something that looked like
f2. I then sent them a mail suggesting standardizing raid10,f2 - some
months ago. No answer (yet...) Am I right that raid10,f2 is not
described in their spec? A bit odd. The idea behind raid10,f2 is quite
straightforward, and gives good results.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 10:40   ` NeilBrown
@ 2009-02-12 11:17     ` Farkas Levente
  2009-02-13 17:02       ` Bill Davidsen
  0 siblings, 1 reply; 50+ messages in thread
From: Farkas Levente @ 2009-02-12 11:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown wrote:
> On Thu, February 12, 2009 8:42 pm, Farkas Levente wrote:
>> NeilBrown wrote:
>>> Hi,
>>>  following is my current patch queue for 2.6.30, in case anyone would
>>> like to review or otherwise comment.
>>> They should show up in -next shortly.
>>>
>>> Probably the most interesting are the last few which provide support
>>> for converting a raid1 into a raid5, and a raid5 into a raid6.
>>> I plan to do some more work here so the code might change a bit before
>>> final submission, as I work out how best ot factor the code.
>>>
>>> mdadm doesn't current support these conversions, but you can
>>> simply
>>>    echo raid5 > /sys/block/md0/md/level
>>> to change a 2-drive raid1 into a raid5.  Similarly for 5->6
>> any plan for non-raid to raid1 or anything else like in windows on can
>> convert a normal partition into a mirrored one online.
> 
> No plan exactly, but I do think about it from time to time.
> 
> There are two problems with this, and solving just one of them
> doesn't help you much.  So you really have to solve both at once,
> which reduces the motivation towards either ....
> 
> One problem is the task of changing the implementation of the device
> underneath the filesystem without the filesystem needing to care.
> 
> i.e. the filesystem opens block device 8,1 (/dev/sda1) and starts do
> IO, then mdadm steps in and magically changes things so that /dev/sda1
> is now on a raid1 array which happens to access the same data, but
> through a different code path.
> Figuring out exactly which data structure to install the redirection
> and how to doing in a way that is guaranteed to be safe is non-trivial.
> 
> dm has a mechanism to change the implementation under a given dm
> device, and md now has an mechanism to change the implementation
> under a given md device.  But generalising that to 'any device' is
> not entirely trivial.  Now that I have done it for md I'm in a better
> position to understand how it might be done.
> 
> The other problem is where to store the metadata.  You need at least a
> few bytes and realistically 1K of space on the devices that is free to
> be used by md to record information about device state to allow arrays to
> be assembled correctly.
> 
> One idea I had was to get the filesystem to allocate a block and make that
> available to md, then md would copy the data from the last block of the
> device into that block and redirect all IO request aim at the
> last block so that really access the relocated block.  Then md puts
> it's metadata in that last block.
> 
> This could work but is a little to error prone for my liking.  e.g.
> if you fsck the device, you suddenly loose your guarantee that
> the filesystem isn't going to write to that relocation block.
> 
> I think it could only work if mdadm can inspect the device and ensure
> that the last block isn't part of any partition, or any active filesystem.
> This is possible, but messy.
> 
> e.g. on my notebook which has a 250Gig drive whatever I used to partition
> it (cfdisk?) insisted on using multiples of cylinders for partitions
> (what an out-of-date concept!) and as the reported geometry is
> 
> Disk /dev/sda: 250.0 GB, 250059350016 bytes
> 255 heads, 63 sectors/track, 30401 cylinders
> 
> There are 5013 unused sectors at the end - plenty of room for
> md to put some metadata.  But if someone else had used sfdisk,
> I think they would find no spare space and be unhappy.
> 
> Maybe it is sufficient to support just those people who are
> lucky enough to not be using the whole device...
> 
> 
> So it might happen, but it is just a little to easy to stick this
> one in the too-hard basket.

the main reason here is our life. i saw many cases where there was a
system installed to one system and later it'd be nice to make it
redundant (a most sysadm said: it's not working on linux it's even
working on windows, just put into a new disk and make it mirror).
so i don't know the technical detail, but would be a very useful feature.

-- 
  Levente                               "Si vis pacem para bellum!"

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 10:45       ` NeilBrown
  2009-02-12 11:11         ` Keld Jørn Simonsen
@ 2009-02-12 15:28         ` Wil Reichert
  2009-02-12 17:44           ` Keld Jørn Simonsen
  1 sibling, 1 reply; 50+ messages in thread
From: Wil Reichert @ 2009-02-12 15:28 UTC (permalink / raw)
  To: NeilBrown; +Cc: Keld Jørn Simonsen, linux-raid

On Thu, Feb 12, 2009 at 2:45 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, February 12, 2009 8:53 pm, Keld Jørn Simonsen wrote:
>> On Thu, Feb 12, 2009 at 08:21:12PM +1100, NeilBrown wrote:
>>> On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
>>> > On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
>>> >> Comments and testing very welcome.
>>> >
>>> > I would rather have functionality to convert raid10 to raid5.
>>> > raid1 should be depreciated, as raid10,n2 for all purposes is the same
>>> > but better implementation and performance, and raid10,f2 and raid10,o2
>>> > are even better.  Nobody should use raid1 anymore.
>>>
>>> That is a fairly simplistic view.
>>
>> It was also formulated to provoke some thoughts.
>>
>>> raid1 supports --write-mostly and --write-behind which raid10 is
>>> unlikely
>>> ever to support.
>>
>> why?
>>
>> Anyway would it not be possible that this functionality be implemented
>> for raid10,n2?
>
> It would be possible, but it might not be sensible.
>
> write-mostly and write-behind only really make sense when you have the
> clear distinction between drives that raid1 gives you.
> These options don't make sense for raid10 in general.  Only in very specific
> layouts.
> If you like, raid1 is an implementation of a specific raid10 layout,
> where it makes sense to add some extra functionality.
>
>>
>> Some code to grow raid10 would also be desirable. Maybe it is some of
>> the same operations that need to be applied: getting the old data in,
>> have it restructured for the new format, in a safe way, and possibly
>> with the help of an extra disk, or possibly not. It sounds non-trivial
>> to me too.
>
> What particular growth scenarios are you interested it?
> Just adding a drive and restriping onto that?  i.e keep that
> same nominal layout but increase 'raid-disks'?
>
> That would be quite similar to the raid5 grow operation so it shouldn't
> be too hard to achieve.
> A 'grow' which changed the layout (e.g. near to far) would be a lot
> harder.

I'd previously seen the wikipedia article regarding Linux RAID10 and
its mention of the 3 disk case.  Out of academic curiousity, how does
the 2 disk RAID10 work?  Is it just a matter of have 2 identical
volumes and reading subsequent stripes from the alteranate drives?  Or
is the algorithm more complicated?

Wil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-12  3:10 ` [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device NeilBrown
@ 2009-02-12 16:56   ` Andre Noll
  2009-02-13 22:19     ` Dan Williams
  2009-02-16  0:08     ` Neil Brown
  2009-02-13 16:37   ` Bill Davidsen
  1 sibling, 2 replies; 50+ messages in thread
From: Andre Noll @ 2009-02-12 16:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3225 bytes --]

On 14:10, NeilBrown wrote:

> -static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous);
> +static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
> +			   int *qd_idx);

That's a bit ugly, isn't it? The function computes both the p index
and the q index which is not obvious from its name. Also, the p index
is the return value and the q index is returned via a pointer which
is a bit unsymmetric.

static void get_parity_indices(sector_t stripe, raid5_conf_t *conf, int previous,
				 int *pd_idx, int *qd_idx);

perhaps?

> +	/* Note that unlike RAID-5, the ordering of the disks matters greatly.*/
> +	/* FIX: Is this ordering of drives even remotely optimal? */
> +	count = 0;
> +	i = d0_idx;
> +	do {
> +		if (i == sh->pd_idx)
> +			ptrs[disks-2] = page_address(sh->dev[i].page);
> +		else if (i == sh->qd_idx)
> +			ptrs[disks-1] = page_address(sh->dev[i].page);
> +		else {
>  			ptrs[count++] = page_address(sh->dev[i].page);
> -			if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
> +			if (!test_bit(R5_UPTODATE, &sh->dev[i].flags))
>  				printk("block %d/%d not uptodate on parity calc\n", i,count);
> -			i = raid6_next_disk(i, disks);
> -		} while ( i != d0_idx );
> -//		break;
> -//	}
> +		}
> +		i = raid6_next_disk(i, disks);
> +	} while (i != d0_idx);
> +	BUG_ON(count+2 != disks);

Isn't it really dangerous to ignore a dirty stripe head during parity
calculation? I understand that compute_parity6() does not have a
return value and that dirty stripe heads have always been ignored by
compute_parity6(). But it looks much saner to fail in this case.

BTW, the printk lacks a KERN_ERR.

> +	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
> +	void *ptrs[disks];

How serious is this? It's probably not a problem for version 0.90
superblocks as one can't have more than 26 disks. OTOH, for version
1 superblocks we might end up using up to 2K of stack space on 64
bit machines.

Would it be a reasonable to always allocate 26 pointers, say, and
fall back to malloc() if this turns out to be too small?

> +	count = 0;
> +	i = d0_idx;
> +	do {
> +		int slot;
> +		if (i == sh->pd_idx)
> +			slot = disks-2;
> +		else if (i == sh->qd_idx)
> +			slot = disks-1;
> +		else
> +			slot = count++;
> +		ptrs[slot] = page_address(sh->dev[i].page);
> +		if (i == dd_idx1)
> +			faila = slot;
> +		if (i == dd_idx2)
> +			failb = slot;
> +		i = raid6_next_disk(i, disks);
> +	} while (i != d0_idx);
> +	BUG_ON(count+2 != disks);

Deja vu. How about using a helper function like

static inline int is_parity_idx(int idx, struct stripe_head_sh)
{
	if (idx == sh->pd_idx)
		return sh->disks - 2;
	if (idx == sh->qd_idx)
		return sh->disks - 1;
	return 0;
}

Then the above becomes

	int slot = is_parity_idx(i, sh);
	if (!slot)
		slot = count++;

And in compute_parity6() we could write

	count = 0;
	i = d0_idx;
	do {
		int slot = is_parity_idx(i, sh);
		if (!slot)
			slot = count++;
		ptrs[slot] = page_address(sh->dev[i].page);


Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash
  2009-02-12  3:10 ` [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash NeilBrown
@ 2009-02-12 17:26   ` John Stoffel
  2009-02-13 16:20   ` Bill Davidsen
  1 sibling, 0 replies; 50+ messages in thread
From: John Stoffel @ 2009-02-12 17:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

NeilBrown> '16' is a fairly arbitrary number.  But we don't really
NeilBrown> have any good way to judge how often is acceptable, and it
NeilBrown> seems like a reasonable number for now.

Maybe once an hour as well?  I don't have good numbers here, but just
a feeling that I'd hate to waste more than hour if possible.  But hey,
this is a big improvement no matter what.

John

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 15:28         ` Wil Reichert
@ 2009-02-12 17:44           ` Keld Jørn Simonsen
  0 siblings, 0 replies; 50+ messages in thread
From: Keld Jørn Simonsen @ 2009-02-12 17:44 UTC (permalink / raw)
  To: Wil Reichert; +Cc: NeilBrown, linux-raid

On Thu, Feb 12, 2009 at 07:28:59AM -0800, Wil Reichert wrote:
> On Thu, Feb 12, 2009 at 2:45 AM, NeilBrown <neilb@suse.de> wrote:
> > On Thu, February 12, 2009 8:53 pm, Keld Jørn Simonsen wrote:
> >> On Thu, Feb 12, 2009 at 08:21:12PM +1100, NeilBrown wrote:
> >>> On Thu, February 12, 2009 7:11 pm, Keld Jørn Simonsen wrote:
> >>> > On Thu, Feb 12, 2009 at 02:10:10PM +1100, NeilBrown wrote:
> >>> >> Comments and testing very welcome.
> >>> >
> >>> > I would rather have functionality to convert raid10 to raid5.
> >>> > raid1 should be depreciated, as raid10,n2 for all purposes is the same
> >>> > but better implementation and performance, and raid10,f2 and raid10,o2
> >>> > are even better.  Nobody should use raid1 anymore.
> >>>
> >>> That is a fairly simplistic view.
> >>
> >> It was also formulated to provoke some thoughts.
> >>
> >>> raid1 supports --write-mostly and --write-behind which raid10 is
> >>> unlikely
> >>> ever to support.
> >>
> >> why?
> >>
> >> Anyway would it not be possible that this functionality be implemented
> >> for raid10,n2?
> >
> > It would be possible, but it might not be sensible.
> >
> > write-mostly and write-behind only really make sense when you have the
> > clear distinction between drives that raid1 gives you.
> > These options don't make sense for raid10 in general.  Only in very specific
> > layouts.
> > If you like, raid1 is an implementation of a specific raid10 layout,
> > where it makes sense to add some extra functionality.
> >
> >>
> >> Some code to grow raid10 would also be desirable. Maybe it is some of
> >> the same operations that need to be applied: getting the old data in,
> >> have it restructured for the new format, in a safe way, and possibly
> >> with the help of an extra disk, or possibly not. It sounds non-trivial
> >> to me too.
> >
> > What particular growth scenarios are you interested it?
> > Just adding a drive and restriping onto that?  i.e keep that
> > same nominal layout but increase 'raid-disks'?
> >
> > That would be quite similar to the raid5 grow operation so it shouldn't
> > be too hard to achieve.
> > A 'grow' which changed the layout (e.g. near to far) would be a lot
> > harder.
> 
> I'd previously seen the wikipedia article regarding Linux RAID10 and
> its mention of the 3 disk case.  Out of academic curiousity, how does
> the 2 disk RAID10 work?  Is it just a matter of have 2 identical
> volumes and reading subsequent stripes from the alteranate drives?  Or
> is the algorithm more complicated?

There are 3 different layouts for raid10: near, far and offset.
Basically raid10 with 2 disks works as RAID 1 - and "near" and linux
raid1 is almost equivalent. far and offset has somewhat different
placements of blocks on the disks, which should lead to faster operation
for some common tasks.

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:13   ` Steve Fairbairn
  2009-02-12  9:46     ` Keld Jørn Simonsen
@ 2009-02-12 22:57     ` Dan Williams
  2009-02-13 16:56     ` Bill Davidsen
  2 siblings, 0 replies; 50+ messages in thread
From: Dan Williams @ 2009-02-12 22:57 UTC (permalink / raw)
  To: Steve Fairbairn; +Cc: linux-raid

On Thu, Feb 12, 2009 at 2:13 AM, Steve Fairbairn
<steve@fairbairn-family.com> wrote:
> Keld Jørn Simonsen wrote:
>>
>> I would rather have functionality to convert raid10 to raid5.
>> raid1 should be depreciated, as raid10,n2 for all purposes is the same
>> but better implementation and performance, and raid10,f2 and raid10,o2
>> are even better.  Nobody should use raid1 anymore.
>>
> Complete ignorance of raid10 here, but is raid10,<anything> bootable, like
> raid1 is?  I use raid1 on my root and boot partitions.
>

I would point out that raid1 is not, strictly speaking, bootable.
Yes, it happens to work because the bootloader can treat each disk as
a standalone device, but the bootloader has no idea if a given disk is
failed, rebuilding, or in-sync.  To reliably boot from raid you need
an option-rom that understands the on disk metadata format and exposes
the array as a single device to the bootloader.

Regards,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash
  2009-02-12  3:10 ` [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash NeilBrown
  2009-02-12 17:26   ` John Stoffel
@ 2009-02-13 16:20   ` Bill Davidsen
  2009-02-13 16:34     ` Jon Nelson
  1 sibling, 1 reply; 50+ messages in thread
From: Bill Davidsen @ 2009-02-13 16:20 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown wrote:
> Version 1.x metadata has the ability to record the status of a
> partially completed drive recovery.
> However we only update that record on a clean shutdown.
> It would be nice to update it on unclean shutdowns too, particularly
> when using a bitmap that removes much to the 'sync' effort after an
> unclean shutdown.
>
> One complication with checkpointing recovery is that we only know
> where we are up to in terms of IO requests started, not which ones
> have completed.  And we need to know what has completed to record
> how much is recovered.  So occasionally pause the recovery until all
> submitted requests are completed, then update the record of where
> we are up to.
>
> When we have a bitmap, we already do that pause occasionally to keep
> the bitmap up-to-date.  So enhance that code to record the recovery
> offset and schedule a superblock update.
> And when there is no bitmap, just pause 16 times during the resync to
> do a checkpoint.
> '16' is a fairly arbitrary number.  But we don't really have any good
> way to judge how often is acceptable, and it seems like a reasonable
> number for now.
>   

Since the object of this code is to save time on shutdown and restart, 
16 has little relation to time. I would think that having this update on 
a time basis would more reasonably reflect this. I would like to see a 
fairly short time, say ten minutes, since the cost of a save is low, and 
ten minutes seems like a reasonable lower bound on "worth effort to 
save" recovery.

As arrays get larger even a 16th of the recovery time can be a pretty 
long time, particularly if the min recovery speed is set fairly low to 
avoid impact on a production server.

Thought for comment: I already move a lot of overhead to the 2-6am slot 
of low load, would changing the rebuild speeds during prime load be 
desirable? The con is longer degraded operation, the pro is less impact 
on performance.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash
  2009-02-13 16:20   ` Bill Davidsen
@ 2009-02-13 16:34     ` Jon Nelson
  0 siblings, 0 replies; 50+ messages in thread
From: Jon Nelson @ 2009-02-13 16:34 UTC (permalink / raw)
  Cc: NeilBrown, linux-raid

..
> Since the object of this code is to save time on shutdown and restart, 16
> has little relation to time. I would think that having this update on a time
> basis would more reasonably reflect this. I would like to see a fairly short
> time, say ten minutes, since the cost of a save is low, and ten minutes
> seems like a reasonable lower bound on "worth effort to save" recovery.
>
> As arrays get larger even a 16th of the recovery time can be a pretty long
> time, particularly if the min recovery speed is set fairly low to avoid
> impact on a production server.

I agree. This seems pretty reasonable. How can on protect against time
changes (like my clock just got reset back 21000 seconds type of
problem).

-- 
Jon

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-12  3:10 ` [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device NeilBrown
  2009-02-12 16:56   ` Andre Noll
@ 2009-02-13 16:37   ` Bill Davidsen
  2009-02-16  5:15     ` Neil Brown
  1 sibling, 1 reply; 50+ messages in thread
From: Bill Davidsen @ 2009-02-13 16:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

NeilBrown wrote:
> Code currently assumes that the devices in a raid6 stripe are
>   0 1 ... N-1 P Q
> in some rotated order.  We will shortly add new layouts in which
> this strict pattern is broken.
> So remove this expectation.  We still assume that the data disks
> are roughly in-order.  However P and Q can be inserted anywhere within
> that order.
>   

Is this a change which could be done on the fly? Because if it is, 
there's obviously a huge performance gain to be had in degraded mode.

To clarify, if the order could be changed (raid5 example):
    running array         0 1 2 3 4 P
    failed device          0 1 F 3 4 P
Leads to a recover of data from the failed device. But if the order 
could be changed on the fly, then "recover" would be a one time 
operation, recover the entire chunk previously on device 2, write to 
device 5(P), change the order so the parity is on the failed device 
(can't recover from another fail anyway), and:
    remapped stripe    0 1 F 3 4 2

Now reads will not cost a recover, and writes hopefully could skip the 
parity completely. When the failed device is replaced the rebuild could 
restore the chunk order to improve leveling of head motion.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 10:53       ` Julian Cowley
@ 2009-02-13 16:54         ` Bill Davidsen
  2009-02-16  5:35           ` Neil Brown
  0 siblings, 1 reply; 50+ messages in thread
From: Bill Davidsen @ 2009-02-13 16:54 UTC (permalink / raw)
  To: Julian Cowley; +Cc: Keld Jørn Simonsen, linux-raid, Neil Brown

Julian Cowley wrote:
> That would be a mistake, too.  It's not tied to RAID 1+0, but it's not 
> tied down to RAID 1 either, and its name should reflect that.  I'd 
> prefer something like raid1+0E, raid0E, raid1+1, or even just raid11.  
> The latter term is not used anywhere I know of, so it stands out.
>
Storage Computer (NH) has a trademark on one of the raid levels, from 
memory raid7.  Raid5E is used to denote a raid5 with distributed spare 
(something I would love to see in md).

> The horse has already probably left the barn on this one, though.
>
And in this case locking the bard door after the horse has left is 
probably a path of least confusion.

> Perhaps instead the documentation in mdadm(8) and md(4) could be 
> updated to mention that raid10 is a combination of the concepts in 
> RAID 1 and RAID 0, but is generalized enough so that it can be done 
> with just two drives at a minimum.  That would have caught my eye, at 
> least.

Good idea.

Ob. plug for raid5E: the advantages of raid5E are two-fold. The most 
obvious is that head motion is spread over N+2 drives (N being number of 
data drives) which improves performance quite a bit in the common small 
business case of 4-5 drive setups. It also puts some use on each drive, 
so you don't suddenly start using a drive which may have been spun down 
for a month, may have developed issues since SMART was last run, etc.

While the distributed spare idea could be extended to raid6 and raid10, 
the mapping gets complex. Since Neil is currently adding code to allow 
for orders other than sequential in raid6, being able to quickly deploy 
the spare on a once-per-stripe basis might at least get him to rethink 
the concept.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12  9:13   ` Steve Fairbairn
  2009-02-12  9:46     ` Keld Jørn Simonsen
  2009-02-12 22:57     ` Dan Williams
@ 2009-02-13 16:56     ` Bill Davidsen
  2 siblings, 0 replies; 50+ messages in thread
From: Bill Davidsen @ 2009-02-13 16:56 UTC (permalink / raw)
  To: Steve Fairbairn; +Cc: linux-raid

Steve Fairbairn wrote:
> Keld Jørn Simonsen wrote:
>>
>> I would rather have functionality to convert raid10 to raid5.
>> raid1 should be depreciated, as raid10,n2 for all purposes is the same
>> but better implementation and performance, and raid10,f2 and raid10,o2
>> are even better.  Nobody should use raid1 anymore.
>>
> Complete ignorance of raid10 here, but is raid10,<anything> bootable, 
> like raid1 is?  I use raid1 on my root and boot partitions.

Note: if you use an initrd you can get away with raid1 on only the boot 
partition (I do with Fedora/GRUB). And most "recovery CDs" will not use 
a raid10 swap without starting it by hand, important only on small 
memory machines.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-12 11:17     ` Farkas Levente
@ 2009-02-13 17:02       ` Bill Davidsen
  0 siblings, 0 replies; 50+ messages in thread
From: Bill Davidsen @ 2009-02-13 17:02 UTC (permalink / raw)
  To: Farkas Levente; +Cc: NeilBrown, linux-raid

Farkas Levente wrote:
> NeilBrown wrote:
>   
>> On Thu, February 12, 2009 8:42 pm, Farkas Levente wrote:
>>     
>>> NeilBrown wrote:
>>>       
>>>> Hi,
>>>>  following is my current patch queue for 2.6.30, in case anyone would
>>>> like to review or otherwise comment.
>>>> They should show up in -next shortly.
>>>>
>>>> Probably the most interesting are the last few which provide support
>>>> for converting a raid1 into a raid5, and a raid5 into a raid6.
>>>> I plan to do some more work here so the code might change a bit before
>>>> final submission, as I work out how best ot factor the code.
>>>>
>>>> mdadm doesn't current support these conversions, but you can
>>>> simply
>>>>    echo raid5 > /sys/block/md0/md/level
>>>> to change a 2-drive raid1 into a raid5.  Similarly for 5->6
>>>>         
>>> any plan for non-raid to raid1 or anything else like in windows on can
>>> convert a normal partition into a mirrored one online.
>>>       
>> No plan exactly, but I do think about it from time to time.
>>
>> There are two problems with this, and solving just one of them
>> doesn't help you much.  So you really have to solve both at once,
>> which reduces the motivation towards either ....
>>
>> One problem is the task of changing the implementation of the device
>> underneath the filesystem without the filesystem needing to care.
>>
>> i.e. the filesystem opens block device 8,1 (/dev/sda1) and starts do
>> IO, then mdadm steps in and magically changes things so that /dev/sda1
>> is now on a raid1 array which happens to access the same data, but
>> through a different code path.
>> Figuring out exactly which data structure to install the redirection
>> and how to doing in a way that is guaranteed to be safe is non-trivial.
>>
>> dm has a mechanism to change the implementation under a given dm
>> device, and md now has an mechanism to change the implementation
>> under a given md device.  But generalising that to 'any device' is
>> not entirely trivial.  Now that I have done it for md I'm in a better
>> position to understand how it might be done.
>>
>> The other problem is where to store the metadata.  You need at least a
>> few bytes and realistically 1K of space on the devices that is free to
>> be used by md to record information about device state to allow arrays to
>> be assembled correctly.
>>
>> One idea I had was to get the filesystem to allocate a block and make that
>> available to md, then md would copy the data from the last block of the
>> device into that block and redirect all IO request aim at the
>> last block so that really access the relocated block.  Then md puts
>> it's metadata in that last block.
>>
>> This could work but is a little to error prone for my liking.  e.g.
>> if you fsck the device, you suddenly loose your guarantee that
>> the filesystem isn't going to write to that relocation block.
>>
>> I think it could only work if mdadm can inspect the device and ensure
>> that the last block isn't part of any partition, or any active filesystem.
>> This is possible, but messy.
>>
>> e.g. on my notebook which has a 250Gig drive whatever I used to partition
>> it (cfdisk?) insisted on using multiples of cylinders for partitions
>> (what an out-of-date concept!) and as the reported geometry is
>>
>> Disk /dev/sda: 250.0 GB, 250059350016 bytes
>> 255 heads, 63 sectors/track, 30401 cylinders
>>
>> There are 5013 unused sectors at the end - plenty of room for
>> md to put some metadata.  But if someone else had used sfdisk,
>> I think they would find no spare space and be unhappy.
>>
>> Maybe it is sufficient to support just those people who are
>> lucky enough to not be using the whole device...
>>
>>
>> So it might happen, but it is just a little to easy to stick this
>> one in the too-hard basket.
>>     
>
> the main reason here is our life. i saw many cases where there was a
> system installed to one system and later it'd be nice to make it
> redundant (a most sysadm said: it's not working on linux it's even
> working on windows, just put into a new disk and make it mirror).
> so i don't know the technical detail, but would be a very useful feature.
>
>   
I think you can get there for normal file systems data by creating a 
raid1 on a new drive using a failed drive. Then copy the data from the 
unmirrored drive to the mirrored f/s, unmount the original drive and 
mount the array, and add the original drive to the new array. This is 
ugly, and a verified backup and restore is better, but it can be done.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-12 16:56   ` Andre Noll
@ 2009-02-13 22:19     ` Dan Williams
  2009-02-16  0:08     ` Neil Brown
  1 sibling, 0 replies; 50+ messages in thread
From: Dan Williams @ 2009-02-13 22:19 UTC (permalink / raw)
  To: Andre Noll; +Cc: NeilBrown, linux-raid

On Thu, Feb 12, 2009 at 9:56 AM, Andre Noll <maan@systemlinux.org> wrote:
> On 14:10, NeilBrown wrote:
>
>> -static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous);
>> +static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
>> +                        int *qd_idx);
>
> That's a bit ugly, isn't it? The function computes both the p index
> and the q index which is not obvious from its name. Also, the p index
> is the return value and the q index is returned via a pointer which
> is a bit unsymmetric.
>
> static void get_parity_indices(sector_t stripe, raid5_conf_t *conf, int previous,
>                                 int *pd_idx, int *qd_idx);
>
> perhaps?
>
>> +     /* Note that unlike RAID-5, the ordering of the disks matters greatly.*/
>> +     /* FIX: Is this ordering of drives even remotely optimal? */
>> +     count = 0;
>> +     i = d0_idx;
>> +     do {
>> +             if (i == sh->pd_idx)
>> +                     ptrs[disks-2] = page_address(sh->dev[i].page);
>> +             else if (i == sh->qd_idx)
>> +                     ptrs[disks-1] = page_address(sh->dev[i].page);
>> +             else {
>>                       ptrs[count++] = page_address(sh->dev[i].page);
>> -                     if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
>> +                     if (!test_bit(R5_UPTODATE, &sh->dev[i].flags))
>>                               printk("block %d/%d not uptodate on parity calc\n", i,count);
>> -                     i = raid6_next_disk(i, disks);
>> -             } while ( i != d0_idx );
>> -//           break;
>> -//   }
>> +             }
>> +             i = raid6_next_disk(i, disks);
>> +     } while (i != d0_idx);
>> +     BUG_ON(count+2 != disks);
>
> Isn't it really dangerous to ignore a dirty stripe head during parity
> calculation? I understand that compute_parity6() does not have a
> return value and that dirty stripe heads have always been ignored by
> compute_parity6(). But it looks much saner to fail in this case.
>
> BTW, the printk lacks a KERN_ERR.
>
>> +     /**** FIX THIS: This could be very bad if disks is close to 256 ****/
>> +     void *ptrs[disks];
>
> How serious is this? It's probably not a problem for version 0.90
> superblocks as one can't have more than 26 disks. OTOH, for version
> 1 superblocks we might end up using up to 2K of stack space on 64
> bit machines.

This is ok in recent kernels because we perform raid calculations in
raid5d or md_do_sync where we have our own stack.  However, with raid6
acceleration we also want to convert dma addresses on the stack which
means the calculation needs double the amount of space.

> Would it be a reasonable to always allocate 26 pointers, say, and
> fall back to malloc() if this turns out to be too small?

Even a GFP_ATOMIC allocation may fail and we would always be
allocating in the > 26 disks case.  The worst case scenario is
actually 4K stacks on a raid6 acceleration enabled 32-bit system with
64-bit dma addresses.  This configuration would require 3K of stack to
start a 256 disk parity calculation, but that should still be ok given
we have our own thread.

Regards,
Dan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-12 16:56   ` Andre Noll
  2009-02-13 22:19     ` Dan Williams
@ 2009-02-16  0:08     ` Neil Brown
  1 sibling, 0 replies; 50+ messages in thread
From: Neil Brown @ 2009-02-16  0:08 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Thursday February 12, maan@systemlinux.org wrote:
> On 14:10, NeilBrown wrote:
> 
> > -static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous);
> > +static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int previous,
> > +			   int *qd_idx);
> 
> That's a bit ugly, isn't it? The function computes both the p index
> and the q index which is not obvious from its name. Also, the p index
> is the return value and the q index is returned via a pointer which
> is a bit unsymmetric.
> 
> static void get_parity_indices(sector_t stripe, raid5_conf_t *conf, int previous,
> 				 int *pd_idx, int *qd_idx);
> 
> perhaps?

Probably a good idea, except that it is only an interim state on the
way to changing the function to "stripe_set_idx" where we pass a
'struct stripe_head *' and it fills the details in directly.

So, while I agree it is ugly, I think I'll leave it as it is.

> 
> > +	/* Note that unlike RAID-5, the ordering of the disks matters greatly.*/
> > +	/* FIX: Is this ordering of drives even remotely optimal? */
> > +	count = 0;
> > +	i = d0_idx;
> > +	do {
> > +		if (i == sh->pd_idx)
> > +			ptrs[disks-2] = page_address(sh->dev[i].page);
> > +		else if (i == sh->qd_idx)
> > +			ptrs[disks-1] = page_address(sh->dev[i].page);
> > +		else {
> >  			ptrs[count++] = page_address(sh->dev[i].page);
> > -			if (count <= disks-2 && !test_bit(R5_UPTODATE, &sh->dev[i].flags))
> > +			if (!test_bit(R5_UPTODATE, &sh->dev[i].flags))
> >  				printk("block %d/%d not uptodate on parity calc\n", i,count);
> > -			i = raid6_next_disk(i, disks);
> > -		} while ( i != d0_idx );
> > -//		break;
> > -//	}
> > +		}
> > +		i = raid6_next_disk(i, disks);
> > +	} while (i != d0_idx);
> > +	BUG_ON(count+2 != disks);
> 
> Isn't it really dangerous to ignore a dirty stripe head during parity
> calculation? I understand that compute_parity6() does not have a
> return value and that dirty stripe heads have always been ignored by
> compute_parity6(). But it looks much saner to fail in this case.
> 
> BTW, the printk lacks a KERN_ERR.
> 

Good points.
The printk should probably be a BUG really.  It certainly is dangerous
to ignore blocks that aren't up-to-date, but they should never be
found.  If the block is not UPTODATE then there is a bug in the code.
And as such a bug could lead to data corrupt, a BUG() call is probably
justified.
I've added the BUG and the KERN_ERR.

> > +	/**** FIX THIS: This could be very bad if disks is close to 256 ****/
> > +	void *ptrs[disks];
> 
> How serious is this? It's probably not a problem for version 0.90
> superblocks as one can't have more than 26 disks. OTOH, for version
> 1 superblocks we might end up using up to 2K of stack space on 64
> bit machines.
> 
> Would it be a reasonable to always allocate 26 pointers, say, and
> fall back to malloc() if this turns out to be too small?

As Dan Williams has said, malloc is not really an option in this
context, and it probably isn't a real problem - at the moment at
least.

However it feels rather fragile to me.  I think I would like to
allocate space somewhere for these sort of arrays, possibly in the
conf_t.  Before doing that we need to be sure to understand where we
could ever want more than one at a time, and ensure that any future
change that might add more parallelism won't silently break any
assumptions we make now.

> 
> > +	count = 0;
> > +	i = d0_idx;
> > +	do {
> > +		int slot;
> > +		if (i == sh->pd_idx)
> > +			slot = disks-2;
> > +		else if (i == sh->qd_idx)
> > +			slot = disks-1;
> > +		else
> > +			slot = count++;
> > +		ptrs[slot] = page_address(sh->dev[i].page);
> > +		if (i == dd_idx1)
> > +			faila = slot;
> > +		if (i == dd_idx2)
> > +			failb = slot;
> > +		i = raid6_next_disk(i, disks);
> > +	} while (i != d0_idx);
> > +	BUG_ON(count+2 != disks);
> 
> Deja vu. How about using a helper function like
> 
> static inline int is_parity_idx(int idx, struct stripe_head_sh)
> {
> 	if (idx == sh->pd_idx)
> 		return sh->disks - 2;
> 	if (idx == sh->qd_idx)
> 		return sh->disks - 1;
> 	return 0;
> }
> 
> Then the above becomes
> 
> 	int slot = is_parity_idx(i, sh);
> 	if (!slot)
> 		slot = count++;
> 
> And in compute_parity6() we could write
> 
> 	count = 0;
> 	i = d0_idx;
> 	do {
> 		int slot = is_parity_idx(i, sh);
> 		if (!slot)
> 			slot = count++;
> 		ptrs[slot] = page_address(sh->dev[i].page);

That is a nice improvement.... except that "is_parity_idx" sounds like
it should return True or False, but it actually returns False or a
slot number.
I might make it:
static int raid6_idx_to_slot(int idx, struct stripe_head *sh, int *count)
{
	int slot;
	if (idx == sh->pd_idx)
		return sh->disks - 2;
	if (idx == sh->qd_idx)
		return sh->disks - 1;
	slot = (*count)++;
	return slot;
}

and add a suitable comment

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device.
  2009-02-13 16:37   ` Bill Davidsen
@ 2009-02-16  5:15     ` Neil Brown
  0 siblings, 0 replies; 50+ messages in thread
From: Neil Brown @ 2009-02-16  5:15 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-raid

On Friday February 13, davidsen@tmr.com wrote:
> NeilBrown wrote:
> > Code currently assumes that the devices in a raid6 stripe are
> >   0 1 ... N-1 P Q
> > in some rotated order.  We will shortly add new layouts in which
> > this strict pattern is broken.
> > So remove this expectation.  We still assume that the data disks
> > are roughly in-order.  However P and Q can be inserted anywhere within
> > that order.
> >   
> 
> Is this a change which could be done on the fly? Because if it is, 
> there's obviously a huge performance gain to be had in degraded mode.

You need to remember which stripes have been changed.

So the best you could do is start a raid6->raid5 conversion in the
event of a device failure.  That would be fairly intrusive while
running.  But when it has finished you get your performance back.

Then do a raid5->raid6 when you get a replacement drive.

I don't know that I would recommend this though.

NeilBrown

> 
> To clarify, if the order could be changed (raid5 example):
>     running array         0 1 2 3 4 P
>     failed device          0 1 F 3 4 P
> Leads to a recover of data from the failed device. But if the order 
> could be changed on the fly, then "recover" would be a one time 
> operation, recover the entire chunk previously on device 2, write to 
> device 5(P), change the order so the parity is on the failed device 
> (can't recover from another fail anyway), and:
>     remapped stripe    0 1 F 3 4 2
> 
> Now reads will not cost a recover, and writes hopefully could skip the 
> parity completely. When the failed device is replaced the rebuild could 
> restore the chunk order to improve leveling of head motion.
> 
> -- 
> Bill Davidsen <davidsen@tmr.com>
>   "Woe unto the statesman who makes war without a reason that will still
>   be valid when the war is over..." Otto von Bismark 
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-13 16:54         ` Bill Davidsen
@ 2009-02-16  5:35           ` Neil Brown
  2009-02-16 17:31             ` Nagilum
  0 siblings, 1 reply; 50+ messages in thread
From: Neil Brown @ 2009-02-16  5:35 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Julian Cowley, Keld Jørn Simonsen, linux-raid

On Friday February 13, davidsen@tmr.com wrote:
> Julian Cowley wrote:
> And in this case locking the bard door after the horse has left is 
> probably a path of least confusion.
> 
> > Perhaps instead the documentation in mdadm(8) and md(4) could be 
> > updated to mention that raid10 is a combination of the concepts in 
> > RAID 1 and RAID 0, but is generalized enough so that it can be done 
> > with just two drives at a minimum.  That would have caught my eye, at 
> > least.
> 
> Good idea.

Patches gladly accepted.


> 
> Ob. plug for raid5E: the advantages of raid5E are two-fold. The most 
> obvious is that head motion is spread over N+2 drives (N being number of 
> data drives) which improves performance quite a bit in the common small 
> business case of 4-5 drive setups. It also puts some use on each drive, 
> so you don't suddenly start using a drive which may have been spun down 
> for a month, may have developed issues since SMART was last run, etc.
> 

Are you thinking of raid5e, where all the spare space is at the end of
the devices, or raid5ee where it is more evenly distributed?

So raid5e is just a normal raid5 where you don't use all of the space.
When a failure happens, you reshape to n-1 drives, thus absorbing the
space.

raid5ee is much like raid6, but you don't read or write the Q block.
If you lose a drive, you rebuild it in the space were the Q block
lives. 

So would you just use raid6 normally and transition to a contorted
raid5 on device failure?  Or would you really want to leave those
blocks fallow?

I guess I could implement that by using 8bits in the 'layout' number
to indicate which device in the array is 'failed', and run a reshape
pass that changes the layout, being careful not to re-write blocks
that hadn't changed....

Not impossible, but I would much rather someone else wrote (and
tested) the code while I watched...

> While the distributed spare idea could be extended to raid6 and raid10, 
> the mapping gets complex. Since Neil is currently adding code to allow 
> for orders other than sequential in raid6, being able to quickly deploy 
> the spare on a once-per-stripe basis might at least get him to rethink 
> the concept.

I think raid6e is trivial and raid6ee would be quite straight forward.

For raid10, if you used a far=3 layout but only use the first two
copies, you would effectively have raid10e.
If you used a near=3 layout but only used 2 copies, you would have
something like a raid10ee, but if you have 3 or 6 drives, all the
spare space would be on the 1 (or 2) device(s).



NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
  2009-02-16  5:35           ` Neil Brown
@ 2009-02-16 17:31             ` Nagilum
  0 siblings, 0 replies; 50+ messages in thread
From: Nagilum @ 2009-02-16 17:31 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown


----- Message from neilb@suse.de ---------
     Date: Mon, 16 Feb 2009 16:35:52 +1100
     From: Neil Brown <neilb@suse.de>
  Subject: Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
       To: Bill Davidsen <davidsen@tmr.com>
       Cc: Julian Cowley <julian@lava.net>, Keld Jorn Simonsen  
<keld@dkuug.dk>, linux-raid@vger.kernel.org

>> Ob. plug for raid5E: the advantages of raid5E are two-fold. The most
>> obvious is that head motion is spread over N+2 drives (N being number of
>> data drives) which improves performance quite a bit in the common small
>> business case of 4-5 drive setups. It also puts some use on each drive,
>> so you don't suddenly start using a drive which may have been spun down
>> for a month, may have developed issues since SMART was last run, etc.
>>
>
> Are you thinking of raid5e, where all the spare space is at the end of
> the devices, or raid5ee where it is more evenly distributed?

raid5E I'd say.

> So raid5e is just a normal raid5 where you don't use all of the space.
> When a failure happens, you reshape to n-1 drives, thus absorbing the
> space.
>
> raid5ee is much like raid6, but you don't read or write the Q block.
> If you lose a drive, you rebuild it in the space were the Q block
> lives.
>
> So would you just use raid6 normally and transition to a contorted
> raid5 on device failure?  Or would you really want to leave those
> blocks fallow?

My understanding is that 5EE leaves those blocks empty. Doing real Q  
blocks would entail too much overhead but it reminds of an idea I had  
some time ago. I call it lazy-Raid6 ;)

Problem: You have enough disks to run RAID6 but you don't want to pay  
the performance penalty* of RAID6.
The solution in those cases is usually RAID5+hotspare but maybe we can  
do better.
We could also use the hotspare to store the RAID6 polynom but we have  
to calculate this (or more specifically read/write the stripe/block)  
only when the disks are idle. This of course means that the hotspare  
will have a number of invalid blocks after each write operation but  
the majority of blocks will be up-to-date. (use a bitmap to mark dirty  
blocks and "clean up" when the disks are idle)
The goal behind this is to have basically the same performance as with  
normal RAID5 but a higher failure resilience. In my experience  
harddisks often fail partially so that if you have a partial and a  
complete disk failure, chances are you will be able to recover. Even  
when two disks fail completely the number of dirty blocks should  
usually be pretty low so we would be able recover most of the data.
If there is a single disk failure we behave like a normal  
raid5+(hot)spare of course.
It is not intended as a replacement for normal RAID6 but it would give  
most of your data about the same protection while maintaining the  
speed of RAID5.

*) The main speed advantage of RAID5 vs. RAID6 comes from the fact  
that if you write one physical block**) in a RAID5 you only need to  
update***) one other additional physical block. If you write a  
physical block in a RAID6 you have to read the whole stripe and then  
write the RAID6 chunk of the stripe.
**) A RAID chunk consists of several physical blocks. Several chunks  
make up a stripe.
***) read+write

Ok, I hope no one can claim a patent on it now. ;)
Alex.

========================================================================
#    _  __          _ __     http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__ ____ _(_) /_ ____ _  nagilum@nagilum.org \n +491776461165 #
#  /    / _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#           /___/     x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
========================================================================


----------------------------------------------------------------
cakebox.homeunix.net - all the machine one needs..

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
       [not found] <7554605.886551236670855947.JavaMail.coremail@bj163app40.163.com>
@ 2009-03-13  1:00 ` Neil Brown
  0 siblings, 0 replies; 50+ messages in thread
From: Neil Brown @ 2009-03-13  1:00 UTC (permalink / raw)
  To: linux-jay; +Cc: Dan Williams, Yuri Tikhonov, Wolfgang Denk, dzu, linux-raid


> hi neil:
> I get your the latest patches (v2.6.29-rc5-335-g0ac4ee7) from git://neil.brown.name/md.
> I want to test the function that switch raid from raid5 to raid6.
> But find some problem.
> I create a raid6 which has 4 disks,and write some files to this raid6 ,next,i remove two of them.
> if this raid6 is ok,now i can read the files that i have written correctly.
> but in fact that i can't read the files ok.maybe there is some error about raid6 in v2.6.29-rc5-335-g0ac4ee7.
> 
> 
> Steps i have done are as follows
> 1, mdadm -C /dev/md6 -l 6 -n 4 /dev/sda /dev/sdb /dev/sdc /dev/sdd --metadata=1.0 --size=1000000 -f
> 2, pvcreate /dev/md6
> 3, vgcreate md6vg /dev/md6
> 4, lvcreate -L 200M -n md6lv md6vg
> 5, mkfs.xfs /dev/mapper/md6vg-md6lv
> 6, mount /dev/mapper/md6vg-md6lv /tmp     (is ok)
> 
> 7, mdadm -f /dev/md6 /dev/sda /dev/sdb  (faulty two disks)
> 8, mount /dev/mapper/md6vg-md6lv /tmp   (now can't read the filesystem information correctly)
> (mount: Structure needs cleaning) (print this )
>  
> thanks neil ,right here waiting for your help !

Thanks for reporting this.  There must be something broken in the
changes to raid6 to support hardware-offload of the calculations.

If you try my 'md-scratch' branch it might work better.  It has all
the code for raid level conversion, but none of the raid6 rework.

NeilBrown

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/18] Assorted md patches headed for 2.6.30
@ 2009-03-10  8:24 jzc-sina
  0 siblings, 0 replies; 50+ messages in thread
From: jzc-sina @ 2009-03-10  8:24 UTC (permalink / raw)
  To: linux-raid

hi neil:
I get your the latest patches (v2.6.29-rc5-335-g0ac4ee7) from git://neil.brown.name/md.
I want to test the function that switch raid from raid5 to raid6.
But find some problem.
I create a raid6 which has 4 disks,and write some files to this raid6 ,next,i remove two of them.
if this raid6 is ok,now i can read the files that i have written correctly.
but in fact that i can't read the files ok.maybe there is some error about raid6 in v2.6.29-rc5-335-g0ac4ee7.


Steps i have done are as follows
1, mdadm -C /dev/md6 -l 6 -n 4 /dev/sda /dev/sdb /dev/sdc /dev/sdd --metadata=1.0 --size=1000000 -f
2, pvcreate /dev/md6
3, vgcreate md6vg /dev/md6
4, lvcreate -L 200M -n md6lv md6vg
5, mkfs.xfs /dev/mapper/md6vg-md6lv
6, mount /dev/mapper/md6vg-md6lv /tmp     (is ok)

7, mdadm -f /dev/md6 /dev/sda /dev/sdb  (faulty two disks)
8, mount /dev/mapper/md6vg-md6lv /tmp   (now can't read the filesystem information correctly)
(mount: Structure needs cleaning) (print this )

thanks neil ,right here waiting for your help !

zhenchengjin 

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2009-03-13  1:00 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-12  3:10 [PATCH 00/18] Assorted md patches headed for 2.6.30 NeilBrown
2009-02-12  3:10 ` [PATCH 04/18] md: be more consistent about setting WriteMostly flag when adding a drive to an array NeilBrown
2009-02-12  3:10 ` [PATCH 05/18] md: Make mddev->size sector-based NeilBrown
2009-02-12  3:10 ` [PATCH 06/18] md: Represent raid device size in sectors NeilBrown
2009-02-12  3:10 ` [PATCH 07/18] md/raid5: simplify interface for init_stripe and get_active_stripe NeilBrown
2009-02-12  3:10 ` [PATCH 02/18] md: write bitmap information to devices that are undergoing recovery NeilBrown
2009-02-12  3:10 ` [PATCH 03/18] md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash NeilBrown
2009-02-12 17:26   ` John Stoffel
2009-02-13 16:20   ` Bill Davidsen
2009-02-13 16:34     ` Jon Nelson
2009-02-12  3:10 ` [PATCH 08/18] md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument NeilBrown
2009-02-12  3:10 ` [PATCH 01/18] md: never clear bit from the write-intent bitmap when the array is degraded NeilBrown
2009-02-12  3:10 ` [PATCH 09/18] md/raid6: remove expectation that Q device is immediately after P device NeilBrown
2009-02-12 16:56   ` Andre Noll
2009-02-13 22:19     ` Dan Williams
2009-02-16  0:08     ` Neil Brown
2009-02-13 16:37   ` Bill Davidsen
2009-02-16  5:15     ` Neil Brown
2009-02-12  3:10 ` [PATCH 17/18] md: add ->takeover method for raid5 to be able to take over raid1 NeilBrown
2009-02-12  3:10 ` [PATCH 13/18] md/raid5: refactor raid5 "run" NeilBrown
2009-02-12  3:10 ` [PATCH 11/18] md/raid5: Add support for new layouts for raid5 and raid6 NeilBrown
2009-02-12  3:10 ` [PATCH 14/18] md: md_unregister_thread should cope with being passed NULL NeilBrown
2009-02-12  3:10 ` [PATCH 12/18] md/raid5: finish support for DDF/raid6 NeilBrown
2009-02-12  3:10 ` [PATCH 10/18] md/raid5: simplify raid5_compute_sector interface NeilBrown
2009-02-12  3:10 ` [PATCH 16/18] md: add ->takeover method to support changing the personality managing an array NeilBrown
2009-02-12  3:10 ` [PATCH 18/18] md/raid5: allow layout/chunksize to be changed on an active2-drive raid5 NeilBrown
2009-02-12  3:10 ` [PATCH 15/18] md: hopefully enable suspend/resume of md devices NeilBrown
2009-02-12  8:11 ` [PATCH 00/18] Assorted md patches headed for 2.6.30 Keld Jørn Simonsen
2009-02-12  9:13   ` Steve Fairbairn
2009-02-12  9:46     ` Keld Jørn Simonsen
2009-02-12 10:52       ` NeilBrown
2009-02-12 11:16         ` Keld Jørn Simonsen
2009-02-12 10:53       ` Julian Cowley
2009-02-13 16:54         ` Bill Davidsen
2009-02-16  5:35           ` Neil Brown
2009-02-16 17:31             ` Nagilum
2009-02-12 22:57     ` Dan Williams
2009-02-13 16:56     ` Bill Davidsen
2009-02-12  9:21   ` NeilBrown
2009-02-12  9:53     ` Keld Jørn Simonsen
2009-02-12 10:45       ` NeilBrown
2009-02-12 11:11         ` Keld Jørn Simonsen
2009-02-12 15:28         ` Wil Reichert
2009-02-12 17:44           ` Keld Jørn Simonsen
2009-02-12  9:42 ` Farkas Levente
2009-02-12 10:40   ` NeilBrown
2009-02-12 11:17     ` Farkas Levente
2009-02-13 17:02       ` Bill Davidsen
2009-03-10  8:24 jzc-sina
     [not found] <7554605.886551236670855947.JavaMail.coremail@bj163app40.163.com>
2009-03-13  1:00 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.