linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 5] md: Introduction
@ 2006-01-17  6:56 NeilBrown
  2006-01-17  6:56 ` [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
                   ` (7 more replies)
  0 siblings, 8 replies; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


Greetings.

In line with the principle of "release early", following are 5 patches
against md in 2.6.latest which implement reshaping of a raid5 array.
By this I mean adding 1 or more drives to the array and then re-laying
out all of the data.

This is still EXPERIMENTAL and could easily eat your data.  Don't use it on
valuable data.  Only use it for review and testing.

This release does not make ANY attempt to record how far the reshape
has progressed on stable storage.  That means that if the process is
interrupted either by a crash or by "mdadm -S", then you completely
lose your data.  All of it.
So don't use it on valuable data.

There are 5 patches to (hopefully) ease review.  Comments are most
welcome, as are test results (providing they aren't done on valuable data:-).

You will need to enable the experimental MD_RAID5_RESHAPE config option
for this to work.  Please read the help message that come with it.  
It gives an example mdadm command to effect a reshape (you do not need
a new mdadm, and vaguely recent version should work).

This code is based in part on earlier work by
  "Steinar H. Gunderson" <sgunderson@bigfoot.com>
Though little of his code remains, having access to it, and having
discussed the issues with him greatly eased the processed of creating
these patches.  Thanks Steinar.

NeilBrown


 [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
 [PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array.
 [PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding.
 [PATCH 004 of 5] md: Core of raid5 resize process
 [PATCH 005 of 5] md: Final stages of raid5 expand code.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
@ 2006-01-17  6:56 ` NeilBrown
  2006-01-17 14:37   ` John Stoffel
  2006-01-17  6:56 ` [PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


Previously the array of disk information was included in the
raid5 'conf' structure which was allocated to an appropriate size.
This makes it awkward to change the size of that array.
So we split it off into a separate kmalloced array which will
require a little extra indexing, but is much easier to grow.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   10 +++++++---
 ./drivers/md/raid6main.c     |   11 ++++++++---
 ./include/linux/raid/raid5.h |    2 +-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-01-17 17:33:05.000000000 +1100
+++ ./drivers/md/raid5.c	2006-01-17 17:33:09.000000000 +1100
@@ -1821,11 +1821,13 @@ static int run(mddev_t *mddev)
 		return -EIO;
 	}
 
-	mddev->private = kzalloc(sizeof (raid5_conf_t)
-				 + mddev->raid_disks * sizeof(struct disk_info),
-				 GFP_KERNEL);
+	mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
+	conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+			      GFP_KERNEL);
+	if (!conf->disks)
+		goto abort;
 
 	conf->mddev = mddev;
 
@@ -1965,6 +1967,7 @@ static int run(mddev_t *mddev)
 abort:
 	if (conf) {
 		print_raid5_conf(conf);
+		kfree(conf->disks);
 		kfree(conf->stripe_hashtbl);
 		kfree(conf);
 	}
@@ -1985,6 +1988,7 @@ static int stop(mddev_t *mddev)
 	kfree(conf->stripe_hashtbl);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
 	sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
+	kfree(conf->disks);
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2006-01-17 17:33:05.000000000 +1100
+++ ./drivers/md/raid6main.c	2006-01-17 17:33:09.000000000 +1100
@@ -1925,11 +1925,14 @@ static int run(mddev_t *mddev)
 		return -EIO;
 	}
 
-	mddev->private = kzalloc(sizeof (raid6_conf_t)
-				 + mddev->raid_disks * sizeof(struct disk_info),
-				 GFP_KERNEL);
+	mddev->private = kzalloc(sizeof (raid6_conf_t), GFP_KERNEL);
 	if ((conf = mddev->private) == NULL)
 		goto abort;
+	conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
+				 GFP_KERNEL);
+	if (!conf->disks)
+		goto abort;
+
 	conf->mddev = mddev;
 
 	if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
@@ -2077,6 +2080,7 @@ abort:
 		print_raid6_conf(conf);
 		safe_put_page(conf->spare_page);
 		kfree(conf->stripe_hashtbl);
+		kfree(conf->disks);
 		kfree(conf);
 	}
 	mddev->private = NULL;
@@ -2095,6 +2099,7 @@ static int stop (mddev_t *mddev)
 	shrink_stripes(conf);
 	kfree(conf->stripe_hashtbl);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
+	kfree(conf->disks);
 	kfree(conf);
 	mddev->private = NULL;
 	return 0;

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-01-17 17:33:05.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-01-17 17:33:09.000000000 +1100
@@ -240,7 +240,7 @@ struct raid5_private_data {
 							 * waiting for 25% to be free
 							 */        
 	spinlock_t		device_lock;
-	struct disk_info	disks[0];
+	struct disk_info	*disks;
 };
 
 typedef struct raid5_private_data raid5_conf_t;

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array.
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
  2006-01-17  6:56 ` [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
@ 2006-01-17  6:56 ` NeilBrown
  2006-01-17  6:56 ` [PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


Before a RAID-5 can be expanded, we need to be able to expand the
stripe-cache data structure.  
This requires allocating new stripes in a new kmem_cache.
If this succeeds, we copy cache pages over and release the old
stripes and kmem_cache.
We then allocate new pages.  If that fails, we leave the stripe
cache at it's new size.  It isn't worth the effort to shink 
it back again.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |  116 +++++++++++++++++++++++++++++++++++++++++--
 ./drivers/md/raid6main.c     |    4 -
 ./include/linux/raid/raid5.h |    9 ++-
 3 files changed, 121 insertions(+), 8 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-01-17 17:33:09.000000000 +1100
+++ ./drivers/md/raid5.c	2006-01-17 17:33:23.000000000 +1100
@@ -313,14 +313,16 @@ static int grow_stripes(raid5_conf_t *co
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid5/%s", mdname(conf->mddev));
-
-	sc = kmem_cache_create(conf->cache_name, 
+	sprintf(conf->cache_name[0], "raid5/%s", mdname(conf->mddev));
+	sprintf(conf->cache_name[1], "raid5/%s-alt", mdname(conf->mddev));
+	conf->active_name = 0;
+	sc = kmem_cache_create(conf->cache_name[conf->active_name],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)
 		return 1;
 	conf->slab_cache = sc;
+	conf->pool_size = devs;
 	while (num--) {
 		if (!grow_one_stripe(conf))
 			return 1;
@@ -328,6 +330,112 @@ static int grow_stripes(raid5_conf_t *co
 	return 0;
 }
 
+static int resize_stripes(raid5_conf_t *conf, int newsize)
+{
+	/* make all the stripes able to hold 'newsize' devices.
+	 * New slots in each stripe get 'page' set to a new page.
+	 * We allocate all the new stripes first, then if that succeeds,
+	 * copy everything across.
+	 * Finally we add new pages.  This could fail, but we leave
+	 * the stripe cache at it's new size, just with some pages empty.
+	 */
+	struct stripe_head *osh, *nsh;
+	struct list_head newstripes, oldstripes;
+	struct disk_info *ndisks;
+	int err = 0;
+	kmem_cache_t *sc;
+	int i;
+
+	if (newsize <= conf->pool_size)
+		return 0; /* never bother to shrink */
+
+	sc = kmem_cache_create(conf->cache_name[1-conf->active_name],
+			       sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev),
+			       0, 0, NULL, NULL);
+	if (!sc)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&newstripes);
+	for (i = conf->max_nr_stripes; i; i--) {
+		nsh = kmem_cache_alloc(sc, GFP_KERNEL);
+		if (!nsh)
+			break;
+
+		memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev));
+
+		nsh->raid_conf = conf;
+		spin_lock_init(&nsh->lock);
+
+		list_add(&nsh->lru, &newstripes);
+	}
+	if (i) {
+		/* didn't get enough, give up */
+		while (!list_empty(&newstripes)) {
+			nsh = list_entry(newstripes.next, struct stripe_head, lru);
+			list_del(&nsh->lru);
+			kmem_cache_free(sc, nsh);
+		}
+		kmem_cache_destroy(sc);
+		return -ENOMEM;
+	}
+	/* OK, we have enough stripes, start collecting inactive
+	 * stripes and copying them over
+	 */
+	INIT_LIST_HEAD(&oldstripes);
+	list_for_each_entry(nsh, &newstripes, lru) {
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_stripe,
+				    !list_empty(&conf->inactive_list),
+				    conf->device_lock,
+				    unplug_slaves(conf->mddev);
+			);
+		osh = get_free_stripe(conf);
+		spin_unlock_irq(&conf->device_lock);
+		atomic_set(&nsh->count, 1);
+		for(i=0; i<conf->pool_size; i++)
+			nsh->dev[i].page = osh->dev[i].page;
+		for( ; i<newsize; i++)
+			nsh->dev[i].page = NULL;
+		list_add(&osh->lru, &oldstripes);
+	}
+	/* Got them all.
+	 * Return the new ones and free the old ones.
+	 * At this point, we are holding all the stripes so the array
+	 * is completely stalled, so now is a good time to resize
+	 * conf->disks.
+	 */
+	ndisks = kzalloc(newsize * sizeof(struct disk_info), GFP_KERNEL);
+	if (ndisks) {
+		for (i=0; i<conf->raid_disks; i++)
+			ndisks[i] = conf->disks[i];
+		kfree(conf->disks);
+		conf->disks = ndisks;
+	} else
+		err = -ENOMEM;
+	while(!list_empty(&newstripes)) {
+		nsh = list_entry(newstripes.next, struct stripe_head, lru);
+		list_del_init(&nsh->lru);
+		for (i=conf->raid_disks; i < newsize; i++)
+			if (nsh->dev[i].page == NULL) {
+				struct page *p = alloc_page(GFP_KERNEL);
+				nsh->dev[i].page = p;
+				if (!p)
+					err = -ENOMEM;
+			}
+		release_stripe(nsh);
+	}
+	while(!list_empty(&oldstripes)) {
+		osh = list_entry(oldstripes.next, struct stripe_head, lru);
+		list_del(&osh->lru);
+		kmem_cache_free(conf->slab_cache, osh);
+	}
+	kmem_cache_destroy(conf->slab_cache);
+	conf->slab_cache = sc;
+	conf->active_name = 1-conf->active_name;
+	conf->pool_size = newsize;
+	return err;
+}
+
+
 static int drop_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
@@ -339,7 +447,7 @@ static int drop_one_stripe(raid5_conf_t 
 		return 0;
 	if (atomic_read(&sh->count))
 		BUG();
-	shrink_buffers(sh, conf->raid_disks);
+	shrink_buffers(sh, conf->pool_size);
 	kmem_cache_free(conf->slab_cache, sh);
 	atomic_dec(&conf->active_stripes);
 	return 1;

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~	2006-01-17 17:33:09.000000000 +1100
+++ ./drivers/md/raid6main.c	2006-01-17 17:33:23.000000000 +1100
@@ -308,9 +308,9 @@ static int grow_stripes(raid6_conf_t *co
 	kmem_cache_t *sc;
 	int devs = conf->raid_disks;
 
-	sprintf(conf->cache_name, "raid6/%s", mdname(conf->mddev));
+	sprintf(conf->cache_name[0], "raid6/%s", mdname(conf->mddev));
 
-	sc = kmem_cache_create(conf->cache_name,
+	sc = kmem_cache_create(conf->cache_name[0],
 			       sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev),
 			       0, 0, NULL, NULL);
 	if (!sc)

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-01-17 17:33:09.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-01-17 17:33:23.000000000 +1100
@@ -216,7 +216,11 @@ struct raid5_private_data {
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */
 	atomic_t		preread_active_stripes; /* stripes with scheduled io */
 
-	char			cache_name[20];
+	/* unfortunately we need two cache names as we temporarily have
+	 * two caches.
+	 */
+	int			active_name;
+	char			cache_name[2][20];
 	kmem_cache_t		*slab_cache; /* for allocating stripes */
 
 	int			seq_flush, seq_write;
@@ -238,7 +242,8 @@ struct raid5_private_data {
 	wait_queue_head_t	wait_for_overlap;
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
-							 */        
+							 */
+	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
 };

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding.
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
  2006-01-17  6:56 ` [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
  2006-01-17  6:56 ` [PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
@ 2006-01-17  6:56 ` NeilBrown
  2006-01-17  6:56 ` [PATCH 004 of 5] md: Core of raid5 resize process NeilBrown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


We need to allow that different stripes are of different effective sizes,
and use the appropriate size.
Also, when a stripe is being expanded, we must block any IO attempts
until the stripe is stable again.

Key elements in this change are:
 - each stripe_head gets a 'disk' field which is part of the key,
   thus there can sometimes be two stripe heads of the same area of
   the array, but covering different numbers of devices.  One of these
   will be marked STRIPE_EXPANDING and so won't accept new requests.
 - conf->expand_progress tracks how the expansion is progressing and
   is used to determine whether the target part of the array has been
   expanded yet or not.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |   71 ++++++++++++++++++++++++-------------------
 ./include/linux/raid/raid5.h |    6 +++
 2 files changed, 47 insertions(+), 30 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-01-17 17:33:23.000000000 +1100
+++ ./drivers/md/raid5.c	2006-01-17 17:35:36.000000000 +1100
@@ -178,10 +178,10 @@ static int grow_buffers(struct stripe_he
 
 static void raid5_build_block (struct stripe_head *sh, int i);
 
-static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx)
+static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int i;
 
 	if (atomic_read(&sh->count) != 0)
 		BUG();
@@ -198,7 +198,9 @@ static void init_stripe(struct stripe_he
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
 
-	for (i=disks; i--; ) {
+	sh->disks = disks;
+
+	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
 		if (dev->toread || dev->towrite || dev->written ||
@@ -215,7 +217,7 @@ static void init_stripe(struct stripe_he
 	insert_hash(conf, sh);
 }
 
-static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector)
+static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, int disks)
 {
 	struct stripe_head *sh;
 	struct hlist_node *hn;
@@ -223,7 +225,7 @@ static struct stripe_head *__find_stripe
 	CHECK_DEVLOCK();
 	PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector);
 	hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash)
-		if (sh->sector == sector)
+		if (sh->sector == sector && sh->disks == disks)
 			return sh;
 	PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector);
 	return NULL;
@@ -232,8 +234,8 @@ static struct stripe_head *__find_stripe
 static void unplug_slaves(mddev_t *mddev);
 static void raid5_unplug_device(request_queue_t *q);
 
-static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector,
-					     int pd_idx, int noblock) 
+static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks,
+					     int pd_idx, int noblock)
 {
 	struct stripe_head *sh;
 
@@ -245,7 +247,7 @@ static struct stripe_head *get_active_st
 		wait_event_lock_irq(conf->wait_for_stripe,
 				    conf->quiesce == 0,
 				    conf->device_lock, /* nothing */);
-		sh = __find_stripe(conf, sector);
+		sh = __find_stripe(conf, sector, disks);
 		if (!sh) {
 			if (!conf->inactive_blocked)
 				sh = get_free_stripe(conf);
@@ -263,7 +265,7 @@ static struct stripe_head *get_active_st
 					);
 				conf->inactive_blocked = 0;
 			} else
-				init_stripe(sh, sector, pd_idx);
+				init_stripe(sh, sector, pd_idx, disks);
 		} else {
 			if (atomic_read(&sh->count)) {
 				if (!list_empty(&sh->lru))
@@ -300,6 +302,7 @@ static int grow_one_stripe(raid5_conf_t 
 		kmem_cache_free(conf->slab_cache, sh);
 		return 0;
 	}
+	sh->disks = conf->raid_disks;
 	/* we just created an active stripe so... */
 	atomic_set(&sh->count, 1);
 	atomic_inc(&conf->active_stripes);
@@ -467,7 +470,7 @@ static int raid5_end_read_request(struct
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
 	if (bi->bi_size)
@@ -565,7 +568,7 @@ static int raid5_end_write_request (stru
 {
  	struct stripe_head *sh = bi->bi_private;
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks, i;
+	int disks = sh->disks, i;
 	unsigned long flags;
 	int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags);
 
@@ -719,7 +722,7 @@ static sector_t raid5_compute_sector(sec
 static sector_t compute_blocknr(struct stripe_head *sh, int i)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int raid_disks = conf->raid_disks, data_disks = raid_disks - 1;
+	int raid_disks = sh->disks, data_disks = raid_disks - 1;
 	sector_t new_sector = sh->sector, check;
 	int sectors_per_chunk = conf->chunk_size >> 9;
 	sector_t stripe;
@@ -820,8 +823,7 @@ static void copy_data(int frombio, struc
 
 static void compute_block(struct stripe_head *sh, int dd_idx)
 {
-	raid5_conf_t *conf = sh->raid_conf;
-	int i, count, disks = conf->raid_disks;
+	int i, count, disks = sh->disks;
 	void *ptr[MAX_XOR_BLOCKS], *p;
 
 	PRINTK("compute_block, stripe %llu, idx %d\n", 
@@ -851,7 +853,7 @@ static void compute_block(struct stripe_
 static void compute_parity(struct stripe_head *sh, int method)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int i, pd_idx = sh->pd_idx, disks = conf->raid_disks, count;
+	int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
 	void *ptr[MAX_XOR_BLOCKS];
 	struct bio *chosen;
 
@@ -1039,7 +1041,7 @@ static int add_stripe_bio(struct stripe_
 static void handle_stripe(struct stripe_head *sh)
 {
 	raid5_conf_t *conf = sh->raid_conf;
-	int disks = conf->raid_disks;
+	int disks = sh->disks;
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
@@ -1633,12 +1635,10 @@ static inline void raid5_plug_device(rai
 	spin_unlock_irq(&conf->device_lock);
 }
 
-static int make_request (request_queue_t *q, struct bio * bi)
+static int make_request(request_queue_t *q, struct bio * bi)
 {
 	mddev_t *mddev = q->queuedata;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
-	const unsigned int raid_disks = conf->raid_disks;
-	const unsigned int data_disks = raid_disks - 1;
 	unsigned int dd_idx, pd_idx;
 	sector_t new_sector;
 	sector_t logical_sector, last_sector;
@@ -1662,20 +1662,31 @@ static int make_request (request_queue_t
 
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
+		int disks;
 		
-		new_sector = raid5_compute_sector(logical_sector,
-						  raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-
+	retry:
+		if (conf->expand_progress == MaxSector)
+			disks = conf->raid_disks;
+		else {
+			spin_lock_irq(&conf->device_lock);
+			disks = conf->raid_disks;
+			if (logical_sector >= conf->expand_progress)
+				disks = conf->previous_raid_disks;
+			spin_unlock_irq(&conf->device_lock);
+		}
+ 		new_sector = raid5_compute_sector(logical_sector, disks, disks - 1,
+						  &dd_idx, &pd_idx, conf);
 		PRINTK("raid5: make_request, sector %llu logical %llu\n",
 			(unsigned long long)new_sector, 
 			(unsigned long long)logical_sector);
 
-	retry:
 		prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-		sh = get_active_stripe(conf, new_sector, pd_idx, (bi->bi_rw&RWA_MASK));
+		sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK));
 		if (sh) {
-			if (!add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
-				/* Add failed due to overlap.  Flush everything
+			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+			    !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) {
+				/* Stripe is busy expanding or
+				 * add failed due to overlap.  Flush everything
 				 * and wait a while
 				 */
 				raid5_unplug_device(mddev->queue);
@@ -1687,7 +1698,6 @@ static int make_request (request_queue_t
 			raid5_plug_device(conf);
 			handle_stripe(sh);
 			release_stripe(sh);
-
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -1763,9 +1773,9 @@ static sector_t sync_request(mddev_t *md
 
 	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
 		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
-	sh = get_active_stripe(conf, sector_nr, pd_idx, 1);
+	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
 	if (sh == NULL) {
-		sh = get_active_stripe(conf, sector_nr, pd_idx, 0);
+		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);
 		/* make sure we don't swamp the stripe cache if someone else
 		 * is trying to get access 
 		 */
@@ -1982,6 +1992,7 @@ static int run(mddev_t *mddev)
 	conf->level = mddev->level;
 	conf->algorithm = mddev->layout;
 	conf->max_nr_stripes = NR_STRIPES;
+	conf->expand_progress = MaxSector;
 
 	/* device size must be a multiple of chunk size */
 	mddev->size &= ~(mddev->chunk_size/1024 -1);
@@ -2112,7 +2123,7 @@ static void print_sh (struct stripe_head
 	printk("sh %llu,  count %d.\n",
 		(unsigned long long)sh->sector, atomic_read(&sh->count));
 	printk("sh %llu, ", (unsigned long long)sh->sector);
-	for (i = 0; i < sh->raid_conf->raid_disks; i++) {
+	for (i = 0; i < sh->disks; i++) {
 		printk("(cache%d: %p %ld) ", 
 			i, sh->dev[i].page, sh->dev[i].flags);
 	}

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-01-17 17:33:23.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-01-17 17:35:36.000000000 +1100
@@ -135,6 +135,7 @@ struct stripe_head {
 	atomic_t		count;			/* nr of active thread/requests */
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
+	int			disks;			/* disks in stripe */
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -174,6 +175,7 @@ struct stripe_head {
 #define	STRIPE_DELAYED		6
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
+#define	STRIPE_EXPANDING	9
 
 /*
  * Plugging:
@@ -211,6 +213,10 @@ struct raid5_private_data {
 	int			raid_disks, working_disks, failed_disks;
 	int			max_nr_stripes;
 
+	/* used during an expand */
+	sector_t		expand_progress;	/* MaxSector when no expand happening */
+	int			previous_raid_disks;
+
 	struct list_head	handle_list; /* stripes needing handling */
 	struct list_head	delayed_list; /* stripes that have plugged requests */
 	struct list_head	bitmap_list; /* stripes delaying awaiting bitmap update */

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 004 of 5] md: Core of raid5 resize process
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
                   ` (2 preceding siblings ...)
  2006-01-17  6:56 ` [PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
@ 2006-01-17  6:56 ` NeilBrown
  2006-01-17  6:56 ` [PATCH 005 of 5] md: Final stages of raid5 expand code NeilBrown
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the
target geometry, marking them STRIPE_EXPANDING.
Then it finds which stripe heads in the old geometry can provide data
needed by these and marks them STRIPE_EXPAND_SOURCE.  This causes
stripe_handle to read all blocks on those stripes.
Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, and that
are needed are copied into the corresponding STRIPE_EXPANDING stripe_head.
Once a STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY
and then is  written out and released.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid5.c         |  189 +++++++++++++++++++++++++++++++++++++------
 ./include/linux/raid/md_k.h  |    4 
 ./include/linux/raid/raid5.h |    4 
 3 files changed, 173 insertions(+), 24 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-01-17 17:35:36.000000000 +1100
+++ ./drivers/md/raid5.c	2006-01-17 17:38:47.000000000 +1100
@@ -93,11 +93,13 @@ static void __release_stripe(raid5_conf_
 				if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD)
 					md_wakeup_thread(conf->mddev->thread);
 			}
-			list_add_tail(&sh->lru, &conf->inactive_list);
 			atomic_dec(&conf->active_stripes);
-			if (!conf->inactive_blocked ||
-			    atomic_read(&conf->active_stripes) < (conf->max_nr_stripes*3/4))
-				wake_up(&conf->wait_for_stripe);
+			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+				list_add_tail(&sh->lru, &conf->inactive_list);
+				if (!conf->inactive_blocked ||
+				    atomic_read(&conf->active_stripes) < (conf->max_nr_stripes*3/4))
+					wake_up(&conf->wait_for_stripe);
+			}
 		}
 	}
 }
@@ -273,9 +275,8 @@ static struct stripe_head *get_active_st
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru))
-					BUG();
-				list_del_init(&sh->lru);
+				if (!list_empty(&sh->lru))
+					list_del_init(&sh->lru);
 			}
 		}
 	} while (sh == NULL);
@@ -1019,6 +1020,18 @@ static int add_stripe_bio(struct stripe_
 	return 0;
 }
 
+int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
+{
+	int sectors_per_chunk = conf->chunk_size >> 9;
+	sector_t x = stripe;
+	int pd_idx, dd_idx;
+	int chunk_offset = sector_div(x, sectors_per_chunk);
+	stripe = x;
+	raid5_compute_sector(stripe*(disks-1)*sectors_per_chunk
+			     + chunk_offset, disks, disks-1, &dd_idx, &pd_idx, conf);
+	return pd_idx;
+}
+
 
 /*
  * handle_stripe - do things to a stripe.
@@ -1045,7 +1058,7 @@ static void handle_stripe(struct stripe_
 	struct bio *return_bi= NULL;
 	struct bio *bi;
 	int i;
-	int syncing;
+	int syncing, expanding, expanded;
 	int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
 	int non_overwrite = 0;
 	int failed_num=0;
@@ -1060,6 +1073,8 @@ static void handle_stripe(struct stripe_
 	clear_bit(STRIPE_DELAYED, &sh->state);
 
 	syncing = test_bit(STRIPE_SYNCING, &sh->state);
+	expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+	expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
 	/* Now to look around and see what can be done */
 
 	rcu_read_lock();
@@ -1252,13 +1267,14 @@ static void handle_stripe(struct stripe_
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
 	 */
-	if (to_read || non_overwrite || (syncing && (uptodate < disks))) {
+	if (to_read || non_overwrite || (syncing && (uptodate < disks)) || expanding) {
 		for (i=disks; i--;) {
 			dev = &sh->dev[i];
 			if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) &&
 			    (dev->toread ||
 			     (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
 			     syncing ||
+			     expanding ||
 			     (failed && (sh->dev[failed_num].toread ||
 					 (sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags))))
 				    )
@@ -1448,13 +1464,76 @@ static void handle_stripe(struct stripe_
 			set_bit(R5_Wantwrite, &dev->flags);
 			set_bit(R5_ReWrite, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
+			locked++;
 		} else {
 			/* let's read it back */
 			set_bit(R5_Wantread, &dev->flags);
 			set_bit(R5_LOCKED, &dev->flags);
+			locked++;
 		}
 	}
 
+	if (expanded) {
+		/* Need to write out all blocks after computing parity */
+		sh->disks = conf->raid_disks;
+		sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks);
+		compute_parity(sh, RECONSTRUCT_WRITE);
+		for (i= conf->raid_disks; i--;) {
+			set_bit(R5_LOCKED, &sh->dev[i].flags);
+			locked++;
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+		}
+		clear_bit(STRIPE_EXPAND_READY, &sh->state);
+		clear_bit(STRIPE_EXPANDING, &sh->state);
+		wake_up(&conf->wait_for_overlap);
+		/* FIXME this shouldn't be called until the writes complete */
+		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
+	}
+
+	if (expanding && locked == 0) {
+		/* We have read all the blocks in this stripe and now we need to
+		 * copy some of them into a target stripe for expand.
+		 */
+		clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+		for (i=0; i< sh->disks; i++)
+			if (i != sh->pd_idx) {
+				int dd_idx, pd_idx, j;
+				struct stripe_head *sh2;
+
+				sector_t bn = compute_blocknr(sh, i);
+				sector_t s = raid5_compute_sector(bn, conf->raid_disks,
+								  conf->raid_disks-1,
+								  &dd_idx, &pd_idx, conf);
+				sh2 = get_active_stripe(conf, s, conf->raid_disks, pd_idx, 1);
+				if (sh2 == NULL)
+					/* so far only the early blocks of this stripe
+					 * have been requested.  When later blocks
+					 * get requested, we will try again
+					 */
+					continue;
+				if(!test_bit(STRIPE_EXPANDING, &sh2->state) ||
+				   test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) {
+					/* must have already done this block */
+					release_stripe(sh2);
+					continue;
+				}
+				memcpy(page_address(sh2->dev[dd_idx].page),
+				       page_address(sh->dev[i].page),
+				       STRIPE_SIZE);
+				set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
+				set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
+				for (j=0; j<conf->raid_disks; j++)
+					if (j != sh2->pd_idx &&
+					    !test_bit(R5_Expanded, &sh2->dev[j].flags))
+						break;
+				if (j == conf->raid_disks) {
+					set_bit(STRIPE_EXPAND_READY, &sh2->state);
+					set_bit(STRIPE_HANDLE, &sh2->state);
+				}
+				release_stripe(sh2);
+			}
+	}
+
 	spin_unlock(&sh->lock);
 
 	while ((bi=return_bi)) {
@@ -1493,7 +1572,7 @@ static void handle_stripe(struct stripe_
 		rcu_read_unlock();
  
 		if (rdev) {
-			if (syncing)
+			if (syncing || expanding || expanded)
 				md_sync_acct(rdev->bdev, STRIPE_SECTORS);
 
 			bi->bi_bdev = rdev->bdev;
@@ -1724,12 +1803,8 @@ static sector_t sync_request(mddev_t *md
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	struct stripe_head *sh;
-	int sectors_per_chunk = conf->chunk_size >> 9;
-	sector_t x;
-	unsigned long stripe;
-	int chunk_offset;
-	int dd_idx, pd_idx;
-	sector_t first_sector;
+	int pd_idx;
+	sector_t first_sector, last_sector;
 	int raid_disks = conf->raid_disks;
 	int data_disks = raid_disks-1;
 	sector_t max_sector = mddev->size << 1;
@@ -1748,6 +1823,80 @@ static sector_t sync_request(mddev_t *md
 
 		return 0;
 	}
+
+	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+		/* reshaping is quite different to recovery/resync so it is
+		 * handled quite separately ... here.
+		 *
+		 * On each call to sync_request, we gather one chunk worth of
+		 * destination stripes and flag them as expanding.
+		 * Then we find all the source stripes and request reads.
+		 * As the reads complete, handle_stripe will copy the data
+		 * into the destination stripe and release that stripe.
+		 */
+		int i;
+		int dd_idx;
+		for (i=0; i < conf->chunk_size/512; i+= STRIPE_SECTORS) {
+			int j;
+			int skipped = 0;
+			pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks);
+			sh = get_active_stripe(conf, sector_nr+i,
+					       conf->raid_disks, pd_idx, 0);
+			set_bit(STRIPE_EXPANDING, &sh->state);
+			/* If any of this stripe is beyond the end of the old
+			 * array, then we need to zero those blocks
+			 */
+			for (j=sh->disks; j--;) {
+				sector_t s;
+				if (j == sh->pd_idx)
+					continue;
+				s = compute_blocknr(sh, j);
+				if (s < (mddev->array_size<<1)) {
+					skipped = 1;
+					continue;
+				}
+				memset(page_address(sh->dev[j].page), 0, STRIPE_SIZE);
+				set_bit(R5_Expanded, &sh->dev[j].flags);
+				set_bit(R5_UPTODATE, &sh->dev[j].flags);
+			}
+			if (!skipped) {
+				set_bit(STRIPE_EXPAND_READY, &sh->state);
+				set_bit(STRIPE_HANDLE, &sh->state);
+			}
+			release_stripe(sh);
+		}
+		spin_lock_irq(&conf->device_lock);
+		conf->expand_progress = (sector_nr + i)*(conf->raid_disks-1);
+		spin_unlock_irq(&conf->device_lock);
+		/* Ok, those stripe are ready. We can start scheduling
+		 * reads on the source stripes.
+		 * The source stripes are determined by mapping the first and last
+		 * block on the destination stripes.
+		 */
+		raid_disks = conf->previous_raid_disks;
+		data_disks = raid_disks - 1;
+		first_sector =
+			raid5_compute_sector(sector_nr*(conf->raid_disks-1),
+					     raid_disks, data_disks,
+					     &dd_idx, &pd_idx, conf);
+		last_sector =
+			raid5_compute_sector((sector_nr+conf->chunk_size/512)
+					       *(conf->raid_disks-1) -1,
+					     raid_disks, data_disks,
+					     &dd_idx, &pd_idx, conf);
+		if (last_sector >= (mddev->size<<1))
+			last_sector = (mddev->size<<1)-1;
+		while (first_sector <= last_sector) {
+			pd_idx = stripe_to_pdidx(first_sector, conf, conf->previous_raid_disks);
+			sh = get_active_stripe(conf, first_sector,
+					       conf->previous_raid_disks, pd_idx, 0);
+			set_bit(STRIPE_EXPAND_SOURCE, &sh->state);
+			set_bit(STRIPE_HANDLE, &sh->state);
+			release_stripe(sh);
+			first_sector += STRIPE_SECTORS;
+		}
+		return conf->chunk_size>>9;
+	}
 	/* if there is 1 or more failed drives and we are trying
 	 * to resync, then assert that we are finished, because there is
 	 * nothing we can do.
@@ -1766,13 +1915,7 @@ static sector_t sync_request(mddev_t *md
 		return sync_blocks * STRIPE_SECTORS; /* keep things rounded to whole stripes */
 	}
 
-	x = sector_nr;
-	chunk_offset = sector_div(x, sectors_per_chunk);
-	stripe = x;
-	BUG_ON(x != stripe);
-
-	first_sector = raid5_compute_sector((sector_t)stripe*data_disks*sectors_per_chunk
-		+ chunk_offset, raid_disks, data_disks, &dd_idx, &pd_idx, conf);
+	pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
 	sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1);
 	if (sh == NULL) {
 		sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0);

diff ./include/linux/raid/md_k.h~current~ ./include/linux/raid/md_k.h
--- ./include/linux/raid/md_k.h~current~	2006-01-17 17:33:05.000000000 +1100
+++ ./include/linux/raid/md_k.h	2006-01-17 17:38:47.000000000 +1100
@@ -157,6 +157,9 @@ struct mddev_s
 	 * DONE:     thread is done and is waiting to be reaped
 	 * REQUEST:  user-space has requested a sync (used with SYNC)
 	 * CHECK:    user-space request for for check-only, no repair
+	 * RESHAPE:  A reshape is happening
+	 *
+	 * If neither SYNC or RESHAPE are set, then it is a recovery.
 	 */
 #define	MD_RECOVERY_RUNNING	0
 #define	MD_RECOVERY_SYNC	1
@@ -166,6 +169,7 @@ struct mddev_s
 #define	MD_RECOVERY_NEEDED	5
 #define	MD_RECOVERY_REQUESTED	6
 #define	MD_RECOVERY_CHECK	7
+#define MD_RECOVERY_RESHAPE	8
 	unsigned long			recovery;
 
 	int				in_sync;	/* know to not need resync */

diff ./include/linux/raid/raid5.h~current~ ./include/linux/raid/raid5.h
--- ./include/linux/raid/raid5.h~current~	2006-01-17 17:35:36.000000000 +1100
+++ ./include/linux/raid/raid5.h	2006-01-17 17:38:47.000000000 +1100
@@ -157,6 +157,7 @@ struct stripe_head {
 #define	R5_ReadError	8	/* seen a read error here recently */
 #define	R5_ReWrite	9	/* have tried to over-write the readerror */
 
+#define	R5_Expanded	10	/* This block now has post-expand data */
 /*
  * Write method
  */
@@ -176,7 +177,8 @@ struct stripe_head {
 #define	STRIPE_DEGRADED		7
 #define	STRIPE_BIT_DELAY	8
 #define	STRIPE_EXPANDING	9
-
+#define	STRIPE_EXPAND_SOURCE	10
+#define	STRIPE_EXPAND_READY	11
 /*
  * Plugging:
  *

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 005 of 5] md: Final stages of raid5 expand code.
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
                   ` (3 preceding siblings ...)
  2006-01-17  6:56 ` [PATCH 004 of 5] md: Core of raid5 resize process NeilBrown
@ 2006-01-17  6:56 ` NeilBrown
  2006-01-17  9:55   ` Sander
  2006-01-17  8:17 ` [PATCH 000 of 5] md: Introduction Michael Tokarev
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 71+ messages in thread
From: NeilBrown @ 2006-01-17  6:56 UTC (permalink / raw)
  To: linux-raid, linux-kernel; +Cc: Steinar H. Gunderson


This patch adds raid5_reshape and end_reshape which will
start and finish the reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set,
to discourage accidental use.

Don't use this on valuable data.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure to avoid use on valuable data.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/Kconfig      |   24 +++++++++++
 ./drivers/md/md.c         |    7 +--
 ./drivers/md/raid5.c      |  100 ++++++++++++++++++++++++++++++++++++++++++++++
 ./include/linux/raid/md.h |    3 -
 4 files changed, 130 insertions(+), 4 deletions(-)

diff ./drivers/md/Kconfig~current~ ./drivers/md/Kconfig
--- ./drivers/md/Kconfig~current~	2006-01-17 17:45:08.000000000 +1100
+++ ./drivers/md/Kconfig	2006-01-17 17:42:31.000000000 +1100
@@ -127,6 +127,30 @@ config MD_RAID5
 
 	  If unsure, say Y.
 
+config MD_RAID5_RESHAPE
+	bool "Support adding drives to a raid-5 array (highly experimental)"
+	depends on MD_RAID5 && EXPERIMENTAL
+	---help---
+	  A RAID-5 set can be expanded by adding extra drives. This
+	  requires "restriping" the array which means (almost) every
+	  block must be written to a different place.
+
+          This option allows this restiping to be done while the array
+	  is online.  However it is VERY EARLY EXPERIMENTAL code.
+	  In particular, if anything goes wrong while the restriping
+	  is happening, such as a power failure or a crash, all the
+	  data on the array will be LOST beyond any reasonable hope
+	  of recovery.
+
+	  This option is provided for experimentation and testing.
+	  Please to NOT use it on valuable data with good, tested, backups.
+
+	  Any reasonable current version of 'mdadm' can start an expansion
+	  with e.g.  mdadm --grow /dev/md0 --raid-disks=6
+	  Note: The array can only be expanded, not contracted.
+	  There should be enough spares already present to make the new
+	  array workable.
+
 config MD_RAID6
 	tristate "RAID-6 mode"
 	depends on BLK_DEV_MD

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~	2006-01-17 17:45:08.000000000 +1100
+++ ./drivers/md/md.c	2006-01-17 17:42:31.000000000 +1100
@@ -158,12 +158,12 @@ static int start_readonly;
  */
 static DECLARE_WAIT_QUEUE_HEAD(md_event_waiters);
 static atomic_t md_event_count;
-static void md_new_event(mddev_t *mddev)
+void md_new_event(mddev_t *mddev)
 {
 	atomic_inc(&md_event_count);
 	wake_up(&md_event_waiters);
 }
-
+EXPORT_SYMBOL_GPL(md_new_event);
 /*
  * Enables to iterate over all existing md arrays
  * all_mddevs_lock protects this list.
@@ -4440,7 +4440,7 @@ static DECLARE_WAIT_QUEUE_HEAD(resync_wa
 
 #define SYNC_MARKS	10
 #define	SYNC_MARK_STEP	(3*HZ)
-static void md_do_sync(mddev_t *mddev)
+void md_do_sync(mddev_t *mddev)
 {
 	mddev_t *mddev2;
 	unsigned int currspeed = 0,
@@ -4673,6 +4673,7 @@ static void md_do_sync(mddev_t *mddev)
 	set_bit(MD_RECOVERY_DONE, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
 }
+EXPORT_SYMBOL_GPL(md_do_sync);
 
 
 /*

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~	2006-01-17 17:45:08.000000000 +1100
+++ ./drivers/md/raid5.c	2006-01-17 17:42:31.000000000 +1100
@@ -1020,6 +1020,8 @@ static int add_stripe_bio(struct stripe_
 	return 0;
 }
 
+static void end_reshape(raid5_conf_t *conf);
+
 int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks)
 {
 	int sectors_per_chunk = conf->chunk_size >> 9;
@@ -1813,6 +1815,10 @@ static sector_t sync_request(mddev_t *md
 	if (sector_nr >= max_sector) {
 		/* just being told to finish up .. nothing much to do */
 		unplug_slaves(mddev);
+		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+			end_reshape(conf);
+			return 0;
+		}
 
 		if (mddev->curr_resync < max_sector) /* aborted */
 			bitmap_end_sync(mddev->bitmap, mddev->curr_resync,
@@ -2433,6 +2439,97 @@ static int raid5_resize(mddev_t *mddev, 
 	return 0;
 }
 
+static int raid5_reshape(mddev_t *mddev, int raid_disks)
+{
+	raid5_conf_t *conf = mddev_to_conf(mddev);
+	int err;
+	mdk_rdev_t *rdev;
+	struct list_head *rtmp;
+	int spares = 0;
+
+	if (mddev->degraded ||
+	    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		return -EBUSY;
+	if (conf->raid_disks > raid_disks)
+		return -EINVAL; /* Cannot shrink array yet */
+	if (conf->raid_disks == raid_disks)
+		return 0; /* nothing to do */
+
+	ITERATE_RDEV(mddev, rdev, rtmp)
+		if (rdev->raid_disk < 0 &&
+		    !test_bit(Faulty, &rdev->flags))
+			spares++;
+	if (conf->raid_disks + spares < raid_disks-1)
+		/* Not enough devices even to make a degraded array
+		 * of that size
+		 */
+		return -EINVAL;
+
+	err = resize_stripes(conf, raid_disks);
+	if (err)
+		return err;
+
+	spin_lock_irq(&conf->device_lock);
+	conf->previous_raid_disks = conf->raid_disks;
+	mddev->raid_disks = conf->raid_disks = raid_disks;
+	conf->expand_progress = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* Add some new drives, as many as will fit.
+	 * We know there are enough to make the newly sized array work.
+	 */
+	ITERATE_RDEV(mddev, rdev, rtmp)
+		if (rdev->raid_disk < 0 &&
+		    !test_bit(Faulty, &rdev->flags)) {
+			if (raid5_add_disk(mddev, rdev)) {
+				char nm[20];
+				set_bit(In_sync, &rdev->flags);
+				conf->working_disks++;
+				sprintf(nm, "rd%d", rdev->raid_disk);
+				sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+			} else
+				break;
+		}
+
+	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+	clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+	set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+	set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+	mddev->sync_thread = md_register_thread(md_do_sync, mddev,
+						"%s_reshape");
+	if (!mddev->sync_thread) {
+		mddev->recovery = 0;
+		spin_lock_irq(&conf->device_lock);
+		mddev->raid_disks = conf->raid_disks = conf->previous_raid_disks;
+		conf->expand_progress = MaxSector;
+		spin_unlock_irq(&conf->device_lock);
+		return -EAGAIN;
+	}
+	md_wakeup_thread(mddev->sync_thread);
+	md_new_event(mddev);
+	return 0;
+}
+
+static void end_reshape(raid5_conf_t *conf)
+{
+	struct block_device *bdev;
+
+	conf->mddev->array_size = conf->mddev->size * (conf->mddev->raid_disks-1);
+	set_capacity(conf->mddev->gendisk, conf->mddev->array_size << 1);
+	conf->mddev->changed = 1;
+
+	bdev = bdget_disk(conf->mddev->gendisk, 0);
+	if (bdev) {
+		mutex_lock(&bdev->bd_inode->i_mutex);
+		i_size_write(bdev->bd_inode, conf->mddev->array_size << 10);
+		mutex_unlock(&bdev->bd_inode->i_mutex);
+		bdput(bdev);
+	}
+	spin_lock_irq(&conf->device_lock);
+	conf->expand_progress = MaxSector;
+	spin_unlock_irq(&conf->device_lock);
+}
+
 static void raid5_quiesce(mddev_t *mddev, int state)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -2471,6 +2568,9 @@ static struct mdk_personality raid5_pers
 	.spare_active	= raid5_spare_active,
 	.sync_request	= sync_request,
 	.resize		= raid5_resize,
+#if CONFIG_MD_RAID5_RESHAPE
+	.reshape	= raid5_reshape,
+#endif
 	.quiesce	= raid5_quiesce,
 };
 

diff ./include/linux/raid/md.h~current~ ./include/linux/raid/md.h
--- ./include/linux/raid/md.h~current~	2006-01-17 17:45:08.000000000 +1100
+++ ./include/linux/raid/md.h	2006-01-17 17:42:31.000000000 +1100
@@ -92,7 +92,8 @@ extern void md_super_write(mddev_t *mdde
 extern void md_super_wait(mddev_t *mddev);
 extern int sync_page_io(struct block_device *bdev, sector_t sector, int size,
 			struct page *page, int rw);
-
+extern void md_do_sync(mddev_t *mddev);
+extern void md_new_event(mddev_t *mddev);
 
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
                   ` (4 preceding siblings ...)
  2006-01-17  6:56 ` [PATCH 005 of 5] md: Final stages of raid5 expand code NeilBrown
@ 2006-01-17  8:17 ` Michael Tokarev
  2006-01-17  9:50   ` Sander
  2006-01-17 14:10   ` Steinar H. Gunderson
  2006-01-22  4:42 ` Adam Kropelin
  2006-01-23  1:08 ` John Hendrikx
  7 siblings, 2 replies; 71+ messages in thread
From: Michael Tokarev @ 2006-01-17  8:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

NeilBrown wrote:
> Greetings.
> 
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.

Neil, is this online resizing/reshaping really needed?  I understand
all those words means alot for marketing persons - zero downtime,
online resizing etc, but it is much safer and easier to do that stuff
'offline', on an inactive array, like raidreconf does - safer, easier,
faster, and one have more possibilities for more complex changes.  It
isn't like you want to add/remove drives to/from your arrays every day...
Alot of good hw raid cards are unable to perform such reshaping too.

/mjt

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  8:17 ` [PATCH 000 of 5] md: Introduction Michael Tokarev
@ 2006-01-17  9:50   ` Sander
  2006-01-17 11:26     ` Michael Tokarev
  2006-01-17 14:10   ` Steinar H. Gunderson
  1 sibling, 1 reply; 71+ messages in thread
From: Sander @ 2006-01-17  9:50 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: NeilBrown, linux-raid, linux-kernel, Steinar H. Gunderson

Michael Tokarev wrote (ao):
> NeilBrown wrote:
> > Greetings.
> > 
> > In line with the principle of "release early", following are 5
> > patches against md in 2.6.latest which implement reshaping of a
> > raid5 array. By this I mean adding 1 or more drives to the array and
> > then re-laying out all of the data.
> 
> Neil, is this online resizing/reshaping really needed? I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes. It
> isn't like you want to add/remove drives to/from your arrays every
> day... Alot of good hw raid cards are unable to perform such reshaping
> too.

I like the feature. Not only marketing prefers zero downtime you know :-)

Actually, I don't understand why you bother at all. One writes the
feature. Another uses it. How would this feature harm you?

	Kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 005 of 5] md: Final stages of raid5 expand code.
  2006-01-17  6:56 ` [PATCH 005 of 5] md: Final stages of raid5 expand code NeilBrown
@ 2006-01-17  9:55   ` Sander
  2006-01-19  0:32     ` Neil Brown
  0 siblings, 1 reply; 71+ messages in thread
From: Sander @ 2006-01-17  9:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

NeilBrown wrote (ao):
> +config MD_RAID5_RESHAPE

Would this also be possible for raid6?

> +	bool "Support adding drives to a raid-5 array (highly experimental)"
> +	depends on MD_RAID5 && EXPERIMENTAL
> +	---help---
> +	  A RAID-5 set can be expanded by adding extra drives. This
> +	  requires "restriping" the array which means (almost) every
> +	  block must be written to a different place.
> +
> +          This option allows this restiping to be done while the array
                                     ^^^^^^^^^
                                     restriping

> +	  is online.  However it is VERY EARLY EXPERIMENTAL code.
> +	  In particular, if anything goes wrong while the restriping
> +	  is happening, such as a power failure or a crash, all the
> +	  data on the array will be LOST beyond any reasonable hope
> +	  of recovery.
> +
> +	  This option is provided for experimentation and testing.
> +	  Please to NOT use it on valuable data with good, tested, backups.
                 ^^                             ^^^^
                 do                             without

Thanks a lot for this feature. I'll try to find a spare computer to test
this on. Thanks!

	Kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  9:50   ` Sander
@ 2006-01-17 11:26     ` Michael Tokarev
  2006-01-17 14:03       ` Kyle Moffett
                         ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Michael Tokarev @ 2006-01-17 11:26 UTC (permalink / raw)
  To: sander; +Cc: NeilBrown, linux-raid, linux-kernel, Steinar H. Gunderson

Sander wrote:
> Michael Tokarev wrote (ao):
[]
>>Neil, is this online resizing/reshaping really needed? I understand
>>all those words means alot for marketing persons - zero downtime,
>>online resizing etc, but it is much safer and easier to do that stuff
>>'offline', on an inactive array, like raidreconf does - safer, easier,
>>faster, and one have more possibilities for more complex changes. It
>>isn't like you want to add/remove drives to/from your arrays every
>>day... Alot of good hw raid cards are unable to perform such reshaping
>>too.
[]
> Actually, I don't understand why you bother at all. One writes the
> feature. Another uses it. How would this feature harm you?

This is about code complexity/bloat.  It's already complex enouth.
I rely on the stability of the linux softraid subsystem, and want
it to be reliable. Adding more features, especially non-trivial
ones, does not buy you bugfree raid subsystem, just the opposite:
it will have more chances to crash, to eat your data etc, and will
be harder in finding/fixing bugs.

Raid code is already too fragile, i'm afraid "simple" I/O errors
(which is what we need raid for) may crash the system already, and
am waiting for the next whole system crash due to eg superblock
update error or whatnot.  I saw all sorts of failures due to
linux softraid already (we use it here alot), including ones
which required complete array rebuild with heavy data loss.

Any "unnecessary bloat" (note the quotes: I understand some
people like this and other features) makes whole system even
more fragile than it is already.

Compare this with my statement about "offline" "reshaper" above:
separate userspace (easier to write/debug compared with kernel
space) program which operates on an inactive array (no locking
needed, no need to worry about other I/O operations going to the
array at the time of reshaping etc), with an ability to plan it's
I/O strategy in alot more efficient and safer way...  Yes this
apprpach has one downside: the array has to be inactive.  But in
my opinion it's worth it, compared to more possibilities to lose
your data, even if you do NOT use that feature at all...

/mjt

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 11:26     ` Michael Tokarev
@ 2006-01-17 14:03       ` Kyle Moffett
  2006-01-19  0:28         ` Neil Brown
  2006-01-17 16:08       ` Ross Vandegrift
  2006-01-17 22:38       ` Phillip Susi
  2 siblings, 1 reply; 71+ messages in thread
From: Kyle Moffett @ 2006-01-17 14:03 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: sander, NeilBrown, linux-raid, linux-kernel, Steinar H. Gunderson

On Jan 17, 2006, at 06:26, Michael Tokarev wrote:
> This is about code complexity/bloat.  It's already complex enouth.  
> I rely on the stability of the linux softraid subsystem, and want  
> it to be reliable. Adding more features, especially non-trivial  
> ones, does not buy you bugfree raid subsystem, just the opposite:  
> it will have more chances to crash, to eat your data etc, and will  
> be harder in finding/fixing bugs.

What part of: "You will need to enable the experimental  
MD_RAID5_RESHAPE config option for this to work." isn't bvious?  If  
you don't want this feature, either don't turn on  
CONFIG_MD_RAID5_RESHAPE, or don't use the raid5 mdadm reshaping  
command.  This feature might be extremely useful for some people  
(including me on occasion), but I would not trust it even on my  
family's fileserver (let alone a corporate one) until it's been  
through several generations of testing and bugfixing.


Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory  
subsystems, sorry.  It would be like trying to make a human more  
robust with an unreliable O2 supply. Memory just has to work.
   -- Andi Kleen



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  8:17 ` [PATCH 000 of 5] md: Introduction Michael Tokarev
  2006-01-17  9:50   ` Sander
@ 2006-01-17 14:10   ` Steinar H. Gunderson
  1 sibling, 0 replies; 71+ messages in thread
From: Steinar H. Gunderson @ 2006-01-17 14:10 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: NeilBrown, linux-raid, linux-kernel

On Tue, Jan 17, 2006 at 11:17:15AM +0300, Michael Tokarev wrote:
> Neil, is this online resizing/reshaping really needed?  I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes. 

Try the scenario where the resize takes a week, and you don't have enough
spare disks to move it onto another server -- besides, that would take
several days alone... This is the kind of use-case for which I wrote the
original patch, and I'm grateful that Neil has picked it up again so we can
finally get something working in.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-01-17  6:56 ` [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
@ 2006-01-17 14:37   ` John Stoffel
  2006-01-19  0:26     ` Neil Brown
  0 siblings, 1 reply; 71+ messages in thread
From: John Stoffel @ 2006-01-17 14:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

NeilBrown> Previously the array of disk information was included in
NeilBrown> the raid5 'conf' structure which was allocated to an
NeilBrown> appropriate size.  This makes it awkward to change the size
NeilBrown> of that array.  So we split it off into a separate
NeilBrown> kmalloced array which will require a little extra indexing,
NeilBrown> but is much easier to grow.

Neil,

Instead of setting mddev->private = NULL, should you be doing a kfree
on it as well when you are in an abort state?

John

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 11:26     ` Michael Tokarev
  2006-01-17 14:03       ` Kyle Moffett
@ 2006-01-17 16:08       ` Ross Vandegrift
  2006-01-17 18:12         ` Michael Tokarev
  2006-01-17 22:38       ` Phillip Susi
  2 siblings, 1 reply; 71+ messages in thread
From: Ross Vandegrift @ 2006-01-17 16:08 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid, linux-kernel

On Tue, Jan 17, 2006 at 02:26:11PM +0300, Michael Tokarev wrote:
> Raid code is already too fragile, i'm afraid "simple" I/O errors
> (which is what we need raid for) may crash the system already, and
> am waiting for the next whole system crash due to eg superblock
> update error or whatnot.

I think you've got some other issue if simple I/O errors cause issues.
I've managed hundreds of MD arrays over the past ~ten years.  MD is
rock solid.  I'd guess that I've recovered at least a hundred disk failures
where data was saved by mdadm.

What is your setup like?  It's also possible that you've found a bug.

> I saw all sorts of failures due to
> linux softraid already (we use it here alot), including ones
> which required complete array rebuild with heavy data loss.

Are you sure?  The one thing that's not always intuitive about MD - a
faild array often still has your data and you can recover it.  Unlike
hardware RAID solutions, you have a lot of control over how the disks
are assembled and used - this can be a major advantage.

I'd say once a week someone comes on the linux-raid list and says "Oh no!
I accidently ruined my RAID array!".  Neil almost always responds "Well,
don't do that!  But since you did, this might help...".

-- 
Ross Vandegrift
ross@lug.udel.edu

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 16:08       ` Ross Vandegrift
@ 2006-01-17 18:12         ` Michael Tokarev
  2006-01-18  8:14           ` Sander
  2006-01-19  0:22           ` Neil Brown
  0 siblings, 2 replies; 71+ messages in thread
From: Michael Tokarev @ 2006-01-17 18:12 UTC (permalink / raw)
  To: Ross Vandegrift; +Cc: linux-raid, linux-kernel

Ross Vandegrift wrote:
> On Tue, Jan 17, 2006 at 02:26:11PM +0300, Michael Tokarev wrote:
> 
>>Raid code is already too fragile, i'm afraid "simple" I/O errors
>>(which is what we need raid for) may crash the system already, and
>>am waiting for the next whole system crash due to eg superblock
>>update error or whatnot.
> 
> I think you've got some other issue if simple I/O errors cause issues.
> I've managed hundreds of MD arrays over the past ~ten years.  MD is
> rock solid.  I'd guess that I've recovered at least a hundred disk failures
> where data was saved by mdadm.
> 
> What is your setup like?  It's also possible that you've found a bug.

We've about 500 systems with raid1, raid5 and raid10 running for about
5 or 6 years (since 0.90 beta patched into 2.2 kernel -- I don't think
linux softraid existed before, or, rather, I can't say that was something
which was possible to use in production).

Most problematic case so far, which I described numerous times (like,
"why linux raid isn't Raid really, why it can be worse than plain disk")
is when, after single sector read failure, md kicks the whole disk off
the array, and when you start resync (after replacing the "bad" drive or
just remapping that bad sector or even doing nothing, as it will be
remapped in almost all cases during write, on real drives anyway),
you find another "bad sector" on another drive.  After this, the array
can't be started anymore, at least not w/o --force (ie, requires some
user intervention, which is sometimes quite difficult if the server
is several 100s miles away).  More, it's quite difficult to recover
it even manually (after --force'ing it to start), without fixing that
bad sector somehow -- if first drive failure is "recent enouth" we've
a hope that this very sector can be read from that first drive. if
the alot of filesystem activity happened since that time, that chances
are quite small; and with raid5 it's quite difficult to say where the
error is in the filesystem, due to the complex layout of raid5.

But this has been described here numerous times, and - hopefully -
with current changes (re-writing of bad blocks) this very issue will
go away, at least most common scenario of it (i'd try to keep even
"bad" drive, even after some write errors, because it still contains
some data which can be read; but that's problematic to say the best
because one has to store a list of bad blocks somewhere...).

(And no, I don't have all bad/cheap drives - it's just when you have
hundreds or 1000s of drives, you've quite high probability that some
of them will fail sometimes, or will develop a bad sector etc).

>>I saw all sorts of failures due to
>>linux softraid already (we use it here alot), including ones
>>which required complete array rebuild with heavy data loss.
> 
> Are you sure?  The one thing that's not always intuitive about MD - a
> faild array often still has your data and you can recover it.  Unlike
> hardware RAID solutions, you have a lot of control over how the disks
> are assembled and used - this can be a major advantage.
> 
> I'd say once a week someone comes on the linux-raid list and says "Oh no!
> I accidently ruined my RAID array!".  Neil almost always responds "Well,
> don't do that!  But since you did, this might help...".

I know that.  And I've quite some expirience too, and I studied mdadm
source.

There was in fact two cases like that, not one.

First was mostly due to operator error, or lack of better choice at
2.2 (or early 2.4) times -- I relied on raid autodetection (which I
don't do anymore, and strongly suggest others to switch to mdassemble
or something like that) -- a drive failed (for real, not bad blocks)
and needed to be replaced, and I forgot to clear the partition table
on the replacement drive (which was in our testing box) - in a result,
kernel assembled a raid5 out of components which belonged to different
arrays..  I only vaguely remember what it was at that time -- maybe
kernel or I started reconstruction (not noticiyng the wrong array),
or i mounted the filesystem - can't say anymore for sure, but the
result was that I wasn't able to restore the filesystem, because i
didn't have that filesystem anymore.  (it should have been assembling
boot raid1 array but assembled a degraided raid5 instead)

And second case was when, after an attempt to resync the array (after
that famous 'bad block kicked off the whole disk) which resulted in an
OOPS (which I didn't notice immediately, but it continued the resync),
it wrote some garbage all over, resulting in badly broken filesystem,
and somewhat broken nearby partition too (which I was able to recover).
It was at about 2.4.19 or so, and I had that situation only once.
Granted, I can't blame raid code for all this, because I don't even
know what was in the oops (machine locked hard but someone who was
near the server noticied it OOPSed) - it sure may be a bug somewhere
else.

As a sort of conclusion.

There are several features that can be implemented in linux softraid
code to make it real Raid, with data safety goal.  One example is to
be able to replace a "to-be-failed" drive (think SMART failure
predictions for example) without removing it from the array with a
(hot)spare (or just a replacement) -- by adding the new drive to the
array *first*, and removing the to-be-replaced one only after new is
fully synced.  Another example is to implement some NVRAM-like storage
for metadata (this will require the necessary hardware as well, like
eg a flash card -- I dunno how safe it can be).  And so on.

The current MD code is "almost here", almost real.  It still has some
(maybe minor) problems, it still lacks some (again maybe minor) features
wrt data safety.  Ie, it still can fail, but it's almost here.

While current development is going to implement some new and non-trivial
features which are of little use in real life.  Face it: yes it's good
when you're able to reshape your array online keeping servicing your
users, but i'd go for even 12 hours downtime if i know my data is safe,
instead of unknown downtime after I realize the reshape failed for some
reason and I dont have my data anymore.  And yes it's very rarely used
(which adds to the problem - rarely used code paths with bugs with stays
unfound for alot of time, and bite you at a very unexpected moment, when
you think it's all ok...)

Well, not all is that bad really.  I really apprecate Neil's work, it's
all his baby after all, and I owe him alot of stuff because of all our
machines which, due to raid code, are running fine (most of them anyway).
I had a hopefully small question, whenever the new features are really
useful, and just described my point of view to the topic.. And answered
your, Ross, questions as well.. ;)

Thank you.

/mjt

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 11:26     ` Michael Tokarev
  2006-01-17 14:03       ` Kyle Moffett
  2006-01-17 16:08       ` Ross Vandegrift
@ 2006-01-17 22:38       ` Phillip Susi
  2006-01-17 22:57         ` Neil Brown
  2 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2006-01-17 22:38 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: sander, NeilBrown, linux-raid, linux-kernel, Steinar H. Gunderson

Michael Tokarev wrote:
<snip>
> Compare this with my statement about "offline" "reshaper" above:
> separate userspace (easier to write/debug compared with kernel
> space) program which operates on an inactive array (no locking
> needed, no need to worry about other I/O operations going to the
> array at the time of reshaping etc), with an ability to plan it's
> I/O strategy in alot more efficient and safer way...  Yes this
> apprpach has one downside: the array has to be inactive.  But in
> my opinion it's worth it, compared to more possibilities to lose
> your data, even if you do NOT use that feature at all...
>
>   
I also like the idea of this kind of thing going in user space.  I was 
also under the impression that md was going to be phased out and 
replaced by the device mapper.  I've been kicking around the idea of a 
user space utility that manipulates the device mapper tables and 
performs block moves itself to reshape a raid array.  It doesn't seem 
like it would be that difficult and would not require modifying the 
kernel at all.  The basic idea is something like this:

/dev/mapper/raid is your raid array, which is mapped to a stripe between 
/dev/sda, /dev/sdb.  When you want to expand the stripe to add /dev/sdc 
to the array, you create three new devices:

/dev/mapper/raid-old:  copy of the old mapper table, striping sda and sdb
/dev/mapper/raid-progress: linear map with size = new stripe width, and 
pointing to raid-new
/dev/mapper/raid-new: what the raid will look like when done, i.e. 
stripe of sda, sdb, and sdc

Then you replace /dev/mapper/raid with a linear map to raid-new, 
raid-progress, and raid-old, in that order.  Initially the length of the 
chunks from raid-progress and raid-new are zero, so you will still be 
entirely accessing raid-old.  For each stripe in the array, you change 
raid-progress to point to the corresponding blocks in raid-new, but 
suspended, so IO to this stripe will block.  Then you update the raid 
map so raid-progress overlays the stripe you are working on to catch IO 
instead of allowing it to go to raid-old.  After you read that stripe 
from raid-old and write it to raid-new,  resume raid-progress to flush 
any blocked writes to the raid-new stripe.  Finally update raid so the 
previously in progress stripe now maps to raid-new. 

Repeat for each stripe in the array, and finally replace the raid table 
with raid-new's table, and delete the 3 temporary devices. 


Adding transaction logging to the user mode utility wouldn't be very 
hard either. 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 22:38       ` Phillip Susi
@ 2006-01-17 22:57         ` Neil Brown
  0 siblings, 0 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-17 22:57 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Michael Tokarev, sander, linux-raid, linux-kernel, Steinar H. Gunderson

On Tuesday January 17, psusi@cfl.rr.com wrote:
>                                                                  I was 
> also under the impression that md was going to be phased out and 
> replaced by the device mapper.

I wonder where this sort of idea comes from....

Obviously individual distributions are free to support or not support
whatever bits of code they like.  And developers are free to add
duplicate functionality to the kernel (I believe someone is working on
a raid5 target for dm).  But that doesn't mean that anything is going
to be 'phased out'.

md and dm, while similar, are quite different.  They can both
comfortably co-exist even if they have similar functionality.
What I expect will happen (in line with what normally happens in
Linux) is that both will continue to evolve as long as there is
interest and developer support.  They will quite possibly borrow ideas
from each other where that is relevant.  Parts of one may lose
support and eventually die (as md/multipath is on the way to doing)
but there is no wholesale 'phasing out' going to happen in either
direction. 

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 18:12         ` Michael Tokarev
@ 2006-01-18  8:14           ` Sander
  2006-01-18  9:03             ` Alan Cox
  2006-01-19  0:22           ` Neil Brown
  1 sibling, 1 reply; 71+ messages in thread
From: Sander @ 2006-01-18  8:14 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Ross Vandegrift, linux-raid, linux-kernel

Michael Tokarev wrote (ao):
> Most problematic case so far, which I described numerous times (like,
> "why linux raid isn't Raid really, why it can be worse than plain
> disk") is when, after single sector read failure, md kicks the whole
> disk off the array, and when you start resync (after replacing the
> "bad" drive or just remapping that bad sector or even doing nothing,
> as it will be remapped in almost all cases during write, on real
> drives anyway),

If the (harddisk internal) remap succeeded, the OS doesn't see the bad
sector at all I believe.

If you (the OS) do see a bad sector, the disk couldn't remap, and goes
downhill from there, right?

	Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-18  8:14           ` Sander
@ 2006-01-18  9:03             ` Alan Cox
  0 siblings, 0 replies; 71+ messages in thread
From: Alan Cox @ 2006-01-18  9:03 UTC (permalink / raw)
  To: sander; +Cc: Michael Tokarev, Ross Vandegrift, linux-raid, linux-kernel

On Mer, 2006-01-18 at 09:14 +0100, Sander wrote:
> If the (harddisk internal) remap succeeded, the OS doesn't see the bad
> sector at all I believe.

True for ATA, in the SCSI case you may be told about the remap having
occurred but its a "by the way" type message not an error proper.

> If you (the OS) do see a bad sector, the disk couldn't remap, and goes
> downhill from there, right?

If a hot spare is configured it will be dropped into the configuration
at that point.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 18:12         ` Michael Tokarev
  2006-01-18  8:14           ` Sander
@ 2006-01-19  0:22           ` Neil Brown
  2006-01-19  9:01             ` Jakob Oestergaard
  1 sibling, 1 reply; 71+ messages in thread
From: Neil Brown @ 2006-01-19  0:22 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Ross Vandegrift, linux-raid, linux-kernel

On Tuesday January 17, mjt@tls.msk.ru wrote:
> 
> As a sort of conclusion.
> 
> There are several features that can be implemented in linux softraid
> code to make it real Raid, with data safety goal.  One example is to
> be able to replace a "to-be-failed" drive (think SMART failure
> predictions for example) without removing it from the array with a
> (hot)spare (or just a replacement) -- by adding the new drive to the
> array *first*, and removing the to-be-replaced one only after new is
> fully synced.  Another example is to implement some NVRAM-like storage
> for metadata (this will require the necessary hardware as well, like
> eg a flash card -- I dunno how safe it can be).  And so on.

proactive replacement before complete failure is a good idea and is
(just recently) on my todo list.  It shouldn't be too hard.

> 
> The current MD code is "almost here", almost real.  It still has some
> (maybe minor) problems, it still lacks some (again maybe minor) features
> wrt data safety.  Ie, it still can fail, but it's almost here.

concrete suggestions are always welcome (though sometimes you might
have to put some effort into convincing me...)

> 
> While current development is going to implement some new and non-trivial
> features which are of little use in real life.  Face it: yes it's good
> when you're able to reshape your array online keeping servicing your
> users, but i'd go for even 12 hours downtime if i know my data is safe,
> instead of unknown downtime after I realize the reshape failed for some
> reason and I dont have my data anymore.  And yes it's very rarely used
> (which adds to the problem - rarely used code paths with bugs with stays
> unfound for alot of time, and bite you at a very unexpected moment, when
> you think it's all ok...)

If you look at the amount of code in the 'reshape raid5' patch you
will notice that it isn't really very much.  It reuses a lot of the
infrastructure that is already present in md/raid5.  So a reshape
actually uses a lot of code that is used very often.

Compare this to an offline solution (raidreconfig) where all the code
is only used occasionally.  You could argue that the online version
has more code safety than the offline version....

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-01-17 14:37   ` John Stoffel
@ 2006-01-19  0:26     ` Neil Brown
  2006-01-21  3:37       ` John Stoffel
  0 siblings, 1 reply; 71+ messages in thread
From: Neil Brown @ 2006-01-19  0:26 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

On Tuesday January 17, john@stoffel.org wrote:
> >>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:
> 
> NeilBrown> Previously the array of disk information was included in
> NeilBrown> the raid5 'conf' structure which was allocated to an
> NeilBrown> appropriate size.  This makes it awkward to change the size
> NeilBrown> of that array.  So we split it off into a separate
> NeilBrown> kmalloced array which will require a little extra indexing,
> NeilBrown> but is much easier to grow.
> 
> Neil,
> 
> Instead of setting mddev->private = NULL, should you be doing a kfree
> on it as well when you are in an abort state?

The only times I set 
  mddev->private = NULL
it is immediately after
   kfree(conf)
and as conf is the thing that is assigned to mddev->private, this
should be doing exactly what you suggest.

Does that make sense?

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17 14:03       ` Kyle Moffett
@ 2006-01-19  0:28         ` Neil Brown
  0 siblings, 0 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-19  0:28 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Michael Tokarev, sander, linux-raid, linux-kernel, Steinar H. Gunderson

On Tuesday January 17, mrmacman_g4@mac.com wrote:
> On Jan 17, 2006, at 06:26, Michael Tokarev wrote:
> > This is about code complexity/bloat.  It's already complex enouth.  
> > I rely on the stability of the linux softraid subsystem, and want  
> > it to be reliable. Adding more features, especially non-trivial  
> > ones, does not buy you bugfree raid subsystem, just the opposite:  
> > it will have more chances to crash, to eat your data etc, and will  
> > be harder in finding/fixing bugs.
> 
> What part of: "You will need to enable the experimental  
> MD_RAID5_RESHAPE config option for this to work." isn't bvious?  If  
> you don't want this feature, either don't turn on  
> CONFIG_MD_RAID5_RESHAPE, or don't use the raid5 mdadm reshaping  
> command.

This isn't really a fair comment.  CONFIG_MD_RAID5_RESHAPE just
enables the code.  All the code is included whether this config option
is set or not.  So if code-bloat were an issue, the config option
wouldn't answer it.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 005 of 5] md: Final stages of raid5 expand code.
  2006-01-17  9:55   ` Sander
@ 2006-01-19  0:32     ` Neil Brown
  0 siblings, 0 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-19  0:32 UTC (permalink / raw)
  To: sander; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

On Tuesday January 17, sander@humilis.net wrote:
> NeilBrown wrote (ao):
> > +config MD_RAID5_RESHAPE
> 
> Would this also be possible for raid6?


Yes.  The will follow once raid5 is reasonably reliable.  It is
essentially the same change to a different file.
(One day we will merge raid5 and raid6 together into the one module,
but not today).

> > +          This option allows this restiping to be done while the array
>                                      ^^^^^^^^^
>                                      restriping
> > +	  Please to NOT use it on valuable data with good, tested, backups.
>                  ^^                             ^^^^
>                  do                             without

Thanks * 3.

> 
> Thanks a lot for this feature. I'll try to find a spare computer to test
> this on. Thanks!

That would be great!

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19  0:22           ` Neil Brown
@ 2006-01-19  9:01             ` Jakob Oestergaard
  0 siblings, 0 replies; 71+ messages in thread
From: Jakob Oestergaard @ 2006-01-19  9:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: Michael Tokarev, Ross Vandegrift, linux-raid, linux-kernel

On Thu, Jan 19, 2006 at 11:22:31AM +1100, Neil Brown wrote:
...
> Compare this to an offline solution (raidreconfig) where all the code
> is only used occasionally.  You could argue that the online version
> has more code safety than the offline version....

Correct.

raidreconf, however, can convert a 2 disk RAID-0 to a 4 disk RAID-5 for
example - the whole design of raidreconf is fundamentally different (of
course) from the on-line reshape.  The on-line reshape can be (and
should be) much simpler.

Now, back when I wrote raidreconf, my thoughts were that md would be
merged into dm, and that raidreconf should evolve into something like
'pvmove' - a user-space tool that moves blocks around, interfacing with
the kernel as much as strictly necessary, allowing hot reconfiguration
of RAID setups.

That was the idea.

Reality, however, seems to be that MD is not moving quickly into DM (for
whatever reasons). Also, I haven't had the time to actually just move on
this myself. Today, raidreconf is used by some, but it is not
maintained, and it is often too slow for comfortable off-line usage
(reconfiguration of TB sized arrays is slow - not so much because of
raidreconf, but because there simply is a lot of data that needs to be
moved around).

I still think that putting MD into DM and extending pvmove to include
raidreconf functionality, would be the way to go. The final solution
should also be tolerant (like pvmove is today) of power cycles during
reconfiguration - the operation should be re-startable.

Anyway - this is just me dreaming - I don't have time to do this and it
seems that currently noone else has either.

Great initiative with the reshape Neil - hot reconfiguration is much
needed - personally I still hope to see MD move into DM and pvmove
including raidreconf functionality, but I guess that when we're eating
an elephant we should be satisfied with taking one bite at a time  :)

-- 

 / jakob


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-01-19  0:26     ` Neil Brown
@ 2006-01-21  3:37       ` John Stoffel
  2006-01-22 22:57         ` Neil Brown
  0 siblings, 1 reply; 71+ messages in thread
From: John Stoffel @ 2006-01-21  3:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: John Stoffel, linux-raid, linux-kernel, Steinar H. Gunderson

>>>>> "Neil" == Neil Brown <neilb@suse.de> writes:

Neil> On Tuesday January 17, john@stoffel.org wrote:
>> >>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:
>> 
NeilBrown> Previously the array of disk information was included in
NeilBrown> the raid5 'conf' structure which was allocated to an
NeilBrown> appropriate size.  This makes it awkward to change the size
NeilBrown> of that array.  So we split it off into a separate
NeilBrown> kmalloced array which will require a little extra indexing,
NeilBrown> but is much easier to grow.

>> Instead of setting mddev->private = NULL, should you be doing a kfree
>> on it as well when you are in an abort state?

Neil> The only times I set 
mddev-> private = NULL
Neil> it is immediately after
Neil>    kfree(conf)
Neil> and as conf is the thing that is assigned to mddev->private, this
Neil> should be doing exactly what you suggest.

Neil> Does that make sense?

Now that I've had some time to actually apply your patches to
2.6.16-rc1 and look them over more carefully, I see my mistake.  I
overlooked the assignment of 

	conf = mddev->private 

In the lines just below there, and I see how you do clean it up.  I
guess I would have just done it the other way around:

	conf = kzalloc()
	if (!conf) 
		goto abort:
	.
	.	
	.
	mddev->private = conf;

Though now that I look at it, don't we have a circular reference
here?  Let me quote the code section, which starts of with where I was
confused: 

        mddev->private = kzalloc(sizeof (raid5_conf_t), GFP_KERNEL);
        if ((conf = mddev->private) == NULL)
                goto abort;
        conf->disks = kzalloc(mddev->raid_disks * sizeof(struct disk_info),
                              GFP_KERNEL);
        if (!conf->disks)
                goto abort;

        conf->mddev = mddev;

        if ((conf->stripe_hashtbl = kzalloc(PAGE_SIZE, GFP_KERNEL)) == NULL)
                goto abort;


Now we seem to end up with:

	mddev->private = conf;
	conf->mddev = mddev;

which looks a little strange to me, and possibly something that could
lead to endless loops.  But I need to find the time to sit down and
try to understand the code, so don't waste much time educating me
here.

Thanks for all your work on this Neil, I for one really appreciate it!

John

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
                   ` (5 preceding siblings ...)
  2006-01-17  8:17 ` [PATCH 000 of 5] md: Introduction Michael Tokarev
@ 2006-01-22  4:42 ` Adam Kropelin
  2006-01-22 22:52   ` Neil Brown
  2006-01-23  1:08 ` John Hendrikx
  7 siblings, 1 reply; 71+ messages in thread
From: Adam Kropelin @ 2006-01-22  4:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

NeilBrown <neilb@suse.de> wrote:
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.

I've been looking forward to a feature like this, so I took the
opportunity to set up a vmware session and give the patches a try. I
encountered both success and failure, and here are the details of both.

On the first try I neglected to read the directions and increased the
number of devices first (which worked) and then attempted to add the
physical device (which didn't work; at least not the way I intended).
The result was an array of size 4, operating in degraded mode, with 
three active drives and one spare. I was unable to find a way to coax
mdadm into adding the 4th drive as an active device instead of a 
spare. I'm not an mdadm guru, so there may be a method I overlooked.
Here's what I did, interspersed with trimmed /proc/mdstat output:

  mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc

    md0 : active raid5 sda[0] sdc[2] sdb[1]
          2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

  mdadm --grow -n4 /dev/md0

    md0 : active raid5 sda[0] sdc[2] sdb[1]
          3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

  mdadm --manage --add /dev/md0 /dev/sdd

    md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
          3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

  mdadm --misc --stop /dev/md0
  mdadm --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd

    md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
          3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

For my second try I actually read the directions and things went much
better, aside from a possible /proc/mdstat glitch shown below.

  mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc

    md0 : active raid5 sda[0] sdc[2] sdb[1]
          2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

  mdadm --manage --add /dev/md0 /dev/sdd

    md0 : active raid5 sdd[3](S) sdc[2] sdb[1] sda[0]
          2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

  mdadm --grow -n4 /dev/md0

    md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
          2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
                                ...should this be... --> [4/3] [UUU_] perhaps?
          [>....................]  recovery =  0.4% (5636/1048512) finish=9.1min speed=1878K/sec

    [...time passes...]

    md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
          3145536 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

My final test was a repeat of #2, but with data actively being written
to the array during the reshape (the previous tests were on an idle,
unmounted array). This one failed pretty hard, with several processes
ending up in the D state. I repeated it twice and sysrq-t dumps can be
found at <http://www.kroptech.com/~adk0212/md-raid5-reshape-wedge.txt>.
The writeout load was a kernel tree untar started shortly before the
'mdadm --grow' command was given. mdadm hung, as did tar. Any process
which subsequently attmpted to access the array hung as well. A second
attempt at the same thing hung similarly, although only pdflush shows up
hung in that trace. mdadm and tar are missing for some reason.

I'm happy to do more tests. It's easy to conjur up virtual disks and
load them with irrelevant data (like kernel trees ;)

--Adam


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-22  4:42 ` Adam Kropelin
@ 2006-01-22 22:52   ` Neil Brown
  2006-01-23 23:02     ` Adam Kropelin
  0 siblings, 1 reply; 71+ messages in thread
From: Neil Brown @ 2006-01-22 22:52 UTC (permalink / raw)
  To: Adam Kropelin; +Cc: NeilBrown, linux-raid, linux-kernel, Steinar H. Gunderson

On Saturday January 21, akropel1@rochester.rr.com wrote:
> NeilBrown <neilb@suse.de> wrote:
> > In line with the principle of "release early", following are 5 patches
> > against md in 2.6.latest which implement reshaping of a raid5 array.
> > By this I mean adding 1 or more drives to the array and then re-laying
> > out all of the data.
> 
> I've been looking forward to a feature like this, so I took the
> opportunity to set up a vmware session and give the patches a try. I
> encountered both success and failure, and here are the details of both.
> 
> On the first try I neglected to read the directions and increased the
> number of devices first (which worked) and then attempted to add the
> physical device (which didn't work; at least not the way I intended).
> The result was an array of size 4, operating in degraded mode, with 
> three active drives and one spare. I was unable to find a way to coax
> mdadm into adding the 4th drive as an active device instead of a 
> spare. I'm not an mdadm guru, so there may be a method I overlooked.
> Here's what I did, interspersed with trimmed /proc/mdstat output:

Thanks, this is exactly the sort of feedback I was hoping for - people
testing thing that I didn't think to...

> 
>   mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
> 
>     md0 : active raid5 sda[0] sdc[2] sdb[1]
>           2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> 
>   mdadm --grow -n4 /dev/md0
> 
>     md0 : active raid5 sda[0] sdc[2] sdb[1]
>           3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

I assume that no "resync" started at this point?  It should have done.

> 
>   mdadm --manage --add /dev/md0 /dev/sdd
> 
>     md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
>           3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
> 
>   mdadm --misc --stop /dev/md0
>   mdadm --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc /dev/sdd
> 
>     md0 : active raid5 sdd[3](S) sda[0] sdc[2] sdb[1]
>           3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

This really should have started a recovery.... I'll look into that
too.


> 
> For my second try I actually read the directions and things went much
> better, aside from a possible /proc/mdstat glitch shown below.
> 
>   mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
> 
>     md0 : active raid5 sda[0] sdc[2] sdb[1]
>           2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> 
>   mdadm --manage --add /dev/md0 /dev/sdd
> 
>     md0 : active raid5 sdd[3](S) sdc[2] sdb[1] sda[0]
>           2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> 
>   mdadm --grow -n4 /dev/md0
> 
>     md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
>           2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>                                 ...should this be... --> [4/3] [UUU_] perhaps?

Well, part of the array is "4/4 UUUU" and part is "3/3 UUU".  How do
you represent that?  I think "4/4 UUUU" is best.


>           [>....................]  recovery =  0.4% (5636/1048512) finish=9.1min speed=1878K/sec
> 
>     [...time passes...]
> 
>     md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
>           3145536 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
> 
> My final test was a repeat of #2, but with data actively being written
> to the array during the reshape (the previous tests were on an idle,
> unmounted array). This one failed pretty hard, with several processes
> ending up in the D state. I repeated it twice and sysrq-t dumps can be
> found at <http://www.kroptech.com/~adk0212/md-raid5-reshape-wedge.txt>.
> The writeout load was a kernel tree untar started shortly before the
> 'mdadm --grow' command was given. mdadm hung, as did tar. Any process
> which subsequently attmpted to access the array hung as well. A second
> attempt at the same thing hung similarly, although only pdflush shows up
> hung in that trace. mdadm and tar are missing for some reason.

Hmmm... I tried similar things but didn't get this deadlock.  Somehow
the fact that mdadm is holding the reconfig_sem semaphore means that
some IO cannot proceed and so mdadm cannot grab and resize all the
stripe heads... I'll have to look more deeply into this.

> 
> I'm happy to do more tests. It's easy to conjur up virtual disks and
> load them with irrelevant data (like kernel trees ;)

Great.  I'll probably be putting out a new patch set  late this week
or early next.  Hopefully it will fix the issues you can found and you
can try it again..


Thanks again,
NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow.
  2006-01-21  3:37       ` John Stoffel
@ 2006-01-22 22:57         ` Neil Brown
  0 siblings, 0 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-22 22:57 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

On Friday January 20, john@stoffel.org wrote:
> Though now that I look at it, don't we have a circular reference
> here?  Let me quote the code section, which starts of with where I was
> confused: 
..
> 
> Now we seem to end up with:
> 
> 	mddev->private = conf;
> 	conf->mddev = mddev;
> 

This is simply to related structures each holding a reference to the
other.  It's like the child pointing to the parent, and the parent
pointing to the child, which you get all the time.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
                   ` (6 preceding siblings ...)
  2006-01-22  4:42 ` Adam Kropelin
@ 2006-01-23  1:08 ` John Hendrikx
  2006-01-23  1:25   ` Neil Brown
  7 siblings, 1 reply; 71+ messages in thread
From: John Hendrikx @ 2006-01-23  1:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

NeilBrown wrote:
> In line with the principle of "release early", following are 5 patches
> against md in 2.6.latest which implement reshaping of a raid5 array.
> By this I mean adding 1 or more drives to the array and then re-laying
> out all of the data.
>   
I think my question is already answered by this, but...

Would this also allow changing the size of each raid device?  Let's say 
I currently have 160 GB x 6, could I change that to 300 GB x 6 or am I 
only allowed to add more 160 GB devices?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23  1:08 ` John Hendrikx
@ 2006-01-23  1:25   ` Neil Brown
  2006-01-23  1:54     ` Kyle Moffett
  0 siblings, 1 reply; 71+ messages in thread
From: Neil Brown @ 2006-01-23  1:25 UTC (permalink / raw)
  To: John Hendrikx; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

On Monday January 23, hjohn@xs4all.nl wrote:
> NeilBrown wrote:
> > In line with the principle of "release early", following are 5 patches
> > against md in 2.6.latest which implement reshaping of a raid5 array.
> > By this I mean adding 1 or more drives to the array and then re-laying
> > out all of the data.
> >   
> I think my question is already answered by this, but...
> 
> Would this also allow changing the size of each raid device?  Let's say 
> I currently have 160 GB x 6, could I change that to 300 GB x 6 or am I 
> only allowed to add more 160 GB devices?

Changing the size of the devices is a separate operation that has been
supported for a while.
For each device in turn, you fail it and replace it with a larger
device. (This means the array runs degraded for a while, which isn't
ideal and might be fixed one day).

Once all the devices in the array are of the desired size, you run
  mdadm --grow /dev/mdX --size=max
and the array (raid1, raid5, raid6) will use up all available space on
the devices, and a resync will start to make sure that extra space is
in-sync.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23  1:25   ` Neil Brown
@ 2006-01-23  1:54     ` Kyle Moffett
  0 siblings, 0 replies; 71+ messages in thread
From: Kyle Moffett @ 2006-01-23  1:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: John Hendrikx, linux-raid, linux-kernel, Steinar H. Gunderson

On Jan 22, 2006, at 20:25, Neil Brown wrote:
> Changing the size of the devices is a separate operation that has  
> been supported for a while. For each device in turn, you fail it  
> and replace it with a larger device. (This means the array runs  
> degraded for a while, which isn't ideal and might be fixed one day).
>
> Once all the devices in the array are of the desired size, you run
>   mdadm --grow /dev/mdX --size=max
> and the array (raid1, raid5, raid6) will use up all available space  
> on the devices, and a resync will start to make sure that extra  
> space is in-sync.

One option I can think of that would make it much safer would be to  
originally set up your RAID like this:

                md3 (RAID-5)
        __________/   |   \__________
       /              |              \
md0 (RAID-1)   md1 (RAID-1)   md2 (RAID-1)

Each of md0-2 would only have a single drive, and therefore provide  
no redundancy.  When you wanted to grow the RAID-5, you would first  
add a new larger disk to each of md0-md2 and trigger each resync.   
Once that is complete, remove the old drives from md0-2 and run:
   mdadm --grow /dev/md0 --size=max
   mdadm --grow /dev/md1 --size=max
   mdadm --grow /dev/md2 --size=max

Then once all that has completed, run:
   mdadm --grow /dev/md3 --size=max

This will enlarge the top-level array.  If you have LVM on the top- 
level, you can allocate new LVs, resize existing ones, etc.

With the newly added code, you could also add new drives dynamically  
by creating a /dev/md4 out of the single drive, and adding that as a  
new member of /dev/md3.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw  
knives at people who weren't supposed to be in your machine room.
   -- Anthony de Boer



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-22 22:52   ` Neil Brown
@ 2006-01-23 23:02     ` Adam Kropelin
  0 siblings, 0 replies; 71+ messages in thread
From: Adam Kropelin @ 2006-01-23 23:02 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

Neil Brown wrote:
> On Saturday January 21, akropel1@rochester.rr.com wrote:
>> On the first try I neglected to read the directions and increased the
>> number of devices first (which worked) and then attempted to add the
>> physical device (which didn't work; at least not the way I intended).
>
> Thanks, this is exactly the sort of feedback I was hoping for - people
> testing thing that I didn't think to...
>
>>   mdadm --create -l5 -n3 /dev/md0 /dev/sda /dev/sdb /dev/sdc
>>
>>     md0 : active raid5 sda[0] sdc[2] sdb[1]
>>           2097024 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>>
>>   mdadm --grow -n4 /dev/md0
>>
>>     md0 : active raid5 sda[0] sdc[2] sdb[1]
>>           3145536 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
>
> I assume that no "resync" started at this point?  It should have done.

Actually, it did start a resync. Sorry, I should have mentioned that. I 
waited until the resync completed before I issued the 'mdadm --add' 
command.

>>     md0 : active raid5 sdd[3] sdc[2] sdb[1] sda[0]
>>           2097024 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
>>                                 ...should this be... --> [4/3]
>> [UUU_] perhaps?
>
> Well, part of the array is "4/4 UUUU" and part is "3/3 UUU".  How do
> you represent that?  I think "4/4 UUUU" is best.

I see your point. I was expecting some indication that that my array was 
vulnerable and that the new disk was not fully utilized yet. I guess the 
resync in progress indicator is sufficient.

>> My final test was a repeat of #2, but with data actively being
>> written
>> to the array during the reshape (the previous tests were on an idle,
>> unmounted array). This one failed pretty hard, with several processes
>> ending up in the D state.
>
> Hmmm... I tried similar things but didn't get this deadlock.  Somehow
> the fact that mdadm is holding the reconfig_sem semaphore means that
> some IO cannot proceed and so mdadm cannot grab and resize all the
> stripe heads... I'll have to look more deeply into this.

For what it's worth, I'm using the Buslogic SCSI driver for the disks in 
the array.

>> I'm happy to do more tests. It's easy to conjur up virtual disks and
>> load them with irrelevant data (like kernel trees ;)
>
> Great.  I'll probably be putting out a new patch set  late this week
> or early next.  Hopefully it will fix the issues you can found and you
> can try it again..

Looking forward to it...

--Adam


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 12:54                           ` Ville Herva
  2006-01-23 13:00                             ` Steinar H. Gunderson
  2006-01-23 13:54                             ` Heinz Mauelshagen
@ 2006-01-24  2:02                             ` Phillip Susi
  2 siblings, 0 replies; 71+ messages in thread
From: Phillip Susi @ 2006-01-24  2:02 UTC (permalink / raw)
  To: vherva
  Cc: Heinz Mauelshagen, Lars Marowsky-Bree, Neil Brown,
	Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

Ville Herva wrote:
> PS: Speaking of debugging failing initrd init scripts; it would be nice if
> the kernel gave an error message on wrong initrd format rather than silently
> failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.
>   

LOL, yea, that one got me too when I was first getting back into linux a 
few months ago and had to customize my initramfs to include dmraid to 
recognize my hardware fakeraid raid0.  Then I discovered the mkinitramfs 
utility which makes things much nicer ;)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 13:54                             ` Heinz Mauelshagen
@ 2006-01-23 17:33                               ` Ville Herva
  0 siblings, 0 replies; 71+ messages in thread
From: Ville Herva @ 2006-01-23 17:33 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Lars Marowsky-Bree, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Mon, Jan 23, 2006 at 02:54:28PM +0100, you [Heinz Mauelshagen] wrote:
> > 
> > It is very tedious to have to debug a production system for a few hours in
> > order to get the rootfs mounted after each kernel update. 
> > 
> > The lvm error messages give almost no clue on the problem. 
> > 
> > Worse yet, problem reports on these issues are completely ignored on the lvm
> > mailing list, even when a patch is attached.
> > 
> > (See
> >  http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
> >  http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
> >  http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
> >  for reference.)
> 
> Hrm, those are initscripts related, not lvm directly

With the ancient LVM1 issue, my main problem was indeed that mkinitrd did
not reserve enough space for the initrd. The LVM issue I posted to the LVM
list was that LVM userland (vg_cfgbackup.c) did not check for errors while
writing to the fs. The (ignored) patch added some error checking.

But that's ancient, I think we can forget about that.

The current issue (please see the first link) is about the need to add
a "sleep 5" between 
 lvm vgmknodes
and
 mount -o defaults --ro -t ext3 /dev/root /sysroot 
. 

Otherwise, mounting fails. (Actually, I added "sleep 5" after every lvm
command in the init script and did not narrow it down any more, since this
was a production system, each boot took ages, and I had to get the system up
as soon as possible.)

To me it seemed some kind of problem with the lvm utilities, not with the
initscripts. At least, the correct solution cannot be adding "sleep 5" here
and there in the initscripts...
 
> Alright.
> Is the initscript issue fixed now or still open ?

It is still open.

Sadly, the only two systems this currently happens are production boxes and
I cannot boot them at will for debugging. It is, however, 100% reproducible
and I can try reasonable suggestions when I boot them the next time. Sorry
about this.

> Had you filed a bug against the distros initscripts ?

No, since I wasn't sure the problem actually was in the initscript. Perhaps
it does do something wrong, but the "sleep 5" workaround is pretty
suspicious.

Thanks for the reply.



-- v -- 

v@iki.fi

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 12:54                           ` Ville Herva
  2006-01-23 13:00                             ` Steinar H. Gunderson
@ 2006-01-23 13:54                             ` Heinz Mauelshagen
  2006-01-23 17:33                               ` Ville Herva
  2006-01-24  2:02                             ` Phillip Susi
  2 siblings, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-23 13:54 UTC (permalink / raw)
  To: Ville Herva
  Cc: Heinz Mauelshagen, Lars Marowsky-Bree, Neil Brown, Phillip Susi,
	Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Mon, Jan 23, 2006 at 02:54:20PM +0200, Ville Herva wrote:
> On Mon, Jan 23, 2006 at 10:44:18AM +0100, you [Heinz Mauelshagen] wrote:
> > > 
> > > I use the regularly to play with md and other stuff...
> > 
> > Me too but for production, I want to avoid the
> > additional stacking overhead and complexity.
> > 
> > > So I remain unconvinced that code duplication is worth it for more than
> > > "hark we want it so!" ;-)
> > 
> > Shall I remove you from the list of potential testers of dm-raid45 then ;-)
> 
> Heinz, 
> 
> If you really want the rest of us to convert from md to lvm, you should
> perhaps give some attention to thee brittle userland (scripts and and
> binaries).

Sure :-)

> 
> It is very tedious to have to debug a production system for a few hours in
> order to get the rootfs mounted after each kernel update. 
> 
> The lvm error messages give almost no clue on the problem. 
> 
> Worse yet, problem reports on these issues are completely ignored on the lvm
> mailing list, even when a patch is attached.
> 
> (See
>  http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
>  http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
>  http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
>  for reference.)

Hrm, those are initscripts related, not lvm directly

> 
> Such experience gives an impression lvm is not yet ready for serious
> production use.

initscripts/initramfs surely need to do the right thing
in case root is on lvm.

> 
> No offense intended, lvm kernel (lvm1 nor lvm2) code has never given me
> trouble, and is probably as solid as anything. 

Alright.
Is the initscript issue fixed now or still open ?
Had you filed a bug against the distros initscripts ?

> 
> 
> -- v -- 
> 
> v@iki.fi
> 
> PS: Speaking of debugging failing initrd init scripts; it would be nice if
> the kernel gave an error message on wrong initrd format rather than silently
> failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.

-- 

Regards,
Heinz    -- The LVM Guy --

*** Software bugs are stupid.
    Nevertheless it needs not so stupid people to solve them ***

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 12:54                           ` Ville Herva
@ 2006-01-23 13:00                             ` Steinar H. Gunderson
  2006-01-23 13:54                             ` Heinz Mauelshagen
  2006-01-24  2:02                             ` Phillip Susi
  2 siblings, 0 replies; 71+ messages in thread
From: Steinar H. Gunderson @ 2006-01-23 13:00 UTC (permalink / raw)
  To: Ville Herva
  Cc: Heinz Mauelshagen, Lars Marowsky-Bree, Neil Brown, Phillip Susi,
	Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel

On Mon, Jan 23, 2006 at 02:54:20PM +0200, Ville Herva wrote:
> If you really want the rest of us to convert from md to lvm, you should
> perhaps give some attention to thee brittle userland (scripts and and
> binaries).

If you do not like the LVM userland, you might want to try the EVMS userland,
which uses the same kernel code and (mostly) the same on-disk formats, but
has a different front-end.

> It is very tedious to have to debug a production system for a few hours in
> order to get the rootfs mounted after each kernel update. 

This sounds a bit like an issue with your distribution, which should normally
fix initrd/initramfs issues for you.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23  9:44                         ` Heinz Mauelshagen
  2006-01-23 10:26                           ` Lars Marowsky-Bree
@ 2006-01-23 12:54                           ` Ville Herva
  2006-01-23 13:00                             ` Steinar H. Gunderson
                                               ` (2 more replies)
  1 sibling, 3 replies; 71+ messages in thread
From: Ville Herva @ 2006-01-23 12:54 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Lars Marowsky-Bree, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Mon, Jan 23, 2006 at 10:44:18AM +0100, you [Heinz Mauelshagen] wrote:
> > 
> > I use the regularly to play with md and other stuff...
> 
> Me too but for production, I want to avoid the
> additional stacking overhead and complexity.
> 
> > So I remain unconvinced that code duplication is worth it for more than
> > "hark we want it so!" ;-)
> 
> Shall I remove you from the list of potential testers of dm-raid45 then ;-)

Heinz, 

If you really want the rest of us to convert from md to lvm, you should
perhaps give some attention to thee brittle userland (scripts and and
binaries).

It is very tedious to have to debug a production system for a few hours in
order to get the rootfs mounted after each kernel update. 

The lvm error messages give almost no clue on the problem. 

Worse yet, problem reports on these issues are completely ignored on the lvm
mailing list, even when a patch is attached.

(See
 http://marc.theaimsgroup.com/?l=linux-lvm&m=113775502821403&w=2
 http://linux.msede.com/lvm_mlist/archive/2001/06/0205.html
 http://linux.msede.com/lvm_mlist/archive/2001/06/0271.html
 for reference.)

Such experience gives an impression lvm is not yet ready for serious
production use.

No offense intended, lvm kernel (lvm1 nor lvm2) code has never given me
trouble, and is probably as solid as anything. 


-- v -- 

v@iki.fi

PS: Speaking of debugging failing initrd init scripts; it would be nice if
the kernel gave an error message on wrong initrd format rather than silently
failing... Yes, I forgot to make the cpio with the "-H newc" option :-/.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 10:45                               ` Lars Marowsky-Bree
@ 2006-01-23 11:00                                 ` Heinz Mauelshagen
  0 siblings, 0 replies; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-23 11:00 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Mon, Jan 23, 2006 at 11:45:22AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-23T11:38:51, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > Ok, I still didn't get that. I must be slow.
> > > 
> > > Did you implement some DM-internal stacking now to avoid the above
> > > mentioned complexity? 
> > > 
> > > Otherwise, even DM-on-DM is still stacked via the block device
> > > abstraction...
> > 
> > No, not necessary because a single-level raid4/5 mapping will do it.
> > Ie. it supports <offset> parameters in the constructor as other targets
> > do as well (eg. mirror or linear).
> 
> An dm-md wrapper would not support such a basic feature (which is easily
> added to md too) how?
> 
> I mean, "I'm rewriting it because I want to and because I understand and
> own the code then" is a perfectly legitimate reason

Sure :-)

>, but let's please
> not pretend there's really sound and good technical reasons ;-)

Mind you that there's no need to argue about that:
this is based on requests to do it.

> 
> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 10:38                             ` Heinz Mauelshagen
@ 2006-01-23 10:45                               ` Lars Marowsky-Bree
  2006-01-23 11:00                                 ` Heinz Mauelshagen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-23 10:45 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Neil Brown, Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-23T11:38:51, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > Ok, I still didn't get that. I must be slow.
> > 
> > Did you implement some DM-internal stacking now to avoid the above
> > mentioned complexity? 
> > 
> > Otherwise, even DM-on-DM is still stacked via the block device
> > abstraction...
> 
> No, not necessary because a single-level raid4/5 mapping will do it.
> Ie. it supports <offset> parameters in the constructor as other targets
> do as well (eg. mirror or linear).

An dm-md wrapper would not support such a basic feature (which is easily
added to md too) how?

I mean, "I'm rewriting it because I want to and because I understand and
own the code then" is a perfectly legitimate reason, but let's please
not pretend there's really sound and good technical reasons ;-)


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23 10:26                           ` Lars Marowsky-Bree
@ 2006-01-23 10:38                             ` Heinz Mauelshagen
  2006-01-23 10:45                               ` Lars Marowsky-Bree
  0 siblings, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-23 10:38 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Mon, Jan 23, 2006 at 11:26:01AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-23T10:44:18, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > Besides, stacking between dm devices so far (ie, if I look how kpartx
> > > does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> > > block device layer anyway - and nothing stops you from putting md on top
> > > of LVM2 LVs either.
> > > 
> > > I use the regularly to play with md and other stuff...
> > 
> > Me too but for production, I want to avoid the
> > additional stacking overhead and complexity.
> 
> Ok, I still didn't get that. I must be slow.
> 
> Did you implement some DM-internal stacking now to avoid the above
> mentioned complexity? 
> 
> Otherwise, even DM-on-DM is still stacked via the block device
> abstraction...

No, not necessary because a single-level raid4/5 mapping will do it.
Ie. it supports <offset> parameters in the constructor as other targets
do as well (eg. mirror or linear).

> 
> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-23  9:44                         ` Heinz Mauelshagen
@ 2006-01-23 10:26                           ` Lars Marowsky-Bree
  2006-01-23 10:38                             ` Heinz Mauelshagen
  2006-01-23 12:54                           ` Ville Herva
  1 sibling, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-23 10:26 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Neil Brown, Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-23T10:44:18, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > Besides, stacking between dm devices so far (ie, if I look how kpartx
> > does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> > block device layer anyway - and nothing stops you from putting md on top
> > of LVM2 LVs either.
> > 
> > I use the regularly to play with md and other stuff...
> 
> Me too but for production, I want to avoid the
> additional stacking overhead and complexity.

Ok, I still didn't get that. I must be slow.

Did you implement some DM-internal stacking now to avoid the above
mentioned complexity? 

Otherwise, even DM-on-DM is still stacked via the block device
abstraction...


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-21  0:13                       ` Lars Marowsky-Bree
@ 2006-01-23  9:44                         ` Heinz Mauelshagen
  2006-01-23 10:26                           ` Lars Marowsky-Bree
  2006-01-23 12:54                           ` Ville Herva
  0 siblings, 2 replies; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-23  9:44 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Sat, Jan 21, 2006 at 01:13:11AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-21T01:08:06, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > A dm-md wrapper would give you the same?
> > No, we'ld need to stack more complex to achieve mappings.
> > Think lvm2 and logical volume level raid5.
> 
> How would you not get that if you had a wrapper around md which made it
> into an dm personality/target?

You could with deeper stacking. That's why I mentioned it above.

> 
> Besides, stacking between dm devices so far (ie, if I look how kpartx
> does it, or LVM2 on top of MPIO etc, which works just fine) is via the
> block device layer anyway - and nothing stops you from putting md on top
> of LVM2 LVs either.
> 
> I use the regularly to play with md and other stuff...

Me too but for production, I want to avoid the
additional stacking overhead and complexity.

> 
> So I remain unconvinced that code duplication is worth it for more than
> "hark we want it so!" ;-)

Shall I remove you from the list of potential testers of dm-raid45 then ;-)

> 
> 

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 16:15 ` Christoph Hellwig
@ 2006-01-22  6:45   ` Herbert Poetzl
  0 siblings, 0 replies; 71+ messages in thread
From: Herbert Poetzl @ 2006-01-22  6:45 UTC (permalink / raw)
  To: Christoph Hellwig, Hubert Tonneau, alan, linux-kernel, neilb

On Fri, Jan 20, 2006 at 04:15:50PM +0000, Christoph Hellwig wrote:
> On Fri, Jan 20, 2006 at 05:01:06PM +0000, Hubert Tonneau wrote:
> > In the U160 category, the symbios driver passed all possible stress tests
> > (partly bad drives that require the driver to properly reset and restart),
> > but in the U320 category, neither the Fusion not the AIC79xx did.
> 
> Please report any fusion problems to Eric Moore at LSI, the Adaptec
> driver must unfortunately be considered unmaintained.

wasn't Justin T. Gibbs maintaining this driver for 
some time, and who is doing the drivers/updates
published on the adaptec site?

http://www.adaptec.com/worldwide/support/driversbycat.jsp?sess=no&language=English+US&cat=%2FOperating+System%2FLinux+Driver+Source+Code

best,
Herbert

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-21  0:08                     ` Heinz Mauelshagen
@ 2006-01-21  0:13                       ` Lars Marowsky-Bree
  2006-01-23  9:44                         ` Heinz Mauelshagen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-21  0:13 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Neil Brown, Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-21T01:08:06, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > A dm-md wrapper would give you the same?
> No, we'ld need to stack more complex to achieve mappings.
> Think lvm2 and logical volume level raid5.

How would you not get that if you had a wrapper around md which made it
into an dm personality/target?

Besides, stacking between dm devices so far (ie, if I look how kpartx
does it, or LVM2 on top of MPIO etc, which works just fine) is via the
block device layer anyway - and nothing stops you from putting md on top
of LVM2 LVs either.

I use the regularly to play with md and other stuff...

So I remain unconvinced that code duplication is worth it for more than
"hark we want it so!" ;-)




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-21  0:03                   ` Lars Marowsky-Bree
@ 2006-01-21  0:08                     ` Heinz Mauelshagen
  2006-01-21  0:13                       ` Lars Marowsky-Bree
  0 siblings, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-21  0:08 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Sat, Jan 21, 2006 at 01:03:44AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-21T01:01:42, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > Why not provide a dm-md wrapper which could then
> > > load/interface to all md personalities?
> > As we want to enrich the mapping flexibility (ie, multi-segment fine grained
> > mappings) of dm by adding targets as we go, a certain degree and transitional
> > existence of duplicate code is the price to gain that flexibility.
> 
> A dm-md wrapper would give you the same?

No, we'ld need to stack more complex to achieve mappings.
Think lvm2 and logical volume level raid5.

> 
> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 22:09                   ` Lars Marowsky-Bree
@ 2006-01-21  0:06                     ` Heinz Mauelshagen
  0 siblings, 0 replies; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-21  0:06 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Phillip Susi, Neil Brown, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20, 2006 at 11:09:51PM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-20T19:38:40, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > However, rewriting the RAID personalities for DM is a thing only a fool
> > > would do without really good cause.
> > 
> > Thanks Lars ;)
> 
> Well, I assume you have a really good cause then, don't you? ;-)

Well, I'll share your assumption ;-)

> 
> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-21  0:01                 ` Heinz Mauelshagen
@ 2006-01-21  0:03                   ` Lars Marowsky-Bree
  2006-01-21  0:08                     ` Heinz Mauelshagen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-21  0:03 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Neil Brown, Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-21T01:01:42, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > Why not provide a dm-md wrapper which could then
> > load/interface to all md personalities?
> As we want to enrich the mapping flexibility (ie, multi-segment fine grained
> mappings) of dm by adding targets as we go, a certain degree and transitional
> existence of duplicate code is the price to gain that flexibility.

A dm-md wrapper would give you the same?


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 22:57               ` Lars Marowsky-Bree
@ 2006-01-21  0:01                 ` Heinz Mauelshagen
  2006-01-21  0:03                   ` Lars Marowsky-Bree
  0 siblings, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-21  0:01 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Heinz Mauelshagen, Neil Brown, Phillip Susi, Jan Engelhardt,
	Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20, 2006 at 11:57:24PM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-20T19:36:21, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:
> 
> > > Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> > > raid levels and 'md' interfaces for others.
> > Yes, that's possible but there's recommendations to have a native target
> > for dm to do RAID5, so I started to implement it.
> 
> Can you answer me what the recommendations are based on?

Partner requests.

> 
> I understand wanting to manage both via the same framework, but
> duplicating the code is just ... wrong.
> 
> What's gained by it?
>
> Why not provide a dm-md wrapper which could then
> load/interface to all md personalities?
> 

As we want to enrich the mapping flexibility (ie, multi-segment fine grained
mappings) of dm by adding targets as we go, a certain degree and transitional
existence of duplicate code is the price to gain that flexibility.

> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"

Warm regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 18:36             ` Heinz Mauelshagen
@ 2006-01-20 22:57               ` Lars Marowsky-Bree
  2006-01-21  0:01                 ` Heinz Mauelshagen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-20 22:57 UTC (permalink / raw)
  To: Heinz Mauelshagen, Neil Brown
  Cc: Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-20T19:36:21, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> > raid levels and 'md' interfaces for others.
> Yes, that's possible but there's recommendations to have a native target
> for dm to do RAID5, so I started to implement it.

Can you answer me what the recommendations are based on?

I understand wanting to manage both via the same framework, but
duplicating the code is just ... wrong.

What's gained by it? Why not provide a dm-md wrapper which could then
load/interface to all md personalities?


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 18:38                 ` Heinz Mauelshagen
@ 2006-01-20 22:09                   ` Lars Marowsky-Bree
  2006-01-21  0:06                     ` Heinz Mauelshagen
  0 siblings, 1 reply; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-20 22:09 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: Phillip Susi, Neil Brown, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-20T19:38:40, Heinz Mauelshagen <mauelshagen@redhat.com> wrote:

> > However, rewriting the RAID personalities for DM is a thing only a fool
> > would do without really good cause.
> 
> Thanks Lars ;)

Well, I assume you have a really good cause then, don't you? ;-)


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20  2:17             ` Phillip Susi
  2006-01-20 10:53               ` Lars Marowsky-Bree
@ 2006-01-20 18:41               ` Heinz Mauelshagen
  1 sibling, 0 replies; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-20 18:41 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Neil Brown, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Thu, Jan 19, 2006 at 09:17:12PM -0500, Phillip Susi wrote:
> Neil Brown wrote:
> 
> >Maybe the problem here is thinking of md and dm as different things.
> >Try just not thinking of them at all.  
> >Think about it like this:
> > The linux kernel support lvm
> > The linux kernel support multipath
> > The linux kernel support snapshots
> > The linux kernel support raid0
> > The linux kernel support raid1
> > The linux kernel support raid5
> >
> >Use the bits that you want, and not the bits that you don't.
> >
> >dm and md are just two different interface styles to various bits of
> >this.  Neither is clearly better than the other, partly because
> >different people have different tastes.
> >
> >Maybe what you really want is for all of these functions to be managed
> >under the one umbrella application.  I think that is was EVMS tried to
> >do. 
> >
> > 
> >
> 
> I am under the impression that dm is simpler/cleaner than md.  That 
> impression very well may be wrong, but if it is simpler, then that's a 
> good thing. 
> 
> 
> >One big selling point that 'dm' has is 'dmraid' - a tool that allows
> >you to use a lot of 'fakeraid' cards.  People would like dmraid to
> >work with raid5 as well, and that is a good goal.
> > 
> >
> 
> AFAIK, the hardware fakeraid solutions on the market don't support raid5 
> anyhow ( at least mine doesn't ), so dmraid won't either. 

Well, some do (eg, Nvidia).

> 
> >However it doesn't mean that dm needs to get it's own raid5
> >implementation or that md/raid5 needs to be merged with dm.
> >It can be achieved by giving md/raid5 the right interfaces so that
> >metadata can be managed from userspace (and I am nearly there).
> >Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> >raid levels and 'md' interfaces for others.
> >
> 
> Having two sets of interfaces and retrofiting a new interface onto a 
> system that wasn't designed for it seems likely to bloat the kernel with 
> complex code.  I don't really know if that is the case because I have 
> not studied the code, but that's the impression I get, and if it's 
> right, then I'd say it is better to stick with dm rather than retrofit 
> md.  In either case, it seems overly complex to have to deal with both. 

I agree, but dm will need to mature before it'll be able to substitute md.

> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Regards,
Heinz    -- The LVM Guy --

*** Software bugs are stupid.
    Nevertheless it needs not so stupid people to solve them ***

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 10:53               ` Lars Marowsky-Bree
  2006-01-20 12:06                 ` Jens Axboe
@ 2006-01-20 18:38                 ` Heinz Mauelshagen
  2006-01-20 22:09                   ` Lars Marowsky-Bree
  1 sibling, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-20 18:38 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Phillip Susi, Neil Brown, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20, 2006 at 11:53:06AM +0100, Lars Marowsky-Bree wrote:
> On 2006-01-19T21:17:12, Phillip Susi <psusi@cfl.rr.com> wrote:
> 
> > I am under the impression that dm is simpler/cleaner than md.  That 
> > impression very well may be wrong, but if it is simpler, then that's a 
> > good thing. 
> 
> That impression is wrong in that general form. Both have advantages and
> disadvantages.
> 
> I've been an advocate of seeing both of them merged, mostly because I
> think it would be beneficial if they'd share the same interface to
> user-space to make the tools easier to write and maintain.
> 
> However, rewriting the RAID personalities for DM is a thing only a fool
> would do without really good cause.

Thanks Lars ;)

> Sure, everybody can write a
> RAID5/RAID6 parity algorithm. But getting the failure/edge cases stable
> is not trivial and requires years of maturing.
> 
> Which is why I think gentle evolution of both source bases towards some
> common API (for example) is much preferable to reinventing one within
> the other.
> 
> Oversimplifying to "dm is better than md" is just stupid.
> 
> 
> 
> Sincerely,
>     Lars Marowsky-Brée
> 
> -- 
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
> "Ignorance more frequently begets confidence than does knowledge"
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 23:43           ` Neil Brown
  2006-01-20  2:17             ` Phillip Susi
  2006-01-20 17:29             ` Ross Vandegrift
@ 2006-01-20 18:36             ` Heinz Mauelshagen
  2006-01-20 22:57               ` Lars Marowsky-Bree
  2 siblings, 1 reply; 71+ messages in thread
From: Heinz Mauelshagen @ 2006-01-20 18:36 UTC (permalink / raw)
  To: Neil Brown
  Cc: Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20, 2006 at 10:43:13AM +1100, Neil Brown wrote:
> On Thursday January 19, psusi@cfl.rr.com wrote:
> > Neil Brown wrote:
> > > 
> > > The in-kernel autodetection in md is purely legacy support as far as I
> > > am concerned.  md does volume detection in user space via 'mdadm'.
> > > 
> > > What other "things like" were you thinking of.
> > > 
> > 
> > Oh, I suppose that's true.  Well, another thing is your new mods to 
> > support on the fly reshaping, which dm could do from user space.  Then 
> > of course, there's multipath and snapshots and other lvm things which 
> > you need dm for, so why use both when one will do?  That's my take on it.
> 
> Maybe the problem here is thinking of md and dm as different things.
> Try just not thinking of them at all.  
> Think about it like this:
>   The linux kernel support lvm
>   The linux kernel support multipath
>   The linux kernel support snapshots
>   The linux kernel support raid0
>   The linux kernel support raid1
>   The linux kernel support raid5
> 
> Use the bits that you want, and not the bits that you don't.
> 
> dm and md are just two different interface styles to various bits of
> this.  Neither is clearly better than the other, partly because
> different people have different tastes.
> 
> Maybe what you really want is for all of these functions to be managed
> under the one umbrella application.  I think that is was EVMS tried to
> do. 
> 
> One big selling point that 'dm' has is 'dmraid' - a tool that allows
> you to use a lot of 'fakeraid' cards.  People would like dmraid to
> work with raid5 as well, and that is a good goal.
> However it doesn't mean that dm needs to get it's own raid5
> implementation or that md/raid5 needs to be merged with dm.

That's a valid point to make but it can ;)

> It can be achieved by giving md/raid5 the right interfaces so that
> metadata can be managed from userspace (and I am nearly there).

Yeah, and I'm nearly there to have a RAID4 and RAID5 target for dm
(which took advantage of the raid address calculation and the bio to
stripe cache copy code of md raid5).

See http://people.redhat.com/heinzm/sw/dm/dm-raid45/dm-raid45_2.6.15_200601201914.patch.bz2 (no Makefile / no Kconfig changes) for early code reference.

> Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
> raid levels and 'md' interfaces for others.

Yes, that's possible but there's recommendations to have a native target
for dm to do RAID5, so I started to implement it.

> 
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Regards,
Heinz    -- The LVM Guy --

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen                                 Red Hat GmbH
Consulting Development Engineer                   Am Sonnenhang 11
Cluster and Storage Development                   56242 Marienrachdorf
                                                  Germany
Mauelshagen@RedHat.com                            +49 2626 141200
                                                       FAX 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
@ 2006-01-20 18:05 Hubert Tonneau
  0 siblings, 0 replies; 71+ messages in thread
From: Hubert Tonneau @ 2006-01-20 18:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: alan, linux-kernel, neilb

Christoph Hellwig wrote:
>
> Please report any fusion problems to Eric Moore at LSI, the Adaptec driver
> must unfortunately be considered unmaintained.

Done several time over more than two years, but new versions did not solve
the problem, even if in 2.6.13 it forwards the problem to MD instead of just
locking the bus.
Also, production quality verification is something really hard to verify
since even on production servers it requires several monthes to tigger some
very rare situations, and on the production server I've finaly replaced the
partialy faulty disk so that the problem may well never append again on
the box.
So the problem is probably still there in the driver, maybe not, but I have
no way to validate new drivers.

To be more precise, last 2.4.xx have production quality fusion driver,
only 2.6.xx driver has problems.

The last point is that if you look at the last changes in the fusion driver,
they are moving evrything around to introduce the SAS, so after more than a
year of unsuccessfull reports about a single bug that appends on production
server, you can understand that my willing to make tests with potencial
production consequences has vanished.

The fusion maintainer is responsive, did his best, but could not achieve
the result, so he may have not received enough help from general kernel
maintainers, or the kernel or the fusion drivers might start to be too
complicated. I stop here because I do not want to start flames. Just report.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 23:43           ` Neil Brown
  2006-01-20  2:17             ` Phillip Susi
@ 2006-01-20 17:29             ` Ross Vandegrift
  2006-01-20 18:36             ` Heinz Mauelshagen
  2 siblings, 0 replies; 71+ messages in thread
From: Ross Vandegrift @ 2006-01-20 17:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: Phillip Susi, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20, 2006 at 10:43:13AM +1100, Neil Brown wrote:
> dm and md are just two different interface styles to various bits of
> this.  Neither is clearly better than the other, partly because
> different people have different tastes.

Here's why it's great to have both: they have different toolkits.  I'm
really familiar with md's toolkit.  I can do most anything I need.
But I'll bet that I've never gotten a pvmove to finish sucessfully
because I am doing something wrong and I don't know it.

Becuase we're talking about data integrity, the toolkit issue alone
makes it worth keeping both code paths.  md does 90% of what I need,
so why should I spend the time to learn a new system that doesn't
offer any advantages?

[1] I'm intentionally neglecting the 4k stack issue

-- 
Ross Vandegrift
ross@lug.udel.edu

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
@ 2006-01-20 17:01 Hubert Tonneau
  2006-01-20 16:15 ` Christoph Hellwig
  0 siblings, 1 reply; 71+ messages in thread
From: Hubert Tonneau @ 2006-01-20 17:01 UTC (permalink / raw)
  To: alan, linux-kernel; +Cc: neilb

Neil Brown wrote:
>
> These can be mixed together quite effectively:
> You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

Also it's not Neil direct problem, since we are at it, the weakest point
about Linux MD is currently that ...
there is no production quality U320 SCSI driver for Linux to run MD over !

In the U160 category, the symbios driver passed all possible stress tests
(partly bad drives that require the driver to properly reset and restart),
but in the U320 category, neither the Fusion not the AIC79xx did.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
@ 2006-01-20 16:48 Hubert Tonneau
  0 siblings, 0 replies; 71+ messages in thread
From: Hubert Tonneau @ 2006-01-20 16:48 UTC (permalink / raw)
  To: neilb; +Cc: linux-kernel

Neil Brown wrote:
>
> These can be mixed together quite effectively:
> You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

Chiepest high capacity is now provided through USB connected external disks.
Of course, it's for very low load.

So, what would be helpfull is let's say have 7 usefull disks, plus 1 for parity
(just like RAID4), but with not a result of one large partition, but with
the result of seven partitions, one on each disk.

So, in case of one disk failure, you loose no data,
in case of two disks failure, you loose 1/7 partition,
in case of three disks failure, you loose 2/7 partitions,
etc, because if the RAID4 is unusable, you can still read each partition
as a non raid partition.

Somebody suggested that it could be done through LVM, but I failed to find
the way to configure LVM on top of RAID4 or RAID5 to grant that each
partition sectors are consecutive all on a single physical disk.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 17:01 Hubert Tonneau
@ 2006-01-20 16:15 ` Christoph Hellwig
  2006-01-22  6:45   ` Herbert Poetzl
  0 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2006-01-20 16:15 UTC (permalink / raw)
  To: Hubert Tonneau; +Cc: alan, linux-kernel, neilb

On Fri, Jan 20, 2006 at 05:01:06PM +0000, Hubert Tonneau wrote:
> In the U160 category, the symbios driver passed all possible stress tests
> (partly bad drives that require the driver to properly reset and restart),
> but in the U320 category, neither the Fusion not the AIC79xx did.

Please report any fusion problems to Eric Moore at LSI, the Adaptec driver
must unfortunately be considered unmaintained.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20 10:53               ` Lars Marowsky-Bree
@ 2006-01-20 12:06                 ` Jens Axboe
  2006-01-20 18:38                 ` Heinz Mauelshagen
  1 sibling, 0 replies; 71+ messages in thread
From: Jens Axboe @ 2006-01-20 12:06 UTC (permalink / raw)
  To: Lars Marowsky-Bree
  Cc: Phillip Susi, Neil Brown, Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Fri, Jan 20 2006, Lars Marowsky-Bree wrote:
> Oversimplifying to "dm is better than md" is just stupid.

Indeed. But "generally" md is faster and more efficient in the way it
handles ios, it doesn't do any splitting unless it has to.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-20  2:17             ` Phillip Susi
@ 2006-01-20 10:53               ` Lars Marowsky-Bree
  2006-01-20 12:06                 ` Jens Axboe
  2006-01-20 18:38                 ` Heinz Mauelshagen
  2006-01-20 18:41               ` Heinz Mauelshagen
  1 sibling, 2 replies; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-20 10:53 UTC (permalink / raw)
  To: Phillip Susi, Neil Brown
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-19T21:17:12, Phillip Susi <psusi@cfl.rr.com> wrote:

> I am under the impression that dm is simpler/cleaner than md.  That 
> impression very well may be wrong, but if it is simpler, then that's a 
> good thing. 

That impression is wrong in that general form. Both have advantages and
disadvantages.

I've been an advocate of seeing both of them merged, mostly because I
think it would be beneficial if they'd share the same interface to
user-space to make the tools easier to write and maintain.

However, rewriting the RAID personalities for DM is a thing only a fool
would do without really good cause. Sure, everybody can write a
RAID5/RAID6 parity algorithm. But getting the failure/edge cases stable
is not trivial and requires years of maturing.

Which is why I think gentle evolution of both source bases towards some
common API (for example) is much preferable to reinventing one within
the other.

Oversimplifying to "dm is better than md" is just stupid.



Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 23:43           ` Neil Brown
@ 2006-01-20  2:17             ` Phillip Susi
  2006-01-20 10:53               ` Lars Marowsky-Bree
  2006-01-20 18:41               ` Heinz Mauelshagen
  2006-01-20 17:29             ` Ross Vandegrift
  2006-01-20 18:36             ` Heinz Mauelshagen
  2 siblings, 2 replies; 71+ messages in thread
From: Phillip Susi @ 2006-01-20  2:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

Neil Brown wrote:

>Maybe the problem here is thinking of md and dm as different things.
>Try just not thinking of them at all.  
>Think about it like this:
>  The linux kernel support lvm
>  The linux kernel support multipath
>  The linux kernel support snapshots
>  The linux kernel support raid0
>  The linux kernel support raid1
>  The linux kernel support raid5
>
>Use the bits that you want, and not the bits that you don't.
>
>dm and md are just two different interface styles to various bits of
>this.  Neither is clearly better than the other, partly because
>different people have different tastes.
>
>Maybe what you really want is for all of these functions to be managed
>under the one umbrella application.  I think that is was EVMS tried to
>do. 
>
>  
>

I am under the impression that dm is simpler/cleaner than md.  That 
impression very well may be wrong, but if it is simpler, then that's a 
good thing. 


>One big selling point that 'dm' has is 'dmraid' - a tool that allows
>you to use a lot of 'fakeraid' cards.  People would like dmraid to
>work with raid5 as well, and that is a good goal.
>  
>

AFAIK, the hardware fakeraid solutions on the market don't support raid5 
anyhow ( at least mine doesn't ), so dmraid won't either. 

>However it doesn't mean that dm needs to get it's own raid5
>implementation or that md/raid5 needs to be merged with dm.
>It can be achieved by giving md/raid5 the right interfaces so that
>metadata can be managed from userspace (and I am nearly there).
>Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
>raid levels and 'md' interfaces for others.
>

Having two sets of interfaces and retrofiting a new interface onto a 
system that wasn't designed for it seems likely to bloat the kernel with 
complex code.  I don't really know if that is the case because I have 
not studied the code, but that's the impression I get, and if it's 
right, then I'd say it is better to stick with dm rather than retrofit 
md.  In either case, it seems overly complex to have to deal with both. 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 23:26         ` Phillip Susi
@ 2006-01-19 23:43           ` Neil Brown
  2006-01-20  2:17             ` Phillip Susi
                               ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-19 23:43 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Thursday January 19, psusi@cfl.rr.com wrote:
> Neil Brown wrote:
> > 
> > The in-kernel autodetection in md is purely legacy support as far as I
> > am concerned.  md does volume detection in user space via 'mdadm'.
> > 
> > What other "things like" were you thinking of.
> > 
> 
> Oh, I suppose that's true.  Well, another thing is your new mods to 
> support on the fly reshaping, which dm could do from user space.  Then 
> of course, there's multipath and snapshots and other lvm things which 
> you need dm for, so why use both when one will do?  That's my take on it.

Maybe the problem here is thinking of md and dm as different things.
Try just not thinking of them at all.  
Think about it like this:
  The linux kernel support lvm
  The linux kernel support multipath
  The linux kernel support snapshots
  The linux kernel support raid0
  The linux kernel support raid1
  The linux kernel support raid5

Use the bits that you want, and not the bits that you don't.

dm and md are just two different interface styles to various bits of
this.  Neither is clearly better than the other, partly because
different people have different tastes.

Maybe what you really want is for all of these functions to be managed
under the one umbrella application.  I think that is was EVMS tried to
do. 

One big selling point that 'dm' has is 'dmraid' - a tool that allows
you to use a lot of 'fakeraid' cards.  People would like dmraid to
work with raid5 as well, and that is a good goal.
However it doesn't mean that dm needs to get it's own raid5
implementation or that md/raid5 needs to be merged with dm.
It can be achieved by giving md/raid5 the right interfaces so that
metadata can be managed from userspace (and I am nearly there).
Then 'dmraid' (or a similar tool) can use 'dm' interfaces for some
raid levels and 'md' interfaces for others.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 22:32       ` Neil Brown
@ 2006-01-19 23:26         ` Phillip Susi
  2006-01-19 23:43           ` Neil Brown
  0 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2006-01-19 23:26 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

Neil Brown wrote:
> 
> The in-kernel autodetection in md is purely legacy support as far as I
> am concerned.  md does volume detection in user space via 'mdadm'.
> 
> What other "things like" were you thinking of.
> 

Oh, I suppose that's true.  Well, another thing is your new mods to 
support on the fly reshaping, which dm could do from user space.  Then 
of course, there's multipath and snapshots and other lvm things which 
you need dm for, so why use both when one will do?  That's my take on it.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 22:17     ` Phillip Susi
@ 2006-01-19 22:32       ` Neil Brown
  2006-01-19 23:26         ` Phillip Susi
  0 siblings, 1 reply; 71+ messages in thread
From: Neil Brown @ 2006-01-19 22:32 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Thursday January 19, psusi@cfl.rr.com wrote:
> I'm currently of the opinion that dm needs a raid5 and raid6 module 
> added, then the user land lvm tools fixed to use them, and then you 
> could use dm instead of md.  The benefit being that dm pushes things 
> like volume autodetection and management out of the kernel to user space 
> where it belongs.  But that's just my opinion...

The in-kernel autodetection in md is purely legacy support as far as I
am concerned.  md does volume detection in user space via 'mdadm'.

What other "things like" were you thinking of.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-18 23:19   ` Neil Brown
  2006-01-19 15:33     ` Mark Hahn
  2006-01-19 20:12     ` Jan Engelhardt
@ 2006-01-19 22:17     ` Phillip Susi
  2006-01-19 22:32       ` Neil Brown
  2 siblings, 1 reply; 71+ messages in thread
From: Phillip Susi @ 2006-01-19 22:17 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

I'm currently of the opinion that dm needs a raid5 and raid6 module 
added, then the user land lvm tools fixed to use them, and then you 
could use dm instead of md.  The benefit being that dm pushes things 
like volume autodetection and management out of the kernel to user space 
where it belongs.  But that's just my opinion...


I'm using dm at home because I have a sata hardware fakeraid raid-0 
between two WD 10,000 rpm raptors, and the dmraid utility correctly 
recognizes that and configures device mapper to use it. 


Neil Brown wrote:
> Which bits?
> Why?
>
> My current opinion is that you should:
>
>  Use md for raid1, raid5, raid6 - anything with redundancy.
>  Use dm for multipath, crypto, linear, LVM, snapshot
>  Use either for raid0 (I don't think dm has particular advantages
>      for md or md over dm).
>
> These can be mixed together quite effectively:
>   You can have dm/lvm over md/raid1 over dm/multipath
> with no problems.
>
> If there is functionality missing from any of these recommended
> components, then make a noise about it, preferably but not necessarily
> with code, and it will quite possibly be fixed.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 000 of 5] md: Introduction
  2006-01-19 20:12     ` Jan Engelhardt
@ 2006-01-19 21:22       ` Lars Marowsky-Bree
  0 siblings, 0 replies; 71+ messages in thread
From: Lars Marowsky-Bree @ 2006-01-19 21:22 UTC (permalink / raw)
  To: Jan Engelhardt, Neil Brown
  Cc: Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On 2006-01-19T21:12:02, Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:

> > Use md for raid1, raid5, raid6 - anything with redundancy.
> > Use dm for multipath, crypto, linear, LVM, snapshot
> There are pairs of files that look like they would do the same thing:
> 
>   raid1.c  <-> dm-raid1.c
>   linear.c <-> dm-linear.c

Sure there's some historical overlap. It'd make sense if DM used the md
raid personalities, yes.


Sincerely,
    Lars Marowsky-Brée

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business	 -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
  2006-01-18 23:19   ` Neil Brown
  2006-01-19 15:33     ` Mark Hahn
@ 2006-01-19 20:12     ` Jan Engelhardt
  2006-01-19 21:22       ` Lars Marowsky-Bree
  2006-01-19 22:17     ` Phillip Susi
  2 siblings, 1 reply; 71+ messages in thread
From: Jan Engelhardt @ 2006-01-19 20:12 UTC (permalink / raw)
  To: Neil Brown
  Cc: Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

>> >personally, I think this this useful functionality, but my personal
>> >preference is that this would be in DM/LVM2 rather than MD.  but given
>> >Neil is the MD author/maintainer, I can see why he'd prefer to do it in
>> >MD. :)
>> 
>> Why don't MD and DM merge some bits?
>
>Which bits?
>Why?
>
>My current opinion is that you should:
>
> Use md for raid1, raid5, raid6 - anything with redundancy.
> Use dm for multipath, crypto, linear, LVM, snapshot

There are pairs of files that look like they would do the same thing:

  raid1.c  <-> dm-raid1.c
  linear.c <-> dm-linear.c



Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
  2006-01-18 23:19   ` Neil Brown
@ 2006-01-19 15:33     ` Mark Hahn
  2006-01-19 20:12     ` Jan Engelhardt
  2006-01-19 22:17     ` Phillip Susi
  2 siblings, 0 replies; 71+ messages in thread
From: Mark Hahn @ 2006-01-19 15:33 UTC (permalink / raw)
  To: Neil Brown
  Cc: Jan Engelhardt, Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

>  Use either for raid0 (I don't think dm has particular advantages
>      for md or md over dm).

I measured this a few months ago, and was surprised to find that 
DM raid0 was very noticably slower than MD raid0.  same machine,
same disks/controller/kernel/settings/stripe-size.  I didn't try
to find out why, since I usually need redundancy...

regards, mark hahn.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
  2006-01-18 13:27 ` Jan Engelhardt
@ 2006-01-18 23:19   ` Neil Brown
  2006-01-19 15:33     ` Mark Hahn
                       ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Neil Brown @ 2006-01-18 23:19 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Lincoln Dale (ltd),
	Michael Tokarev, linux-raid, linux-kernel, Steinar H. Gunderson

On Wednesday January 18, jengelh@linux01.gwdg.de wrote:
> 
> >personally, I think this this useful functionality, but my personal
> >preference is that this would be in DM/LVM2 rather than MD.  but given
> >Neil is the MD author/maintainer, I can see why he'd prefer to do it in
> >MD. :)
> 
> Why don't MD and DM merge some bits?
> 

Which bits?
Why?

My current opinion is that you should:

 Use md for raid1, raid5, raid6 - anything with redundancy.
 Use dm for multipath, crypto, linear, LVM, snapshot
 Use either for raid0 (I don't think dm has particular advantages
     for md or md over dm).

These can be mixed together quite effectively:
  You can have dm/lvm over md/raid1 over dm/multipath
with no problems.

If there is functionality missing from any of these recommended
components, then make a noise about it, preferably but not necessarily
with code, and it will quite possibly be fixed.

NeilBrown

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
  2006-01-17 21:38 Lincoln Dale (ltd)
@ 2006-01-18 13:27 ` Jan Engelhardt
  2006-01-18 23:19   ` Neil Brown
  0 siblings, 1 reply; 71+ messages in thread
From: Jan Engelhardt @ 2006-01-18 13:27 UTC (permalink / raw)
  To: Lincoln Dale (ltd)
  Cc: Michael Tokarev, NeilBrown, linux-raid, linux-kernel,
	Steinar H. Gunderson


>personally, I think this this useful functionality, but my personal
>preference is that this would be in DM/LVM2 rather than MD.  but given
>Neil is the MD author/maintainer, I can see why he'd prefer to do it in
>MD. :)

Why don't MD and DM merge some bits?



Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [PATCH 000 of 5] md: Introduction
@ 2006-01-17 21:38 Lincoln Dale (ltd)
  2006-01-18 13:27 ` Jan Engelhardt
  0 siblings, 1 reply; 71+ messages in thread
From: Lincoln Dale (ltd) @ 2006-01-17 21:38 UTC (permalink / raw)
  To: Michael Tokarev, NeilBrown; +Cc: linux-raid, linux-kernel, Steinar H. Gunderson

> Neil, is this online resizing/reshaping really needed?  I understand
> all those words means alot for marketing persons - zero downtime,
> online resizing etc, but it is much safer and easier to do that stuff
> 'offline', on an inactive array, like raidreconf does - safer, easier,
> faster, and one have more possibilities for more complex changes.  It
> isn't like you want to add/remove drives to/from your arrays every
day...
> Alot of good hw raid cards are unable to perform such reshaping too.

RAID resize/restripe may not be so common with cheap / PC-based RAID
systems, but it is common with midrange and enterprise storage
subsystems from vendors such as EMC, HDS, IBM & HP.
in fact, I'd say it's the exception to the rule _if_ an
midrange/enterprise storage subsystem doesn't have an _online_ resize
capability..

personally, I think this this useful functionality, but my personal
preference is that this would be in DM/LVM2 rather than MD.  but given
Neil is the MD author/maintainer, I can see why he'd prefer to do it in
MD. :)


cheers,

lincoln.

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2006-01-24  2:02 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-17  6:56 [PATCH 000 of 5] md: Introduction NeilBrown
2006-01-17  6:56 ` [PATCH 001 of 5] md: Split disks array out of raid5 conf structure so it is easier to grow NeilBrown
2006-01-17 14:37   ` John Stoffel
2006-01-19  0:26     ` Neil Brown
2006-01-21  3:37       ` John Stoffel
2006-01-22 22:57         ` Neil Brown
2006-01-17  6:56 ` [PATCH 002 of 5] md: Allow stripes to be expanded in preparation for expanding an array NeilBrown
2006-01-17  6:56 ` [PATCH 003 of 5] md: Infrastructure to allow normal IO to continue while array is expanding NeilBrown
2006-01-17  6:56 ` [PATCH 004 of 5] md: Core of raid5 resize process NeilBrown
2006-01-17  6:56 ` [PATCH 005 of 5] md: Final stages of raid5 expand code NeilBrown
2006-01-17  9:55   ` Sander
2006-01-19  0:32     ` Neil Brown
2006-01-17  8:17 ` [PATCH 000 of 5] md: Introduction Michael Tokarev
2006-01-17  9:50   ` Sander
2006-01-17 11:26     ` Michael Tokarev
2006-01-17 14:03       ` Kyle Moffett
2006-01-19  0:28         ` Neil Brown
2006-01-17 16:08       ` Ross Vandegrift
2006-01-17 18:12         ` Michael Tokarev
2006-01-18  8:14           ` Sander
2006-01-18  9:03             ` Alan Cox
2006-01-19  0:22           ` Neil Brown
2006-01-19  9:01             ` Jakob Oestergaard
2006-01-17 22:38       ` Phillip Susi
2006-01-17 22:57         ` Neil Brown
2006-01-17 14:10   ` Steinar H. Gunderson
2006-01-22  4:42 ` Adam Kropelin
2006-01-22 22:52   ` Neil Brown
2006-01-23 23:02     ` Adam Kropelin
2006-01-23  1:08 ` John Hendrikx
2006-01-23  1:25   ` Neil Brown
2006-01-23  1:54     ` Kyle Moffett
2006-01-17 21:38 Lincoln Dale (ltd)
2006-01-18 13:27 ` Jan Engelhardt
2006-01-18 23:19   ` Neil Brown
2006-01-19 15:33     ` Mark Hahn
2006-01-19 20:12     ` Jan Engelhardt
2006-01-19 21:22       ` Lars Marowsky-Bree
2006-01-19 22:17     ` Phillip Susi
2006-01-19 22:32       ` Neil Brown
2006-01-19 23:26         ` Phillip Susi
2006-01-19 23:43           ` Neil Brown
2006-01-20  2:17             ` Phillip Susi
2006-01-20 10:53               ` Lars Marowsky-Bree
2006-01-20 12:06                 ` Jens Axboe
2006-01-20 18:38                 ` Heinz Mauelshagen
2006-01-20 22:09                   ` Lars Marowsky-Bree
2006-01-21  0:06                     ` Heinz Mauelshagen
2006-01-20 18:41               ` Heinz Mauelshagen
2006-01-20 17:29             ` Ross Vandegrift
2006-01-20 18:36             ` Heinz Mauelshagen
2006-01-20 22:57               ` Lars Marowsky-Bree
2006-01-21  0:01                 ` Heinz Mauelshagen
2006-01-21  0:03                   ` Lars Marowsky-Bree
2006-01-21  0:08                     ` Heinz Mauelshagen
2006-01-21  0:13                       ` Lars Marowsky-Bree
2006-01-23  9:44                         ` Heinz Mauelshagen
2006-01-23 10:26                           ` Lars Marowsky-Bree
2006-01-23 10:38                             ` Heinz Mauelshagen
2006-01-23 10:45                               ` Lars Marowsky-Bree
2006-01-23 11:00                                 ` Heinz Mauelshagen
2006-01-23 12:54                           ` Ville Herva
2006-01-23 13:00                             ` Steinar H. Gunderson
2006-01-23 13:54                             ` Heinz Mauelshagen
2006-01-23 17:33                               ` Ville Herva
2006-01-24  2:02                             ` Phillip Susi
2006-01-20 16:48 Hubert Tonneau
2006-01-20 17:01 Hubert Tonneau
2006-01-20 16:15 ` Christoph Hellwig
2006-01-22  6:45   ` Herbert Poetzl
2006-01-20 18:05 Hubert Tonneau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).