linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE
@ 2007-12-14  6:26 NeilBrown
  2007-12-14  6:26 ` [PATCH 001 of 7] md: Support 'external' metadata for md arrays NeilBrown
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Jens Axboe

Following are 7 md related patches are suitable for the next -mm
and maybe for 2.6.25.

They move towards giving user-space programs more fine control of an
array so that we can add support for more complex metadata formats
(e.g. DDF) without bothering the kernel with such things.

The last patch isn't strictly md related.  It adds an ioctl which
allows mapping from an open file descriptor on a block device to
a name in /sys.  This makes finding name of things in /sys more practical.
As I put this in block-layer code, I have Cc:ed Jens Axboe.

 [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
 [PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use
 [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing.
 [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
 [PATCH 005 of 7] md: Lock address when changing attributes of component devices.
 [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata.
 [PATCH 007 of 7] md: Get name for block device in sysfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-25 22:03   ` Andrew Morton
  2007-12-14  6:26 ` [PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use NeilBrown
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


- Add a state flag 'external' to indicate that the metadata is managed
  externally (by user-space) so important changes need to be 
  left of user-space to handle.
  Alternates are non-persistant ('none') where there is no stable metadata -
  after the  array is stopped there is no record of it's status - and 
  internal which can be version 0.90 or version 1.x
  These are selected by writing to the 'metadata' attribute.



- move the updating of superblocks (sync_sbs) to after we have checked if
  there are any superblocks or not.

- New array state 'write_pending'.  This means that the metadata records
  the array as 'clean', but a write has been requested, so the metadata has
  to be updated to record a 'dirty' array before the write can continue.
  This change is reported to md by writing 'active' to the array_state
  attribute.

- tidy up marking of sb_dirty:
   - don't set sb_dirty when resync finishes as md_check_recovery
     calls md_update_sb when the sync thread finishes anyway.
   - Don't set sb_dirty in multipath_run as the array might not be dirty.
   - don't mark superblock dirty when switching to 'clean' if there
     is no internal superblock (if external, userspace can choose to
     update the superblock whenever it chooses to).

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   77 +++++++++++++++++++++++++++++++++-----------
 ./include/linux/raid/md_k.h |    3 +
 2 files changed, 61 insertions(+), 19 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:07:51.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:08:28.000000000 +1100
@@ -778,7 +778,8 @@ static int super_90_validate(mddev_t *md
 		mddev->major_version = 0;
 		mddev->minor_version = sb->minor_version;
 		mddev->patch_version = sb->patch_version;
-		mddev->persistent = ! sb->not_persistent;
+		mddev->persistent = 1;
+		mddev->external = 0;
 		mddev->chunk_size = sb->chunk_size;
 		mddev->ctime = sb->ctime;
 		mddev->utime = sb->utime;
@@ -904,7 +905,7 @@ static void super_90_sync(mddev_t *mddev
 	sb->size  = mddev->size;
 	sb->raid_disks = mddev->raid_disks;
 	sb->md_minor = mddev->md_minor;
-	sb->not_persistent = !mddev->persistent;
+	sb->not_persistent = 0;
 	sb->utime = mddev->utime;
 	sb->state = 0;
 	sb->events_hi = (mddev->events>>32);
@@ -1158,6 +1159,7 @@ static int super_1_validate(mddev_t *mdd
 		mddev->major_version = 1;
 		mddev->patch_version = 0;
 		mddev->persistent = 1;
+		mddev->external = 0;
 		mddev->chunk_size = le32_to_cpu(sb->chunksize) << 9;
 		mddev->ctime = le64_to_cpu(sb->ctime) & ((1ULL << 32)-1);
 		mddev->utime = le64_to_cpu(sb->utime) & ((1ULL << 32)-1);
@@ -1699,18 +1701,20 @@ repeat:
 		MD_BUG();
 		mddev->events --;
 	}
-	sync_sbs(mddev, nospares);
 
 	/*
 	 * do not write anything to disk if using
 	 * nonpersistent superblocks
 	 */
 	if (!mddev->persistent) {
-		clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+		if (!mddev->external)
+			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
+
 		spin_unlock_irq(&mddev->write_lock);
 		wake_up(&mddev->sb_wait);
 		return;
 	}
+	sync_sbs(mddev, nospares);
 	spin_unlock_irq(&mddev->write_lock);
 
 	dprintk(KERN_INFO 
@@ -2430,6 +2434,8 @@ array_state_show(mddev_t *mddev, char *p
 		case 0:
 			if (mddev->in_sync)
 				st = clean;
+			else if (test_bit(MD_CHANGE_CLEAN, &mddev->flags))
+				st = write_pending;
 			else if (mddev->safemode)
 				st = active_idle;
 			else
@@ -2460,11 +2466,9 @@ array_state_store(mddev_t *mddev, const 
 		break;
 	case clear:
 		/* stopping an active array */
-		if (mddev->pers) {
-			if (atomic_read(&mddev->active) > 1)
-				return -EBUSY;
-			err = do_md_stop(mddev, 0);
-		}
+		if (atomic_read(&mddev->active) > 1)
+			return -EBUSY;
+		err = do_md_stop(mddev, 0);
 		break;
 	case inactive:
 		/* stopping an active array */
@@ -2472,7 +2476,8 @@ array_state_store(mddev_t *mddev, const 
 			if (atomic_read(&mddev->active) > 1)
 				return -EBUSY;
 			err = do_md_stop(mddev, 2);
-		}
+		} else
+			err = 0; /* already inactive */
 		break;
 	case suspended:
 		break; /* not supported yet */
@@ -2500,9 +2505,15 @@ array_state_store(mddev_t *mddev, const 
 			restart_array(mddev);
 			spin_lock_irq(&mddev->write_lock);
 			if (atomic_read(&mddev->writes_pending) == 0) {
-				mddev->in_sync = 1;
-				set_bit(MD_CHANGE_CLEAN, &mddev->flags);
-			}
+				if (mddev->in_sync == 0) {
+					mddev->in_sync = 1;
+					if (mddev->persistent)
+						set_bit(MD_CHANGE_CLEAN,
+							&mddev->flags);
+				}
+				err = 0;
+			} else
+				err = -EBUSY;
 			spin_unlock_irq(&mddev->write_lock);
 		} else {
 			mddev->ro = 0;
@@ -2513,7 +2524,8 @@ array_state_store(mddev_t *mddev, const 
 	case active:
 		if (mddev->pers) {
 			restart_array(mddev);
-			clear_bit(MD_CHANGE_CLEAN, &mddev->flags);
+			if (mddev->external)
+				clear_bit(MD_CHANGE_CLEAN, &mddev->flags);
 			wake_up(&mddev->sb_wait);
 			err = 0;
 		} else {
@@ -2664,7 +2676,9 @@ __ATTR(component_size, S_IRUGO|S_IWUSR, 
 
 
 /* Metdata version.
- * This is either 'none' for arrays with externally managed metadata,
+ * This is one of
+ *   'none' for arrays with no metadata (good luck...)
+ *   'external' for arrays with externally managed metadata,
  * or N.M for internally known formats
  */
 static ssize_t
@@ -2673,6 +2687,8 @@ metadata_show(mddev_t *mddev, char *page
 	if (mddev->persistent)
 		return sprintf(page, "%d.%d\n",
 			       mddev->major_version, mddev->minor_version);
+	else if (mddev->external)
+		return sprintf(page, "external:%s\n", mddev->metadata_type);
 	else
 		return sprintf(page, "none\n");
 }
@@ -2687,6 +2703,21 @@ metadata_store(mddev_t *mddev, const cha
 
 	if (cmd_match(buf, "none")) {
 		mddev->persistent = 0;
+		mddev->external = 0;
+		mddev->major_version = 0;
+		mddev->minor_version = 90;
+		return len;
+	}
+	if (strncmp(buf, "external:", 9) == 0) {
+		int namelen = len-9;
+		if (namelen >= sizeof(mddev->metadata_type))
+			namelen = sizeof(mddev->metadata_type)-1;
+		strncpy(mddev->metadata_type, buf+9, namelen);
+		mddev->metadata_type[namelen] = 0;
+		if (namelen && mddev->metadata_type[namelen-1] == '\n')
+			mddev->metadata_type[--namelen] = 0;
+		mddev->persistent = 0;
+		mddev->external = 1;
 		mddev->major_version = 0;
 		mddev->minor_version = 90;
 		return len;
@@ -2703,6 +2734,7 @@ metadata_store(mddev_t *mddev, const cha
 	mddev->major_version = major;
 	mddev->minor_version = minor;
 	mddev->persistent = 1;
+	mddev->external = 0;
 	return len;
 }
 
@@ -3527,6 +3559,7 @@ static int do_md_stop(mddev_t * mddev, i
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
 		mddev->reshape_position = MaxSector;
+		mddev->external = 0;
 
 	} else if (mddev->pers)
 		printk(KERN_INFO "md: %s switched to read-only mode.\n",
@@ -4168,13 +4201,15 @@ static int set_array_info(mddev_t * mdde
 	else
 		mddev->recovery_cp = 0;
 	mddev->persistent    = ! info->not_persistent;
+	mddev->external	     = 0;
 
 	mddev->layout        = info->layout;
 	mddev->chunk_size    = info->chunk_size;
 
 	mddev->max_disks     = MD_SB_DISKS;
 
-	mddev->flags         = 0;
+	if (mddev->persistent)
+		mddev->flags         = 0;
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 
 	mddev->default_bitmap_offset = MD_SB_BYTES >> 9;
@@ -4985,7 +5020,10 @@ static int md_seq_show(struct seq_file *
 					   mddev->major_version,
 					   mddev->minor_version);
 			}
-		} else
+		} else if (mddev->external)
+			seq_printf(seq, " super external:%s",
+				   mddev->metadata_type);
+		else
 			seq_printf(seq, " super non-persistent");
 
 		if (mddev->pers) {
@@ -5591,7 +5629,7 @@ void md_check_recovery(mddev_t *mddev)
 	}
 
 	if ( ! (
-		mddev->flags ||
+		(mddev->flags && !mddev->external) ||
 		test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
 		test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 		(mddev->safemode == 1) ||
@@ -5607,7 +5645,8 @@ void md_check_recovery(mddev_t *mddev)
 		if (mddev->safemode && !atomic_read(&mddev->writes_pending) &&
 		    !mddev->in_sync && mddev->recovery_cp == MaxSector) {
 			mddev->in_sync = 1;
-			set_bit(MD_CHANGE_CLEAN, &mddev->flags);
+			if (mddev->persistent)
+				set_bit(MD_CHANGE_CLEAN, &mddev->flags);
 		}
 		if (mddev->safemode == 1)
 			mddev->safemode = 0;

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2007-12-14 16:07:51.000000000 +1100
+++ ./include/linux/raid/md_k.h	2007-12-14 16:07:54.000000000 +1100
@@ -130,6 +130,9 @@ struct mddev_s
 					minor_version,
 					patch_version;
 	int				persistent;
+	int 				external;	/* metadata is
+							 * managed externally */
+	char				metadata_type[17]; /* externally set*/
 	int				chunk_size;
 	time_t				ctime, utime;
 	int				level, layout;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
  2007-12-14  6:26 ` [PATCH 001 of 7] md: Support 'external' metadata for md arrays NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-14  6:26 ` [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing NeilBrown
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


When a device fails, we must not allow an further writes to the array
until the device failure has been recorded in array metadata.
When metadata is managed externally, this requires some synchronisation...

Allow/require userspace to explicitly remove failed devices
from active service in the array by writing 'none' to the 'slot'
attribute.  If this reduces the number of failed devices to 0,
the write block will automatically be lowered.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |   43 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:08:28.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:08:52.000000000 +1100
@@ -1894,20 +1894,44 @@ static ssize_t
 slot_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
 	char *e;
+	int err;
+	char nm[20];
 	int slot = simple_strtoul(buf, &e, 10);
 	if (strncmp(buf, "none", 4)==0)
 		slot = -1;
 	else if (e==buf || (*e && *e!= '\n'))
 		return -EINVAL;
-	if (rdev->mddev->pers)
-		/* Cannot set slot in active array (yet) */
-		return -EBUSY;
-	if (slot >= rdev->mddev->raid_disks)
-		return -ENOSPC;
-	rdev->raid_disk = slot;
-	/* assume it is working */
-	rdev->flags = 0;
-	set_bit(In_sync, &rdev->flags);
+	if (rdev->mddev->pers) {
+		/* Setting 'slot' on an active array requires also
+		 * updating the 'rd%d' link, and communicating
+		 * with the personality with ->hot_*_disk.
+		 * For now we only support removing
+		 * failed/spare devices.  This normally happens automatically,
+		 * but not when the metadata is externally managed.
+		 */
+		if (slot != -1)
+			return -EBUSY;
+		if (rdev->raid_disk == -1)
+			return -EEXIST;
+		/* personality does all needed checks */
+		if (rdev->mddev->pers->hot_add_disk == NULL)
+			return -EINVAL;
+		err = rdev->mddev->pers->
+			hot_remove_disk(rdev->mddev, rdev->raid_disk);
+		if (err)
+			return err;
+		sprintf(nm, "rd%d", rdev->raid_disk);
+		sysfs_remove_link(&rdev->mddev->kobj, nm);
+		set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
+		md_wakeup_thread(rdev->mddev->thread);
+	} else {
+		if (slot >= rdev->mddev->raid_disks)
+			return -ENOSPC;
+		rdev->raid_disk = slot;
+		/* assume it is working */
+		rdev->flags = 0;
+		set_bit(In_sync, &rdev->flags);
+	}
 	return len;
 }
 
@@ -5551,6 +5575,7 @@ static int remove_and_add_spares(mddev_t
 
 	ITERATE_RDEV(mddev,rdev,rtmp)
 		if (rdev->raid_disk >= 0 &&
+		    !mddev->external &&
 		    (test_bit(Faulty, &rdev->flags) ||
 		     ! test_bit(In_sync, &rdev->flags)) &&
 		    atomic_read(&rdev->nr_pending)==0) {

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing.
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
  2007-12-14  6:26 ` [PATCH 001 of 7] md: Support 'external' metadata for md arrays NeilBrown
  2007-12-14  6:26 ` [PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-14  6:26 ` [PATCH 004 of 7] md: Allow devices to be shared between md arrays NeilBrown
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


This allows userspace to control resync/reshape progress and
synchronise it with other activities, such as shared access in a SAN,
or backing up critical sections during a tricky reshape.

Writing a number of sectors (which must be a multiple of the chunk
size if such is meaningful) causes a resync to pause when it
gets to that point.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./Documentation/md.txt      |   10 +++++
 ./drivers/md/md.c           |   75 ++++++++++++++++++++++++++++++++++++++------
 ./drivers/md/raid1.c        |    2 +
 ./drivers/md/raid10.c       |    3 +
 ./drivers/md/raid5.c        |   25 ++++++++++++++
 ./include/linux/raid/md_k.h |    2 +
 6 files changed, 107 insertions(+), 10 deletions(-)

diff .prev/Documentation/md.txt ./Documentation/md.txt
--- .prev/Documentation/md.txt	2007-12-14 16:07:50.000000000 +1100
+++ ./Documentation/md.txt	2007-12-14 16:08:57.000000000 +1100
@@ -416,6 +416,16 @@ also have
      sectors in total that could need to be processed.  The two
      numbers are separated by a '/'  thus effectively showing one
      value, a fraction of the process that is complete.
+     A 'select' on this attribute will return when resync completes,
+     when it reaches the current sync_max (below) and possibly at
+     other times.
+
+   sync_max
+     This is a number of sectors at which point a resync/recovery
+     process will pause.  When a resync is active, the value can
+     only ever be increased, never decreased.  The value of 'max'
+     effectively disables the limit.
+
 
    sync_speed
      This shows the current actual speed, in K/sec, of the current

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:08:52.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:08:57.000000000 +1100
@@ -275,6 +275,7 @@ static mddev_t * mddev_find(dev_t unit)
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
 	new->reshape_position = MaxSector;
+	new->resync_max = MaxSector;
 
 	new->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!new->queue) {
@@ -2926,6 +2927,43 @@ sync_completed_show(mddev_t *mddev, char
 static struct md_sysfs_entry md_sync_completed = __ATTR_RO(sync_completed);
 
 static ssize_t
+max_sync_show(mddev_t *mddev, char *page)
+{
+	if (mddev->resync_max == MaxSector)
+		return sprintf(page, "max\n");
+	else
+		return sprintf(page, "%llu\n",
+			       (unsigned long long)mddev->resync_max);
+}
+static ssize_t
+max_sync_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	if (strncmp(buf, "max", 3) == 0)
+		mddev->resync_max = MaxSector;
+	else {
+		char *ep;
+		unsigned long long max = simple_strtoull(buf, &ep, 10);
+		if (ep == buf || (*ep != 0 && *ep != '\n'))
+			return -EINVAL;
+		if (max < mddev->resync_max &&
+		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+			return -EBUSY;
+
+		/* Must be a multiple of chunk_size */
+		if (mddev->chunk_size) {
+			if (max & (sector_t)((mddev->chunk_size>>9)-1))
+				return -EINVAL;
+		}
+		mddev->resync_max = max;
+	}
+	wake_up(&mddev->recovery_wait);
+	return len;
+}
+
+static struct md_sysfs_entry md_max_sync =
+__ATTR(sync_max, S_IRUGO|S_IWUSR, max_sync_show, max_sync_store);
+
+static ssize_t
 suspend_lo_show(mddev_t *mddev, char *page)
 {
 	return sprintf(page, "%llu\n", (unsigned long long)mddev->suspend_lo);
@@ -3035,6 +3073,7 @@ static struct attribute *md_redundancy_a
 	&md_sync_max.attr,
 	&md_sync_speed.attr,
 	&md_sync_completed.attr,
+	&md_max_sync.attr,
 	&md_suspend_lo.attr,
 	&md_suspend_hi.attr,
 	&md_bitmap.attr,
@@ -3582,6 +3621,7 @@ static int do_md_stop(mddev_t * mddev, i
 		mddev->size = 0;
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
+		mddev->resync_max = MaxSector;
 		mddev->reshape_position = MaxSector;
 		mddev->external = 0;
 
@@ -5445,8 +5485,16 @@ void md_do_sync(mddev_t *mddev)
 		sector_t sectors;
 
 		skipped = 0;
+		if (j >= mddev->resync_max) {
+			sysfs_notify(&mddev->kobj, NULL, "sync_completed");
+			wait_event(mddev->recovery_wait,
+				   mddev->resync_max > j
+				   || kthread_should_stop());
+		}
+		if (kthread_should_stop())
+			goto interrupted;
 		sectors = mddev->pers->sync_request(mddev, j, &skipped,
-					    currspeed < speed_min(mddev));
+						  currspeed < speed_min(mddev));
 		if (sectors == 0) {
 			set_bit(MD_RECOVERY_ERR, &mddev->recovery);
 			goto out;
@@ -5488,15 +5536,9 @@ void md_do_sync(mddev_t *mddev)
 		}
 
 
-		if (kthread_should_stop()) {
-			/*
-			 * got a signal, exit.
-			 */
-			printk(KERN_INFO 
-				"md: md_do_sync() got signal ... exiting\n");
-			set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-			goto out;
-		}
+		if (kthread_should_stop())
+			goto interrupted;
+
 
 		/*
 		 * this loop exits only if either when we are slower than
@@ -5560,9 +5602,22 @@ void md_do_sync(mddev_t *mddev)
 
  skip:
 	mddev->curr_resync = 0;
+	mddev->resync_max = MaxSector;
+	sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	wake_up(&resync_wait);
 	set_bit(MD_RECOVERY_DONE, &mddev->recovery);
 	md_wakeup_thread(mddev->thread);
+	return;
+
+ interrupted:
+	/*
+	 * got a signal, exit.
+	 */
+	printk(KERN_INFO
+	       "md: md_do_sync() got signal ... exiting\n");
+	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+	goto out;
+
 }
 EXPORT_SYMBOL_GPL(md_do_sync);
 

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2007-12-14 16:07:50.000000000 +1100
+++ ./drivers/md/raid10.c	2007-12-14 16:08:57.000000000 +1100
@@ -1657,6 +1657,9 @@ static sector_t sync_request(mddev_t *md
 		return (max_sector - sector_nr) + sectors_skipped;
 	}
 
+	if (max_sector > mddev->resync_max)
+		max_sector = mddev->resync_max; /* Don't do IO beyond here */
+
 	/* make sure whole request will fit in a chunk - if chunks
 	 * are meaningful
 	 */

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2007-12-14 16:07:50.000000000 +1100
+++ ./drivers/md/raid1.c	2007-12-14 16:08:57.000000000 +1100
@@ -1784,6 +1784,8 @@ static sector_t sync_request(mddev_t *md
 		return rv;
 	}
 
+	if (max_sector > mddev->resync_max)
+		max_sector = mddev->resync_max; /* Don't do IO beyond here */
 	nr_sectors = 0;
 	sync_blocks = 0;
 	do {

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c	2007-12-14 16:07:50.000000000 +1100
+++ ./drivers/md/raid5.c	2007-12-14 16:08:57.000000000 +1100
@@ -4277,6 +4277,25 @@ static sector_t reshape_request(mddev_t 
 		release_queue(sq);
 		first_sector += STRIPE_SECTORS;
 	}
+	/* If this takes us to the resync_max point where we have to pause,
+	 * then we need to write out the superblock.
+	 */
+	sector_nr += conf->chunk_size>>9;
+	if (sector_nr >= mddev->resync_max) {
+		/* Cannot proceed until we've updated the superblock... */
+		wait_event(conf->wait_for_overlap,
+			   atomic_read(&conf->reshape_stripes) == 0);
+		mddev->reshape_position = conf->expand_progress;
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_DEVS, &mddev->flags)
+			   || kthread_should_stop());
+		spin_lock_irq(&conf->device_lock);
+		conf->expand_lo = mddev->reshape_position;
+		spin_unlock_irq(&conf->device_lock);
+		wake_up(&conf->wait_for_overlap);
+	}
 	return conf->chunk_size>>9;
 }
 
@@ -4314,6 +4333,12 @@ static inline sector_t sync_request(mdde
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		return reshape_request(mddev, sector_nr, skipped);
 
+	/* No need to check resync_max as we never do more than one
+	 * stripe, and as resync_max will always be on a chunk boundary,
+	 * if the check in md_do_sync didn't fire, there is no chance
+	 * of overstepping resync_max here
+	 */
+
 	/* if there is too many failed drives and we are trying
 	 * to resync, then assert that we are finished, because there is
 	 * nothing we can do.

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2007-12-14 16:07:54.000000000 +1100
+++ ./include/linux/raid/md_k.h	2007-12-14 16:08:57.000000000 +1100
@@ -219,6 +219,8 @@ struct mddev_s
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
+	sector_t			resync_max;	/* resync should pause
+							 * when it gets here */
 
 	spinlock_t			write_lock;
 	wait_queue_head_t		sb_wait;	/* for waiting on superblock updates */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (2 preceding siblings ...)
  2007-12-14  6:26 ` [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-25 22:04   ` Andrew Morton
  2007-12-14  6:26 ` [PATCH 005 of 7] md: Lock address when changing attributes of component devices NeilBrown
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Currently, a given device is "claimed" by a particular array so
that it cannot be used by other arrays.

This is not ideal for DDF and other metadata schemes which have
their own partitioning concept.

So for externally managed metadata, just claim the device for
md in general, require that "offset" and "size" are set
properly for each device, and make sure that if a device is
included in different arrays then the active sections do
not overlap.

This involves adding another flag to the rdev which makes it awkward
to set "->flags = 0" to clear certain flags.  So now clear flags
explicitly by name when we want to clear things.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   93 ++++++++++++++++++++++++++++++++++++++------
 ./include/linux/raid/md_k.h |    2 
 2 files changed, 84 insertions(+), 11 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:08:57.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:09:01.000000000 +1100
@@ -774,7 +774,11 @@ static int super_90_validate(mddev_t *md
 	__u64 ev1 = md_event(sb);
 
 	rdev->raid_disk = -1;
-	rdev->flags = 0;
+	clear_bit(Faulty, &rdev->flags);
+	clear_bit(In_sync, &rdev->flags);
+	clear_bit(WriteMostly, &rdev->flags);
+	clear_bit(BarriersNotsupp, &rdev->flags);
+
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 0;
 		mddev->minor_version = sb->minor_version;
@@ -1155,7 +1159,11 @@ static int super_1_validate(mddev_t *mdd
 	__u64 ev1 = le64_to_cpu(sb->events);
 
 	rdev->raid_disk = -1;
-	rdev->flags = 0;
+	clear_bit(Faulty, &rdev->flags);
+	clear_bit(In_sync, &rdev->flags);
+	clear_bit(WriteMostly, &rdev->flags);
+	clear_bit(BarriersNotsupp, &rdev->flags);
+
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 1;
 		mddev->patch_version = 0;
@@ -1407,7 +1415,7 @@ static int bind_rdev_to_array(mdk_rdev_t
 		goto fail;
 	}
 	list_add(&rdev->same_set, &mddev->disks);
-	bd_claim_by_disk(rdev->bdev, rdev, mddev->gendisk);
+	bd_claim_by_disk(rdev->bdev, rdev->bdev->bd_holder, mddev->gendisk);
 	return 0;
 
  fail:
@@ -1447,7 +1455,7 @@ static void unbind_rdev_from_array(mdk_r
  * otherwise reused by a RAID array (or any other kernel
  * subsystem), by bd_claiming the device.
  */
-static int lock_rdev(mdk_rdev_t *rdev, dev_t dev)
+static int lock_rdev(mdk_rdev_t *rdev, dev_t dev, int shared)
 {
 	int err = 0;
 	struct block_device *bdev;
@@ -1459,13 +1467,15 @@ static int lock_rdev(mdk_rdev_t *rdev, d
 			__bdevname(dev, b));
 		return PTR_ERR(bdev);
 	}
-	err = bd_claim(bdev, rdev);
+	err = bd_claim(bdev, shared ? (mdk_rdev_t *)lock_rdev : rdev);
 	if (err) {
 		printk(KERN_ERR "md: could not bd_claim %s.\n",
 			bdevname(bdev, b));
 		blkdev_put(bdev);
 		return err;
 	}
+	if (!shared)
+		set_bit(AllReserved, &rdev->flags);
 	rdev->bdev = bdev;
 	return err;
 }
@@ -1930,7 +1940,8 @@ slot_store(mdk_rdev_t *rdev, const char 
 			return -ENOSPC;
 		rdev->raid_disk = slot;
 		/* assume it is working */
-		rdev->flags = 0;
+		clear_bit(Faulty, &rdev->flags);
+		clear_bit(WriteMostly, &rdev->flags);
 		set_bit(In_sync, &rdev->flags);
 	}
 	return len;
@@ -1955,6 +1966,10 @@ offset_store(mdk_rdev_t *rdev, const cha
 		return -EINVAL;
 	if (rdev->mddev->pers)
 		return -EBUSY;
+	if (rdev->size && rdev->mddev->external)
+		/* Must set offset before size, so overlap checks
+		 * can be sane */
+		return -EBUSY;
 	rdev->data_offset = offset;
 	return len;
 }
@@ -1968,16 +1983,69 @@ rdev_size_show(mdk_rdev_t *rdev, char *p
 	return sprintf(page, "%llu\n", (unsigned long long)rdev->size);
 }
 
+static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
+{
+	/* check if two start/length pairs overlap */
+	if (s1+l1 <= s2)
+		return 0;
+	if (s2+l2 <= s1)
+		return 0;
+	return 1;
+}
+
 static ssize_t
 rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
 	char *e;
 	unsigned long long size = simple_strtoull(buf, &e, 10);
+	unsigned long long oldsize = rdev->size;
 	if (e==buf || (*e && *e != '\n'))
 		return -EINVAL;
 	if (rdev->mddev->pers)
 		return -EBUSY;
 	rdev->size = size;
+	if (size > oldsize && rdev->mddev->external) {
+		/* need to check that all other rdevs with the same ->bdev
+		 * do not overlap.  We need to unlock the mddev to avoid
+		 * a deadlock.  We have already changed rdev->size, and if
+		 * we have to change it back, we will have the lock again.
+		 */
+		mddev_t *mddev;
+		int overlap = 0;
+		struct list_head *tmp, *tmp2;
+
+		mddev_unlock(rdev->mddev);
+		ITERATE_MDDEV(mddev, tmp) {
+			mdk_rdev_t *rdev2;
+
+			mddev_lock(mddev);
+			ITERATE_RDEV(mddev, rdev2, tmp2)
+				if (test_bit(AllReserved, &rdev2->flags) ||
+				    (rdev->bdev == rdev2->bdev &&
+				     rdev != rdev2 &&
+				     overlaps(rdev->data_offset, rdev->size,
+					    rdev2->data_offset, rdev2->size))) {
+					overlap = 1;
+					break;
+				}
+			mddev_unlock(mddev);
+			if (overlap) {
+				mddev_put(mddev);
+				break;
+			}
+		}
+		mddev_lock(rdev->mddev);
+		if (overlap) {
+			/* Someone else could have slipped in a size
+			 * change here, but doing so is just silly.
+			 * We put oldsize back because we *know* it is
+			 * safe, and trust userspace not to race with
+			 * itself
+			 */
+			rdev->size = oldsize;
+			return -EBUSY;
+		}
+	}
 	if (size < rdev->mddev->size || rdev->mddev->size == 0)
 		rdev->mddev->size = size;
 	return len;
@@ -2061,7 +2129,7 @@ static mdk_rdev_t *md_import_device(dev_
 	if ((err = alloc_disk_sb(rdev)))
 		goto abort_free;
 
-	err = lock_rdev(rdev, newdev);
+	err = lock_rdev(rdev, newdev, super_format == -2);
 	if (err)
 		goto abort_free;
 
@@ -2616,7 +2684,9 @@ new_dev_store(mddev_t *mddev, const char
 			if (err < 0)
 				goto out;
 		}
-	} else
+	} else if (mddev->external)
+		rdev = md_import_device(dev, -2, -1);
+	else
 		rdev = md_import_device(dev, -1, -1);
 
 	if (IS_ERR(rdev))
@@ -3216,8 +3286,11 @@ static int do_md_run(mddev_t * mddev)
 	/*
 	 * Analyze all RAID superblock(s)
 	 */
-	if (!mddev->raid_disks)
+	if (!mddev->raid_disks) {
+		if (!mddev->persistent)
+			return -EINVAL;
 		analyze_sbs(mddev);
+	}
 
 	chunk_size = mddev->chunk_size;
 
@@ -4019,8 +4092,6 @@ static int add_new_disk(mddev_t * mddev,
 		else
 			rdev->raid_disk = -1;
 
-		rdev->flags = 0;
-
 		if (rdev->raid_disk < mddev->raid_disks)
 			if (info->state & (1<<MD_DISK_SYNC))
 				set_bit(In_sync, &rdev->flags);

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2007-12-14 16:08:57.000000000 +1100
+++ ./include/linux/raid/md_k.h	2007-12-14 16:09:01.000000000 +1100
@@ -81,6 +81,8 @@ struct mdk_rdev_s
 #define	In_sync		2		/* device is in_sync with rest of array */
 #define	WriteMostly	4		/* Avoid reading if at all possible */
 #define	BarriersNotsupp	5		/* BIO_RW_BARRIER is not supported */
+#define	AllReserved	6		/* If whole device is reserved for
+					 * one array */
 
 	int desc_nr;			/* descriptor index in the superblock */
 	int raid_disk;			/* role of device in array */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 005 of 7] md: Lock address when changing attributes of component devices.
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (3 preceding siblings ...)
  2007-12-14  6:26 ` [PATCH 004 of 7] md: Allow devices to be shared between md arrays NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-14  6:26 ` [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata NeilBrown
  2007-12-14  6:26 ` [PATCH 007 of 7] md: Get name for block device in sysfs NeilBrown
  6 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel



Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:09:01.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:09:03.000000000 +1100
@@ -2080,12 +2080,18 @@ rdev_attr_store(struct kobject *kobj, st
 {
 	struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
 	mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+	int rv;
 
 	if (!entry->store)
 		return -EIO;
 	if (!capable(CAP_SYS_ADMIN))
 		return -EACCES;
-	return entry->store(rdev, page, length);
+	rv = mddev_lock(rdev->mddev);
+	if (!rv) {
+		rv = entry->store(rdev, page, length);
+		mddev_unlock(rdev->mddev);
+	}
+	return rv;
 }
 
 static void rdev_free(struct kobject *ko)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata.
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (4 preceding siblings ...)
  2007-12-14  6:26 ` [PATCH 005 of 7] md: Lock address when changing attributes of component devices NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-14  6:26 ` [PATCH 007 of 7] md: Get name for block device in sysfs NeilBrown
  6 siblings, 0 replies; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel



Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2007-12-14 16:09:03.000000000 +1100
+++ ./drivers/md/md.c	2007-12-14 16:09:09.000000000 +1100
@@ -4650,9 +4650,10 @@ static int md_ioctl(struct inode *inode,
 	 */
 	/* if we are not initialised yet, only ADD_NEW_DISK, STOP_ARRAY,
 	 * RUN_ARRAY, and GET_ and SET_BITMAP_FILE are allowed */
-	if (!mddev->raid_disks && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY
-			&& cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE
-	    		&& cmd != GET_BITMAP_FILE) {
+	if ((!mddev->raid_disks && !mddev->external)
+	    && cmd != ADD_NEW_DISK && cmd != STOP_ARRAY
+	    && cmd != RUN_ARRAY && cmd != SET_BITMAP_FILE
+	    && cmd != GET_BITMAP_FILE) {
 		err = -ENODEV;
 		goto abort_unlock;
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 007 of 7] md: Get name for block device in sysfs
  2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
                   ` (5 preceding siblings ...)
  2007-12-14  6:26 ` [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata NeilBrown
@ 2007-12-14  6:26 ` NeilBrown
  2007-12-15 16:58   ` Kay Sievers
  6 siblings, 1 reply; 17+ messages in thread
From: NeilBrown @ 2007-12-14  6:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, Jens Axboe


Given an fd on a block device, returns a string like

	/block/sda/sda1

which can be used to find related information in /sys.

Ideally we should have an ioctl that works on char devices as well,
but that seems far from trivial, so it seems reasonable to have
this until the later can be implemented.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ioctl.c      |   13 +++++++++++++
 ./include/linux/fs.h |    2 ++
 2 files changed, 15 insertions(+)

diff .prev/block/ioctl.c ./block/ioctl.c
--- .prev/block/ioctl.c	2007-12-14 17:18:50.000000000 +1100
+++ ./block/ioctl.c	2007-12-14 16:15:41.000000000 +1100
@@ -227,8 +227,21 @@ int blkdev_ioctl(struct inode *inode, st
 	struct block_device *bdev = inode->i_bdev;
 	struct gendisk *disk = bdev->bd_disk;
 	int ret, n;
+	char b[BDEVNAME_SIZE*2  + 10];
 
 	switch(cmd) {
+	case BLKGETNAME:
+		strcpy(b, "/block/");
+		bdevname(bdev->bd_contains, b+7);
+		if (bdev->bd_contains != bdev) {
+			char *e = b + strlen(b);
+			*e++ = '/';
+			bdevname(bdev, e);
+		}
+		if (copy_to_user((char __user *)arg, b, strlen(b)+1))
+			return -EFAULT;
+		return 0;
+
 	case BLKFLSBUF:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EACCES;

diff .prev/include/linux/fs.h ./include/linux/fs.h
--- .prev/include/linux/fs.h	2007-12-14 17:18:50.000000000 +1100
+++ ./include/linux/fs.h	2007-12-14 16:13:03.000000000 +1100
@@ -218,6 +218,8 @@ extern int dir_notify_enable;
 #define BLKTRACESTOP _IO(0x12,117)
 #define BLKTRACETEARDOWN _IO(0x12,118)
 
+#define BLKGETNAME _IOR(0x12, 119, char [1024])
+
 #define BMAP_IOCTL 1		/* obsolete - kept for compatibility */
 #define FIBMAP	   _IO(0x00,1)	/* bmap access */
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 007 of 7] md: Get name for block device in sysfs
  2007-12-14  6:26 ` [PATCH 007 of 7] md: Get name for block device in sysfs NeilBrown
@ 2007-12-15 16:58   ` Kay Sievers
  2007-12-16 22:43     ` Neil Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Kay Sievers @ 2007-12-15 16:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel, Jens Axboe, Greg KH

On Dec 14, 2007 7:26 AM, NeilBrown <neilb@suse.de> wrote:
>
> Given an fd on a block device, returns a string like
>
>         /block/sda/sda1
>
> which can be used to find related information in /sys.
>
> Ideally we should have an ioctl that works on char devices as well,
> but that seems far from trivial, so it seems reasonable to have
> this until the later can be implemented.
>
> Cc: Jens Axboe <jens.axboe@oracle.com>
> Signed-off-by: Neil Brown <neilb@suse.de>
>
> ### Diffstat output
>  ./block/ioctl.c      |   13 +++++++++++++
>  ./include/linux/fs.h |    2 ++
>  2 files changed, 15 insertions(+)
>
> diff .prev/block/ioctl.c ./block/ioctl.c
> --- .prev/block/ioctl.c 2007-12-14 17:18:50.000000000 +1100
> +++ ./block/ioctl.c     2007-12-14 16:15:41.000000000 +1100
> @@ -227,8 +227,21 @@ int blkdev_ioctl(struct inode *inode, st
>         struct block_device *bdev = inode->i_bdev;
>         struct gendisk *disk = bdev->bd_disk;
>         int ret, n;
> +       char b[BDEVNAME_SIZE*2  + 10];
>
>         switch(cmd) {
> +       case BLKGETNAME:
> +               strcpy(b, "/block/");

As pointed out to when you came up with the idea, we can't do this. A devpath
is a path to the device and will not necessarily start with "/block" for block
devices. It may start with "/devices" and can be much longer than
BDEVNAME_SIZE*2  + 10.

Please do not apply!

Kay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 007 of 7] md: Get name for block device in sysfs
  2007-12-15 16:58   ` Kay Sievers
@ 2007-12-16 22:43     ` Neil Brown
  2007-12-17  2:10       ` Kay Sievers
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Brown @ 2007-12-16 22:43 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Andrew Morton, linux-raid, linux-kernel, Jens Axboe, Greg KH

On Saturday December 15, kay.sievers@vrfy.org wrote:
> On Dec 14, 2007 7:26 AM, NeilBrown <neilb@suse.de> wrote:
> >
> > Given an fd on a block device, returns a string like
> >
> >         /block/sda/sda1
> >
> > which can be used to find related information in /sys.
....
> 
> As pointed out to when you came up with the idea, we can't do this. A devpath
> is a path to the device and will not necessarily start with "/block" for block
> devices. It may start with "/devices" and can be much longer than
> BDEVNAME_SIZE*2  + 10.


When you say "will not necessarily" can I take that to mean that it
currently does, but it might (will) change??
In that case can we have the patch as it stands and when the path to
block devices in /sys changes, the ioctl can be changed at the same
time to match?

Or are you saying that as the kernel is today, some block devices
appear under /devices/..., in which case could you please give an
example?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 007 of 7] md: Get name for block device in sysfs
  2007-12-16 22:43     ` Neil Brown
@ 2007-12-17  2:10       ` Kay Sievers
  2007-12-17  5:29         ` /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs] Michael Tokarev
  0 siblings, 1 reply; 17+ messages in thread
From: Kay Sievers @ 2007-12-17  2:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-raid, linux-kernel, Jens Axboe, Greg KH

On Mon, 2007-12-17 at 09:43 +1100, Neil Brown wrote:
> On Saturday December 15, kay.sievers@vrfy.org wrote:
> > On Dec 14, 2007 7:26 AM, NeilBrown <neilb@suse.de> wrote:
> > >
> > > Given an fd on a block device, returns a string like
> > >
> > >         /block/sda/sda1
> > >
> > > which can be used to find related information in /sys.
> ....
> > 
> > As pointed out to when you came up with the idea, we can't do this. A devpath
> > is a path to the device and will not necessarily start with "/block" for block
> > devices. It may start with "/devices" and can be much longer than
> > BDEVNAME_SIZE*2  + 10.
> 
> When you say "will not necessarily" can I take that to mean that it
> currently does, but it might (will) change??

It's in -mm. The devpath for all block devices, like for all other
devices, will start with /devices/* if !SYSFS_DEPRECATED.

> In that case can we have the patch as it stands and when the path to
> block devices in /sys changes, the ioctl can be changed at the same
> time to match?

No, you have to use kobject_get_path() to get the path to the object.
This will also handle devices where the name contains '/' which needs to
be translated to '!', which is broken in this patch.

> Or are you saying that as the kernel is today, some block devices
> appear under /devices/..., in which case could you please give an
> example?

We expect the next kernel to have it.

Btw: BLKGETNAME should probably be renamed to something which contains
DEVPATH, to make clear that it's a path to, and not the name of the
device.

Kay


^ permalink raw reply	[flat|nested] 17+ messages in thread

* /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs]
  2007-12-17  2:10       ` Kay Sievers
@ 2007-12-17  5:29         ` Michael Tokarev
  2007-12-17  8:24           ` Kay Sievers
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Tokarev @ 2007-12-17  5:29 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, Greg KH

Kay Sievers wrote:
> On Mon, 2007-12-17 at 09:43 +1100, Neil Brown wrote:
>> On Saturday December 15, kay.sievers@vrfy.org wrote:
>>> On Dec 14, 2007 7:26 AM, NeilBrown <neilb@suse.de> wrote:
>>>> Given an fd on a block device, returns a string like
>>>>
>>>>         /block/sda/sda1
>>>>
>>>> which can be used to find related information in /sys.
>> ....
>>> As pointed out to when you came up with the idea, we can't do this. A devpath
>>> is a path to the device and will not necessarily start with "/block" for block
>>> devices. It may start with "/devices" and can be much longer than
>>> BDEVNAME_SIZE*2  + 10.
>> When you say "will not necessarily" can I take that to mean that it
>> currently does, but it might (will) change??
> 
> It's in -mm. The devpath for all block devices, like for all other
> devices, will start with /devices/* if !SYSFS_DEPRECATED.

This is the second time I come across this (planned?) change, and for
the second time I can't understand it.

How to distinguish char devices from block devices in sysfs?
Is the only way to read a symlink `subsystem' in the device
directory?

For now, I've a shell code (used heavily in numerous places),
which looks like this:

  function makedev() {
    ...
    case $DEVPATH in
      /block/*) TYPE=b ;;
      *) TYPE=c ;;
    esac
    ...
    mknod /dev/$DEV $TYPE $MAJOR $MINOR
  }

The only external process invocation in there is mknod, all
the rest is done using pure shell constructs.  Is it really
necessary to spawn another process just to read a symlink
now?  It will be almost 2 times slower....

(Sure thing this may be rewritten in C, but using shell it's
MUCH easier to customize if necessary.)

Also, /sys/block/ directory is very easy to use currently, --
unlike other /sys/ stuff which is way too deep and often
placed in unknown/unexpected places (and /sys/class/ and
/sys/bus/ directories are changing all the time).

What's the benefit of moving things from /sys/block/ to
/sys/devices/ ?

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs]
  2007-12-17  5:29         ` /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs] Michael Tokarev
@ 2007-12-17  8:24           ` Kay Sievers
  2007-12-17  8:32             ` Michael Tokarev
  0 siblings, 1 reply; 17+ messages in thread
From: Kay Sievers @ 2007-12-17  8:24 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-kernel, Greg KH

On Mon, 2007-12-17 at 08:29 +0300, Michael Tokarev wrote:
> Kay Sievers wrote:
> > On Mon, 2007-12-17 at 09:43 +1100, Neil Brown wrote:
> >> On Saturday December 15, kay.sievers@vrfy.org wrote:
> >>> On Dec 14, 2007 7:26 AM, NeilBrown <neilb@suse.de> wrote:
> >>>> Given an fd on a block device, returns a string like
> >>>>
> >>>>         /block/sda/sda1
> >>>>
> >>>> which can be used to find related information in /sys.
> >> ....
> >>> As pointed out to when you came up with the idea, we can't do this. A devpath
> >>> is a path to the device and will not necessarily start with "/block" for block
> >>> devices. It may start with "/devices" and can be much longer than
> >>> BDEVNAME_SIZE*2  + 10.
> >> When you say "will not necessarily" can I take that to mean that it
> >> currently does, but it might (will) change??
> > 
> > It's in -mm. The devpath for all block devices, like for all other
> > devices, will start with /devices/* if !SYSFS_DEPRECATED.
> 
> This is the second time I come across this (planned?) change, and for
> the second time I can't understand it.
> 
> How to distinguish char devices from block devices in sysfs?
> Is the only way to read a symlink `subsystem' in the device
> directory?

By its subsystem value (block), from the symlink, from the environment,
or from $1.

> For now, I've a shell code (used heavily in numerous places),
> which looks like this:
> 
>   function makedev() {
>     ...
>     case $DEVPATH in
>       /block/*) TYPE=b ;;
>       *) TYPE=c ;;
>     esac
>     ...
>     mknod /dev/$DEV $TYPE $MAJOR $MINOR
>   }
> 
> The only external process invocation in there is mknod, all
> the rest is done using pure shell constructs.  Is it really
> necessary to spawn another process just to read a symlink
> now?  It will be almost 2 times slower....

No need.

> (Sure thing this may be rewritten in C, but using shell it's
> MUCH easier to customize if necessary.)

$SUBSYSTEM == "block"

> Also, /sys/block/ directory is very easy to use currently, --
> unlike other /sys/ stuff which is way too deep and often
> placed in unknown/unexpected places (and /sys/class/ and
> /sys/bus/ directories are changing all the time).

/sys/block is still there and contains symlinks. And all this happens
only for !SYSFS_DEPRECATED.

> What's the benefit of moving things from /sys/block/ to
> /sys/devices/ ?

Unification. Block devices are "struct devices", use a class, use the
common driver core code instead of their own, show up in the tree, and
can be parents for other devices.

Kay


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs]
  2007-12-17  8:24           ` Kay Sievers
@ 2007-12-17  8:32             ` Michael Tokarev
  2007-12-17  9:13               ` Michael Tokarev
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Tokarev @ 2007-12-17  8:32 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, Greg KH

Kay Sievers wrote:
> On Mon, 2007-12-17 at 08:29 +0300, Michael Tokarev wrote:
[]
>> How to distinguish char devices from block devices in sysfs?
>> Is the only way to read a symlink `subsystem' in the device
>> directory?
> 
> By its subsystem value (block), from the symlink, from the environment,
> or from $1.

Environment and $1 comes as arguments for hotplug helper, not
when scanning /sys/.

>> For now, I've a shell code (used heavily in numerous places),
>> which looks like this:
>>
>>   function makedev() {
>>     ...
>>     case $DEVPATH in
>>       /block/*) TYPE=b ;;
>>       *) TYPE=c ;;
>>     esac
>>     ...
>>     mknod /dev/$DEV $TYPE $MAJOR $MINOR
>>   }
>>
>> The only external process invocation in there is mknod, all
>> the rest is done using pure shell constructs.  Is it really
>> necessary to spawn another process just to read a symlink
>> now?  It will be almost 2 times slower....
> 
> No need.

It seems there IS a need now ;)

Thanks for the clarification.

/mjt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs]
  2007-12-17  8:32             ` Michael Tokarev
@ 2007-12-17  9:13               ` Michael Tokarev
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Tokarev @ 2007-12-17  9:13 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Kay Sievers, linux-kernel, Greg KH

Michael Tokarev wrote:
> Kay Sievers wrote:
>> On Mon, 2007-12-17 at 08:29 +0300, Michael Tokarev wrote:
> []
>>> How to distinguish char devices from block devices in sysfs?
>>> Is the only way to read a symlink `subsystem' in the device
>>> directory?
>> By its subsystem value (block), from the symlink, from the environment,
>> or from $1.
> 
> Environment and $1 comes as arguments for hotplug helper, not
> when scanning /sys/.

By the way, that's one of reasons I asked about useful content
in uevent files (but failed to provide a patch so far ;).
In 2.6.23, those files ARE readable finally, but doesn't
contain much info yet.

/mjt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
  2007-12-14  6:26 ` [PATCH 001 of 7] md: Support 'external' metadata for md arrays NeilBrown
@ 2007-12-25 22:03   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2007-12-25 22:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

On Fri, 14 Dec 2007 17:26:08 +1100 NeilBrown <neilb@suse.de> wrote:

> +	if (strncmp(buf, "external:", 9) == 0) {
> +		int namelen = len-9;
> +		if (namelen >= sizeof(mddev->metadata_type))
> +			namelen = sizeof(mddev->metadata_type)-1;
> +		strncpy(mddev->metadata_type, buf+9, namelen);
> +		mddev->metadata_type[namelen] = 0;
> +		if (namelen && mddev->metadata_type[namelen-1] == '\n')
> +			mddev->metadata_type[--namelen] = 0;
> +		mddev->persistent = 0;
> +		mddev->external = 1;

size_t would be a more appropriate type for `namelen'.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
  2007-12-14  6:26 ` [PATCH 004 of 7] md: Allow devices to be shared between md arrays NeilBrown
@ 2007-12-25 22:04   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2007-12-25 22:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

On Fri, 14 Dec 2007 17:26:28 +1100 NeilBrown <neilb@suse.de> wrote:

> +		mddev_unlock(rdev->mddev);
> +		ITERATE_MDDEV(mddev, tmp) {
> +			mdk_rdev_t *rdev2;
> +
> +			mddev_lock(mddev);
> +			ITERATE_RDEV(mddev, rdev2, tmp2)
> +				if (test_bit(AllReserved, &rdev2->flags) ||
> +				    (rdev->bdev == rdev2->bdev &&
> +				     rdev != rdev2 &&
> +				     overlaps(rdev->data_offset, rdev->size,
> +					    rdev2->data_offset, rdev2->size))) {
> +					overlap = 1;
> +					break;
> +				}
> +			mddev_unlock(mddev);
> +			if (overlap) {
> +				mddev_put(mddev);
> +				break;
> +			}
> +		}

eww, ITERATE_MDDEV() and ITERATE_RDEV() are an eyesore.

for_each_mddev() and for_each_rdev() would at least mean the reader doesn't
need to check the implementation when wondering what that `break' is
breaking from.

>  #define	In_sync		2		/* device is in_sync with rest of array */
>  #define	WriteMostly	4		/* Avoid reading if at all possible */
>  #define	BarriersNotsupp	5		/* BIO_RW_BARRIER is not supported */
> +#define	AllReserved	6		/* If whole device is reserved for

The naming style here is inconsistent.

A task for the keen would be to convert these to an enum and add some
namespacing prefix to them.  

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2007-12-25 22:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-14  6:26 [PATCH 000 of 7] md: Introduction EXPLAIN PATCH SET HERE NeilBrown
2007-12-14  6:26 ` [PATCH 001 of 7] md: Support 'external' metadata for md arrays NeilBrown
2007-12-25 22:03   ` Andrew Morton
2007-12-14  6:26 ` [PATCH 002 of 7] md: Give userspace control over removing failed devices when external metdata in use NeilBrown
2007-12-14  6:26 ` [PATCH 003 of 7] md: Allow a maximum extent to be set for resyncing NeilBrown
2007-12-14  6:26 ` [PATCH 004 of 7] md: Allow devices to be shared between md arrays NeilBrown
2007-12-25 22:04   ` Andrew Morton
2007-12-14  6:26 ` [PATCH 005 of 7] md: Lock address when changing attributes of component devices NeilBrown
2007-12-14  6:26 ` [PATCH 006 of 7] md: Allow an md array to appear with 0 drives if it has external metadata NeilBrown
2007-12-14  6:26 ` [PATCH 007 of 7] md: Get name for block device in sysfs NeilBrown
2007-12-15 16:58   ` Kay Sievers
2007-12-16 22:43     ` Neil Brown
2007-12-17  2:10       ` Kay Sievers
2007-12-17  5:29         ` /sys/block [was: [PATCH 007 of 7] md: Get name for block device in sysfs] Michael Tokarev
2007-12-17  8:24           ` Kay Sievers
2007-12-17  8:32             ` Michael Tokarev
2007-12-17  9:13               ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).