linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: "Enhanced" MD code avaible for review
       [not found] <459805408.1079547261@aslan.scsiguy.com>
@ 2004-03-17 19:18 ` Jeff Garzik
  2004-03-17 19:32   ` Christoph Hellwig
  2004-03-17 21:18   ` Scott Long
  0 siblings, 2 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-17 19:18 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-raid, justin_gibbs, Linux Kernel

Justin T. Gibbs wrote:
> [ I tried sending this last night from my Adaptec email address and have
>   yet to see it on the list.  Sorry if this is dup for any of you. ]

Included linux-kernel in the CC (and also bounced this post there).


> For the past few months, Adaptec Inc, has been working to enhance MD.

The FAQ from several corners is going to be "why not DM?", so I would 
humbly request that you (or Scott Long) re-post some of that rationale 
here...


> The goals of this project are:
> 
> 	o Allow fully pluggable meta-data modules

yep, needed


> 	o Add support for Adaptec ASR (aka HostRAID) and DDF
> 	  (Disk Data Format) meta-data types.  Both of these
> 	  formats are understood natively by certain vendor
> 	  BIOSes meaning that arrays can be booted from transparently.

yep, needed

For those who don't know, DDF is particularly interesting.  A storage 
industry association, "SNIA", has gotten most of the software and 
hardware RAID folks to agree on a common, vendor-neutral on-disk format. 
  Pretty historic, IMO :)  Since this will be appearing on most of the 
future RAID hardware, Linux users will be left out in a big way if this 
isn't supported.

EARLY DRAFT spec for DDF was posted on snia.org at
http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf


> 	o Improve the ability of MD to auto-configure arrays.

hmmmm.  Maybe in my language this means "improve ability for low-level 
drivers to communicate RAID support to upper layers"?


> 	o Support multi-level arrays transparently yet allow
> 	  proper event notification across levels when the
> 	  topology is known to MD.

I'll need to see the code to understand what this means, much less 
whether it is needed ;-)


> 	o Create a more generic "work item" framework which is
> 	  used to support array initialization, rebuild, and
> 	  verify operations as well as miscellaneous tasks that
> 	  a meta-data or RAID personality may need to perform
> 	  from a thread context (e.g. spare activation where
> 	  meta-data records may need to be sequenced carefully).

This is interesting.  (guessing) sort of like a pluggable finite state 
machine?


> 	o Modify the MD ioctl interface to allow the creation
> 	  of management utilities that are meta-data format
> 	  agnostic.

I'm thinking that for 2.6, it is much better to use a more tightly 
defined interface via a Linux character driver.  Userland write(2)'s 
packets of data (h/w raid commands or software raid configuration 
commands), and read(2)'s the responses.

ioctl's are a pain for 32->64-bit translation layers.  Using a 
read/write interface allows one to create an interface that requires no 
translation layer -- a big deal for AMD64 and IA32e processors moving 
forward -- and it also gives one a lot more control over the interface.

See, we need what I described _anyway_, as a chrdev-based interface to 
sending and receiving ATA taskfiles or SCSI cdb's.

It would be IMO simple to extend this to a looks-a-lot-like-ioctl 
raid_op interface.


> A snapshot of this work is now available here:
> 
> 	http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz

Your email didn't say...  this appears to be for 2.6, correct?


> This snapshot includes support for RAID0, RAID1, and the Adaptec
> ASR and DDF meta-data formats.  Additional RAID personalities and
> support for the Super90 and Super 1 meta-data formats will be added
> in the coming weeks, the end goal being to provide a superset of
> the functionality in the current MD.

groovy


> Since the current MD notification scheme does not allow MD to receive
> notifications unless it is statically compiled into the kernel, we
> would like to work with the community to develop a more generic
> notification scheme to which modules, such as MD, can dynamically
> register.  Until that occurs, these EMD snapshots will require at
> least md.c to be a static component of the kernel.

You would just need a small stub that holds a notifier pointer, yes?


> Architectural Notes
> ===================
> The major areas of change in "EMD" can be categorized into:
> 
> 1) "Object Oriented" Data structure changes 
> 
> 	These changes are the basis for allowing RAID personalities
> 	to transparently operate on "disks" or "arrays" as member
> 	objects.  While it has always been possible to create
> 	multi-level arrays in MD using block layer stacking, our
> 	approach allows MD to also stack internally.  Once a given
> 	RAID or meta-data personality is converted to the new
> 	structures, this "feature" comes at no cost.  The benefit
> 	to stacking internally, which requires a meta-data format
> 	that supports this, is that array state can propagate up
> 	and down the topology without the loss of information
> 	inherent in using the block layer to traverse levels of an
> 	array.

I have a feeling that consensus will prefer that we fix the block layer, 
and then figure out the best way to support "automatic stacking" -- 
since DDF and presumeably other RAID formats will require automatic 
setup of raid0+1, etc.

Are there RAID-specific issues here, that do not apply to e.g. 
multipathing, which I've heard needs more information at the block layer?


> 2) Opcode based interfaces.
> 
> 	Rather than add additional method vectors to either the
> 	RAID personality or meta-data personality objects, the new
> 	code uses only a few methods that are parameterized.  This
> 	has allowed us to create a fairly rich interface between
> 	the core and the personalities without overly bloating
> 	personality "classes".

Modulo what I said above, about the chrdev userland interface, we want 
to avoid this.  You're already going down the wrong road by creating 
more untyped interfaces...

static int raid0_raidop(mdk_member_t *member, int op, void *arg)
{
         switch (op) {
         case MDK_RAID_OP_MSTATE_CHANGED:

The preferred model is to create a single marshalling module (a la 
net/core/ethtool.c) that converts the ioctls we must support into a 
fully typed function call interface (a la struct ethtool_ops).


> 3) WorkItems
> 
> 	Workitems provide a generic framework for queuing work to
> 	a thread context.  Workitems include a "control" method as
> 	well as a "handler" method.  This separation allows, for
> 	example, a RAID personality to use the generic sync handler
> 	while trapping the "open", "close", and "free" of any sync
> 	workitems.  Since both handlers can be tailored to the
> 	individual workitem that is queued, this removes the need
> 	to overload one or more interfaces in the personalities.
> 	It also means that any code in MD can make use of this
> 	framework - it is not tied to particular objects or modules
> 	in the system.

Makes sense, though I wonder if we'll want to make this more generic. 
hardware RAID drivers might want to use this sort of stuff internally?


> 4) "Syncable Volume" Support
> 
> 	All of the transaction accounting necessary to support
> 	redundant arrays has been abstracted out into a few inline
> 	functions.  With the inclusion of a "sync support" structure
> 	in a RAID personality's private data structure area and the
> 	use of these functions, the generic sync framework is fully
> 	available.  The sync algorithm is also now more like that
> 	in 2.4.X - with some updates to improve performance.  Two
> 	contiguous sync ranges are employed so that sync I/O can
> 	be pending while the lock range is extended and new sync
> 	I/O is stalled waiting for normal I/O writes that might
> 	conflict with the new range complete.  The syncer updates
> 	its stats more frequently than in the past so that it can
> 	more quickly react to changes in the normal I/O load.  Syncer
> 	backoff is also disabled anytime there is pending I/O blocked
> 	on the syncer's locked region.  RAID personalities have
> 	full control over the size of the sync windows used so that
> 	they can be optimized based on RAID layout policy.

interesting.  makes sense on the surface, I'll have to think some more...


> 5) IOCTL Interface
> 
> 	"EMD" now performs all of its configuration via an "mdctl"
> 	character device.  Since one of our goals is to remove any
> 	knowledge of meta-data type in the user control programs,
> 	initial meta-data stamping and configuration validation
> 	occurs in the kernel.  In general, the meta-data modules
> 	already need this validation code in order to support
> 	auto-configuration, so adding this capability adds little
> 	to the overall size of EMD.  It does, however, require a
> 	few additional ioctls to support things like querying the
> 	maximum "coerced" size of a disk targeted for a new array,
> 	or enumerating the names of installed meta-data modules,
> 	etc.
> 	
> 	This area of EMD is still in very active development and we expect
> 	to provide a drop of an "emdadm" utility later this week.   

I haven't evaluated yet the ioctl interface.  I do understand the need 
to play alongside the existing md interface, but if there are huge 
numbers of additions, it would be preferred to just use the chrdev 
straightaway.  Such a chrdev would be easily portable to 2.4.x kernels 
too :)


> 7) Correction of RAID0 Transform
> 
> 	The RAID0 transform's "merge function" assumes that the
> 	incoming bio's starting sector is the same as what will be
> 	presented to its make_request function.  In the case of a
> 	partitioned MD device, the starting sector is shifted by
> 	the partition offset for the target offset.  Unfortunately,
> 	the merge functions are not notified of the partition
> 	transform, so RAID0 would often reject requests that span
> 	"chunk" boundaries once shifted.  The fix employed here is
> 	to determine if a partition transform will occur and take
> 	this into account in the merge function.

interesting


> Adaptec is currently validating EMD through formal testing while
> continuing the build-out of new features.  Our hope is to gather
> feedback from the Linux community and adjust our approach to satisfy
> the community's requirements.  We look forward to your comments,
> suggestions, and review of this project.

Thanks much for working with the Linux community.

One overall comment on merging into 2.6:  the patch will need to be 
broken up into pieces.  It's OK if each piece is dependent on the prior 
one, and it's OK if there are 20, 30, even 100 pieces.  It helps a lot 
for review to see the evolution, and it also helps flush out problems 
you might not have even noticed.  e.g.
	- add concept of member, and related helper functions
	- use member functions/structs in raid drivers raid0.c, etc.
	- fix raid0 transform
	- add ioctls needed in order for DDF to be useful
	- add DDF format
	etc.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 19:18 ` "Enhanced" MD code avaible for review Jeff Garzik
@ 2004-03-17 19:32   ` Christoph Hellwig
  2004-03-17 20:02     ` Jeff Garzik
  2004-03-17 21:18   ` Scott Long
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2004-03-17 19:32 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Justin T. Gibbs, linux-raid, Linux Kernel

On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote:
> > 	o Allow fully pluggable meta-data modules
> 
> yep, needed

Well, this is pretty much the EVMS route we all heavily argued against.
Most of the metadata shouldn't be visible in the kernel at all.

> > 	o Improve the ability of MD to auto-configure arrays.
> 
> hmmmm.  Maybe in my language this means "improve ability for low-level 
> drivers to communicate RAID support to upper layers"?

I think he's talking about the deprecated raid autorun feature.  Again
something that is completely misplaced in the kernel.  (ågain EVMS light)

> > 	o Support multi-level arrays transparently yet allow
> > 	  proper event notification across levels when the
> > 	  topology is known to MD.
> 
> I'll need to see the code to understand what this means, much less 
> whether it is needed ;-)

I think he mean the broken inter-driver raid stacking mentioned below.
Why do I have to thing of EVMS when for each feature?..


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 19:32   ` Christoph Hellwig
@ 2004-03-17 20:02     ` Jeff Garzik
  0 siblings, 0 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-17 20:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Justin T. Gibbs, linux-raid, Linux Kernel

Christoph Hellwig wrote:
> On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote:
> 
>>>	o Allow fully pluggable meta-data modules
>>
>>yep, needed
> 
> 
> Well, this is pretty much the EVMS route we all heavily argued against.
> Most of the metadata shouldn't be visible in the kernel at all.

_some_ metadata is required at runtime, and must be in the kernel.  I 
agree that a lot of configuration doesn't necessarily need to be in the 
kernel.  But stuff like bad sector and event logs, and other bits are 
still needed at runtime.


>>>	o Improve the ability of MD to auto-configure arrays.
>>
>>hmmmm.  Maybe in my language this means "improve ability for low-level 
>>drivers to communicate RAID support to upper layers"?
> 
> 
> I think he's talking about the deprecated raid autorun feature.  Again
> something that is completely misplaced in the kernel.  (ågain EVMS light)

Indeed, but I'll let him and the code illuminate the meaning :)

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 19:18 ` "Enhanced" MD code avaible for review Jeff Garzik
  2004-03-17 19:32   ` Christoph Hellwig
@ 2004-03-17 21:18   ` Scott Long
  2004-03-17 21:35     ` Jeff Garzik
                       ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Scott Long @ 2004-03-17 21:18 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

Jeff Garzik wrote:
> Justin T. Gibbs wrote:
>  > [ I tried sending this last night from my Adaptec email address and have
>  >   yet to see it on the list.  Sorry if this is dup for any of you. ]
> 
> Included linux-kernel in the CC (and also bounced this post there).
> 
> 
>  > For the past few months, Adaptec Inc, has been working to enhance MD.
> 
> The FAQ from several corners is going to be "why not DM?", so I would
> humbly request that you (or Scott Long) re-post some of that rationale
> here...
> 
> 
>  > The goals of this project are:
>  >
>  >       o Allow fully pluggable meta-data modules
> 
> yep, needed
> 
> 
>  >       o Add support for Adaptec ASR (aka HostRAID) and DDF
>  >         (Disk Data Format) meta-data types.  Both of these
>  >         formats are understood natively by certain vendor
>  >         BIOSes meaning that arrays can be booted from transparently.
> 
> yep, needed
> 
> For those who don't know, DDF is particularly interesting.  A storage
> industry association, "SNIA", has gotten most of the software and
> hardware RAID folks to agree on a common, vendor-neutral on-disk format.
>   Pretty historic, IMO :)  Since this will be appearing on most of the
> future RAID hardware, Linux users will be left out in a big way if this
> isn't supported.
> 
> EARLY DRAFT spec for DDF was posted on snia.org at
> http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf
> 
> 
>  >       o Improve the ability of MD to auto-configure arrays.
> 
> hmmmm.  Maybe in my language this means "improve ability for low-level
> drivers to communicate RAID support to upper layers"?
> 

No, this is full auto-configuration support at boot-time, and when
drives are hot-added.  I think that you comment applies to the next
item, and yes, you are correct.

> 
>  >       o Support multi-level arrays transparently yet allow
>  >         proper event notification across levels when the
>  >         topology is known to MD.
> 
> I'll need to see the code to understand what this means, much less
> whether it is needed ;-)
> 
> 
>  >       o Create a more generic "work item" framework which is
>  >         used to support array initialization, rebuild, and
>  >         verify operations as well as miscellaneous tasks that
>  >         a meta-data or RAID personality may need to perform
>  >         from a thread context (e.g. spare activation where
>  >         meta-data records may need to be sequenced carefully).
> 
> This is interesting.  (guessing) sort of like a pluggable finite state
> machine?
> 

More or less, yes.  We needed a way to bridge the gap from an error
being reported in an interrupt context to being able to allocate memory
and do blocking I/O from a thread context.  The md_error() interface
already existed to do this, but was way too primitive for our needs.  It
had no way to handle cascading or compound events.

> 
>  >       o Modify the MD ioctl interface to allow the creation
>  >         of management utilities that are meta-data format
>  >         agnostic.
> 
> I'm thinking that for 2.6, it is much better to use a more tightly
> defined interface via a Linux character driver.  Userland write(2)'s
> packets of data (h/w raid commands or software raid configuration
> commands), and read(2)'s the responses.
> 
> ioctl's are a pain for 32->64-bit translation layers.  Using a
> read/write interface allows one to create an interface that requires no
> translation layer -- a big deal for AMD64 and IA32e processors moving
> forward -- and it also gives one a lot more control over the interface.
> 

I'm not exactly sure what the difference is here.  Both the ioctl and 
read/write paths copy data in and out of the kernel.  The ioctl
method is a little bit easier since you don't have to stream in a chunk
of data before knowing what to do with it.  ANd I also don't see how
read/write protect you from endian and 64/32-bit issues better than
ioctl.  If you write your code cleanly and correctly, it's a moot point.

> See, we need what I described _anyway_, as a chrdev-based interface to
> sending and receiving ATA taskfiles or SCSI cdb's.
> 
> It would be IMO simple to extend this to a looks-a-lot-like-ioctl
> raid_op interface.
> 
> 
>  > A snapshot of this work is now available here:
>  >
>  >       http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz
> 
> Your email didn't say...  this appears to be for 2.6, correct?
> 
> 
>  > This snapshot includes support for RAID0, RAID1, and the Adaptec
>  > ASR and DDF meta-data formats.  Additional RAID personalities and
>  > support for the Super90 and Super 1 meta-data formats will be added
>  > in the coming weeks, the end goal being to provide a superset of
>  > the functionality in the current MD.
> 
> groovy
> 
> 
>  > Since the current MD notification scheme does not allow MD to receive
>  > notifications unless it is statically compiled into the kernel, we
>  > would like to work with the community to develop a more generic
>  > notification scheme to which modules, such as MD, can dynamically
>  > register.  Until that occurs, these EMD snapshots will require at
>  > least md.c to be a static component of the kernel.
> 
> You would just need a small stub that holds a notifier pointer, yes?
> 

I think that we are flexible on this.  We have an implementation from
several years ago that records partition type information and passes it
around in the notification message so that consumers can register for
distinct types of disks/partitions/etc.  Our needs aren't that complex,
but we would be happy to share it anyways since it is useful.

> 
>  > Architectural Notes
>  > ===================
>  > The major areas of change in "EMD" can be categorized into:
>  >
>  > 1) "Object Oriented" Data structure changes
>  >
>  >       These changes are the basis for allowing RAID personalities
>  >       to transparently operate on "disks" or "arrays" as member
>  >       objects.  While it has always been possible to create
>  >       multi-level arrays in MD using block layer stacking, our
>  >       approach allows MD to also stack internally.  Once a given
>  >       RAID or meta-data personality is converted to the new
>  >       structures, this "feature" comes at no cost.  The benefit
>  >       to stacking internally, which requires a meta-data format
>  >       that supports this, is that array state can propagate up
>  >       and down the topology without the loss of information
>  >       inherent in using the block layer to traverse levels of an
>  >       array.
> 
> I have a feeling that consensus will prefer that we fix the block layer,
> and then figure out the best way to support "automatic stacking" --
> since DDF and presumeably other RAID formats will require automatic
> setup of raid0+1, etc.
> 
> Are there RAID-specific issues here, that do not apply to e.g.
> multipathing, which I've heard needs more information at the block layer?
> 

No, the issue is, how do you propagate events through the block layer?
EIO/EINVAL/etc error codes just don't cut it.  Also, many metadata 
formats are unified, in that even though the arrays are stacked, the
metadata sees the entire picture.  Updates might need to touch every 
disk in the compound array, not just a certain sub-array.

The stacking that we do internal to MD is still fairly clean and doesn't
prevent one from stacking outside of MD.

> 
>  > 2) Opcode based interfaces.
>  >
>  >       Rather than add additional method vectors to either the
>  >       RAID personality or meta-data personality objects, the new
>  >       code uses only a few methods that are parameterized.  This
>  >       has allowed us to create a fairly rich interface between
>  >       the core and the personalities without overly bloating
>  >       personality "classes".
> 
> Modulo what I said above, about the chrdev userland interface, we want
> to avoid this.  You're already going down the wrong road by creating
> more untyped interfaces...
> 
> static int raid0_raidop(mdk_member_t *member, int op, void *arg)
> {
>          switch (op) {
>          case MDK_RAID_OP_MSTATE_CHANGED:
> 
> The preferred model is to create a single marshalling module (a la
> net/core/ethtool.c) that converts the ioctls we must support into a
> fully typed function call interface (a la struct ethtool_ops).
> 

These OPS don't exist soley for the userland ap.  They also exist for
communicating between the raid transform and metadata modules.

> 
>  > 3) WorkItems
>  >
>  >       Workitems provide a generic framework for queuing work to
>  >       a thread context.  Workitems include a "control" method as
>  >       well as a "handler" method.  This separation allows, for
>  >       example, a RAID personality to use the generic sync handler
>  >       while trapping the "open", "close", and "free" of any sync
>  >       workitems.  Since both handlers can be tailored to the
>  >       individual workitem that is queued, this removes the need
>  >       to overload one or more interfaces in the personalities.
>  >       It also means that any code in MD can make use of this
>  >       framework - it is not tied to particular objects or modules
>  >       in the system.
> 
> Makes sense, though I wonder if we'll want to make this more generic.
> hardware RAID drivers might want to use this sort of stuff internally?
> 

If you want to make it into a more generic kernel service, that fine. 
However, I'm not quite sure what kind of work items a hardware raid
driver will need.  The whole point there is to hide what's going on ;-)

> 
>  > 4) "Syncable Volume" Support
>  >
>  >       All of the transaction accounting necessary to support
>  >       redundant arrays has been abstracted out into a few inline
>  >       functions.  With the inclusion of a "sync support" structure
>  >       in a RAID personality's private data structure area and the
>  >       use of these functions, the generic sync framework is fully
>  >       available.  The sync algorithm is also now more like that
>  >       in 2.4.X - with some updates to improve performance.  Two
>  >       contiguous sync ranges are employed so that sync I/O can
>  >       be pending while the lock range is extended and new sync
>  >       I/O is stalled waiting for normal I/O writes that might
>  >       conflict with the new range complete.  The syncer updates
>  >       its stats more frequently than in the past so that it can
>  >       more quickly react to changes in the normal I/O load.  Syncer
>  >       backoff is also disabled anytime there is pending I/O blocked
>  >       on the syncer's locked region.  RAID personalities have
>  >       full control over the size of the sync windows used so that
>  >       they can be optimized based on RAID layout policy.
> 
> interesting.  makes sense on the surface, I'll have to think some more...
> 
> 
>  > 5) IOCTL Interface
>  >
>  >       "EMD" now performs all of its configuration via an "mdctl"
>  >       character device.  Since one of our goals is to remove any
>  >       knowledge of meta-data type in the user control programs,
>  >       initial meta-data stamping and configuration validation
>  >       occurs in the kernel.  In general, the meta-data modules
>  >       already need this validation code in order to support
>  >       auto-configuration, so adding this capability adds little
>  >       to the overall size of EMD.  It does, however, require a
>  >       few additional ioctls to support things like querying the
>  >       maximum "coerced" size of a disk targeted for a new array,
>  >       or enumerating the names of installed meta-data modules,
>  >       etc.
>  >      
>  >       This area of EMD is still in very active development and we expect
>  >       to provide a drop of an "emdadm" utility later this week.  
> 
> I haven't evaluated yet the ioctl interface.  I do understand the need
> to play alongside the existing md interface, but if there are huge
> numbers of additions, it would be preferred to just use the chrdev
> straightaway.  Such a chrdev would be easily portable to 2.4.x kernels
> too :)
> 
> 
>  > 7) Correction of RAID0 Transform
>  >
>  >       The RAID0 transform's "merge function" assumes that the
>  >       incoming bio's starting sector is the same as what will be
>  >       presented to its make_request function.  In the case of a
>  >       partitioned MD device, the starting sector is shifted by
>  >       the partition offset for the target offset.  Unfortunately,
>  >       the merge functions are not notified of the partition
>  >       transform, so RAID0 would often reject requests that span
>  >       "chunk" boundaries once shifted.  The fix employed here is
>  >       to determine if a partition transform will occur and take
>  >       this into account in the merge function.
> 
> interesting
> 
> 
>  > Adaptec is currently validating EMD through formal testing while
>  > continuing the build-out of new features.  Our hope is to gather
>  > feedback from the Linux community and adjust our approach to satisfy
>  > the community's requirements.  We look forward to your comments,
>  > suggestions, and review of this project.
> 
> Thanks much for working with the Linux community.
> 
> One overall comment on merging into 2.6:  the patch will need to be
> broken up into pieces.  It's OK if each piece is dependent on the prior
> one, and it's OK if there are 20, 30, even 100 pieces.  It helps a lot
> for review to see the evolution, and it also helps flush out problems
> you might not have even noticed.  e.g.
>         - add concept of member, and related helper functions
>         - use member functions/structs in raid drivers raid0.c, etc.
>         - fix raid0 transform
>         - add ioctls needed in order for DDF to be useful
>         - add DDF format
>         etc.
> 

We can provide our Perforce changelogs (just like we do for SCSI).

Scott


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 21:18   ` Scott Long
@ 2004-03-17 21:35     ` Jeff Garzik
  2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
  2004-03-18  1:56     ` viro
  2 siblings, 0 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-17 21:35 UTC (permalink / raw)
  To: Scott Long; +Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

Scott Long wrote:
> Jeff Garzik wrote:
>> Modulo what I said above, about the chrdev userland interface, we want
>> to avoid this.  You're already going down the wrong road by creating
>> more untyped interfaces...
>>
>> static int raid0_raidop(mdk_member_t *member, int op, void *arg)
>> {
>>          switch (op) {
>>          case MDK_RAID_OP_MSTATE_CHANGED:
>>
>> The preferred model is to create a single marshalling module (a la
>> net/core/ethtool.c) that converts the ioctls we must support into a
>> fully typed function call interface (a la struct ethtool_ops).
>>
> 
> These OPS don't exist soley for the userland ap.  They also exist for
> communicating between the raid transform and metadata modules.

Nod -- kernel internal calls should _especially_ be type-explicit, not 
typeless ioctl-like APIs.


>> One overall comment on merging into 2.6:  the patch will need to be
>> broken up into pieces.  It's OK if each piece is dependent on the prior
>> one, and it's OK if there are 20, 30, even 100 pieces.  It helps a lot
>> for review to see the evolution, and it also helps flush out problems
>> you might not have even noticed.  e.g.
>>         - add concept of member, and related helper functions
>>         - use member functions/structs in raid drivers raid0.c, etc.
>>         - fix raid0 transform
>>         - add ioctls needed in order for DDF to be useful
>>         - add DDF format
>>         etc.
>>
> 
> We can provide our Perforce changelogs (just like we do for SCSI).

What I'm saying is, emd needs to be submitted to the kernel just like 
Neil Brown submits patches to Andrew, etc.  This is how everybody else 
submits and maintains Linux kernel code.  There needs to be N patches, 
one patch per email, that successively introduces new code, or modifies 
existing code.

Absent of all other issues, one huge patch that completely updates md 
isn't going to be acceptable, no matter how nifty or well-tested it is...

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 21:18   ` Scott Long
  2004-03-17 21:35     ` Jeff Garzik
@ 2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
  2004-03-18  0:23       ` Scott Long
  2004-03-18  1:56     ` viro
  2 siblings, 1 reply; 56+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2004-03-17 21:45 UTC (permalink / raw)
  To: Scott Long, Jeff Garzik
  Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> Jeff Garzik wrote:
> > Justin T. Gibbs wrote:
> >  > [ I tried sending this last night from my Adaptec email address and
> >  > have yet to see it on the list.  Sorry if this is dup for any of you.
> >  > ]
> >
> > Included linux-kernel in the CC (and also bounced this post there).
> >
> >  > For the past few months, Adaptec Inc, has been working to enhance MD.
> >
> > The FAQ from several corners is going to be "why not DM?", so I would
> > humbly request that you (or Scott Long) re-post some of that rationale
> > here...

This is #1 question so... why not DM?  8)

Regards,
Bartlomiej


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
@ 2004-03-18  0:23       ` Scott Long
  2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Scott Long @ 2004-03-18  0:23 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz
  Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

Bartlomiej Zolnierkiewicz wrote:
> On Wednesday 17 of March 2004 22:18, Scott Long wrote:
>  > Jeff Garzik wrote:
>  > > Justin T. Gibbs wrote:
>  > >  > [ I tried sending this last night from my Adaptec email address and
>  > >  > have yet to see it on the list.  Sorry if this is dup for any of 
> you.
>  > >  > ]
>  > >
>  > > Included linux-kernel in the CC (and also bounced this post there).
>  > >
>  > >  > For the past few months, Adaptec Inc, has been working to 
> enhance MD.
>  > >
>  > > The FAQ from several corners is going to be "why not DM?", so I would
>  > > humbly request that you (or Scott Long) re-post some of that rationale
>  > > here...
> 
> This is #1 question so... why not DM?  8)
> 
> Regards,
> Bartlomiej
> 


The primary feature of any RAID implementation is reliability. 
Reliability is a surprisingly hard goal.  Making sure that your
data is available and trustworthy under real-world scenarios is
a lot harder than it sounds.  This has been a significant focus
of ours on MD, and is the primary reason why we chose MD as the
foundation of our work.

Storage is the foundation of everything that you do with your
computer.  It needs to work regardless of what happened to your 
filesystem on the last crash, regardless of whether or not you
have the latest initrd tools, regardless of what rpms you've kept
up to date on, regardless if your userland works, regardless of
what libc you are using this week, etc.

With DM, what happens when your initrd gets accidentally corrupted?
What happens when the kernel and userland pieces get out of sync?
Maybe you are booting off of a single drive and only using DM arrays
for secondary storage, but maybe you're not.  If something goes wrong
with DM, how do you boot?

Secondly, our target here is to interoperate with hardware components
that run outside the scope of Linux.  The HostRAID or DDF BIOS is
going to create an array using it's own format.  It's not going to
have any knowledge of DM config files, initrd, ramfs, etc.  However,
the end user is still going to expect to be able to seamlessly install
onto that newly created array, maybe move that array to another system,
whatever, and have it all Just Work.  Has anyone heard of a hardware
RAID card that requires you to run OS-specific commands in order to
access the arrays on it?  Of course not.  The point here is to make
software raid just as easy to the end user.

The third, and arguably most important issue is the need for reliable
error recovery.  With the DM model, error recovery would be done in
userland.  Errors generated during I/O would be kicked to a userland
app that would then drive the recovery-spare activation-rebuild
sequence.  That's fine, but what if something happens that prevents
the userland tool from running?  Maybe it was a daemon that became
idle and got swapped out to disk, but now you can't swap it back in
because your I/O is failing.  Or maybe it needs to activate a helper
module or read a config file, but again it can't because i/o is
failing.  What if it crashes.  What if the source code gets out of sync
with the kernel interface.  What if you upgrade glibc and it stops
working for whatever unknown reason.

Some have suggested in the past that these userland tools get put into
ramfs and locked into memory.  If you do that, then it might as well be
part of the kernel anyways.  It's consuming the same memory, if not
more, than the equivalent code in the kernel (likely a lot more since
you'd  have to static link it).  And you still have the downsides of it
possibly getting out of date with the kernel.  So what are the upsides?

MD is not terribly heavy-weight.  As a monolithic module of
DDF+ASR+R0+R1 it's about 65k in size.  That's 1/2 the size of your
average SCSI driver these days, and no one is advocating putting those
into userland.  It just doesn't make sense to sacrifice reliability
for the phantom goal of 'reducing kernel bloat'.

Scott


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-18  0:23       ` Scott Long
@ 2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
  2004-03-18  6:38         ` Stefan Smietanowski
  2004-03-20 13:07         ` Arjan van de Ven
  2 siblings, 0 replies; 56+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2004-03-18  1:55 UTC (permalink / raw)
  To: Scott Long
  Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

On Thursday 18 of March 2004 01:23, Scott Long wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Wednesday 17 of March 2004 22:18, Scott Long wrote:
> >  > Jeff Garzik wrote:
> >  > > Justin T. Gibbs wrote:
> >  > >  > [ I tried sending this last night from my Adaptec email address
> >  > >  > and have yet to see it on the list.  Sorry if this is dup for any
> >  > >  > of
> >
> > you.
> >
> >  > >  > ]
> >  > >
> >  > > Included linux-kernel in the CC (and also bounced this post there).
> >  > >
> >  > >  > For the past few months, Adaptec Inc, has been working to
> >
> > enhance MD.
> >
> >  > > The FAQ from several corners is going to be "why not DM?", so I
> >  > > would humbly request that you (or Scott Long) re-post some of that
> >  > > rationale here...
> >
> > This is #1 question so... why not DM?  8)
> >
> > Regards,
> > Bartlomiej
>
> The primary feature of any RAID implementation is reliability.
> Reliability is a surprisingly hard goal.  Making sure that your
> data is available and trustworthy under real-world scenarios is
> a lot harder than it sounds.  This has been a significant focus
> of ours on MD, and is the primary reason why we chose MD as the
> foundation of our work.

Okay.

> Storage is the foundation of everything that you do with your
> computer.  It needs to work regardless of what happened to your
> filesystem on the last crash, regardless of whether or not you
> have the latest initrd tools, regardless of what rpms you've kept
> up to date on, regardless if your userland works, regardless of
> what libc you are using this week, etc.

I'm thinking about initrd+klibc not rpms+libc,
fs is a lower level than DM - fs crash is not a problem here.

> With DM, what happens when your initrd gets accidentally corrupted?

The same what happens when your kernel image gets corrupted,
probability is similar.

> What happens when the kernel and userland pieces get out of sync?

The same what happens when your kernel driver gets out of sync.

> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not.  If something goes wrong
> with DM, how do you boot?

The same what happens when "something" wrong goes with kernel.

> Secondly, our target here is to interoperate with hardware components
> that run outside the scope of Linux.  The HostRAID or DDF BIOS is
> going to create an array using it's own format.  It's not going to
> have any knowledge of DM config files, initrd, ramfs, etc.  However,

It doesn't need any knowledge of config files, initrd, ramfs etc.

> the end user is still going to expect to be able to seamlessly install
> onto that newly created array, maybe move that array to another system,
> whatever, and have it all Just Work.  Has anyone heard of a hardware
> RAID card that requires you to run OS-specific commands in order to
> access the arrays on it?  Of course not.  The point here is to make
> software raid just as easy to the end user.

It won't require user to run any commands.

RAID card gets detected and initialized -> hotplug event happens ->
user-land configuration tools executed etc.

> The third, and arguably most important issue is the need for reliable
> error recovery.  With the DM model, error recovery would be done in
> userland.  Errors generated during I/O would be kicked to a userland
> app that would then drive the recovery-spare activation-rebuild
> sequence.  That's fine, but what if something happens that prevents
> the userland tool from running?  Maybe it was a daemon that became
> idle and got swapped out to disk, but now you can't swap it back in
> because your I/O is failing.  Or maybe it needs to activate a helper
> module or read a config file, but again it can't because i/o is

I see valid points here but ramfs can be used etc.

> failing.  What if it crashes.  What if the source code gets out of sync
> with the kernel interface.  What if you upgrade glibc and it stops
> working for whatever unknown reason.

glibc is not needed/recommend here.

> Some have suggested in the past that these userland tools get put into
> ramfs and locked into memory.  If you do that, then it might as well be
> part of the kernel anyways.  It's consuming the same memory, if not
> more, than the equivalent code in the kernel (likely a lot more since
> you'd  have to static link it).  And you still have the downsides of it
> possibly getting out of date with the kernel.  So what are the upsides?

Faster/easier development - user-space apps don't OOPS. :-)
Somebody else than kernel people have to update user-land. :-)

> MD is not terribly heavy-weight.  As a monolithic module of
> DDF+ASR+R0+R1 it's about 65k in size.  That's 1/2 the size of your
> average SCSI driver these days, and no one is advocating putting those

SCSI driver is a low-level stuff - it needs direct hardware access.

Even 65k is still a bloat - think about vendor kernel including support
for all possible RAID flavors.  If they are modular - they require initrd
so may as well be put to user-land.

> into userland.  It just doesn't make sense to sacrifice reliability
> for the phantom goal of 'reducing kernel bloat'.

ATARAID drivers are just moving in this direction...
ASR+DDF will also follow this way... sooner or later...

Regards,
Bartlomiej


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-17 21:18   ` Scott Long
  2004-03-17 21:35     ` Jeff Garzik
  2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
@ 2004-03-18  1:56     ` viro
  2 siblings, 0 replies; 56+ messages in thread
From: viro @ 2004-03-18  1:56 UTC (permalink / raw)
  To: Scott Long
  Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel

On Wed, Mar 17, 2004 at 02:18:01PM -0700, Scott Long wrote:
> >One overall comment on merging into 2.6:  the patch will need to be
> >broken up into pieces.  It's OK if each piece is dependent on the prior
> >one, and it's OK if there are 20, 30, even 100 pieces.  It helps a lot
> >for review to see the evolution, and it also helps flush out problems
> >you might not have even noticed.  e.g.
> >        - add concept of member, and related helper functions
> >        - use member functions/structs in raid drivers raid0.c, etc.
> >        - fix raid0 transform
> >        - add ioctls needed in order for DDF to be useful
> >        - add DDF format
> >        etc.
> >
> 
> We can provide our Perforce changelogs (just like we do for SCSI).

TA: "you must submit a solution, not just an answer"
CALC101 student: "but I've checked the answer, it's OK"
TA: "I'm sorry, it's not enough"
<student hands a pile of paper covered with snippets of text and calculations>
Student: "All right, here are all notes I've made while solving the problem.
Happy now?"
TA: <exasperated sigh> "Not really"

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-18  0:23       ` Scott Long
  2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
@ 2004-03-18  6:38         ` Stefan Smietanowski
  2004-03-20 13:07         ` Arjan van de Ven
  2 siblings, 0 replies; 56+ messages in thread
From: Stefan Smietanowski @ 2004-03-18  6:38 UTC (permalink / raw)
  To: Scott Long
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

Hi.

<snip beginning of discsussion about DDF, etc>

> With DM, what happens when your initrd gets accidentally corrupted?
> What happens when the kernel and userland pieces get out of sync?
> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not.  If something goes wrong
> with DM, how do you boot?

Tell me something... Do you guys release a driver for WinXP as an
example? You don't have to answer that really as it's obvious that
you do. Do you in the installation program recompile the windows
kernel so that your driver is monolithic? The answer is most presumably
no - that's not how it's done there.

Ok. Your example states "what if initrd gets corrupted" and my example
is "what if you driver file(s) get corrupted?" and my example
is equally important to a module in linux as it is a driver in windows.

Now, since you do supply a windows driver and that driver is NOT
statically linked to the windows kernel why is it that you believe
a meta driver (which MD really is in a sense) needs special treatment
(static linking into the kernel) when for instance a driver for a piece
of hardware doesn't? If you have disk corruption so far that your
initrd is corrupted I would seriously suggest NOT booting that OS
that's on that drive regardless of anything else and sticking it
in another box OR booting from rescue media of some sort.

// Stefan

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-18  0:23       ` Scott Long
  2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
  2004-03-18  6:38         ` Stefan Smietanowski
@ 2004-03-20 13:07         ` Arjan van de Ven
  2004-03-21 23:42           ` Scott Long
  2 siblings, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2004-03-20 13:07 UTC (permalink / raw)
  To: Scott Long
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1606 bytes --]


> With DM, what happens when your initrd gets accidentally corrupted?

What happens if your vmlinuz accidentally gets corrupted? If your initrd
is toast the module for your root fs doesn't load either. Duh.

> What happens when the kernel and userland pieces get out of sync?
> Maybe you are booting off of a single drive and only using DM arrays
> for secondary storage, but maybe you're not.  If something goes wrong
> with DM, how do you boot?

If you loose 10 disks out of your raid array, how do you boot ?

> 
> Secondly, our target here is to interoperate with hardware components
> that run outside the scope of Linux.  The HostRAID or DDF BIOS is
> going to create an array using it's own format.  It's not going to
> have any knowledge of DM config files, 

DM doesn't need/use config files.
> initrd, ramfs, etc.  However,
> the end user is still going to expect to be able to seamlessly install
> onto that newly created array, maybe move that array to another system,
> whatever, and have it all Just Work.  Has anyone heard of a hardware
> RAID card that requires you to run OS-specific commands in order to
> access the arrays on it?  Of course not.  The point here is to make
> software raid just as easy to the end user.

And that is an easy task for distribution makers (or actually the people
who make the initrd creation software).

I'm sorry, I'm not buying your arguments and consider 100% the wrong
direction. I'm hoping that someone with a bit more time than me will
write the DDF device mapper target so that I can use it for my
kernels... ;)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-20 13:07         ` Arjan van de Ven
@ 2004-03-21 23:42           ` Scott Long
  2004-03-22  9:05             ` Arjan van de Ven
  0 siblings, 1 reply; 56+ messages in thread
From: Scott Long @ 2004-03-21 23:42 UTC (permalink / raw)
  To: arjanv
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

Arjan van de Ven wrote:
>>With DM, what happens when your initrd gets accidentally corrupted?
> 
> 
> What happens if your vmlinuz accidentally gets corrupted? If your initrd
> is toast the module for your root fs doesn't load either. Duh.

The point here is to minimize points of failure.

> 
> 
>>What happens when the kernel and userland pieces get out of sync?
>>Maybe you are booting off of a single drive and only using DM arrays
>>for secondary storage, but maybe you're not.  If something goes wrong
>>with DM, how do you boot?
> 
> 
> If you loose 10 disks out of your raid array, how do you boot ?

That's a silly statement and has nothing to do with the argument.

> 
> 
>>Secondly, our target here is to interoperate with hardware components
>>that run outside the scope of Linux.  The HostRAID or DDF BIOS is
>>going to create an array using it's own format.  It's not going to
>>have any knowledge of DM config files, 
> 
> 
> DM doesn't need/use config files.
> 
>>initrd, ramfs, etc.  However,
>>the end user is still going to expect to be able to seamlessly install
>>onto that newly created array, maybe move that array to another system,
>>whatever, and have it all Just Work.  Has anyone heard of a hardware
>>RAID card that requires you to run OS-specific commands in order to
>>access the arrays on it?  Of course not.  The point here is to make
>>software raid just as easy to the end user.
> 
> 
> And that is an easy task for distribution makers (or actually the people
> who make the initrd creation software).
> 
> I'm sorry, I'm not buying your arguments and consider 100% the wrong
> direction. I'm hoping that someone with a bit more time than me will
> write the DDF device mapper target so that I can use it for my
> kernels... ;)
> 

Well, code speaks louder than words, as this group loves to say.  I 
eagerly await your code.  Barring that, I eagerly await a technical
argument, rather than an emotional "you're wrong because I'm right"
argument.

Scott


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-21 23:42           ` Scott Long
@ 2004-03-22  9:05             ` Arjan van de Ven
  2004-03-22 21:59               ` Scott Long
  0 siblings, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2004-03-22  9:05 UTC (permalink / raw)
  To: Scott Long
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 452 bytes --]

On Mon, 2004-03-22 at 00:42, Scott Long wrote:

> Well, code speaks louder than words, as this group loves to say.  I 
> eagerly await your code.  Barring that, I eagerly await a technical
> argument, rather than an emotional "you're wrong because I'm right"
> argument.

I think that all the arguments for using DM are techinical arguments not
emotional ones. oh well.. you're free to write your code I'm free to not
use it in my kernels ;)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-22  9:05             ` Arjan van de Ven
@ 2004-03-22 21:59               ` Scott Long
  2004-03-22 22:22                 ` Lars Marowsky-Bree
  2004-03-23  6:48                 ` Arjan van de Ven
  0 siblings, 2 replies; 56+ messages in thread
From: Scott Long @ 2004-03-22 21:59 UTC (permalink / raw)
  To: arjanv
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

Arjan van de Ven wrote:
> On Mon, 2004-03-22 at 00:42, Scott Long wrote:
> 
> 
>>Well, code speaks louder than words, as this group loves to say.  I 
>>eagerly await your code.  Barring that, I eagerly await a technical
>>argument, rather than an emotional "you're wrong because I'm right"
>>argument.
> 
> 
> I think that all the arguments for using DM are techinical arguments not
> emotional ones. oh well.. you're free to write your code I'm free to not
> use it in my kernels ;)

Ok, the technical arguments I've heard in favor of the DM approach is 
that it reduces kernel bloat.  That fair, and I certainly agree with not
putting the kitchen sink into the kernel.  Our position on EMD is that
it's a special case because you want to reduce the number of failure
modes, and that it doesn't contribute in a significant way to the kernel
size.  Your response to that our arguments don't matter since your mind
is already made up.  That's the barrier I'm trying to break through and
have a techincal discussion on.

Scott


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-22 21:59               ` Scott Long
@ 2004-03-22 22:22                 ` Lars Marowsky-Bree
  2004-03-23  6:48                 ` Arjan van de Ven
  1 sibling, 0 replies; 56+ messages in thread
From: Lars Marowsky-Bree @ 2004-03-22 22:22 UTC (permalink / raw)
  To: Linux Kernel

On 2004-03-22T14:59:29,
   Scott Long <scott_long@adaptec.com> said:

> Ok, the technical arguments I've heard in favor of the DM approach is 
> that it reduces kernel bloat.  That fair, and I certainly agree with not
> putting the kitchen sink into the kernel.  Our position on EMD is that
> it's a special case because you want to reduce the number of failure
> modes, and that it doesn't contribute in a significant way to the kernel
> size.  Your response to that our arguments don't matter since your mind
> is already made up.  That's the barrier I'm trying to break through and
> have a techincal discussion on.

The problematic point is that the failure modes which you want to
protect against all basically amount to -EUSERTOOSTUPID (if he forgot to
update the initrd and thus basically missed a vital part of the kernel
update), or -EFUBAR (in which case even the kernel image itself won't
help you). In those cases, not even being linked into the kernel helps
you any.

All of these cases are well understood, and have been problematic in the
past already, and will fuck the user up whether he has EMD enabled or
not. That EMD is coming up is not going to help him much, because he
won't be able to mount the root filesystem w/o the filesystem module,
or without the LVM2/EVMS2 stuff etc. initrd has long been mostly
mandatory already for such scenarios.

This is the way how the kernel has been developing for a while. Your
patch does something different, and the reasons you give are not
convincing.

In particular, if EMD is going to be stacked with other stuff (ie, EMD
RAID1 on top of multipath or whatever), having the autodiscovery in the
kernel is actually cumbersome. And yes, right now you have only one
format. But bet on it, the spec will change, vendors will not 100%
adhere to it, new formats will be supported by the same code etc, and
thus the discovery logic will become bigger. Having such complexity
outside the kernel is good, and its also not time critical, because it
is only done once.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	      \ ever tried. ever failed. no matter.
SUSE Labs			      | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-22 21:59               ` Scott Long
  2004-03-22 22:22                 ` Lars Marowsky-Bree
@ 2004-03-23  6:48                 ` Arjan van de Ven
  1 sibling, 0 replies; 56+ messages in thread
From: Arjan van de Ven @ 2004-03-23  6:48 UTC (permalink / raw)
  To: Scott Long
  Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs,
	linux-raid, Gibbs, Justin, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

On Mon, Mar 22, 2004 at 02:59:29PM -0700, Scott Long wrote:
> >I think that all the arguments for using DM are techinical arguments not
> >emotional ones. oh well.. you're free to write your code I'm free to not
> >use it in my kernels ;)
> 
> Ok, the technical arguments I've heard in favor of the DM approach is 
> that it reduces kernel bloat.  That fair, and I certainly agree with not
> putting the kitchen sink into the kernel.  Our position on EMD is that
> it's a special case because you want to reduce the number of failure
> modes, and that it doesn't contribute in a significant way to the kernel
> size. 

There are serveral dozen such formats as DDF, should those be put in too?
And then the next step is built in multipathing or stacking or .. or .... 
And pretty soon you're back at the EVMS 1.0 situation. I see the general 
kernel direction be to move such autodetection to early userland (there's 
a reason DM and not EVMS1.0 is in the kernel, afaics even the EVMS guys now 
agree that this was the right move); EMD is a step in the opposite direction.


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 19:19               ` Kevin Corry
@ 2004-03-31 17:07                 ` Randy.Dunlap
  0 siblings, 0 replies; 56+ messages in thread
From: Randy.Dunlap @ 2004-03-31 17:07 UTC (permalink / raw)
  To: Kevin Corry; +Cc: linux-kernel, lmb, jgarzik, neilb, gibbs, linux-raid

On Fri, 26 Mar 2004 13:19:28 -0600 Kevin Corry wrote:

| On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
| > On 2004-03-25T13:42:12,
| >
| >    Jeff Garzik <jgarzik@pobox.com> said:
| > > >and -5). And we've talked for a long time about wanting to port RAID-1
| > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
| > > > started on any such work, or even had any significant discussions about
| > > > *how* to do it. I can't
| > >
| > > let's have that discussion :)
| >
| > Nice 2.7 material, and parts I've always wanted to work on. (Including
| > making the entire partition scanning user-space on top of DM too.)
| 
| Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think 
| we've already proved this is possible. We really only need to work on making 
| early-userspace a little easier to use.
| 
| > KS material?
| 
| Sounds good to me.

Ditto.

I didn't see much conclusion to this thread, other than Neil's
good suggestions.  (maybe on some other list that I don't read?)

I wouldn't want this or any other projects to have to wait for the
kernel summit.  Email has worked well for many years...let's
try to keep it working.  :)

--
~Randy
"You can't do anything without having to do something else first."
-- Belefant's Law

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 22:12                               ` Justin T. Gibbs
@ 2004-03-30 22:34                                 ` Jeff Garzik
  0 siblings, 0 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-30 22:34 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>>So you are saying that this presents an unrecoverable situation?
>>
>>No, I'm saying that the data phase need not have a bunch of in-kernel
>>checks, it should be generated correctly from the source.
> 
> 
> The SCSI drivers validate the controller's data phase based on the
> expected phase presented to them from an upper layer.  I never talked
> about adding checks that make little sense or are overly expensive.  You
> seem to equate validation with huge expense.  That is just not the
> general case.
> 
> 
>>>Hmm.  I've never had someone tell me that my SCSI drivers are slow.
>>
>>This would be noticed in the CPU utilization area.  Your drivers are
>>probably a long way from being CPU-bound.
> 
> 
> I very much doubt that.  There are perhaps four or five tests in the
> I/O path where some value already in a cache line that has to be accessed
> anyway is compared against a constant.  We're talking about something
> down in the noise of any type of profiling you could perform.  As I said,
> validation makes sense where there is basically no-cost to do it.
> 
> 
>>>I don't think that your statement is true in the general case.  My
>>>belief is that validation should occur where it is cheap and efficient
>>>to do so.  More expensive checks should be pushed into diagnostic code
>>>that is disabled by default, but the code *should be there*.  In any event,
>>>for RAID meta-data, we're talking about code that is *not* in the common
>>>or time critical path of the kernel.  A few dozen lines of validation code
>>>there has almost no impact on the size of the kernel and yields huge
>>>benefits for debugging and maintaining the code.  This is even more
>>>the case in Linux the end user is often your test lab.
>>
>>It doesn't scale terribly well, because the checks themselves become a
>>source of bugs.
> 
> 
> So now the complaint is that validation code is somehow harder to write
> and maintain than the rest of the code?

Actually, yes.  Validation of random user input has always been a source 
of bugs (usually in edge cases), in Linux and in other operating 
systems.  It is often the area where security bugs are found.

Basically you want to avoid add checks for conditions that don't occur 
in properly written software, and make sure that the kernel always 
generates correct requests.  Obviously that excludes anything on the 
target side, but other than that...  in userland, a priveleged user is 
free to do anything they wish, including violate protocols, cook their 
disk, etc.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 21:47                             ` Jeff Garzik
@ 2004-03-30 22:12                               ` Justin T. Gibbs
  2004-03-30 22:34                                 ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 22:12 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

>> So you are saying that this presents an unrecoverable situation?
> 
> No, I'm saying that the data phase need not have a bunch of in-kernel
> checks, it should be generated correctly from the source.

The SCSI drivers validate the controller's data phase based on the
expected phase presented to them from an upper layer.  I never talked
about adding checks that make little sense or are overly expensive.  You
seem to equate validation with huge expense.  That is just not the
general case.

>> Hmm.  I've never had someone tell me that my SCSI drivers are slow.
> 
> This would be noticed in the CPU utilization area.  Your drivers are
> probably a long way from being CPU-bound.

I very much doubt that.  There are perhaps four or five tests in the
I/O path where some value already in a cache line that has to be accessed
anyway is compared against a constant.  We're talking about something
down in the noise of any type of profiling you could perform.  As I said,
validation makes sense where there is basically no-cost to do it.

>> I don't think that your statement is true in the general case.  My
>> belief is that validation should occur where it is cheap and efficient
>> to do so.  More expensive checks should be pushed into diagnostic code
>> that is disabled by default, but the code *should be there*.  In any event,
>> for RAID meta-data, we're talking about code that is *not* in the common
>> or time critical path of the kernel.  A few dozen lines of validation code
>> there has almost no impact on the size of the kernel and yields huge
>> benefits for debugging and maintaining the code.  This is even more
>> the case in Linux the end user is often your test lab.
> 
> It doesn't scale terribly well, because the checks themselves become a
> source of bugs.

So now the complaint is that validation code is somehow harder to write
and maintain than the rest of the code?

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 18:04                           ` Justin T. Gibbs
@ 2004-03-30 21:47                             ` Jeff Garzik
  2004-03-30 22:12                               ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-30 21:47 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>>That's unfortunate for those using ATA.  A command submitted from userland
>>
>>Required, since one cannot know the data phase of vendor-specific commands.
> 
> 
> So you are saying that this presents an unrecoverable situation?

No, I'm saying that the data phase need not have a bunch of in-kernel 
checks, it should be generated correctly from the source.


>>Particularly, checking whether the kernel is doing something wrong, or wrong,
>>just wastes cycles.  That's not a scalable way to code...  if every driver
>>and Linux subsystem did that, things would be unbearable slow.
> 
> 
> Hmm.  I've never had someone tell me that my SCSI drivers are slow.

This would be noticed in the CPU utilization area.  Your drivers are 
probably a long way from being CPU-bound.


> I don't think that your statement is true in the general case.  My
> belief is that validation should occur where it is cheap and efficient
> to do so.  More expensive checks should be pushed into diagnostic code
> that is disabled by default, but the code *should be there*.  In any event,
> for RAID meta-data, we're talking about code that is *not* in the common
> or time critical path of the kernel.  A few dozen lines of validation code
> there has almost no impact on the size of the kernel and yields huge
> benefits for debugging and maintaining the code.  This is even more
> the case in Linux the end user is often your test lab.

It doesn't scale terribly well, because the checks themselves become a 
source of bugs.

	Jeff





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:35                       ` Justin T. Gibbs
  2004-03-30 17:46                         ` Jeff Garzik
@ 2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  1 sibling, 0 replies; 56+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2004-03-30 18:11 UTC (permalink / raw)
  To: Justin T. Gibbs, Jeff Garzik
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

On Tuesday 30 of March 2004 19:35, Justin T. Gibbs wrote:
> > The kernel should not be validating -trusted- userland inputs.  Root is
> > allowed to scrag the disk, violate limits, and/or crash his own machine.
> >
> > A simple example is requiring userland, when submitting ATA taskfiles via
> > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> > If the data phase is specified incorrectly, you kill the OS driver's ATA
> > host wwtate machine, and the results are very unpredictable.   Since this
> > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get
> > the required details right (just like following a spec).
>
> That's unfortunate for those using ATA.  A command submitted from userland
> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem.  Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel.  The main reason is for
> robustness and ease of debugging.  In SCSI case, there is almost no
> run-time cost, and the system will stop before data corruption occurs.  In

In ATA case detection of protocol violation is not possible w/o checking every
possible command opcode.  Even if implemented (notice that checking commands
coming from kernel is out of question - for performance reasons) this breaks
for future and vendor specific commands.

> the meta-data case we've been discussing in terms of EMD, there is no
> runtime cost, the validation has to occur somewhere anyway, and in many
> cases some validation is already required to avoid races with external
> events.  If the validation is done in the kernel, then you get the benefit
> of nice diagnostics instead of strange crashes that are difficult to debug.

Unless code that crashes is the one doing validation. ;-)

Bartlomiej


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:46                         ` Jeff Garzik
@ 2004-03-30 18:04                           ` Justin T. Gibbs
  2004-03-30 21:47                             ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 18:04 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

>> That's unfortunate for those using ATA.  A command submitted from userland
> 
> Required, since one cannot know the data phase of vendor-specific commands.

So you are saying that this presents an unrecoverable situation?

> Particularly, checking whether the kernel is doing something wrong, or wrong,
> just wastes cycles.  That's not a scalable way to code...  if every driver
> and Linux subsystem did that, things would be unbearable slow.

Hmm.  I've never had someone tell me that my SCSI drivers are slow.

I don't think that your statement is true in the general case.  My
belief is that validation should occur where it is cheap and efficient
to do so.  More expensive checks should be pushed into diagnostic code
that is disabled by default, but the code *should be there*.  In any event,
for RAID meta-data, we're talking about code that is *not* in the common
or time critical path of the kernel.  A few dozen lines of validation code
there has almost no impact on the size of the kernel and yields huge
benefits for debugging and maintaining the code.  This is even more
the case in Linux the end user is often your test lab.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-28  0:06                   ` Lincoln Dale
@ 2004-03-30 17:54                     ` Justin T. Gibbs
  0 siblings, 0 replies; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:54 UTC (permalink / raw)
  To: Lincoln Dale
  Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid

> At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>> I posted a rather detailed, technical, analysis of what I believe would
>> be required to make this work correctly using a userland approach.  The
>> only response I've received is from Neil Brown.  Please, point out, in
>> a technical fashion, how you would address the feature set being proposed:
> 
> i'll have a go.
> 
> your position is one of "put it all in the kernel".
> Jeff, Neil, Kevin et al is one of "it can live in userspace".

Please don't misrepresent or over simplify my statements.  What
I have said is that meta-data reading and writing should occur in
only one place.  Since, as has already been acknowledged by many,
meta-data updates are required in the kernel, that means this support
should be handled in the kernel.  Any other solution adds complexity
and size to the solution.

> to that end, i agree with the userspace approach.
> the way i personally believe that it SHOULD happen is that you tie
> your metadata format (and RAID format, if its different to others) into DM.

Saying how you think something should happen without any technical
argument for it, doesn't help me to understand the benefits of your
approach.

...

> perhaps that means that you guys could provide enhancements to grub/lilo
> if they are insufficient for things like finding a secondary copy of
> initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things
> the "open source way" and help improve the overall tools, if the end goal
> ends up being the same: enabling YOUR system to work better?)

I don't understand your argument.  We have improved an already existing
opensource driver to provide this functionality.  This is not the
OpenSource way?

> then answering your other points:

Again, you have presented strategies that may or may not work, but
no technical arguments for their superiority over placing meta-data
in the kernel.

> there may be less lines of code involved in "entirely in kernel" for YOUR
> hardware -- but what about when 4 other storage vendors come out with such
> a card?

There will be less lines of code total for any vendor that decides to
add a new meta-data type.  All the vendor has to do is provide a meta-data
module.  There are no changes to the userland utilities (they know nothing
about specific meta-data formats), to the RAID transform modules, or to
the core of EMD.  If this were not the case, there would be little point
to the EMD work.

> what if someone wants to use your card in conjunction with the storage
> being multipathed or replicated automatically?
> what about when someone wants to create snapshots for backups?
> 
> all that functionality has to then go into your EMD driver.

No.  DM already works on any block device exported to the kernel.
EMD exports its devices as block devices.  Thus, all of the DM
functionality you are talking about is also available for EMD.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:35                       ` Justin T. Gibbs
@ 2004-03-30 17:46                         ` Jeff Garzik
  2004-03-30 18:04                           ` Justin T. Gibbs
  2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  1 sibling, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-30 17:46 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>The kernel should not be validating -trusted- userland inputs.  Root is
>>allowed to scrag the disk, violate limits, and/or crash his own machine.
>>
>>A simple example is requiring userland, when submitting ATA taskfiles via
>>an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
>>If the data phase is specified incorrectly, you kill the OS driver's ATA
>>host wwtate machine, and the results are very unpredictable.   Since this
>>is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
>>required details right (just like following a spec).
> 
> 
> That's unfortunate for those using ATA.  A command submitted from userland

Required, since one cannot know the data phase of vendor-specific commands.


> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem.  Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel.  The main reason is for robustness

Well,
* the target is not _issuing_ commands,
* any user issuing incorrect commands/cdbs is not your bug,
* and kernel code issuing incorrect cmands/cdbs isn't your bug either

Particularly, checking whether the kernel is doing something wrong, or 
wrong, just wastes cycles.  That's not a scalable way to code...  if 
every driver and Linux subsystem did that, things would be unbearable slow.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:15                     ` Jeff Garzik
@ 2004-03-30 17:35                       ` Justin T. Gibbs
  2004-03-30 17:46                         ` Jeff Garzik
  2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  0 siblings, 2 replies; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

> The kernel should not be validating -trusted- userland inputs.  Root is
> allowed to scrag the disk, violate limits, and/or crash his own machine.
> 
> A simple example is requiring userland, when submitting ATA taskfiles via
> an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> If the data phase is specified incorrectly, you kill the OS driver's ATA
> host wwtate machine, and the results are very unpredictable.   Since this
> is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
> required details right (just like following a spec).

That's unfortunate for those using ATA.  A command submitted from userland
to the SCSI drivers I've written that causes a protocol violation will
be detected, result in appropriate recovery, and a nice diagnostic that
can be used to diagnose the problem.  Part of this is because I cannot know
if the protocol violation stems from a target defect, the input from the
user or, for that matter, from the kernel.  The main reason is for robustness
and ease of debugging.  In SCSI case, there is almost no run-time cost, and
the system will stop before data corruption occurs.  In the meta-data case
we've been discussing in terms of EMD, there is no runtime cost, the
validation has to occur somewhere anyway, and in many cases some validation
is already required to avoid races with external events.  If the validation
is done in the kernel, then you get the benefit of nice diagnostics instead
of strange crashes that are difficult to debug.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:03                   ` Justin T. Gibbs
@ 2004-03-30 17:15                     ` Jeff Garzik
  2004-03-30 17:35                       ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-30 17:15 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
> The dm-raid1 module also appears to intrinsicly trust its mapping and the
> contents of its meta-data (simple magic number check).  It seems to me that 
> the kernel should validate all of its inputs regardless of whether the
> ioctls that are used to present them are only supposed to be used by a
> "trusted daemon".

The kernel should not be validating -trusted- userland inputs.  Root is 
allowed to scrag the disk, violate limits, and/or crash his own machine.

A simple example is requiring userland, when submitting ATA taskfiles 
via an ioctl, to specify the data phase (pio read, dma write, no-data, 
etc.).  If the data phase is specified incorrectly, you kill the OS 
driver's ATA host state machine, and the results are very unpredictable. 
  Since this is a trusted operation, requiring CAP_RAW_IO, it's up to 
userland to get the required details right (just like following a spec).


> I honestly don't care if the final solution is EMD, DM, or XYZ so long
> as that solution is correct, supportable, and covers all of the scenarios
> required for robust RAID support.  That is the crux of the argument, not
> "please love my code".

hehe.  I think we all agree here...

	Jeff





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-27 15:39                 ` Kevin Corry
@ 2004-03-30 17:03                   ` Justin T. Gibbs
  2004-03-30 17:15                     ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:03 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel

> Well, there's certainly no guarantee that the "industry" will get it right. In
> this case, it seems that they didn't. But even given that we don't have ideal
> metadata formats, it's still possible to do discovery and a number of other
> management tasks from user-space.

I have never proposed that management activities be performed solely
within the kernel.  My position has been that meta-data parsing and
updating has to be core-resident for any solution that handles advanced
RAID functionality and that spliting out any portion of those roles
to userland just complicates the solution.

>> it is perfectly suited to some types of logical volume management
>> applications.  But that is as far as it goes.  It does not have any
>> support for doing "sync/resync/scrub" type operations or any generic
>> support for doing anything with meta-data.
> 
> The core DM driver would not and should not be handling these operations.
> These are handled in modules specific to one type of mapping. There's no
> need for the DM core to know anything about any metadata. If one particular
> module (e.g. dm-mirror) needs to support one or more metadata formats, it's
> free to do so.

That's unfortunate considering that the meta-data formats we are talking
about already have the capability of expressing RAID 1(E),4,5,6.  There has
to be a common meta-data framework in order to avoid this duplication.

>> In all of the examples you 
>> have presented so far, you have not explained how this part of the equation
>> is handled.

...

> Before the new disk is added to the raid1, user-space is responsible for
> writing an initial state to that disk, effectively marking it as completely
> dirty and unsynced. When the new table is loaded, part of the "resume" is for
> the module to read any metadata and do any initial setup that's necessary. In
> this particular example, it means the new disk would start with all of its
> "regions" marked "dirty", and all the regions would need to be synced from
> corresponding "clean" regions on another disk in the set.
> 
> If the previously-existing disks were part-way through a sync when the table
> was switched, their metadata would indicate where the current "sync mark" was
> located. The module could then continue the sync from where it left off,
> including the new disk that was just added. When the sync completed, it might
> have to scan back to the beginning of the new disk to see if had any remaining
> dirty regions that needed to be synced before that disk was completely clean.
> 
> And of course the I/O-mapping path just has to be smart enough to know which
> regions are dirty and avoid sending live I/O to those.
> 
> (And I'm sure Joe or Alasdair could provide a better in-depth explanation of 
> the current dm-mirror module than I'm trying to. This is obviously a very 
> high-level overview.)

So all of this complexity is still in the kernel.  The only difference is
that the meta-data can *also* be manipulated from userspace.  In order
for this to be safe, the mirror must be suspended (meta-data becomes stable),
the meta-data must be re-read by the userland program, the meta-data must be
updated, the mapping must be updated, the mirror must be resumed, and the
mirror must revalidate all meta-data.  How do you avoid deadlock in this
process?  Does the userland daemon, which must be core resident in this case,
pre-allocate buffers for reading and writing the meta-data?

The dm-raid1 module also appears to intrinsicly trust its mapping and the
contents of its meta-data (simple magic number check).  It seems to me that 
the kernel should validate all of its inputs regardless of whether the
ioctls that are used to present them are only supposed to be used by a
"trusted daemon".

All of this adds up to more complexity.  Your argument seems to be that,
since DM avoids this complexity in its core, this is a better solution,
but I am more interested in the least complex, most easily maintained
total solution.

>> The simplicity of DM is part of why it is compelling.  My belief is that
>> merging RAID into DM will compromise this simplicity and divert DM from
>> what it was designed to do - provide LVM transforms.
> 
> I disagree. The simplicity of the core DM driver really isn't at stake here.
> We're only talking about adding a few relatively complex target modules. And
> with DM you get the benefit of a very simple user/kernel interface.

The simplicity of the user/kernel interface is not what is at stake here.
With EMD, you can perform all of the same operations talked about above,
in just as few ioctl calls.  The only difference is that the kernel and
only the kernel, reads and modifies the metadata.  There are actually
fewer steps for the userland application than before.  This becomes even
more evident as more meta-data modules are added.

> I don't honestly expect to suddenly change your mind on all these issues.
> A lot of work has obviously gone into EMD, and I definitely know how hard it
> can be when the community isn't greeting your suggestions with open arms.

I honestly don't care if the final solution is EMD, DM, or XYZ so long
as that solution is correct, supportable, and covers all of the scenarios
required for robust RAID support.  That is the crux of the argument, not
"please love my code".

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 17:43                 ` Justin T. Gibbs
  2004-03-28  0:06                   ` Lincoln Dale
@ 2004-03-28  0:30                   ` Jeff Garzik
  1 sibling, 0 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-28  0:30 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>  o Rebuilds

	> 90% kernel, AFAICS, otherwise you have races with
	requests that the driver is actively satisfying


>  o Auto-array enumeration

	userspace


>  o Meta-data updates for "safe mode"

	unsure of the definition of safe mode


>  o Array creation/deletion


	of entire arrays?  can mostly be done in userspace, but deletion
	also needs to update controller-wide metadata, which might be
	stored on active arrays.


>  o "Hot member addition"

	userspace prepares, kernel completes

[moved this down in your list]
>  o Meta-data updates for topology changes (failed members, spare activation)

[warning: this is a tangent from the userspace sub-thread/topic]

	the kernel, of course, must manage topology, otherwise things
	Don't Get Done, and requests don't do where they should.  :)

	Part of the value of device mapper is that it provides container
	objects for multi-disk groups, and a common method of messing
	around with those container objects.  You clearly recognized the
	same need in emd... but I don't think we want two different
	pieces of code doing the same basic thing.


	I do think that metadata management needs to be fairly cleanly
	separately (I like what emd did, there) such that a user needs
	three in-kernel pieces:
	* device mapper
	* generic raid1 engine
	* personality module

	"personality" would be where the specifics of the metadata
	management lived, and it would be responsible for handling the
	specifics of non-hot-path events that nonetheless still need
	to be in the kernel.




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 17:43                 ` Justin T. Gibbs
@ 2004-03-28  0:06                   ` Lincoln Dale
  2004-03-30 17:54                     ` Justin T. Gibbs
  2004-03-28  0:30                   ` Jeff Garzik
  1 sibling, 1 reply; 56+ messages in thread
From: Lincoln Dale @ 2004-03-28  0:06 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid

At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>I posted a rather detailed, technical, analysis of what I believe would
>be required to make this work correctly using a userland approach.  The
>only response I've received is from Neil Brown.  Please, point out, in
>a technical fashion, how you would address the feature set being proposed:

i'll have a go.

your position is one of "put it all in the kernel".
Jeff, Neil, Kevin et al is one of "it can live in userspace".

to that end, i agree with the userspace approach.
the way i personally believe that it SHOULD happen is that you tie your 
metadata format (and RAID format, if its different to others) into DM.

you boot up using an initrd where you can start some form of userspace 
management daemon from initrd.
you can have your binary (userspace) tools started from initrd which can 
populate the tables for all disks/filesystems, including pivoting to a new 
root filesystem if need-be.

the only thing your BIOS/int13h redirection needs to do is be able to 
provide sufficient information to be capable of loading the kernel and the 
initial ramdisk.
perhaps that means that you guys could provide enhancements to grub/lilo if 
they are insufficient for things like finding a secondary copy of 
initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things 
the "open source way" and help improve the overall tools, if the end goal 
ends up being the same: enabling YOUR system to work better?)

moving forward, perhaps initrd will be deprecated in favour of initramfs - 
but until then, there isn't any downside to this approach that i can see.

with all this in mind, and the basic premise being that as a minimum, the 
kernel has booted, and initrd is working
then answering your other points:

>  o Rebuilds

userspace is running.
rebuilds are simply a process of your userspace tools recognising that 
there are disk groups in a inconsistent state, and don't bring them online, 
but rather, do whatever is necessary to rebuild them.
nothing says that you cannot have a KERNEL-space 'helper' to help do the 
rebuild..

>  o Auto-array enumeration

your userspace tool can receive notification (via udev/hotplug) when new 
disks/devices appear.  from there, your userspace tool can read whatever 
metadata exists on the disk, and use that to enumerate whatever block 
devices exist.

perhaps DM needs some hooks to be able to do this - but i believe that the 
DM v4 ioctls cover this already.

>  o Meta-data updates for topology changes (failed members, spare activation)

a failed member may be as a result of a disk being pulled out.  for such an 
event, udev/hotplug should tell your userspace daemon.
a failed member may be as a result of lots of I/O errors.  perhaps there is 
work needed in the linux block layer to indicate some form of hotplug event 
such as 'excessive errors', perhaps its something needed in the DM 
layer.  in either case, it isn't out of the question that userspace can be 
notified.

for a "spare activation", once again, that can be done entirely from userspace.

>  o Meta-data updates for "safe mode"

seems implementation specific to me.

>  o Array creation/deletion

the short answer here is "how does one create or remove DM/LVM/MD 
partitions today?"
it certainly isn't in the kernel ...

>  o "Hot member addition"

this should also be possible today.
i haven't looked too closely at whether there are sufficient interfaces for 
quiescence of I/O or not - but once again, if not, why not implement 
something that can be used for all?

>Only then can a true comparative analysis of which solution is "less
>complex", "more maintainable", and "smaller" be performed.

there may be less lines of code involved in "entirely in kernel" for YOUR 
hardware --
but what about when 4 other storage vendors come out with such a card?
what if someone wants to use your card in conjunction with the storage 
being multipathed or replicated automatically?
what about when someone wants to create snapshots for backups?

all that functionality has to then go into your EMD driver.

Adaptec may decide all that is too hard -- at which point, your product may 
become obsolete as the storage paradigms have moved beyond what your EMD 
driver is capable of.
if you could tie it into DM -- which i believe to be the defacto path 
forward for lots of this cool functionality -- you gain this kind of 
functionality gratis -- or at least with minimal effort to integrate.

better yet, Linux as a whole benefits from your involvement -- your 
time/effort isn't put into something specific to your hardware -- but 
rather your time/effort is put into something that can be used by all.

this conversation really sounds like the same one you had with James about 
the SCSI Mid layer and why you just have to bypass items there and do your 
own proprietary things.  in summary, i don't believe you should be 
focussing on a short-term viiew of "but its more lines of code", but rather 
a more big-picture view of "overall, there will be LESS lines of code" and 
"it will fit better into the overall device-mapper/block-remapper 
functionality" within the kernel.


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 20:45               ` Justin T. Gibbs
@ 2004-03-27 15:39                 ` Kevin Corry
  2004-03-30 17:03                   ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Kevin Corry @ 2004-03-27 15:39 UTC (permalink / raw)
  To: linux-kernel, Justin T. Gibbs
  Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel

On Friday 26 March 2004 2:45 pm, Justin T. Gibbs wrote:
> We don't have control over the meta-data formats being used by the
> industry. Coming up with a solution that only works for "Linux Engineered
> Meta-data formats" removes any possibility of supporting things like DDF,
> Adaptec ASR, and a host of other meta-data formats that can be plugged into
> things like EMD.  In the two cases we are supporting today with EMD, the
> records required for doing discovery reside in the same sectors as those
> that need to be updated at runtime from some "in-core" context.

Well, there's certainly no guarantee that the "industry" will get it right. In
this case, it seems that they didn't. But even given that we don't have ideal
metadata formats, it's still possible to do discovery and a number of other
management tasks from user-space.

> > The main point I'm trying to get across here is that DM provides a simple
> > yet extensible kernel framework for a variety of storage management
> > tasks, including a lot more than just RAID. I think it would be a huge
> > benefit for the RAID drivers to make use of this framework to provide
> > functionality beyond what is currently available.
>
> DM is a transform layer that has the ability to pause I/O while that
> transform is updated from userland.  That's all it provides.

I think the DM developers would disagree with you on this point.

> As such, 
> it is perfectly suited to some types of logical volume management
> applications.  But that is as far as it goes.  It does not have any
> support for doing "sync/resync/scrub" type operations or any generic
> support for doing anything with meta-data.

The core DM driver would not and should not be handling these operations.
These are handled in modules specific to one type of mapping. There's no
need for the DM core to know anything about any metadata. If one particular
module (e.g. dm-mirror) needs to support one or more metadata formats, it's
free to do so.

On the other hand, DM *does* provide services that make "sync/resync" a great
deal simpler for such a module. It provides simple services for performing
synchronous or asynchronous I/O to pages or vm areas. It provides a service
for performing copies from one block-device area to another. The dm-mirror
module uses these for this very purpose. If we need additional "libraries"
for common RAID tasks (e.g. parity calculations) we can certainly add them.

> In all of the examples you 
> have presented so far, you have not explained how this part of the equation
> is handled.  Sure, adding a member to a RAID1 is trivial.  Just pause the
> I/O, update the transform, and let it go.  Unfortunately, that new member
> is not in sync with the rest.  The transform must be aware of this and only
> trust the member below the sync mark.  How is this information communicated
> to the transform?  Who updates the sync mark?  Who copies the data to the
> new member while guaranteeing that an in-flight write does not occur to the
> area being synced?

Before the new disk is added to the raid1, user-space is responsible for
writing an initial state to that disk, effectively marking it as completely
dirty and unsynced. When the new table is loaded, part of the "resume" is for
the module to read any metadata and do any initial setup that's necessary. In
this particular example, it means the new disk would start with all of its
"regions" marked "dirty", and all the regions would need to be synced from
corresponding "clean" regions on another disk in the set.

If the previously-existing disks were part-way through a sync when the table
was switched, their metadata would indicate where the current "sync mark" was
located. The module could then continue the sync from where it left off,
including the new disk that was just added. When the sync completed, it might
have to scan back to the beginning of the new disk to see if had any remaining
dirty regions that needed to be synced before that disk was completely clean.

And of course the I/O-mapping path just has to be smart enough to know which
regions are dirty and avoid sending live I/O to those.

(And I'm sure Joe or Alasdair could provide a better in-depth explanation of 
the current dm-mirror module than I'm trying to. This is obviously a very 
high-level overview.)

This process is somewhat similar to how dm-snapshot works. If it reads an 
empty header structure, it assumes it's a new snapshot, and starts with an 
empty hash table. If it reads a previously existing header, it continues to 
read the on-disk COW tables and constructs the necessary in-memory hash-table 
to represent that initial state.

> If you intend to add all of this to DM, then it is no 
> longer any "simpler" or more extensible than EMD.

Sure it is. Because very little (if any) of this needs to affect the core DM
driver, that core remains as simple and extensible as it currently is. The
extra complexity only really affects the new modules that would handle RAID.

> Don't take my arguments the wrong way.  I believe that DM is useful
> for what it was designed for: LVM.  It does not, however, provide the
> machinery required for it to replace a generic RAID stack.  Could
> you merge a RAID stack into DM.  Sure.  Its only software.  But for
> it to be robust, the same types of operations MD/EMD perform in kernel
> space will have to be done there too.
>
> The simplicity of DM is part of why it is compelling.  My belief is that
> merging RAID into DM will compromise this simplicity and divert DM from
> what it was designed to do - provide LVM transforms.

I disagree. The simplicity of the core DM driver really isn't at stake here.
We're only talking about adding a few relatively complex target modules. And
with DM you get the benefit of a very simple user/kernel interface.

> As for RAID discovery, this is the trivial portion of RAID.  For an extra
> 10% or less of code in a meta-data module, you get RAID discovery.  You
> also get a single point of access to the meta-data, avoid duplicated code,
> and complex kernel/user interfaces.  There seems to be a consistent feeling
> that it is worth compromising all of these benefits just to push this 10%
> of the meta-data handling code out of the kernel (and inflate it by 5 or
> 6 X duplicating code already in the kernel).  Where are the benefits of
> this userland approach?

I've got to admit, this whole discussion is very ironic. Two years ago I
was exactly where you are today, pushing for in-kernel discover, a variety of 
metadata modules, internal opaque device stacking, etc, etc. I can only
imagine that hch is laughing his ass off now that I'm the one arguing for
moving all this stuff to user-space.

I don't honestly expect to suddenly change your mind on all these issues.
A lot of work has obviously gone into EMD, and I definitely know how hard it
can be when the community isn't greeting your suggestions with open arms. And
I'm certainly not saying the EMD method isn't a potentially viable approach.
But it doesn't seem to be the approach the community is looking for. We faced
the same resistance two years ago. It took months of arguing with the 
community and arguing amongst ourselves before we finally decided to move 
EVMS to user-space and use MD and DM. It was a decision that meant 
essentially throwing away an enormous amount of work from several people. It 
was an incredibly hard choice, but I really believe now that it was the right
decision. It was the direction the community wanted to move in, and the only
way for our project to truely survive was to move with them.

So feel free to continue to develop and promote EMD. I'm not trying to stop
you and I don't mind having competition for finding the best way to do RAID
in Linux. But I can tell you from experience that EMD is going to face a good
bit of opposition based on its current design and you might want to take that
into consideration.

I am interested in discussing if and how RAID could be supported under
Device-Mapper (or some other "merging" of these two drivers). Jeff and Lars
have shown some interest, and I certainly hope we can convince Neil and Joe
that this is a good direction. Maybe it can be done and maybe it can't. I
personally think it can be, and I'd at least like to have that discussion
and find out.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 19:15             ` Kevin Corry
@ 2004-03-26 20:45               ` Justin T. Gibbs
  2004-03-27 15:39                 ` Kevin Corry
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-26 20:45 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid

>> There is a certain amount of metadata that -must- be updated at runtime,
>> as you recognize.  Over and above what MD already cares about, DDF and
>> its cousins introduce more items along those lines:  event logs, bad
>> sector logs, controller-level metadata...  these are some of the areas I
>> think Justin/Scott are concerned about.
> 
> I'm sure these things could be accommodated within DM. Nothing in DM prevents 
> having some sort of in-kernel metadata knowledge. In fact, other DM modules 
> already do - dm-snapshot and the above mentioned dm-mirror both need to do 
> some amount of in-kernel status updating. But I see this as completely 
> separate from in-kernel device discovery (which we seem to agree is the wrong 
> direction). And IMO, well designed metadata will make this "split" very 
> obvious, so it's clear which parts of the metadata the kernel can use for 
> status, and which parts are purely for identification (which the kernel thus 
> ought to be able to ignore).

We don't have control over the meta-data formats being used by the industry.
Coming up with a solution that only works for "Linux Engineered Meta-data
formats" removes any possibility of supporting things like DDF, Adaptec
ASR, and a host of other meta-data formats that can be plugged into things
like EMD.  In the two cases we are supporting today with EMD, the records
required for doing discovery reside in the same sectors as those that need
to be updated at runtime from some "in-core" context.

> The main point I'm trying to get across here is that DM provides a simple yet 
> extensible kernel framework for a variety of storage management tasks, 
> including a lot more than just RAID. I think it would be a huge benefit for 
> the RAID drivers to make use of this framework to provide functionality 
> beyond what is currently available.

DM is a transform layer that has the ability to pause I/O while that
transform is updated from userland.  That's all it provides.  As such,
it is perfectly suited to some types of logical volume management
applications.  But that is as far as it goes.  It does not have any
support for doing "sync/resync/scrub" type operations or any generic
support for doing anything with meta-data.  In all of the examples you
have presented so far, you have not explained how this part of the equation
is handled.  Sure, adding a member to a RAID1 is trivial.  Just pause the
I/O, update the transform, and let it go.  Unfortunately, that new member
is not in sync with the rest.  The transform must be aware of this and only
trust the member below the sync mark.  How is this information communicated
to the transform?  Who updates the sync mark?  Who copies the data to the
new member while guaranteeing that an in-flight write does not occur to the
area being synced?  If you intend to add all of this to DM, then it is no
longer any "simpler" or more extensible than EMD.

Don't take my arguments the wrong way.  I believe that DM is useful
for what it was designed for: LVM.  It does not, however, provide the
machinery required for it to replace a generic RAID stack.  Could
you merge a RAID stack into DM.  Sure.  Its only software.  But for
it to be robust, the same types of operations MD/EMD perform in kernel
space will have to be done there too.

The simplicity of DM is part of why it is compelling.  My belief is that
merging RAID into DM will compromise this simplicity and divert DM from
what it was designed to do - provide LVM transforms.

As for RAID discovery, this is the trivial portion of RAID.  For an extra
10% or less of code in a meta-data module, you get RAID discovery.  You
also get a single point of access to the meta-data, avoid duplicated code,
and complex kernel/user interfaces.  There seems to be a consistent feeling
that it is worth compromising all of these benefits just to push this 10%
of the meta-data handling code out of the kernel (and inflate it by 5 or
6 X duplicating code already in the kernel).  Where are the benefits of
this userland approach?

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 22:04             ` Lars Marowsky-Bree
@ 2004-03-26 19:19               ` Kevin Corry
  2004-03-31 17:07                 ` Randy.Dunlap
  0 siblings, 1 reply; 56+ messages in thread
From: Kevin Corry @ 2004-03-26 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lars Marowsky-Bree, Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid

On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
> On 2004-03-25T13:42:12,
>
>    Jeff Garzik <jgarzik@pobox.com> said:
> > >and -5). And we've talked for a long time about wanting to port RAID-1
> > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
> > > started on any such work, or even had any significant discussions about
> > > *how* to do it. I can't
> >
> > let's have that discussion :)
>
> Nice 2.7 material, and parts I've always wanted to work on. (Including
> making the entire partition scanning user-space on top of DM too.)

Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think 
we've already proved this is possible. We really only need to work on making 
early-userspace a little easier to use.

> KS material?

Sounds good to me.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
                               ` (2 preceding siblings ...)
  2004-03-25 23:35             ` Justin T. Gibbs
@ 2004-03-26 19:15             ` Kevin Corry
  2004-03-26 20:45               ` Justin T. Gibbs
  3 siblings, 1 reply; 56+ messages in thread
From: Kevin Corry @ 2004-03-26 19:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid

On Thursday 25 March 2004 12:42 pm, Jeff Garzik wrote:
> > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at
> > some point in the future, primarily for some of the reasons I mentioned
> > above. Obviously linear.c and raid0.c don't really need to be ported. DM
> > provides equivalent functionality, the discovery/activation can be driven
> > from user-space, and no in-kernel status updating is necessary (unlike
> > RAID-1 and -5). And we've talked for a long time about wanting to port
> > RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we
> > haven't started on any such work, or even had any significant discussions
> > about *how* to do it. I can't
>
> let's have that discussion :)

Great! Where do we begin? :)

> I'd like to focus on the "additional requirements" you mention, as I
> think that is a key area for consideration.
>
> There is a certain amount of metadata that -must- be updated at runtime,
> as you recognize.  Over and above what MD already cares about, DDF and
> its cousins introduce more items along those lines:  event logs, bad
> sector logs, controller-level metadata...  these are some of the areas I
> think Justin/Scott are concerned about.

I'm sure these things could be accomodated within DM. Nothing in DM prevents 
having some sort of in-kernel metadata knowledge. In fact, other DM modules 
already do - dm-snapshot and the above mentioned dm-mirror both need to do 
some amount of in-kernel status updating. But I see this as completely 
separate from in-kernel device discovery (which we seem to agree is the wrong 
direction). And IMO, well designed metadata will make this "split" very 
obvious, so it's clear which parts of the metadata the kernel can use for 
status, and which parts are purely for identification (which the kernel thus 
ought to be able to ignore).

The main point I'm trying to get across here is that DM provides a simple yet 
extensible kernel framework for a variety of storage management tasks, 
including a lot more than just RAID. I think it would be a huge benefit for 
the RAID drivers to make use of this framework to provide functionality 
beyond what is currently available.

> My take on things...  the configuration of RAID arrays got a lot more
> complex with DDF and "host RAID" in general.  Association of RAID arrays
> based on specific hardware controllers.  Silently building RAID0+1
> stacked arrays out of non-RAID block devices the kernel presents.

By this I assume you mean RAID devices that don't contain any type of on-disk 
metadata (e.g. MD superblocks). I don't see this as a huge hurdle. As long as 
the device drivers (SCIS, IDE, etc) export the necessary identification info 
through sysfs, user-space tools can contain the policies necessary to allow 
them to detect which disks belong together in a RAID device, and then tell 
the kernel to activate said RAID device. This sounds a lot like how 
Christophe Varoqui has been doing things in his new multipath tools.

> Failing over when one of the drives the kernel presents does not respond.
>
> All that just screams "do it in userland".
>
> OTOH, once the devices are up and running, kernel needs update some of
> that configuration itself.  Hot spare lists are an easy example, but any
> time the state of the overall RAID array changes, some host RAID
> formats, more closely tied to hardware than MD, may require
> configuration metadata changes when some hardware condition(s) change.

Certainly. Of course, I see things like adding and removing hot-spares and 
removing stale/faulty disks as something that can be driven from user-space. 
For example, for adding a new hot-spare, with DM it's as simple as loading a 
new mapping that contains the new disk, then telling DM to switch the device 
mapping (which implies a suspend/resume of I/O). And if necessary, such a 
user-space tool can be activated by hotplug events triggered by the insertion 
of a new disk into the system, making the process effectively transparent to 
the user.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:13               ` Jeff Garzik
@ 2004-03-26 17:43                 ` Justin T. Gibbs
  2004-03-28  0:06                   ` Lincoln Dale
  2004-03-28  0:30                   ` Jeff Garzik
  0 siblings, 2 replies; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-26 17:43 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

>>> I respectfully disagree with the EMD folks that a userland approach is
>>> impossible, given all the failure scenarios.
>> 
>> 
>> I've never said that it was impossible, just unwise.  I believe
>> that a userland approach offers no benefit over allowing the kernel
>> to perform all meta-data operations.  The end result of such an
>> approach (given feature and robustness parity with the EMD solution)
>> is a larger resident side, code duplication, and more complicated
>> configuration/management interfaces.
> 
> There is some code duplication, yes.  But the right userspace solution
> does not have a larger RSS, and has _less_ complicated management
> interfaces.
>
> A key benefit of "do it in userland" is a clear gain in flexibility,
> simplicity, and debuggability (if that's a word).

This is just as much hand waving as, 'All that just screams "do it in
userland".' <sigh>

I posted a rather detailed, technical, analysis of what I believe would
be required to make this work correctly using a userland approach.  The
only response I've received is from Neil Brown.  Please, point out, in
a technical fashion, how you would address the feature set being proposed:

 o Rebuilds
 o Auto-array enumeration
 o Meta-data updates for topology changes (failed members, spare activation)
 o Meta-data updates for "safe mode"
 o Array creation/deletion
 o "Hot member addition"

Only then can a true comparative analysis of which solution is "less
complex", "more maintainable", and "smaller" be performed.

> But it's hard.  It requires some deep thinking.  It's a whole lot easier
> to do everything in the kernel -- but that doesn't offer you the
> protections of userland, particularly separate address spaces from the
> kernel, and having to try harder to crash the kernel.  :)

A crash in any component of a RAID solution that prevents automatic
failover and rebuilds without customer intervention is unacceptable.
Whether it crashes your kernel or not is really not that important other
than the customer will probably notice that their data is no longer
protected *sooner* if the system crashes.  In other-words, the solution
must be *correct* regardless of where it resides.  Saying that doing
a portion of it in userland allows it to safely be buggier seems a very
strange argument.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:10                   ` Justin T. Gibbs
@ 2004-03-26  0:14                     ` Jeff Garzik
  0 siblings, 0 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:14 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>>None of the solutions being talked about perform "failing over" in
>>>userland.  The RAID transforms which perform this operation are kernel
>>>resident in DM, MD, and EMD.  Perhaps you are talking about spare
>>>activation and rebuild?
>>
>>This is precisely why I sent the second email, and made the qualification
>>I did :)
>>
>>For a "do it in userland" solution, an initrd or initramfs piece examines
>>the system configuration, and assembles physical disks into RAID arrays
>>based on the information it finds.  I was mainly implying that an initrd
>>solution would have to provide some primitive failover initially, before
>>the kernel is bootstrapped...  much like a bootloader that supports booting
>>off a RAID1 array would need to do.
> 
> 
> "Failover" (i.e. redirecting a read to a viable member) will not occur
> via userland at all.  The initrd solution just has to present all available
> members to the kernel interface performing the RAID transform.  There
> is no need for "special failover handling" during bootstrap in either
> case.

hmmm, yeah, agreed.

	Jeff





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:35             ` Justin T. Gibbs
@ 2004-03-26  0:13               ` Jeff Garzik
  2004-03-26 17:43                 ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:13 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>I respectfully disagree with the EMD folks that a userland approach is
>>impossible, given all the failure scenarios.
> 
> 
> I've never said that it was impossible, just unwise.  I believe
> that a userland approach offers no benefit over allowing the kernel
> to perform all meta-data operations.  The end result of such an
> approach (given feature and robustness parity with the EMD solution)
> is a larger resident side, code duplication, and more complicated
> configuration/management interfaces.

There is some code duplication, yes.  But the right userspace solution 
does not have a larger RSS, and has _less_ complicated management 
interfaces.  A key benefit of "do it in userland" is a clear gain in 
flexibility, simplicity, and debuggability (if that's a word).

But it's hard.  It requires some deep thinking.  It's a whole lot easier 
to do everything in the kernel -- but that doesn't offer you the 
protections of userland, particularly separate address spaces from the 
kernel, and having to try harder to crash the kernel.  :)

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:01                 ` Jeff Garzik
@ 2004-03-26  0:10                   ` Justin T. Gibbs
  2004-03-26  0:14                     ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-26  0:10 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

>> None of the solutions being talked about perform "failing over" in
>> userland.  The RAID transforms which perform this operation are kernel
>> resident in DM, MD, and EMD.  Perhaps you are talking about spare
>> activation and rebuild?
> 
> This is precisely why I sent the second email, and made the qualification
> I did :)
> 
> For a "do it in userland" solution, an initrd or initramfs piece examines
> the system configuration, and assembles physical disks into RAID arrays
> based on the information it finds.  I was mainly implying that an initrd
> solution would have to provide some primitive failover initially, before
> the kernel is bootstrapped...  much like a bootloader that supports booting
> off a RAID1 array would need to do.

"Failover" (i.e. redirecting a read to a viable member) will not occur
via userland at all.  The initrd solution just has to present all available
members to the kernel interface performing the RAID transform.  There
is no need for "special failover handling" during bootstrap in either
case.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:44             ` Lars Marowsky-Bree
@ 2004-03-26  0:03               ` Justin T. Gibbs
  0 siblings, 0 replies; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-26  0:03 UTC (permalink / raw)
  To: Lars Marowsky-Bree, Kevin Corry, linux-kernel
  Cc: Jeff Garzik, Neil Brown, linux-raid

> Uhm. DM sort of does (at least where the morphing amounts to resyncing a
> part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
> Freeze, load new mapping, continue.

The point is that these trivial "morphings" can be achieved with limited
effort regardless of whether you do it via EMD or DM.   Implementing this
in EMD could be achieved with perhaps 8 hours work with no significant
increase in code size or complexity.  This is part of why I find them
"uninteresting".  If we really want to talk about generic morphing,
I think you'll find that DM is no better suited to this task than MD or
its derivatives.

> I agree that more complex morphings (RAID1->RAID5 or vice-versa in
> particular) are more difficult to get right, but are not that often
> needed online - or if they are, typically such scenarios will have
> enough temporary storage to create the new target, RAID1 over,
> disconnect the old part and free it, which will work just fine with DM.

The most common requests that we hear from customers are:

o single -> R1

	Equally possible with MD or DM assuming your singles are
	accessed via a volume manager.  Without that support the
	user will have to dismount and remount storage.

o R1 -> R10

	This should require just double the number of active members.
	This is not possible today with either DM or MD.  Only
	"migration" is possible.

o R1 -> R5
o R5 -> R1

	These typically occur when data access patterns change for
	the customer.  Again not possible with DM or MD today.

All of these are important to some subset of customers and are, to
my mind, required if you want to claim even basic morphing capability.
If you are allowing the "cop-out" of using a volume manager to substitute
data-migration for true morphing, then MD is almost as well suited to
that task as DM.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:46               ` Justin T. Gibbs
@ 2004-03-26  0:01                 ` Jeff Garzik
  2004-03-26  0:10                   ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:01 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>Jeff Garzik wrote:
>>
>>Just so there is no confusion...  the "failing over...in userland" thing I
>>mention is _only_ during discovery of the root disk.
> 
> 
> None of the solutions being talked about perform "failing over" in
> userland.  The RAID transforms which perform this operation are kernel
> resident in DM, MD, and EMD.  Perhaps you are talking about spare
> activation and rebuild?

This is precisely why I sent the second email, and made the 
qualification I did :)

For a "do it in userland" solution, an initrd or initramfs piece 
examines the system configuration, and assembles physical disks into 
RAID arrays based on the information it finds.  I was mainly implying 
that an initrd solution would have to provide some primitive failover 
initially, before the kernel is bootstrapped...  much like a bootloader 
that supports booting off a RAID1 array would need to do.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:48             ` Jeff Garzik
@ 2004-03-25 23:46               ` Justin T. Gibbs
  2004-03-26  0:01                 ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 23:46 UTC (permalink / raw)
  To: Jeff Garzik, linux-kernel; +Cc: Kevin Corry, Neil Brown, linux-raid

> Jeff Garzik wrote:
> 
> Just so there is no confusion...  the "failing over...in userland" thing I
> mention is _only_ during discovery of the root disk.

None of the solutions being talked about perform "failing over" in
userland.  The RAID transforms which perform this operation are kernel
resident in DM, MD, and EMD.  Perhaps you are talking about spare
activation and rebuild?

> Similar code would need to go into the bootloader, for controllers that do
> not present the entire RAID array as a faked BIOS INT drive.

None of the solutions presented here are attempting to make RAID
transforms operate from the boot loader environment without BIOS
support.  I see this as a completely tangental problem to what is
being discussed.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 22:59           ` Justin T. Gibbs
@ 2004-03-25 23:44             ` Lars Marowsky-Bree
  2004-03-26  0:03               ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Lars Marowsky-Bree @ 2004-03-25 23:44 UTC (permalink / raw)
  To: Justin T. Gibbs, Kevin Corry, linux-kernel
  Cc: Jeff Garzik, Neil Brown, linux-raid

On 2004-03-25T15:59:00,
   "Justin T. Gibbs" <gibbs@scsiguy.com> said:

> The fact of the matter is that neither EMD nor DM provide a generic
> morphing capability.  If this is desirable, we can discuss how it could
> be achieved, but my initial belief is that attempting any type of
> complicated morphing from userland would be slow, prone to deadlocks,
> and thus difficult to achieve in a fashion that guaranteed no loss of
> data in the face of unexpected system restarts.

Uhm. DM sort of does (at least where the morphing amounts to resyncing a
part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
Freeze, load new mapping, continue.

I agree that more complex morphings (RAID1->RAID5 or vice-versa in
particular) are more difficult to get right, but are not that often
needed online - or if they are, typically such scenarios will have
enough temporary storage to create the new target, RAID1 over,
disconnect the old part and free it, which will work just fine with DM.


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	      \ ever tried. ever failed. no matter.
SUSE Labs			      | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
  2004-03-25 22:04             ` Lars Marowsky-Bree
@ 2004-03-25 23:35             ` Justin T. Gibbs
  2004-03-26  0:13               ` Jeff Garzik
  2004-03-26 19:15             ` Kevin Corry
  3 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 23:35 UTC (permalink / raw)
  To: Jeff Garzik, Kevin Corry; +Cc: linux-kernel, Neil Brown, linux-raid

> I respectfully disagree with the EMD folks that a userland approach is
> impossible, given all the failure scenarios.

I've never said that it was impossible, just unwise.  I believe
that a userland approach offers no benefit over allowing the kernel
to perform all meta-data operations.  The end result of such an
approach (given feature and robustness parity with the EMD solution)
is a larger resident side, code duplication, and more complicated
configuration/management interfaces.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:00         ` Kevin Corry
  2004-03-25 18:42           ` Jeff Garzik
@ 2004-03-25 22:59           ` Justin T. Gibbs
  2004-03-25 23:44             ` Lars Marowsky-Bree
  1 sibling, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 22:59 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid

>> Independent DM efforts have already started supporting MD raid0/1
>> metadata from what I understand, though these efforts don't seem to post
>> to linux-kernel or linux-raid much at all.  :/
> 
> I post on lkml.....occasionally. :)

...

> This decision was not based on any real dislike of the MD driver, but rather 
> for the benefits that are gained by using Device-Mapper. In particular, 
> Device-Mapper provides the ability to change out the device mapping on the 
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
> I'm sure many of you know this already. But I'm not sure everyone fully 
> understands how powerful a feature this is. For instance, it means EVMS can 
> now expand RAID-linear devices online. While that particular example may not 
> sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to 
> Device-Mapper, this feature would then allow you to do stuff like add new 
> "active" members to a RAID-1 online (think changing from 2-way mirror to 
> 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online 
> simply by adding a new disk (assuming other limitations, e.g. a single 
> stripe-zone). Unfortunately, these are things the MD driver can't do online, 
> because you need to completely stop the MD device before making such changes 
> (to prevent the kernel and user-space from trampling on the same metadata), 
> and MD won't stop the device if it's open (i.e. if it's mounted or if you 
> have other device (LVM) built on top of MD). Often times this means you need 
> to boot to a rescue-CD to make these types of configuration changes.

We should be clear about your argument here.  It is not that DM makes
generic morphing easy and possible, it is that with DM the most basic
types of morphing (no data striping or de-striping) is easily accomplished.
You sight two examples:

1) Adding another member to a RAID-1.  While MD may not allow this to
   occur while the array is operational, EMD does.  This is possible
   because there is only one entity controlling the meta-data.

2) Converting a RAID0 to a RAID4 while possible with DM is not particularly
   interesting from an end user perspective.

The fact of the matter is that neither EMD nor DM provide a generic
morphing capability.  If this is desirable, we can discuss how it could
be achieved, but my initial belief is that attempting any type of
complicated morphing from userland would be slow, prone to deadlocks,
and thus difficult to achieve in a fashion that guaranteed no loss of
data in the face of unexpected system restarts.

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
@ 2004-03-25 22:04             ` Lars Marowsky-Bree
  2004-03-26 19:19               ` Kevin Corry
  2004-03-25 23:35             ` Justin T. Gibbs
  2004-03-26 19:15             ` Kevin Corry
  3 siblings, 1 reply; 56+ messages in thread
From: Lars Marowsky-Bree @ 2004-03-25 22:04 UTC (permalink / raw)
  To: Jeff Garzik, Kevin Corry
  Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid

On 2004-03-25T13:42:12,
   Jeff Garzik <jgarzik@pobox.com> said:

> >and -5). And we've talked for a long time about wanting to port RAID-1 and 
> >RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started 
> >on any such work, or even had any significant discussions about *how* to 
> >do it. I can't 
> let's have that discussion :)

Nice 2.7 material, and parts I've always wanted to work on. (Including
making the entire partition scanning user-space on top of DM too.)

KS material?

> My take on things...  the configuration of RAID arrays got a lot more 
> complex with DDF and "host RAID" in general.

And then add all the other stuff, like scenarios where half of your RAID
is "somewhere" on the network via nbd, iSCSI or whatever and all the
other possible stackings... Definetely user-space material, and partly
because it /needs/ to have the input from the volume managers to do the
sane things.

The point about this implying that the superblock parsing/updating logic
needs to be duplicated between userspace and kernel land is valid too
though, and I'm keen on resolving this in a way which doesn't suck...


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	      \ ever tried. ever failed. no matter.
SUSE Labs			      | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ 	-- Samuel Beckett


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
@ 2004-03-25 18:48             ` Jeff Garzik
  2004-03-25 23:46               ` Justin T. Gibbs
  2004-03-25 22:04             ` Lars Marowsky-Bree
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-25 18:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kevin Corry, Neil Brown, Justin T. Gibbs, linux-raid

Jeff Garzik wrote:
> My take on things...  the configuration of RAID arrays got a lot more 
> complex with DDF and "host RAID" in general.  Association of RAID arrays 
> based on specific hardware controllers.  Silently building RAID0+1 
> stacked arrays out of non-RAID block devices the kernel presents. 
> Failing over when one of the drives the kernel presents does not respond.
> 
> All that just screams "do it in userland".

Just so there is no confusion...  the "failing over...in userland" thing 
I mention is _only_ during discovery of the root disk.

Similar code would need to go into the bootloader, for controllers that 
do not present the entire RAID array as a faked BIOS INT drive.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:00         ` Kevin Corry
@ 2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
                               ` (3 more replies)
  2004-03-25 22:59           ` Justin T. Gibbs
  1 sibling, 4 replies; 56+ messages in thread
From: Jeff Garzik @ 2004-03-25 18:42 UTC (permalink / raw)
  To: Kevin Corry; +Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid

Kevin Corry wrote:
> I'm guessing you're referring to EVMS in that comment, since we have done 
> *part* of what you just described. EVMS has always had a plugin to recognize 
> MD devices, and has been using the MD driver for quite some time (along with 
> using Device-Mapper for non-MD stuff). However, as of our most recent release 
> (earlier this month), we switched to using Device-Mapper for MD RAID-linear 
> and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" 
> module (both required to support LVM volumes), and it was a rather trivial 
> exercise to switch to activating these RAID devices using DM instead of MD.

nod


> This decision was not based on any real dislike of the MD driver, but rather 
> for the benefits that are gained by using Device-Mapper. In particular, 
> Device-Mapper provides the ability to change out the device mapping on the 
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
> I'm sure many of you know this already. But I'm not sure everyone fully 
> understands how powerful a feature this is. For instance, it means EVMS can 
> now expand RAID-linear devices online. While that particular example may not 
[...]

Sounds interesting but is mainly an implementation detail for the 
purposes of this discussion...

Some of this emd may want to use, for example.


> As for not posting this information on lkml and/or linux-raid, I do apologize 
> if this is something you would like to have been informed of. Most of the 
> recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken 
> that to mean the folks on the list aren't terribly interested in EVMS 
> developments. And since EVMS is a completely user-space tool and this 
> decision didn't affect any kernel components, I didn't think it was really 
> relevent to mention here. We usually discuss such things on 
> evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to 
> cross-post to lkml more often if it's something that might be pertinent.

Understandable...  for the stuff that impacts MD some mention of the 
work, on occasion, to linux-raid and/or linux-kernel would be useful.

I'm mainly looking at it from a standpoint of making sure that all the 
various RAID efforts are not independent of each other.


> We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some 
> point in the future, primarily for some of the reasons I mentioned above. 
> Obviously linear.c and raid0.c don't really need to be ported. DM provides 
> equivalent functionality, the discovery/activation can be driven from 
> user-space, and no in-kernel status updating is necessary (unlike RAID-1 and 
> -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 
> (and now RAID-6) to Device-Mapper targets, but we haven't started on any such 
> work, or even had any significant discussions about *how* to do it. I can't 

let's have that discussion :)

> imagine we would try this without at least involving Neil and other folks 
> from linux-raid, since it would be nice to actually reuse as much of the 
> existing MD code as possible (especially for RAID-5 and -6). I have no desire 
> to try to rewrite those from scratch.

<cheers>


> Device-Mapper does currently contain a mirroring module (still just in Joe's 
> -udm tree), which has primarily been used to provide online-move 
> functionality in LVM2 and EVMS. They've recently added support for persistent 
> logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 
> has some additional requirements for updating status in its superblock at 
> runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring 
> module could still be used, with the possibility of either adding 
> MD-RAID-1-specific information to the persistent-log module, or simply as an 
> additional log type.

WRT specific implementation, I would hope for the reverse -- that the 
existing, known, well-tested MD raid1 code would be used.  But perhaps 
that's a naive impression...  Folks with more knowledge of the 
implementation can make that call better than I.


I'd like to focus on the "additional requirements" you mention, as I 
think that is a key area for consideration.

There is a certain amount of metadata that -must- be updated at runtime, 
as you recognize.  Over and above what MD already cares about, DDF and 
its cousins introduce more items along those lines:  event logs, bad 
sector logs, controller-level metadata...  these are some of the areas I 
think Justin/Scott are concerned about.

My take on things...  the configuration of RAID arrays got a lot more 
complex with DDF and "host RAID" in general.  Association of RAID arrays 
based on specific hardware controllers.  Silently building RAID0+1 
stacked arrays out of non-RAID block devices the kernel presents. 
Failing over when one of the drives the kernel presents does not respond.

All that just screams "do it in userland".

OTOH, once the devices are up and running, kernel needs update some of 
that configuration itself.  Hot spare lists are an easy example, but any 
time the state of the overall RAID array changes, some host RAID 
formats, more closely tied to hardware than MD, may require 
configuration metadata changes when some hardware condition(s) change.

I respectfully disagree with the EMD folks that a userland approach is 
impossible, given all the failure scenarios.  In a userland approach, 
there -will- be some duplicated metadata-management code between 
userland and the kernel.  But for configuration _above_ the 
single-raid-array level, I think that's best left to userspace.

There will certainly be a bit of intra-raid-array management code in the 
kernel, including configuration updating.  I agree to its necessity... 
but that doesn't mean that -all- configuration/autorun stuff needs to be 
in the kernel.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25  2:21       ` Jeff Garzik
@ 2004-03-25 18:00         ` Kevin Corry
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 22:59           ` Justin T. Gibbs
  0 siblings, 2 replies; 56+ messages in thread
From: Kevin Corry @ 2004-03-25 18:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid

On Wednesday 24 March 2004 8:21 pm, Jeff Garzik wrote:
> Neil Brown wrote:
> > Choice is good.  Competition is good.  I would not try to interfere
> > with you creating a new "emd" driver that didn't interfere with "md".
> > What Linus would think of it I really don't know.  It is certainly not
> > impossible that he would accept it.
>
> Agreed.
>
> Independent DM efforts have already started supporting MD raid0/1
> metadata from what I understand, though these efforts don't seem to post
> to linux-kernel or linux-raid much at all.  :/

I post on lkml.....occasionally. :)

I'm guessing you're referring to EVMS in that comment, since we have done 
*part* of what you just described. EVMS has always had a plugin to recognize 
MD devices, and has been using the MD driver for quite some time (along with 
using Device-Mapper for non-MD stuff). However, as of our most recent release 
(earlier this month), we switched to using Device-Mapper for MD RAID-linear 
and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" 
module (both required to support LVM volumes), and it was a rather trivial 
exercise to switch to activating these RAID devices using DM instead of MD.

This decision was not based on any real dislike of the MD driver, but rather 
for the benefits that are gained by using Device-Mapper. In particular, 
Device-Mapper provides the ability to change out the device mapping on the 
fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
I'm sure many of you know this already. But I'm not sure everyone fully 
understands how powerful a feature this is. For instance, it means EVMS can 
now expand RAID-linear devices online. While that particular example may not 
sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to 
Device-Mapper, this feature would then allow you to do stuff like add new 
"active" members to a RAID-1 online (think changing from 2-way mirror to 
3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online 
simply by adding a new disk (assuming other limitations, e.g. a single 
stripe-zone). Unfortunately, these are things the MD driver can't do online, 
because you need to completely stop the MD device before making such changes 
(to prevent the kernel and user-space from trampling on the same metadata), 
and MD won't stop the device if it's open (i.e. if it's mounted or if you 
have other device (LVM) built on top of MD). Often times this means you need 
to boot to a rescue-CD to make these types of configuration changes.

As for not posting this information on lkml and/or linux-raid, I do apologize 
if this is something you would like to have been informed of. Most of the 
recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken 
that to mean the folks on the list aren't terribly interested in EVMS 
developments. And since EVMS is a completely user-space tool and this 
decision didn't affect any kernel components, I didn't think it was really 
relevent to mention here. We usually discuss such things on 
evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to 
cross-post to lkml more often if it's something that might be pertinent.

> > However I'm not sure that having three separate device-array systems
> > (dm, md, emd) is actually a good idea.  It would probably be really
> > good to unite md and dm somehow, but no-one seems really keen on
> > actually doing the work.
>
> I would be disappointed if all the work that has gone into the MD driver
> is simply obsoleted by new DM targets.  Particularly RAID 1/5/6.
>
> You pretty much echoed my sentiments exactly...  ideally md and dm can
> be bound much more tightly to each other.  For example, convert md's
> raid[0156].c into device mapper targets...  but indeed, nobody has
> stepped up to do that so far.

We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some 
point in the future, primarily for some of the reasons I mentioned above. 
Obviously linear.c and raid0.c don't really need to be ported. DM provides 
equivalent functionality, the discovery/activation can be driven from 
user-space, and no in-kernel status updating is necessary (unlike RAID-1 and 
-5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 
(and now RAID-6) to Device-Mapper targets, but we haven't started on any such 
work, or even had any significant discussions about *how* to do it. I can't 
imagine we would try this without at least involving Neil and other folks 
from linux-raid, since it would be nice to actually reuse as much of the 
existing MD code as possible (especially for RAID-5 and -6). I have no desire 
to try to rewrite those from scratch.

Device-Mapper does currently contain a mirroring module (still just in Joe's 
-udm tree), which has primarily been used to provide online-move 
functionality in LVM2 and EVMS. They've recently added support for persistent 
logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 
has some additional requirements for updating status in its superblock at 
runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring 
module could still be used, with the possibility of either adding 
MD-RAID-1-specific information to the persistent-log module, or simply as an 
additional log type.

So, if this is the direction everyone else would like to see MD and DM take, 
we'd be happy to help out.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-24  2:26     ` Neil Brown
  2004-03-24 19:09       ` Matt Domsch
@ 2004-03-25  2:21       ` Jeff Garzik
  2004-03-25 18:00         ` Kevin Corry
  1 sibling, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-25  2:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel

Neil Brown wrote:
> Choice is good.  Competition is good.  I would not try to interfere
> with you creating a new "emd" driver that didn't interfere with "md". 
> What Linus would think of it I really don't know.  It is certainly not
> impossible that he would accept it.

Agreed.

Independent DM efforts have already started supporting MD raid0/1 
metadata from what I understand, though these efforts don't seem to post 
to linux-kernel or linux-raid much at all.  :/


> However I'm not sure that having three separate device-array systems
> (dm, md, emd) is actually a good idea.  It would probably be really
> good to unite md and dm somehow, but no-one seems really keen on
> actually doing the work.

I would be disappointed if all the work that has gone into the MD driver 
is simply obsoleted by new DM targets.  Particularly RAID 1/5/6.

You pretty much echoed my sentiments exactly...  ideally md and dm can 
be bound much more tightly to each other.  For example, convert md's 
raid[0156].c into device mapper targets...  but indeed, nobody has 
stepped up to do that so far.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-24  2:26     ` Neil Brown
@ 2004-03-24 19:09       ` Matt Domsch
  2004-03-25  2:21       ` Jeff Garzik
  1 sibling, 0 replies; 56+ messages in thread
From: Matt Domsch @ 2004-03-24 19:09 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel

On Wed, Mar 24, 2004 at 01:26:47PM +1100, Neil Brown wrote:
> On Monday March 22, gibbs@scsiguy.com wrote:
> > One suggestion that was recently raised was to present these changes
> > in the form of an alternate "EMD" driver to avoid any potential
> > breakage of the existing MD.  Do you have any opinion on this?
> 
> I seriously think the best long-term approach for your emd work is to
> get it integrated into md.  I do listen to reason and I am not
> completely head-strong, but I do have opinions, and you would need to
> put in the effort to convincing me.

I completely agree that long-term, md and emd need to be the same.
However, watching the pain that the IDE changes took in early 2.5, I'd
like to see emd be merged alongside md for the short-term while the
kinks get worked out, keeping in mind the desire to merge them
together again soon as that happens. 

Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-23  6:23   ` Justin T. Gibbs
@ 2004-03-24  2:26     ` Neil Brown
  2004-03-24 19:09       ` Matt Domsch
  2004-03-25  2:21       ` Jeff Garzik
  0 siblings, 2 replies; 56+ messages in thread
From: Neil Brown @ 2004-03-24  2:26 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel

On Monday March 22, gibbs@scsiguy.com wrote:
> >> o Any successful solution will have to have "meta-data modules" for
> >>   active arrays "core resident" in order to be robust.  This
> 
> ...
> 
> > I agree.
> > 'Linear' and 'raid0' arrays don't really need metadata support in the
> > kernel as their metadata is essentially read-only.
> > There are interesting applications for raid1 without metadata, but I
> > think that for all raid personalities where metadata might need to be
> > updated in an error condition to preserve data integrity, the kernel
> > should know enough about the metadata to perform that update.
> > 
> > It would be nice to keep the in-kernel knowledge to a minimum, though
> > some metadata formats probably make that hard.
> 
> Can you further explain why you want to limit the kernel's knowledge
> and where you would separate the roles between kernel and userland?

General caution.
It is generally harder the change mistakes in the kernel than it is to
change mistakes in userspace, and similarly it is easer to add
functionality and configurability in userspace.  A design that puts
the control in userspace is therefore preferred.  A design that ties
you to working through a narrow user-kernel interface is disliked.
A design that gives easy control to user-space, and allows the kernel
to do simple things simply is probably best.

I'm not particularly concerned with code size and code duplication.  A
clean, expressive design is paramount.

> 2) Solution Complexity
> 
> Two entities understand how to read and manipulate the meta-data.
> Policies and APIs must be created to ensure that only one entity
> is performing operations on the meta-data at a time.  This is true
> even if one entity is primarily a read-only "client".  For example,
> a meta-data module may defer meta-data updates in some instances
> (e.g. rebuild checkpointing) until the meta-data is closed (writing
> the checkpoint sooner doesn't make sense considering that you should
> restart your scrub, rebuild or verify if the system is not safely
> shutdown).  How does the userland client get the most up-to-date
> information?  This is just one of the problems in this area.

If the kernel and userspace both need to know about metadata, then the
design must make clear how they communicate.  

> 
> > Currently partitions are (sufficiently) needs-driven.  It is true that
> > any partitionable devices has it's partitions presented.  However the
> > existence of partitions does not affect access to the whole device at
> > all.  Only once the partitions are claimed is the whole-device
> > blocked. 
> 
> This seems a slight digression from your earlier argument.  Is your
> concern that the arrays are auto-enumerated, or that the act of enumerating
> them prevents the component devices from being accessed (due to
> bd_clam)?

Primarily the latter.  But also that the act of enumerating them may
cause an update to an underlying devices (e.g. metadata update or
resync).  That is what I am particularly uncomfortable about.

> 
> > Providing that auto-assembly of arrays works the same way (needs
> > driven), I am happy for arrays to auto-assemble.
> > I happen to think this most easily done in user-space.
> 
> I don't know how to reconcile a needs based approach with system
> features that require arrays to be exported as soon as they are
> detected.
> 

Maybe if arrays were auto-assembled in a read-only mode that guaranteed
not to write to the devices *at*all* and did not bd_claim them.

When they are needed (either though some explicit set-writable command
or through an implicit first-write) then the underlying components are
bd_claimed.  If that succeeds, the array becomes "live".  If it fails,
it stays read-only.

> 
> > But back to your original post:  I suspect there is lots of valuable
> > stuff in your emd patch, but as you have probably gathered, big
> > patches are not the way we work around here, and with good reason.
> > 
> > If you would like to identify isolated pieces of functionality, create
> > patches to implement them, and submit them for review I will be quite
> > happy to review them and, when appropriate, forward them to
> > Andrew/Linus.
> > I suggest you start with less controversial changes and work your way
> > forward.
> 
> One suggestion that was recently raised was to present these changes
> in the form of an alternate "EMD" driver to avoid any potential
> breakage of the existing MD.  Do you have any opinion on this?

Choice is good.  Competition is good.  I would not try to interfere
with you creating a new "emd" driver that didn't interfere with "md". 
What Linus would think of it I really don't know.  It is certainly not
impossible that he would accept it.

However I'm not sure that having three separate device-array systems
(dm, md, emd) is actually a good idea.  It would probably be really
good to unite md and dm somehow, but no-one seems really keen on
actually doing the work.

I seriously think the best long-term approach for your emd work is to
get it integrated into md.  I do listen to reason and I am not
completely head-strong, but I do have opinions, and you would need to
put in the effort to convincing me.

NeilBrown


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-23  5:05 ` Neil Brown
@ 2004-03-23  6:23   ` Justin T. Gibbs
  2004-03-24  2:26     ` Neil Brown
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-23  6:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel

>> o Any successful solution will have to have "meta-data modules" for
>>   active arrays "core resident" in order to be robust.  This

...

> I agree.
> 'Linear' and 'raid0' arrays don't really need metadata support in the
> kernel as their metadata is essentially read-only.
> There are interesting applications for raid1 without metadata, but I
> think that for all raid personalities where metadata might need to be
> updated in an error condition to preserve data integrity, the kernel
> should know enough about the metadata to perform that update.
> 
> It would be nice to keep the in-kernel knowledge to a minimum, though
> some metadata formats probably make that hard.

Can you further explain why you want to limit the kernel's knowledge
and where you would separate the roles between kernel and userland?

In reviewing one of our typical metadata modules, perhaps 80% of the code
is generic meta-data record parsing and state conversion logic that would
have to be retained in the kernel to perform "minimal meta-data updates".
Some high portion of this 80% (less the portion that builds the in-kernel
data structures to manipulate and update meta-data) would also need to be
replicated into a user-land utility for any type of separation of labor to
be possible.  The remaining 20% of the kernel code deals with validation of
user meta-data creation requests.  This code is relatively small since
it leverages all of the other routines that are already required for
the operational requirements of the module.

Splitting the roles bring up some important issues:

1) Code duplication.

Depending on the complexity of the meta-data format being supported,
the amount of code duplication between userland and kernel modules
may be quite large.  Any time code is duplicated, the solution is
prone to getting out of sync - bugs are fixed in one copy of the code
but not another.

2) Solution Complexity

Two entities understand how to read and manipulate the meta-data.
Policies and APIs must be created to ensure that only one entity
is performing operations on the meta-data at a time.  This is true
even if one entity is primarily a read-only "client".  For example,
a meta-data module may defer meta-data updates in some instances
(e.g. rebuild checkpointing) until the meta-data is closed (writing
the checkpoint sooner doesn't make sense considering that you should
restart your scrub, rebuild or verify if the system is not safely
shutdown).  How does the userland client get the most up-to-date
information?  This is just one of the problems in this area.

3) Size

Due to code duplication, the total solution will be larger in
code size.

What benefits of operating in userland outweigh these issues?

>> o It is desirable for arrays to auto-assemble based on recorded
>>   meta-data.  This includes the ability to have a user hot-insert
>>   a "cold spare", have the system recognize it as a spare (based
>>   on the meta-data resident on it) and activate it if necessary to
>>   restore a degraded array.
> 
> Certainly.  It doesn't follow that the auto-assembly has to happen
> within the kernel.  Having it all done in user-space makes it much
> easier to control/configure.
> 
> I think the best way to describe my attitude to auto-assembly is that
> it could be needs-driven rather than availability-driven.
> 
> needs-driven means: if the user asks to access an array that doesn't
>   exist, then try to find the bits and assemble it.
> availability driven means: find all the devices that could be part of
>   an array, and combine as many of them as possible together into
>   arrays.
> 
> Currently filesystems are needs-driven.  At boot time, only to root
> filesystem, which has been identified somehow, gets mounted. 
> Then the init scripts mount any others that are needed.
> We don't have any hunting around for filesystem superblocks and
> mounting the filesystems just in case they are needed.

Are filesystems the correct analogy?  Consider that a user's attempt
to mount a filesystem by label requires that all of the "block devices"
that might contain that filesystem be enumerated automatically by
the system.  In this respect, the system is treating an MD device in
exactly the same way as a SCSI or IDE disk.  The array must be exported
to the system on an "availability basis" in order for the "needs-driven"
features of the system to behave as expected.

> Currently partitions are (sufficiently) needs-driven.  It is true that
> any partitionable devices has it's partitions presented.  However the
> existence of partitions does not affect access to the whole device at
> all.  Only once the partitions are claimed is the whole-device
> blocked. 

This seems a slight digression from your earlier argument.  Is your
concern that the arrays are auto-enumerated, or that the act of enumerating
them prevents the component devices from being accessed (due to bd_clam)?

> Providing that auto-assembly of arrays works the same way (needs
> driven), I am happy for arrays to auto-assemble.
> I happen to think this most easily done in user-space.

I don't know how to reconcile a needs based approach with system
features that require arrays to be exported as soon as they are
detected.

> With DDF format metadata, there is a concept of 'imported' arrays,
> which basically means arrays from some other controller that have been
> attached to the current controller.
> 
> Part of my desire for needs-driven assembly is that I don't want to
> inadvertently assemble 'imported' arrays.
> A DDF controller has NVRAM or a hardcoded serial number to help avoid
> this.  A generic Linux machine doesn't.
> 
> I could possibly be happy with auto-assembly where a kernel parameter
> of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF
> arrays that have a controler-id (or whatever it is) of xx.yy.zz.
> 
> This is probably simple enough to live entirely in the kernel.

The concept of "importing" an array doesn't really make sense in
the case of MD's DDF.  To fully take advantage of features like
a controller BIOS's ability to natively boot an array, the disks
for that domain must remain in that controller's domain.  Determining
the domain to assign to new arrays will require input from the user
since there is limited topology information available to MD.  The
user will also have the ability to assign newly created  arrays to
the "MD Domain" which is not tied to any particular controller domain.

...

> But back to your original post:  I suspect there is lots of valuable
> stuff in your emd patch, but as you have probably gathered, big
> patches are not the way we work around here, and with good reason.
> 
> If you would like to identify isolated pieces of functionality, create
> patches to implement them, and submit them for review I will be quite
> happy to review them and, when appropriate, forward them to
> Andrew/Linus.
> I suggest you start with less controversial changes and work your way
> forward.

One suggestion that was recently raised was to present these changes
in the form of an alternate "EMD" driver to avoid any potential
breakage of the existing MD.  Do you have any opinion on this?

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-19 20:19 Justin T. Gibbs
@ 2004-03-23  5:05 ` Neil Brown
  2004-03-23  6:23   ` Justin T. Gibbs
  0 siblings, 1 reply; 56+ messages in thread
From: Neil Brown @ 2004-03-23  5:05 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel

On Friday March 19, gibbs@scsiguy.com wrote:
> [ CC trimmed since all those on the CC line appear to be on the lists ... ]
> 
> Lets take a step back and focus on a few of the points to which we can
> hopefully all agree:
> 
> o Any successful solution will have to have "meta-data modules" for
>   active arrays "core resident" in order to be robust.  This
>   requirement stems from the need to avoid deadlock during error
>   recovery scenarios that must block "normal I/O" to the array while
>   meta-data operations take place.

I agree.
'Linear' and 'raid0' arrays don't really need metadata support in the
kernel as their metadata is essentially read-only.
There are interesting applications for raid1 without metadata, but I
think that for all raid personalities where metadata might need to be
updated in an error condition to preserve data integrity, the kernel
should know enough about the metadata to perform that update.

It would be nice to keep the in-kernel knowledge to a minimum, though
some metadata formats probably make that hard.

> 
> o It is desirable for arrays to auto-assemble based on recorded
>   meta-data.  This includes the ability to have a user hot-insert
>   a "cold spare", have the system recognize it as a spare (based
>   on the meta-data resident on it) and activate it if necessary to
>   restore a degraded array.

Certainly.  It doesn't follow that the auto-assembly has to happen
within the kernel.  Having it all done in user-space makes it much
easier to control/configure.

I think the best way to describe my attitude to auto-assembly is that
it could be needs-driven rather than availability-driven.

needs-driven means: if the user asks to access an array that doesn't
  exist, then try to find the bits and assemble it.
availability driven means: find all the devices that could be part of
  an array, and combine as many of them as possible together into
  arrays.

Currently filesystems are needs-driven.  At boot time, only to root
filesystem, which has been identified somehow, gets mounted. 
Then the init scripts mount any others that are needed.
We don't have any hunting around for filesystem superblocks and
mounting the filesystems just in case they are needed.

Currently partitions are (sufficiently) needs-driven.  It is true that
any partitionable devices has it's partitions presented.  However the
existence of partitions does not affect access to the whole device at
all.  Only once the partitions are claimed is the whole-device
blocked. 

Providing that auto-assembly of arrays works the same way (needs
driven), I am happy for arrays to auto-assemble.
I happen to think this most easily done in user-space.

With DDF format metadata, there is a concept of 'imported' arrays,
which basically means arrays from some other controller that have been
attached to the current controller.

Part of my desire for needs-driven assembly is that I don't want to
inadvertently assemble 'imported' arrays.
A DDF controller has NVRAM or a hardcoded serial number to help avoid
this.  A generic Linux machine doesn't.

I could possibly be happy with auto-assembly where a kernel parameter
of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF
arrays that have a controler-id (or whatever it is) of xx.yy.zz.

This is probably simple enough to live entirely in the kernel.

> 
> o Child devices of an array should only be accessible through the
>   array while the array is in a configured state (bd_claim'ed).
>   This avoids situations where a user can subvert the integrity of
>   the array by performing "rogue I/O" to an array member.

bd_claim doesn't and (I believe) shouldn't stop access from
user-space.
It does stop a number of sorts of access that would expect exclusive
access. 


But back to your original post:  I suspect there is lots of valuable
stuff in your emd patch, but as you have probably gathered, big
patches are not the way we work around here, and with good reason.

If you would like to identify isolated pieces of functionality, create
patches to implement them, and submit them for review I will be quite
happy to review them and, when appropriate, forward them to
Andrew/Linus.
I suggest you start with less controversial changes and work your way
forward.

NeilBrown

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-18  2:00     ` Jeff Garzik
@ 2004-03-20  9:58       ` Jamie Lokier
  0 siblings, 0 replies; 56+ messages in thread
From: Jamie Lokier @ 2004-03-20  9:58 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andi Kleen, linux-raid, justin_gibbs, Linux Kernel

Jeff Garzik wrote:
> I'll probably have to illustrate with code, but basically, read/write 
> can be completely ignorant of 32/64-bit architecture, endianness, it can 
> even be network-transparent.  ioctls just can't do that.

Apart from the network transparency, yes they can.

Ioctl is no different from read/write/read-modify-write except
the additional command argument.

You can write architecture-specific ioctls which take and return
structs -- and you can do the same with read/write.  This is what
Andi is thinking of as dangerous: the read/write case is then much
harder to emulate.

Or, you can write architecture-independent read/write, which use fixed
formats, which you seem to have in mind.  That works fine with ioctls too.

It isn't commonly done, because people prefer the convenience of a
struct.  But it does work.  It's slightly easier in the driver to
implement commands this way using an ioctl, because you don't have to
check the read/write length.  It's about the same to use from
userspace: both read/write and ioctl methods using an
architecture-independent data format require the program to lay out
the command bytes and then issue one system call.

-- Jamie

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
@ 2004-03-19 20:19 Justin T. Gibbs
  2004-03-23  5:05 ` Neil Brown
  0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-19 20:19 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

[ CC trimmed since all those on the CC line appear to be on the lists ... ]

Lets take a step back and focus on a few of the points to which we can
hopefully all agree:

o Any successful solution will have to have "meta-data modules" for
  active arrays "core resident" in order to be robust.  This
  requirement stems from the need to avoid deadlock during error
  recovery scenarios that must block "normal I/O" to the array while
  meta-data operations take place.

o It is desirable for arrays to auto-assemble based on recorded
  meta-data.  This includes the ability to have a user hot-insert
  a "cold spare", have the system recognize it as a spare (based
  on the meta-data resident on it) and activate it if necessary to
  restore a degraded array.

o Child devices of an array should only be accessible through the
  array while the array is in a configured state (bd_claim'ed).
  This avoids situations where a user can subvert the integrity of
  the array by performing "rogue I/O" to an array member.

Concentrating on just these three, we come to the conclusion that
whether the solution comes via "early user fs" or kernel modules,
the resident size of the solution *will* include the cost for
meta-data support.  In either case, the user is able to tailor their
system to include only the support necessary for their individual
system to operate.

If we want to argue the merits of either approach based on just the
sheer size of resident code, I have little doubt that the kernel
module approach will prove smaller:

 o No need for "mdadm" or some other daemon to be locked resident in
   memory.   This alone saves you having a locked copy of klibc or
   any other user libraries core resident.  The kernel modules
   leverage the kernel APIs that already have to be core resident
   to satisfy the needs of other parts of the kernel which also
   helps in reducing its size.

 o Initial RAM disk data can be discarded after modules are loaded at
   boot time.

Putting the size argument aside for a moment, lets explore how a
userland solution could satisfy just the above three requirements.

How is meta-data updated on child members of an array while that
array is on-line?  Remember that these operations occur with some
frequency.  MD includes "safe-mode" support where redundant arrays
are marked clean any time writes cease for a predetermined, fairly
short, amount of time.  The userland app cannot access the component
devices directly since they are bd_claim'ed.  Even if that mechanism
is somehow subverted, how do we guarantee that these meta-data
writes do not cause a deadlock?  In the case of a transition from
Read-only to Write mode, all writes are blocked to the array (this
must be the case for "Dirty" state to be accurate).  It seems to
me that you must then provide extra code to not only pre-allocate
buffers for the userland app to do its work, but also provide a
"back-door" interface for these operations to take place.

The argument has also been made that shifting some of this code out
to a userland app "simplifies" the solution and perhaps even makes
it easier to develop.  Comparing the two approaches we have:

UserFS:
      o Kernel Driver + "enhanced interface to userland daemon"
      o Userland Daemon (core resident)
      o Userland Meta-Data modules
      o Userland Management tool
	 - This tool needs to interface to the daemon and
	   perhaps also the kernel driver.

Kernel:
      o Kernel RAID Transform Drivers
      o Kernel Meta-Data modules
      o Simple Userland Mangement
	tool with no meta-data knowledge

So two questions arise from this analysis:

1) Are meta-data modules easier to code up or more robust as user
   or kernel modules?  I believe that doing these outside the kernel
   will make them larger and more complex while also losing the
   ability to have meta-data modules weigh in on rapidly occurring
   events without incurring performance tradeoffs.  Regardless of
   where they reside, these modules must be robust.  A kernel Oops
   or a segfault in the daemon is unacceptable to the end user.
   Saying that a segfault is less harmful in some way than an Oops
   when we're talking about the users data completely misses the
   point of why people use RAID.

2) What added complexity is incurred by supporting both a core
   resident daemon as well as management interfaces to the daemon
   and potentially the kernel module?  I have not fully thought
   through the corner cases such an approach would expose, so I
   cannot quantify this cost.  There are certainly more components
   to get right and keep synchronized.

In the end, I find it hard to justify inventing all of the userland
machinery necessary to make this work just to avoid roughly ~2K
lines of code per-metadata module from being part of the kernel.
The ASR module for example, which is only required by those that
need support for this meta-data type, is only 19K with all of its
debugging printks and code enabled, unstripped.  Are there benefits
to the userland approach that I'm missing?

--
Justin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-18  1:33   ` Andi Kleen
@ 2004-03-18  2:00     ` Jeff Garzik
  2004-03-20  9:58       ` Jamie Lokier
  0 siblings, 1 reply; 56+ messages in thread
From: Jeff Garzik @ 2004-03-18  2:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-raid, justin_gibbs, Linux Kernel

Andi Kleen wrote:
> Sorry, Jeff, but that's just not true. While ioctls need an additional
> entry in the conversion table, they can at least easily get an
> translation handler if needed. When they are correctly designed you
> just need a single line to enable pass through the emulation.
> If you don't want to add that line to the generic compat_ioctl.h
> file you can also do it in your driver.
> 
> read/write has the big disadvantage that if someone gets the emulation
> wrong (and that happens regularly) it is near impossible to add an 
> emulation handler then, because there is no good way to hook
> into the read/write paths.
> 
> There may be valid reasons to go for read/write, but 32bit emulation
> is not one of them. In fact from the emulation perspective read/write
> should be avoided.

I'll probably have to illustrate with code, but basically, read/write 
can be completely ignorant of 32/64-bit architecture, endianness, it can 
even be network-transparent.  ioctls just can't do that.

	Jeff




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: "Enhanced" MD code avaible for review
       [not found] ` <1AOTW-4Vx-5@gated-at.bofh.it>
@ 2004-03-18  1:33   ` Andi Kleen
  2004-03-18  2:00     ` Jeff Garzik
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2004-03-18  1:33 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-raid, justin_gibbs, Linux Kernel

Jeff Garzik <jgarzik@pobox.com> writes:
>
> ioctl's are a pain for 32->64-bit translation layers.  Using a
> read/write interface allows one to create an interface that requires
> no translation layer -- a big deal for AMD64 and IA32e processors
> moving forward -- and it also gives one a lot more control over the
> interface.

Sorry, Jeff, but that's just not true. While ioctls need an additional
entry in the conversion table, they can at least easily get an
translation handler if needed. When they are correctly designed you
just need a single line to enable pass through the emulation.
If you don't want to add that line to the generic compat_ioctl.h
file you can also do it in your driver.

read/write has the big disadvantage that if someone gets the emulation
wrong (and that happens regularly) it is near impossible to add an 
emulation handler then, because there is no good way to hook
into the read/write paths.

There may be valid reasons to go for read/write, but 32bit emulation
is not one of them. In fact from the emulation perspective read/write
should be avoided.

-Andi


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2004-03-31 17:11 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <459805408.1079547261@aslan.scsiguy.com>
2004-03-17 19:18 ` "Enhanced" MD code avaible for review Jeff Garzik
2004-03-17 19:32   ` Christoph Hellwig
2004-03-17 20:02     ` Jeff Garzik
2004-03-17 21:18   ` Scott Long
2004-03-17 21:35     ` Jeff Garzik
2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
2004-03-18  0:23       ` Scott Long
2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
2004-03-18  6:38         ` Stefan Smietanowski
2004-03-20 13:07         ` Arjan van de Ven
2004-03-21 23:42           ` Scott Long
2004-03-22  9:05             ` Arjan van de Ven
2004-03-22 21:59               ` Scott Long
2004-03-22 22:22                 ` Lars Marowsky-Bree
2004-03-23  6:48                 ` Arjan van de Ven
2004-03-18  1:56     ` viro
     [not found] <1AOTW-4Vx-7@gated-at.bofh.it>
     [not found] ` <1AOTW-4Vx-5@gated-at.bofh.it>
2004-03-18  1:33   ` Andi Kleen
2004-03-18  2:00     ` Jeff Garzik
2004-03-20  9:58       ` Jamie Lokier
2004-03-19 20:19 Justin T. Gibbs
2004-03-23  5:05 ` Neil Brown
2004-03-23  6:23   ` Justin T. Gibbs
2004-03-24  2:26     ` Neil Brown
2004-03-24 19:09       ` Matt Domsch
2004-03-25  2:21       ` Jeff Garzik
2004-03-25 18:00         ` Kevin Corry
2004-03-25 18:42           ` Jeff Garzik
2004-03-25 18:48             ` Jeff Garzik
2004-03-25 23:46               ` Justin T. Gibbs
2004-03-26  0:01                 ` Jeff Garzik
2004-03-26  0:10                   ` Justin T. Gibbs
2004-03-26  0:14                     ` Jeff Garzik
2004-03-25 22:04             ` Lars Marowsky-Bree
2004-03-26 19:19               ` Kevin Corry
2004-03-31 17:07                 ` Randy.Dunlap
2004-03-25 23:35             ` Justin T. Gibbs
2004-03-26  0:13               ` Jeff Garzik
2004-03-26 17:43                 ` Justin T. Gibbs
2004-03-28  0:06                   ` Lincoln Dale
2004-03-30 17:54                     ` Justin T. Gibbs
2004-03-28  0:30                   ` Jeff Garzik
2004-03-26 19:15             ` Kevin Corry
2004-03-26 20:45               ` Justin T. Gibbs
2004-03-27 15:39                 ` Kevin Corry
2004-03-30 17:03                   ` Justin T. Gibbs
2004-03-30 17:15                     ` Jeff Garzik
2004-03-30 17:35                       ` Justin T. Gibbs
2004-03-30 17:46                         ` Jeff Garzik
2004-03-30 18:04                           ` Justin T. Gibbs
2004-03-30 21:47                             ` Jeff Garzik
2004-03-30 22:12                               ` Justin T. Gibbs
2004-03-30 22:34                                 ` Jeff Garzik
2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
2004-03-25 22:59           ` Justin T. Gibbs
2004-03-25 23:44             ` Lars Marowsky-Bree
2004-03-26  0:03               ` Justin T. Gibbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).