All of lore.kernel.org
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Christoph Hellwig <hch@infradead.org>
Cc: James Bottomley <James.Bottomley@suse.de>,
	Lars Ellenberg <lars.ellenberg@linbit.com>,
	linux-kernel@vger.kernel.org, drbd-dev@lists.linbit.com,
	Andrew Morton <akpm@linux-foundation.org>,
	Bart Van Assche <bart.vanassche@gmail.com>,
	Dave Jones <davej@redhat.com>, Greg KH <gregkh@suse.de>,
	Jens Axboe <jens.axboe@oracle.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Kyle Moffett <kyle@moffetthome.net>,
	Lars Marowsky-Bree <lmb@suse.de>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
	Nikanth Karthikesan <knikanth@suse.de>,
	Philipp Reisner <philipp.reisner@linbit.com>,
	Sam Ravnborg <sam@ravnborg.org>
Subject: Re: [GIT PULL] DRBD for 2.6.32
Date: Fri, 18 Sep 2009 13:32:07 +1000	[thread overview]
Message-ID: <19122.65335.126937.476968@notabene.brown> (raw)
In-Reply-To: message from Christoph Hellwig on Thursday September 17

On Thursday September 17, hch@infradead.org wrote:
> On Thu, Sep 17, 2009 at 10:02:45AM -0600, James Bottomley wrote:
> > So I think Christoph's NAK is rooted in the fact that we have a
> > proliferation of in-kernel RAID implementations and he's trying to
> > reunify them all again.
> > 
> > As part of the review, reusing the kernel RAID (and actually logging)
> > logic did come up and you added it to your todo list.  Perhaps expanding
> > on the status of that would help, since what's being looked for is that
> > you're not adding more work to the RAID reunification effort and that
> > you do have a plan and preferably a time frame for coming into sync with
> > it.
> 
> Yes.  RDBD has spend tons of time out of tree, and if they want to put
> it in now I think requiring them to do their homework is a good idea.

What homework?

If there was a sensible unifying framework in the kernel that they
could plug in to, then requiring them do to that might make sense.  But
there isn't.  You/I/We haven't created a solution (i.e. there is no
equivalent of the VFS for virtual block devices) and saying that
because we haven't they cannot merge DRBD hardly seems fair.

Indeed, merging DRBD must be seen as a *good* thing as we then have
more examples of differing requirements against which a proposed
solution can be measured and tested.

I thought the current attitude was "merge then fix".  That is what the
drivers/staging tree seems to be all about.  Maybe you could argue
that DRBD should go in to 'staging' first (though I don't think that
is appropriate or require myself), but keeping it out just seems
wrong.

> 
> Note that the in-kernel raid implementation is just a rather small part
> of this, what's much more important is the user interface.  A big part
> of raid unification is that we can support on proper interface to deal
> with raid vs volume management, and DRBD adds another totally
> incompatible one to that.  We'd be much better off adding the drbd in
> the write protocol (at least the most recent version) to DM instead of
> adding another big chunk of framework.

I agree that the interface is very important.  But the 'dm' interface
and the 'md' interface (both imperfect) are not going away any time
soon and there is no reason to expect that the DRBD interface has to
be sacrificed simply because they didn't manage to get it in-kernel
before now.

Let me try to paint a partial picture for you to show how my thoughts
have been going.  I'm looking at this from the perspective of the
driver model, particularly exposed through sysfs.

A 'block device' like 'sda' has a parent in sysfs, which represents
(e.g.) the SCSI device which provides the storage that is exposed
through 'sda'.  e.g.
  .../target0:0:0/0:0:0:0/block/sda
      ^target     ^lun   ^padding ^block-device
Block devices 'md0' or 'mapper/whatever' don't have a real parent and
so live in /sys/devices/virtual/block which is really just a
place-holder because there is no real parent.  There should be.

So I would propose a 'bus' device which contains virtual block devices
- 'vbd's.  There is probably just one instance of this bus.

A 'vbd' is somewhat like a SCSI target (or maybe 'lun').
The preferred way to create a vbd is to write a device name to a
'scan' file in the 'bus' device. (similar to ....scsi_host/host0/scan).
Legacy interfaces (md,dm,drbd,loop,...) would be able to do the same
thing using an internal interface.

This would make the named vbd appear in the bus and it would have some
attribute files which could be filled in to describe the device.
Writing one of these attributes would activate the device and make a
'block device' come into existence.  The block device would be a child
of the vbd, just like sda is a child of a SCSI target.

When a vbd is being managed by a legacy interface (md, dm, drbd...) it
would probably has a second child device which represents that
interface.

So to be a bit concrete:

  /sys/devices/virtual/vdbus   would be the bus
  /sys/devices/virtual/vdbus/md0  would be the vbd for an md device
  /sys/devices/virtual/vdbus/md0/block/md0 would be the block device
  /sys/devices/virtual/vdbus/md0/md/md0 would be an 'md' device
                           representing the (legacy) md interface.

For compatibility (maybe only temporarily),
  /sys/devices/virtual/vdbus/md0/block/md0/md -> /sys/devices/virtual/vdbus/md0/md/md0
 
so the current /sys/block/mdX/md/ directory still works.
that directory would largely have symlink up to the parent,
though possible with different names.


The next bit is the messy bit that I haven't come up with an adequate
solution yet:
  What is the relationship between the component devices and the vdb
  device?

This is clearly a dependency, and sysfs has a clear model for
representing dependencies:  The child is dependent on the parent.
However with vdb, the child is dependent on multiple parents and those
dependencies change.
As reported in http://lwn.net/Articles/347573/, other things have
multiple dependencies too, so we should probably try to make sure a
solution is created that fits both needs.
Personally, I would much rather all the dependencies were links, and
the directory hierarchy was
   /sys/subsystem/$SUBSYSTEM/devices/$DEVICE
(where 'subsystem' subsumes both 'class' and 'bus').  But it is
probably 7 years too late for that.

The other thing I would really like to be able to manage is for a
'class/block' device to be able to be moved from one parent to
another.  This would make it possible to change a block device to a
RAID1 containing the same data while it was mounted.   It isn't too
hard to implement that internally, but making it fit with the sysfs
model is hard.  It requires changeable dependencies again.


So yeah, let's have a discussion and find a good universal interface
which can subsume all the others and provide even more functionality,
but I don't think we can justify using the fact that we haven't
devised such an interface yet as reason to exclude DRBD.

NeilBrown

  reply	other threads:[~2009-09-18  3:31 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-15 14:45 [GIT PULL] DRBD for 2.6.32 Philipp Reisner
2009-09-15 23:19 ` Christoph Hellwig
2009-09-16  8:33   ` Philipp Reisner
2009-09-17  8:12   ` Lars Ellenberg
2009-09-17 16:02     ` James Bottomley
2009-09-17 16:11       ` Christoph Hellwig
2009-09-18  3:32         ` Neil Brown [this message]
2009-09-18 20:08           ` Jens Axboe
2009-09-19  5:14             ` FUJITA Tomonori
2009-09-19 22:02               ` Lars Marowsky-Bree
2009-09-19 23:56                 ` Dan Williams
2009-09-21 13:39                 ` FUJITA Tomonori
2009-09-21 14:43                   ` Lars Ellenberg
2009-09-21 14:52                     ` Arjan van de Ven
2009-09-21 16:53                       ` Lars Ellenberg
2009-09-21 22:27                         ` FUJITA Tomonori
2009-09-22  0:51                           ` Kyle Moffett
2009-09-23 11:27                             ` FUJITA Tomonori
2009-09-23 11:57                             ` Christoph Hellwig
2009-09-23 14:01                               ` Kyle Moffett
2009-09-23 23:21                                 ` FUJITA Tomonori
2009-09-22  6:20                           ` Lars Marowsky-Bree
2009-09-23 11:36                             ` FUJITA Tomonori
2009-09-23 23:06                               ` Neil Brown
2009-09-23 23:37                                 ` FUJITA Tomonori
2009-09-25  5:27                                   ` Neil Brown
2009-09-25  9:59                                     ` Lars Marowsky-Bree
2009-09-21 14:55                     ` [Drbd-dev] " Lars Ellenberg
2009-09-22  5:37                     ` Heinz Mauelshagen
2009-09-17  8:50   ` Lars Marowsky-Bree
2009-09-16  0:46 ` KOSAKI Motohiro
2009-09-16  9:19   ` Philipp Reisner
2009-09-17 18:52 devzero
2009-09-23 19:10 devzero

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=19122.65335.126937.476968@notabene.brown \
    --to=neilb@suse.de \
    --cc=James.Bottomley@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=bart.vanassche@gmail.com \
    --cc=davej@redhat.com \
    --cc=drbd-dev@lists.linbit.com \
    --cc=gregkh@suse.de \
    --cc=hch@infradead.org \
    --cc=jens.axboe@oracle.com \
    --cc=knikanth@suse.de \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=kyle@moffetthome.net \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=nab@linux-iscsi.org \
    --cc=philipp.reisner@linbit.com \
    --cc=sam@ravnborg.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.