All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lars Ellenberg <lars.ellenberg@linbit.com>
To: Neil Brown <neilb@suse.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>,
	linux-kernel@vger.kernel.org, Jens Axboe <jens.axboe@oracle.com>,
	Greg KH <gregkh@suse.de>,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	Sam Ravnborg <sam@ravnborg.org>, Dave Jones <davej@redhat.com>,
	Nikanth Karthikesan <knikanth@suse.de>,
	Lars Marowsky-Bree <lmb@suse.de>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
	Kyle Moffett <kyle@moffetthome.net>,
	Bart Van Assche <bart.vanassche@gmail.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
Date: Sun, 3 May 2009 10:29:31 +0200	[thread overview]
Message-ID: <20090503082931.GD31340@racke> (raw)
In-Reply-To: <18941.12645.590037.589600@notabene.brown>

On Sun, May 03, 2009 at 03:53:41PM +1000, Neil Brown wrote:
> I know this is minor, but it bugs me every time I see that phrase
> "shared-nothing". 

> Or maybe "shared-nothing" is an accepted technical term in the
> clustering world??

yes.

> All this should probably be in a patch against Documentation/drbd.txt 

Ok.

> >    1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
> >     speak) has the filesystem mounted and the application running. Node B is
> >     in standby mode ('secondary' in DRBD speak).
> 
> If there some strong technical reason to only allow 2 nodes?

It "just" has not yet been implemented.
I'm working on that, though.

> >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> >     transport, it does not offer the ability to exchange dirty bitmaps or
> >     data generation identifiers, nor does the RAID1 code has a concept of
> >     that.
> 
> Not 100% true, but I - at least partly -  get your point.
> As md stores bitmaps and data generation identifiers on the block
> device, these can be transferred over NBD just like any other data on
> the block device.

Do you have one dirty bitmap per mirror (yet) ?
Do you _merge_ them?

the "NBD" mirrors are remote, and once you lose communication,
they may be (and in general, you have to assume they are) modified
by which ever node they are directly attached to.

> However I think that part of your point is that DRBD can transfer them
> more efficiently (e.g. it compresses the bitmap before transferring it
> -  I assume the compression you use is much more effective than gzip??
> else why both to code your own).

No, the point was that we have one bitmap per mirror (though currently
number of mirrors == 2, only), and that we do merge them.

but to answer the question:
why bother to implement our own encoding?
because we know a lot about the data to be encoded.

the compression of the bitmap transfer we just added very recently.
for a bitmap, with large chunks of bits set or unset, it is efficient
to just code the runlength.
to use gzip in kernel would add yet an other huge overhead for code
tables and so on.
during testing of this encoding, applying it to an already gzip'ed file
was able to compress it even further, btw.
though on english plain text, gzip compression is _much_ more effective.

> You say "nor does the RAID1 code has a concept of that".  It isn't
> clear what you are referring to.

The concept that one of the mirrors (the "nbd" one in that picture)
may have been accessed independently, without MD knowning,
because the node this MD (and its "local" mirror) was living on
suffered from power outage.

The concept of both mirrors being modified _simultaneously_,
(e.g. living below a cluster file system).

> >    2) When using DRBD over small bandwidth links, one has to run a resync,
> >     DRBD offers the option to do a "checksum based resync". Similar to rsync
> >     it at first only exchanges a checksum, and transmits the whole data
> >     block only if the checksums differ.
> > 
> >     That again is something that does not fit into the concepts of
> >     NBD or RAID1.
> 
> Interesting idea....  RAID1 does have a mode where it reads both (all)
> devices and compares them to see if they match or not.  Doing this
> compare with checksums rather than memcmp would not be an enormous
> change.
> 
> I'm beginning to imagine an enhanced NBD as a model for what DRBD
> does.  This enhanced NBD not only supports read and write of blocks
> but also:
> 
>    - maintains the local bitmap and sets bits before allowing a write

right.

>    - can return a strong checksum rather than the data of a block

ok.

>    - provides sequence numbers in a way that I don't fully understand
>      yet, but which allows consistent write ordering.

yes, please.

>    - allows reads to be compressed so that the bitmap can be
>      transferred efficiently.

yep.

add to that
     - can exchange data generations on handshake,
     - can refuse the handshake (consistent data,
       but evolved differently than the other copy;
       diverging data sets detected!)
     - is bi-directional, can _push_ writes!

and whatever else I forgot just now.

> I can imagine that md/raid1 could be made to work well with an
> enhanced NBD like this.

of course.

> >   DRBD can also be used in dual-Primary mode (device writable on both
> >   nodes), which means it can exhibit shared disk semantics in a
> >   shared-nothing cluster.  Needless to say, on top of dual-Primary
> >   DRBD utilizing a cluster file system is necessary to maintain for
> >   cache coherency.
> > 
> >   More background on this can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> > 
> >   Beyond that, DRBD addresses various issues of cluster partitioning,
> >   which the MD/NBD stack, to the best of our knowledge, does not
> >   solve. The above-mentioned paper goes into some detail about that as
> >   well.
> 
> Agreed - MD/NBD could probably be easily confused by cluster
> partitioning, though I suspect that in many simple cases it would get
> it right.  I haven't given it enough thought to be sure.  I doubt the
> enhancements necessary would be very significant though.

The most significant part is probably the bidirectional nature
and the "refuse it" part of the handshake.

> >   DRBD can operate in synchronous mode, or in asynchronous mode. I want
> >   to point out that we guarantee not to violate a single possible write
> >   after write dependency when writing on the standby node. More on that
> >   can be found in this paper:
> >     http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf
> 
> I really must read and understand this paper..
> 
> 
> So... what would you think of working towards incorporating all of the
> DRBD functionality into md/raid1??
> I suspect that it would be a mutually beneficial exercise, except for
> the small fact that it would take a significant amount of time and
> effort.  I'd be will to shuffle some priorities and put in some effort
> if it was a direction that you would be open to exploring.

Sure. But yes, full ack on the time and effort part ;)

> Whether the current DRBD code gets merged or not is possibly a
> separate question, though I would hope that if we followed the path of
> merging DRBD into md/raid1, then any duplicate code would eventually be
> excised from the kernel.

Rumor [http://lwn.net/Articles/326818/] has it, that the various in
kernel raid implementations are being unified right now, anyways?

If you want to stick to "replication is almost identical to RAID1",
best not to forget "this may be a remote mirror", there may be more than
one entity accessing it, this may be part of a bi-directional
(active-active) replication setup.

For further ideas on what could be done with replication (enhancing the
strict "raid1" notion), see also
http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf

 - time shift replication
 - generic point in time recovery of block device data
 - (remote) backup by periodically, round-robin re-sync of
   "raid" members, then "dropping" them again.
 ...

No useable code on those ideas, yet,
but a lot of thought. It is not all handwaving.

	Lars

  parent reply	other threads:[~2009-05-03  8:30 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
2009-04-30 11:26                               ` [PATCH 16/16] DRBD: final Philipp Reisner
2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
2009-05-02 17:29                           ` Lars Ellenberg
2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
2009-05-02 20:23                       ` Lars Ellenberg
2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
2009-05-02 17:28           ` Lars Ellenberg
2009-05-03  5:21             ` Neil Brown
2009-05-03  7:38               ` Lars Ellenberg
2009-05-05 17:48               ` Lars Marowsky-Bree
2009-05-05 17:51                 ` James Bottomley
2009-05-05 22:26                 ` Neil Brown
2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
2009-05-02 17:00         ` Lars Ellenberg
2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
2009-05-02 15:26       ` Lars Ellenberg
2009-05-02 17:58         ` Andrew Morton
2009-05-02 18:13           ` Lars Ellenberg
2009-05-02 18:26             ` Andrew Morton
2009-05-02 19:39               ` Lars Ellenberg
2009-05-02 23:51     ` Kyle Moffett
2009-05-03  6:27       ` Lars Ellenberg
2009-05-03 14:06         ` Kyle Moffett
2009-05-03 22:48           ` Lars Ellenberg
2009-05-04  0:48             ` Kyle Moffett
2009-05-04  1:01               ` Kyle Moffett
2009-05-04 16:12                 ` Rik van Riel
2009-05-04 16:15                   ` Lars Ellenberg
2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
2009-05-01 11:15   ` Lars Marowsky-Bree
2009-05-01 13:14     ` Dave Jones
2009-05-01 19:14       ` Andrew Morton
2009-05-05  4:05     ` Christian Kujau
2009-05-02  7:33   ` Bart Van Assche
2009-05-03  5:36     ` Willy Tarreau
2009-05-03  5:40       ` david
2009-05-03 14:21         ` James Bottomley
2009-05-03 14:36           ` david
2009-05-03 14:45             ` James Bottomley
2009-05-03 14:56               ` david
2009-05-03 15:09                 ` James Bottomley
2009-05-03 15:22                   ` david
2009-05-03 15:38                     ` James Bottomley
2009-05-03 15:48                       ` david
2009-05-03 16:02                         ` James Bottomley
2009-05-03 16:13                           ` david
2009-05-04  8:28               ` Philipp Reisner
2009-05-04 17:24                 ` James Bottomley
2009-05-05  8:21                   ` Philipp Reisner
2009-05-05 14:09                     ` James Bottomley
2009-05-05 15:56                       ` Philipp Reisner
2009-05-05 17:05                         ` James Bottomley
2009-05-05 21:45                           ` Philipp Reisner
2009-05-05 21:53                             ` James Bottomley
2009-05-06  8:17                               ` Philipp Reisner
2009-05-05 15:03                     ` Bart Van Assche
2009-05-05 15:57                       ` Philipp Reisner
2009-05-05 17:38                         ` Lars Marowsky-Bree
2009-05-03 10:06       ` Philipp Reisner
2009-05-03 10:15         ` Thomas Backlund
2009-05-03  5:53 ` Neil Brown
2009-05-03  6:24   ` david
2009-05-03  8:29   ` Lars Ellenberg [this message]
2009-05-03 11:00     ` Neil Brown
2009-05-03 21:32       ` Lars Ellenberg
2009-05-04 16:12         ` Lars Marowsky-Bree
2009-05-05 22:08         ` Lars Ellenberg
2009-05-14 22:31 devzero
2009-05-15 12:10 Philipp Reisner
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21  5:49 ` Andrew Morton
     [not found]   ` <20090720224940.36da1ef8.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-07-21 18:51     ` Lars Ellenberg
2009-07-22  4:59       ` [Drbd-dev] " Stephen Rothwell
2009-07-24 15:20         ` Philipp Reisner
     [not found]           ` <200907241720.22771.philipp.reisner-63ez5xqkn6DQT0dZR+AlfA@public.gmane.org>
2009-07-26 23:24             ` Stephen Rothwell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090503082931.GD31340@racke \
    --to=lars.ellenberg@linbit.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=bart.vanassche@gmail.com \
    --cc=davej@redhat.com \
    --cc=gregkh@suse.de \
    --cc=jens.axboe@oracle.com \
    --cc=knikanth@suse.de \
    --cc=kyle@moffetthome.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=nab@linux-iscsi.org \
    --cc=neilb@suse.de \
    --cc=philipp.reisner@linbit.com \
    --cc=sam@ravnborg.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.