linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Philipp Reisner <philipp.reisner@linbit.com>
To: Nikanth K <nikanth@gmail.com>
Cc: linux-kernel@vger.kernel.org, gregkh@suse.de,
	jens.axboe@oracle.com, nab@risingtidestorage.com,
	andi@firstfloor.org, Nikanth Karthikesan <knikanth@suse.de>
Subject: Re: [PATCH 00/12] DRBD: a block device for HA clusters
Date: Tue, 7 Apr 2009 17:56:22 +0200	[thread overview]
Message-ID: <200904071756.23914.philipp.reisner@linbit.com> (raw)
In-Reply-To: <807b3a220904070523t746ad2abx6a46d30e816eb1d6@mail.gmail.com>

On Tuesday 07 April 2009 14:23:14 Nikanth K wrote:
> Hi Philipp,
>
> On Mon, Mar 30, 2009 at 10:17 PM, Philipp Reisner
>
> <philipp.reisner@linbit.com> wrote:
> > Hi,
> >
> >  This is a repost of DRBD, to keep you updated about the ongoing
> >  cleanups.
> >
> > Description
> >
> >  DRBD is a shared-nothing, synchronously replicated block device. It
> >  is designed to serve as a building block for high availability
> >  clusters and in this context, is a "drop-in" replacement for shared
> >  storage. Simplistically, you could see it as a network RAID 1.
> >
> >  Each minor device has a role, which can be 'primary' or 'secondary'.
> >  On the node with the primary device the application is supposed to
> >  run and to access the device (/dev/drbdX). Every write is sent to
> >  the local 'lower level block device' and, across the network, to the
> >  node with the device in 'secondary' state.  The secondary device
> >  simply writes the data to its lower level block device.
> >
> >  DRBD can also be used in dual-Primary mode (device writable on both
> >  nodes), which means it can exhibit shared disk semantics in a
> >  shared-nothing cluster.  Needless to say, on top of dual-Primary
> >  DRBD utilizing a cluster file system is necessary to maintain for
> >  cache coherency.
> >
> >  This is one of the areas where DRBD differs notably from RAID1 (say
> >  md) stacked on top of NBD or iSCSI. DRBD solves the issue of
> >  concurrent writes to the same on disk location. That is an error of
> >  the layer above us -- it usually indicates a broken lock manager in
> >  a cluster file system --, but DRBD has to ensure that both sides
> >  agree on which write came last, and therefore overwrites the other
> >  write.
>
> So this difference to RAID1+NBD is required only if the DLM of the
> clustered fs is buggy?
>

No, DRBD is much more than RAID1+NBD, I had the impression that by writing 
"RAID1+NBD" I can quickly communicate the big picture what DRBD is.

> >  More background on this can be found in this paper:
> >    http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
> >
> >  Beyond that, DRBD addresses various issues of cluster partitioning,
> >  which the MD/NBD stack, to the best of our knowledge, does not
> >  solve. The above-mentioned paper goes into some detail about that as
> >  well.
>
> It would be nice, if you can list those limitations of NBD/RAID here.
>

Ok. I will give you two simple examples:

1)
Think of a two node HA cluster. Node A is active ('primary' in DRBD speak)
has the filesystem mounted and the application running. Node B is
in standby mode ('secondary' in DRBD speak).

We loose network connectivity, the primary node continues to run, the 
secondary no longer gets updates.

Then we have a complete power failure, both nodes are down. Then they
power up the data center again, but at first the get only the power circuit
of node B up and running again. 

  Should node B offer the service right now ? 
     ( DRBD has configurable policies for that )

Later on they manage to get node A up and running again, now lets assume
node B was chosen to be the new primary node. What needs to be done ?

 Modifications on B since it became primary needs to be resynced to A.
 Modifications on A sind it lost contact to B needs to be taken out.

DRBD does that. 

How do you fit that into a RAID1+NBD model ? NBD is just a block transport,
it does not offer the ability to exchange dirty bitmaps or data generation
identifiers, nor does the RAID1 code has a concept of that.

2)
When using DRBD over small bandwidth links, one has to run a resync, DRBD
offers the option to do a "checksum based resync". Similar to rsync it 
at first only exchanges a checksum, and transmits the whole data block only
if the checksums differ.

That again is something that does not fit into the concepts of NBD or RAID1.

I will write down more examples if you think, that you need more justification
for yet another implementation of RAID in the kernel. DRBD does more, but DRBD
is not suitable for RAID1 on a local box. 

PS: Lars Marowsky-Bree requested a GIT tree of the DRBD-for-mainline kernel
    patch. I will set that up until Friday, and maintain the code there for
    for the merging process.

Best,
 Philipp
-- 
: Dipl-Ing Philipp Reisner
: LINBIT | Your Way to High Availability
: Tel: +43-1-8178292-50, Fax: +43-1-8178292-82
: http://www.linbit.com

DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.


  reply	other threads:[~2009-04-07 15:56 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-03-30 16:47 [PATCH 00/12] DRBD: a block device for HA clusters Philipp Reisner
2009-03-30 16:47 ` [PATCH 01/13] DRBD: lru_cache Philipp Reisner
2009-03-30 16:47   ` [PATCH 02/13] DRBD: activity_log Philipp Reisner
2009-03-30 16:47     ` [PATCH 03/13] DRBD: bitmap Philipp Reisner
2009-03-30 16:47       ` [PATCH 04/13] DRBD: request Philipp Reisner
2009-03-30 16:47         ` [PATCH 05/13] DRBD: userspace_interface Philipp Reisner
2009-03-30 16:47           ` [PATCH 06/13] DRBD: internal_data_structures Philipp Reisner
2009-03-30 16:47             ` [PATCH 07/13] DRBD: main Philipp Reisner
2009-03-30 16:47               ` [PATCH 08/13] DRBD: receiver Philipp Reisner
2009-03-30 16:47                 ` [PATCH 09/13] DRBD: proc Philipp Reisner
2009-03-30 16:47                   ` [PATCH 10/13] DRBD: worker Philipp Reisner
2009-03-30 16:47                     ` [PATCH 11/12] DRBD: misc Philipp Reisner
2009-03-30 16:47                       ` [PATCH 11/13] DRBD: variable_length_integer_encoding Philipp Reisner
2009-03-30 16:47                         ` [PATCH 12/12] DRBD: final Philipp Reisner
2009-03-30 16:47                           ` [PATCH 12/13] DRBD: misc Philipp Reisner
2009-03-30 16:47                             ` [PATCH 13/13] DRBD: final Philipp Reisner
2009-03-30 19:05                               ` Sam Ravnborg
2009-04-01 10:13                                 ` Philipp Reisner
2009-04-07 10:26 ` [PATCH 00/12] DRBD: a block device for HA clusters Lars Marowsky-Bree
2009-04-07 12:23 ` Nikanth K
2009-04-07 15:56   ` Philipp Reisner [this message]
  -- strict thread matches above, loose matches on Subject: below --
2009-03-23 15:47 Philipp Reisner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200904071756.23914.philipp.reisner@linbit.com \
    --to=philipp.reisner@linbit.com \
    --cc=andi@firstfloor.org \
    --cc=gregkh@suse.de \
    --cc=jens.axboe@oracle.com \
    --cc=knikanth@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nab@risingtidestorage.com \
    --cc=nikanth@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).