linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lars Ellenberg <lars.ellenberg@linbit.com>
To: Neil Brown <neilb@suse.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>,
	linux-kernel@vger.kernel.org, Jens Axboe <jens.axboe@oracle.com>,
	Greg KH <gregkh@suse.de>,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	Sam Ravnborg <sam@ravnborg.org>, Dave Jones <davej@redhat.com>,
	Nikanth Karthikesan <knikanth@suse.de>,
	Lars Marowsky-Bree <lmb@suse.de>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
	Kyle Moffett <kyle@moffetthome.net>,
	Bart Van Assche <bart.vanassche@gmail.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
Date: Sun, 3 May 2009 23:32:31 +0200	[thread overview]
Message-ID: <20090503213231.GA6243@racke> (raw)
In-Reply-To: <18941.31069.695554.862567@notabene.brown>

On Sun, May 03, 2009 at 09:00:45PM +1000, Neil Brown wrote:
> On Sunday May 3, lars.ellenberg@linbit.com wrote:
> > > If there some strong technical reason to only allow 2 nodes?
> > 
> > It "just" has not yet been implemented.
> > I'm working on that, though.
> 
> :-)
> 
> > 
> > > >     How do you fit that into a RAID1+NBD model ? NBD is just a block
> > > >     transport, it does not offer the ability to exchange dirty bitmaps or
> > > >     data generation identifiers, nor does the RAID1 code has a concept of
> > > >     that.
> > > 
> > > Not 100% true, but I - at least partly -  get your point.
> > > As md stores bitmaps and data generation identifiers on the block
> > > device, these can be transferred over NBD just like any other data on
> > > the block device.
> > 
> > Do you have one dirty bitmap per mirror (yet) ?
> > Do you _merge_ them?
> 
> md doesn't merge bitmaps yet.  However if I found a need to, I would
> simple read a bitmap in userspace and feed it into the kernel via 
> 	/sys/block/mdX/md/md/bitmap_set_bits

ah, ok.  right.  that would do it.

> We sort-of have one bitmap per mirror, but only because the one bitmap
> is mirrored...

Which it could not be while replication link is down,
so once replication link is back (or remote node is back,
which is not easily distinguishable just there, blablabla),
you'd need to fetch the remote bitmap, and merge it with the local
bitmap (feeding it into bitmap_set_bits),
then re-attach the "failed" mirror.

The reasoning in the commit 9b1d1dac181d8c1b9492e05cee660a985d035a06,
which adds that feature, exactly describes this use case.

There, again, our simple run-length encoding scheme does make very much
sense, as the numbers dropping out of it during decoding are exactly
the runlengths, and could be fed into this almost directly.

> > the "NBD" mirrors are remote, and once you lose communication,
> > they may be (and in general, you have to assume they are) modified
> > by which ever node they are directly attached to.
> > 
> > > However I think that part of your point is that DRBD can transfer them
> > > more efficiently (e.g. it compresses the bitmap before transferring it
> > > -  I assume the compression you use is much more effective than gzip??
> > > else why both to code your own).
> > 
> > No, the point was that we have one bitmap per mirror (though currently
> > number of mirrors == 2, only), and that we do merge them.
> 
> Right.  I imagine much of the complexity of that could be handled in
> user-space while setting an a DRBD instance (??).

possibly.
you'd need to involve these steps on each and every communication loss
and network handshake.  I think that would make the system slower to
react on e.g. "flaky" replication links.

you are thinking in the "MD" paradigma: at any point in time, there is
only one MD instance involved, the mirror transports (currently dumb
block devices) simply do what they are told.

in DRBD, we have multiple (ok, two) instances talking to each other,
and I think that is the better approach for (remote) replication.

> > but to answer the question:
> > why bother to implement our own encoding?
> > because we know a lot about the data to be encoded.
> > 
> > the compression of the bitmap transfer we just added very recently.
> > for a bitmap, with large chunks of bits set or unset, it is efficient
> > to just code the runlength.
> > to use gzip in kernel would add yet an other huge overhead for code
> > tables and so on.
> > during testing of this encoding, applying it to an already gzip'ed file
> > was able to compress it even further, btw.
> > though on english plain text, gzip compression is _much_ more effective.
> 
> I just tried a little experiment.
> I created a 128meg file and randomly set 1000 bits in it.
> I compressed it with "gzip --best" and the result was 4Meg.  Not
> particularly impressive.
> I then tried to compress it wit bzip2 and got 3452 bytes.
> Now *that* is impressive.  I suspect your encoding might do a little
> better, but I wonder if it is worth the effort.

The effort is minimal.
The cpu overhead is negligible (compared with bzip2, or any other
generic compression scheme), and the memory overhead is next to none
(just a small scratch buffer, to assemble the network packet).
No tables or anything involved.
Especially the _decoding_ part has this nice property:
  chunk = 0;
  while (!eof) {
	vli_decode_bits(&rl, input); /* number of unset bits */
	chunk += rl;
	vli_decode_bits(&rl, input); /* number of set bits */
	bitmap_dirty_bits(bitmap, chunk, chunk + rl);
	chunk += rl;
 }

The source code is there.

For your example, on average you'd have (128 << 23) / 1000 "clear" bits,
then one set bit. The encoding transfers
"first bit unset -- ca. (1<<20), 1, ca. (1<<20), 1, ca. (1<<20), 1, ...",
using 2 bits for the "1", and up to 29 bit for the "ca. 1<<20".
should be in the very same ballpark as your bzip2 result.

> I'm not certain that my test file is entirely realistic, but it is
> still an interesting experiment.

It is not ;) but still...
If you are interessted, I can dig up my throw away user land code,
that has been used to evaluate various such schemes.
But it is so ugly that I won't post it to lkml.

> Why do you do this compression in the kernel?  It seems to me that it
> would be quite practical to do it all in user-space, thus making it
> really easy to use pre-existing libraries.

Because the bitmap exchange happens in kernel.

If considering to rewrite a replication solution,
one can start to reconsider design choices.

But DRBD as of now does the connection handshake and bitmap exchange in
kernel.  We wanted to have a fast compression scheme suitable for
bitmaps, without cpu or memory overhead.  This does it quite nicely.

I can dig up my userland throwaway code used during evaluation
of various encoding schemes again, if you are interessted.

> BTW, the kernel already contains various compression code as part of
> the crypto API.

Of course I know.  But you are not really suggesting that I should do
bzip2 in kernel to exchange the bitmap. And on decoding, I want those
runlengths, not the actual plain bitmap.

> > > You say "nor does the RAID1 code has a concept of that".  It isn't
> > > clear what you are referring to.
> > 
> > The concept that one of the mirrors (the "nbd" one in that picture)
> > may have been accessed independently, without MD knowning,
> > because the node this MD (and its "local" mirror) was living on
> > suffered from power outage.

or the link has been down,
and the remote side decided to go active with it.

or the link has been taken down,
to activate the other side, knowingly creating a data set divergence,
to do some off-site processing.

> > The concept of both mirrors being modified _simultaneously_,
> > (e.g. living below a cluster file system).
> 
> Yes, that is an important concept.  Certainly one of the bits that
> would need to be added to md.
> 
> > > Whether the current DRBD code gets merged or not is possibly a
> > > separate question, though I would hope that if we followed the path of
> > > merging DRBD into md/raid1, then any duplicate code would eventually be
> > > excised from the kernel.
> > 
> > Rumor [http://lwn.net/Articles/326818/] has it, that the various in
> > kernel raid implementations are being unified right now, anyways?
> 
> I'm not holding my breath on that one...  
> I think that merging DRBD with md/raid1 would be significantly easier
> that any sort of merge between md and dm.  But (in either case) I'll
> do what I can to assist any effort that is technically sound.

D'accord.

> > If you want to stick to "replication is almost identical to RAID1",
> > best not to forget "this may be a remote mirror", there may be more than
> > one entity accessing it, this may be part of a bi-directional
> > (active-active) replication setup.
> > 
> > For further ideas on what could be done with replication (enhancing the
> > strict "raid1" notion), see also
> > http://www.drbd.org/fileadmin/drbd/publications/drbd9.linux-kongress.2008.pdf
> > 
> >  - time shift replication
> >  - generic point in time recovery of block device data
> >  - (remote) backup by periodically, round-robin re-sync of
> >    "raid" members, then "dropping" them again.
> >  ...
> > 
> > No useable code on those ideas, yet,
> > but a lot of thought. It is not all handwaving.
> 
> :-)
> 
> I'll have to do a bit of reading I see.  I'll then try to rough out a
> design and plan for merging DRBD functionality with md/raid1.  At the
> very least that would give me enough background understanding to be
> able to sensibly review your code submission.

Thanks.  Please give particular attention to the "taxonomy paper"
referenced therein, so we are going to use the same terms.

	Lars

  reply	other threads:[~2009-05-03 21:33 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-30 11:26 [PATCH 00/16] DRBD: a block device for HA clusters Philipp Reisner
2009-04-30 11:26 ` [PATCH 01/16] DRBD: major.h Philipp Reisner
2009-04-30 11:26   ` [PATCH 02/16] DRBD: lru_cache Philipp Reisner
2009-04-30 11:26     ` [PATCH 03/16] DRBD: activity_log Philipp Reisner
2009-04-30 11:26       ` [PATCH 04/16] DRBD: bitmap Philipp Reisner
2009-04-30 11:26         ` [PATCH 05/16] DRBD: request Philipp Reisner
2009-04-30 11:26           ` [PATCH 06/16] DRBD: userspace_interface Philipp Reisner
2009-04-30 11:26             ` [PATCH 07/16] DRBD: internal_data_structures Philipp Reisner
2009-04-30 11:26               ` [PATCH 08/16] DRBD: main Philipp Reisner
2009-04-30 11:26                 ` [PATCH 09/16] DRBD: receiver Philipp Reisner
2009-04-30 11:26                   ` [PATCH 10/16] DRBD: proc Philipp Reisner
2009-04-30 11:26                     ` [PATCH 11/16] DRBD: worker Philipp Reisner
2009-04-30 11:26                       ` [PATCH 12/16] DRBD: variable_length_integer_encoding Philipp Reisner
2009-04-30 11:26                         ` [PATCH 13/16] DRBD: misc Philipp Reisner
2009-04-30 11:26                           ` [PATCH 14/16] DRBD: tracepoint_probes Philipp Reisner
2009-04-30 11:26                             ` [PATCH 15/16] DRBD: documentation Philipp Reisner
2009-04-30 11:26                               ` [PATCH 16/16] DRBD: final Philipp Reisner
2009-05-02 15:45                         ` [PATCH 12/16] DRBD: variable_length_integer_encoding James Bottomley
2009-05-02 17:29                           ` Lars Ellenberg
2009-05-02 15:44                     ` [PATCH 10/16] DRBD: proc James Bottomley
2009-05-02 20:23                       ` Lars Ellenberg
2009-05-02 15:41         ` [PATCH 04/16] DRBD: bitmap James Bottomley
2009-05-02 17:28           ` Lars Ellenberg
2009-05-03  5:21             ` Neil Brown
2009-05-03  7:38               ` Lars Ellenberg
2009-05-05 17:48               ` Lars Marowsky-Bree
2009-05-05 17:51                 ` James Bottomley
2009-05-05 22:26                 ` Neil Brown
2009-05-01  9:01       ` [PATCH 03/16] DRBD: activity_log Andrew Morton
2009-05-02 17:00         ` Lars Ellenberg
2009-05-01  8:59     ` [PATCH 02/16] DRBD: lru_cache Andrew Morton
2009-05-02 15:26       ` Lars Ellenberg
2009-05-02 17:58         ` Andrew Morton
2009-05-02 18:13           ` Lars Ellenberg
2009-05-02 18:26             ` Andrew Morton
2009-05-02 19:39               ` Lars Ellenberg
2009-05-02 23:51     ` Kyle Moffett
2009-05-03  6:27       ` Lars Ellenberg
2009-05-03 14:06         ` Kyle Moffett
2009-05-03 22:48           ` Lars Ellenberg
2009-05-04  0:48             ` Kyle Moffett
2009-05-04  1:01               ` Kyle Moffett
2009-05-04 16:12                 ` Rik van Riel
2009-05-04 16:15                   ` Lars Ellenberg
2009-05-01  8:59   ` [PATCH 01/16] DRBD: major.h Andrew Morton
2009-05-01  8:59 ` [PATCH 00/16] DRBD: a block device for HA clusters Andrew Morton
2009-05-01 11:15   ` Lars Marowsky-Bree
2009-05-01 13:14     ` Dave Jones
2009-05-01 19:14       ` Andrew Morton
2009-05-05  4:05     ` Christian Kujau
2009-05-02  7:33   ` Bart Van Assche
2009-05-03  5:36     ` Willy Tarreau
2009-05-03  5:40       ` david
2009-05-03 14:21         ` James Bottomley
2009-05-03 14:36           ` david
2009-05-03 14:45             ` James Bottomley
2009-05-03 14:56               ` david
2009-05-03 15:09                 ` James Bottomley
2009-05-03 15:22                   ` david
2009-05-03 15:38                     ` James Bottomley
2009-05-03 15:48                       ` david
2009-05-03 16:02                         ` James Bottomley
2009-05-03 16:13                           ` david
2009-05-04  8:28               ` Philipp Reisner
2009-05-04 17:24                 ` James Bottomley
2009-05-05  8:21                   ` Philipp Reisner
2009-05-05 14:09                     ` James Bottomley
2009-05-05 15:56                       ` Philipp Reisner
2009-05-05 17:05                         ` James Bottomley
2009-05-05 21:45                           ` Philipp Reisner
2009-05-05 21:53                             ` James Bottomley
2009-05-06  8:17                               ` Philipp Reisner
2009-05-05 15:03                     ` Bart Van Assche
2009-05-05 15:57                       ` Philipp Reisner
2009-05-05 17:38                         ` Lars Marowsky-Bree
2009-05-03 10:06       ` Philipp Reisner
2009-05-03 10:15         ` Thomas Backlund
2009-05-03  5:53 ` Neil Brown
2009-05-03  6:24   ` david
2009-05-03  8:29   ` Lars Ellenberg
2009-05-03 11:00     ` Neil Brown
2009-05-03 21:32       ` Lars Ellenberg [this message]
2009-05-04 16:12         ` Lars Marowsky-Bree
2009-05-05 22:08         ` Lars Ellenberg
2009-05-14 22:31 devzero
2009-05-15 12:10 Philipp Reisner
2009-07-06 15:39 [PATCH 00/16] drbd: " Philipp Reisner
2009-07-21  5:49 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090503213231.GA6243@racke \
    --to=lars.ellenberg@linbit.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=bart.vanassche@gmail.com \
    --cc=davej@redhat.com \
    --cc=gregkh@suse.de \
    --cc=jens.axboe@oracle.com \
    --cc=knikanth@suse.de \
    --cc=kyle@moffetthome.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lmb@suse.de \
    --cc=nab@linux-iscsi.org \
    --cc=neilb@suse.de \
    --cc=philipp.reisner@linbit.com \
    --cc=sam@ravnborg.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).