All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Igor Fedotov <ifedotov@mirantis.com>
Cc: allen.samuels@sandisk.com, ceph-devel@vger.kernel.org
Subject: Re: 2 related bluestore questions
Date: Wed, 11 May 2016 09:57:03 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.11.1605110951570.15518@cpach.fuggernut.com> (raw)
In-Reply-To: <8b077a20-ace3-7824-4039-7b8e9adf88ce@mirantis.com>

On Wed, 11 May 2016, Igor Fedotov wrote:
> On 11.05.2016 16:10, Sage Weil wrote:
> > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > I took a stab at a revised wal_op_t here:
> > > > 
> > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > > > 
> > > > This is enough to implement the basic wal overwrite case here:
> > > > 
> > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > > > 
> > > > It's overkill for that, but something like this ought to be sufficiently
> > > > general to express the more complicated wal (and compaction/gc/cleanup)
> > > > operations, where we are reading bits of data from lots of different
> > > > previous blobs, verifying checksums, and then assembling the results
> > > > into
> > > > a new buffer that gets written somewhere else.  The read_extent_map and
> > > > write_map offsets are logical offsets in a buffer we assemble and then
> > > > write to b_off~b_len in the specific blob.  I didn't get to the
> > > > _do_wal_op
> > > > part that actually does it, but it would do the final write, csum
> > > > calculation, and metadata update.  Probably... the allocation would
> > > > happen
> > > > then too, if the specified blob didn't already have pextents.  Tha way
> > > > we can do compression at that stage as well?
> > > > 
> > > > What do you think?
> > > Not completely sure that it's a good idea to have read stage description
> > > stored in WAL record? Wouldn't that produce any conflicts/inconsistencies
> > > when
> > > multiple WAL records deal with the same or close lextents and previous WAL
> > > updates lextents to read. May be it's better to prepare such a description
> > > exactly when WAL is applied? And WAL record to have just a basic write
> > > info?
> > Yeah, I think this is a problem.  I see two basic paths:
> > 
> >   - We do a wal flush before queueing a new wal event to avoid races like
> > this. Or perhaps we only do it when the wal event(s) touch the same
> > blob(s).  That's simple to reason about, but means that a series
> > of small IOs to the same object (or blob) will serialize the kv commit and
> > wal r/m/w operations.  (Note that this is no worse than the naive approach
> > of doing the read part up front, and it only happens when you have
> > successive wal ops on the same object (or blob)).
> > 
> >   - We describe the wal read-side in terms of the current onode state.  For
> > example, 'read object offset 0..100, use provided buffer for 100..4096,
> > overwrite block'.  That can be pipelined.  But there are other
> > operations that would require we flush the wal events, like a truncate or
> > zero or other write that clobbers that region of the object.
> > Maybe/hopefully in those cases we don't care (it no longer matters that
> > this wal event do the write we originally intended) but we'd need
> > to think pretty carefully about it.  FWIW, truncate already does an
> > o->flush().
> I'd prefer the second approach. Probably with some modification...
> As far as I understand with the approach above you are trying to locate all
> write logic at a single place and have WAL machinery as a straightforward
> executor for already prepared tasks. Not sure this is beneficial enough. But
> definitely it's more complex and error-prone. And potentially you will need
> extend WAL machinery task description from time to time...
> As an alternative one can eliminate that read description in WAL record at
> all. Let's simply record what loffset we are going to write to and data
> itself. Thus we have simple write request description.
> And when WAL is applied corresponding code should determine how to do the
> write properly using the current lextent/blob maps state. This way Write Op
> apply can be just a regular write handling that performs sync RMW or any other
> implementation depending on the current state, some policy, or whatever else
> that fits the best at the specific moment.

I like that way better!  We can just add a force_sync argument to 
_do_write.  That also lets us trivially disable wal (by forcing sync w/ 
a config option or whatever).

The downside is that any logically conflicting request (an overlapping 
write or truncate or zero) needs to drain the wal events, whereas with a 
lower-level wal description there might be cases where we can ignore the 
wal operation.  I suspect the trivial solution of o->flush() on 
write/truncate/zero will be pretty visible in benchmarks.  Tracking 
in-flight wal ops with an interval_set would probably work well enough.

sage

  reply	other threads:[~2016-05-11 13:56 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov
2016-05-10 12:53   ` Sage Weil
2016-05-10 14:41     ` Igor Fedotov
2016-05-10 15:39       ` Sage Weil
2016-05-11  1:10         ` Sage Weil
2016-05-11 12:11           ` Igor Fedotov
2016-05-11 13:10             ` Sage Weil
2016-05-11 13:45               ` Igor Fedotov
2016-05-11 13:57                 ` Sage Weil [this message]
2016-05-11 20:54                   ` Sage Weil
2016-05-11 21:38                     ` Allen Samuels
2016-05-12  2:58                       ` Sage Weil
2016-05-12 11:54                         ` Allen Samuels
2016-05-12 14:47                           ` Igor Fedotov
2016-05-12 14:38                         ` Igor Fedotov
2016-05-12 16:37                         ` Igor Fedotov
2016-05-12 16:43                           ` Sage Weil
2016-05-12 16:45                             ` Igor Fedotov
2016-05-12 16:48                               ` Sage Weil
2016-05-12 16:52                                 ` Igor Fedotov
2016-05-12 17:09                                   ` Sage Weil
2016-05-13 17:07                                     ` Igor Fedotov
2016-05-12 14:29                       ` Igor Fedotov
2016-05-12 14:27                     ` Igor Fedotov
2016-05-12 15:06                       ` Sage Weil
2016-05-11 12:39           ` Igor Fedotov
2016-05-11 14:35             ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1605110951570.15518@cpach.fuggernut.com \
    --to=sage@newdream.net \
    --cc=allen.samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.