All of lore.kernel.org
 help / color / mirror / Atom feed
From: Allen Samuels <Allen.Samuels@sandisk.com>
To: Sage Weil <sage@newdream.net>, Igor Fedotov <ifedotov@mirantis.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: RE: 2 related bluestore questions
Date: Wed, 11 May 2016 21:38:04 +0000	[thread overview]
Message-ID: <BLUPR0201MB152437E90BED4AE3CA22310EE8720@BLUPR0201MB1524.namprd02.prod.outlook.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1605111636390.15518@cpach.fuggernut.com>

Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.)

Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition....

Maybe I've missed something critical (vacation-headed) if so, please ignore.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, May 12, 2016 6:54 AM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: 2 related bluestore questions
> 
> On Wed, 11 May 2016, Sage Weil wrote:
> > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > On 11.05.2016 16:10, Sage Weil wrote:
> > > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > > > I took a stab at a revised wal_op_t here:
> > > > > >
> > > > > >
> > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
> > > > > > os/bluestore/bluestore_types.h#L595-L605
> > > > > >
> > > > > > This is enough to implement the basic wal overwrite case here:
> > > > > >
> > > > > >
> > > > > > https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
> > > > > > os/bluestore/BlueStore.cc#L5522-L5578
> > > > > >
> > > > > > It's overkill for that, but something like this ought to be
> > > > > > sufficiently general to express the more complicated wal (and
> > > > > > compaction/gc/cleanup) operations, where we are reading bits
> > > > > > of data from lots of different previous blobs, verifying
> > > > > > checksums, and then assembling the results into a new buffer
> > > > > > that gets written somewhere else.  The read_extent_map and
> > > > > > write_map offsets are logical offsets in a buffer we assemble
> > > > > > and then write to b_off~b_len in the specific blob.  I didn't
> > > > > > get to the _do_wal_op part that actually does it, but it would
> > > > > > do the final write, csum calculation, and metadata update.
> > > > > > Probably... the allocation would happen then too, if the
> > > > > > specified blob didn't already have pextents.  Tha way we can
> > > > > > do compression at that stage as well?
> > > > > >
> > > > > > What do you think?
> > > > > Not completely sure that it's a good idea to have read stage
> > > > > description stored in WAL record? Wouldn't that produce any
> > > > > conflicts/inconsistencies when multiple WAL records deal with
> > > > > the same or close lextents and previous WAL updates lextents to
> > > > > read. May be it's better to prepare such a description exactly
> > > > > when WAL is applied? And WAL record to have just a basic write
> > > > > info?
> > > > Yeah, I think this is a problem.  I see two basic paths:
> > > >
> > > >   - We do a wal flush before queueing a new wal event to avoid
> > > > races like this. Or perhaps we only do it when the wal event(s)
> > > > touch the same blob(s).  That's simple to reason about, but means
> > > > that a series of small IOs to the same object (or blob) will
> > > > serialize the kv commit and wal r/m/w operations.  (Note that this
> > > > is no worse than the naive approach of doing the read part up
> > > > front, and it only happens when you have successive wal ops on the
> same object (or blob)).
> > > >
> > > >   - We describe the wal read-side in terms of the current onode
> > > > state.  For example, 'read object offset 0..100, use provided
> > > > buffer for 100..4096, overwrite block'.  That can be pipelined.
> > > > But there are other operations that would require we flush the wal
> > > > events, like a truncate or zero or other write that clobbers that region
> of the object.
> > > > Maybe/hopefully in those cases we don't care (it no longer matters
> > > > that this wal event do the write we originally intended) but we'd
> > > > need to think pretty carefully about it.  FWIW, truncate already
> > > > does an
> > > > o->flush().
> > > I'd prefer the second approach. Probably with some modification...
> > > As far as I understand with the approach above you are trying to
> > > locate all write logic at a single place and have WAL machinery as a
> > > straightforward executor for already prepared tasks. Not sure this
> > > is beneficial enough. But definitely it's more complex and
> > > error-prone. And potentially you will need extend WAL machinery task
> description from time to time...
> > > As an alternative one can eliminate that read description in WAL
> > > record at all. Let's simply record what loffset we are going to
> > > write to and data itself. Thus we have simple write request description.
> > > And when WAL is applied corresponding code should determine how to
> > > do the write properly using the current lextent/blob maps state.
> > > This way Write Op apply can be just a regular write handling that
> > > performs sync RMW or any other implementation depending on the
> > > current state, some policy, or whatever else that fits the best at the
> specific moment.
> >
> > I like that way better!  We can just add a force_sync argument to
> > _do_write.  That also lets us trivially disable wal (by forcing sync
> > w/ a config option or whatever).
> >
> > The downside is that any logically conflicting request (an overlapping
> > write or truncate or zero) needs to drain the wal events, whereas with
> > a lower-level wal description there might be cases where we can ignore
> > the wal operation.  I suspect the trivial solution of o->flush() on
> > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > in-flight wal ops with an interval_set would probably work well enough.
> 
> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
> into the write code (with a sync flag), we will have to do write IO, and this
> wreaks havoc on our otherwise (mostly) orderly state machine.
> I think it can be done if we build in a similar guard like _txc_finish_io so that
> we wait for the wal events to also complete IO in order before committing
> them.  I think.
> 
> But the other problem is the checksum thing that came up in another thread,
> where the read-side of a read/modify/write might fail teh checksum because
> the wal write hit disk but the kv portion didn't commit. I see a few options:
> 
>  1) If there are checksums and we're doing a sub-block overwrite, we have to
> write/cow it elsewhere.  This probably means min_alloc_size cow operations
> for small writes.  In which case we needn't bother doing a wal even in the
> first place--the whole point is to enable an overwrite.
> 
>  2) We do loose checksum validation that will accept either the old checksum
> or the expected new checksum for the read stage.  This handles these two
> crash cases:
> 
>  * kv commit of op + wal event
>    <crash here, or>
>  * do wal io (completely)
>    <crash before cleaning up wal event>
>  * kv cleanup of wal event
> 
> but not the case where we only partially complete the wal io.  Which means
> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
> but confuse ourselves such that we refuse to replay the wal events on
> startup).
> 
>  3) Same as 2, but simply warn if we fail that read-side checksum on replay.
> This basically introduces a *very* small window which could allow an ondisk
> corruption to get absorbed into our checksum.  This could just be #2 + a
> config option so we warn instead of erroring out.
> 
>  4) Same as 2, but we try every combination of old and new data on
> block/sector boundaries to find a valid checksum on the read-side.
> 
> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
> which would risk very bad fragmentation.
> 
> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
> just a logical write", that means this weird replay handling logic creeps into
> the normal write path.
> 
> I'm currently leaning toward keeping the wal events special (lower-level), but
> doing what we can to make it work with the same mid- to low-level helper
> functions (for reading and verifying blobs, etc.).
> 
> sage

  reply	other threads:[~2016-05-11 21:38 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov
2016-05-10 12:53   ` Sage Weil
2016-05-10 14:41     ` Igor Fedotov
2016-05-10 15:39       ` Sage Weil
2016-05-11  1:10         ` Sage Weil
2016-05-11 12:11           ` Igor Fedotov
2016-05-11 13:10             ` Sage Weil
2016-05-11 13:45               ` Igor Fedotov
2016-05-11 13:57                 ` Sage Weil
2016-05-11 20:54                   ` Sage Weil
2016-05-11 21:38                     ` Allen Samuels [this message]
2016-05-12  2:58                       ` Sage Weil
2016-05-12 11:54                         ` Allen Samuels
2016-05-12 14:47                           ` Igor Fedotov
2016-05-12 14:38                         ` Igor Fedotov
2016-05-12 16:37                         ` Igor Fedotov
2016-05-12 16:43                           ` Sage Weil
2016-05-12 16:45                             ` Igor Fedotov
2016-05-12 16:48                               ` Sage Weil
2016-05-12 16:52                                 ` Igor Fedotov
2016-05-12 17:09                                   ` Sage Weil
2016-05-13 17:07                                     ` Igor Fedotov
2016-05-12 14:29                       ` Igor Fedotov
2016-05-12 14:27                     ` Igor Fedotov
2016-05-12 15:06                       ` Sage Weil
2016-05-11 12:39           ` Igor Fedotov
2016-05-11 14:35             ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BLUPR0201MB152437E90BED4AE3CA22310EE8720@BLUPR0201MB1524.namprd02.prod.outlook.com \
    --to=allen.samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.