From mboxrd@z Thu Jan  1 00:00:00 1970
From: Sage Weil <sage@newdream.net>
Subject: Re: 2 related bluestore questions
Date: Wed, 11 May 2016 16:54:16 -0400 (EDT)
Message-ID: <alpine.DEB.2.11.1605111636390.15518@cpach.fuggernut.com>
References: <alpine.DEB.2.11.1605091417590.336@cpach.fuggernut.com> <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> <alpine.DEB.2.11.1605100841400.15518@cpach.fuggernut.com> <b6240191-0849-d60c-ebb6-147b5785e3e7@mirantis.com>
 <alpine.DEB.2.11.1605101121090.15518@cpach.fuggernut.com> <alpine.DEB.2.11.1605102105260.15518@cpach.fuggernut.com> <2b5ebbd8-3e89-1fff-37f1-c6eb00bdcb1a@mirantis.com> <alpine.DEB.2.11.1605110901551.15518@cpach.fuggernut.com> <8b077a20-ace3-7824-4039-7b8e9adf88ce@mirantis.com>
 <alpine.DEB.2.11.1605110951570.15518@cpach.fuggernut.com>
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from cobra.newdream.net ([66.33.216.30]:57315 "EHLO
	cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751462AbcEKUyE (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Wed, 11 May 2016 16:54:04 -0400
In-Reply-To: <alpine.DEB.2.11.1605110951570.15518@cpach.fuggernut.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Igor Fedotov <ifedotov@mirantis.com>
Cc: allen.samuels@sandisk.com, ceph-devel@vger.kernel.org

On Wed, 11 May 2016, Sage Weil wrote:
> On Wed, 11 May 2016, Igor Fedotov wrote:
> > On 11.05.2016 16:10, Sage Weil wrote:
> > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > > > I took a stab at a revised wal_op_t here:
> > > > > 
> > > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/bluestore_types.h#L595-L605
> > > > > 
> > > > > This is enough to implement the basic wal overwrite case here:
> > > > > 
> > > > > 	https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/os/bluestore/BlueStore.cc#L5522-L5578
> > > > > 
> > > > > It's overkill for that, but something like this ought to be sufficiently
> > > > > general to express the more complicated wal (and compaction/gc/cleanup)
> > > > > operations, where we are reading bits of data from lots of different
> > > > > previous blobs, verifying checksums, and then assembling the results
> > > > > into
> > > > > a new buffer that gets written somewhere else.  The read_extent_map and
> > > > > write_map offsets are logical offsets in a buffer we assemble and then
> > > > > write to b_off~b_len in the specific blob.  I didn't get to the
> > > > > _do_wal_op
> > > > > part that actually does it, but it would do the final write, csum
> > > > > calculation, and metadata update.  Probably... the allocation would
> > > > > happen
> > > > > then too, if the specified blob didn't already have pextents.  Tha way
> > > > > we can do compression at that stage as well?
> > > > > 
> > > > > What do you think?
> > > > Not completely sure that it's a good idea to have read stage description
> > > > stored in WAL record? Wouldn't that produce any conflicts/inconsistencies
> > > > when
> > > > multiple WAL records deal with the same or close lextents and previous WAL
> > > > updates lextents to read. May be it's better to prepare such a description
> > > > exactly when WAL is applied? And WAL record to have just a basic write
> > > > info?
> > > Yeah, I think this is a problem.  I see two basic paths:
> > > 
> > >   - We do a wal flush before queueing a new wal event to avoid races like
> > > this. Or perhaps we only do it when the wal event(s) touch the same
> > > blob(s).  That's simple to reason about, but means that a series
> > > of small IOs to the same object (or blob) will serialize the kv commit and
> > > wal r/m/w operations.  (Note that this is no worse than the naive approach
> > > of doing the read part up front, and it only happens when you have
> > > successive wal ops on the same object (or blob)).
> > > 
> > >   - We describe the wal read-side in terms of the current onode state.  For
> > > example, 'read object offset 0..100, use provided buffer for 100..4096,
> > > overwrite block'.  That can be pipelined.  But there are other
> > > operations that would require we flush the wal events, like a truncate or
> > > zero or other write that clobbers that region of the object.
> > > Maybe/hopefully in those cases we don't care (it no longer matters that
> > > this wal event do the write we originally intended) but we'd need
> > > to think pretty carefully about it.  FWIW, truncate already does an
> > > o->flush().
> > I'd prefer the second approach. Probably with some modification...
> > As far as I understand with the approach above you are trying to locate all
> > write logic at a single place and have WAL machinery as a straightforward
> > executor for already prepared tasks. Not sure this is beneficial enough. But
> > definitely it's more complex and error-prone. And potentially you will need
> > extend WAL machinery task description from time to time...
> > As an alternative one can eliminate that read description in WAL record at
> > all. Let's simply record what loffset we are going to write to and data
> > itself. Thus we have simple write request description.
> > And when WAL is applied corresponding code should determine how to do the
> > write properly using the current lextent/blob maps state. This way Write Op
> > apply can be just a regular write handling that performs sync RMW or any other
> > implementation depending on the current state, some policy, or whatever else
> > that fits the best at the specific moment.
> 
> I like that way better!  We can just add a force_sync argument to 
> _do_write.  That also lets us trivially disable wal (by forcing sync w/ 
> a config option or whatever).
> 
> The downside is that any logically conflicting request (an overlapping 
> write or truncate or zero) needs to drain the wal events, whereas with a 
> lower-level wal description there might be cases where we can ignore the 
> wal operation.  I suspect the trivial solution of o->flush() on 
> write/truncate/zero will be pretty visible in benchmarks.  Tracking 
> in-flight wal ops with an interval_set would probably work well enough.

Hmm, I'm not sure this will pan out.  The main problem is that if we call 
back into the write code (with a sync flag), we will have to do write 
IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.  
I think it can be done if we build in a similar guard like _txc_finish_io 
so that we wait for the wal events to also complete IO in order before 
committing them.  I think.

But the other problem is the checksum thing that came up in another 
thread, where the read-side of a read/modify/write might fail teh checksum 
because the wal write hit disk but the kv portion didn't commit. I see a 
few options:

 1) If there are checksums and we're doing a sub-block overwrite, we 
have to write/cow it elsewhere.  This probably means min_alloc_size cow 
operations for small writes.  In which case we needn't bother doing a wal 
even in the first place--the whole point is to enable an overwrite.

 2) We do loose checksum validation that will accept either the old 
checksum or the expected new checksum for the read stage.  This handles 
these two crash cases:

 * kv commit of op + wal event
   <crash here, or>
 * do wal io (completely)
   <crash before cleaning up wal event>
 * kv cleanup of wal event

but not the case where we only partially complete the wal io.  Which means 
there is a small probability is "corrupt" ourselves on crash (not really 
corrupt, but confuse ourselves such that we refuse to replay the 
wal events on startup).

 3) Same as 2, but simply warn if we fail that read-side checksum on 
replay.  This basically introduces a *very* small window which could allow 
an ondisk corruption to get absorbed into our checksum.  This could just 
be #2 + a config option so we warn instead of erroring out.

 4) Same as 2, but we try every combination of old and new data on 
block/sector boundaries to find a valid checksum on the read-side.

I think #1 is a non-starter because it turns a 4K write into a 64K read + 
seek + 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on 
HDD, which would risk very bad fragmentation.

Which makes we want #3 (initially) and then #4.  But... if we do the "wal 
is just a logical write", that means this weird replay handling logic 
creeps into the normal write path.

I'm currently leaning toward keeping the wal events special (lower-level), 
but doing what we can to make it work with the same mid- to low-level 
helper functions (for reading and verifying blobs, etc.).

sage