Re: 2 related bluestore questions

From: Sage Weil <sage@newdream.net>
To: Igor Fedotov <ifedotov@mirantis.com>
Cc: allen.samuels@sandisk.com, ceph-devel@vger.kernel.org
Subject: Re: 2 related bluestore questions
Date: Thu, 12 May 2016 11:06:48 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.11.1605121103440.23446@cpach.fuggernut.com> (raw)
In-Reply-To: <fd43404b-907c-a84f-a353-74d0a66c2aa1@mirantis.com>

On Thu, 12 May 2016, Igor Fedotov wrote:
> On 11.05.2016 23:54, Sage Weil wrote:
> > On Wed, 11 May 2016, Sage Weil wrote:
> > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > 
> > > I like that way better!  We can just add a force_sync argument to
> > > _do_write.  That also lets us trivially disable wal (by forcing sync w/
> > > a config option or whatever).
> > > 
> > > The downside is that any logically conflicting request (an overlapping
> > > write or truncate or zero) needs to drain the wal events, whereas with a
> > > lower-level wal description there might be cases where we can ignore the
> > > wal operation.  I suspect the trivial solution of o->flush() on
> > > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > > in-flight wal ops with an interval_set would probably work well enough.
> > Hmm, I'm not sure this will pan out.  The main problem is that if we call
> > back into the write code (with a sync flag), we will have to do write
> > IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.
> > I think it can be done if we build in a similar guard like _txc_finish_io
> > so that we wait for the wal events to also complete IO in order before
> > committing them.  I think.
> > 
> > But the other problem is the checksum thing that came up in another
> > thread, where the read-side of a read/modify/write might fail teh checksum
> > because the wal write hit disk but the kv portion didn't commit. I see a
> > few options:
> > 
> >   1) If there are checksums and we're doing a sub-block overwrite, we
> > have to write/cow it elsewhere.  This probably means min_alloc_size cow
> > operations for small writes.  In which case we needn't bother doing a wal
> > even in the first place--the whole point is to enable an overwrite.
> > 
> >   2) We do loose checksum validation that will accept either the old
> > checksum or the expected new checksum for the read stage.  This handles
> > these two crash cases:
> Probably I missed something but It seems to me that we don't have any
> 'expected new checksum' for the whole new block after the crash.
> What we can have are old block checksum in KV and checksum for overwritten
> portion of the block in WAL. To have full new checksum one has to do the read
> and store new checksum to KV afterwards.
> Or you mean write+KV update under 'do  wal io'?

Yeah, you're right, I'm speaking nonsense!

Bottom line, we can't do partial checksum-block r/m/w overwrites (unless 
the read part of the r/m/w happens beforehand, turning it into a full 
block overwrite).

sage