All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Igor Fedotov <ifedotov@mirantis.com>
Cc: allen.samuels@sandisk.com, ceph-devel@vger.kernel.org
Subject: Re: 2 related bluestore questions
Date: Thu, 12 May 2016 11:06:48 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.11.1605121103440.23446@cpach.fuggernut.com> (raw)
In-Reply-To: <fd43404b-907c-a84f-a353-74d0a66c2aa1@mirantis.com>

On Thu, 12 May 2016, Igor Fedotov wrote:
> On 11.05.2016 23:54, Sage Weil wrote:
> > On Wed, 11 May 2016, Sage Weil wrote:
> > > On Wed, 11 May 2016, Igor Fedotov wrote:
> > > 
> > > I like that way better!  We can just add a force_sync argument to
> > > _do_write.  That also lets us trivially disable wal (by forcing sync w/
> > > a config option or whatever).
> > > 
> > > The downside is that any logically conflicting request (an overlapping
> > > write or truncate or zero) needs to drain the wal events, whereas with a
> > > lower-level wal description there might be cases where we can ignore the
> > > wal operation.  I suspect the trivial solution of o->flush() on
> > > write/truncate/zero will be pretty visible in benchmarks.  Tracking
> > > in-flight wal ops with an interval_set would probably work well enough.
> > Hmm, I'm not sure this will pan out.  The main problem is that if we call
> > back into the write code (with a sync flag), we will have to do write
> > IO, and this wreaks havoc on our otherwise (mostly) orderly state machine.
> > I think it can be done if we build in a similar guard like _txc_finish_io
> > so that we wait for the wal events to also complete IO in order before
> > committing them.  I think.
> > 
> > But the other problem is the checksum thing that came up in another
> > thread, where the read-side of a read/modify/write might fail teh checksum
> > because the wal write hit disk but the kv portion didn't commit. I see a
> > few options:
> > 
> >   1) If there are checksums and we're doing a sub-block overwrite, we
> > have to write/cow it elsewhere.  This probably means min_alloc_size cow
> > operations for small writes.  In which case we needn't bother doing a wal
> > even in the first place--the whole point is to enable an overwrite.
> > 
> >   2) We do loose checksum validation that will accept either the old
> > checksum or the expected new checksum for the read stage.  This handles
> > these two crash cases:
> Probably I missed something but It seems to me that we don't have any
> 'expected new checksum' for the whole new block after the crash.
> What we can have are old block checksum in KV and checksum for overwritten
> portion of the block in WAL. To have full new checksum one has to do the read
> and store new checksum to KV afterwards.
> Or you mean write+KV update under 'do  wal io'?

Yeah, you're right, I'm speaking nonsense!

Bottom line, we can't do partial checksum-block r/m/w overwrites (unless 
the read part of the r/m/w happens beforehand, turning it into a full 
block overwrite).

sage

  reply	other threads:[~2016-05-12 15:06 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov
2016-05-10 12:53   ` Sage Weil
2016-05-10 14:41     ` Igor Fedotov
2016-05-10 15:39       ` Sage Weil
2016-05-11  1:10         ` Sage Weil
2016-05-11 12:11           ` Igor Fedotov
2016-05-11 13:10             ` Sage Weil
2016-05-11 13:45               ` Igor Fedotov
2016-05-11 13:57                 ` Sage Weil
2016-05-11 20:54                   ` Sage Weil
2016-05-11 21:38                     ` Allen Samuels
2016-05-12  2:58                       ` Sage Weil
2016-05-12 11:54                         ` Allen Samuels
2016-05-12 14:47                           ` Igor Fedotov
2016-05-12 14:38                         ` Igor Fedotov
2016-05-12 16:37                         ` Igor Fedotov
2016-05-12 16:43                           ` Sage Weil
2016-05-12 16:45                             ` Igor Fedotov
2016-05-12 16:48                               ` Sage Weil
2016-05-12 16:52                                 ` Igor Fedotov
2016-05-12 17:09                                   ` Sage Weil
2016-05-13 17:07                                     ` Igor Fedotov
2016-05-12 14:29                       ` Igor Fedotov
2016-05-12 14:27                     ` Igor Fedotov
2016-05-12 15:06                       ` Sage Weil [this message]
2016-05-11 12:39           ` Igor Fedotov
2016-05-11 14:35             ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1605121103440.23446@cpach.fuggernut.com \
    --to=sage@newdream.net \
    --cc=allen.samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.