All of lore.kernel.org
 help / color / mirror / Atom feed
From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: 2 related bluestore questions
Date: Thu, 12 May 2016 19:52:24 +0300	[thread overview]
Message-ID: <5441ddd3-c37b-4594-632c-58aafd354fb6@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1605121247520.23446@cpach.fuggernut.com>

Well, it goes to the new space and updates all the maps.

Then WAL comes to action - where will it write? To the new location? And 
overwrite new data?


On 12.05.2016 19:48, Sage Weil wrote:
> On Thu, 12 May 2016, Igor Fedotov wrote:
>> The second write in my example isn't processed through WAL - it's large and
>> overwrites the whole blob...
> If it's large, it wouldn't overwrite--it would go to newly allocated
> space.  We can *never* overwrite without wal or else we corrupt previous
> data...
>
> sage
>
>
>>
>> On 12.05.2016 19:43, Sage Weil wrote:
>>> On Thu, 12 May 2016, Igor Fedotov wrote:
>>>> Yet another potential issue with WAL I can imagine:
>>>>
>>>> Let's have some small write going to WAL followed by an larger aligned
>>>> overwrite to the same extent that bypasses WAL. Is it possible if the
>>>> first
>>>> write is processed later and overwrites the second one? I think so.
>>> Yeah, that would be chaos.  The wal ops are already ordered by the
>>> sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
>>> can't happen.
>>>
>>> sage
>>>
>>>
>>>> This way we can probably come to the conclusion that all requests should
>>>> be
>>>> processed in-sequence. One should prohibit multiple flows for requests
>>>> processing as this may eliminate their order.
>>>>
>>>> Yeah - I'm attacking WAL concept this way...
>>>>
>>>>
>>>> Thanks,
>>>> Igor
>>>>
>>>> On 12.05.2016 5:58, Sage Weil wrote:
>>>>> On Wed, 11 May 2016, Allen Samuels wrote:
>>>>>> Sorry, still on vacation and I haven't really wrapped my head around
>>>>>> everything that's being discussed. However, w.r.t. wal operations, I
>>>>>> would strongly favor an approach that minimizes the amount of "future"
>>>>>> operations that are recorded (which I'll call intentions -- i.e.,
>>>>>> non-binding hints about extra work that needs to get done). Much of
>>>>>> the
>>>>>> complexity here is because the intentions -- after being recorded --
>>>>>> will need to be altered based on subsequent operations. Hence every
>>>>>> write operation will need to digest the historical intentions and
>>>>>> potentially update them -- this is VERY complex, potentially much more
>>>>>> complex than code that simply examines the current state and
>>>>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>>>>
>>>>>> Additional complexity arises because you're recording two sets of
>>>>>> state
>>>>>> that require consistency checking -- in my experience, this road leads
>>>>>> to perdition....
>>>>> I agree is has to be something manageable that we can reason about.  I
>>>>> think the question for me is mostly about which path minimizes the
>>>>> complexity while still getting us a reasonable level of performance.
>>>>>
>>>>> I had one new thought, see below...
>>>>>
>>>>>>>> The downside is that any logically conflicting request (an
>>>>>>>> overlapping
>>>>>>>> write or truncate or zero) needs to drain the wal events, whereas
>>>>>>>> with
>>>>>>>> a lower-level wal description there might be cases where we can
>>>>>>>> ignore
>>>>>>>> the wal operation.  I suspect the trivial solution of o->flush()
>>>>>>>> on
>>>>>>>> write/truncate/zero will be pretty visible in benchmarks.
>>>>>>>> Tracking
>>>>>>>> in-flight wal ops with an interval_set would probably work well
>>>>>>>> enough.
>>>>>>> Hmm, I'm not sure this will pan out.  The main problem is that if we
>>>>>>> call back
>>>>>>> into the write code (with a sync flag), we will have to do write IO,
>>>>>>> and
>>>>>>> this
>>>>>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>>>>>> I think it can be done if we build in a similar guard like
>>>>>>> _txc_finish_io so that
>>>>>>> we wait for the wal events to also complete IO in order before
>>>>>>> committing
>>>>>>> them.  I think.
>>>>>>>
>>>>>>> But the other problem is the checksum thing that came up in another
>>>>>>> thread,
>>>>>>> where the read-side of a read/modify/write might fail teh checksum
>>>>>>> because
>>>>>>> the wal write hit disk but the kv portion didn't commit. I see a few
>>>>>>> options:
>>>>>>>
>>>>>>>     1) If there are checksums and we're doing a sub-block overwrite,
>>>>>>> we
>>>>>>> have to
>>>>>>> write/cow it elsewhere.  This probably means min_alloc_size cow
>>>>>>> operations
>>>>>>> for small writes.  In which case we needn't bother doing a wal even
>>>>>>> in
>>>>>>> the
>>>>>>> first place--the whole point is to enable an overwrite.
>>>>>>>
>>>>>>>     2) We do loose checksum validation that will accept either the
>>>>>>> old
>>>>>>> checksum
>>>>>>> or the expected new checksum for the read stage.  This handles these
>>>>>>> two
>>>>>>> crash cases:
>>>>>>>
>>>>>>>     * kv commit of op + wal event
>>>>>>>       <crash here, or>
>>>>>>>     * do wal io (completely)
>>>>>>>       <crash before cleaning up wal event>
>>>>>>>     * kv cleanup of wal event
>>>>>>>
>>>>>>> but not the case where we only partially complete the wal io.  Which
>>>>>>> means
>>>>>>> there is a small probability is "corrupt" ourselves on crash (not
>>>>>>> really
>>>>>>> corrupt,
>>>>>>> but confuse ourselves such that we refuse to replay the wal events
>>>>>>> on
>>>>>>> startup).
>>>>>>>
>>>>>>>     3) Same as 2, but simply warn if we fail that read-side checksum
>>>>>>> on
>>>>>>> replay.
>>>>>>> This basically introduces a *very* small window which could allow an
>>>>>>> ondisk
>>>>>>> corruption to get absorbed into our checksum.  This could just be #2
>>>>>>> + a
>>>>>>> config option so we warn instead of erroring out.
>>>>>>>
>>>>>>>     4) Same as 2, but we try every combination of old and new data on
>>>>>>> block/sector boundaries to find a valid checksum on the read-side.
>>>>>>>
>>>>>>> I think #1 is a non-starter because it turns a 4K write into a 64K
>>>>>>> read
>>>>>>> + seek +
>>>>>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
>>>>>>> HDD,
>>>>>>> which would risk very bad fragmentation.
>>>>>>>
>>>>>>> Which makes we want #3 (initially) and then #4.  But... if we do the
>>>>>>> "wal is
>>>>>>> just a logical write", that means this weird replay handling logic
>>>>>>> creeps into
>>>>>>> the normal write path.
>>>>>>>
>>>>>>> I'm currently leaning toward keeping the wal events special
>>>>>>> (lower-level), but
>>>>>>> doing what we can to make it work with the same mid- to low-level
>>>>>>> helper
>>>>>>> functions (for reading and verifying blobs, etc.).
>>>>> It occured to me that this checksum consistency issue only comes up when
>>>>> we are updating something that is smaller than the csum block size.  And
>>>>> the real source of the problem is that you have a sequence of
>>>>>
>>>>>     1- journal intent (kv wal item)
>>>>>     2- do read io
>>>>>     3- verify csum
>>>>>     4- do write io
>>>>>     5- cancel intent (remove kv wal item)
>>>>>
>>>>> If we have an order like
>>>>>
>>>>>     1- do read io
>>>>>     2- journal intent for entire csum chunk (kv wal item)
>>>>>     3- do write io
>>>>>     4- cancel intent
>>>>>
>>>>> Then the issue goes away.  And I'm thinking if the csum chunk is big
>>>>> enough that the #2 step is too big of a wal item to perform well, then
>>>>> the
>>>>> problem is your choice of csum block size, not the approach.  I.e., use
>>>>> a
>>>>> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
>>>>> whatever) only for things that never see random overwrites (rgw data).
>>>>>
>>>>> If that is good enough, then it might also mean that we can make the wal
>>>>> operations never do reads--just (over)writes, further simplifying things
>>>>> on that end.  In the jewel bluestore the only times we do reads is for
>>>>> partial block updates (do we really care about these?  a buffer cache
>>>>> could absorb them when it matters) and for copy/cow operations
>>>>> post-clone
>>>>> (which i think are simple enough to be deal with separately).
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


  reply	other threads:[~2016-05-12 16:52 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-09 18:31 2 related bluestore questions Sage Weil
2016-05-10 12:17 ` Igor Fedotov
2016-05-10 12:53   ` Sage Weil
2016-05-10 14:41     ` Igor Fedotov
2016-05-10 15:39       ` Sage Weil
2016-05-11  1:10         ` Sage Weil
2016-05-11 12:11           ` Igor Fedotov
2016-05-11 13:10             ` Sage Weil
2016-05-11 13:45               ` Igor Fedotov
2016-05-11 13:57                 ` Sage Weil
2016-05-11 20:54                   ` Sage Weil
2016-05-11 21:38                     ` Allen Samuels
2016-05-12  2:58                       ` Sage Weil
2016-05-12 11:54                         ` Allen Samuels
2016-05-12 14:47                           ` Igor Fedotov
2016-05-12 14:38                         ` Igor Fedotov
2016-05-12 16:37                         ` Igor Fedotov
2016-05-12 16:43                           ` Sage Weil
2016-05-12 16:45                             ` Igor Fedotov
2016-05-12 16:48                               ` Sage Weil
2016-05-12 16:52                                 ` Igor Fedotov [this message]
2016-05-12 17:09                                   ` Sage Weil
2016-05-13 17:07                                     ` Igor Fedotov
2016-05-12 14:29                       ` Igor Fedotov
2016-05-12 14:27                     ` Igor Fedotov
2016-05-12 15:06                       ` Sage Weil
2016-05-11 12:39           ` Igor Fedotov
2016-05-11 14:35             ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5441ddd3-c37b-4594-632c-58aafd354fb6@mirantis.com \
    --to=ifedotov@mirantis.com \
    --cc=Allen.Samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.