Re: 2 related bluestore questions

From: Igor Fedotov <ifedotov@mirantis.com>
To: Sage Weil <sage@newdream.net>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: 2 related bluestore questions
Date: Thu, 12 May 2016 19:52:24 +0300	[thread overview]
Message-ID: <5441ddd3-c37b-4594-632c-58aafd354fb6@mirantis.com> (raw)
In-Reply-To: <alpine.DEB.2.11.1605121247520.23446@cpach.fuggernut.com>

Well, it goes to the new space and updates all the maps.

Then WAL comes to action - where will it write? To the new location? And 
overwrite new data?

On 12.05.2016 19:48, Sage Weil wrote:
> On Thu, 12 May 2016, Igor Fedotov wrote:
>> The second write in my example isn't processed through WAL - it's large and
>> overwrites the whole blob...
> If it's large, it wouldn't overwrite--it would go to newly allocated
> space.  We can *never* overwrite without wal or else we corrupt previous
> data...
>
> sage
>
>
>>
>> On 12.05.2016 19:43, Sage Weil wrote:
>>> On Thu, 12 May 2016, Igor Fedotov wrote:
>>>> Yet another potential issue with WAL I can imagine:
>>>>
>>>> Let's have some small write going to WAL followed by an larger aligned
>>>> overwrite to the same extent that bypasses WAL. Is it possible if the
>>>> first
>>>> write is processed later and overwrites the second one? I think so.
>>> Yeah, that would be chaos.  The wal ops are already ordered by the
>>> sequencer (or ordered globally, if bluestore_sync_wal_apply=true), so this
>>> can't happen.
>>>
>>> sage
>>>
>>>
>>>> This way we can probably come to the conclusion that all requests should
>>>> be
>>>> processed in-sequence. One should prohibit multiple flows for requests
>>>> processing as this may eliminate their order.
>>>>
>>>> Yeah - I'm attacking WAL concept this way...
>>>>
>>>>
>>>> Thanks,
>>>> Igor
>>>>
>>>> On 12.05.2016 5:58, Sage Weil wrote:
>>>>> On Wed, 11 May 2016, Allen Samuels wrote:
>>>>>> Sorry, still on vacation and I haven't really wrapped my head around
>>>>>> everything that's being discussed. However, w.r.t. wal operations, I
>>>>>> would strongly favor an approach that minimizes the amount of "future"
>>>>>> operations that are recorded (which I'll call intentions -- i.e.,
>>>>>> non-binding hints about extra work that needs to get done). Much of
>>>>>> the
>>>>>> complexity here is because the intentions -- after being recorded --
>>>>>> will need to be altered based on subsequent operations. Hence every
>>>>>> write operation will need to digest the historical intentions and
>>>>>> potentially update them -- this is VERY complex, potentially much more
>>>>>> complex than code that simply examines the current state and
>>>>>> re-determines the correct next operation (i.e., de-wal, gc, etc.)
>>>>>>
>>>>>> Additional complexity arises because you're recording two sets of
>>>>>> state
>>>>>> that require consistency checking -- in my experience, this road leads
>>>>>> to perdition....
>>>>> I agree is has to be something manageable that we can reason about.  I
>>>>> think the question for me is mostly about which path minimizes the
>>>>> complexity while still getting us a reasonable level of performance.
>>>>>
>>>>> I had one new thought, see below...
>>>>>
>>>>>>>> The downside is that any logically conflicting request (an
>>>>>>>> overlapping
>>>>>>>> write or truncate or zero) needs to drain the wal events, whereas
>>>>>>>> with
>>>>>>>> a lower-level wal description there might be cases where we can
>>>>>>>> ignore
>>>>>>>> the wal operation.  I suspect the trivial solution of o->flush()
>>>>>>>> on
>>>>>>>> write/truncate/zero will be pretty visible in benchmarks.
>>>>>>>> Tracking
>>>>>>>> in-flight wal ops with an interval_set would probably work well
>>>>>>>> enough.
>>>>>>> Hmm, I'm not sure this will pan out.  The main problem is that if we
>>>>>>> call back
>>>>>>> into the write code (with a sync flag), we will have to do write IO,
>>>>>>> and
>>>>>>> this
>>>>>>> wreaks havoc on our otherwise (mostly) orderly state machine.
>>>>>>> I think it can be done if we build in a similar guard like
>>>>>>> _txc_finish_io so that
>>>>>>> we wait for the wal events to also complete IO in order before
>>>>>>> committing
>>>>>>> them.  I think.
>>>>>>>
>>>>>>> But the other problem is the checksum thing that came up in another
>>>>>>> thread,
>>>>>>> where the read-side of a read/modify/write might fail teh checksum
>>>>>>> because
>>>>>>> the wal write hit disk but the kv portion didn't commit. I see a few
>>>>>>> options:
>>>>>>>
>>>>>>>     1) If there are checksums and we're doing a sub-block overwrite,
>>>>>>> we
>>>>>>> have to
>>>>>>> write/cow it elsewhere.  This probably means min_alloc_size cow
>>>>>>> operations
>>>>>>> for small writes.  In which case we needn't bother doing a wal even
>>>>>>> in
>>>>>>> the
>>>>>>> first place--the whole point is to enable an overwrite.
>>>>>>>
>>>>>>>     2) We do loose checksum validation that will accept either the
>>>>>>> old
>>>>>>> checksum
>>>>>>> or the expected new checksum for the read stage.  This handles these
>>>>>>> two
>>>>>>> crash cases:
>>>>>>>
>>>>>>>     * kv commit of op + wal event
>>>>>>>       <crash here, or>
>>>>>>>     * do wal io (completely)
>>>>>>>       <crash before cleaning up wal event>
>>>>>>>     * kv cleanup of wal event
>>>>>>>
>>>>>>> but not the case where we only partially complete the wal io.  Which
>>>>>>> means
>>>>>>> there is a small probability is "corrupt" ourselves on crash (not
>>>>>>> really
>>>>>>> corrupt,
>>>>>>> but confuse ourselves such that we refuse to replay the wal events
>>>>>>> on
>>>>>>> startup).
>>>>>>>
>>>>>>>     3) Same as 2, but simply warn if we fail that read-side checksum
>>>>>>> on
>>>>>>> replay.
>>>>>>> This basically introduces a *very* small window which could allow an
>>>>>>> ondisk
>>>>>>> corruption to get absorbed into our checksum.  This could just be #2
>>>>>>> + a
>>>>>>> config option so we warn instead of erroring out.
>>>>>>>
>>>>>>>     4) Same as 2, but we try every combination of old and new data on
>>>>>>> block/sector boundaries to find a valid checksum on the read-side.
>>>>>>>
>>>>>>> I think #1 is a non-starter because it turns a 4K write into a 64K
>>>>>>> read
>>>>>>> + seek +
>>>>>>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on
>>>>>>> HDD,
>>>>>>> which would risk very bad fragmentation.
>>>>>>>
>>>>>>> Which makes we want #3 (initially) and then #4.  But... if we do the
>>>>>>> "wal is
>>>>>>> just a logical write", that means this weird replay handling logic
>>>>>>> creeps into
>>>>>>> the normal write path.
>>>>>>>
>>>>>>> I'm currently leaning toward keeping the wal events special
>>>>>>> (lower-level), but
>>>>>>> doing what we can to make it work with the same mid- to low-level
>>>>>>> helper
>>>>>>> functions (for reading and verifying blobs, etc.).
>>>>> It occured to me that this checksum consistency issue only comes up when
>>>>> we are updating something that is smaller than the csum block size.  And
>>>>> the real source of the problem is that you have a sequence of
>>>>>
>>>>>     1- journal intent (kv wal item)
>>>>>     2- do read io
>>>>>     3- verify csum
>>>>>     4- do write io
>>>>>     5- cancel intent (remove kv wal item)
>>>>>
>>>>> If we have an order like
>>>>>
>>>>>     1- do read io
>>>>>     2- journal intent for entire csum chunk (kv wal item)
>>>>>     3- do write io
>>>>>     4- cancel intent
>>>>>
>>>>> Then the issue goes away.  And I'm thinking if the csum chunk is big
>>>>> enough that the #2 step is too big of a wal item to perform well, then
>>>>> the
>>>>> problem is your choice of csum block size, not the approach.  I.e., use
>>>>> a
>>>>> 4kb csum block size for rbd images, and use large blocks (128k, 512k,
>>>>> whatever) only for things that never see random overwrites (rgw data).
>>>>>
>>>>> If that is good enough, then it might also mean that we can make the wal
>>>>> operations never do reads--just (over)writes, further simplifying things
>>>>> on that end.  In the jewel bluestore the only times we do reads is for
>>>>> partial block updates (do we really care about these?  a buffer cache
>>>>> could absorb them when it matters) and for copy/cow operations
>>>>> post-clone
>>>>> (which i think are simple enough to be deal with separately).
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>