From mboxrd@z Thu Jan  1 00:00:00 1970
From: Igor Fedotov <ifedotov@mirantis.com>
Subject: Re: 2 related bluestore questions
Date: Thu, 12 May 2016 17:29:23 +0300
Message-ID: <8ddf3e2a-11b7-0f5c-fc8d-92b2252b013f@mirantis.com>
References: <alpine.DEB.2.11.1605091417590.336@cpach.fuggernut.com>
 <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com>
 <alpine.DEB.2.11.1605100841400.15518@cpach.fuggernut.com>
 <b6240191-0849-d60c-ebb6-147b5785e3e7@mirantis.com>
 <alpine.DEB.2.11.1605101121090.15518@cpach.fuggernut.com>
 <alpine.DEB.2.11.1605102105260.15518@cpach.fuggernut.com>
 <2b5ebbd8-3e89-1fff-37f1-c6eb00bdcb1a@mirantis.com>
 <alpine.DEB.2.11.1605110901551.15518@cpach.fuggernut.com>
 <8b077a20-ace3-7824-4039-7b8e9adf88ce@mirantis.com>
 <alpine.DEB.2.11.1605110951570.15518@cpach.fuggernut.com>
 <alpine.DEB.2.11.1605111636390.15518@cpach.fuggernut.com>
 <BLUPR0201MB152437E90BED4AE3CA22310EE8720@BLUPR0201MB1524.namprd02.prod.outlook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mail-lf0-f54.google.com ([209.85.215.54]:35583 "EHLO
	mail-lf0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752341AbcELO30 (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Thu, 12 May 2016 10:29:26 -0400
Received: by mail-lf0-f54.google.com with SMTP id j8so72559474lfd.2
        for <ceph-devel@vger.kernel.org>; Thu, 12 May 2016 07:29:25 -0700 (PDT)
In-Reply-To: <BLUPR0201MB152437E90BED4AE3CA22310EE8720@BLUPR0201MB1524.namprd02.prod.outlook.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Allen Samuels <Allen.Samuels@sandisk.com>, Sage Weil <sage@newdream.net>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>


On 12.05.2016 0:38, Allen Samuels wrote:
> Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.)
>
> Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition....
Completely agree
> Maybe I've missed something critical (vacation-headed) if so, please ignore.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Thursday, May 12, 2016 6:54 AM
>> To: Igor Fedotov <ifedotov@mirantis.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-
>> devel@vger.kernel.org
>> Subject: Re: 2 related bluestore questions
>>
>> On Wed, 11 May 2016, Sage Weil wrote:
>>> On Wed, 11 May 2016, Igor Fedotov wrote:
>>>> On 11.05.2016 16:10, Sage Weil wrote:
>>>>> On Wed, 11 May 2016, Igor Fedotov wrote:
>>>>>>> I took a stab at a revised wal_op_t here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
>>>>>>> os/bluestore/bluestore_types.h#L595-L605
>>>>>>>
>>>>>>> This is enough to implement the basic wal overwrite case here:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/
>>>>>>> os/bluestore/BlueStore.cc#L5522-L5578
>>>>>>>
>>>>>>> It's overkill for that, but something like this ought to be
>>>>>>> sufficiently general to express the more complicated wal (and
>>>>>>> compaction/gc/cleanup) operations, where we are reading bits
>>>>>>> of data from lots of different previous blobs, verifying
>>>>>>> checksums, and then assembling the results into a new buffer
>>>>>>> that gets written somewhere else.  The read_extent_map and
>>>>>>> write_map offsets are logical offsets in a buffer we assemble
>>>>>>> and then write to b_off~b_len in the specific blob.  I didn't
>>>>>>> get to the _do_wal_op part that actually does it, but it would
>>>>>>> do the final write, csum calculation, and metadata update.
>>>>>>> Probably... the allocation would happen then too, if the
>>>>>>> specified blob didn't already have pextents.  Tha way we can
>>>>>>> do compression at that stage as well?
>>>>>>>
>>>>>>> What do you think?
>>>>>> Not completely sure that it's a good idea to have read stage
>>>>>> description stored in WAL record? Wouldn't that produce any
>>>>>> conflicts/inconsistencies when multiple WAL records deal with
>>>>>> the same or close lextents and previous WAL updates lextents to
>>>>>> read. May be it's better to prepare such a description exactly
>>>>>> when WAL is applied? And WAL record to have just a basic write
>>>>>> info?
>>>>> Yeah, I think this is a problem.  I see two basic paths:
>>>>>
>>>>>    - We do a wal flush before queueing a new wal event to avoid
>>>>> races like this. Or perhaps we only do it when the wal event(s)
>>>>> touch the same blob(s).  That's simple to reason about, but means
>>>>> that a series of small IOs to the same object (or blob) will
>>>>> serialize the kv commit and wal r/m/w operations.  (Note that this
>>>>> is no worse than the naive approach of doing the read part up
>>>>> front, and it only happens when you have successive wal ops on the
>> same object (or blob)).
>>>>>    - We describe the wal read-side in terms of the current onode
>>>>> state.  For example, 'read object offset 0..100, use provided
>>>>> buffer for 100..4096, overwrite block'.  That can be pipelined.
>>>>> But there are other operations that would require we flush the wal
>>>>> events, like a truncate or zero or other write that clobbers that region
>> of the object.
>>>>> Maybe/hopefully in those cases we don't care (it no longer matters
>>>>> that this wal event do the write we originally intended) but we'd
>>>>> need to think pretty carefully about it.  FWIW, truncate already
>>>>> does an
>>>>> o->flush().
>>>> I'd prefer the second approach. Probably with some modification...
>>>> As far as I understand with the approach above you are trying to
>>>> locate all write logic at a single place and have WAL machinery as a
>>>> straightforward executor for already prepared tasks. Not sure this
>>>> is beneficial enough. But definitely it's more complex and
>>>> error-prone. And potentially you will need extend WAL machinery task
>> description from time to time...
>>>> As an alternative one can eliminate that read description in WAL
>>>> record at all. Let's simply record what loffset we are going to
>>>> write to and data itself. Thus we have simple write request description.
>>>> And when WAL is applied corresponding code should determine how to
>>>> do the write properly using the current lextent/blob maps state.
>>>> This way Write Op apply can be just a regular write handling that
>>>> performs sync RMW or any other implementation depending on the
>>>> current state, some policy, or whatever else that fits the best at the
>> specific moment.
>>> I like that way better!  We can just add a force_sync argument to
>>> _do_write.  That also lets us trivially disable wal (by forcing sync
>>> w/ a config option or whatever).
>>>
>>> The downside is that any logically conflicting request (an overlapping
>>> write or truncate or zero) needs to drain the wal events, whereas with
>>> a lower-level wal description there might be cases where we can ignore
>>> the wal operation.  I suspect the trivial solution of o->flush() on
>>> write/truncate/zero will be pretty visible in benchmarks.  Tracking
>>> in-flight wal ops with an interval_set would probably work well enough.
>> Hmm, I'm not sure this will pan out.  The main problem is that if we call back
>> into the write code (with a sync flag), we will have to do write IO, and this
>> wreaks havoc on our otherwise (mostly) orderly state machine.
>> I think it can be done if we build in a similar guard like _txc_finish_io so that
>> we wait for the wal events to also complete IO in order before committing
>> them.  I think.
>>
>> But the other problem is the checksum thing that came up in another thread,
>> where the read-side of a read/modify/write might fail teh checksum because
>> the wal write hit disk but the kv portion didn't commit. I see a few options:
>>
>>   1) If there are checksums and we're doing a sub-block overwrite, we have to
>> write/cow it elsewhere.  This probably means min_alloc_size cow operations
>> for small writes.  In which case we needn't bother doing a wal even in the
>> first place--the whole point is to enable an overwrite.
>>
>>   2) We do loose checksum validation that will accept either the old checksum
>> or the expected new checksum for the read stage.  This handles these two
>> crash cases:
>>
>>   * kv commit of op + wal event
>>     <crash here, or>
>>   * do wal io (completely)
>>     <crash before cleaning up wal event>
>>   * kv cleanup of wal event
>>
>> but not the case where we only partially complete the wal io.  Which means
>> there is a small probability is "corrupt" ourselves on crash (not really corrupt,
>> but confuse ourselves such that we refuse to replay the wal events on
>> startup).
>>
>>   3) Same as 2, but simply warn if we fail that read-side checksum on replay.
>> This basically introduces a *very* small window which could allow an ondisk
>> corruption to get absorbed into our checksum.  This could just be #2 + a
>> config option so we warn instead of erroring out.
>>
>>   4) Same as 2, but we try every combination of old and new data on
>> block/sector boundaries to find a valid checksum on the read-side.
>>
>> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek +
>> 64K write on an HDD.  Or forces us to run with min_alloc_size=4K on HDD,
>> which would risk very bad fragmentation.
>>
>> Which makes we want #3 (initially) and then #4.  But... if we do the "wal is
>> just a logical write", that means this weird replay handling logic creeps into
>> the normal write path.
>>
>> I'm currently leaning toward keeping the wal events special (lower-level), but
>> doing what we can to make it work with the same mid- to low-level helper
>> functions (for reading and verifying blobs, etc.).
>>
>> sage