From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: 2 related bluestore questions Date: Thu, 12 May 2016 17:29:23 +0300 Message-ID: <8ddf3e2a-11b7-0f5c-fc8d-92b2252b013f@mirantis.com> References: <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> <2b5ebbd8-3e89-1fff-37f1-c6eb00bdcb1a@mirantis.com> <8b077a20-ace3-7824-4039-7b8e9adf88ce@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-lf0-f54.google.com ([209.85.215.54]:35583 "EHLO mail-lf0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752341AbcELO30 (ORCPT ); Thu, 12 May 2016 10:29:26 -0400 Received: by mail-lf0-f54.google.com with SMTP id j8so72559474lfd.2 for ; Thu, 12 May 2016 07:29:25 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels , Sage Weil Cc: "ceph-devel@vger.kernel.org" On 12.05.2016 0:38, Allen Samuels wrote: > Sorry, still on vacation and I haven't really wrapped my head around everything that's being discussed. However, w.r.t. wal operations, I would strongly favor an approach that minimizes the amount of "future" operations that are recorded (which I'll call intentions -- i.e., non-binding hints about extra work that needs to get done). Much of the complexity here is because the intentions -- after being recorded -- will need to be altered based on subsequent operations. Hence every write operation will need to digest the historical intentions and potentially update them -- this is VERY complex, potentially much more complex than code that simply examines the current state and re-determines the correct next operation (i.e., de-wal, gc, etc.) > > Additional complexity arises because you're recording two sets of state that require consistency checking -- in my experience, this road leads to perdition.... Completely agree > Maybe I've missed something critical (vacation-headed) if so, please ignore. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@SanDisk.com > > >> -----Original Message----- >> From: Sage Weil [mailto:sage@newdream.net] >> Sent: Thursday, May 12, 2016 6:54 AM >> To: Igor Fedotov >> Cc: Allen Samuels ; ceph- >> devel@vger.kernel.org >> Subject: Re: 2 related bluestore questions >> >> On Wed, 11 May 2016, Sage Weil wrote: >>> On Wed, 11 May 2016, Igor Fedotov wrote: >>>> On 11.05.2016 16:10, Sage Weil wrote: >>>>> On Wed, 11 May 2016, Igor Fedotov wrote: >>>>>>> I took a stab at a revised wal_op_t here: >>>>>>> >>>>>>> >>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/ >>>>>>> os/bluestore/bluestore_types.h#L595-L605 >>>>>>> >>>>>>> This is enough to implement the basic wal overwrite case here: >>>>>>> >>>>>>> >>>>>>> https://github.com/liewegas/ceph/blob/wip-bluestore-write/src/ >>>>>>> os/bluestore/BlueStore.cc#L5522-L5578 >>>>>>> >>>>>>> It's overkill for that, but something like this ought to be >>>>>>> sufficiently general to express the more complicated wal (and >>>>>>> compaction/gc/cleanup) operations, where we are reading bits >>>>>>> of data from lots of different previous blobs, verifying >>>>>>> checksums, and then assembling the results into a new buffer >>>>>>> that gets written somewhere else. The read_extent_map and >>>>>>> write_map offsets are logical offsets in a buffer we assemble >>>>>>> and then write to b_off~b_len in the specific blob. I didn't >>>>>>> get to the _do_wal_op part that actually does it, but it would >>>>>>> do the final write, csum calculation, and metadata update. >>>>>>> Probably... the allocation would happen then too, if the >>>>>>> specified blob didn't already have pextents. Tha way we can >>>>>>> do compression at that stage as well? >>>>>>> >>>>>>> What do you think? >>>>>> Not completely sure that it's a good idea to have read stage >>>>>> description stored in WAL record? Wouldn't that produce any >>>>>> conflicts/inconsistencies when multiple WAL records deal with >>>>>> the same or close lextents and previous WAL updates lextents to >>>>>> read. May be it's better to prepare such a description exactly >>>>>> when WAL is applied? And WAL record to have just a basic write >>>>>> info? >>>>> Yeah, I think this is a problem. I see two basic paths: >>>>> >>>>> - We do a wal flush before queueing a new wal event to avoid >>>>> races like this. Or perhaps we only do it when the wal event(s) >>>>> touch the same blob(s). That's simple to reason about, but means >>>>> that a series of small IOs to the same object (or blob) will >>>>> serialize the kv commit and wal r/m/w operations. (Note that this >>>>> is no worse than the naive approach of doing the read part up >>>>> front, and it only happens when you have successive wal ops on the >> same object (or blob)). >>>>> - We describe the wal read-side in terms of the current onode >>>>> state. For example, 'read object offset 0..100, use provided >>>>> buffer for 100..4096, overwrite block'. That can be pipelined. >>>>> But there are other operations that would require we flush the wal >>>>> events, like a truncate or zero or other write that clobbers that region >> of the object. >>>>> Maybe/hopefully in those cases we don't care (it no longer matters >>>>> that this wal event do the write we originally intended) but we'd >>>>> need to think pretty carefully about it. FWIW, truncate already >>>>> does an >>>>> o->flush(). >>>> I'd prefer the second approach. Probably with some modification... >>>> As far as I understand with the approach above you are trying to >>>> locate all write logic at a single place and have WAL machinery as a >>>> straightforward executor for already prepared tasks. Not sure this >>>> is beneficial enough. But definitely it's more complex and >>>> error-prone. And potentially you will need extend WAL machinery task >> description from time to time... >>>> As an alternative one can eliminate that read description in WAL >>>> record at all. Let's simply record what loffset we are going to >>>> write to and data itself. Thus we have simple write request description. >>>> And when WAL is applied corresponding code should determine how to >>>> do the write properly using the current lextent/blob maps state. >>>> This way Write Op apply can be just a regular write handling that >>>> performs sync RMW or any other implementation depending on the >>>> current state, some policy, or whatever else that fits the best at the >> specific moment. >>> I like that way better! We can just add a force_sync argument to >>> _do_write. That also lets us trivially disable wal (by forcing sync >>> w/ a config option or whatever). >>> >>> The downside is that any logically conflicting request (an overlapping >>> write or truncate or zero) needs to drain the wal events, whereas with >>> a lower-level wal description there might be cases where we can ignore >>> the wal operation. I suspect the trivial solution of o->flush() on >>> write/truncate/zero will be pretty visible in benchmarks. Tracking >>> in-flight wal ops with an interval_set would probably work well enough. >> Hmm, I'm not sure this will pan out. The main problem is that if we call back >> into the write code (with a sync flag), we will have to do write IO, and this >> wreaks havoc on our otherwise (mostly) orderly state machine. >> I think it can be done if we build in a similar guard like _txc_finish_io so that >> we wait for the wal events to also complete IO in order before committing >> them. I think. >> >> But the other problem is the checksum thing that came up in another thread, >> where the read-side of a read/modify/write might fail teh checksum because >> the wal write hit disk but the kv portion didn't commit. I see a few options: >> >> 1) If there are checksums and we're doing a sub-block overwrite, we have to >> write/cow it elsewhere. This probably means min_alloc_size cow operations >> for small writes. In which case we needn't bother doing a wal even in the >> first place--the whole point is to enable an overwrite. >> >> 2) We do loose checksum validation that will accept either the old checksum >> or the expected new checksum for the read stage. This handles these two >> crash cases: >> >> * kv commit of op + wal event >> >> * do wal io (completely) >> >> * kv cleanup of wal event >> >> but not the case where we only partially complete the wal io. Which means >> there is a small probability is "corrupt" ourselves on crash (not really corrupt, >> but confuse ourselves such that we refuse to replay the wal events on >> startup). >> >> 3) Same as 2, but simply warn if we fail that read-side checksum on replay. >> This basically introduces a *very* small window which could allow an ondisk >> corruption to get absorbed into our checksum. This could just be #2 + a >> config option so we warn instead of erroring out. >> >> 4) Same as 2, but we try every combination of old and new data on >> block/sector boundaries to find a valid checksum on the read-side. >> >> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek + >> 64K write on an HDD. Or forces us to run with min_alloc_size=4K on HDD, >> which would risk very bad fragmentation. >> >> Which makes we want #3 (initially) and then #4. But... if we do the "wal is >> just a logical write", that means this weird replay handling logic creeps into >> the normal write path. >> >> I'm currently leaning toward keeping the wal events special (lower-level), but >> doing what we can to make it work with the same mid- to low-level helper >> functions (for reading and verifying blobs, etc.). >> >> sage