From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: 2 related bluestore questions Date: Thu, 12 May 2016 17:38:47 +0300 Message-ID: References: <6168022b-e3c0-b8f2-e8c7-3b4b82f9dc6e@mirantis.com> <2b5ebbd8-3e89-1fff-37f1-c6eb00bdcb1a@mirantis.com> <8b077a20-ace3-7824-4039-7b8e9adf88ce@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-lf0-f41.google.com ([209.85.215.41]:35968 "EHLO mail-lf0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751882AbcELOiu (ORCPT ); Thu, 12 May 2016 10:38:50 -0400 Received: by mail-lf0-f41.google.com with SMTP id u64so73237432lff.3 for ; Thu, 12 May 2016 07:38:49 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil , Allen Samuels Cc: "ceph-devel@vger.kernel.org" On 12.05.2016 5:58, Sage Weil wrote: > On Wed, 11 May 2016, Allen Samuels wrote: >> Sorry, still on vacation and I haven't really wrapped my head around >> everything that's being discussed. However, w.r.t. wal operations, I >> would strongly favor an approach that minimizes the amount of "future" >> operations that are recorded (which I'll call intentions -- i.e., >> non-binding hints about extra work that needs to get done). Much of the >> complexity here is because the intentions -- after being recorded -- >> will need to be altered based on subsequent operations. Hence every >> write operation will need to digest the historical intentions and >> potentially update them -- this is VERY complex, potentially much more >> complex than code that simply examines the current state and >> re-determines the correct next operation (i.e., de-wal, gc, etc.) >> >> Additional complexity arises because you're recording two sets of state >> that require consistency checking -- in my experience, this road leads >> to perdition.... > I agree is has to be something manageable that we can reason about. I > think the question for me is mostly about which path minimizes the > complexity while still getting us a reasonable level of performance. > > I had one new thought, see below... > >>>> The downside is that any logically conflicting request (an overlapping >>>> write or truncate or zero) needs to drain the wal events, whereas with >>>> a lower-level wal description there might be cases where we can ignore >>>> the wal operation. I suspect the trivial solution of o->flush() on >>>> write/truncate/zero will be pretty visible in benchmarks. Tracking >>>> in-flight wal ops with an interval_set would probably work well enough. >>> Hmm, I'm not sure this will pan out. The main problem is that if we call back >>> into the write code (with a sync flag), we will have to do write IO, and this >>> wreaks havoc on our otherwise (mostly) orderly state machine. >>> I think it can be done if we build in a similar guard like _txc_finish_io so that >>> we wait for the wal events to also complete IO in order before committing >>> them. I think. >>> >>> But the other problem is the checksum thing that came up in another thread, >>> where the read-side of a read/modify/write might fail teh checksum because >>> the wal write hit disk but the kv portion didn't commit. I see a few options: >>> >>> 1) If there are checksums and we're doing a sub-block overwrite, we have to >>> write/cow it elsewhere. This probably means min_alloc_size cow operations >>> for small writes. In which case we needn't bother doing a wal even in the >>> first place--the whole point is to enable an overwrite. >>> >>> 2) We do loose checksum validation that will accept either the old checksum >>> or the expected new checksum for the read stage. This handles these two >>> crash cases: >>> >>> * kv commit of op + wal event >>> >>> * do wal io (completely) >>> >>> * kv cleanup of wal event >>> >>> but not the case where we only partially complete the wal io. Which means >>> there is a small probability is "corrupt" ourselves on crash (not really corrupt, >>> but confuse ourselves such that we refuse to replay the wal events on >>> startup). >>> >>> 3) Same as 2, but simply warn if we fail that read-side checksum on replay. >>> This basically introduces a *very* small window which could allow an ondisk >>> corruption to get absorbed into our checksum. This could just be #2 + a >>> config option so we warn instead of erroring out. >>> >>> 4) Same as 2, but we try every combination of old and new data on >>> block/sector boundaries to find a valid checksum on the read-side. >>> >>> I think #1 is a non-starter because it turns a 4K write into a 64K read + seek + >>> 64K write on an HDD. Or forces us to run with min_alloc_size=4K on HDD, >>> which would risk very bad fragmentation. >>> >>> Which makes we want #3 (initially) and then #4. But... if we do the "wal is >>> just a logical write", that means this weird replay handling logic creeps into >>> the normal write path. >>> >>> I'm currently leaning toward keeping the wal events special (lower-level), but >>> doing what we can to make it work with the same mid- to low-level helper >>> functions (for reading and verifying blobs, etc.). > It occured to me that this checksum consistency issue only comes up when > we are updating something that is smaller than the csum block size. And > the real source of the problem is that you have a sequence of > > 1- journal intent (kv wal item) > 2- do read io > 3- verify csum > 4- do write io > 5- cancel intent (remove kv wal item) > > If we have an order like > > 1- do read io > 2- journal intent for entire csum chunk (kv wal item) > 3- do write io > 4- cancel intent I suspect this will cause consistency issues when handling multiple writes for the same extent if subsequent write doesn't wait for WAL apply completion. E.g. we have block <1,2,3> and two writes <4,,> & <,,5> In your case the second WAL will have <1,2,5> block instead of <4,2,5> one. And remember you have o->flush for reading and don't have one for writing. But for your case you're introducing o->flush for writing as well to perform a read... > > Then the issue goes away. And I'm thinking if the csum chunk is big > enough that the #2 step is too big of a wal item to perform well, then the > problem is your choice of csum block size, not the approach. I.e., use a > 4kb csum block size for rbd images, and use large blocks (128k, 512k, > whatever) only for things that never see random overwrites (rgw data). > > If that is good enough, then it might also mean that we can make the wal > operations never do reads--just (over)writes, further simplifying things > on that end. In the jewel bluestore the only times we do reads is for > partial block updates (do we really care about these? a buffer cache > could absorb them when it matters) and for copy/cow operations post-clone > (which i think are simple enough to be deal with separately). > > sage