From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: Adding compression support for bluestore. Date: Mon, 21 Mar 2016 21:31:19 +0300 Message-ID: <56F03DF7.7030708@mirantis.com> References: <56C1FCF3.4030505@mirantis.com> <56C3BAA3.3070804@mirantis.com> <56CDF40C.9060405@mirantis.com> <56D08E30.20308@mirantis.com> <56E9A727.1030400@mirantis.com> <56EACAAD.90002@mirantis.com> <56EC248E.3060502@mirantis.com> <56F00022.9020307@mirantis.com> <56F022ED.6000909@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-lb0-f179.google.com ([209.85.217.179]:32850 "EHLO mail-lb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756526AbcCUT3s (ORCPT ); Mon, 21 Mar 2016 15:29:48 -0400 Received: by mail-lb0-f179.google.com with SMTP id oe12so133424395lbc.0 for ; Mon, 21 Mar 2016 12:29:47 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels , Sage Weil Cc: ceph-devel On 21.03.2016 20:14, Allen Samuels wrote: > After reading your proposal and Sage's response, I have a hybrid prop= osal that I think is BOBW... > > This proposal is based on Sage's comments about there being two diffe= rent structures. > > Leveraging my proposal for a unified extent (included by reference). = =46or our purpose each extent is fully self-contained, i.e., logical ad= dress/size, physical address/size, checksums, flags, etc.... It seems we don't need logical addr/size in this structure any more - i= t=20 should be within the logical block map. See below... > The onode value (i.e., the one in the KV storage) contains only LIST= of extents (yes a list, read on!). The list is ordered temporally, thi= s provides the necessary ordering information to disambiguate between t= wo extents that overlap/occlude the same logical address range. And I'm not sure if we need any temporal order in this extent list If=20 logical block map is maintained properly, i.e. proper entries are=20 inserted on overwrite and overwritten ones are updated. We just need a=20 (implicit?) collection of such extents and an ability to reference its'= =20 specific entries. Under "implicit" collection I mean we don't need to collect them within= =20 a single data structure ( map, list, array, whatever..) - we just need=20 some reference to the entry from logical block map and corresponding=20 reference counting mechanics. Serializing to be considered though... > The in-memory onode contains an auxilliary map that's ordered by logi= cal address. Entries in the map point to the extent that provides the d= ata for that logical address range. There can be multiple entries in th= e map that reference the same element of the list. This map cheaply dis= ambiguates between multiple extents that overlap in logical address spa= ce. This is the case that Sage pointed out causes problems for the data= structure that you propose. In my proposal, you'll have two entries in= the logical map pointing to the same entry in the extent map. > > We reference count between the logical map and the extent list to kno= w when an extent no longer contributes data to this object (and hence c= an be removed from the list and potentially have it's space recovered). Yes. And such a map ( AKA logical block map ) has to contain (logical?)= =20 address/size ( I'd prefer to name it a pointer or descriptor for a=20 region within a physical extent ) and a reference to an extent. Exactly= =20 as in Sage's comment. Thus Sage proposal to be simply extended with reference counting for ra= w=20 extents. > Costs for this dual data structure are: > > Lookup, insert and delete are all O(log2 N) and are essentially equiv= alent to the interval_set (with some extra logic for the reference coun= ted extents). > > Construction of the logical map when an onode is de-serialized is jus= t O(N) inserts of an empty list -- something like O(log N+1), but not e= xactly. I think it's something like log(1) + log(2) + log(3) + log(4)..= =2E.. Not really sure what that adds up to, but it's waaay less than O(= N). > > There may be additional potential benefits to having the extent list = be temporally ordered. You might be able to infer all sorts of behavior= al information by examining that list... > > >> -----Original Message----- >> From: Igor Fedotov [mailto:ifedotov@mirantis.com] >> Sent: Monday, March 21, 2016 9:36 AM >> To: Allen Samuels ; Sage Weil >> >> Cc: ceph-devel >> Subject: Re: Adding compression support for bluestore. >> >> >> >> On 21.03.2016 18:14, Allen Samuels wrote: >>>> That's an interesting proposal but I can see following caveats her= e >>>> (I beg pardon I misunderstood something): >>>> 1) Potentially uncontrolled extent map growth when extensive >>>> (over)writing takes place. >>> Yes, a na=C3=AFve insertion policy could lead to uncontrolled growt= h, but I don't >> think this needs to be the case. I assume that when you add an "exte= nt", you >> won't increase the size of the array unnecessarily, i.e., if the new= extent >> doesn't overlap an existing extent then there's no reason to increas= e the size >> of the map array -- actually you want to insert the new extent at th= e >> array index that doesn't overlap, only increasing the arr= ay size >> when that's not possible. I'm not 100% certain of the worst case, bu= t I believe >> that it's limited to the ratio between the largest extent and the sm= allest >> extent. (i.e., if we assume writes are no larger than -- say -- 1MB = and the >> smallest are 4K, then I think the max depth of the array is 1M/4K =3D= > 2^8, 256. >> Which is ugly but not awful -- since this is probably a contrived ca= se. This >> might be a reason to limit the largest extent size to something a bi= t smaller >> (say 256K)... >> It looks like I misunderstood something... It seemed to me that your= array >> grows depending on the maximum amount of block versions. >> Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K I supposed t= hat >> this will create following array: >> [ >> 0: <0:{...,4K}, 8K:{...4K}>, >> ... >> 999: <0:{...,4K}, 8K:{...4K}>, >> ] >> >> what's happening in your case? >> >>>> 2) Read/Lookup algorithmic complexity. To find valid block (or det= ect >>>> overwrite) one should sequentially enumerate the full array. Given= 1) >>>> that might be very ineffective. >>> Only requires one log2 lookup for each index of the array. >> This depends on 1) thus still unclear at the moment. >>>> 3) It's not dealing with unaligned overwrites. What happens when s= ome >>>> block is partially overwritten? >>> I'm not sure I understand what cases you're referring to. Can you g= ive an >> example? >> Well, as far as I understand in the proposal above you were operatin= g the >> entire blocks (i.e. 4K data) Thus overwriting the block is a simple = case - you >> just need to create a new block "version" and insert it into an arra= y. >> But real user writes seems to be unaligned to block size. >> E.g. >> write 0~2048 >> write 1024~3072 >> >> you have to either track both blocks or merge them. The latter is a = bit tricky >> for the compression case. >> >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html