From mboxrd@z Thu Jan  1 00:00:00 1970
From: Allen Samuels <Allen.Samuels@sandisk.com>
Subject: Re: bluestore blobs REVISITED
Date: Sun, 21 Aug 2016 17:27:09 +0000
Message-ID: <2F51EC8C-D280-4DFF-8FF6-4AC97071A7D5@sandisk.com>
References: <BLUPR0201MB15245F1EA4B70EF5761981AFE8170@BLUPR0201MB1524.namprd02.prod.outlook.com>,<alpine.DEB.2.11.1608211601350.14754@piezo.us.to>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mail-co1nam03on0069.outbound.protection.outlook.com ([104.47.40.69]:6800
        "EHLO NAM03-CO1-obe.outbound.protection.outlook.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1751443AbcHUR1M (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
        Sun, 21 Aug 2016 13:27:12 -0400
In-Reply-To: <alpine.DEB.2.11.1608211601350.14754@piezo.us.to>
Content-Language: en-US
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Sage Weil <sweil@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>

I wonder how hard it would be to add a "lower-bound" fetch like stl. That w=
ould allow the kv store to do the fetch without incurring the overhead of a=
 snapshot for the iteration scan.=20

Shared blobs were always going to trigger an extra kv fetch no matter what.=
=20

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
>=20
>> On Sat, 20 Aug 2016, Allen Samuels wrote:
>> I have another proposal (it just occurred to me, so it might not survive=
=20
>> more scrutiny).
>>=20
>> Yes, we should remove the blob-map from the oNode.
>>=20
>> But I believe we should also remove the lextent map from the oNode and=20
>> make each lextent be an independent KV value.
>>=20
>> However, in the special case where each extent --exactly-- maps onto a=20
>> blob AND the blob is not referenced by any other extent (which is the=20
>> typical case, unless you're doing compression with strange-ish overlaps)=
=20
>> -- then you encode the blob in the lextent itself and there's no=20
>> separate blob entry.
>>=20
>> This is pretty much exactly the same number of KV fetches as what you=20
>> proposed before when the blob isn't shared (the typical case) -- except=
=20
>> the oNode is MUCH MUCH smaller now.
>=20
> I think this approach makes a lot of sense!  The only thing I'm worried=20
> about is that the lextent keys are no longer known when they are being=20
> fetched (since they will be a logical offset), which means we'll have to=
=20
> use an iterator instead of a simple get.  The former is quite a bit slowe=
r=20
> than the latter (which can make use of the rocksdb caches and/or key bloo=
m=20
> filters more easily).
>=20
> We could augment your approach by keeping *just* the lextent offsets in=20
> the onode, so that we know exactly which lextent key to fetch, but then=20
> I'm not sure we'll get much benefit (lextent metadata size goes down by=20
> ~1/3, but then we have an extra get for cloned objects).
>=20
> Hmm, the other thing to keep in mind is that for RBD the common case is=20
> that lots of objects have clones, and many of those objects' blobs will b=
e=20
> shared.
>=20
> sage
>=20
>> So for the non-shared case, you fetch the oNode which is dominated by=20
>> the xattrs now (so figure a couple of hundred bytes and not much CPU=20
>> cost to deserialize). And then fetch from the KV for the lextent (which=
=20
>> is 1 fetch -- unless it overlaps two previous lextents). If it's the=20
>> optimal case, the KV fetch is small (10-15 bytes) and trivial to=20
>> deserialize. If it's an unshared/local blob then you're ready to go. If=
=20
>> the blob is shared (locally or globally) then you'll have to go fetch=20
>> that one too.
>>=20
>> This might lead to the elimination of the local/global blob thing (I=20
>> think you've talked about that before) as now the only "local" blobs are=
=20
>> the unshared single extent blobs which are stored inline with the=20
>> lextent entry. You'll still have the special cases of promoting unshared=
=20
>> (inline) blobs to global blobs -- which is probably similar to the=20
>> current promotion "magic" on a clone operation.
>>=20
>> The current refmap concept may require some additional work. I believe=20
>> that we'll have to do a reconstruction of the refmap, but fortunately=20
>> only for the range of the current I/O. That will be a bit more=20
>> expensive, but still less expensive than reconstructing the entire=20
>> refmap for every oNode deserialization, Fortunately I believe the refmap=
=20
>> is only really needed for compression cases or RBD cases without "trim"=
=20
>> (this is the case to optimize -- it'll make trim really important for=20
>> performance).
>>=20
>> Best of both worlds????
>>=20
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>=20
>>=20
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Friday, August 19, 2016 7:16 AM
>>> To: Sage Weil <sweil@redhat.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: bluestore blobs
>>>=20
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>> Sent: Friday, August 19, 2016 6:53 AM
>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: bluestore blobs
>>>>=20
>>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>> Sent: Thursday, August 18, 2016 8:10 AM
>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: RE: bluestore blobs
>>>>>>=20
>>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
>>>>>>>> To: ceph-devel@vger.kernel.org
>>>>>>>> Subject: bluestore blobs
>>>>>>>>=20
>>>>>>>> I think we need to look at other changes in addition to the
>>>>>>>> encoding performance improvements.  Even if they end up being
>>>>>>>> good enough, these changes are somewhat orthogonal and at
>>>>>>>> least one of them should give us something that is even faster.
>>>>>>>>=20
>>>>>>>> 1. I mentioned this before, but we should keep the encoding
>>>>>>>> bluestore_blob_t around when we load the blob map.  If it's
>>>>>>>> not changed, don't reencode it.  There are no blockers for
>>>>>>>> implementing this
>>>>>> currently.
>>>>>>>> It may be difficult to ensure the blobs are properly marked dirty.=
..
>>>>>>>> I'll see if we can use proper accessors for the blob to
>>>>>>>> enforce this at compile time.  We should do that anyway.
>>>>>>>=20
>>>>>>> If it's not changed, then why are we re-writing it? I'm having a
>>>>>>> hard time thinking of a case worth optimizing where I want to
>>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing
>>>> something obvious?
>>>>>>=20
>>>>>> An onode's blob_map might have 300 blobs, and a single write only
>>>>>> updates one of them.  The other 299 blobs need not be reencoded,
>>>>>> just
>>>> memcpy'd.
>>>>>=20
>>>>> As long as we're just appending that's a good optimization. How
>>>>> often does that happen? It's certainly not going to help the RBD 4K
>>>>> random write problem.
>>>>=20
>>>> It won't help the (l)extent_map encoding, but it avoids almost all of
>>>> the blob reencoding.  A 4k random write will update one blob out of
>>>> ~100 (or whatever it is).
>>>>=20
>>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages:
>>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
>>>>>>>> untouched
>>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
>>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
>>>>>>>> We could extend the rocksdb interface to take an iovec so that
>>>>>>>> the first memcpy isn't needed (and rocksdb will instead
>>>>>>>> iterate over our buffers and copy them directly into its write
>>>>>>>> buffer).  This is probably a pretty small piece of the overall
>>>>>>>> time... should verify with a profiler
>>>>>> before investing too much effort here.
>>>>>>>=20
>>>>>>> I doubt it's the memcpy that's really the expensive part. I'll
>>>>>>> bet it's that we're transcoding from an internal to an external
>>>>>>> representation on an element by element basis. If the iovec
>>>>>>> scheme is going to help, it presumes that the internal data
>>>>>>> structure essentially matches the external data structure so
>>>>>>> that only an iovec copy is required. I'm wondering how
>>>>>>> compatible this is with the current concepts of lextext/blob/pexten=
t.
>>>>>>=20
>>>>>> I'm thinking of the xattr case (we have a bunch of strings to copy
>>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
>>>>>> of memcpy'ing them into a big contiguous buffer and having rocksdb
>>>>>> memcpy
>>>>>> *that* into it's larger buffer, give rocksdb an iovec so that they
>>>>>> smaller buffers are assembled only once.
>>>>>>=20
>>>>>> These buffers will be on the order of many 10s to a couple 100s of
>>> bytes.
>>>>>> I'm not sure where the crossover point for constructing and then
>>>>>> traversing an iovec vs just copying twice would be...
>>>>>=20
>>>>> Yes this will eliminate the "extra" copy, but the real problem is
>>>>> that the oNode itself is just too large. I doubt removing one extra
>>>>> copy is going to suddenly "solve" this problem. I think we're going
>>>>> to end up rejiggering things so that this will be much less of a
>>>>> problem than it is now -- time will tell.
>>>>=20
>>>> Yeah, leaving this one for last I think... until we see memcpy show up
>>>> in the profile.
>>>>=20
>>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or
>>>>>>>> more?) key into rocksdb every time we touch an object, even
>>>>>>>> when a tiny
>>>>>=20
>>>>> See my analysis, you're looking at 8-10K for the RBD random write
>>>>> case
>>>>> -- which I think everybody cares a lot about.
>>>>>=20
>>>>>>>> amount of metadata is getting changed.  This is a consequence
>>>>>>>> of embedding all of the blobs into the onode (or bnode).  That
>>>>>>>> seemed like a good idea early on when they were tiny (i.e.,
>>>>>>>> just an extent), but now I'm not so sure.  I see a couple of
>>>>>>>> different
>>>> options:
>>>>>>>>=20
>>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the
>>>>>>>> onode, load the blobs too.  They will hopefully be sequential
>>>>>>>> in rocksdb (or definitely sequential in zs).  Probably go back
>>>>>>>> to using an
>>>> iterator.
>>>>>>>>=20
>>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so
>>>>>>>> that they are unique for any given hash value.  Then store the
>>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode
>>>>>>>> is now).  Then when clone happens there is no onode->bnode
>>>>>>>> migration magic happening--we've already committed to storing
>>>>>>>> blobs in separate keys.  When we load the onode, keep the
>>>>>>>> conditional bnode loading we already have.. but when the bnode
>>>>>>>> is loaded load up all the blobs for the hash key.  (Okay, we
>>>>>>>> could fault in blobs individually, but that code will be more
>>>>>>>> complicated.)
>>>>>=20
>>>>> I like this direction. I think you'll still end up demand loading
>>>>> the blobs in order to speed up the random read case. This scheme
>>>>> will result in some space-amplification, both in the lextent and in
>>>>> the blob-map, it's worth a bit of study too see how bad the
>>>>> metadata/data ratio becomes (just as a guess,
>>>>> $shard.$poolid.$hash.$blobid is probably 16 +
>>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
>>>>> unless your KV store does path compression. My reading of RocksDB
>>>>> sst file seems to indicate that it doesn't, I *believe* that ZS does
>>>>> [need to confirm]). I'm wondering if the current notion of local vs.
>>>>> global blobs isn't actually beneficial in that we can give local
>>>>> blobs different names that sort with their associated oNode (which
>>>>> probably makes the space-amp worse) which is an important
>>>>> optimization. We do need to watch the space amp, we're going to be
>>>>> burning DRAM to make KV accesses cheap and the amount of DRAM is
>>> proportional to the space amp.
>>>>=20
>>>> I got this mostly working last night... just need to sort out the
>>>> clone case (and clean up a bunch of code).  It was a relatively
>>>> painless transition to make, although in its current form the blobs
>>>> all belong to the bnode, and the bnode if ephemeral but remains in
>>> memory until all referencing onodes go away.
>>>> Mostly fine, except it means that odd combinations of clone could
>>>> leave lots of blobs in cache that don't get trimmed.  Will address tha=
t later.
>>>>=20
>>>> I'll try to finish it up this morning and get it passing tests and pos=
ted.
>>>>=20
>>>>>>>> In both these cases, a write will dirty the onode (which is
>>>>>>>> back to being pretty small.. just xattrs and the lextent map)
>>>>>>>> and 1-3 blobs (also
>>>>>> now small keys).
>>>>>=20
>>>>> I'm not sure the oNode is going to be that small. Looking at the RBD
>>>>> random 4K write case, you're going to have 1K entries each of which
>>>>> has an offset, size and a blob-id reference in them. In my current
>>>>> oNode compression scheme this compresses to about 1 byte per entry.
>>>>> However, this optimization relies on being able to cheaply renumber
>>>>> the blob-ids, which is no longer possible when the blob-ids become
>>>>> parts of a key (see above). So now you'll have a minimum of 1.5-3
>>>>> bytes extra for each blob-id (because you can't assume that the
>>>>> blob-ids
>>>> become "dense"
>>>>> anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
>>>>> Bytes of lextent table. Worse, because of the variable length
>>>>> encoding you'll have to scan the entire table to deserialize it
>>>>> (yes, we could do differential editing when we write but that's anoth=
er
>>> discussion).
>>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
>>>>> while this looks small compared to the current ~30K for the entire
>>>>> thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
>>>>> compressed oNode/lextent/blobmap scheme that I published earlier.
>>>>>=20
>>>>> If we want to do better we will need to separate the lextent from
>>>>> the oNode also. It's relatively easy to move the lextents into the
>>>>> KV store itself (there are two obvious ways to deal with this,
>>>>> either use the native offset/size from the lextent itself OR create
>>>>> 'N' buckets of logical offset into which we pour entries -- both of
>>>>> these would add somewhere between 1 and 2 KV look-ups per operation
>>>>> -- here is where an iterator would probably help.
>>>>>=20
>>>>> Unfortunately, if you only process a portion of the lextent (because
>>>>> you've made it into multiple keys and you don't want to load all of
>>>>> them) you no longer can re-generate the refmap on the fly (another
>>>>> key space optimization). The lack of refmap screws up a number of
>>>>> other important algorithms -- for example the overlapping blob-map
>>> thing, etc.
>>>>> Not sure if these are easy to rewrite or not -- too complicated to
>>>>> think about at this hour of the evening.
>>>>=20
>>>> Yeah, I forgot about the extent_map and how big it gets.  I think,
>>>> though, that if we can get a 4mb object with 1024 4k lextents to
>>>> encode the whole onode and extent_map in under 4K that will be good
>>>> enough.  The blob update that goes with it will be ~200 bytes, and
>>>> benchmarks aside, the 4k random write 100% fragmented object is a wors=
t
>>> case.
>>>=20
>>> Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at"=
 vs. a
>>> "worst-case-that-almost-nobody-looks-at".
>>>=20
>>> I'm still concerned about having an oNode that's larger than a 4K block=
.
>>>=20
>>>=20
>>>>=20
>>>> Anyway, I'll get the blob separation branch working and we can go from
>>>> there...
>>>>=20
>>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>=20
>>=20