From mboxrd@z Thu Jan 1 00:00:00 1970 From: Allen Samuels Subject: Re: bluestore blobs REVISITED Date: Sun, 21 Aug 2016 17:27:09 +0000 Message-ID: <2F51EC8C-D280-4DFF-8FF6-4AC97071A7D5@sandisk.com> References: , Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-co1nam03on0069.outbound.protection.outlook.com ([104.47.40.69]:6800 "EHLO NAM03-CO1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751443AbcHUR1M (ORCPT ); Sun, 21 Aug 2016 13:27:12 -0400 In-Reply-To: Content-Language: en-US Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: "ceph-devel@vger.kernel.org" I wonder how hard it would be to add a "lower-bound" fetch like stl. That w= ould allow the kv store to do the fetch without incurring the overhead of a= snapshot for the iteration scan.=20 Shared blobs were always going to trigger an extra kv fetch no matter what.= =20 Sent from my iPhone. Please excuse all typos and autocorrects. > On Aug 21, 2016, at 12:08 PM, Sage Weil wrote: >=20 >> On Sat, 20 Aug 2016, Allen Samuels wrote: >> I have another proposal (it just occurred to me, so it might not survive= =20 >> more scrutiny). >>=20 >> Yes, we should remove the blob-map from the oNode. >>=20 >> But I believe we should also remove the lextent map from the oNode and=20 >> make each lextent be an independent KV value. >>=20 >> However, in the special case where each extent --exactly-- maps onto a=20 >> blob AND the blob is not referenced by any other extent (which is the=20 >> typical case, unless you're doing compression with strange-ish overlaps)= =20 >> -- then you encode the blob in the lextent itself and there's no=20 >> separate blob entry. >>=20 >> This is pretty much exactly the same number of KV fetches as what you=20 >> proposed before when the blob isn't shared (the typical case) -- except= =20 >> the oNode is MUCH MUCH smaller now. >=20 > I think this approach makes a lot of sense! The only thing I'm worried=20 > about is that the lextent keys are no longer known when they are being=20 > fetched (since they will be a logical offset), which means we'll have to= =20 > use an iterator instead of a simple get. The former is quite a bit slowe= r=20 > than the latter (which can make use of the rocksdb caches and/or key bloo= m=20 > filters more easily). >=20 > We could augment your approach by keeping *just* the lextent offsets in=20 > the onode, so that we know exactly which lextent key to fetch, but then=20 > I'm not sure we'll get much benefit (lextent metadata size goes down by=20 > ~1/3, but then we have an extra get for cloned objects). >=20 > Hmm, the other thing to keep in mind is that for RBD the common case is=20 > that lots of objects have clones, and many of those objects' blobs will b= e=20 > shared. >=20 > sage >=20 >> So for the non-shared case, you fetch the oNode which is dominated by=20 >> the xattrs now (so figure a couple of hundred bytes and not much CPU=20 >> cost to deserialize). And then fetch from the KV for the lextent (which= =20 >> is 1 fetch -- unless it overlaps two previous lextents). If it's the=20 >> optimal case, the KV fetch is small (10-15 bytes) and trivial to=20 >> deserialize. If it's an unshared/local blob then you're ready to go. If= =20 >> the blob is shared (locally or globally) then you'll have to go fetch=20 >> that one too. >>=20 >> This might lead to the elimination of the local/global blob thing (I=20 >> think you've talked about that before) as now the only "local" blobs are= =20 >> the unshared single extent blobs which are stored inline with the=20 >> lextent entry. You'll still have the special cases of promoting unshared= =20 >> (inline) blobs to global blobs -- which is probably similar to the=20 >> current promotion "magic" on a clone operation. >>=20 >> The current refmap concept may require some additional work. I believe=20 >> that we'll have to do a reconstruction of the refmap, but fortunately=20 >> only for the range of the current I/O. That will be a bit more=20 >> expensive, but still less expensive than reconstructing the entire=20 >> refmap for every oNode deserialization, Fortunately I believe the refmap= =20 >> is only really needed for compression cases or RBD cases without "trim"= =20 >> (this is the case to optimize -- it'll make trim really important for=20 >> performance). >>=20 >> Best of both worlds???? >>=20 >> Allen Samuels >> SanDisk |a Western Digital brand >> 2880 Junction Avenue, Milpitas, CA 95134 >> T: +1 408 801 7030| M: +1 408 780 6416 >> allen.samuels@SanDisk.com >>=20 >>=20 >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >>> owner@vger.kernel.org] On Behalf Of Allen Samuels >>> Sent: Friday, August 19, 2016 7:16 AM >>> To: Sage Weil >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: bluestore blobs >>>=20 >>>> -----Original Message----- >>>> From: Sage Weil [mailto:sweil@redhat.com] >>>> Sent: Friday, August 19, 2016 6:53 AM >>>> To: Allen Samuels >>>> Cc: ceph-devel@vger.kernel.org >>>> Subject: RE: bluestore blobs >>>>=20 >>>> On Fri, 19 Aug 2016, Allen Samuels wrote: >>>>>> -----Original Message----- >>>>>> From: Sage Weil [mailto:sweil@redhat.com] >>>>>> Sent: Thursday, August 18, 2016 8:10 AM >>>>>> To: Allen Samuels >>>>>> Cc: ceph-devel@vger.kernel.org >>>>>> Subject: RE: bluestore blobs >>>>>>=20 >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote: >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM >>>>>>>> To: ceph-devel@vger.kernel.org >>>>>>>> Subject: bluestore blobs >>>>>>>>=20 >>>>>>>> I think we need to look at other changes in addition to the >>>>>>>> encoding performance improvements. Even if they end up being >>>>>>>> good enough, these changes are somewhat orthogonal and at >>>>>>>> least one of them should give us something that is even faster. >>>>>>>>=20 >>>>>>>> 1. I mentioned this before, but we should keep the encoding >>>>>>>> bluestore_blob_t around when we load the blob map. If it's >>>>>>>> not changed, don't reencode it. There are no blockers for >>>>>>>> implementing this >>>>>> currently. >>>>>>>> It may be difficult to ensure the blobs are properly marked dirty.= .. >>>>>>>> I'll see if we can use proper accessors for the blob to >>>>>>>> enforce this at compile time. We should do that anyway. >>>>>>>=20 >>>>>>> If it's not changed, then why are we re-writing it? I'm having a >>>>>>> hard time thinking of a case worth optimizing where I want to >>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing >>>> something obvious? >>>>>>=20 >>>>>> An onode's blob_map might have 300 blobs, and a single write only >>>>>> updates one of them. The other 299 blobs need not be reencoded, >>>>>> just >>>> memcpy'd. >>>>>=20 >>>>> As long as we're just appending that's a good optimization. How >>>>> often does that happen? It's certainly not going to help the RBD 4K >>>>> random write problem. >>>>=20 >>>> It won't help the (l)extent_map encoding, but it avoids almost all of >>>> the blob reencoding. A 4k random write will update one blob out of >>>> ~100 (or whatever it is). >>>>=20 >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages: >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each >>>>>>>> untouched >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy >>>>>>>> somewhere inside rocksdb to copy this into the write buffer. >>>>>>>> We could extend the rocksdb interface to take an iovec so that >>>>>>>> the first memcpy isn't needed (and rocksdb will instead >>>>>>>> iterate over our buffers and copy them directly into its write >>>>>>>> buffer). This is probably a pretty small piece of the overall >>>>>>>> time... should verify with a profiler >>>>>> before investing too much effort here. >>>>>>>=20 >>>>>>> I doubt it's the memcpy that's really the expensive part. I'll >>>>>>> bet it's that we're transcoding from an internal to an external >>>>>>> representation on an element by element basis. If the iovec >>>>>>> scheme is going to help, it presumes that the internal data >>>>>>> structure essentially matches the external data structure so >>>>>>> that only an iovec copy is required. I'm wondering how >>>>>>> compatible this is with the current concepts of lextext/blob/pexten= t. >>>>>>=20 >>>>>> I'm thinking of the xattr case (we have a bunch of strings to copy >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead >>>>>> of memcpy'ing them into a big contiguous buffer and having rocksdb >>>>>> memcpy >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that they >>>>>> smaller buffers are assembled only once. >>>>>>=20 >>>>>> These buffers will be on the order of many 10s to a couple 100s of >>> bytes. >>>>>> I'm not sure where the crossover point for constructing and then >>>>>> traversing an iovec vs just copying twice would be... >>>>>=20 >>>>> Yes this will eliminate the "extra" copy, but the real problem is >>>>> that the oNode itself is just too large. I doubt removing one extra >>>>> copy is going to suddenly "solve" this problem. I think we're going >>>>> to end up rejiggering things so that this will be much less of a >>>>> problem than it is now -- time will tell. >>>>=20 >>>> Yeah, leaving this one for last I think... until we see memcpy show up >>>> in the profile. >>>>=20 >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or >>>>>>>> more?) key into rocksdb every time we touch an object, even >>>>>>>> when a tiny >>>>>=20 >>>>> See my analysis, you're looking at 8-10K for the RBD random write >>>>> case >>>>> -- which I think everybody cares a lot about. >>>>>=20 >>>>>>>> amount of metadata is getting changed. This is a consequence >>>>>>>> of embedding all of the blobs into the onode (or bnode). That >>>>>>>> seemed like a good idea early on when they were tiny (i.e., >>>>>>>> just an extent), but now I'm not so sure. I see a couple of >>>>>>>> different >>>> options: >>>>>>>>=20 >>>>>>>> a) Store each blob as ($onode_key+$blobid). When we load the >>>>>>>> onode, load the blobs too. They will hopefully be sequential >>>>>>>> in rocksdb (or definitely sequential in zs). Probably go back >>>>>>>> to using an >>>> iterator. >>>>>>>>=20 >>>>>>>> b) Go all in on the "bnode" like concept. Assign blob ids so >>>>>>>> that they are unique for any given hash value. Then store the >>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode >>>>>>>> is now). Then when clone happens there is no onode->bnode >>>>>>>> migration magic happening--we've already committed to storing >>>>>>>> blobs in separate keys. When we load the onode, keep the >>>>>>>> conditional bnode loading we already have.. but when the bnode >>>>>>>> is loaded load up all the blobs for the hash key. (Okay, we >>>>>>>> could fault in blobs individually, but that code will be more >>>>>>>> complicated.) >>>>>=20 >>>>> I like this direction. I think you'll still end up demand loading >>>>> the blobs in order to speed up the random read case. This scheme >>>>> will result in some space-amplification, both in the lextent and in >>>>> the blob-map, it's worth a bit of study too see how bad the >>>>> metadata/data ratio becomes (just as a guess, >>>>> $shard.$poolid.$hash.$blobid is probably 16 + >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob -- >>>>> unless your KV store does path compression. My reading of RocksDB >>>>> sst file seems to indicate that it doesn't, I *believe* that ZS does >>>>> [need to confirm]). I'm wondering if the current notion of local vs. >>>>> global blobs isn't actually beneficial in that we can give local >>>>> blobs different names that sort with their associated oNode (which >>>>> probably makes the space-amp worse) which is an important >>>>> optimization. We do need to watch the space amp, we're going to be >>>>> burning DRAM to make KV accesses cheap and the amount of DRAM is >>> proportional to the space amp. >>>>=20 >>>> I got this mostly working last night... just need to sort out the >>>> clone case (and clean up a bunch of code). It was a relatively >>>> painless transition to make, although in its current form the blobs >>>> all belong to the bnode, and the bnode if ephemeral but remains in >>> memory until all referencing onodes go away. >>>> Mostly fine, except it means that odd combinations of clone could >>>> leave lots of blobs in cache that don't get trimmed. Will address tha= t later. >>>>=20 >>>> I'll try to finish it up this morning and get it passing tests and pos= ted. >>>>=20 >>>>>>>> In both these cases, a write will dirty the onode (which is >>>>>>>> back to being pretty small.. just xattrs and the lextent map) >>>>>>>> and 1-3 blobs (also >>>>>> now small keys). >>>>>=20 >>>>> I'm not sure the oNode is going to be that small. Looking at the RBD >>>>> random 4K write case, you're going to have 1K entries each of which >>>>> has an offset, size and a blob-id reference in them. In my current >>>>> oNode compression scheme this compresses to about 1 byte per entry. >>>>> However, this optimization relies on being able to cheaply renumber >>>>> the blob-ids, which is no longer possible when the blob-ids become >>>>> parts of a key (see above). So now you'll have a minimum of 1.5-3 >>>>> bytes extra for each blob-id (because you can't assume that the >>>>> blob-ids >>>> become "dense" >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K >>>>> Bytes of lextent table. Worse, because of the variable length >>>>> encoding you'll have to scan the entire table to deserialize it >>>>> (yes, we could do differential editing when we write but that's anoth= er >>> discussion). >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So >>>>> while this looks small compared to the current ~30K for the entire >>>>> thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the >>>>> compressed oNode/lextent/blobmap scheme that I published earlier. >>>>>=20 >>>>> If we want to do better we will need to separate the lextent from >>>>> the oNode also. It's relatively easy to move the lextents into the >>>>> KV store itself (there are two obvious ways to deal with this, >>>>> either use the native offset/size from the lextent itself OR create >>>>> 'N' buckets of logical offset into which we pour entries -- both of >>>>> these would add somewhere between 1 and 2 KV look-ups per operation >>>>> -- here is where an iterator would probably help. >>>>>=20 >>>>> Unfortunately, if you only process a portion of the lextent (because >>>>> you've made it into multiple keys and you don't want to load all of >>>>> them) you no longer can re-generate the refmap on the fly (another >>>>> key space optimization). The lack of refmap screws up a number of >>>>> other important algorithms -- for example the overlapping blob-map >>> thing, etc. >>>>> Not sure if these are easy to rewrite or not -- too complicated to >>>>> think about at this hour of the evening. >>>>=20 >>>> Yeah, I forgot about the extent_map and how big it gets. I think, >>>> though, that if we can get a 4mb object with 1024 4k lextents to >>>> encode the whole onode and extent_map in under 4K that will be good >>>> enough. The blob update that goes with it will be ~200 bytes, and >>>> benchmarks aside, the 4k random write 100% fragmented object is a wors= t >>> case. >>>=20 >>> Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at"= vs. a >>> "worst-case-that-almost-nobody-looks-at". >>>=20 >>> I'm still concerned about having an oNode that's larger than a 4K block= . >>>=20 >>>=20 >>>>=20 >>>> Anyway, I'll get the blob separation branch working and we can go from >>>> there... >>>>=20 >>>> sage >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the >>> body of a message to majordomo@vger.kernel.org More majordomo info at >>> http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >>=20