From mboxrd@z Thu Jan  1 00:00:00 1970
From: Allen Samuels <Allen.Samuels@sandisk.com>
Subject: bluestore blobs REVISITED
Date: Sat, 20 Aug 2016 05:33:12 +0000
Message-ID: <BLUPR0201MB15245F1EA4B70EF5761981AFE8170@BLUPR0201MB1524.namprd02.prod.outlook.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mail-cys01nam02on0078.outbound.protection.outlook.com ([104.47.37.78]:25184
        "EHLO NAM02-CY1-obe.outbound.protection.outlook.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1751161AbcHTFdQ (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
        Sat, 20 Aug 2016 01:33:16 -0400
Content-Language: en-US
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Sage Weil <sweil@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>

I have another proposal (it just occurred to me, so it might not survive mo=
re scrutiny).

Yes, we should remove the blob-map from the oNode.

But I believe we should also remove the lextent map from the oNode and make=
 each lextent be an independent KV value.

However, in the special case where each extent --exactly-- maps onto a blob=
 AND the blob is not referenced by any other extent (which is the typical c=
ase, unless you're doing compression with strange-ish overlaps) -- then you=
 encode the blob in the lextent itself and there's no separate blob entry.

This is pretty much exactly the same number of KV fetches as what you propo=
sed before when the blob isn't shared (the typical case) -- except the oNod=
e is MUCH MUCH smaller now.

So for the non-shared case, you fetch the oNode which is dominated by the x=
attrs now (so figure a couple of hundred bytes and not much CPU cost to des=
erialize).
And then fetch from the KV for the lextent (which is 1 fetch -- unless it o=
verlaps two previous lextents). If it's the optimal case, the KV fetch is s=
mall (10-15 bytes) and trivial to deserialize. If it's an unshared/local bl=
ob then you're ready to go. If the blob is shared (locally or globally) the=
n you'll have to go fetch that one too.=20

This might lead to the elimination of the local/global blob thing (I think =
you've talked about that before) as now the only "local" blobs are the unsh=
ared single extent blobs which are stored inline with the lextent entry. Yo=
u'll still have the special cases of promoting unshared (inline) blobs to g=
lobal blobs -- which is probably similar to the current promotion "magic" o=
n a clone operation.

The current refmap concept may require some additional work. I believe that=
 we'll have to do a reconstruction of the refmap, but fortunately only for =
the range of the current I/O. That will be a bit more expensive, but still =
less expensive than reconstructing the entire refmap for every oNode deseri=
alization, Fortunately I believe the refmap is only really needed for compr=
ession cases or RBD cases without "trim" (this is the case to optimize -- i=
t'll make trim really important for performance).

Best of both worlds????

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Friday, August 19, 2016 7:16 AM
> To: Sage Weil <sweil@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs
>=20
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, August 19, 2016 6:53 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs
> >
> > On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Thursday, August 18, 2016 8:10 AM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: bluestore blobs
> > > >
> > > > On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > > To: ceph-devel@vger.kernel.org
> > > > > > Subject: bluestore blobs
> > > > > >
> > > > > > I think we need to look at other changes in addition to the
> > > > > > encoding performance improvements.  Even if they end up being
> > > > > > good enough, these changes are somewhat orthogonal and at
> > > > > > least one of them should give us something that is even faster.
> > > > > >
> > > > > > 1. I mentioned this before, but we should keep the encoding
> > > > > > bluestore_blob_t around when we load the blob map.  If it's
> > > > > > not changed, don't reencode it.  There are no blockers for
> > > > > > implementing this
> > > > currently.
> > > > > > It may be difficult to ensure the blobs are properly marked dir=
ty...
> > > > > > I'll see if we can use proper accessors for the blob to
> > > > > > enforce this at compile time.  We should do that anyway.
> > > > >
> > > > > If it's not changed, then why are we re-writing it? I'm having a
> > > > > hard time thinking of a case worth optimizing where I want to
> > > > > re-write the oNode but the blob_map is unchanged. Am I missing
> > something obvious?
> > > >
> > > > An onode's blob_map might have 300 blobs, and a single write only
> > > > updates one of them.  The other 299 blobs need not be reencoded,
> > > > just
> > memcpy'd.
> > >
> > > As long as we're just appending that's a good optimization. How
> > > often does that happen? It's certainly not going to help the RBD 4K
> > > random write problem.
> >
> > It won't help the (l)extent_map encoding, but it avoids almost all of
> > the blob reencoding.  A 4k random write will update one blob out of
> > ~100 (or whatever it is).
> >
> > > > > > 2. This turns the blob Put into rocksdb into two memcpy stages:
> > > > > > one to assemble the bufferlist (lots of bufferptrs to each
> > > > > > untouched
> > > > > > blob) into a single rocksdb::Slice, and another memcpy
> > > > > > somewhere inside rocksdb to copy this into the write buffer.
> > > > > > We could extend the rocksdb interface to take an iovec so that
> > > > > > the first memcpy isn't needed (and rocksdb will instead
> > > > > > iterate over our buffers and copy them directly into its write
> > > > > > buffer).  This is probably a pretty small piece of the overall
> > > > > > time... should verify with a profiler
> > > > before investing too much effort here.
> > > > >
> > > > > I doubt it's the memcpy that's really the expensive part. I'll
> > > > > bet it's that we're transcoding from an internal to an external
> > > > > representation on an element by element basis. If the iovec
> > > > > scheme is going to help, it presumes that the internal data
> > > > > structure essentially matches the external data structure so
> > > > > that only an iovec copy is required. I'm wondering how
> > > > > compatible this is with the current concepts of lextext/blob/pext=
ent.
> > > >
> > > > I'm thinking of the xattr case (we have a bunch of strings to copy
> > > > verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
> > > > of memcpy'ing them into a big contiguous buffer and having rocksdb
> > > > memcpy
> > > > *that* into it's larger buffer, give rocksdb an iovec so that they
> > > > smaller buffers are assembled only once.
> > > >
> > > > These buffers will be on the order of many 10s to a couple 100s of
> bytes.
> > > > I'm not sure where the crossover point for constructing and then
> > > > traversing an iovec vs just copying twice would be...
> > > >
> > >
> > > Yes this will eliminate the "extra" copy, but the real problem is
> > > that the oNode itself is just too large. I doubt removing one extra
> > > copy is going to suddenly "solve" this problem. I think we're going
> > > to end up rejiggering things so that this will be much less of a
> > > problem than it is now -- time will tell.
> >
> > Yeah, leaving this one for last I think... until we see memcpy show up
> > in the profile.
> >
> > > > > > 3. Even if we do the above, we're still setting a big (~4k or
> > > > > > more?) key into rocksdb every time we touch an object, even
> > > > > > when a tiny
> > >
> > > See my analysis, you're looking at 8-10K for the RBD random write
> > > case
> > > -- which I think everybody cares a lot about.
> > >
> > > > > > amount of metadata is getting changed.  This is a consequence
> > > > > > of embedding all of the blobs into the onode (or bnode).  That
> > > > > > seemed like a good idea early on when they were tiny (i.e.,
> > > > > > just an extent), but now I'm not so sure.  I see a couple of
> > > > > > different
> > options:
> > > > > >
> > > > > > a) Store each blob as ($onode_key+$blobid).  When we load the
> > > > > > onode, load the blobs too.  They will hopefully be sequential
> > > > > > in rocksdb (or definitely sequential in zs).  Probably go back
> > > > > > to using an
> > iterator.
> > > > > >
> > > > > > b) Go all in on the "bnode" like concept.  Assign blob ids so
> > > > > > that they are unique for any given hash value.  Then store the
> > > > > > blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode
> > > > > > is now).  Then when clone happens there is no onode->bnode
> > > > > > migration magic happening--we've already committed to storing
> > > > > > blobs in separate keys.  When we load the onode, keep the
> > > > > > conditional bnode loading we already have.. but when the bnode
> > > > > > is loaded load up all the blobs for the hash key.  (Okay, we
> > > > > > could fault in blobs individually, but that code will be more
> > > > > > complicated.)
> > >
> > > I like this direction. I think you'll still end up demand loading
> > > the blobs in order to speed up the random read case. This scheme
> > > will result in some space-amplification, both in the lextent and in
> > > the blob-map, it's worth a bit of study too see how bad the
> > > metadata/data ratio becomes (just as a guess,
> > > $shard.$poolid.$hash.$blobid is probably 16 +
> > > 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
> > > unless your KV store does path compression. My reading of RocksDB
> > > sst file seems to indicate that it doesn't, I *believe* that ZS does
> > > [need to confirm]). I'm wondering if the current notion of local vs.
> > > global blobs isn't actually beneficial in that we can give local
> > > blobs different names that sort with their associated oNode (which
> > > probably makes the space-amp worse) which is an important
> > > optimization. We do need to watch the space amp, we're going to be
> > > burning DRAM to make KV accesses cheap and the amount of DRAM is
> proportional to the space amp.
> >
> > I got this mostly working last night... just need to sort out the
> > clone case (and clean up a bunch of code).  It was a relatively
> > painless transition to make, although in its current form the blobs
> > all belong to the bnode, and the bnode if ephemeral but remains in
> memory until all referencing onodes go away.
> > Mostly fine, except it means that odd combinations of clone could
> > leave lots of blobs in cache that don't get trimmed.  Will address that=
 later.
> >
> > I'll try to finish it up this morning and get it passing tests and post=
ed.
> >
> > > > > > In both these cases, a write will dirty the onode (which is
> > > > > > back to being pretty small.. just xattrs and the lextent map)
> > > > > > and 1-3 blobs (also
> > > > now small keys).
> > >
> > > I'm not sure the oNode is going to be that small. Looking at the RBD
> > > random 4K write case, you're going to have 1K entries each of which
> > > has an offset, size and a blob-id reference in them. In my current
> > > oNode compression scheme this compresses to about 1 byte per entry.
> > > However, this optimization relies on being able to cheaply renumber
> > > the blob-ids, which is no longer possible when the blob-ids become
> > > parts of a key (see above). So now you'll have a minimum of 1.5-3
> > > bytes extra for each blob-id (because you can't assume that the
> > > blob-ids
> > become "dense"
> > > anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
> > > Bytes of lextent table. Worse, because of the variable length
> > > encoding you'll have to scan the entire table to deserialize it
> > > (yes, we could do differential editing when we write but that's anoth=
er
> discussion).
> > > Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
> > > while this looks small compared to the current ~30K for the entire
> > > thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
> > > compressed oNode/lextent/blobmap scheme that I published earlier.
> > >
> > > If we want to do better we will need to separate the lextent from
> > > the oNode also. It's relatively easy to move the lextents into the
> > > KV store itself (there are two obvious ways to deal with this,
> > > either use the native offset/size from the lextent itself OR create
> > > 'N' buckets of logical offset into which we pour entries -- both of
> > > these would add somewhere between 1 and 2 KV look-ups per operation
> > > -- here is where an iterator would probably help.
> > >
> > > Unfortunately, if you only process a portion of the lextent (because
> > > you've made it into multiple keys and you don't want to load all of
> > > them) you no longer can re-generate the refmap on the fly (another
> > > key space optimization). The lack of refmap screws up a number of
> > > other important algorithms -- for example the overlapping blob-map
> thing, etc.
> > > Not sure if these are easy to rewrite or not -- too complicated to
> > > think about at this hour of the evening.
> >
> > Yeah, I forgot about the extent_map and how big it gets.  I think,
> > though, that if we can get a 4mb object with 1024 4k lextents to
> > encode the whole onode and extent_map in under 4K that will be good
> > enough.  The blob update that goes with it will be ~200 bytes, and
> > benchmarks aside, the 4k random write 100% fragmented object is a worst
> case.
>=20
> Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at" v=
s. a
> "worst-case-that-almost-nobody-looks-at".
>=20
> I'm still concerned about having an oNode that's larger than a 4K block.
>=20
>=20
> >
> > Anyway, I'll get the blob separation branch working and we can go from
> > there...
> >
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in =
the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html