All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore blobs REVISITED
@ 2016-08-20  5:33 Allen Samuels
  2016-08-21 16:08 ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-20  5:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I have another proposal (it just occurred to me, so it might not survive more scrutiny).

Yes, we should remove the blob-map from the oNode.

But I believe we should also remove the lextent map from the oNode and make each lextent be an independent KV value.

However, in the special case where each extent --exactly-- maps onto a blob AND the blob is not referenced by any other extent (which is the typical case, unless you're doing compression with strange-ish overlaps) -- then you encode the blob in the lextent itself and there's no separate blob entry.

This is pretty much exactly the same number of KV fetches as what you proposed before when the blob isn't shared (the typical case) -- except the oNode is MUCH MUCH smaller now.

So for the non-shared case, you fetch the oNode which is dominated by the xattrs now (so figure a couple of hundred bytes and not much CPU cost to deserialize).
And then fetch from the KV for the lextent (which is 1 fetch -- unless it overlaps two previous lextents). If it's the optimal case, the KV fetch is small (10-15 bytes) and trivial to deserialize. If it's an unshared/local blob then you're ready to go. If the blob is shared (locally or globally) then you'll have to go fetch that one too. 

This might lead to the elimination of the local/global blob thing (I think you've talked about that before) as now the only "local" blobs are the unshared single extent blobs which are stored inline with the lextent entry. You'll still have the special cases of promoting unshared (inline) blobs to global blobs -- which is probably similar to the current promotion "magic" on a clone operation.

The current refmap concept may require some additional work. I believe that we'll have to do a reconstruction of the refmap, but fortunately only for the range of the current I/O. That will be a bit more expensive, but still less expensive than reconstructing the entire refmap for every oNode deserialization, Fortunately I believe the refmap is only really needed for compression cases or RBD cases without "trim" (this is the case to optimize -- it'll make trim really important for performance).

Best of both worlds????

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Friday, August 19, 2016 7:16 AM
> To: Sage Weil <sweil@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, August 19, 2016 6:53 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs
> >
> > On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Thursday, August 18, 2016 8:10 AM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: bluestore blobs
> > > >
> > > > On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > > To: ceph-devel@vger.kernel.org
> > > > > > Subject: bluestore blobs
> > > > > >
> > > > > > I think we need to look at other changes in addition to the
> > > > > > encoding performance improvements.  Even if they end up being
> > > > > > good enough, these changes are somewhat orthogonal and at
> > > > > > least one of them should give us something that is even faster.
> > > > > >
> > > > > > 1. I mentioned this before, but we should keep the encoding
> > > > > > bluestore_blob_t around when we load the blob map.  If it's
> > > > > > not changed, don't reencode it.  There are no blockers for
> > > > > > implementing this
> > > > currently.
> > > > > > It may be difficult to ensure the blobs are properly marked dirty...
> > > > > > I'll see if we can use proper accessors for the blob to
> > > > > > enforce this at compile time.  We should do that anyway.
> > > > >
> > > > > If it's not changed, then why are we re-writing it? I'm having a
> > > > > hard time thinking of a case worth optimizing where I want to
> > > > > re-write the oNode but the blob_map is unchanged. Am I missing
> > something obvious?
> > > >
> > > > An onode's blob_map might have 300 blobs, and a single write only
> > > > updates one of them.  The other 299 blobs need not be reencoded,
> > > > just
> > memcpy'd.
> > >
> > > As long as we're just appending that's a good optimization. How
> > > often does that happen? It's certainly not going to help the RBD 4K
> > > random write problem.
> >
> > It won't help the (l)extent_map encoding, but it avoids almost all of
> > the blob reencoding.  A 4k random write will update one blob out of
> > ~100 (or whatever it is).
> >
> > > > > > 2. This turns the blob Put into rocksdb into two memcpy stages:
> > > > > > one to assemble the bufferlist (lots of bufferptrs to each
> > > > > > untouched
> > > > > > blob) into a single rocksdb::Slice, and another memcpy
> > > > > > somewhere inside rocksdb to copy this into the write buffer.
> > > > > > We could extend the rocksdb interface to take an iovec so that
> > > > > > the first memcpy isn't needed (and rocksdb will instead
> > > > > > iterate over our buffers and copy them directly into its write
> > > > > > buffer).  This is probably a pretty small piece of the overall
> > > > > > time... should verify with a profiler
> > > > before investing too much effort here.
> > > > >
> > > > > I doubt it's the memcpy that's really the expensive part. I'll
> > > > > bet it's that we're transcoding from an internal to an external
> > > > > representation on an element by element basis. If the iovec
> > > > > scheme is going to help, it presumes that the internal data
> > > > > structure essentially matches the external data structure so
> > > > > that only an iovec copy is required. I'm wondering how
> > > > > compatible this is with the current concepts of lextext/blob/pextent.
> > > >
> > > > I'm thinking of the xattr case (we have a bunch of strings to copy
> > > > verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
> > > > of memcpy'ing them into a big contiguous buffer and having rocksdb
> > > > memcpy
> > > > *that* into it's larger buffer, give rocksdb an iovec so that they
> > > > smaller buffers are assembled only once.
> > > >
> > > > These buffers will be on the order of many 10s to a couple 100s of
> bytes.
> > > > I'm not sure where the crossover point for constructing and then
> > > > traversing an iovec vs just copying twice would be...
> > > >
> > >
> > > Yes this will eliminate the "extra" copy, but the real problem is
> > > that the oNode itself is just too large. I doubt removing one extra
> > > copy is going to suddenly "solve" this problem. I think we're going
> > > to end up rejiggering things so that this will be much less of a
> > > problem than it is now -- time will tell.
> >
> > Yeah, leaving this one for last I think... until we see memcpy show up
> > in the profile.
> >
> > > > > > 3. Even if we do the above, we're still setting a big (~4k or
> > > > > > more?) key into rocksdb every time we touch an object, even
> > > > > > when a tiny
> > >
> > > See my analysis, you're looking at 8-10K for the RBD random write
> > > case
> > > -- which I think everybody cares a lot about.
> > >
> > > > > > amount of metadata is getting changed.  This is a consequence
> > > > > > of embedding all of the blobs into the onode (or bnode).  That
> > > > > > seemed like a good idea early on when they were tiny (i.e.,
> > > > > > just an extent), but now I'm not so sure.  I see a couple of
> > > > > > different
> > options:
> > > > > >
> > > > > > a) Store each blob as ($onode_key+$blobid).  When we load the
> > > > > > onode, load the blobs too.  They will hopefully be sequential
> > > > > > in rocksdb (or definitely sequential in zs).  Probably go back
> > > > > > to using an
> > iterator.
> > > > > >
> > > > > > b) Go all in on the "bnode" like concept.  Assign blob ids so
> > > > > > that they are unique for any given hash value.  Then store the
> > > > > > blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode
> > > > > > is now).  Then when clone happens there is no onode->bnode
> > > > > > migration magic happening--we've already committed to storing
> > > > > > blobs in separate keys.  When we load the onode, keep the
> > > > > > conditional bnode loading we already have.. but when the bnode
> > > > > > is loaded load up all the blobs for the hash key.  (Okay, we
> > > > > > could fault in blobs individually, but that code will be more
> > > > > > complicated.)
> > >
> > > I like this direction. I think you'll still end up demand loading
> > > the blobs in order to speed up the random read case. This scheme
> > > will result in some space-amplification, both in the lextent and in
> > > the blob-map, it's worth a bit of study too see how bad the
> > > metadata/data ratio becomes (just as a guess,
> > > $shard.$poolid.$hash.$blobid is probably 16 +
> > > 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
> > > unless your KV store does path compression. My reading of RocksDB
> > > sst file seems to indicate that it doesn't, I *believe* that ZS does
> > > [need to confirm]). I'm wondering if the current notion of local vs.
> > > global blobs isn't actually beneficial in that we can give local
> > > blobs different names that sort with their associated oNode (which
> > > probably makes the space-amp worse) which is an important
> > > optimization. We do need to watch the space amp, we're going to be
> > > burning DRAM to make KV accesses cheap and the amount of DRAM is
> proportional to the space amp.
> >
> > I got this mostly working last night... just need to sort out the
> > clone case (and clean up a bunch of code).  It was a relatively
> > painless transition to make, although in its current form the blobs
> > all belong to the bnode, and the bnode if ephemeral but remains in
> memory until all referencing onodes go away.
> > Mostly fine, except it means that odd combinations of clone could
> > leave lots of blobs in cache that don't get trimmed.  Will address that later.
> >
> > I'll try to finish it up this morning and get it passing tests and posted.
> >
> > > > > > In both these cases, a write will dirty the onode (which is
> > > > > > back to being pretty small.. just xattrs and the lextent map)
> > > > > > and 1-3 blobs (also
> > > > now small keys).
> > >
> > > I'm not sure the oNode is going to be that small. Looking at the RBD
> > > random 4K write case, you're going to have 1K entries each of which
> > > has an offset, size and a blob-id reference in them. In my current
> > > oNode compression scheme this compresses to about 1 byte per entry.
> > > However, this optimization relies on being able to cheaply renumber
> > > the blob-ids, which is no longer possible when the blob-ids become
> > > parts of a key (see above). So now you'll have a minimum of 1.5-3
> > > bytes extra for each blob-id (because you can't assume that the
> > > blob-ids
> > become "dense"
> > > anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
> > > Bytes of lextent table. Worse, because of the variable length
> > > encoding you'll have to scan the entire table to deserialize it
> > > (yes, we could do differential editing when we write but that's another
> discussion).
> > > Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
> > > while this looks small compared to the current ~30K for the entire
> > > thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
> > > compressed oNode/lextent/blobmap scheme that I published earlier.
> > >
> > > If we want to do better we will need to separate the lextent from
> > > the oNode also. It's relatively easy to move the lextents into the
> > > KV store itself (there are two obvious ways to deal with this,
> > > either use the native offset/size from the lextent itself OR create
> > > 'N' buckets of logical offset into which we pour entries -- both of
> > > these would add somewhere between 1 and 2 KV look-ups per operation
> > > -- here is where an iterator would probably help.
> > >
> > > Unfortunately, if you only process a portion of the lextent (because
> > > you've made it into multiple keys and you don't want to load all of
> > > them) you no longer can re-generate the refmap on the fly (another
> > > key space optimization). The lack of refmap screws up a number of
> > > other important algorithms -- for example the overlapping blob-map
> thing, etc.
> > > Not sure if these are easy to rewrite or not -- too complicated to
> > > think about at this hour of the evening.
> >
> > Yeah, I forgot about the extent_map and how big it gets.  I think,
> > though, that if we can get a 4mb object with 1024 4k lextents to
> > encode the whole onode and extent_map in under 4K that will be good
> > enough.  The blob update that goes with it will be ~200 bytes, and
> > benchmarks aside, the 4k random write 100% fragmented object is a worst
> case.
> 
> Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at" vs. a
> "worst-case-that-almost-nobody-looks-at".
> 
> I'm still concerned about having an oNode that's larger than a 4K block.
> 
> 
> >
> > Anyway, I'll get the blob separation branch working and we can go from
> > there...
> >
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-20  5:33 bluestore blobs REVISITED Allen Samuels
@ 2016-08-21 16:08 ` Sage Weil
  2016-08-21 17:27   ` Allen Samuels
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-08-21 16:08 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Sat, 20 Aug 2016, Allen Samuels wrote:
> I have another proposal (it just occurred to me, so it might not survive 
> more scrutiny).
> 
> Yes, we should remove the blob-map from the oNode.
> 
> But I believe we should also remove the lextent map from the oNode and 
> make each lextent be an independent KV value.
> 
> However, in the special case where each extent --exactly-- maps onto a 
> blob AND the blob is not referenced by any other extent (which is the 
> typical case, unless you're doing compression with strange-ish overlaps) 
> -- then you encode the blob in the lextent itself and there's no 
> separate blob entry.
> 
> This is pretty much exactly the same number of KV fetches as what you 
> proposed before when the blob isn't shared (the typical case) -- except 
> the oNode is MUCH MUCH smaller now.

I think this approach makes a lot of sense!  The only thing I'm worried 
about is that the lextent keys are no longer known when they are being 
fetched (since they will be a logical offset), which means we'll have to 
use an iterator instead of a simple get.  The former is quite a bit slower 
than the latter (which can make use of the rocksdb caches and/or key bloom 
filters more easily).

We could augment your approach by keeping *just* the lextent offsets in 
the onode, so that we know exactly which lextent key to fetch, but then 
I'm not sure we'll get much benefit (lextent metadata size goes down by 
~1/3, but then we have an extra get for cloned objects).
 
Hmm, the other thing to keep in mind is that for RBD the common case is 
that lots of objects have clones, and many of those objects' blobs will be 
shared.

sage

> So for the non-shared case, you fetch the oNode which is dominated by 
> the xattrs now (so figure a couple of hundred bytes and not much CPU 
> cost to deserialize). And then fetch from the KV for the lextent (which 
> is 1 fetch -- unless it overlaps two previous lextents). If it's the 
> optimal case, the KV fetch is small (10-15 bytes) and trivial to 
> deserialize. If it's an unshared/local blob then you're ready to go. If 
> the blob is shared (locally or globally) then you'll have to go fetch 
> that one too.
> 
> This might lead to the elimination of the local/global blob thing (I 
> think you've talked about that before) as now the only "local" blobs are 
> the unshared single extent blobs which are stored inline with the 
> lextent entry. You'll still have the special cases of promoting unshared 
> (inline) blobs to global blobs -- which is probably similar to the 
> current promotion "magic" on a clone operation.
> 
> The current refmap concept may require some additional work. I believe 
> that we'll have to do a reconstruction of the refmap, but fortunately 
> only for the range of the current I/O. That will be a bit more 
> expensive, but still less expensive than reconstructing the entire 
> refmap for every oNode deserialization, Fortunately I believe the refmap 
> is only really needed for compression cases or RBD cases without "trim" 
> (this is the case to optimize -- it'll make trim really important for 
> performance).
> 
> Best of both worlds????
> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > Sent: Friday, August 19, 2016 7:16 AM
> > To: Sage Weil <sweil@redhat.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs
> > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, August 19, 2016 6:53 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs
> > >
> > > On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Thursday, August 18, 2016 8:10 AM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: ceph-devel@vger.kernel.org
> > > > > Subject: RE: bluestore blobs
> > > > >
> > > > > On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > > > To: ceph-devel@vger.kernel.org
> > > > > > > Subject: bluestore blobs
> > > > > > >
> > > > > > > I think we need to look at other changes in addition to the
> > > > > > > encoding performance improvements.  Even if they end up being
> > > > > > > good enough, these changes are somewhat orthogonal and at
> > > > > > > least one of them should give us something that is even faster.
> > > > > > >
> > > > > > > 1. I mentioned this before, but we should keep the encoding
> > > > > > > bluestore_blob_t around when we load the blob map.  If it's
> > > > > > > not changed, don't reencode it.  There are no blockers for
> > > > > > > implementing this
> > > > > currently.
> > > > > > > It may be difficult to ensure the blobs are properly marked dirty...
> > > > > > > I'll see if we can use proper accessors for the blob to
> > > > > > > enforce this at compile time.  We should do that anyway.
> > > > > >
> > > > > > If it's not changed, then why are we re-writing it? I'm having a
> > > > > > hard time thinking of a case worth optimizing where I want to
> > > > > > re-write the oNode but the blob_map is unchanged. Am I missing
> > > something obvious?
> > > > >
> > > > > An onode's blob_map might have 300 blobs, and a single write only
> > > > > updates one of them.  The other 299 blobs need not be reencoded,
> > > > > just
> > > memcpy'd.
> > > >
> > > > As long as we're just appending that's a good optimization. How
> > > > often does that happen? It's certainly not going to help the RBD 4K
> > > > random write problem.
> > >
> > > It won't help the (l)extent_map encoding, but it avoids almost all of
> > > the blob reencoding.  A 4k random write will update one blob out of
> > > ~100 (or whatever it is).
> > >
> > > > > > > 2. This turns the blob Put into rocksdb into two memcpy stages:
> > > > > > > one to assemble the bufferlist (lots of bufferptrs to each
> > > > > > > untouched
> > > > > > > blob) into a single rocksdb::Slice, and another memcpy
> > > > > > > somewhere inside rocksdb to copy this into the write buffer.
> > > > > > > We could extend the rocksdb interface to take an iovec so that
> > > > > > > the first memcpy isn't needed (and rocksdb will instead
> > > > > > > iterate over our buffers and copy them directly into its write
> > > > > > > buffer).  This is probably a pretty small piece of the overall
> > > > > > > time... should verify with a profiler
> > > > > before investing too much effort here.
> > > > > >
> > > > > > I doubt it's the memcpy that's really the expensive part. I'll
> > > > > > bet it's that we're transcoding from an internal to an external
> > > > > > representation on an element by element basis. If the iovec
> > > > > > scheme is going to help, it presumes that the internal data
> > > > > > structure essentially matches the external data structure so
> > > > > > that only an iovec copy is required. I'm wondering how
> > > > > > compatible this is with the current concepts of lextext/blob/pextent.
> > > > >
> > > > > I'm thinking of the xattr case (we have a bunch of strings to copy
> > > > > verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
> > > > > of memcpy'ing them into a big contiguous buffer and having rocksdb
> > > > > memcpy
> > > > > *that* into it's larger buffer, give rocksdb an iovec so that they
> > > > > smaller buffers are assembled only once.
> > > > >
> > > > > These buffers will be on the order of many 10s to a couple 100s of
> > bytes.
> > > > > I'm not sure where the crossover point for constructing and then
> > > > > traversing an iovec vs just copying twice would be...
> > > > >
> > > >
> > > > Yes this will eliminate the "extra" copy, but the real problem is
> > > > that the oNode itself is just too large. I doubt removing one extra
> > > > copy is going to suddenly "solve" this problem. I think we're going
> > > > to end up rejiggering things so that this will be much less of a
> > > > problem than it is now -- time will tell.
> > >
> > > Yeah, leaving this one for last I think... until we see memcpy show up
> > > in the profile.
> > >
> > > > > > > 3. Even if we do the above, we're still setting a big (~4k or
> > > > > > > more?) key into rocksdb every time we touch an object, even
> > > > > > > when a tiny
> > > >
> > > > See my analysis, you're looking at 8-10K for the RBD random write
> > > > case
> > > > -- which I think everybody cares a lot about.
> > > >
> > > > > > > amount of metadata is getting changed.  This is a consequence
> > > > > > > of embedding all of the blobs into the onode (or bnode).  That
> > > > > > > seemed like a good idea early on when they were tiny (i.e.,
> > > > > > > just an extent), but now I'm not so sure.  I see a couple of
> > > > > > > different
> > > options:
> > > > > > >
> > > > > > > a) Store each blob as ($onode_key+$blobid).  When we load the
> > > > > > > onode, load the blobs too.  They will hopefully be sequential
> > > > > > > in rocksdb (or definitely sequential in zs).  Probably go back
> > > > > > > to using an
> > > iterator.
> > > > > > >
> > > > > > > b) Go all in on the "bnode" like concept.  Assign blob ids so
> > > > > > > that they are unique for any given hash value.  Then store the
> > > > > > > blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode
> > > > > > > is now).  Then when clone happens there is no onode->bnode
> > > > > > > migration magic happening--we've already committed to storing
> > > > > > > blobs in separate keys.  When we load the onode, keep the
> > > > > > > conditional bnode loading we already have.. but when the bnode
> > > > > > > is loaded load up all the blobs for the hash key.  (Okay, we
> > > > > > > could fault in blobs individually, but that code will be more
> > > > > > > complicated.)
> > > >
> > > > I like this direction. I think you'll still end up demand loading
> > > > the blobs in order to speed up the random read case. This scheme
> > > > will result in some space-amplification, both in the lextent and in
> > > > the blob-map, it's worth a bit of study too see how bad the
> > > > metadata/data ratio becomes (just as a guess,
> > > > $shard.$poolid.$hash.$blobid is probably 16 +
> > > > 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
> > > > unless your KV store does path compression. My reading of RocksDB
> > > > sst file seems to indicate that it doesn't, I *believe* that ZS does
> > > > [need to confirm]). I'm wondering if the current notion of local vs.
> > > > global blobs isn't actually beneficial in that we can give local
> > > > blobs different names that sort with their associated oNode (which
> > > > probably makes the space-amp worse) which is an important
> > > > optimization. We do need to watch the space amp, we're going to be
> > > > burning DRAM to make KV accesses cheap and the amount of DRAM is
> > proportional to the space amp.
> > >
> > > I got this mostly working last night... just need to sort out the
> > > clone case (and clean up a bunch of code).  It was a relatively
> > > painless transition to make, although in its current form the blobs
> > > all belong to the bnode, and the bnode if ephemeral but remains in
> > memory until all referencing onodes go away.
> > > Mostly fine, except it means that odd combinations of clone could
> > > leave lots of blobs in cache that don't get trimmed.  Will address that later.
> > >
> > > I'll try to finish it up this morning and get it passing tests and posted.
> > >
> > > > > > > In both these cases, a write will dirty the onode (which is
> > > > > > > back to being pretty small.. just xattrs and the lextent map)
> > > > > > > and 1-3 blobs (also
> > > > > now small keys).
> > > >
> > > > I'm not sure the oNode is going to be that small. Looking at the RBD
> > > > random 4K write case, you're going to have 1K entries each of which
> > > > has an offset, size and a blob-id reference in them. In my current
> > > > oNode compression scheme this compresses to about 1 byte per entry.
> > > > However, this optimization relies on being able to cheaply renumber
> > > > the blob-ids, which is no longer possible when the blob-ids become
> > > > parts of a key (see above). So now you'll have a minimum of 1.5-3
> > > > bytes extra for each blob-id (because you can't assume that the
> > > > blob-ids
> > > become "dense"
> > > > anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
> > > > Bytes of lextent table. Worse, because of the variable length
> > > > encoding you'll have to scan the entire table to deserialize it
> > > > (yes, we could do differential editing when we write but that's another
> > discussion).
> > > > Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
> > > > while this looks small compared to the current ~30K for the entire
> > > > thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
> > > > compressed oNode/lextent/blobmap scheme that I published earlier.
> > > >
> > > > If we want to do better we will need to separate the lextent from
> > > > the oNode also. It's relatively easy to move the lextents into the
> > > > KV store itself (there are two obvious ways to deal with this,
> > > > either use the native offset/size from the lextent itself OR create
> > > > 'N' buckets of logical offset into which we pour entries -- both of
> > > > these would add somewhere between 1 and 2 KV look-ups per operation
> > > > -- here is where an iterator would probably help.
> > > >
> > > > Unfortunately, if you only process a portion of the lextent (because
> > > > you've made it into multiple keys and you don't want to load all of
> > > > them) you no longer can re-generate the refmap on the fly (another
> > > > key space optimization). The lack of refmap screws up a number of
> > > > other important algorithms -- for example the overlapping blob-map
> > thing, etc.
> > > > Not sure if these are easy to rewrite or not -- too complicated to
> > > > think about at this hour of the evening.
> > >
> > > Yeah, I forgot about the extent_map and how big it gets.  I think,
> > > though, that if we can get a 4mb object with 1024 4k lextents to
> > > encode the whole onode and extent_map in under 4K that will be good
> > > enough.  The blob update that goes with it will be ~200 bytes, and
> > > benchmarks aside, the 4k random write 100% fragmented object is a worst
> > case.
> > 
> > Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at" vs. a
> > "worst-case-that-almost-nobody-looks-at".
> > 
> > I'm still concerned about having an oNode that's larger than a 4K block.
> > 
> > 
> > >
> > > Anyway, I'll get the blob separation branch working and we can go from
> > > there...
> > >
> > > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-21 16:08 ` Sage Weil
@ 2016-08-21 17:27   ` Allen Samuels
  2016-08-22  1:41     ` Allen Samuels
  0 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-21 17:27 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I wonder how hard it would be to add a "lower-bound" fetch like stl. That would allow the kv store to do the fetch without incurring the overhead of a snapshot for the iteration scan. 

Shared blobs were always going to trigger an extra kv fetch no matter what. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
> 
>> On Sat, 20 Aug 2016, Allen Samuels wrote:
>> I have another proposal (it just occurred to me, so it might not survive 
>> more scrutiny).
>> 
>> Yes, we should remove the blob-map from the oNode.
>> 
>> But I believe we should also remove the lextent map from the oNode and 
>> make each lextent be an independent KV value.
>> 
>> However, in the special case where each extent --exactly-- maps onto a 
>> blob AND the blob is not referenced by any other extent (which is the 
>> typical case, unless you're doing compression with strange-ish overlaps) 
>> -- then you encode the blob in the lextent itself and there's no 
>> separate blob entry.
>> 
>> This is pretty much exactly the same number of KV fetches as what you 
>> proposed before when the blob isn't shared (the typical case) -- except 
>> the oNode is MUCH MUCH smaller now.
> 
> I think this approach makes a lot of sense!  The only thing I'm worried 
> about is that the lextent keys are no longer known when they are being 
> fetched (since they will be a logical offset), which means we'll have to 
> use an iterator instead of a simple get.  The former is quite a bit slower 
> than the latter (which can make use of the rocksdb caches and/or key bloom 
> filters more easily).
> 
> We could augment your approach by keeping *just* the lextent offsets in 
> the onode, so that we know exactly which lextent key to fetch, but then 
> I'm not sure we'll get much benefit (lextent metadata size goes down by 
> ~1/3, but then we have an extra get for cloned objects).
> 
> Hmm, the other thing to keep in mind is that for RBD the common case is 
> that lots of objects have clones, and many of those objects' blobs will be 
> shared.
> 
> sage
> 
>> So for the non-shared case, you fetch the oNode which is dominated by 
>> the xattrs now (so figure a couple of hundred bytes and not much CPU 
>> cost to deserialize). And then fetch from the KV for the lextent (which 
>> is 1 fetch -- unless it overlaps two previous lextents). If it's the 
>> optimal case, the KV fetch is small (10-15 bytes) and trivial to 
>> deserialize. If it's an unshared/local blob then you're ready to go. If 
>> the blob is shared (locally or globally) then you'll have to go fetch 
>> that one too.
>> 
>> This might lead to the elimination of the local/global blob thing (I 
>> think you've talked about that before) as now the only "local" blobs are 
>> the unshared single extent blobs which are stored inline with the 
>> lextent entry. You'll still have the special cases of promoting unshared 
>> (inline) blobs to global blobs -- which is probably similar to the 
>> current promotion "magic" on a clone operation.
>> 
>> The current refmap concept may require some additional work. I believe 
>> that we'll have to do a reconstruction of the refmap, but fortunately 
>> only for the range of the current I/O. That will be a bit more 
>> expensive, but still less expensive than reconstructing the entire 
>> refmap for every oNode deserialization, Fortunately I believe the refmap 
>> is only really needed for compression cases or RBD cases without "trim" 
>> (this is the case to optimize -- it'll make trim really important for 
>> performance).
>> 
>> Best of both worlds????
>> 
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>> 
>> 
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Friday, August 19, 2016 7:16 AM
>>> To: Sage Weil <sweil@redhat.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: bluestore blobs
>>> 
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>> Sent: Friday, August 19, 2016 6:53 AM
>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: bluestore blobs
>>>> 
>>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>> Sent: Thursday, August 18, 2016 8:10 AM
>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: RE: bluestore blobs
>>>>>> 
>>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
>>>>>>>> To: ceph-devel@vger.kernel.org
>>>>>>>> Subject: bluestore blobs
>>>>>>>> 
>>>>>>>> I think we need to look at other changes in addition to the
>>>>>>>> encoding performance improvements.  Even if they end up being
>>>>>>>> good enough, these changes are somewhat orthogonal and at
>>>>>>>> least one of them should give us something that is even faster.
>>>>>>>> 
>>>>>>>> 1. I mentioned this before, but we should keep the encoding
>>>>>>>> bluestore_blob_t around when we load the blob map.  If it's
>>>>>>>> not changed, don't reencode it.  There are no blockers for
>>>>>>>> implementing this
>>>>>> currently.
>>>>>>>> It may be difficult to ensure the blobs are properly marked dirty...
>>>>>>>> I'll see if we can use proper accessors for the blob to
>>>>>>>> enforce this at compile time.  We should do that anyway.
>>>>>>> 
>>>>>>> If it's not changed, then why are we re-writing it? I'm having a
>>>>>>> hard time thinking of a case worth optimizing where I want to
>>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing
>>>> something obvious?
>>>>>> 
>>>>>> An onode's blob_map might have 300 blobs, and a single write only
>>>>>> updates one of them.  The other 299 blobs need not be reencoded,
>>>>>> just
>>>> memcpy'd.
>>>>> 
>>>>> As long as we're just appending that's a good optimization. How
>>>>> often does that happen? It's certainly not going to help the RBD 4K
>>>>> random write problem.
>>>> 
>>>> It won't help the (l)extent_map encoding, but it avoids almost all of
>>>> the blob reencoding.  A 4k random write will update one blob out of
>>>> ~100 (or whatever it is).
>>>> 
>>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages:
>>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
>>>>>>>> untouched
>>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
>>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
>>>>>>>> We could extend the rocksdb interface to take an iovec so that
>>>>>>>> the first memcpy isn't needed (and rocksdb will instead
>>>>>>>> iterate over our buffers and copy them directly into its write
>>>>>>>> buffer).  This is probably a pretty small piece of the overall
>>>>>>>> time... should verify with a profiler
>>>>>> before investing too much effort here.
>>>>>>> 
>>>>>>> I doubt it's the memcpy that's really the expensive part. I'll
>>>>>>> bet it's that we're transcoding from an internal to an external
>>>>>>> representation on an element by element basis. If the iovec
>>>>>>> scheme is going to help, it presumes that the internal data
>>>>>>> structure essentially matches the external data structure so
>>>>>>> that only an iovec copy is required. I'm wondering how
>>>>>>> compatible this is with the current concepts of lextext/blob/pextent.
>>>>>> 
>>>>>> I'm thinking of the xattr case (we have a bunch of strings to copy
>>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
>>>>>> of memcpy'ing them into a big contiguous buffer and having rocksdb
>>>>>> memcpy
>>>>>> *that* into it's larger buffer, give rocksdb an iovec so that they
>>>>>> smaller buffers are assembled only once.
>>>>>> 
>>>>>> These buffers will be on the order of many 10s to a couple 100s of
>>> bytes.
>>>>>> I'm not sure where the crossover point for constructing and then
>>>>>> traversing an iovec vs just copying twice would be...
>>>>> 
>>>>> Yes this will eliminate the "extra" copy, but the real problem is
>>>>> that the oNode itself is just too large. I doubt removing one extra
>>>>> copy is going to suddenly "solve" this problem. I think we're going
>>>>> to end up rejiggering things so that this will be much less of a
>>>>> problem than it is now -- time will tell.
>>>> 
>>>> Yeah, leaving this one for last I think... until we see memcpy show up
>>>> in the profile.
>>>> 
>>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or
>>>>>>>> more?) key into rocksdb every time we touch an object, even
>>>>>>>> when a tiny
>>>>> 
>>>>> See my analysis, you're looking at 8-10K for the RBD random write
>>>>> case
>>>>> -- which I think everybody cares a lot about.
>>>>> 
>>>>>>>> amount of metadata is getting changed.  This is a consequence
>>>>>>>> of embedding all of the blobs into the onode (or bnode).  That
>>>>>>>> seemed like a good idea early on when they were tiny (i.e.,
>>>>>>>> just an extent), but now I'm not so sure.  I see a couple of
>>>>>>>> different
>>>> options:
>>>>>>>> 
>>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the
>>>>>>>> onode, load the blobs too.  They will hopefully be sequential
>>>>>>>> in rocksdb (or definitely sequential in zs).  Probably go back
>>>>>>>> to using an
>>>> iterator.
>>>>>>>> 
>>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so
>>>>>>>> that they are unique for any given hash value.  Then store the
>>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode
>>>>>>>> is now).  Then when clone happens there is no onode->bnode
>>>>>>>> migration magic happening--we've already committed to storing
>>>>>>>> blobs in separate keys.  When we load the onode, keep the
>>>>>>>> conditional bnode loading we already have.. but when the bnode
>>>>>>>> is loaded load up all the blobs for the hash key.  (Okay, we
>>>>>>>> could fault in blobs individually, but that code will be more
>>>>>>>> complicated.)
>>>>> 
>>>>> I like this direction. I think you'll still end up demand loading
>>>>> the blobs in order to speed up the random read case. This scheme
>>>>> will result in some space-amplification, both in the lextent and in
>>>>> the blob-map, it's worth a bit of study too see how bad the
>>>>> metadata/data ratio becomes (just as a guess,
>>>>> $shard.$poolid.$hash.$blobid is probably 16 +
>>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
>>>>> unless your KV store does path compression. My reading of RocksDB
>>>>> sst file seems to indicate that it doesn't, I *believe* that ZS does
>>>>> [need to confirm]). I'm wondering if the current notion of local vs.
>>>>> global blobs isn't actually beneficial in that we can give local
>>>>> blobs different names that sort with their associated oNode (which
>>>>> probably makes the space-amp worse) which is an important
>>>>> optimization. We do need to watch the space amp, we're going to be
>>>>> burning DRAM to make KV accesses cheap and the amount of DRAM is
>>> proportional to the space amp.
>>>> 
>>>> I got this mostly working last night... just need to sort out the
>>>> clone case (and clean up a bunch of code).  It was a relatively
>>>> painless transition to make, although in its current form the blobs
>>>> all belong to the bnode, and the bnode if ephemeral but remains in
>>> memory until all referencing onodes go away.
>>>> Mostly fine, except it means that odd combinations of clone could
>>>> leave lots of blobs in cache that don't get trimmed.  Will address that later.
>>>> 
>>>> I'll try to finish it up this morning and get it passing tests and posted.
>>>> 
>>>>>>>> In both these cases, a write will dirty the onode (which is
>>>>>>>> back to being pretty small.. just xattrs and the lextent map)
>>>>>>>> and 1-3 blobs (also
>>>>>> now small keys).
>>>>> 
>>>>> I'm not sure the oNode is going to be that small. Looking at the RBD
>>>>> random 4K write case, you're going to have 1K entries each of which
>>>>> has an offset, size and a blob-id reference in them. In my current
>>>>> oNode compression scheme this compresses to about 1 byte per entry.
>>>>> However, this optimization relies on being able to cheaply renumber
>>>>> the blob-ids, which is no longer possible when the blob-ids become
>>>>> parts of a key (see above). So now you'll have a minimum of 1.5-3
>>>>> bytes extra for each blob-id (because you can't assume that the
>>>>> blob-ids
>>>> become "dense"
>>>>> anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
>>>>> Bytes of lextent table. Worse, because of the variable length
>>>>> encoding you'll have to scan the entire table to deserialize it
>>>>> (yes, we could do differential editing when we write but that's another
>>> discussion).
>>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
>>>>> while this looks small compared to the current ~30K for the entire
>>>>> thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
>>>>> compressed oNode/lextent/blobmap scheme that I published earlier.
>>>>> 
>>>>> If we want to do better we will need to separate the lextent from
>>>>> the oNode also. It's relatively easy to move the lextents into the
>>>>> KV store itself (there are two obvious ways to deal with this,
>>>>> either use the native offset/size from the lextent itself OR create
>>>>> 'N' buckets of logical offset into which we pour entries -- both of
>>>>> these would add somewhere between 1 and 2 KV look-ups per operation
>>>>> -- here is where an iterator would probably help.
>>>>> 
>>>>> Unfortunately, if you only process a portion of the lextent (because
>>>>> you've made it into multiple keys and you don't want to load all of
>>>>> them) you no longer can re-generate the refmap on the fly (another
>>>>> key space optimization). The lack of refmap screws up a number of
>>>>> other important algorithms -- for example the overlapping blob-map
>>> thing, etc.
>>>>> Not sure if these are easy to rewrite or not -- too complicated to
>>>>> think about at this hour of the evening.
>>>> 
>>>> Yeah, I forgot about the extent_map and how big it gets.  I think,
>>>> though, that if we can get a 4mb object with 1024 4k lextents to
>>>> encode the whole onode and extent_map in under 4K that will be good
>>>> enough.  The blob update that goes with it will be ~200 bytes, and
>>>> benchmarks aside, the 4k random write 100% fragmented object is a worst
>>> case.
>>> 
>>> Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at" vs. a
>>> "worst-case-that-almost-nobody-looks-at".
>>> 
>>> I'm still concerned about having an oNode that's larger than a 4K block.
>>> 
>>> 
>>>> 
>>>> Anyway, I'll get the blob separation branch working and we can go from
>>>> there...
>>>> 
>>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-21 17:27   ` Allen Samuels
@ 2016-08-22  1:41     ` Allen Samuels
  2016-08-22 22:08       ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-22  1:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Another possibility is to "bin" the lextent table into known, fixed, offset ranges. Suppose each oNode had a fixed range of LBA keys associated with the lextent table: say [0..128K), [128K..256K), ...

It eliminates the need to put a "lower_bound" into the KV Store directly. Though it'll likely be a bit more complex above and somewhat less efficient. 


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Sunday, August 21, 2016 10:27 AM
> To: Sage Weil <sweil@redhat.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: bluestore blobs REVISITED
> 
> I wonder how hard it would be to add a "lower-bound" fetch like stl. That
> would allow the kv store to do the fetch without incurring the overhead of a
> snapshot for the iteration scan.
> 
> Shared blobs were always going to trigger an extra kv fetch no matter what.
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
> > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
> >
> >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> >> I have another proposal (it just occurred to me, so it might not
> >> survive more scrutiny).
> >>
> >> Yes, we should remove the blob-map from the oNode.
> >>
> >> But I believe we should also remove the lextent map from the oNode
> >> and make each lextent be an independent KV value.
> >>
> >> However, in the special case where each extent --exactly-- maps onto
> >> a blob AND the blob is not referenced by any other extent (which is
> >> the typical case, unless you're doing compression with strange-ish
> >> overlaps)
> >> -- then you encode the blob in the lextent itself and there's no
> >> separate blob entry.
> >>
> >> This is pretty much exactly the same number of KV fetches as what you
> >> proposed before when the blob isn't shared (the typical case) --
> >> except the oNode is MUCH MUCH smaller now.
> >
> > I think this approach makes a lot of sense!  The only thing I'm
> > worried about is that the lextent keys are no longer known when they
> > are being fetched (since they will be a logical offset), which means
> > we'll have to use an iterator instead of a simple get.  The former is
> > quite a bit slower than the latter (which can make use of the rocksdb
> > caches and/or key bloom filters more easily).
> >
> > We could augment your approach by keeping *just* the lextent offsets
> > in the onode, so that we know exactly which lextent key to fetch, but
> > then I'm not sure we'll get much benefit (lextent metadata size goes
> > down by ~1/3, but then we have an extra get for cloned objects).
> >
> > Hmm, the other thing to keep in mind is that for RBD the common case
> > is that lots of objects have clones, and many of those objects' blobs
> > will be shared.
> >
> > sage
> >
> >> So for the non-shared case, you fetch the oNode which is dominated by
> >> the xattrs now (so figure a couple of hundred bytes and not much CPU
> >> cost to deserialize). And then fetch from the KV for the lextent
> >> (which is 1 fetch -- unless it overlaps two previous lextents). If
> >> it's the optimal case, the KV fetch is small (10-15 bytes) and
> >> trivial to deserialize. If it's an unshared/local blob then you're
> >> ready to go. If the blob is shared (locally or globally) then you'll
> >> have to go fetch that one too.
> >>
> >> This might lead to the elimination of the local/global blob thing (I
> >> think you've talked about that before) as now the only "local" blobs
> >> are the unshared single extent blobs which are stored inline with the
> >> lextent entry. You'll still have the special cases of promoting
> >> unshared
> >> (inline) blobs to global blobs -- which is probably similar to the
> >> current promotion "magic" on a clone operation.
> >>
> >> The current refmap concept may require some additional work. I
> >> believe that we'll have to do a reconstruction of the refmap, but
> >> fortunately only for the range of the current I/O. That will be a bit
> >> more expensive, but still less expensive than reconstructing the
> >> entire refmap for every oNode deserialization, Fortunately I believe
> >> the refmap is only really needed for compression cases or RBD cases
> without "trim"
> >> (this is the case to optimize -- it'll make trim really important for
> >> performance).
> >>
> >> Best of both worlds????
> >>
> >> Allen Samuels
> >> SanDisk |a Western Digital brand
> >> 2880 Junction Avenue, Milpitas, CA 95134
> >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> >>> Sent: Friday, August 19, 2016 7:16 AM
> >>> To: Sage Weil <sweil@redhat.com>
> >>> Cc: ceph-devel@vger.kernel.org
> >>> Subject: RE: bluestore blobs
> >>>
> >>>> -----Original Message-----
> >>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>> Sent: Friday, August 19, 2016 6:53 AM
> >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: bluestore blobs
> >>>>
> >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >>>>>> Cc: ceph-devel@vger.kernel.org
> >>>>>> Subject: RE: bluestore blobs
> >>>>>>
> >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> >>>>>>>> To: ceph-devel@vger.kernel.org
> >>>>>>>> Subject: bluestore blobs
> >>>>>>>>
> >>>>>>>> I think we need to look at other changes in addition to the
> >>>>>>>> encoding performance improvements.  Even if they end up being
> >>>>>>>> good enough, these changes are somewhat orthogonal and at
> least
> >>>>>>>> one of them should give us something that is even faster.
> >>>>>>>>
> >>>>>>>> 1. I mentioned this before, but we should keep the encoding
> >>>>>>>> bluestore_blob_t around when we load the blob map.  If it's not
> >>>>>>>> changed, don't reencode it.  There are no blockers for
> >>>>>>>> implementing this
> >>>>>> currently.
> >>>>>>>> It may be difficult to ensure the blobs are properly marked dirty...
> >>>>>>>> I'll see if we can use proper accessors for the blob to enforce
> >>>>>>>> this at compile time.  We should do that anyway.
> >>>>>>>
> >>>>>>> If it's not changed, then why are we re-writing it? I'm having a
> >>>>>>> hard time thinking of a case worth optimizing where I want to
> >>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing
> >>>> something obvious?
> >>>>>>
> >>>>>> An onode's blob_map might have 300 blobs, and a single write only
> >>>>>> updates one of them.  The other 299 blobs need not be reencoded,
> >>>>>> just
> >>>> memcpy'd.
> >>>>>
> >>>>> As long as we're just appending that's a good optimization. How
> >>>>> often does that happen? It's certainly not going to help the RBD
> >>>>> 4K random write problem.
> >>>>
> >>>> It won't help the (l)extent_map encoding, but it avoids almost all
> >>>> of the blob reencoding.  A 4k random write will update one blob out
> >>>> of
> >>>> ~100 (or whatever it is).
> >>>>
> >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages:
> >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
> >>>>>>>> untouched
> >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> >>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
> >>>>>>>> We could extend the rocksdb interface to take an iovec so that
> >>>>>>>> the first memcpy isn't needed (and rocksdb will instead iterate
> >>>>>>>> over our buffers and copy them directly into its write buffer).
> >>>>>>>> This is probably a pretty small piece of the overall time...
> >>>>>>>> should verify with a profiler
> >>>>>> before investing too much effort here.
> >>>>>>>
> >>>>>>> I doubt it's the memcpy that's really the expensive part. I'll
> >>>>>>> bet it's that we're transcoding from an internal to an external
> >>>>>>> representation on an element by element basis. If the iovec
> >>>>>>> scheme is going to help, it presumes that the internal data
> >>>>>>> structure essentially matches the external data structure so
> >>>>>>> that only an iovec copy is required. I'm wondering how
> >>>>>>> compatible this is with the current concepts of
> lextext/blob/pextent.
> >>>>>>
> >>>>>> I'm thinking of the xattr case (we have a bunch of strings to
> >>>>>> copy
> >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case:
> >>>>>> instead of memcpy'ing them into a big contiguous buffer and
> >>>>>> having rocksdb memcpy
> >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that
> >>>>>> they smaller buffers are assembled only once.
> >>>>>>
> >>>>>> These buffers will be on the order of many 10s to a couple 100s
> >>>>>> of
> >>> bytes.
> >>>>>> I'm not sure where the crossover point for constructing and then
> >>>>>> traversing an iovec vs just copying twice would be...
> >>>>>
> >>>>> Yes this will eliminate the "extra" copy, but the real problem is
> >>>>> that the oNode itself is just too large. I doubt removing one
> >>>>> extra copy is going to suddenly "solve" this problem. I think
> >>>>> we're going to end up rejiggering things so that this will be much
> >>>>> less of a problem than it is now -- time will tell.
> >>>>
> >>>> Yeah, leaving this one for last I think... until we see memcpy show
> >>>> up in the profile.
> >>>>
> >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or
> >>>>>>>> more?) key into rocksdb every time we touch an object, even
> >>>>>>>> when a tiny
> >>>>>
> >>>>> See my analysis, you're looking at 8-10K for the RBD random write
> >>>>> case
> >>>>> -- which I think everybody cares a lot about.
> >>>>>
> >>>>>>>> amount of metadata is getting changed.  This is a consequence
> >>>>>>>> of embedding all of the blobs into the onode (or bnode).  That
> >>>>>>>> seemed like a good idea early on when they were tiny (i.e.,
> >>>>>>>> just an extent), but now I'm not so sure.  I see a couple of
> >>>>>>>> different
> >>>> options:
> >>>>>>>>
> >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the
> >>>>>>>> onode, load the blobs too.  They will hopefully be sequential
> >>>>>>>> in rocksdb (or definitely sequential in zs).  Probably go back
> >>>>>>>> to using an
> >>>> iterator.
> >>>>>>>>
> >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so
> >>>>>>>> that they are unique for any given hash value.  Then store the
> >>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is
> >>>>>>>> now).  Then when clone happens there is no onode->bnode
> >>>>>>>> migration magic happening--we've already committed to storing
> >>>>>>>> blobs in separate keys.  When we load the onode, keep the
> >>>>>>>> conditional bnode loading we already have.. but when the bnode
> >>>>>>>> is loaded load up all the blobs for the hash key.  (Okay, we
> >>>>>>>> could fault in blobs individually, but that code will be more
> >>>>>>>> complicated.)
> >>>>>
> >>>>> I like this direction. I think you'll still end up demand loading
> >>>>> the blobs in order to speed up the random read case. This scheme
> >>>>> will result in some space-amplification, both in the lextent and
> >>>>> in the blob-map, it's worth a bit of study too see how bad the
> >>>>> metadata/data ratio becomes (just as a guess,
> >>>>> $shard.$poolid.$hash.$blobid is probably 16 +
> >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob
> >>>>> -- unless your KV store does path compression. My reading of
> >>>>> RocksDB sst file seems to indicate that it doesn't, I *believe*
> >>>>> that ZS does [need to confirm]). I'm wondering if the current notion
> of local vs.
> >>>>> global blobs isn't actually beneficial in that we can give local
> >>>>> blobs different names that sort with their associated oNode (which
> >>>>> probably makes the space-amp worse) which is an important
> >>>>> optimization. We do need to watch the space amp, we're going to be
> >>>>> burning DRAM to make KV accesses cheap and the amount of DRAM
> is
> >>> proportional to the space amp.
> >>>>
> >>>> I got this mostly working last night... just need to sort out the
> >>>> clone case (and clean up a bunch of code).  It was a relatively
> >>>> painless transition to make, although in its current form the blobs
> >>>> all belong to the bnode, and the bnode if ephemeral but remains in
> >>> memory until all referencing onodes go away.
> >>>> Mostly fine, except it means that odd combinations of clone could
> >>>> leave lots of blobs in cache that don't get trimmed.  Will address that
> later.
> >>>>
> >>>> I'll try to finish it up this morning and get it passing tests and posted.
> >>>>
> >>>>>>>> In both these cases, a write will dirty the onode (which is
> >>>>>>>> back to being pretty small.. just xattrs and the lextent map)
> >>>>>>>> and 1-3 blobs (also
> >>>>>> now small keys).
> >>>>>
> >>>>> I'm not sure the oNode is going to be that small. Looking at the
> >>>>> RBD random 4K write case, you're going to have 1K entries each of
> >>>>> which has an offset, size and a blob-id reference in them. In my
> >>>>> current oNode compression scheme this compresses to about 1 byte
> per entry.
> >>>>> However, this optimization relies on being able to cheaply
> >>>>> renumber the blob-ids, which is no longer possible when the
> >>>>> blob-ids become parts of a key (see above). So now you'll have a
> >>>>> minimum of 1.5-3 bytes extra for each blob-id (because you can't
> >>>>> assume that the blob-ids
> >>>> become "dense"
> >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about
> >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable
> >>>>> length encoding you'll have to scan the entire table to
> >>>>> deserialize it (yes, we could do differential editing when we
> >>>>> write but that's another
> >>> discussion).
> >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :).
> >>>>> So while this looks small compared to the current ~30K for the
> >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over
> >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I
> published earlier.
> >>>>>
> >>>>> If we want to do better we will need to separate the lextent from
> >>>>> the oNode also. It's relatively easy to move the lextents into the
> >>>>> KV store itself (there are two obvious ways to deal with this,
> >>>>> either use the native offset/size from the lextent itself OR
> >>>>> create 'N' buckets of logical offset into which we pour entries --
> >>>>> both of these would add somewhere between 1 and 2 KV look-ups
> per
> >>>>> operation
> >>>>> -- here is where an iterator would probably help.
> >>>>>
> >>>>> Unfortunately, if you only process a portion of the lextent
> >>>>> (because you've made it into multiple keys and you don't want to
> >>>>> load all of
> >>>>> them) you no longer can re-generate the refmap on the fly (another
> >>>>> key space optimization). The lack of refmap screws up a number of
> >>>>> other important algorithms -- for example the overlapping blob-map
> >>> thing, etc.
> >>>>> Not sure if these are easy to rewrite or not -- too complicated to
> >>>>> think about at this hour of the evening.
> >>>>
> >>>> Yeah, I forgot about the extent_map and how big it gets.  I think,
> >>>> though, that if we can get a 4mb object with 1024 4k lextents to
> >>>> encode the whole onode and extent_map in under 4K that will be good
> >>>> enough.  The blob update that goes with it will be ~200 bytes, and
> >>>> benchmarks aside, the 4k random write 100% fragmented object is a
> >>>> worst
> >>> case.
> >>>
> >>> Yes, it's a worst-case. But it's a
> >>> "worst-case-that-everybody-looks-at" vs. a "worst-case-that-almost-
> nobody-looks-at".
> >>>
> >>> I'm still concerned about having an oNode that's larger than a 4K block.
> >>>
> >>>
> >>>>
> >>>> Anyway, I'll get the blob separation branch working and we can go
> >>>> from there...
> >>>>
> >>>> sage
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe
> >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-22  1:41     ` Allen Samuels
@ 2016-08-22 22:08       ` Sage Weil
  2016-08-22 22:20         ` Allen Samuels
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-08-22 22:08 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Mon, 22 Aug 2016, Allen Samuels wrote:
> Another possibility is to "bin" the lextent table into known, fixed, 
> offset ranges. Suppose each oNode had a fixed range of LBA keys 
> associated with the lextent table: say [0..128K), [128K..256K), ...

Yeah, I think that's the way to do it.  Either a set<uint32_t> 
lextent_key_offsets or uint64_t lextent_map_chunk_size to specify the 
granularity.

> It eliminates the need to put a "lower_bound" into the KV Store 
> directly. Though it'll likely be a bit more complex above and somewhat 
> less efficient.

FWIW this is basically wat the iterator does, I think.  It's a separate 
rocksdb operation to create a snapshot to iterate over (and we don't rely 
on that feature anywhere).  It's still more expensive than raw gets just 
because it has to have fingers in all the levels to ensure that it finds 
all the keys for the given range, while get can stop once it finds a 
key or a tombstone (either in a cache or a higher level of the tree).

But I'm not super thrilled about this complexity.  I still hope 
(wishfully?) we can get the lextent map small enough that we can leave it 
in the onode.  Otherwise we really are going to net +1 kv fetch for every 
operation (3, in the clone case... onode, then lextent, then shared blob).

sage

> 
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > Sent: Sunday, August 21, 2016 10:27 AM
> > To: Sage Weil <sweil@redhat.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: bluestore blobs REVISITED
> > 
> > I wonder how hard it would be to add a "lower-bound" fetch like stl. That
> > would allow the kv store to do the fetch without incurring the overhead of a
> > snapshot for the iteration scan.
> > 
> > Shared blobs were always going to trigger an extra kv fetch no matter what.
> > 
> > Sent from my iPhone. Please excuse all typos and autocorrects.
> > 
> > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
> > >
> > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > >> I have another proposal (it just occurred to me, so it might not
> > >> survive more scrutiny).
> > >>
> > >> Yes, we should remove the blob-map from the oNode.
> > >>
> > >> But I believe we should also remove the lextent map from the oNode
> > >> and make each lextent be an independent KV value.
> > >>
> > >> However, in the special case where each extent --exactly-- maps onto
> > >> a blob AND the blob is not referenced by any other extent (which is
> > >> the typical case, unless you're doing compression with strange-ish
> > >> overlaps)
> > >> -- then you encode the blob in the lextent itself and there's no
> > >> separate blob entry.
> > >>
> > >> This is pretty much exactly the same number of KV fetches as what you
> > >> proposed before when the blob isn't shared (the typical case) --
> > >> except the oNode is MUCH MUCH smaller now.
> > >
> > > I think this approach makes a lot of sense!  The only thing I'm
> > > worried about is that the lextent keys are no longer known when they
> > > are being fetched (since they will be a logical offset), which means
> > > we'll have to use an iterator instead of a simple get.  The former is
> > > quite a bit slower than the latter (which can make use of the rocksdb
> > > caches and/or key bloom filters more easily).
> > >
> > > We could augment your approach by keeping *just* the lextent offsets
> > > in the onode, so that we know exactly which lextent key to fetch, but
> > > then I'm not sure we'll get much benefit (lextent metadata size goes
> > > down by ~1/3, but then we have an extra get for cloned objects).
> > >
> > > Hmm, the other thing to keep in mind is that for RBD the common case
> > > is that lots of objects have clones, and many of those objects' blobs
> > > will be shared.
> > >
> > > sage
> > >
> > >> So for the non-shared case, you fetch the oNode which is dominated by
> > >> the xattrs now (so figure a couple of hundred bytes and not much CPU
> > >> cost to deserialize). And then fetch from the KV for the lextent
> > >> (which is 1 fetch -- unless it overlaps two previous lextents). If
> > >> it's the optimal case, the KV fetch is small (10-15 bytes) and
> > >> trivial to deserialize. If it's an unshared/local blob then you're
> > >> ready to go. If the blob is shared (locally or globally) then you'll
> > >> have to go fetch that one too.
> > >>
> > >> This might lead to the elimination of the local/global blob thing (I
> > >> think you've talked about that before) as now the only "local" blobs
> > >> are the unshared single extent blobs which are stored inline with the
> > >> lextent entry. You'll still have the special cases of promoting
> > >> unshared
> > >> (inline) blobs to global blobs -- which is probably similar to the
> > >> current promotion "magic" on a clone operation.
> > >>
> > >> The current refmap concept may require some additional work. I
> > >> believe that we'll have to do a reconstruction of the refmap, but
> > >> fortunately only for the range of the current I/O. That will be a bit
> > >> more expensive, but still less expensive than reconstructing the
> > >> entire refmap for every oNode deserialization, Fortunately I believe
> > >> the refmap is only really needed for compression cases or RBD cases
> > without "trim"
> > >> (this is the case to optimize -- it'll make trim really important for
> > >> performance).
> > >>
> > >> Best of both worlds????
> > >>
> > >> Allen Samuels
> > >> SanDisk |a Western Digital brand
> > >> 2880 Junction Avenue, Milpitas, CA 95134
> > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > >>> Sent: Friday, August 19, 2016 7:16 AM
> > >>> To: Sage Weil <sweil@redhat.com>
> > >>> Cc: ceph-devel@vger.kernel.org
> > >>> Subject: RE: bluestore blobs
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > >>>> Cc: ceph-devel@vger.kernel.org
> > >>>> Subject: RE: bluestore blobs
> > >>>>
> > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > >>>>>> -----Original Message-----
> > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > >>>>>> Cc: ceph-devel@vger.kernel.org
> > >>>>>> Subject: RE: bluestore blobs
> > >>>>>>
> > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > >>>>>>>> To: ceph-devel@vger.kernel.org
> > >>>>>>>> Subject: bluestore blobs
> > >>>>>>>>
> > >>>>>>>> I think we need to look at other changes in addition to the
> > >>>>>>>> encoding performance improvements.  Even if they end up being
> > >>>>>>>> good enough, these changes are somewhat orthogonal and at
> > least
> > >>>>>>>> one of them should give us something that is even faster.
> > >>>>>>>>
> > >>>>>>>> 1. I mentioned this before, but we should keep the encoding
> > >>>>>>>> bluestore_blob_t around when we load the blob map.  If it's not
> > >>>>>>>> changed, don't reencode it.  There are no blockers for
> > >>>>>>>> implementing this
> > >>>>>> currently.
> > >>>>>>>> It may be difficult to ensure the blobs are properly marked dirty...
> > >>>>>>>> I'll see if we can use proper accessors for the blob to enforce
> > >>>>>>>> this at compile time.  We should do that anyway.
> > >>>>>>>
> > >>>>>>> If it's not changed, then why are we re-writing it? I'm having a
> > >>>>>>> hard time thinking of a case worth optimizing where I want to
> > >>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing
> > >>>> something obvious?
> > >>>>>>
> > >>>>>> An onode's blob_map might have 300 blobs, and a single write only
> > >>>>>> updates one of them.  The other 299 blobs need not be reencoded,
> > >>>>>> just
> > >>>> memcpy'd.
> > >>>>>
> > >>>>> As long as we're just appending that's a good optimization. How
> > >>>>> often does that happen? It's certainly not going to help the RBD
> > >>>>> 4K random write problem.
> > >>>>
> > >>>> It won't help the (l)extent_map encoding, but it avoids almost all
> > >>>> of the blob reencoding.  A 4k random write will update one blob out
> > >>>> of
> > >>>> ~100 (or whatever it is).
> > >>>>
> > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages:
> > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
> > >>>>>>>> untouched
> > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> > >>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
> > >>>>>>>> We could extend the rocksdb interface to take an iovec so that
> > >>>>>>>> the first memcpy isn't needed (and rocksdb will instead iterate
> > >>>>>>>> over our buffers and copy them directly into its write buffer).
> > >>>>>>>> This is probably a pretty small piece of the overall time...
> > >>>>>>>> should verify with a profiler
> > >>>>>> before investing too much effort here.
> > >>>>>>>
> > >>>>>>> I doubt it's the memcpy that's really the expensive part. I'll
> > >>>>>>> bet it's that we're transcoding from an internal to an external
> > >>>>>>> representation on an element by element basis. If the iovec
> > >>>>>>> scheme is going to help, it presumes that the internal data
> > >>>>>>> structure essentially matches the external data structure so
> > >>>>>>> that only an iovec copy is required. I'm wondering how
> > >>>>>>> compatible this is with the current concepts of
> > lextext/blob/pextent.
> > >>>>>>
> > >>>>>> I'm thinking of the xattr case (we have a bunch of strings to
> > >>>>>> copy
> > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case:
> > >>>>>> instead of memcpy'ing them into a big contiguous buffer and
> > >>>>>> having rocksdb memcpy
> > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that
> > >>>>>> they smaller buffers are assembled only once.
> > >>>>>>
> > >>>>>> These buffers will be on the order of many 10s to a couple 100s
> > >>>>>> of
> > >>> bytes.
> > >>>>>> I'm not sure where the crossover point for constructing and then
> > >>>>>> traversing an iovec vs just copying twice would be...
> > >>>>>
> > >>>>> Yes this will eliminate the "extra" copy, but the real problem is
> > >>>>> that the oNode itself is just too large. I doubt removing one
> > >>>>> extra copy is going to suddenly "solve" this problem. I think
> > >>>>> we're going to end up rejiggering things so that this will be much
> > >>>>> less of a problem than it is now -- time will tell.
> > >>>>
> > >>>> Yeah, leaving this one for last I think... until we see memcpy show
> > >>>> up in the profile.
> > >>>>
> > >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or
> > >>>>>>>> more?) key into rocksdb every time we touch an object, even
> > >>>>>>>> when a tiny
> > >>>>>
> > >>>>> See my analysis, you're looking at 8-10K for the RBD random write
> > >>>>> case
> > >>>>> -- which I think everybody cares a lot about.
> > >>>>>
> > >>>>>>>> amount of metadata is getting changed.  This is a consequence
> > >>>>>>>> of embedding all of the blobs into the onode (or bnode).  That
> > >>>>>>>> seemed like a good idea early on when they were tiny (i.e.,
> > >>>>>>>> just an extent), but now I'm not so sure.  I see a couple of
> > >>>>>>>> different
> > >>>> options:
> > >>>>>>>>
> > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the
> > >>>>>>>> onode, load the blobs too.  They will hopefully be sequential
> > >>>>>>>> in rocksdb (or definitely sequential in zs).  Probably go back
> > >>>>>>>> to using an
> > >>>> iterator.
> > >>>>>>>>
> > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so
> > >>>>>>>> that they are unique for any given hash value.  Then store the
> > >>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is
> > >>>>>>>> now).  Then when clone happens there is no onode->bnode
> > >>>>>>>> migration magic happening--we've already committed to storing
> > >>>>>>>> blobs in separate keys.  When we load the onode, keep the
> > >>>>>>>> conditional bnode loading we already have.. but when the bnode
> > >>>>>>>> is loaded load up all the blobs for the hash key.  (Okay, we
> > >>>>>>>> could fault in blobs individually, but that code will be more
> > >>>>>>>> complicated.)
> > >>>>>
> > >>>>> I like this direction. I think you'll still end up demand loading
> > >>>>> the blobs in order to speed up the random read case. This scheme
> > >>>>> will result in some space-amplification, both in the lextent and
> > >>>>> in the blob-map, it's worth a bit of study too see how bad the
> > >>>>> metadata/data ratio becomes (just as a guess,
> > >>>>> $shard.$poolid.$hash.$blobid is probably 16 +
> > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob
> > >>>>> -- unless your KV store does path compression. My reading of
> > >>>>> RocksDB sst file seems to indicate that it doesn't, I *believe*
> > >>>>> that ZS does [need to confirm]). I'm wondering if the current notion
> > of local vs.
> > >>>>> global blobs isn't actually beneficial in that we can give local
> > >>>>> blobs different names that sort with their associated oNode (which
> > >>>>> probably makes the space-amp worse) which is an important
> > >>>>> optimization. We do need to watch the space amp, we're going to be
> > >>>>> burning DRAM to make KV accesses cheap and the amount of DRAM
> > is
> > >>> proportional to the space amp.
> > >>>>
> > >>>> I got this mostly working last night... just need to sort out the
> > >>>> clone case (and clean up a bunch of code).  It was a relatively
> > >>>> painless transition to make, although in its current form the blobs
> > >>>> all belong to the bnode, and the bnode if ephemeral but remains in
> > >>> memory until all referencing onodes go away.
> > >>>> Mostly fine, except it means that odd combinations of clone could
> > >>>> leave lots of blobs in cache that don't get trimmed.  Will address that
> > later.
> > >>>>
> > >>>> I'll try to finish it up this morning and get it passing tests and posted.
> > >>>>
> > >>>>>>>> In both these cases, a write will dirty the onode (which is
> > >>>>>>>> back to being pretty small.. just xattrs and the lextent map)
> > >>>>>>>> and 1-3 blobs (also
> > >>>>>> now small keys).
> > >>>>>
> > >>>>> I'm not sure the oNode is going to be that small. Looking at the
> > >>>>> RBD random 4K write case, you're going to have 1K entries each of
> > >>>>> which has an offset, size and a blob-id reference in them. In my
> > >>>>> current oNode compression scheme this compresses to about 1 byte
> > per entry.
> > >>>>> However, this optimization relies on being able to cheaply
> > >>>>> renumber the blob-ids, which is no longer possible when the
> > >>>>> blob-ids become parts of a key (see above). So now you'll have a
> > >>>>> minimum of 1.5-3 bytes extra for each blob-id (because you can't
> > >>>>> assume that the blob-ids
> > >>>> become "dense"
> > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about
> > >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable
> > >>>>> length encoding you'll have to scan the entire table to
> > >>>>> deserialize it (yes, we could do differential editing when we
> > >>>>> write but that's another
> > >>> discussion).
> > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :).
> > >>>>> So while this looks small compared to the current ~30K for the
> > >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over
> > >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I
> > published earlier.
> > >>>>>
> > >>>>> If we want to do better we will need to separate the lextent from
> > >>>>> the oNode also. It's relatively easy to move the lextents into the
> > >>>>> KV store itself (there are two obvious ways to deal with this,
> > >>>>> either use the native offset/size from the lextent itself OR
> > >>>>> create 'N' buckets of logical offset into which we pour entries --
> > >>>>> both of these would add somewhere between 1 and 2 KV look-ups
> > per
> > >>>>> operation
> > >>>>> -- here is where an iterator would probably help.
> > >>>>>
> > >>>>> Unfortunately, if you only process a portion of the lextent
> > >>>>> (because you've made it into multiple keys and you don't want to
> > >>>>> load all of
> > >>>>> them) you no longer can re-generate the refmap on the fly (another
> > >>>>> key space optimization). The lack of refmap screws up a number of
> > >>>>> other important algorithms -- for example the overlapping blob-map
> > >>> thing, etc.
> > >>>>> Not sure if these are easy to rewrite or not -- too complicated to
> > >>>>> think about at this hour of the evening.
> > >>>>
> > >>>> Yeah, I forgot about the extent_map and how big it gets.  I think,
> > >>>> though, that if we can get a 4mb object with 1024 4k lextents to
> > >>>> encode the whole onode and extent_map in under 4K that will be good
> > >>>> enough.  The blob update that goes with it will be ~200 bytes, and
> > >>>> benchmarks aside, the 4k random write 100% fragmented object is a
> > >>>> worst
> > >>> case.
> > >>>
> > >>> Yes, it's a worst-case. But it's a
> > >>> "worst-case-that-everybody-looks-at" vs. a "worst-case-that-almost-
> > nobody-looks-at".
> > >>>
> > >>> I'm still concerned about having an oNode that's larger than a 4K block.
> > >>>
> > >>>
> > >>>>
> > >>>> Anyway, I'll get the blob separation branch working and we can go
> > >>>> from there...
> > >>>>
> > >>>> sage
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe
> > >>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >> in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > >> info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-22 22:08       ` Sage Weil
@ 2016-08-22 22:20         ` Allen Samuels
  2016-08-22 22:55           ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-22 22:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 6:09 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs REVISITED
> 
> On Mon, 22 Aug 2016, Allen Samuels wrote:
> > Another possibility is to "bin" the lextent table into known, fixed,
> > offset ranges. Suppose each oNode had a fixed range of LBA keys
> > associated with the lextent table: say [0..128K), [128K..256K), ...
> 
> Yeah, I think that's the way to do it.  Either a set<uint32_t>
> lextent_key_offsets or uint64_t lextent_map_chunk_size to specify the
> granularity.

Need to do some actual estimates on this scheme to make sure we're actually landing on a real solution and not just another band-aid that we have to rip off (painfully) at some future time.

> 
> > It eliminates the need to put a "lower_bound" into the KV Store
> > directly. Though it'll likely be a bit more complex above and somewhat
> > less efficient.
> 
> FWIW this is basically wat the iterator does, I think.  It's a separate rocksdb
> operation to create a snapshot to iterate over (and we don't rely on that
> feature anywhere).  It's still more expensive than raw gets just because it has
> to have fingers in all the levels to ensure that it finds all the keys for the given
> range, while get can stop once it finds a key or a tombstone (either in a cache
> or a higher level of the tree).
> 
> But I'm not super thrilled about this complexity.  I still hope
> (wishfully?) we can get the lextent map small enough that we can leave it in
> the onode.  Otherwise we really are going to net +1 kv fetch for every
> operation (3, in the clone case... onode, then lextent, then shared blob).

I don't see the feasibility of leaving the lextent map in the oNode. It's just too big for the 4K random write case. I know that's not indicative of real-world usage. But it's what people use to measure systems....

BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff suggests that it's would relatively trivial to ensure that bloom-filters properly ignore the "offset" portion of an lextent key. Meaning that I believe that a "lower_bound" operator ought to be relatively easy to implement without triggering the overhead of a snapshot, again, provided that you provided a custom bloom-filter implementation that did the right thing.

So the question is do we want to avoid monkeying with RocksDB and go with a binning approach (TBD ensuring that this is a real solution to the problem) OR do we bite the bullet and solve the lower-bound lookup problem?

BTW, on the "binning" scheme, perhaps the solution is just put the current "binning value" [presumeably some log2 thing] into the oNode itself -- it'll just be a byte. Then you're only stuck with the complexity of deciding when to do a "split" if the current bin has gotten too large (some arbitrary limit on size of the encoded lexent-bin)

> 
> sage
> 
> >
> >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > Sent: Sunday, August 21, 2016 10:27 AM
> > > To: Sage Weil <sweil@redhat.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: Re: bluestore blobs REVISITED
> > >
> > > I wonder how hard it would be to add a "lower-bound" fetch like stl.
> > > That would allow the kv store to do the fetch without incurring the
> > > overhead of a snapshot for the iteration scan.
> > >
> > > Shared blobs were always going to trigger an extra kv fetch no matter
> what.
> > >
> > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > >
> > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
> > > >
> > > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > > >> I have another proposal (it just occurred to me, so it might not
> > > >> survive more scrutiny).
> > > >>
> > > >> Yes, we should remove the blob-map from the oNode.
> > > >>
> > > >> But I believe we should also remove the lextent map from the
> > > >> oNode and make each lextent be an independent KV value.
> > > >>
> > > >> However, in the special case where each extent --exactly-- maps
> > > >> onto a blob AND the blob is not referenced by any other extent
> > > >> (which is the typical case, unless you're doing compression with
> > > >> strange-ish
> > > >> overlaps)
> > > >> -- then you encode the blob in the lextent itself and there's no
> > > >> separate blob entry.
> > > >>
> > > >> This is pretty much exactly the same number of KV fetches as what
> > > >> you proposed before when the blob isn't shared (the typical case)
> > > >> -- except the oNode is MUCH MUCH smaller now.
> > > >
> > > > I think this approach makes a lot of sense!  The only thing I'm
> > > > worried about is that the lextent keys are no longer known when
> > > > they are being fetched (since they will be a logical offset),
> > > > which means we'll have to use an iterator instead of a simple get.
> > > > The former is quite a bit slower than the latter (which can make
> > > > use of the rocksdb caches and/or key bloom filters more easily).
> > > >
> > > > We could augment your approach by keeping *just* the lextent
> > > > offsets in the onode, so that we know exactly which lextent key to
> > > > fetch, but then I'm not sure we'll get much benefit (lextent
> > > > metadata size goes down by ~1/3, but then we have an extra get for
> cloned objects).
> > > >
> > > > Hmm, the other thing to keep in mind is that for RBD the common
> > > > case is that lots of objects have clones, and many of those
> > > > objects' blobs will be shared.
> > > >
> > > > sage
> > > >
> > > >> So for the non-shared case, you fetch the oNode which is
> > > >> dominated by the xattrs now (so figure a couple of hundred bytes
> > > >> and not much CPU cost to deserialize). And then fetch from the KV
> > > >> for the lextent (which is 1 fetch -- unless it overlaps two
> > > >> previous lextents). If it's the optimal case, the KV fetch is
> > > >> small (10-15 bytes) and trivial to deserialize. If it's an
> > > >> unshared/local blob then you're ready to go. If the blob is
> > > >> shared (locally or globally) then you'll have to go fetch that one too.
> > > >>
> > > >> This might lead to the elimination of the local/global blob thing
> > > >> (I think you've talked about that before) as now the only "local"
> > > >> blobs are the unshared single extent blobs which are stored
> > > >> inline with the lextent entry. You'll still have the special
> > > >> cases of promoting unshared
> > > >> (inline) blobs to global blobs -- which is probably similar to
> > > >> the current promotion "magic" on a clone operation.
> > > >>
> > > >> The current refmap concept may require some additional work. I
> > > >> believe that we'll have to do a reconstruction of the refmap, but
> > > >> fortunately only for the range of the current I/O. That will be a
> > > >> bit more expensive, but still less expensive than reconstructing
> > > >> the entire refmap for every oNode deserialization, Fortunately I
> > > >> believe the refmap is only really needed for compression cases or
> > > >> RBD cases
> > > without "trim"
> > > >> (this is the case to optimize -- it'll make trim really important
> > > >> for performance).
> > > >>
> > > >> Best of both worlds????
> > > >>
> > > >> Allen Samuels
> > > >> SanDisk |a Western Digital brand
> > > >> 2880 Junction Avenue, Milpitas, CA 95134
> > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > >>> Sent: Friday, August 19, 2016 7:16 AM
> > > >>> To: Sage Weil <sweil@redhat.com>
> > > >>> Cc: ceph-devel@vger.kernel.org
> > > >>> Subject: RE: bluestore blobs
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > >>>> Cc: ceph-devel@vger.kernel.org
> > > >>>> Subject: RE: bluestore blobs
> > > >>>>
> > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > >>>>>> Cc: ceph-devel@vger.kernel.org
> > > >>>>>> Subject: RE: bluestore blobs
> > > >>>>>>
> > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> devel-
> > > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > > >>>>>>>> To: ceph-devel@vger.kernel.org
> > > >>>>>>>> Subject: bluestore blobs
> > > >>>>>>>>
> > > >>>>>>>> I think we need to look at other changes in addition to the
> > > >>>>>>>> encoding performance improvements.  Even if they end up
> > > >>>>>>>> being good enough, these changes are somewhat orthogonal
> > > >>>>>>>> and at
> > > least
> > > >>>>>>>> one of them should give us something that is even faster.
> > > >>>>>>>>
> > > >>>>>>>> 1. I mentioned this before, but we should keep the encoding
> > > >>>>>>>> bluestore_blob_t around when we load the blob map.  If it's
> > > >>>>>>>> not changed, don't reencode it.  There are no blockers for
> > > >>>>>>>> implementing this
> > > >>>>>> currently.
> > > >>>>>>>> It may be difficult to ensure the blobs are properly marked
> dirty...
> > > >>>>>>>> I'll see if we can use proper accessors for the blob to
> > > >>>>>>>> enforce this at compile time.  We should do that anyway.
> > > >>>>>>>
> > > >>>>>>> If it's not changed, then why are we re-writing it? I'm
> > > >>>>>>> having a hard time thinking of a case worth optimizing where
> > > >>>>>>> I want to re-write the oNode but the blob_map is unchanged.
> > > >>>>>>> Am I missing
> > > >>>> something obvious?
> > > >>>>>>
> > > >>>>>> An onode's blob_map might have 300 blobs, and a single write
> > > >>>>>> only updates one of them.  The other 299 blobs need not be
> > > >>>>>> reencoded, just
> > > >>>> memcpy'd.
> > > >>>>>
> > > >>>>> As long as we're just appending that's a good optimization.
> > > >>>>> How often does that happen? It's certainly not going to help
> > > >>>>> the RBD 4K random write problem.
> > > >>>>
> > > >>>> It won't help the (l)extent_map encoding, but it avoids almost
> > > >>>> all of the blob reencoding.  A 4k random write will update one
> > > >>>> blob out of
> > > >>>> ~100 (or whatever it is).
> > > >>>>
> > > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
> stages:
> > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
> > > >>>>>>>> untouched
> > > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> > > >>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
> > > >>>>>>>> We could extend the rocksdb interface to take an iovec so
> > > >>>>>>>> that the first memcpy isn't needed (and rocksdb will
> > > >>>>>>>> instead iterate over our buffers and copy them directly into its
> write buffer).
> > > >>>>>>>> This is probably a pretty small piece of the overall time...
> > > >>>>>>>> should verify with a profiler
> > > >>>>>> before investing too much effort here.
> > > >>>>>>>
> > > >>>>>>> I doubt it's the memcpy that's really the expensive part.
> > > >>>>>>> I'll bet it's that we're transcoding from an internal to an
> > > >>>>>>> external representation on an element by element basis. If
> > > >>>>>>> the iovec scheme is going to help, it presumes that the
> > > >>>>>>> internal data structure essentially matches the external
> > > >>>>>>> data structure so that only an iovec copy is required. I'm
> > > >>>>>>> wondering how compatible this is with the current concepts
> > > >>>>>>> of
> > > lextext/blob/pextent.
> > > >>>>>>
> > > >>>>>> I'm thinking of the xattr case (we have a bunch of strings to
> > > >>>>>> copy
> > > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case:
> > > >>>>>> instead of memcpy'ing them into a big contiguous buffer and
> > > >>>>>> having rocksdb memcpy
> > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that
> > > >>>>>> they smaller buffers are assembled only once.
> > > >>>>>>
> > > >>>>>> These buffers will be on the order of many 10s to a couple
> > > >>>>>> 100s of
> > > >>> bytes.
> > > >>>>>> I'm not sure where the crossover point for constructing and
> > > >>>>>> then traversing an iovec vs just copying twice would be...
> > > >>>>>
> > > >>>>> Yes this will eliminate the "extra" copy, but the real problem
> > > >>>>> is that the oNode itself is just too large. I doubt removing
> > > >>>>> one extra copy is going to suddenly "solve" this problem. I
> > > >>>>> think we're going to end up rejiggering things so that this
> > > >>>>> will be much less of a problem than it is now -- time will tell.
> > > >>>>
> > > >>>> Yeah, leaving this one for last I think... until we see memcpy
> > > >>>> show up in the profile.
> > > >>>>
> > > >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k
> > > >>>>>>>> or
> > > >>>>>>>> more?) key into rocksdb every time we touch an object, even
> > > >>>>>>>> when a tiny
> > > >>>>>
> > > >>>>> See my analysis, you're looking at 8-10K for the RBD random
> > > >>>>> write case
> > > >>>>> -- which I think everybody cares a lot about.
> > > >>>>>
> > > >>>>>>>> amount of metadata is getting changed.  This is a
> > > >>>>>>>> consequence of embedding all of the blobs into the onode
> > > >>>>>>>> (or bnode).  That seemed like a good idea early on when
> > > >>>>>>>> they were tiny (i.e., just an extent), but now I'm not so
> > > >>>>>>>> sure.  I see a couple of different
> > > >>>> options:
> > > >>>>>>>>
> > > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load
> > > >>>>>>>> the onode, load the blobs too.  They will hopefully be
> > > >>>>>>>> sequential in rocksdb (or definitely sequential in zs).
> > > >>>>>>>> Probably go back to using an
> > > >>>> iterator.
> > > >>>>>>>>
> > > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids
> > > >>>>>>>> so that they are unique for any given hash value.  Then
> > > >>>>>>>> store the blobs as $shard.$poolid.$hash.$blobid (i.e.,
> > > >>>>>>>> where the bnode is now).  Then when clone happens there is
> > > >>>>>>>> no onode->bnode migration magic happening--we've already
> > > >>>>>>>> committed to storing blobs in separate keys.  When we load
> > > >>>>>>>> the onode, keep the conditional bnode loading we already
> > > >>>>>>>> have.. but when the bnode is loaded load up all the blobs
> > > >>>>>>>> for the hash key.  (Okay, we could fault in blobs
> > > >>>>>>>> individually, but that code will be more
> > > >>>>>>>> complicated.)
> > > >>>>>
> > > >>>>> I like this direction. I think you'll still end up demand
> > > >>>>> loading the blobs in order to speed up the random read case.
> > > >>>>> This scheme will result in some space-amplification, both in
> > > >>>>> the lextent and in the blob-map, it's worth a bit of study too
> > > >>>>> see how bad the metadata/data ratio becomes (just as a guess,
> > > >>>>> $shard.$poolid.$hash.$blobid is probably 16 +
> > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each
> > > >>>>> Blob
> > > >>>>> -- unless your KV store does path compression. My reading of
> > > >>>>> RocksDB sst file seems to indicate that it doesn't, I
> > > >>>>> *believe* that ZS does [need to confirm]). I'm wondering if
> > > >>>>> the current notion
> > > of local vs.
> > > >>>>> global blobs isn't actually beneficial in that we can give
> > > >>>>> local blobs different names that sort with their associated
> > > >>>>> oNode (which probably makes the space-amp worse) which is an
> > > >>>>> important optimization. We do need to watch the space amp,
> > > >>>>> we're going to be burning DRAM to make KV accesses cheap and
> > > >>>>> the amount of DRAM
> > > is
> > > >>> proportional to the space amp.
> > > >>>>
> > > >>>> I got this mostly working last night... just need to sort out
> > > >>>> the clone case (and clean up a bunch of code).  It was a
> > > >>>> relatively painless transition to make, although in its current
> > > >>>> form the blobs all belong to the bnode, and the bnode if
> > > >>>> ephemeral but remains in
> > > >>> memory until all referencing onodes go away.
> > > >>>> Mostly fine, except it means that odd combinations of clone
> > > >>>> could leave lots of blobs in cache that don't get trimmed.
> > > >>>> Will address that
> > > later.
> > > >>>>
> > > >>>> I'll try to finish it up this morning and get it passing tests and posted.
> > > >>>>
> > > >>>>>>>> In both these cases, a write will dirty the onode (which is
> > > >>>>>>>> back to being pretty small.. just xattrs and the lextent
> > > >>>>>>>> map) and 1-3 blobs (also
> > > >>>>>> now small keys).
> > > >>>>>
> > > >>>>> I'm not sure the oNode is going to be that small. Looking at
> > > >>>>> the RBD random 4K write case, you're going to have 1K entries
> > > >>>>> each of which has an offset, size and a blob-id reference in
> > > >>>>> them. In my current oNode compression scheme this compresses
> > > >>>>> to about 1 byte
> > > per entry.
> > > >>>>> However, this optimization relies on being able to cheaply
> > > >>>>> renumber the blob-ids, which is no longer possible when the
> > > >>>>> blob-ids become parts of a key (see above). So now you'll have
> > > >>>>> a minimum of 1.5-3 bytes extra for each blob-id (because you
> > > >>>>> can't assume that the blob-ids
> > > >>>> become "dense"
> > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about
> > > >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable
> > > >>>>> length encoding you'll have to scan the entire table to
> > > >>>>> deserialize it (yes, we could do differential editing when we
> > > >>>>> write but that's another
> > > >>> discussion).
> > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :).
> > > >>>>> So while this looks small compared to the current ~30K for the
> > > >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over
> > > >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I
> > > published earlier.
> > > >>>>>
> > > >>>>> If we want to do better we will need to separate the lextent
> > > >>>>> from the oNode also. It's relatively easy to move the lextents
> > > >>>>> into the KV store itself (there are two obvious ways to deal
> > > >>>>> with this, either use the native offset/size from the lextent
> > > >>>>> itself OR create 'N' buckets of logical offset into which we
> > > >>>>> pour entries -- both of these would add somewhere between 1
> > > >>>>> and 2 KV look-ups
> > > per
> > > >>>>> operation
> > > >>>>> -- here is where an iterator would probably help.
> > > >>>>>
> > > >>>>> Unfortunately, if you only process a portion of the lextent
> > > >>>>> (because you've made it into multiple keys and you don't want
> > > >>>>> to load all of
> > > >>>>> them) you no longer can re-generate the refmap on the fly
> > > >>>>> (another key space optimization). The lack of refmap screws up
> > > >>>>> a number of other important algorithms -- for example the
> > > >>>>> overlapping blob-map
> > > >>> thing, etc.
> > > >>>>> Not sure if these are easy to rewrite or not -- too
> > > >>>>> complicated to think about at this hour of the evening.
> > > >>>>
> > > >>>> Yeah, I forgot about the extent_map and how big it gets.  I
> > > >>>> think, though, that if we can get a 4mb object with 1024 4k
> > > >>>> lextents to encode the whole onode and extent_map in under 4K
> > > >>>> that will be good enough.  The blob update that goes with it
> > > >>>> will be ~200 bytes, and benchmarks aside, the 4k random write
> > > >>>> 100% fragmented object is a worst
> > > >>> case.
> > > >>>
> > > >>> Yes, it's a worst-case. But it's a
> > > >>> "worst-case-that-everybody-looks-at" vs. a
> > > >>> "worst-case-that-almost-
> > > nobody-looks-at".
> > > >>>
> > > >>> I'm still concerned about having an oNode that's larger than a 4K
> block.
> > > >>>
> > > >>>
> > > >>>>
> > > >>>> Anyway, I'll get the blob separation branch working and we can
> > > >>>> go from there...
> > > >>>>
> > > >>>> sage
> > > >>> --
> > > >>> To unsubscribe from this list: send the line "unsubscribe
> > > >>> ceph-devel" in the body of a message to
> > > >>> majordomo@vger.kernel.org More majordomo info at
> > > >>> http://vger.kernel.org/majordomo-info.html
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >> in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > >> info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-22 22:20         ` Allen Samuels
@ 2016-08-22 22:55           ` Sage Weil
  2016-08-22 23:09             ` Allen Samuels
  2016-08-23  5:21             ` Varada Kari
  0 siblings, 2 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-22 22:55 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Mon, 22 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Monday, August 22, 2016 6:09 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs REVISITED
> > 
> > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > Another possibility is to "bin" the lextent table into known, fixed,
> > > offset ranges. Suppose each oNode had a fixed range of LBA keys
> > > associated with the lextent table: say [0..128K), [128K..256K), ...
> > 
> > Yeah, I think that's the way to do it.  Either a set<uint32_t>
> > lextent_key_offsets or uint64_t lextent_map_chunk_size to specify the
> > granularity.
> 
> Need to do some actual estimates on this scheme to make sure we're 
> actually landing on a real solution and not just another band-aid that 
> we have to rip off (painfully) at some future time.

Yeah

> > > It eliminates the need to put a "lower_bound" into the KV Store
> > > directly. Though it'll likely be a bit more complex above and somewhat
> > > less efficient.
> > 
> > FWIW this is basically wat the iterator does, I think.  It's a separate rocksdb
> > operation to create a snapshot to iterate over (and we don't rely on that
> > feature anywhere).  It's still more expensive than raw gets just because it has
> > to have fingers in all the levels to ensure that it finds all the keys for the given
> > range, while get can stop once it finds a key or a tombstone (either in a cache
> > or a higher level of the tree).
> > 
> > But I'm not super thrilled about this complexity.  I still hope
> > (wishfully?) we can get the lextent map small enough that we can leave it in
> > the onode.  Otherwise we really are going to net +1 kv fetch for every
> > operation (3, in the clone case... onode, then lextent, then shared blob).
> 
> I don't see the feasibility of leaving the lextent map in the oNode. 
> It's just too big for the 4K random write case. I know that's not 
> indicative of real-world usage. But it's what people use to measure 
> systems....

How small do you think it would need to be to be acceptable (in teh 
worst-case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?

You're probably right, but I think I can still cut the full map encoding 
further.  It was 6500, I did a quick tweak to get it to 5500, and I think 
I can drop another 1-2 bytes per entry for common cases (offset of 0, 
length == previous lextent length), which would get it under the 4k mark.

> BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff 
> suggests that it's would relatively trivial to ensure that bloom-filters 
> properly ignore the "offset" portion of an lextent key. Meaning that I 
> believe that a "lower_bound" operator ought to be relatively easy to 
> implement without triggering the overhead of a snapshot, again, provided 
> that you provided a custom bloom-filter implementation that did the 
> right thing.

Hmm, that might help.  It may be that this isn't super significant, 
though, too... as I mentioned there is no snapshot involved with a vanilla 
iterator.

> So the question is do we want to avoid monkeying with RocksDB and go 
> with a binning approach (TBD ensuring that this is a real solution to 
> the problem) OR do we bite the bullet and solve the lower-bound lookup 
> problem?
> 
> BTW, on the "binning" scheme, perhaps the solution is just put the 
> current "binning value" [presumeably some log2 thing] into the oNode 
> itself -- it'll just be a byte. Then you're only stuck with the 
> complexity of deciding when to do a "split" if the current bin has 
> gotten too large (some arbitrary limit on size of the encoded 
> lexent-bin)

I think simple is better, but it's annoying because one big bin means 
splitting all other bins, and we don't know when to merge without 
seeing all bin sizes.

sage


> 
> > 
> > sage
> > 
> > >
> > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > Sent: Sunday, August 21, 2016 10:27 AM
> > > > To: Sage Weil <sweil@redhat.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: Re: bluestore blobs REVISITED
> > > >
> > > > I wonder how hard it would be to add a "lower-bound" fetch like stl.
> > > > That would allow the kv store to do the fetch without incurring the
> > > > overhead of a snapshot for the iteration scan.
> > > >
> > > > Shared blobs were always going to trigger an extra kv fetch no matter
> > what.
> > > >
> > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > >
> > > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
> > > > >
> > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > > > >> I have another proposal (it just occurred to me, so it might not
> > > > >> survive more scrutiny).
> > > > >>
> > > > >> Yes, we should remove the blob-map from the oNode.
> > > > >>
> > > > >> But I believe we should also remove the lextent map from the
> > > > >> oNode and make each lextent be an independent KV value.
> > > > >>
> > > > >> However, in the special case where each extent --exactly-- maps
> > > > >> onto a blob AND the blob is not referenced by any other extent
> > > > >> (which is the typical case, unless you're doing compression with
> > > > >> strange-ish
> > > > >> overlaps)
> > > > >> -- then you encode the blob in the lextent itself and there's no
> > > > >> separate blob entry.
> > > > >>
> > > > >> This is pretty much exactly the same number of KV fetches as what
> > > > >> you proposed before when the blob isn't shared (the typical case)
> > > > >> -- except the oNode is MUCH MUCH smaller now.
> > > > >
> > > > > I think this approach makes a lot of sense!  The only thing I'm
> > > > > worried about is that the lextent keys are no longer known when
> > > > > they are being fetched (since they will be a logical offset),
> > > > > which means we'll have to use an iterator instead of a simple get.
> > > > > The former is quite a bit slower than the latter (which can make
> > > > > use of the rocksdb caches and/or key bloom filters more easily).
> > > > >
> > > > > We could augment your approach by keeping *just* the lextent
> > > > > offsets in the onode, so that we know exactly which lextent key to
> > > > > fetch, but then I'm not sure we'll get much benefit (lextent
> > > > > metadata size goes down by ~1/3, but then we have an extra get for
> > cloned objects).
> > > > >
> > > > > Hmm, the other thing to keep in mind is that for RBD the common
> > > > > case is that lots of objects have clones, and many of those
> > > > > objects' blobs will be shared.
> > > > >
> > > > > sage
> > > > >
> > > > >> So for the non-shared case, you fetch the oNode which is
> > > > >> dominated by the xattrs now (so figure a couple of hundred bytes
> > > > >> and not much CPU cost to deserialize). And then fetch from the KV
> > > > >> for the lextent (which is 1 fetch -- unless it overlaps two
> > > > >> previous lextents). If it's the optimal case, the KV fetch is
> > > > >> small (10-15 bytes) and trivial to deserialize. If it's an
> > > > >> unshared/local blob then you're ready to go. If the blob is
> > > > >> shared (locally or globally) then you'll have to go fetch that one too.
> > > > >>
> > > > >> This might lead to the elimination of the local/global blob thing
> > > > >> (I think you've talked about that before) as now the only "local"
> > > > >> blobs are the unshared single extent blobs which are stored
> > > > >> inline with the lextent entry. You'll still have the special
> > > > >> cases of promoting unshared
> > > > >> (inline) blobs to global blobs -- which is probably similar to
> > > > >> the current promotion "magic" on a clone operation.
> > > > >>
> > > > >> The current refmap concept may require some additional work. I
> > > > >> believe that we'll have to do a reconstruction of the refmap, but
> > > > >> fortunately only for the range of the current I/O. That will be a
> > > > >> bit more expensive, but still less expensive than reconstructing
> > > > >> the entire refmap for every oNode deserialization, Fortunately I
> > > > >> believe the refmap is only really needed for compression cases or
> > > > >> RBD cases
> > > > without "trim"
> > > > >> (this is the case to optimize -- it'll make trim really important
> > > > >> for performance).
> > > > >>
> > > > >> Best of both worlds????
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, Milpitas, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > >>> Sent: Friday, August 19, 2016 7:16 AM
> > > > >>> To: Sage Weil <sweil@redhat.com>
> > > > >>> Cc: ceph-devel@vger.kernel.org
> > > > >>> Subject: RE: bluestore blobs
> > > > >>>
> > > > >>>> -----Original Message-----
> > > > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > > > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > >>>> Cc: ceph-devel@vger.kernel.org
> > > > >>>> Subject: RE: bluestore blobs
> > > > >>>>
> > > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > >>>>>> -----Original Message-----
> > > > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > > > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > >>>>>> Cc: ceph-devel@vger.kernel.org
> > > > >>>>>> Subject: RE: bluestore blobs
> > > > >>>>>>
> > > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> > devel-
> > > > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > > > >>>>>>>> To: ceph-devel@vger.kernel.org
> > > > >>>>>>>> Subject: bluestore blobs
> > > > >>>>>>>>
> > > > >>>>>>>> I think we need to look at other changes in addition to the
> > > > >>>>>>>> encoding performance improvements.  Even if they end up
> > > > >>>>>>>> being good enough, these changes are somewhat orthogonal
> > > > >>>>>>>> and at
> > > > least
> > > > >>>>>>>> one of them should give us something that is even faster.
> > > > >>>>>>>>
> > > > >>>>>>>> 1. I mentioned this before, but we should keep the encoding
> > > > >>>>>>>> bluestore_blob_t around when we load the blob map.  If it's
> > > > >>>>>>>> not changed, don't reencode it.  There are no blockers for
> > > > >>>>>>>> implementing this
> > > > >>>>>> currently.
> > > > >>>>>>>> It may be difficult to ensure the blobs are properly marked
> > dirty...
> > > > >>>>>>>> I'll see if we can use proper accessors for the blob to
> > > > >>>>>>>> enforce this at compile time.  We should do that anyway.
> > > > >>>>>>>
> > > > >>>>>>> If it's not changed, then why are we re-writing it? I'm
> > > > >>>>>>> having a hard time thinking of a case worth optimizing where
> > > > >>>>>>> I want to re-write the oNode but the blob_map is unchanged.
> > > > >>>>>>> Am I missing
> > > > >>>> something obvious?
> > > > >>>>>>
> > > > >>>>>> An onode's blob_map might have 300 blobs, and a single write
> > > > >>>>>> only updates one of them.  The other 299 blobs need not be
> > > > >>>>>> reencoded, just
> > > > >>>> memcpy'd.
> > > > >>>>>
> > > > >>>>> As long as we're just appending that's a good optimization.
> > > > >>>>> How often does that happen? It's certainly not going to help
> > > > >>>>> the RBD 4K random write problem.
> > > > >>>>
> > > > >>>> It won't help the (l)extent_map encoding, but it avoids almost
> > > > >>>> all of the blob reencoding.  A 4k random write will update one
> > > > >>>> blob out of
> > > > >>>> ~100 (or whatever it is).
> > > > >>>>
> > > > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
> > stages:
> > > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
> > > > >>>>>>>> untouched
> > > > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> > > > >>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
> > > > >>>>>>>> We could extend the rocksdb interface to take an iovec so
> > > > >>>>>>>> that the first memcpy isn't needed (and rocksdb will
> > > > >>>>>>>> instead iterate over our buffers and copy them directly into its
> > write buffer).
> > > > >>>>>>>> This is probably a pretty small piece of the overall time...
> > > > >>>>>>>> should verify with a profiler
> > > > >>>>>> before investing too much effort here.
> > > > >>>>>>>
> > > > >>>>>>> I doubt it's the memcpy that's really the expensive part.
> > > > >>>>>>> I'll bet it's that we're transcoding from an internal to an
> > > > >>>>>>> external representation on an element by element basis. If
> > > > >>>>>>> the iovec scheme is going to help, it presumes that the
> > > > >>>>>>> internal data structure essentially matches the external
> > > > >>>>>>> data structure so that only an iovec copy is required. I'm
> > > > >>>>>>> wondering how compatible this is with the current concepts
> > > > >>>>>>> of
> > > > lextext/blob/pextent.
> > > > >>>>>>
> > > > >>>>>> I'm thinking of the xattr case (we have a bunch of strings to
> > > > >>>>>> copy
> > > > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case:
> > > > >>>>>> instead of memcpy'ing them into a big contiguous buffer and
> > > > >>>>>> having rocksdb memcpy
> > > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that
> > > > >>>>>> they smaller buffers are assembled only once.
> > > > >>>>>>
> > > > >>>>>> These buffers will be on the order of many 10s to a couple
> > > > >>>>>> 100s of
> > > > >>> bytes.
> > > > >>>>>> I'm not sure where the crossover point for constructing and
> > > > >>>>>> then traversing an iovec vs just copying twice would be...
> > > > >>>>>
> > > > >>>>> Yes this will eliminate the "extra" copy, but the real problem
> > > > >>>>> is that the oNode itself is just too large. I doubt removing
> > > > >>>>> one extra copy is going to suddenly "solve" this problem. I
> > > > >>>>> think we're going to end up rejiggering things so that this
> > > > >>>>> will be much less of a problem than it is now -- time will tell.
> > > > >>>>
> > > > >>>> Yeah, leaving this one for last I think... until we see memcpy
> > > > >>>> show up in the profile.
> > > > >>>>
> > > > >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k
> > > > >>>>>>>> or
> > > > >>>>>>>> more?) key into rocksdb every time we touch an object, even
> > > > >>>>>>>> when a tiny
> > > > >>>>>
> > > > >>>>> See my analysis, you're looking at 8-10K for the RBD random
> > > > >>>>> write case
> > > > >>>>> -- which I think everybody cares a lot about.
> > > > >>>>>
> > > > >>>>>>>> amount of metadata is getting changed.  This is a
> > > > >>>>>>>> consequence of embedding all of the blobs into the onode
> > > > >>>>>>>> (or bnode).  That seemed like a good idea early on when
> > > > >>>>>>>> they were tiny (i.e., just an extent), but now I'm not so
> > > > >>>>>>>> sure.  I see a couple of different
> > > > >>>> options:
> > > > >>>>>>>>
> > > > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load
> > > > >>>>>>>> the onode, load the blobs too.  They will hopefully be
> > > > >>>>>>>> sequential in rocksdb (or definitely sequential in zs).
> > > > >>>>>>>> Probably go back to using an
> > > > >>>> iterator.
> > > > >>>>>>>>
> > > > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids
> > > > >>>>>>>> so that they are unique for any given hash value.  Then
> > > > >>>>>>>> store the blobs as $shard.$poolid.$hash.$blobid (i.e.,
> > > > >>>>>>>> where the bnode is now).  Then when clone happens there is
> > > > >>>>>>>> no onode->bnode migration magic happening--we've already
> > > > >>>>>>>> committed to storing blobs in separate keys.  When we load
> > > > >>>>>>>> the onode, keep the conditional bnode loading we already
> > > > >>>>>>>> have.. but when the bnode is loaded load up all the blobs
> > > > >>>>>>>> for the hash key.  (Okay, we could fault in blobs
> > > > >>>>>>>> individually, but that code will be more
> > > > >>>>>>>> complicated.)
> > > > >>>>>
> > > > >>>>> I like this direction. I think you'll still end up demand
> > > > >>>>> loading the blobs in order to speed up the random read case.
> > > > >>>>> This scheme will result in some space-amplification, both in
> > > > >>>>> the lextent and in the blob-map, it's worth a bit of study too
> > > > >>>>> see how bad the metadata/data ratio becomes (just as a guess,
> > > > >>>>> $shard.$poolid.$hash.$blobid is probably 16 +
> > > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each
> > > > >>>>> Blob
> > > > >>>>> -- unless your KV store does path compression. My reading of
> > > > >>>>> RocksDB sst file seems to indicate that it doesn't, I
> > > > >>>>> *believe* that ZS does [need to confirm]). I'm wondering if
> > > > >>>>> the current notion
> > > > of local vs.
> > > > >>>>> global blobs isn't actually beneficial in that we can give
> > > > >>>>> local blobs different names that sort with their associated
> > > > >>>>> oNode (which probably makes the space-amp worse) which is an
> > > > >>>>> important optimization. We do need to watch the space amp,
> > > > >>>>> we're going to be burning DRAM to make KV accesses cheap and
> > > > >>>>> the amount of DRAM
> > > > is
> > > > >>> proportional to the space amp.
> > > > >>>>
> > > > >>>> I got this mostly working last night... just need to sort out
> > > > >>>> the clone case (and clean up a bunch of code).  It was a
> > > > >>>> relatively painless transition to make, although in its current
> > > > >>>> form the blobs all belong to the bnode, and the bnode if
> > > > >>>> ephemeral but remains in
> > > > >>> memory until all referencing onodes go away.
> > > > >>>> Mostly fine, except it means that odd combinations of clone
> > > > >>>> could leave lots of blobs in cache that don't get trimmed.
> > > > >>>> Will address that
> > > > later.
> > > > >>>>
> > > > >>>> I'll try to finish it up this morning and get it passing tests and posted.
> > > > >>>>
> > > > >>>>>>>> In both these cases, a write will dirty the onode (which is
> > > > >>>>>>>> back to being pretty small.. just xattrs and the lextent
> > > > >>>>>>>> map) and 1-3 blobs (also
> > > > >>>>>> now small keys).
> > > > >>>>>
> > > > >>>>> I'm not sure the oNode is going to be that small. Looking at
> > > > >>>>> the RBD random 4K write case, you're going to have 1K entries
> > > > >>>>> each of which has an offset, size and a blob-id reference in
> > > > >>>>> them. In my current oNode compression scheme this compresses
> > > > >>>>> to about 1 byte
> > > > per entry.
> > > > >>>>> However, this optimization relies on being able to cheaply
> > > > >>>>> renumber the blob-ids, which is no longer possible when the
> > > > >>>>> blob-ids become parts of a key (see above). So now you'll have
> > > > >>>>> a minimum of 1.5-3 bytes extra for each blob-id (because you
> > > > >>>>> can't assume that the blob-ids
> > > > >>>> become "dense"
> > > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about
> > > > >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable
> > > > >>>>> length encoding you'll have to scan the entire table to
> > > > >>>>> deserialize it (yes, we could do differential editing when we
> > > > >>>>> write but that's another
> > > > >>> discussion).
> > > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :).
> > > > >>>>> So while this looks small compared to the current ~30K for the
> > > > >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over
> > > > >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I
> > > > published earlier.
> > > > >>>>>
> > > > >>>>> If we want to do better we will need to separate the lextent
> > > > >>>>> from the oNode also. It's relatively easy to move the lextents
> > > > >>>>> into the KV store itself (there are two obvious ways to deal
> > > > >>>>> with this, either use the native offset/size from the lextent
> > > > >>>>> itself OR create 'N' buckets of logical offset into which we
> > > > >>>>> pour entries -- both of these would add somewhere between 1
> > > > >>>>> and 2 KV look-ups
> > > > per
> > > > >>>>> operation
> > > > >>>>> -- here is where an iterator would probably help.
> > > > >>>>>
> > > > >>>>> Unfortunately, if you only process a portion of the lextent
> > > > >>>>> (because you've made it into multiple keys and you don't want
> > > > >>>>> to load all of
> > > > >>>>> them) you no longer can re-generate the refmap on the fly
> > > > >>>>> (another key space optimization). The lack of refmap screws up
> > > > >>>>> a number of other important algorithms -- for example the
> > > > >>>>> overlapping blob-map
> > > > >>> thing, etc.
> > > > >>>>> Not sure if these are easy to rewrite or not -- too
> > > > >>>>> complicated to think about at this hour of the evening.
> > > > >>>>
> > > > >>>> Yeah, I forgot about the extent_map and how big it gets.  I
> > > > >>>> think, though, that if we can get a 4mb object with 1024 4k
> > > > >>>> lextents to encode the whole onode and extent_map in under 4K
> > > > >>>> that will be good enough.  The blob update that goes with it
> > > > >>>> will be ~200 bytes, and benchmarks aside, the 4k random write
> > > > >>>> 100% fragmented object is a worst
> > > > >>> case.
> > > > >>>
> > > > >>> Yes, it's a worst-case. But it's a
> > > > >>> "worst-case-that-everybody-looks-at" vs. a
> > > > >>> "worst-case-that-almost-
> > > > nobody-looks-at".
> > > > >>>
> > > > >>> I'm still concerned about having an oNode that's larger than a 4K
> > block.
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>> Anyway, I'll get the blob separation branch working and we can
> > > > >>>> go from there...
> > > > >>>>
> > > > >>>> sage
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe
> > > > >>> ceph-devel" in the body of a message to
> > > > >>> majordomo@vger.kernel.org More majordomo info at
> > > > >>> http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > >> info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > >>
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-22 22:55           ` Sage Weil
@ 2016-08-22 23:09             ` Allen Samuels
  2016-08-23 16:02               ` Sage Weil
  2016-08-23  5:21             ` Varada Kari
  1 sibling, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-22 23:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 6:55 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs REVISITED
> 
> On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 6:09 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs REVISITED
> > >
> > > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > > Another possibility is to "bin" the lextent table into known,
> > > > fixed, offset ranges. Suppose each oNode had a fixed range of LBA
> > > > keys associated with the lextent table: say [0..128K), [128K..256K), ...
> > >
> > > Yeah, I think that's the way to do it.  Either a set<uint32_t>
> > > lextent_key_offsets or uint64_t lextent_map_chunk_size to specify
> > > the granularity.
> >
> > Need to do some actual estimates on this scheme to make sure we're
> > actually landing on a real solution and not just another band-aid that
> > we have to rip off (painfully) at some future time.
> 
> Yeah
> 
> > > > It eliminates the need to put a "lower_bound" into the KV Store
> > > > directly. Though it'll likely be a bit more complex above and
> > > > somewhat less efficient.
> > >
> > > FWIW this is basically wat the iterator does, I think.  It's a
> > > separate rocksdb operation to create a snapshot to iterate over (and
> > > we don't rely on that feature anywhere).  It's still more expensive
> > > than raw gets just because it has to have fingers in all the levels
> > > to ensure that it finds all the keys for the given range, while get
> > > can stop once it finds a key or a tombstone (either in a cache or a higher
> level of the tree).
> > >
> > > But I'm not super thrilled about this complexity.  I still hope
> > > (wishfully?) we can get the lextent map small enough that we can
> > > leave it in the onode.  Otherwise we really are going to net +1 kv
> > > fetch for every operation (3, in the clone case... onode, then lextent,
> then shared blob).
> >
> > I don't see the feasibility of leaving the lextent map in the oNode.
> > It's just too big for the 4K random write case. I know that's not
> > indicative of real-world usage. But it's what people use to measure
> > systems....
> 
> How small do you think it would need to be to be acceptable (in teh worst-
> case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?
> 
> You're probably right, but I think I can still cut the full map encoding further.
> It was 6500, I did a quick tweak to get it to 5500, and I think I can drop another
> 1-2 bytes per entry for common cases (offset of 0, length == previous lextent
> length), which would get it under the 4k mark.
> 

I don't think there's a hard number on the size, but the fancy differential encoding is going to really chew up CPU time re-inflating it. All for the purpose of doing a lower-bound on the inflated data. 
 
> > BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff
> > suggests that it's would relatively trivial to ensure that
> > bloom-filters properly ignore the "offset" portion of an lextent key.
> > Meaning that I believe that a "lower_bound" operator ought to be
> > relatively easy to implement without triggering the overhead of a
> > snapshot, again, provided that you provided a custom bloom-filter
> > implementation that did the right thing.
> 
> Hmm, that might help.  It may be that this isn't super significant, though,
> too... as I mentioned there is no snapshot involved with a vanilla iterator.

Technically correct, however, the iterator is stated as providing a consistent view of the data, meaning that it has some kind of micro-snapshot equivalent associated with it -- which I suspect is where the extra expense comes in (it must bump an internal ref-count on the .sst files in the range of the iterator as well as some kind of filddling with the memtable). The lower_bound that I'm talking about ought not be any more expensive than a regular get would be (assuming the correct bloom filter behavior) it's just a tweak of the existing search logic for a regular Get operation (though I suspect there are some edge cases where you're exactly on the end of at .sst range, blah blah blah -- perhaps by providing only one of lower_bound or upper_bound (but not both) that extra edge case might be eliminated)

> 
> > So the question is do we want to avoid monkeying with RocksDB and go
> > with a binning approach (TBD ensuring that this is a real solution to
> > the problem) OR do we bite the bullet and solve the lower-bound lookup
> > problem?
> >
> > BTW, on the "binning" scheme, perhaps the solution is just put the
> > current "binning value" [presumeably some log2 thing] into the oNode
> > itself -- it'll just be a byte. Then you're only stuck with the
> > complexity of deciding when to do a "split" if the current bin has
> > gotten too large (some arbitrary limit on size of the encoded
> > lexent-bin)
> 
> I think simple is better, but it's annoying because one big bin means splitting
> all other bins, and we don't know when to merge without seeing all bin sizes.

I was thinking of something simple. One binning value for the entire oNode. If you split, you have to read all of the lextents, re-bin them and re-write them -- which shouldn't be --that-- difficult to do.... Maybe we just ignore the un-split case [do it later ?].
 
> 
> sage
> 
> 
> >
> > >
> > > sage
> > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > Sent: Sunday, August 21, 2016 10:27 AM
> > > > > To: Sage Weil <sweil@redhat.com>
> > > > > Cc: ceph-devel@vger.kernel.org
> > > > > Subject: Re: bluestore blobs REVISITED
> > > > >
> > > > > I wonder how hard it would be to add a "lower-bound" fetch like stl.
> > > > > That would allow the kv store to do the fetch without incurring
> > > > > the overhead of a snapshot for the iteration scan.
> > > > >
> > > > > Shared blobs were always going to trigger an extra kv fetch no
> > > > > matter
> > > what.
> > > > >
> > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > >
> > > > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com>
> wrote:
> > > > > >
> > > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > > > > >> I have another proposal (it just occurred to me, so it might
> > > > > >> not survive more scrutiny).
> > > > > >>
> > > > > >> Yes, we should remove the blob-map from the oNode.
> > > > > >>
> > > > > >> But I believe we should also remove the lextent map from the
> > > > > >> oNode and make each lextent be an independent KV value.
> > > > > >>
> > > > > >> However, in the special case where each extent --exactly--
> > > > > >> maps onto a blob AND the blob is not referenced by any other
> > > > > >> extent (which is the typical case, unless you're doing
> > > > > >> compression with strange-ish
> > > > > >> overlaps)
> > > > > >> -- then you encode the blob in the lextent itself and there's
> > > > > >> no separate blob entry.
> > > > > >>
> > > > > >> This is pretty much exactly the same number of KV fetches as
> > > > > >> what you proposed before when the blob isn't shared (the
> > > > > >> typical case)
> > > > > >> -- except the oNode is MUCH MUCH smaller now.
> > > > > >
> > > > > > I think this approach makes a lot of sense!  The only thing
> > > > > > I'm worried about is that the lextent keys are no longer known
> > > > > > when they are being fetched (since they will be a logical
> > > > > > offset), which means we'll have to use an iterator instead of a
> simple get.
> > > > > > The former is quite a bit slower than the latter (which can
> > > > > > make use of the rocksdb caches and/or key bloom filters more
> easily).
> > > > > >
> > > > > > We could augment your approach by keeping *just* the lextent
> > > > > > offsets in the onode, so that we know exactly which lextent
> > > > > > key to fetch, but then I'm not sure we'll get much benefit
> > > > > > (lextent metadata size goes down by ~1/3, but then we have an
> > > > > > extra get for
> > > cloned objects).
> > > > > >
> > > > > > Hmm, the other thing to keep in mind is that for RBD the
> > > > > > common case is that lots of objects have clones, and many of
> > > > > > those objects' blobs will be shared.
> > > > > >
> > > > > > sage
> > > > > >
> > > > > >> So for the non-shared case, you fetch the oNode which is
> > > > > >> dominated by the xattrs now (so figure a couple of hundred
> > > > > >> bytes and not much CPU cost to deserialize). And then fetch
> > > > > >> from the KV for the lextent (which is 1 fetch -- unless it
> > > > > >> overlaps two previous lextents). If it's the optimal case,
> > > > > >> the KV fetch is small (10-15 bytes) and trivial to
> > > > > >> deserialize. If it's an unshared/local blob then you're ready
> > > > > >> to go. If the blob is shared (locally or globally) then you'll have to go
> fetch that one too.
> > > > > >>
> > > > > >> This might lead to the elimination of the local/global blob
> > > > > >> thing (I think you've talked about that before) as now the only
> "local"
> > > > > >> blobs are the unshared single extent blobs which are stored
> > > > > >> inline with the lextent entry. You'll still have the special
> > > > > >> cases of promoting unshared
> > > > > >> (inline) blobs to global blobs -- which is probably similar
> > > > > >> to the current promotion "magic" on a clone operation.
> > > > > >>
> > > > > >> The current refmap concept may require some additional work.
> > > > > >> I believe that we'll have to do a reconstruction of the
> > > > > >> refmap, but fortunately only for the range of the current
> > > > > >> I/O. That will be a bit more expensive, but still less
> > > > > >> expensive than reconstructing the entire refmap for every
> > > > > >> oNode deserialization, Fortunately I believe the refmap is
> > > > > >> only really needed for compression cases or RBD cases
> > > > > without "trim"
> > > > > >> (this is the case to optimize -- it'll make trim really
> > > > > >> important for performance).
> > > > > >>
> > > > > >> Best of both worlds????
> > > > > >>
> > > > > >> Allen Samuels
> > > > > >> SanDisk |a Western Digital brand
> > > > > >> 2880 Junction Avenue, Milpitas, CA 95134
> > > > > >> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > >> allen.samuels@SanDisk.com
> > > > > >>
> > > > > >>
> > > > > >>> -----Original Message-----
> > > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > >>> Sent: Friday, August 19, 2016 7:16 AM
> > > > > >>> To: Sage Weil <sweil@redhat.com>
> > > > > >>> Cc: ceph-devel@vger.kernel.org
> > > > > >>> Subject: RE: bluestore blobs
> > > > > >>>
> > > > > >>>> -----Original Message-----
> > > > > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > > > > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > >>>> Cc: ceph-devel@vger.kernel.org
> > > > > >>>> Subject: RE: bluestore blobs
> > > > > >>>>
> > > > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > > >>>>>> -----Original Message-----
> > > > > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > > > > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > >>>>>> Cc: ceph-devel@vger.kernel.org
> > > > > >>>>>> Subject: RE: bluestore blobs
> > > > > >>>>>>
> > > > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> > > devel-
> > > > > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > >>>>>>>> To: ceph-devel@vger.kernel.org
> > > > > >>>>>>>> Subject: bluestore blobs
> > > > > >>>>>>>>
> > > > > >>>>>>>> I think we need to look at other changes in addition to
> > > > > >>>>>>>> the encoding performance improvements.  Even if they
> > > > > >>>>>>>> end up being good enough, these changes are somewhat
> > > > > >>>>>>>> orthogonal and at
> > > > > least
> > > > > >>>>>>>> one of them should give us something that is even faster.
> > > > > >>>>>>>>
> > > > > >>>>>>>> 1. I mentioned this before, but we should keep the
> > > > > >>>>>>>> encoding bluestore_blob_t around when we load the blob
> > > > > >>>>>>>> map.  If it's not changed, don't reencode it.  There
> > > > > >>>>>>>> are no blockers for implementing this
> > > > > >>>>>> currently.
> > > > > >>>>>>>> It may be difficult to ensure the blobs are properly
> > > > > >>>>>>>> marked
> > > dirty...
> > > > > >>>>>>>> I'll see if we can use proper accessors for the blob to
> > > > > >>>>>>>> enforce this at compile time.  We should do that anyway.
> > > > > >>>>>>>
> > > > > >>>>>>> If it's not changed, then why are we re-writing it? I'm
> > > > > >>>>>>> having a hard time thinking of a case worth optimizing
> > > > > >>>>>>> where I want to re-write the oNode but the blob_map is
> unchanged.
> > > > > >>>>>>> Am I missing
> > > > > >>>> something obvious?
> > > > > >>>>>>
> > > > > >>>>>> An onode's blob_map might have 300 blobs, and a single
> > > > > >>>>>> write only updates one of them.  The other 299 blobs need
> > > > > >>>>>> not be reencoded, just
> > > > > >>>> memcpy'd.
> > > > > >>>>>
> > > > > >>>>> As long as we're just appending that's a good optimization.
> > > > > >>>>> How often does that happen? It's certainly not going to
> > > > > >>>>> help the RBD 4K random write problem.
> > > > > >>>>
> > > > > >>>> It won't help the (l)extent_map encoding, but it avoids
> > > > > >>>> almost all of the blob reencoding.  A 4k random write will
> > > > > >>>> update one blob out of
> > > > > >>>> ~100 (or whatever it is).
> > > > > >>>>
> > > > > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
> > > stages:
> > > > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to
> > > > > >>>>>>>> each untouched
> > > > > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> > > > > >>>>>>>> somewhere inside rocksdb to copy this into the write
> buffer.
> > > > > >>>>>>>> We could extend the rocksdb interface to take an iovec
> > > > > >>>>>>>> so that the first memcpy isn't needed (and rocksdb will
> > > > > >>>>>>>> instead iterate over our buffers and copy them directly
> > > > > >>>>>>>> into its
> > > write buffer).
> > > > > >>>>>>>> This is probably a pretty small piece of the overall time...
> > > > > >>>>>>>> should verify with a profiler
> > > > > >>>>>> before investing too much effort here.
> > > > > >>>>>>>
> > > > > >>>>>>> I doubt it's the memcpy that's really the expensive part.
> > > > > >>>>>>> I'll bet it's that we're transcoding from an internal to
> > > > > >>>>>>> an external representation on an element by element
> > > > > >>>>>>> basis. If the iovec scheme is going to help, it presumes
> > > > > >>>>>>> that the internal data structure essentially matches the
> > > > > >>>>>>> external data structure so that only an iovec copy is
> > > > > >>>>>>> required. I'm wondering how compatible this is with the
> > > > > >>>>>>> current concepts of
> > > > > lextext/blob/pextent.
> > > > > >>>>>>
> > > > > >>>>>> I'm thinking of the xattr case (we have a bunch of
> > > > > >>>>>> strings to copy
> > > > > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged
> case:
> > > > > >>>>>> instead of memcpy'ing them into a big contiguous buffer
> > > > > >>>>>> and having rocksdb memcpy
> > > > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so
> > > > > >>>>>> that they smaller buffers are assembled only once.
> > > > > >>>>>>
> > > > > >>>>>> These buffers will be on the order of many 10s to a
> > > > > >>>>>> couple 100s of
> > > > > >>> bytes.
> > > > > >>>>>> I'm not sure where the crossover point for constructing
> > > > > >>>>>> and then traversing an iovec vs just copying twice would be...
> > > > > >>>>>
> > > > > >>>>> Yes this will eliminate the "extra" copy, but the real
> > > > > >>>>> problem is that the oNode itself is just too large. I
> > > > > >>>>> doubt removing one extra copy is going to suddenly "solve"
> > > > > >>>>> this problem. I think we're going to end up rejiggering
> > > > > >>>>> things so that this will be much less of a problem than it is now
> -- time will tell.
> > > > > >>>>
> > > > > >>>> Yeah, leaving this one for last I think... until we see
> > > > > >>>> memcpy show up in the profile.
> > > > > >>>>
> > > > > >>>>>>>> 3. Even if we do the above, we're still setting a big
> > > > > >>>>>>>> (~4k or
> > > > > >>>>>>>> more?) key into rocksdb every time we touch an object,
> > > > > >>>>>>>> even when a tiny
> > > > > >>>>>
> > > > > >>>>> See my analysis, you're looking at 8-10K for the RBD
> > > > > >>>>> random write case
> > > > > >>>>> -- which I think everybody cares a lot about.
> > > > > >>>>>
> > > > > >>>>>>>> amount of metadata is getting changed.  This is a
> > > > > >>>>>>>> consequence of embedding all of the blobs into the
> > > > > >>>>>>>> onode (or bnode).  That seemed like a good idea early
> > > > > >>>>>>>> on when they were tiny (i.e., just an extent), but now
> > > > > >>>>>>>> I'm not so sure.  I see a couple of different
> > > > > >>>> options:
> > > > > >>>>>>>>
> > > > > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we
> > > > > >>>>>>>> load the onode, load the blobs too.  They will
> > > > > >>>>>>>> hopefully be sequential in rocksdb (or definitely sequential
> in zs).
> > > > > >>>>>>>> Probably go back to using an
> > > > > >>>> iterator.
> > > > > >>>>>>>>
> > > > > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob
> > > > > >>>>>>>> ids so that they are unique for any given hash value.
> > > > > >>>>>>>> Then store the blobs as $shard.$poolid.$hash.$blobid
> > > > > >>>>>>>> (i.e., where the bnode is now).  Then when clone
> > > > > >>>>>>>> happens there is no onode->bnode migration magic
> > > > > >>>>>>>> happening--we've already committed to storing blobs in
> > > > > >>>>>>>> separate keys.  When we load the onode, keep the
> > > > > >>>>>>>> conditional bnode loading we already have.. but when
> > > > > >>>>>>>> the bnode is loaded load up all the blobs for the hash
> > > > > >>>>>>>> key.  (Okay, we could fault in blobs individually, but
> > > > > >>>>>>>> that code will be more
> > > > > >>>>>>>> complicated.)
> > > > > >>>>>
> > > > > >>>>> I like this direction. I think you'll still end up demand
> > > > > >>>>> loading the blobs in order to speed up the random read case.
> > > > > >>>>> This scheme will result in some space-amplification, both
> > > > > >>>>> in the lextent and in the blob-map, it's worth a bit of
> > > > > >>>>> study too see how bad the metadata/data ratio becomes
> > > > > >>>>> (just as a guess, $shard.$poolid.$hash.$blobid is probably
> > > > > >>>>> 16 +
> > > > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for
> > > > > >>>>> each Blob
> > > > > >>>>> -- unless your KV store does path compression. My reading
> > > > > >>>>> of RocksDB sst file seems to indicate that it doesn't, I
> > > > > >>>>> *believe* that ZS does [need to confirm]). I'm wondering
> > > > > >>>>> if the current notion
> > > > > of local vs.
> > > > > >>>>> global blobs isn't actually beneficial in that we can give
> > > > > >>>>> local blobs different names that sort with their
> > > > > >>>>> associated oNode (which probably makes the space-amp
> > > > > >>>>> worse) which is an important optimization. We do need to
> > > > > >>>>> watch the space amp, we're going to be burning DRAM to
> > > > > >>>>> make KV accesses cheap and the amount of DRAM
> > > > > is
> > > > > >>> proportional to the space amp.
> > > > > >>>>
> > > > > >>>> I got this mostly working last night... just need to sort
> > > > > >>>> out the clone case (and clean up a bunch of code).  It was
> > > > > >>>> a relatively painless transition to make, although in its
> > > > > >>>> current form the blobs all belong to the bnode, and the
> > > > > >>>> bnode if ephemeral but remains in
> > > > > >>> memory until all referencing onodes go away.
> > > > > >>>> Mostly fine, except it means that odd combinations of clone
> > > > > >>>> could leave lots of blobs in cache that don't get trimmed.
> > > > > >>>> Will address that
> > > > > later.
> > > > > >>>>
> > > > > >>>> I'll try to finish it up this morning and get it passing tests and
> posted.
> > > > > >>>>
> > > > > >>>>>>>> In both these cases, a write will dirty the onode
> > > > > >>>>>>>> (which is back to being pretty small.. just xattrs and
> > > > > >>>>>>>> the lextent
> > > > > >>>>>>>> map) and 1-3 blobs (also
> > > > > >>>>>> now small keys).
> > > > > >>>>>
> > > > > >>>>> I'm not sure the oNode is going to be that small. Looking
> > > > > >>>>> at the RBD random 4K write case, you're going to have 1K
> > > > > >>>>> entries each of which has an offset, size and a blob-id
> > > > > >>>>> reference in them. In my current oNode compression scheme
> > > > > >>>>> this compresses to about 1 byte
> > > > > per entry.
> > > > > >>>>> However, this optimization relies on being able to cheaply
> > > > > >>>>> renumber the blob-ids, which is no longer possible when
> > > > > >>>>> the blob-ids become parts of a key (see above). So now
> > > > > >>>>> you'll have a minimum of 1.5-3 bytes extra for each
> > > > > >>>>> blob-id (because you can't assume that the blob-ids
> > > > > >>>> become "dense"
> > > > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or
> > > > > >>>>> about 2.5-4K Bytes of lextent table. Worse, because of the
> > > > > >>>>> variable length encoding you'll have to scan the entire
> > > > > >>>>> table to deserialize it (yes, we could do differential
> > > > > >>>>> editing when we write but that's another
> > > > > >>> discussion).
> > > > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs
> :).
> > > > > >>>>> So while this looks small compared to the current ~30K for
> > > > > >>>>> the entire thing oNode/lextent/blobmap, it's NOT a huge
> > > > > >>>>> gain over 8-10K of the compressed oNode/lextent/blobmap
> > > > > >>>>> scheme that I
> > > > > published earlier.
> > > > > >>>>>
> > > > > >>>>> If we want to do better we will need to separate the
> > > > > >>>>> lextent from the oNode also. It's relatively easy to move
> > > > > >>>>> the lextents into the KV store itself (there are two
> > > > > >>>>> obvious ways to deal with this, either use the native
> > > > > >>>>> offset/size from the lextent itself OR create 'N' buckets
> > > > > >>>>> of logical offset into which we pour entries -- both of
> > > > > >>>>> these would add somewhere between 1 and 2 KV look-ups
> > > > > per
> > > > > >>>>> operation
> > > > > >>>>> -- here is where an iterator would probably help.
> > > > > >>>>>
> > > > > >>>>> Unfortunately, if you only process a portion of the
> > > > > >>>>> lextent (because you've made it into multiple keys and you
> > > > > >>>>> don't want to load all of
> > > > > >>>>> them) you no longer can re-generate the refmap on the fly
> > > > > >>>>> (another key space optimization). The lack of refmap
> > > > > >>>>> screws up a number of other important algorithms -- for
> > > > > >>>>> example the overlapping blob-map
> > > > > >>> thing, etc.
> > > > > >>>>> Not sure if these are easy to rewrite or not -- too
> > > > > >>>>> complicated to think about at this hour of the evening.
> > > > > >>>>
> > > > > >>>> Yeah, I forgot about the extent_map and how big it gets.  I
> > > > > >>>> think, though, that if we can get a 4mb object with 1024 4k
> > > > > >>>> lextents to encode the whole onode and extent_map in under
> > > > > >>>> 4K that will be good enough.  The blob update that goes
> > > > > >>>> with it will be ~200 bytes, and benchmarks aside, the 4k
> > > > > >>>> random write 100% fragmented object is a worst
> > > > > >>> case.
> > > > > >>>
> > > > > >>> Yes, it's a worst-case. But it's a
> > > > > >>> "worst-case-that-everybody-looks-at" vs. a
> > > > > >>> "worst-case-that-almost-
> > > > > nobody-looks-at".
> > > > > >>>
> > > > > >>> I'm still concerned about having an oNode that's larger than
> > > > > >>> a 4K
> > > block.
> > > > > >>>
> > > > > >>>
> > > > > >>>>
> > > > > >>>> Anyway, I'll get the blob separation branch working and we
> > > > > >>>> can go from there...
> > > > > >>>>
> > > > > >>>> sage
> > > > > >>> --
> > > > > >>> To unsubscribe from this list: send the line "unsubscribe
> > > > > >>> ceph-devel" in the body of a message to
> > > > > >>> majordomo@vger.kernel.org More majordomo info at
> > > > > >>> http://vger.kernel.org/majordomo-info.html
> > > > > >> --
> > > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > > majordomo
> > > > > >> info at  http://vger.kernel.org/majordomo-info.html
> > > > > >>
> > > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > ceph-devel" in the body of a message to
> > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> >
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-22 22:55           ` Sage Weil
  2016-08-22 23:09             ` Allen Samuels
@ 2016-08-23  5:21             ` Varada Kari
  1 sibling, 0 replies; 29+ messages in thread
From: Varada Kari @ 2016-08-23  5:21 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel

On Tuesday 23 August 2016 04:27 AM, Sage Weil wrote:
> On Mon, 22 Aug 2016, Allen Samuels wrote:
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Monday, August 22, 2016 6:09 PM
>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: bluestore blobs REVISITED
>>>
>>> On Mon, 22 Aug 2016, Allen Samuels wrote:
>>>> Another possibility is to "bin" the lextent table into known, fixed,
>>>> offset ranges. Suppose each oNode had a fixed range of LBA keys
>>>> associated with the lextent table: say [0..128K), [128K..256K), ...
>>> Yeah, I think that's the way to do it.  Either a set<uint32_t>
>>> lextent_key_offsets or uint64_t lextent_map_chunk_size to specify the
>>> granularity.
>> Need to do some actual estimates on this scheme to make sure we're
>> actually landing on a real solution and not just another band-aid that
>> we have to rip off (painfully) at some future time.
> Yeah
>
>>>> It eliminates the need to put a "lower_bound" into the KV Store
>>>> directly. Though it'll likely be a bit more complex above and somewhat
>>>> less efficient.
>>> FWIW this is basically wat the iterator does, I think.  It's a separate rocksdb
>>> operation to create a snapshot to iterate over (and we don't rely on that
>>> feature anywhere).  It's still more expensive than raw gets just because it has
>>> to have fingers in all the levels to ensure that it finds all the keys for the given
>>> range, while get can stop once it finds a key or a tombstone (either in a cache
>>> or a higher level of the tree).
>>>
>>> But I'm not super thrilled about this complexity.  I still hope
>>> (wishfully?) we can get the lextent map small enough that we can leave it in
>>> the onode.  Otherwise we really are going to net +1 kv fetch for every
>>> operation (3, in the clone case... onode, then lextent, then shared blob).
>> I don't see the feasibility of leaving the lextent map in the oNode.
>> It's just too big for the 4K random write case. I know that's not
>> indicative of real-world usage. But it's what people use to measure
>> systems....
> How small do you think it would need to be to be acceptable (in teh
> worst-case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?
>
> You're probably right, but I think I can still cut the full map encoding
> further.  It was 6500, I did a quick tweak to get it to 5500, and I think
> I can drop another 1-2 bytes per entry for common cases (offset of 0,
> length == previous lextent length), which would get it under the 4k mark.
I am also working on some of the optimizations like this. Introduced one
byte to
indicate whether the offset is zero or not, and if lextent length is
same as the blob pextent length(fails in case of super block and
incremental osd map, doesn't have a good hack so far to address this),
iff pextent vector has only one extent.
Another one, if we know, there is only one extent in the pextent
vector(usually the case) may be we can serialize the extent directly not
through the vector which saves couple of bytes to store the size of the
vector. Similar ones for lextents as well. Making changes to see if that
yields some savings.


Varada

>> BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff
>> suggests that it's would relatively trivial to ensure that bloom-filters
>> properly ignore the "offset" portion of an lextent key. Meaning that I
>> believe that a "lower_bound" operator ought to be relatively easy to
>> implement without triggering the overhead of a snapshot, again, provided
>> that you provided a custom bloom-filter implementation that did the
>> right thing.
> Hmm, that might help.  It may be that this isn't super significant,
> though, too... as I mentioned there is no snapshot involved with a vanilla
> iterator.
>
>> So the question is do we want to avoid monkeying with RocksDB and go
>> with a binning approach (TBD ensuring that this is a real solution to
>> the problem) OR do we bite the bullet and solve the lower-bound lookup
>> problem?
>>
>> BTW, on the "binning" scheme, perhaps the solution is just put the
>> current "binning value" [presumeably some log2 thing] into the oNode
>> itself -- it'll just be a byte. Then you're only stuck with the
>> complexity of deciding when to do a "split" if the current bin has
>> gotten too large (some arbitrary limit on size of the encoded
>> lexent-bin)
> I think simple is better, but it's annoying because one big bin means
> splitting all other bins, and we don't know when to merge without
> seeing all bin sizes.
>
> sage
>
>
>>> sage
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>>>> Sent: Sunday, August 21, 2016 10:27 AM
>>>>> To: Sage Weil <sweil@redhat.com>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: bluestore blobs REVISITED
>>>>>
>>>>> I wonder how hard it would be to add a "lower-bound" fetch like stl.
>>>>> That would allow the kv store to do the fetch without incurring the
>>>>> overhead of a snapshot for the iteration scan.
>>>>>
>>>>> Shared blobs were always going to trigger an extra kv fetch no matter
>>> what.
>>>>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>>>>
>>>>>> On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>>
>>>>>>> On Sat, 20 Aug 2016, Allen Samuels wrote:
>>>>>>> I have another proposal (it just occurred to me, so it might not
>>>>>>> survive more scrutiny).
>>>>>>>
>>>>>>> Yes, we should remove the blob-map from the oNode.
>>>>>>>
>>>>>>> But I believe we should also remove the lextent map from the
>>>>>>> oNode and make each lextent be an independent KV value.
>>>>>>>
>>>>>>> However, in the special case where each extent --exactly-- maps
>>>>>>> onto a blob AND the blob is not referenced by any other extent
>>>>>>> (which is the typical case, unless you're doing compression with
>>>>>>> strange-ish
>>>>>>> overlaps)
>>>>>>> -- then you encode the blob in the lextent itself and there's no
>>>>>>> separate blob entry.
>>>>>>>
>>>>>>> This is pretty much exactly the same number of KV fetches as what
>>>>>>> you proposed before when the blob isn't shared (the typical case)
>>>>>>> -- except the oNode is MUCH MUCH smaller now.
>>>>>> I think this approach makes a lot of sense!  The only thing I'm
>>>>>> worried about is that the lextent keys are no longer known when
>>>>>> they are being fetched (since they will be a logical offset),
>>>>>> which means we'll have to use an iterator instead of a simple get.
>>>>>> The former is quite a bit slower than the latter (which can make
>>>>>> use of the rocksdb caches and/or key bloom filters more easily).
>>>>>>
>>>>>> We could augment your approach by keeping *just* the lextent
>>>>>> offsets in the onode, so that we know exactly which lextent key to
>>>>>> fetch, but then I'm not sure we'll get much benefit (lextent
>>>>>> metadata size goes down by ~1/3, but then we have an extra get for
>>> cloned objects).
>>>>>> Hmm, the other thing to keep in mind is that for RBD the common
>>>>>> case is that lots of objects have clones, and many of those
>>>>>> objects' blobs will be shared.
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>> So for the non-shared case, you fetch the oNode which is
>>>>>>> dominated by the xattrs now (so figure a couple of hundred bytes
>>>>>>> and not much CPU cost to deserialize). And then fetch from the KV
>>>>>>> for the lextent (which is 1 fetch -- unless it overlaps two
>>>>>>> previous lextents). If it's the optimal case, the KV fetch is
>>>>>>> small (10-15 bytes) and trivial to deserialize. If it's an
>>>>>>> unshared/local blob then you're ready to go. If the blob is
>>>>>>> shared (locally or globally) then you'll have to go fetch that one too.
>>>>>>>
>>>>>>> This might lead to the elimination of the local/global blob thing
>>>>>>> (I think you've talked about that before) as now the only "local"
>>>>>>> blobs are the unshared single extent blobs which are stored
>>>>>>> inline with the lextent entry. You'll still have the special
>>>>>>> cases of promoting unshared
>>>>>>> (inline) blobs to global blobs -- which is probably similar to
>>>>>>> the current promotion "magic" on a clone operation.
>>>>>>>
>>>>>>> The current refmap concept may require some additional work. I
>>>>>>> believe that we'll have to do a reconstruction of the refmap, but
>>>>>>> fortunately only for the range of the current I/O. That will be a
>>>>>>> bit more expensive, but still less expensive than reconstructing
>>>>>>> the entire refmap for every oNode deserialization, Fortunately I
>>>>>>> believe the refmap is only really needed for compression cases or
>>>>>>> RBD cases
>>>>> without "trim"
>>>>>>> (this is the case to optimize -- it'll make trim really important
>>>>>>> for performance).
>>>>>>>
>>>>>>> Best of both worlds????
>>>>>>>
>>>>>>> Allen Samuels
>>>>>>> SanDisk |a Western Digital brand
>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>>>>>>> Sent: Friday, August 19, 2016 7:16 AM
>>>>>>>> To: Sage Weil <sweil@redhat.com>
>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>> Sent: Friday, August 19, 2016 6:53 AM
>>>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>>
>>>>>>>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>>>> Sent: Thursday, August 18, 2016 8:10 AM
>>>>>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
>>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
>>> devel-
>>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>>>>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
>>>>>>>>>>>>> To: ceph-devel@vger.kernel.org
>>>>>>>>>>>>> Subject: bluestore blobs
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we need to look at other changes in addition to the
>>>>>>>>>>>>> encoding performance improvements.  Even if they end up
>>>>>>>>>>>>> being good enough, these changes are somewhat orthogonal
>>>>>>>>>>>>> and at
>>>>> least
>>>>>>>>>>>>> one of them should give us something that is even faster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. I mentioned this before, but we should keep the encoding
>>>>>>>>>>>>> bluestore_blob_t around when we load the blob map.  If it's
>>>>>>>>>>>>> not changed, don't reencode it.  There are no blockers for
>>>>>>>>>>>>> implementing this
>>>>>>>>>>> currently.
>>>>>>>>>>>>> It may be difficult to ensure the blobs are properly marked
>>> dirty...
>>>>>>>>>>>>> I'll see if we can use proper accessors for the blob to
>>>>>>>>>>>>> enforce this at compile time.  We should do that anyway.
>>>>>>>>>>>> If it's not changed, then why are we re-writing it? I'm
>>>>>>>>>>>> having a hard time thinking of a case worth optimizing where
>>>>>>>>>>>> I want to re-write the oNode but the blob_map is unchanged.
>>>>>>>>>>>> Am I missing
>>>>>>>>> something obvious?
>>>>>>>>>>> An onode's blob_map might have 300 blobs, and a single write
>>>>>>>>>>> only updates one of them.  The other 299 blobs need not be
>>>>>>>>>>> reencoded, just
>>>>>>>>> memcpy'd.
>>>>>>>>>> As long as we're just appending that's a good optimization.
>>>>>>>>>> How often does that happen? It's certainly not going to help
>>>>>>>>>> the RBD 4K random write problem.
>>>>>>>>> It won't help the (l)extent_map encoding, but it avoids almost
>>>>>>>>> all of the blob reencoding.  A 4k random write will update one
>>>>>>>>> blob out of
>>>>>>>>> ~100 (or whatever it is).
>>>>>>>>>
>>>>>>>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
>>> stages:
>>>>>>>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each
>>>>>>>>>>>>> untouched
>>>>>>>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
>>>>>>>>>>>>> somewhere inside rocksdb to copy this into the write buffer.
>>>>>>>>>>>>> We could extend the rocksdb interface to take an iovec so
>>>>>>>>>>>>> that the first memcpy isn't needed (and rocksdb will
>>>>>>>>>>>>> instead iterate over our buffers and copy them directly into its
>>> write buffer).
>>>>>>>>>>>>> This is probably a pretty small piece of the overall time...
>>>>>>>>>>>>> should verify with a profiler
>>>>>>>>>>> before investing too much effort here.
>>>>>>>>>>>> I doubt it's the memcpy that's really the expensive part.
>>>>>>>>>>>> I'll bet it's that we're transcoding from an internal to an
>>>>>>>>>>>> external representation on an element by element basis. If
>>>>>>>>>>>> the iovec scheme is going to help, it presumes that the
>>>>>>>>>>>> internal data structure essentially matches the external
>>>>>>>>>>>> data structure so that only an iovec copy is required. I'm
>>>>>>>>>>>> wondering how compatible this is with the current concepts
>>>>>>>>>>>> of
>>>>> lextext/blob/pextent.
>>>>>>>>>>> I'm thinking of the xattr case (we have a bunch of strings to
>>>>>>>>>>> copy
>>>>>>>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case:
>>>>>>>>>>> instead of memcpy'ing them into a big contiguous buffer and
>>>>>>>>>>> having rocksdb memcpy
>>>>>>>>>>> *that* into it's larger buffer, give rocksdb an iovec so that
>>>>>>>>>>> they smaller buffers are assembled only once.
>>>>>>>>>>>
>>>>>>>>>>> These buffers will be on the order of many 10s to a couple
>>>>>>>>>>> 100s of
>>>>>>>> bytes.
>>>>>>>>>>> I'm not sure where the crossover point for constructing and
>>>>>>>>>>> then traversing an iovec vs just copying twice would be...
>>>>>>>>>> Yes this will eliminate the "extra" copy, but the real problem
>>>>>>>>>> is that the oNode itself is just too large. I doubt removing
>>>>>>>>>> one extra copy is going to suddenly "solve" this problem. I
>>>>>>>>>> think we're going to end up rejiggering things so that this
>>>>>>>>>> will be much less of a problem than it is now -- time will tell.
>>>>>>>>> Yeah, leaving this one for last I think... until we see memcpy
>>>>>>>>> show up in the profile.
>>>>>>>>>
>>>>>>>>>>>>> 3. Even if we do the above, we're still setting a big (~4k
>>>>>>>>>>>>> or
>>>>>>>>>>>>> more?) key into rocksdb every time we touch an object, even
>>>>>>>>>>>>> when a tiny
>>>>>>>>>> See my analysis, you're looking at 8-10K for the RBD random
>>>>>>>>>> write case
>>>>>>>>>> -- which I think everybody cares a lot about.
>>>>>>>>>>
>>>>>>>>>>>>> amount of metadata is getting changed.  This is a
>>>>>>>>>>>>> consequence of embedding all of the blobs into the onode
>>>>>>>>>>>>> (or bnode).  That seemed like a good idea early on when
>>>>>>>>>>>>> they were tiny (i.e., just an extent), but now I'm not so
>>>>>>>>>>>>> sure.  I see a couple of different
>>>>>>>>> options:
>>>>>>>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we load
>>>>>>>>>>>>> the onode, load the blobs too.  They will hopefully be
>>>>>>>>>>>>> sequential in rocksdb (or definitely sequential in zs).
>>>>>>>>>>>>> Probably go back to using an
>>>>>>>>> iterator.
>>>>>>>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids
>>>>>>>>>>>>> so that they are unique for any given hash value.  Then
>>>>>>>>>>>>> store the blobs as $shard.$poolid.$hash.$blobid (i.e.,
>>>>>>>>>>>>> where the bnode is now).  Then when clone happens there is
>>>>>>>>>>>>> no onode->bnode migration magic happening--we've already
>>>>>>>>>>>>> committed to storing blobs in separate keys.  When we load
>>>>>>>>>>>>> the onode, keep the conditional bnode loading we already
>>>>>>>>>>>>> have.. but when the bnode is loaded load up all the blobs
>>>>>>>>>>>>> for the hash key.  (Okay, we could fault in blobs
>>>>>>>>>>>>> individually, but that code will be more
>>>>>>>>>>>>> complicated.)
>>>>>>>>>> I like this direction. I think you'll still end up demand
>>>>>>>>>> loading the blobs in order to speed up the random read case.
>>>>>>>>>> This scheme will result in some space-amplification, both in
>>>>>>>>>> the lextent and in the blob-map, it's worth a bit of study too
>>>>>>>>>> see how bad the metadata/data ratio becomes (just as a guess,
>>>>>>>>>> $shard.$poolid.$hash.$blobid is probably 16 +
>>>>>>>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each
>>>>>>>>>> Blob
>>>>>>>>>> -- unless your KV store does path compression. My reading of
>>>>>>>>>> RocksDB sst file seems to indicate that it doesn't, I
>>>>>>>>>> *believe* that ZS does [need to confirm]). I'm wondering if
>>>>>>>>>> the current notion
>>>>> of local vs.
>>>>>>>>>> global blobs isn't actually beneficial in that we can give
>>>>>>>>>> local blobs different names that sort with their associated
>>>>>>>>>> oNode (which probably makes the space-amp worse) which is an
>>>>>>>>>> important optimization. We do need to watch the space amp,
>>>>>>>>>> we're going to be burning DRAM to make KV accesses cheap and
>>>>>>>>>> the amount of DRAM
>>>>> is
>>>>>>>> proportional to the space amp.
>>>>>>>>> I got this mostly working last night... just need to sort out
>>>>>>>>> the clone case (and clean up a bunch of code).  It was a
>>>>>>>>> relatively painless transition to make, although in its current
>>>>>>>>> form the blobs all belong to the bnode, and the bnode if
>>>>>>>>> ephemeral but remains in
>>>>>>>> memory until all referencing onodes go away.
>>>>>>>>> Mostly fine, except it means that odd combinations of clone
>>>>>>>>> could leave lots of blobs in cache that don't get trimmed.
>>>>>>>>> Will address that
>>>>> later.
>>>>>>>>> I'll try to finish it up this morning and get it passing tests and posted.
>>>>>>>>>
>>>>>>>>>>>>> In both these cases, a write will dirty the onode (which is
>>>>>>>>>>>>> back to being pretty small.. just xattrs and the lextent
>>>>>>>>>>>>> map) and 1-3 blobs (also
>>>>>>>>>>> now small keys).
>>>>>>>>>> I'm not sure the oNode is going to be that small. Looking at
>>>>>>>>>> the RBD random 4K write case, you're going to have 1K entries
>>>>>>>>>> each of which has an offset, size and a blob-id reference in
>>>>>>>>>> them. In my current oNode compression scheme this compresses
>>>>>>>>>> to about 1 byte
>>>>> per entry.
>>>>>>>>>> However, this optimization relies on being able to cheaply
>>>>>>>>>> renumber the blob-ids, which is no longer possible when the
>>>>>>>>>> blob-ids become parts of a key (see above). So now you'll have
>>>>>>>>>> a minimum of 1.5-3 bytes extra for each blob-id (because you
>>>>>>>>>> can't assume that the blob-ids
>>>>>>>>> become "dense"
>>>>>>>>>> anymore) So you're looking at 2.5-4 bytes per entry or about
>>>>>>>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable
>>>>>>>>>> length encoding you'll have to scan the entire table to
>>>>>>>>>> deserialize it (yes, we could do differential editing when we
>>>>>>>>>> write but that's another
>>>>>>>> discussion).
>>>>>>>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :).
>>>>>>>>>> So while this looks small compared to the current ~30K for the
>>>>>>>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over
>>>>>>>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I
>>>>> published earlier.
>>>>>>>>>> If we want to do better we will need to separate the lextent
>>>>>>>>>> from the oNode also. It's relatively easy to move the lextents
>>>>>>>>>> into the KV store itself (there are two obvious ways to deal
>>>>>>>>>> with this, either use the native offset/size from the lextent
>>>>>>>>>> itself OR create 'N' buckets of logical offset into which we
>>>>>>>>>> pour entries -- both of these would add somewhere between 1
>>>>>>>>>> and 2 KV look-ups
>>>>> per
>>>>>>>>>> operation
>>>>>>>>>> -- here is where an iterator would probably help.
>>>>>>>>>>
>>>>>>>>>> Unfortunately, if you only process a portion of the lextent
>>>>>>>>>> (because you've made it into multiple keys and you don't want
>>>>>>>>>> to load all of
>>>>>>>>>> them) you no longer can re-generate the refmap on the fly
>>>>>>>>>> (another key space optimization). The lack of refmap screws up
>>>>>>>>>> a number of other important algorithms -- for example the
>>>>>>>>>> overlapping blob-map
>>>>>>>> thing, etc.
>>>>>>>>>> Not sure if these are easy to rewrite or not -- too
>>>>>>>>>> complicated to think about at this hour of the evening.
>>>>>>>>> Yeah, I forgot about the extent_map and how big it gets.  I
>>>>>>>>> think, though, that if we can get a 4mb object with 1024 4k
>>>>>>>>> lextents to encode the whole onode and extent_map in under 4K
>>>>>>>>> that will be good enough.  The blob update that goes with it
>>>>>>>>> will be ~200 bytes, and benchmarks aside, the 4k random write
>>>>>>>>> 100% fragmented object is a worst
>>>>>>>> case.
>>>>>>>>
>>>>>>>> Yes, it's a worst-case. But it's a
>>>>>>>> "worst-case-that-everybody-looks-at" vs. a
>>>>>>>> "worst-case-that-almost-
>>>>> nobody-looks-at".
>>>>>>>> I'm still concerned about having an oNode that's larger than a 4K
>>> block.
>>>>>>>>
>>>>>>>>> Anyway, I'll get the blob separation branch working and we can
>>>>>>>>> go from there...
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel" in the body of a message to
>>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-22 23:09             ` Allen Samuels
@ 2016-08-23 16:02               ` Sage Weil
  2016-08-23 16:44                 ` Mark Nelson
  2016-08-23 16:46                 ` Allen Samuels
  0 siblings, 2 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-23 16:02 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

I just got the onode, including the full lextent map, down to ~1500 bytes.  
The lextent map encoding optimizations went something like this:

- 1 blobid bit to indicate that this lextent starts where the last one 
ended. (6500->5500)

- 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate length is 
same as previous lextent.  (5500->3500)

- make blobid signed (1 bit) and delta encode relative to previous blob.  
(3500->1500).  In practice we'd get something between 1500 and 3500 
because blobids won't have as much temporal locality as my test workload.  
OTOH, this is really important because blobids will also get big over time 
(we don't (yet?) have a way to reuse blobids and/or keep them unique to a 
hash key, so they grow as the osd ages).

https://github.com/liewegas/ceph/blob/wip-bluestore-blobwise/src/os/bluestore/bluestore_types.cc#L826

This makes the metadata update for a 4k write look like

  1500 byte onode update
  20 byte blob update
  182 byte + 847(*) byte omap updates (pg log)

  * pg _info key... def some room for optimization here I think!

In any case, this is pretty encouraging.  I think we have a few options:

1) keep extent map in onode and (re)encode fully each time (what I have 
now).  blobs live in their own keys.

2) keep extent map in onode but shard it in memory and only reencode the 
part(s) that get modified.  this will alleviate the cpu concerns around a 
more complex encoding.  no change to on-disk format.

3) join lextents and blobs (Allen's proposal) and dynamically bin based on 
the encoded size.

4) #3, but let shard_size=0 in onode (or whatever) put it inline with 
onode, so that simple objects avoid any additional kv op.

I currently like #2 for its simplicity, but I suspect we'll need to try 
#3/#4 too.

sage


On Mon, 22 Aug 2016, Allen Samuels wrote:

> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Monday, August 22, 2016 6:55 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs REVISITED
> > 
> > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Monday, August 22, 2016 6:09 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: bluestore blobs REVISITED
> > > >
> > > > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > > > Another possibility is to "bin" the lextent table into known,
> > > > > fixed, offset ranges. Suppose each oNode had a fixed range of LBA
> > > > > keys associated with the lextent table: say [0..128K), [128K..256K), ...
> > > >
> > > > Yeah, I think that's the way to do it.  Either a set<uint32_t>
> > > > lextent_key_offsets or uint64_t lextent_map_chunk_size to specify
> > > > the granularity.
> > >
> > > Need to do some actual estimates on this scheme to make sure we're
> > > actually landing on a real solution and not just another band-aid that
> > > we have to rip off (painfully) at some future time.
> > 
> > Yeah
> > 
> > > > > It eliminates the need to put a "lower_bound" into the KV Store
> > > > > directly. Though it'll likely be a bit more complex above and
> > > > > somewhat less efficient.
> > > >
> > > > FWIW this is basically wat the iterator does, I think.  It's a
> > > > separate rocksdb operation to create a snapshot to iterate over (and
> > > > we don't rely on that feature anywhere).  It's still more expensive
> > > > than raw gets just because it has to have fingers in all the levels
> > > > to ensure that it finds all the keys for the given range, while get
> > > > can stop once it finds a key or a tombstone (either in a cache or a higher
> > level of the tree).
> > > >
> > > > But I'm not super thrilled about this complexity.  I still hope
> > > > (wishfully?) we can get the lextent map small enough that we can
> > > > leave it in the onode.  Otherwise we really are going to net +1 kv
> > > > fetch for every operation (3, in the clone case... onode, then lextent,
> > then shared blob).
> > >
> > > I don't see the feasibility of leaving the lextent map in the oNode.
> > > It's just too big for the 4K random write case. I know that's not
> > > indicative of real-world usage. But it's what people use to measure
> > > systems....
> > 
> > How small do you think it would need to be to be acceptable (in teh worst-
> > case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?
> > 
> > You're probably right, but I think I can still cut the full map encoding further.
> > It was 6500, I did a quick tweak to get it to 5500, and I think I can drop another
> > 1-2 bytes per entry for common cases (offset of 0, length == previous lextent
> > length), which would get it under the 4k mark.
> > 
> 
> I don't think there's a hard number on the size, but the fancy differential encoding is going to really chew up CPU time re-inflating it. All for the purpose of doing a lower-bound on the inflated data. 
>  
> > > BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff
> > > suggests that it's would relatively trivial to ensure that
> > > bloom-filters properly ignore the "offset" portion of an lextent key.
> > > Meaning that I believe that a "lower_bound" operator ought to be
> > > relatively easy to implement without triggering the overhead of a
> > > snapshot, again, provided that you provided a custom bloom-filter
> > > implementation that did the right thing.
> > 
> > Hmm, that might help.  It may be that this isn't super significant, though,
> > too... as I mentioned there is no snapshot involved with a vanilla iterator.
> 
> Technically correct, however, the iterator is stated as providing a consistent view of the data, meaning that it has some kind of micro-snapshot equivalent associated with it -- which I suspect is where the extra expense comes in (it must bump an internal ref-count on the .sst files in the range of the iterator as well as some kind of filddling with the memtable). The lower_bound that I'm talking about ought not be any more expensive than a regular get would be (assuming the correct bloom filter behavior) it's just a tweak of the existing search logic for a regular Get operation (though I suspect there are some edge cases where you're exactly on the end of at .sst range, blah blah blah -- perhaps by providing only one of lower_bound or upper_bound (but not both) that extra edge case migh
 t be eliminated)
> 
> > 
> > > So the question is do we want to avoid monkeying with RocksDB and go
> > > with a binning approach (TBD ensuring that this is a real solution to
> > > the problem) OR do we bite the bullet and solve the lower-bound lookup
> > > problem?
> > >
> > > BTW, on the "binning" scheme, perhaps the solution is just put the
> > > current "binning value" [presumeably some log2 thing] into the oNode
> > > itself -- it'll just be a byte. Then you're only stuck with the
> > > complexity of deciding when to do a "split" if the current bin has
> > > gotten too large (some arbitrary limit on size of the encoded
> > > lexent-bin)
> > 
> > I think simple is better, but it's annoying because one big bin means splitting
> > all other bins, and we don't know when to merge without seeing all bin sizes.
> 
> I was thinking of something simple. One binning value for the entire oNode. If you split, you have to read all of the lextents, re-bin them and re-write them -- which shouldn't be --that-- difficult to do.... Maybe we just ignore the un-split case [do it later ?].
>  
> > 
> > sage
> > 
> > 
> > >
> > > >
> > > > sage
> > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > > Sent: Sunday, August 21, 2016 10:27 AM
> > > > > > To: Sage Weil <sweil@redhat.com>
> > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > Subject: Re: bluestore blobs REVISITED
> > > > > >
> > > > > > I wonder how hard it would be to add a "lower-bound" fetch like stl.
> > > > > > That would allow the kv store to do the fetch without incurring
> > > > > > the overhead of a snapshot for the iteration scan.
> > > > > >
> > > > > > Shared blobs were always going to trigger an extra kv fetch no
> > > > > > matter
> > > > what.
> > > > > >
> > > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > > >
> > > > > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com>
> > wrote:
> > > > > > >
> > > > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > > > > > >> I have another proposal (it just occurred to me, so it might
> > > > > > >> not survive more scrutiny).
> > > > > > >>
> > > > > > >> Yes, we should remove the blob-map from the oNode.
> > > > > > >>
> > > > > > >> But I believe we should also remove the lextent map from the
> > > > > > >> oNode and make each lextent be an independent KV value.
> > > > > > >>
> > > > > > >> However, in the special case where each extent --exactly--
> > > > > > >> maps onto a blob AND the blob is not referenced by any other
> > > > > > >> extent (which is the typical case, unless you're doing
> > > > > > >> compression with strange-ish
> > > > > > >> overlaps)
> > > > > > >> -- then you encode the blob in the lextent itself and there's
> > > > > > >> no separate blob entry.
> > > > > > >>
> > > > > > >> This is pretty much exactly the same number of KV fetches as
> > > > > > >> what you proposed before when the blob isn't shared (the
> > > > > > >> typical case)
> > > > > > >> -- except the oNode is MUCH MUCH smaller now.
> > > > > > >
> > > > > > > I think this approach makes a lot of sense!  The only thing
> > > > > > > I'm worried about is that the lextent keys are no longer known
> > > > > > > when they are being fetched (since they will be a logical
> > > > > > > offset), which means we'll have to use an iterator instead of a
> > simple get.
> > > > > > > The former is quite a bit slower than the latter (which can
> > > > > > > make use of the rocksdb caches and/or key bloom filters more
> > easily).
> > > > > > >
> > > > > > > We could augment your approach by keeping *just* the lextent
> > > > > > > offsets in the onode, so that we know exactly which lextent
> > > > > > > key to fetch, but then I'm not sure we'll get much benefit
> > > > > > > (lextent metadata size goes down by ~1/3, but then we have an
> > > > > > > extra get for
> > > > cloned objects).
> > > > > > >
> > > > > > > Hmm, the other thing to keep in mind is that for RBD the
> > > > > > > common case is that lots of objects have clones, and many of
> > > > > > > those objects' blobs will be shared.
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >> So for the non-shared case, you fetch the oNode which is
> > > > > > >> dominated by the xattrs now (so figure a couple of hundred
> > > > > > >> bytes and not much CPU cost to deserialize). And then fetch
> > > > > > >> from the KV for the lextent (which is 1 fetch -- unless it
> > > > > > >> overlaps two previous lextents). If it's the optimal case,
> > > > > > >> the KV fetch is small (10-15 bytes) and trivial to
> > > > > > >> deserialize. If it's an unshared/local blob then you're ready
> > > > > > >> to go. If the blob is shared (locally or globally) then you'll have to go
> > fetch that one too.
> > > > > > >>
> > > > > > >> This might lead to the elimination of the local/global blob
> > > > > > >> thing (I think you've talked about that before) as now the only
> > "local"
> > > > > > >> blobs are the unshared single extent blobs which are stored
> > > > > > >> inline with the lextent entry. You'll still have the special
> > > > > > >> cases of promoting unshared
> > > > > > >> (inline) blobs to global blobs -- which is probably similar
> > > > > > >> to the current promotion "magic" on a clone operation.
> > > > > > >>
> > > > > > >> The current refmap concept may require some additional work.
> > > > > > >> I believe that we'll have to do a reconstruction of the
> > > > > > >> refmap, but fortunately only for the range of the current
> > > > > > >> I/O. That will be a bit more expensive, but still less
> > > > > > >> expensive than reconstructing the entire refmap for every
> > > > > > >> oNode deserialization, Fortunately I believe the refmap is
> > > > > > >> only really needed for compression cases or RBD cases
> > > > > > without "trim"
> > > > > > >> (this is the case to optimize -- it'll make trim really
> > > > > > >> important for performance).
> > > > > > >>
> > > > > > >> Best of both worlds????
> > > > > > >>
> > > > > > >> Allen Samuels
> > > > > > >> SanDisk |a Western Digital brand
> > > > > > >> 2880 Junction Avenue, Milpitas, CA 95134
> > > > > > >> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > >> allen.samuels@SanDisk.com
> > > > > > >>
> > > > > > >>
> > > > > > >>> -----Original Message-----
> > > > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > >>> owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > > >>> Sent: Friday, August 19, 2016 7:16 AM
> > > > > > >>> To: Sage Weil <sweil@redhat.com>
> > > > > > >>> Cc: ceph-devel@vger.kernel.org
> > > > > > >>> Subject: RE: bluestore blobs
> > > > > > >>>
> > > > > > >>>> -----Original Message-----
> > > > > > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > > > > > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > >>>> Cc: ceph-devel@vger.kernel.org
> > > > > > >>>> Subject: RE: bluestore blobs
> > > > > > >>>>
> > > > > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > > > >>>>>> -----Original Message-----
> > > > > > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > > > > > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > >>>>>> Cc: ceph-devel@vger.kernel.org
> > > > > > >>>>>> Subject: RE: bluestore blobs
> > > > > > >>>>>>
> > > > > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > > >>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> > > > devel-
> > > > > > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > > >>>>>>>> To: ceph-devel@vger.kernel.org
> > > > > > >>>>>>>> Subject: bluestore blobs
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I think we need to look at other changes in addition to
> > > > > > >>>>>>>> the encoding performance improvements.  Even if they
> > > > > > >>>>>>>> end up being good enough, these changes are somewhat
> > > > > > >>>>>>>> orthogonal and at
> > > > > > least
> > > > > > >>>>>>>> one of them should give us something that is even faster.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> 1. I mentioned this before, but we should keep the
> > > > > > >>>>>>>> encoding bluestore_blob_t around when we load the blob
> > > > > > >>>>>>>> map.  If it's not changed, don't reencode it.  There
> > > > > > >>>>>>>> are no blockers for implementing this
> > > > > > >>>>>> currently.
> > > > > > >>>>>>>> It may be difficult to ensure the blobs are properly
> > > > > > >>>>>>>> marked
> > > > dirty...
> > > > > > >>>>>>>> I'll see if we can use proper accessors for the blob to
> > > > > > >>>>>>>> enforce this at compile time.  We should do that anyway.
> > > > > > >>>>>>>
> > > > > > >>>>>>> If it's not changed, then why are we re-writing it? I'm
> > > > > > >>>>>>> having a hard time thinking of a case worth optimizing
> > > > > > >>>>>>> where I want to re-write the oNode but the blob_map is
> > unchanged.
> > > > > > >>>>>>> Am I missing
> > > > > > >>>> something obvious?
> > > > > > >>>>>>
> > > > > > >>>>>> An onode's blob_map might have 300 blobs, and a single
> > > > > > >>>>>> write only updates one of them.  The other 299 blobs need
> > > > > > >>>>>> not be reencoded, just
> > > > > > >>>> memcpy'd.
> > > > > > >>>>>
> > > > > > >>>>> As long as we're just appending that's a good optimization.
> > > > > > >>>>> How often does that happen? It's certainly not going to
> > > > > > >>>>> help the RBD 4K random write problem.
> > > > > > >>>>
> > > > > > >>>> It won't help the (l)extent_map encoding, but it avoids
> > > > > > >>>> almost all of the blob reencoding.  A 4k random write will
> > > > > > >>>> update one blob out of
> > > > > > >>>> ~100 (or whatever it is).
> > > > > > >>>>
> > > > > > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
> > > > stages:
> > > > > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to
> > > > > > >>>>>>>> each untouched
> > > > > > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
> > > > > > >>>>>>>> somewhere inside rocksdb to copy this into the write
> > buffer.
> > > > > > >>>>>>>> We could extend the rocksdb interface to take an iovec
> > > > > > >>>>>>>> so that the first memcpy isn't needed (and rocksdb will
> > > > > > >>>>>>>> instead iterate over our buffers and copy them directly
> > > > > > >>>>>>>> into its
> > > > write buffer).
> > > > > > >>>>>>>> This is probably a pretty small piece of the overall time...
> > > > > > >>>>>>>> should verify with a profiler
> > > > > > >>>>>> before investing too much effort here.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I doubt it's the memcpy that's really the expensive part.
> > > > > > >>>>>>> I'll bet it's that we're transcoding from an internal to
> > > > > > >>>>>>> an external representation on an element by element
> > > > > > >>>>>>> basis. If the iovec scheme is going to help, it presumes
> > > > > > >>>>>>> that the internal data structure essentially matches the
> > > > > > >>>>>>> external data structure so that only an iovec copy is
> > > > > > >>>>>>> required. I'm wondering how compatible this is with the
> > > > > > >>>>>>> current concepts of
> > > > > > lextext/blob/pextent.
> > > > > > >>>>>>
> > > > > > >>>>>> I'm thinking of the xattr case (we have a bunch of
> > > > > > >>>>>> strings to copy
> > > > > > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged
> > case:
> > > > > > >>>>>> instead of memcpy'ing them into a big contiguous buffer
> > > > > > >>>>>> and having rocksdb memcpy
> > > > > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so
> > > > > > >>>>>> that they smaller buffers are assembled only once.
> > > > > > >>>>>>
> > > > > > >>>>>> These buffers will be on the order of many 10s to a
> > > > > > >>>>>> couple 100s of
> > > > > > >>> bytes.
> > > > > > >>>>>> I'm not sure where the crossover point for constructing
> > > > > > >>>>>> and then traversing an iovec vs just copying twice would be...
> > > > > > >>>>>
> > > > > > >>>>> Yes this will eliminate the "extra" copy, but the real
> > > > > > >>>>> problem is that the oNode itself is just too large. I
> > > > > > >>>>> doubt removing one extra copy is going to suddenly "solve"
> > > > > > >>>>> this problem. I think we're going to end up rejiggering
> > > > > > >>>>> things so that this will be much less of a problem than it is now
> > -- time will tell.
> > > > > > >>>>
> > > > > > >>>> Yeah, leaving this one for last I think... until we see
> > > > > > >>>> memcpy show up in the profile.
> > > > > > >>>>
> > > > > > >>>>>>>> 3. Even if we do the above, we're still setting a big
> > > > > > >>>>>>>> (~4k or
> > > > > > >>>>>>>> more?) key into rocksdb every time we touch an object,
> > > > > > >>>>>>>> even when a tiny
> > > > > > >>>>>
> > > > > > >>>>> See my analysis, you're looking at 8-10K for the RBD
> > > > > > >>>>> random write case
> > > > > > >>>>> -- which I think everybody cares a lot about.
> > > > > > >>>>>
> > > > > > >>>>>>>> amount of metadata is getting changed.  This is a
> > > > > > >>>>>>>> consequence of embedding all of the blobs into the
> > > > > > >>>>>>>> onode (or bnode).  That seemed like a good idea early
> > > > > > >>>>>>>> on when they were tiny (i.e., just an extent), but now
> > > > > > >>>>>>>> I'm not so sure.  I see a couple of different
> > > > > > >>>> options:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we
> > > > > > >>>>>>>> load the onode, load the blobs too.  They will
> > > > > > >>>>>>>> hopefully be sequential in rocksdb (or definitely sequential
> > in zs).
> > > > > > >>>>>>>> Probably go back to using an
> > > > > > >>>> iterator.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob
> > > > > > >>>>>>>> ids so that they are unique for any given hash value.
> > > > > > >>>>>>>> Then store the blobs as $shard.$poolid.$hash.$blobid
> > > > > > >>>>>>>> (i.e., where the bnode is now).  Then when clone
> > > > > > >>>>>>>> happens there is no onode->bnode migration magic
> > > > > > >>>>>>>> happening--we've already committed to storing blobs in
> > > > > > >>>>>>>> separate keys.  When we load the onode, keep the
> > > > > > >>>>>>>> conditional bnode loading we already have.. but when
> > > > > > >>>>>>>> the bnode is loaded load up all the blobs for the hash
> > > > > > >>>>>>>> key.  (Okay, we could fault in blobs individually, but
> > > > > > >>>>>>>> that code will be more
> > > > > > >>>>>>>> complicated.)
> > > > > > >>>>>
> > > > > > >>>>> I like this direction. I think you'll still end up demand
> > > > > > >>>>> loading the blobs in order to speed up the random read case.
> > > > > > >>>>> This scheme will result in some space-amplification, both
> > > > > > >>>>> in the lextent and in the blob-map, it's worth a bit of
> > > > > > >>>>> study too see how bad the metadata/data ratio becomes
> > > > > > >>>>> (just as a guess, $shard.$poolid.$hash.$blobid is probably
> > > > > > >>>>> 16 +
> > > > > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for
> > > > > > >>>>> each Blob
> > > > > > >>>>> -- unless your KV store does path compression. My reading
> > > > > > >>>>> of RocksDB sst file seems to indicate that it doesn't, I
> > > > > > >>>>> *believe* that ZS does [need to confirm]). I'm wondering
> > > > > > >>>>> if the current notion
> > > > > > of local vs.
> > > > > > >>>>> global blobs isn't actually beneficial in that we can give
> > > > > > >>>>> local blobs different names that sort with their
> > > > > > >>>>> associated oNode (which probably makes the space-amp
> > > > > > >>>>> worse) which is an important optimization. We do need to
> > > > > > >>>>> watch the space amp, we're going to be burning DRAM to
> > > > > > >>>>> make KV accesses cheap and the amount of DRAM
> > > > > > is
> > > > > > >>> proportional to the space amp.
> > > > > > >>>>
> > > > > > >>>> I got this mostly working last night... just need to sort
> > > > > > >>>> out the clone case (and clean up a bunch of code).  It was
> > > > > > >>>> a relatively painless transition to make, although in its
> > > > > > >>>> current form the blobs all belong to the bnode, and the
> > > > > > >>>> bnode if ephemeral but remains in
> > > > > > >>> memory until all referencing onodes go away.
> > > > > > >>>> Mostly fine, except it means that odd combinations of clone
> > > > > > >>>> could leave lots of blobs in cache that don't get trimmed.
> > > > > > >>>> Will address that
> > > > > > later.
> > > > > > >>>>
> > > > > > >>>> I'll try to finish it up this morning and get it passing tests and
> > posted.
> > > > > > >>>>
> > > > > > >>>>>>>> In both these cases, a write will dirty the onode
> > > > > > >>>>>>>> (which is back to being pretty small.. just xattrs and
> > > > > > >>>>>>>> the lextent
> > > > > > >>>>>>>> map) and 1-3 blobs (also
> > > > > > >>>>>> now small keys).
> > > > > > >>>>>
> > > > > > >>>>> I'm not sure the oNode is going to be that small. Looking
> > > > > > >>>>> at the RBD random 4K write case, you're going to have 1K
> > > > > > >>>>> entries each of which has an offset, size and a blob-id
> > > > > > >>>>> reference in them. In my current oNode compression scheme
> > > > > > >>>>> this compresses to about 1 byte
> > > > > > per entry.
> > > > > > >>>>> However, this optimization relies on being able to cheaply
> > > > > > >>>>> renumber the blob-ids, which is no longer possible when
> > > > > > >>>>> the blob-ids become parts of a key (see above). So now
> > > > > > >>>>> you'll have a minimum of 1.5-3 bytes extra for each
> > > > > > >>>>> blob-id (because you can't assume that the blob-ids
> > > > > > >>>> become "dense"
> > > > > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or
> > > > > > >>>>> about 2.5-4K Bytes of lextent table. Worse, because of the
> > > > > > >>>>> variable length encoding you'll have to scan the entire
> > > > > > >>>>> table to deserialize it (yes, we could do differential
> > > > > > >>>>> editing when we write but that's another
> > > > > > >>> discussion).
> > > > > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs
> > :).
> > > > > > >>>>> So while this looks small compared to the current ~30K for
> > > > > > >>>>> the entire thing oNode/lextent/blobmap, it's NOT a huge
> > > > > > >>>>> gain over 8-10K of the compressed oNode/lextent/blobmap
> > > > > > >>>>> scheme that I
> > > > > > published earlier.
> > > > > > >>>>>
> > > > > > >>>>> If we want to do better we will need to separate the
> > > > > > >>>>> lextent from the oNode also. It's relatively easy to move
> > > > > > >>>>> the lextents into the KV store itself (there are two
> > > > > > >>>>> obvious ways to deal with this, either use the native
> > > > > > >>>>> offset/size from the lextent itself OR create 'N' buckets
> > > > > > >>>>> of logical offset into which we pour entries -- both of
> > > > > > >>>>> these would add somewhere between 1 and 2 KV look-ups
> > > > > > per
> > > > > > >>>>> operation
> > > > > > >>>>> -- here is where an iterator would probably help.
> > > > > > >>>>>
> > > > > > >>>>> Unfortunately, if you only process a portion of the
> > > > > > >>>>> lextent (because you've made it into multiple keys and you
> > > > > > >>>>> don't want to load all of
> > > > > > >>>>> them) you no longer can re-generate the refmap on the fly
> > > > > > >>>>> (another key space optimization). The lack of refmap
> > > > > > >>>>> screws up a number of other important algorithms -- for
> > > > > > >>>>> example the overlapping blob-map
> > > > > > >>> thing, etc.
> > > > > > >>>>> Not sure if these are easy to rewrite or not -- too
> > > > > > >>>>> complicated to think about at this hour of the evening.
> > > > > > >>>>
> > > > > > >>>> Yeah, I forgot about the extent_map and how big it gets.  I
> > > > > > >>>> think, though, that if we can get a 4mb object with 1024 4k
> > > > > > >>>> lextents to encode the whole onode and extent_map in under
> > > > > > >>>> 4K that will be good enough.  The blob update that goes
> > > > > > >>>> with it will be ~200 bytes, and benchmarks aside, the 4k
> > > > > > >>>> random write 100% fragmented object is a worst
> > > > > > >>> case.
> > > > > > >>>
> > > > > > >>> Yes, it's a worst-case. But it's a
> > > > > > >>> "worst-case-that-everybody-looks-at" vs. a
> > > > > > >>> "worst-case-that-almost-
> > > > > > nobody-looks-at".
> > > > > > >>>
> > > > > > >>> I'm still concerned about having an oNode that's larger than
> > > > > > >>> a 4K
> > > > block.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>>
> > > > > > >>>> Anyway, I'll get the blob separation branch working and we
> > > > > > >>>> can go from there...
> > > > > > >>>>
> > > > > > >>>> sage
> > > > > > >>> --
> > > > > > >>> To unsubscribe from this list: send the line "unsubscribe
> > > > > > >>> ceph-devel" in the body of a message to
> > > > > > >>> majordomo@vger.kernel.org More majordomo info at
> > > > > > >>> http://vger.kernel.org/majordomo-info.html
> > > > > > >> --
> > > > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > > > majordomo
> > > > > > >> info at  http://vger.kernel.org/majordomo-info.html
> > > > > > >>
> > > > > > >>
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel" in the body of a message to
> > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-23 16:02               ` Sage Weil
@ 2016-08-23 16:44                 ` Mark Nelson
  2016-08-23 16:46                 ` Allen Samuels
  1 sibling, 0 replies; 29+ messages in thread
From: Mark Nelson @ 2016-08-23 16:44 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 08/23/2016 11:02 AM, Sage Weil wrote:
> I just got the onode, including the full lextent map, down to ~1500 bytes.
> The lextent map encoding optimizations went something like this:
>
> - 1 blobid bit to indicate that this lextent starts where the last one
> ended. (6500->5500)
>
> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate length is
> same as previous lextent.  (5500->3500)
>
> - make blobid signed (1 bit) and delta encode relative to previous blob.
> (3500->1500).  In practice we'd get something between 1500 and 3500
> because blobids won't have as much temporal locality as my test workload.
> OTOH, this is really important because blobids will also get big over time
> (we don't (yet?) have a way to reuse blobids and/or keep them unique to a
> hash key, so they grow as the osd ages).
>
> https://github.com/liewegas/ceph/blob/wip-bluestore-blobwise/src/os/bluestore/bluestore_types.cc#L826
>
> This makes the metadata update for a 4k write look like
>
>   1500 byte onode update
>   20 byte blob update
>   182 byte + 847(*) byte omap updates (pg log)
>
>   * pg _info key... def some room for optimization here I think!
>
> In any case, this is pretty encouraging.  I think we have a few options:
>
> 1) keep extent map in onode and (re)encode fully each time (what I have
> now).  blobs live in their own keys.
>
> 2) keep extent map in onode but shard it in memory and only reencode the
> part(s) that get modified.  this will alleviate the cpu concerns around a
> more complex encoding.  no change to on-disk format.
>
> 3) join lextents and blobs (Allen's proposal) and dynamically bin based on
> the encoded size.
>
> 4) #3, but let shard_size=0 in onode (or whatever) put it inline with
> onode, so that simple objects avoid any additional kv op.
>
> I currently like #2 for its simplicity, but I suspect we'll need to try
> #3/#4 too.

I was gravitating toward #2 right before I read the above line fwiw.  I 
think the idea to try 3/4 is a good one though.  There's enough 
complexity in all of this that I think we're (at some level) just 
guessing until we actually have some rocksdb and perf data to look at 
regarding actual IO activity and CPU usage.

Mark

>
> sage
>
>
> On Mon, 22 Aug 2016, Allen Samuels wrote:
>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Monday, August 22, 2016 6:55 PM
>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: bluestore blobs REVISITED
>>>
>>> On Mon, 22 Aug 2016, Allen Samuels wrote:
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Monday, August 22, 2016 6:09 PM
>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: RE: bluestore blobs REVISITED
>>>>>
>>>>> On Mon, 22 Aug 2016, Allen Samuels wrote:
>>>>>> Another possibility is to "bin" the lextent table into known,
>>>>>> fixed, offset ranges. Suppose each oNode had a fixed range of LBA
>>>>>> keys associated with the lextent table: say [0..128K), [128K..256K), ...
>>>>>
>>>>> Yeah, I think that's the way to do it.  Either a set<uint32_t>
>>>>> lextent_key_offsets or uint64_t lextent_map_chunk_size to specify
>>>>> the granularity.
>>>>
>>>> Need to do some actual estimates on this scheme to make sure we're
>>>> actually landing on a real solution and not just another band-aid that
>>>> we have to rip off (painfully) at some future time.
>>>
>>> Yeah
>>>
>>>>>> It eliminates the need to put a "lower_bound" into the KV Store
>>>>>> directly. Though it'll likely be a bit more complex above and
>>>>>> somewhat less efficient.
>>>>>
>>>>> FWIW this is basically wat the iterator does, I think.  It's a
>>>>> separate rocksdb operation to create a snapshot to iterate over (and
>>>>> we don't rely on that feature anywhere).  It's still more expensive
>>>>> than raw gets just because it has to have fingers in all the levels
>>>>> to ensure that it finds all the keys for the given range, while get
>>>>> can stop once it finds a key or a tombstone (either in a cache or a higher
>>> level of the tree).
>>>>>
>>>>> But I'm not super thrilled about this complexity.  I still hope
>>>>> (wishfully?) we can get the lextent map small enough that we can
>>>>> leave it in the onode.  Otherwise we really are going to net +1 kv
>>>>> fetch for every operation (3, in the clone case... onode, then lextent,
>>> then shared blob).
>>>>
>>>> I don't see the feasibility of leaving the lextent map in the oNode.
>>>> It's just too big for the 4K random write case. I know that's not
>>>> indicative of real-world usage. But it's what people use to measure
>>>> systems....
>>>
>>> How small do you think it would need to be to be acceptable (in teh worst-
>>> case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?
>>>
>>> You're probably right, but I think I can still cut the full map encoding further.
>>> It was 6500, I did a quick tweak to get it to 5500, and I think I can drop another
>>> 1-2 bytes per entry for common cases (offset of 0, length == previous lextent
>>> length), which would get it under the 4k mark.
>>>
>>
>> I don't think there's a hard number on the size, but the fancy differential encoding is going to really chew up CPU time re-inflating it. All for the purpose of doing a lower-bound on the inflated data.
>>
>>>> BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff
>>>> suggests that it's would relatively trivial to ensure that
>>>> bloom-filters properly ignore the "offset" portion of an lextent key.
>>>> Meaning that I believe that a "lower_bound" operator ought to be
>>>> relatively easy to implement without triggering the overhead of a
>>>> snapshot, again, provided that you provided a custom bloom-filter
>>>> implementation that did the right thing.
>>>
>>> Hmm, that might help.  It may be that this isn't super significant, though,
>>> too... as I mentioned there is no snapshot involved with a vanilla iterator.
>>
>> Technically correct, however, the iterator is stated as providing a consistent view of the data, meaning that it has some kind of micro-snapshot equivalent associated with it -- which I suspect is where the extra expense comes in (it must bump an internal ref-count on the .sst files in the range of the iterator as well as some kind of filddling with the memtable). The lower_bound that I'm talking about ought not be any more expensive than a regular get would be (assuming the correct bloom filter behavior) it's just a tweak of the existing search logic for a regular Get operation (though I suspect there are some edge cases where you're exactly on the end of at .sst range, blah blah blah -- perhaps by providing only one of lower_bound or upper_bound (but not both) that extra edge case mig
 ht be eliminated)
>>
>>>
>>>> So the question is do we want to avoid monkeying with RocksDB and go
>>>> with a binning approach (TBD ensuring that this is a real solution to
>>>> the problem) OR do we bite the bullet and solve the lower-bound lookup
>>>> problem?
>>>>
>>>> BTW, on the "binning" scheme, perhaps the solution is just put the
>>>> current "binning value" [presumeably some log2 thing] into the oNode
>>>> itself -- it'll just be a byte. Then you're only stuck with the
>>>> complexity of deciding when to do a "split" if the current bin has
>>>> gotten too large (some arbitrary limit on size of the encoded
>>>> lexent-bin)
>>>
>>> I think simple is better, but it's annoying because one big bin means splitting
>>> all other bins, and we don't know when to merge without seeing all bin sizes.
>>
>> I was thinking of something simple. One binning value for the entire oNode. If you split, you have to read all of the lextents, re-bin them and re-write them -- which shouldn't be --that-- difficult to do.... Maybe we just ignore the un-split case [do it later ?].
>>
>>>
>>> sage
>>>
>>>
>>>>
>>>>>
>>>>> sage
>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>>>>>> Sent: Sunday, August 21, 2016 10:27 AM
>>>>>>> To: Sage Weil <sweil@redhat.com>
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Subject: Re: bluestore blobs REVISITED
>>>>>>>
>>>>>>> I wonder how hard it would be to add a "lower-bound" fetch like stl.
>>>>>>> That would allow the kv store to do the fetch without incurring
>>>>>>> the overhead of a snapshot for the iteration scan.
>>>>>>>
>>>>>>> Shared blobs were always going to trigger an extra kv fetch no
>>>>>>> matter
>>>>> what.
>>>>>>>
>>>>>>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>>>>>>
>>>>>>>> On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com>
>>> wrote:
>>>>>>>>
>>>>>>>>> On Sat, 20 Aug 2016, Allen Samuels wrote:
>>>>>>>>> I have another proposal (it just occurred to me, so it might
>>>>>>>>> not survive more scrutiny).
>>>>>>>>>
>>>>>>>>> Yes, we should remove the blob-map from the oNode.
>>>>>>>>>
>>>>>>>>> But I believe we should also remove the lextent map from the
>>>>>>>>> oNode and make each lextent be an independent KV value.
>>>>>>>>>
>>>>>>>>> However, in the special case where each extent --exactly--
>>>>>>>>> maps onto a blob AND the blob is not referenced by any other
>>>>>>>>> extent (which is the typical case, unless you're doing
>>>>>>>>> compression with strange-ish
>>>>>>>>> overlaps)
>>>>>>>>> -- then you encode the blob in the lextent itself and there's
>>>>>>>>> no separate blob entry.
>>>>>>>>>
>>>>>>>>> This is pretty much exactly the same number of KV fetches as
>>>>>>>>> what you proposed before when the blob isn't shared (the
>>>>>>>>> typical case)
>>>>>>>>> -- except the oNode is MUCH MUCH smaller now.
>>>>>>>>
>>>>>>>> I think this approach makes a lot of sense!  The only thing
>>>>>>>> I'm worried about is that the lextent keys are no longer known
>>>>>>>> when they are being fetched (since they will be a logical
>>>>>>>> offset), which means we'll have to use an iterator instead of a
>>> simple get.
>>>>>>>> The former is quite a bit slower than the latter (which can
>>>>>>>> make use of the rocksdb caches and/or key bloom filters more
>>> easily).
>>>>>>>>
>>>>>>>> We could augment your approach by keeping *just* the lextent
>>>>>>>> offsets in the onode, so that we know exactly which lextent
>>>>>>>> key to fetch, but then I'm not sure we'll get much benefit
>>>>>>>> (lextent metadata size goes down by ~1/3, but then we have an
>>>>>>>> extra get for
>>>>> cloned objects).
>>>>>>>>
>>>>>>>> Hmm, the other thing to keep in mind is that for RBD the
>>>>>>>> common case is that lots of objects have clones, and many of
>>>>>>>> those objects' blobs will be shared.
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>>> So for the non-shared case, you fetch the oNode which is
>>>>>>>>> dominated by the xattrs now (so figure a couple of hundred
>>>>>>>>> bytes and not much CPU cost to deserialize). And then fetch
>>>>>>>>> from the KV for the lextent (which is 1 fetch -- unless it
>>>>>>>>> overlaps two previous lextents). If it's the optimal case,
>>>>>>>>> the KV fetch is small (10-15 bytes) and trivial to
>>>>>>>>> deserialize. If it's an unshared/local blob then you're ready
>>>>>>>>> to go. If the blob is shared (locally or globally) then you'll have to go
>>> fetch that one too.
>>>>>>>>>
>>>>>>>>> This might lead to the elimination of the local/global blob
>>>>>>>>> thing (I think you've talked about that before) as now the only
>>> "local"
>>>>>>>>> blobs are the unshared single extent blobs which are stored
>>>>>>>>> inline with the lextent entry. You'll still have the special
>>>>>>>>> cases of promoting unshared
>>>>>>>>> (inline) blobs to global blobs -- which is probably similar
>>>>>>>>> to the current promotion "magic" on a clone operation.
>>>>>>>>>
>>>>>>>>> The current refmap concept may require some additional work.
>>>>>>>>> I believe that we'll have to do a reconstruction of the
>>>>>>>>> refmap, but fortunately only for the range of the current
>>>>>>>>> I/O. That will be a bit more expensive, but still less
>>>>>>>>> expensive than reconstructing the entire refmap for every
>>>>>>>>> oNode deserialization, Fortunately I believe the refmap is
>>>>>>>>> only really needed for compression cases or RBD cases
>>>>>>> without "trim"
>>>>>>>>> (this is the case to optimize -- it'll make trim really
>>>>>>>>> important for performance).
>>>>>>>>>
>>>>>>>>> Best of both worlds????
>>>>>>>>>
>>>>>>>>> Allen Samuels
>>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>>>>>>> allen.samuels@SanDisk.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>>>>>>>>>> Sent: Friday, August 19, 2016 7:16 AM
>>>>>>>>>> To: Sage Weil <sweil@redhat.com>
>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>>>> Sent: Friday, August 19, 2016 6:53 AM
>>>>>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>>>>>> Sent: Thursday, August 18, 2016 8:10 AM
>>>>>>>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>>>>>>>> Subject: RE: bluestore blobs
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
>>>>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
>>>>> devel-
>>>>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>>>>>>>>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
>>>>>>>>>>>>>>> To: ceph-devel@vger.kernel.org
>>>>>>>>>>>>>>> Subject: bluestore blobs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think we need to look at other changes in addition to
>>>>>>>>>>>>>>> the encoding performance improvements.  Even if they
>>>>>>>>>>>>>>> end up being good enough, these changes are somewhat
>>>>>>>>>>>>>>> orthogonal and at
>>>>>>> least
>>>>>>>>>>>>>>> one of them should give us something that is even faster.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. I mentioned this before, but we should keep the
>>>>>>>>>>>>>>> encoding bluestore_blob_t around when we load the blob
>>>>>>>>>>>>>>> map.  If it's not changed, don't reencode it.  There
>>>>>>>>>>>>>>> are no blockers for implementing this
>>>>>>>>>>>>> currently.
>>>>>>>>>>>>>>> It may be difficult to ensure the blobs are properly
>>>>>>>>>>>>>>> marked
>>>>> dirty...
>>>>>>>>>>>>>>> I'll see if we can use proper accessors for the blob to
>>>>>>>>>>>>>>> enforce this at compile time.  We should do that anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If it's not changed, then why are we re-writing it? I'm
>>>>>>>>>>>>>> having a hard time thinking of a case worth optimizing
>>>>>>>>>>>>>> where I want to re-write the oNode but the blob_map is
>>> unchanged.
>>>>>>>>>>>>>> Am I missing
>>>>>>>>>>> something obvious?
>>>>>>>>>>>>>
>>>>>>>>>>>>> An onode's blob_map might have 300 blobs, and a single
>>>>>>>>>>>>> write only updates one of them.  The other 299 blobs need
>>>>>>>>>>>>> not be reencoded, just
>>>>>>>>>>> memcpy'd.
>>>>>>>>>>>>
>>>>>>>>>>>> As long as we're just appending that's a good optimization.
>>>>>>>>>>>> How often does that happen? It's certainly not going to
>>>>>>>>>>>> help the RBD 4K random write problem.
>>>>>>>>>>>
>>>>>>>>>>> It won't help the (l)extent_map encoding, but it avoids
>>>>>>>>>>> almost all of the blob reencoding.  A 4k random write will
>>>>>>>>>>> update one blob out of
>>>>>>>>>>> ~100 (or whatever it is).
>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy
>>>>> stages:
>>>>>>>>>>>>>>> one to assemble the bufferlist (lots of bufferptrs to
>>>>>>>>>>>>>>> each untouched
>>>>>>>>>>>>>>> blob) into a single rocksdb::Slice, and another memcpy
>>>>>>>>>>>>>>> somewhere inside rocksdb to copy this into the write
>>> buffer.
>>>>>>>>>>>>>>> We could extend the rocksdb interface to take an iovec
>>>>>>>>>>>>>>> so that the first memcpy isn't needed (and rocksdb will
>>>>>>>>>>>>>>> instead iterate over our buffers and copy them directly
>>>>>>>>>>>>>>> into its
>>>>> write buffer).
>>>>>>>>>>>>>>> This is probably a pretty small piece of the overall time...
>>>>>>>>>>>>>>> should verify with a profiler
>>>>>>>>>>>>> before investing too much effort here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I doubt it's the memcpy that's really the expensive part.
>>>>>>>>>>>>>> I'll bet it's that we're transcoding from an internal to
>>>>>>>>>>>>>> an external representation on an element by element
>>>>>>>>>>>>>> basis. If the iovec scheme is going to help, it presumes
>>>>>>>>>>>>>> that the internal data structure essentially matches the
>>>>>>>>>>>>>> external data structure so that only an iovec copy is
>>>>>>>>>>>>>> required. I'm wondering how compatible this is with the
>>>>>>>>>>>>>> current concepts of
>>>>>>> lextext/blob/pextent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm thinking of the xattr case (we have a bunch of
>>>>>>>>>>>>> strings to copy
>>>>>>>>>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged
>>> case:
>>>>>>>>>>>>> instead of memcpy'ing them into a big contiguous buffer
>>>>>>>>>>>>> and having rocksdb memcpy
>>>>>>>>>>>>> *that* into it's larger buffer, give rocksdb an iovec so
>>>>>>>>>>>>> that they smaller buffers are assembled only once.
>>>>>>>>>>>>>
>>>>>>>>>>>>> These buffers will be on the order of many 10s to a
>>>>>>>>>>>>> couple 100s of
>>>>>>>>>> bytes.
>>>>>>>>>>>>> I'm not sure where the crossover point for constructing
>>>>>>>>>>>>> and then traversing an iovec vs just copying twice would be...
>>>>>>>>>>>>
>>>>>>>>>>>> Yes this will eliminate the "extra" copy, but the real
>>>>>>>>>>>> problem is that the oNode itself is just too large. I
>>>>>>>>>>>> doubt removing one extra copy is going to suddenly "solve"
>>>>>>>>>>>> this problem. I think we're going to end up rejiggering
>>>>>>>>>>>> things so that this will be much less of a problem than it is now
>>> -- time will tell.
>>>>>>>>>>>
>>>>>>>>>>> Yeah, leaving this one for last I think... until we see
>>>>>>>>>>> memcpy show up in the profile.
>>>>>>>>>>>
>>>>>>>>>>>>>>> 3. Even if we do the above, we're still setting a big
>>>>>>>>>>>>>>> (~4k or
>>>>>>>>>>>>>>> more?) key into rocksdb every time we touch an object,
>>>>>>>>>>>>>>> even when a tiny
>>>>>>>>>>>>
>>>>>>>>>>>> See my analysis, you're looking at 8-10K for the RBD
>>>>>>>>>>>> random write case
>>>>>>>>>>>> -- which I think everybody cares a lot about.
>>>>>>>>>>>>
>>>>>>>>>>>>>>> amount of metadata is getting changed.  This is a
>>>>>>>>>>>>>>> consequence of embedding all of the blobs into the
>>>>>>>>>>>>>>> onode (or bnode).  That seemed like a good idea early
>>>>>>>>>>>>>>> on when they were tiny (i.e., just an extent), but now
>>>>>>>>>>>>>>> I'm not so sure.  I see a couple of different
>>>>>>>>>>> options:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> a) Store each blob as ($onode_key+$blobid).  When we
>>>>>>>>>>>>>>> load the onode, load the blobs too.  They will
>>>>>>>>>>>>>>> hopefully be sequential in rocksdb (or definitely sequential
>>> in zs).
>>>>>>>>>>>>>>> Probably go back to using an
>>>>>>>>>>> iterator.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> b) Go all in on the "bnode" like concept.  Assign blob
>>>>>>>>>>>>>>> ids so that they are unique for any given hash value.
>>>>>>>>>>>>>>> Then store the blobs as $shard.$poolid.$hash.$blobid
>>>>>>>>>>>>>>> (i.e., where the bnode is now).  Then when clone
>>>>>>>>>>>>>>> happens there is no onode->bnode migration magic
>>>>>>>>>>>>>>> happening--we've already committed to storing blobs in
>>>>>>>>>>>>>>> separate keys.  When we load the onode, keep the
>>>>>>>>>>>>>>> conditional bnode loading we already have.. but when
>>>>>>>>>>>>>>> the bnode is loaded load up all the blobs for the hash
>>>>>>>>>>>>>>> key.  (Okay, we could fault in blobs individually, but
>>>>>>>>>>>>>>> that code will be more
>>>>>>>>>>>>>>> complicated.)
>>>>>>>>>>>>
>>>>>>>>>>>> I like this direction. I think you'll still end up demand
>>>>>>>>>>>> loading the blobs in order to speed up the random read case.
>>>>>>>>>>>> This scheme will result in some space-amplification, both
>>>>>>>>>>>> in the lextent and in the blob-map, it's worth a bit of
>>>>>>>>>>>> study too see how bad the metadata/data ratio becomes
>>>>>>>>>>>> (just as a guess, $shard.$poolid.$hash.$blobid is probably
>>>>>>>>>>>> 16 +
>>>>>>>>>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for
>>>>>>>>>>>> each Blob
>>>>>>>>>>>> -- unless your KV store does path compression. My reading
>>>>>>>>>>>> of RocksDB sst file seems to indicate that it doesn't, I
>>>>>>>>>>>> *believe* that ZS does [need to confirm]). I'm wondering
>>>>>>>>>>>> if the current notion
>>>>>>> of local vs.
>>>>>>>>>>>> global blobs isn't actually beneficial in that we can give
>>>>>>>>>>>> local blobs different names that sort with their
>>>>>>>>>>>> associated oNode (which probably makes the space-amp
>>>>>>>>>>>> worse) which is an important optimization. We do need to
>>>>>>>>>>>> watch the space amp, we're going to be burning DRAM to
>>>>>>>>>>>> make KV accesses cheap and the amount of DRAM
>>>>>>> is
>>>>>>>>>> proportional to the space amp.
>>>>>>>>>>>
>>>>>>>>>>> I got this mostly working last night... just need to sort
>>>>>>>>>>> out the clone case (and clean up a bunch of code).  It was
>>>>>>>>>>> a relatively painless transition to make, although in its
>>>>>>>>>>> current form the blobs all belong to the bnode, and the
>>>>>>>>>>> bnode if ephemeral but remains in
>>>>>>>>>> memory until all referencing onodes go away.
>>>>>>>>>>> Mostly fine, except it means that odd combinations of clone
>>>>>>>>>>> could leave lots of blobs in cache that don't get trimmed.
>>>>>>>>>>> Will address that
>>>>>>> later.
>>>>>>>>>>>
>>>>>>>>>>> I'll try to finish it up this morning and get it passing tests and
>>> posted.
>>>>>>>>>>>
>>>>>>>>>>>>>>> In both these cases, a write will dirty the onode
>>>>>>>>>>>>>>> (which is back to being pretty small.. just xattrs and
>>>>>>>>>>>>>>> the lextent
>>>>>>>>>>>>>>> map) and 1-3 blobs (also
>>>>>>>>>>>>> now small keys).
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure the oNode is going to be that small. Looking
>>>>>>>>>>>> at the RBD random 4K write case, you're going to have 1K
>>>>>>>>>>>> entries each of which has an offset, size and a blob-id
>>>>>>>>>>>> reference in them. In my current oNode compression scheme
>>>>>>>>>>>> this compresses to about 1 byte
>>>>>>> per entry.
>>>>>>>>>>>> However, this optimization relies on being able to cheaply
>>>>>>>>>>>> renumber the blob-ids, which is no longer possible when
>>>>>>>>>>>> the blob-ids become parts of a key (see above). So now
>>>>>>>>>>>> you'll have a minimum of 1.5-3 bytes extra for each
>>>>>>>>>>>> blob-id (because you can't assume that the blob-ids
>>>>>>>>>>> become "dense"
>>>>>>>>>>>> anymore) So you're looking at 2.5-4 bytes per entry or
>>>>>>>>>>>> about 2.5-4K Bytes of lextent table. Worse, because of the
>>>>>>>>>>>> variable length encoding you'll have to scan the entire
>>>>>>>>>>>> table to deserialize it (yes, we could do differential
>>>>>>>>>>>> editing when we write but that's another
>>>>>>>>>> discussion).
>>>>>>>>>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs
>>> :).
>>>>>>>>>>>> So while this looks small compared to the current ~30K for
>>>>>>>>>>>> the entire thing oNode/lextent/blobmap, it's NOT a huge
>>>>>>>>>>>> gain over 8-10K of the compressed oNode/lextent/blobmap
>>>>>>>>>>>> scheme that I
>>>>>>> published earlier.
>>>>>>>>>>>>
>>>>>>>>>>>> If we want to do better we will need to separate the
>>>>>>>>>>>> lextent from the oNode also. It's relatively easy to move
>>>>>>>>>>>> the lextents into the KV store itself (there are two
>>>>>>>>>>>> obvious ways to deal with this, either use the native
>>>>>>>>>>>> offset/size from the lextent itself OR create 'N' buckets
>>>>>>>>>>>> of logical offset into which we pour entries -- both of
>>>>>>>>>>>> these would add somewhere between 1 and 2 KV look-ups
>>>>>>> per
>>>>>>>>>>>> operation
>>>>>>>>>>>> -- here is where an iterator would probably help.
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, if you only process a portion of the
>>>>>>>>>>>> lextent (because you've made it into multiple keys and you
>>>>>>>>>>>> don't want to load all of
>>>>>>>>>>>> them) you no longer can re-generate the refmap on the fly
>>>>>>>>>>>> (another key space optimization). The lack of refmap
>>>>>>>>>>>> screws up a number of other important algorithms -- for
>>>>>>>>>>>> example the overlapping blob-map
>>>>>>>>>> thing, etc.
>>>>>>>>>>>> Not sure if these are easy to rewrite or not -- too
>>>>>>>>>>>> complicated to think about at this hour of the evening.
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I forgot about the extent_map and how big it gets.  I
>>>>>>>>>>> think, though, that if we can get a 4mb object with 1024 4k
>>>>>>>>>>> lextents to encode the whole onode and extent_map in under
>>>>>>>>>>> 4K that will be good enough.  The blob update that goes
>>>>>>>>>>> with it will be ~200 bytes, and benchmarks aside, the 4k
>>>>>>>>>>> random write 100% fragmented object is a worst
>>>>>>>>>> case.
>>>>>>>>>>
>>>>>>>>>> Yes, it's a worst-case. But it's a
>>>>>>>>>> "worst-case-that-everybody-looks-at" vs. a
>>>>>>>>>> "worst-case-that-almost-
>>>>>>> nobody-looks-at".
>>>>>>>>>>
>>>>>>>>>> I'm still concerned about having an oNode that's larger than
>>>>>>>>>> a 4K
>>>>> block.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I'll get the blob separation branch working and we
>>>>>>>>>>> can go from there...
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel" in the body of a message to
>>>>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to
>>>>>>> majordomo@vger.kernel.org More majordomo info at
>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-23 16:02               ` Sage Weil
  2016-08-23 16:44                 ` Mark Nelson
@ 2016-08-23 16:46                 ` Allen Samuels
  2016-08-23 19:39                   ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-23 16:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, August 23, 2016 12:03 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs REVISITED
> 
> I just got the onode, including the full lextent map, down to ~1500 bytes.
> The lextent map encoding optimizations went something like this:
> 
> - 1 blobid bit to indicate that this lextent starts where the last one ended.
> (6500->5500)
> 
> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate length is same as
> previous lextent.  (5500->3500)
> 
> - make blobid signed (1 bit) and delta encode relative to previous blob.
> (3500->1500).  In practice we'd get something between 1500 and 3500
> because blobids won't have as much temporal locality as my test workload.
> OTOH, this is really important because blobids will also get big over time (we
> don't (yet?) have a way to reuse blobids and/or keep them unique to a hash
> key, so they grow as the osd ages).

This seems fishy to me, my mental model for the blob_id suggests that it must be at least 9 bits (for a random write workload) in size (1K entries randomly written lead to an average distance of 512, which means 10 bits to encode -- plus the other optimizations bits). Meaning that you're going to have two bytes for each lextent, meaning at least 2048 bytes of lextent plus the remainder of the oNode. So, probably more like 2500 bytes.

Am I missing something?

> 
> https://github.com/liewegas/ceph/blob/wip-bluestore-
> blobwise/src/os/bluestore/bluestore_types.cc#L826
> 
> This makes the metadata update for a 4k write look like
> 
>   1500 byte onode update
>   20 byte blob update
>   182 byte + 847(*) byte omap updates (pg log)
> 
>   * pg _info key... def some room for optimization here I think!
> 
> In any case, this is pretty encouraging.  I think we have a few options:
> 
> 1) keep extent map in onode and (re)encode fully each time (what I have
> now).  blobs live in their own keys.
> 
> 2) keep extent map in onode but shard it in memory and only reencode the
> part(s) that get modified.  this will alleviate the cpu concerns around a more
> complex encoding.  no change to on-disk format.
> 

This will save some CPU, but the write-amp is still pretty bad unless you get the entire commit to < 4K bytes on Rocks.

> 3) join lextents and blobs (Allen's proposal) and dynamically bin based on the
> encoded size.
> 
> 4) #3, but let shard_size=0 in onode (or whatever) put it inline with onode, so
> that simple objects avoid any additional kv op.

Yes, I'm still of the mind that this is the best approach. I'm not sure it's really that hard because most of the "special" cases can be dealt with in a common brute-force way (because they don't happen too often).

> 
> I currently like #2 for its simplicity, but I suspect we'll need to try
> #3/#4 too.
> 
> sage
> 
> 
> On Mon, 22 Aug 2016, Allen Samuels wrote:
> 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 6:55 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs REVISITED
> > >
> > > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Monday, August 22, 2016 6:09 PM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: ceph-devel@vger.kernel.org
> > > > > Subject: RE: bluestore blobs REVISITED
> > > > >
> > > > > On Mon, 22 Aug 2016, Allen Samuels wrote:
> > > > > > Another possibility is to "bin" the lextent table into known,
> > > > > > fixed, offset ranges. Suppose each oNode had a fixed range of
> > > > > > LBA keys associated with the lextent table: say [0..128K),
> [128K..256K), ...
> > > > >
> > > > > Yeah, I think that's the way to do it.  Either a set<uint32_t>
> > > > > lextent_key_offsets or uint64_t lextent_map_chunk_size to
> > > > > specify the granularity.
> > > >
> > > > Need to do some actual estimates on this scheme to make sure we're
> > > > actually landing on a real solution and not just another band-aid
> > > > that we have to rip off (painfully) at some future time.
> > >
> > > Yeah
> > >
> > > > > > It eliminates the need to put a "lower_bound" into the KV
> > > > > > Store directly. Though it'll likely be a bit more complex
> > > > > > above and somewhat less efficient.
> > > > >
> > > > > FWIW this is basically wat the iterator does, I think.  It's a
> > > > > separate rocksdb operation to create a snapshot to iterate over
> > > > > (and we don't rely on that feature anywhere).  It's still more
> > > > > expensive than raw gets just because it has to have fingers in
> > > > > all the levels to ensure that it finds all the keys for the
> > > > > given range, while get can stop once it finds a key or a
> > > > > tombstone (either in a cache or a higher
> > > level of the tree).
> > > > >
> > > > > But I'm not super thrilled about this complexity.  I still hope
> > > > > (wishfully?) we can get the lextent map small enough that we can
> > > > > leave it in the onode.  Otherwise we really are going to net +1
> > > > > kv fetch for every operation (3, in the clone case... onode,
> > > > > then lextent,
> > > then shared blob).
> > > >
> > > > I don't see the feasibility of leaving the lextent map in the oNode.
> > > > It's just too big for the 4K random write case. I know that's not
> > > > indicative of real-world usage. But it's what people use to
> > > > measure systems....
> > >
> > > How small do you think it would need to be to be acceptable (in teh
> > > worst- case, 1024 4k lextents in a 4m object)?  2k?  3k?  1k?
> > >
> > > You're probably right, but I think I can still cut the full map encoding
> further.
> > > It was 6500, I did a quick tweak to get it to 5500, and I think I
> > > can drop another
> > > 1-2 bytes per entry for common cases (offset of 0, length ==
> > > previous lextent length), which would get it under the 4k mark.
> > >
> >
> > I don't think there's a hard number on the size, but the fancy differential
> encoding is going to really chew up CPU time re-inflating it. All for the
> purpose of doing a lower-bound on the inflated data.
> >
> > > > BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix
> > > > stuff suggests that it's would relatively trivial to ensure that
> > > > bloom-filters properly ignore the "offset" portion of an lextent key.
> > > > Meaning that I believe that a "lower_bound" operator ought to be
> > > > relatively easy to implement without triggering the overhead of a
> > > > snapshot, again, provided that you provided a custom bloom-filter
> > > > implementation that did the right thing.
> > >
> > > Hmm, that might help.  It may be that this isn't super significant,
> > > though, too... as I mentioned there is no snapshot involved with a vanilla
> iterator.
> >
> > Technically correct, however, the iterator is stated as providing a
> > consistent view of the data, meaning that it has some kind of
> > micro-snapshot equivalent associated with it -- which I suspect is
> > where the extra expense comes in (it must bump an internal ref-count
> > on the .sst files in the range of the iterator as well as some kind of
> > filddling with the memtable). The lower_bound that I'm talking about
> > ought not be any more expensive than a regular get would be (assuming
> > the correct bloom filter behavior) it's just a tweak of the existing
> > search logic for a regular Get operation (though I suspect there are
> > some edge cases where you're exactly on the end of at .sst range, blah
> > blah blah -- perhaps by providing only one of lower_bound or
> > upper_bound (but not both) that extra edge case might be eliminated)
> >
> > >
> > > > So the question is do we want to avoid monkeying with RocksDB and
> > > > go with a binning approach (TBD ensuring that this is a real
> > > > solution to the problem) OR do we bite the bullet and solve the
> > > > lower-bound lookup problem?
> > > >
> > > > BTW, on the "binning" scheme, perhaps the solution is just put the
> > > > current "binning value" [presumeably some log2 thing] into the
> > > > oNode itself -- it'll just be a byte. Then you're only stuck with
> > > > the complexity of deciding when to do a "split" if the current bin
> > > > has gotten too large (some arbitrary limit on size of the encoded
> > > > lexent-bin)
> > >
> > > I think simple is better, but it's annoying because one big bin
> > > means splitting all other bins, and we don't know when to merge without
> seeing all bin sizes.
> >
> > I was thinking of something simple. One binning value for the entire
> oNode. If you split, you have to read all of the lextents, re-bin them and re-
> write them -- which shouldn't be --that-- difficult to do.... Maybe we just
> ignore the un-split case [do it later ?].
> >
> > >
> > > sage
> > >
> > >
> > > >
> > > > >
> > > > > sage
> > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > > > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > > > > > > Sent: Sunday, August 21, 2016 10:27 AM
> > > > > > > To: Sage Weil <sweil@redhat.com>
> > > > > > > Cc: ceph-devel@vger.kernel.org
> > > > > > > Subject: Re: bluestore blobs REVISITED
> > > > > > >
> > > > > > > I wonder how hard it would be to add a "lower-bound" fetch like
> stl.
> > > > > > > That would allow the kv store to do the fetch without
> > > > > > > incurring the overhead of a snapshot for the iteration scan.
> > > > > > >
> > > > > > > Shared blobs were always going to trigger an extra kv fetch
> > > > > > > no matter
> > > > > what.
> > > > > > >
> > > > > > > Sent from my iPhone. Please excuse all typos and autocorrects.
> > > > > > >
> > > > > > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@redhat.com>
> > > wrote:
> > > > > > > >
> > > > > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote:
> > > > > > > >> I have another proposal (it just occurred to me, so it
> > > > > > > >> might not survive more scrutiny).
> > > > > > > >>
> > > > > > > >> Yes, we should remove the blob-map from the oNode.
> > > > > > > >>
> > > > > > > >> But I believe we should also remove the lextent map from
> > > > > > > >> the oNode and make each lextent be an independent KV
> value.
> > > > > > > >>
> > > > > > > >> However, in the special case where each extent
> > > > > > > >> --exactly-- maps onto a blob AND the blob is not
> > > > > > > >> referenced by any other extent (which is the typical
> > > > > > > >> case, unless you're doing compression with strange-ish
> > > > > > > >> overlaps)
> > > > > > > >> -- then you encode the blob in the lextent itself and
> > > > > > > >> there's no separate blob entry.
> > > > > > > >>
> > > > > > > >> This is pretty much exactly the same number of KV fetches
> > > > > > > >> as what you proposed before when the blob isn't shared
> > > > > > > >> (the typical case)
> > > > > > > >> -- except the oNode is MUCH MUCH smaller now.
> > > > > > > >
> > > > > > > > I think this approach makes a lot of sense!  The only
> > > > > > > > thing I'm worried about is that the lextent keys are no
> > > > > > > > longer known when they are being fetched (since they will
> > > > > > > > be a logical offset), which means we'll have to use an
> > > > > > > > iterator instead of a
> > > simple get.
> > > > > > > > The former is quite a bit slower than the latter (which
> > > > > > > > can make use of the rocksdb caches and/or key bloom
> > > > > > > > filters more
> > > easily).
> > > > > > > >
> > > > > > > > We could augment your approach by keeping *just* the
> > > > > > > > lextent offsets in the onode, so that we know exactly
> > > > > > > > which lextent key to fetch, but then I'm not sure we'll
> > > > > > > > get much benefit (lextent metadata size goes down by ~1/3,
> > > > > > > > but then we have an extra get for
> > > > > cloned objects).
> > > > > > > >
> > > > > > > > Hmm, the other thing to keep in mind is that for RBD the
> > > > > > > > common case is that lots of objects have clones, and many
> > > > > > > > of those objects' blobs will be shared.
> > > > > > > >
> > > > > > > > sage
> > > > > > > >
> > > > > > > >> So for the non-shared case, you fetch the oNode which is
> > > > > > > >> dominated by the xattrs now (so figure a couple of
> > > > > > > >> hundred bytes and not much CPU cost to deserialize). And
> > > > > > > >> then fetch from the KV for the lextent (which is 1 fetch
> > > > > > > >> -- unless it overlaps two previous lextents). If it's the
> > > > > > > >> optimal case, the KV fetch is small (10-15 bytes) and
> > > > > > > >> trivial to deserialize. If it's an unshared/local blob
> > > > > > > >> then you're ready to go. If the blob is shared (locally
> > > > > > > >> or globally) then you'll have to go
> > > fetch that one too.
> > > > > > > >>
> > > > > > > >> This might lead to the elimination of the local/global
> > > > > > > >> blob thing (I think you've talked about that before) as
> > > > > > > >> now the only
> > > "local"
> > > > > > > >> blobs are the unshared single extent blobs which are
> > > > > > > >> stored inline with the lextent entry. You'll still have
> > > > > > > >> the special cases of promoting unshared
> > > > > > > >> (inline) blobs to global blobs -- which is probably
> > > > > > > >> similar to the current promotion "magic" on a clone operation.
> > > > > > > >>
> > > > > > > >> The current refmap concept may require some additional
> work.
> > > > > > > >> I believe that we'll have to do a reconstruction of the
> > > > > > > >> refmap, but fortunately only for the range of the current
> > > > > > > >> I/O. That will be a bit more expensive, but still less
> > > > > > > >> expensive than reconstructing the entire refmap for every
> > > > > > > >> oNode deserialization, Fortunately I believe the refmap
> > > > > > > >> is only really needed for compression cases or RBD cases
> > > > > > > without "trim"
> > > > > > > >> (this is the case to optimize -- it'll make trim really
> > > > > > > >> important for performance).
> > > > > > > >>
> > > > > > > >> Best of both worlds????
> > > > > > > >>
> > > > > > > >> Allen Samuels
> > > > > > > >> SanDisk |a Western Digital brand
> > > > > > > >> 2880 Junction Avenue, Milpitas, CA 95134
> > > > > > > >> T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > >> allen.samuels@SanDisk.com
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>> -----Original Message-----
> > > > > > > >>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > >>> [mailto:ceph-devel- owner@vger.kernel.org] On Behalf Of
> > > > > > > >>> Allen Samuels
> > > > > > > >>> Sent: Friday, August 19, 2016 7:16 AM
> > > > > > > >>> To: Sage Weil <sweil@redhat.com>
> > > > > > > >>> Cc: ceph-devel@vger.kernel.org
> > > > > > > >>> Subject: RE: bluestore blobs
> > > > > > > >>>
> > > > > > > >>>> -----Original Message-----
> > > > > > > >>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > >>>> Sent: Friday, August 19, 2016 6:53 AM
> > > > > > > >>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > > >>>> Cc: ceph-devel@vger.kernel.org
> > > > > > > >>>> Subject: RE: bluestore blobs
> > > > > > > >>>>
> > > > > > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > > > > > >>>>>> -----Original Message-----
> > > > > > > >>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM
> > > > > > > >>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > > >>>>>> Cc: ceph-devel@vger.kernel.org
> > > > > > > >>>>>> Subject: RE: bluestore blobs
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > > > >>>>>>>> From: ceph-devel-owner@vger.kernel.org
> > > > > > > >>>>>>>> [mailto:ceph-
> > > > > devel-
> > > > > > > >>>>>>>> owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > > > >>>>>>>> To: ceph-devel@vger.kernel.org
> > > > > > > >>>>>>>> Subject: bluestore blobs
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I think we need to look at other changes in
> > > > > > > >>>>>>>> addition to the encoding performance improvements.
> > > > > > > >>>>>>>> Even if they end up being good enough, these
> > > > > > > >>>>>>>> changes are somewhat orthogonal and at
> > > > > > > least
> > > > > > > >>>>>>>> one of them should give us something that is even
> faster.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> 1. I mentioned this before, but we should keep the
> > > > > > > >>>>>>>> encoding bluestore_blob_t around when we load the
> > > > > > > >>>>>>>> blob map.  If it's not changed, don't reencode it.
> > > > > > > >>>>>>>> There are no blockers for implementing this
> > > > > > > >>>>>> currently.
> > > > > > > >>>>>>>> It may be difficult to ensure the blobs are
> > > > > > > >>>>>>>> properly marked
> > > > > dirty...
> > > > > > > >>>>>>>> I'll see if we can use proper accessors for the
> > > > > > > >>>>>>>> blob to enforce this at compile time.  We should do
> that anyway.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> If it's not changed, then why are we re-writing it?
> > > > > > > >>>>>>> I'm having a hard time thinking of a case worth
> > > > > > > >>>>>>> optimizing where I want to re-write the oNode but
> > > > > > > >>>>>>> the blob_map is
> > > unchanged.
> > > > > > > >>>>>>> Am I missing
> > > > > > > >>>> something obvious?
> > > > > > > >>>>>>
> > > > > > > >>>>>> An onode's blob_map might have 300 blobs, and a
> > > > > > > >>>>>> single write only updates one of them.  The other 299
> > > > > > > >>>>>> blobs need not be reencoded, just
> > > > > > > >>>> memcpy'd.
> > > > > > > >>>>>
> > > > > > > >>>>> As long as we're just appending that's a good optimization.
> > > > > > > >>>>> How often does that happen? It's certainly not going
> > > > > > > >>>>> to help the RBD 4K random write problem.
> > > > > > > >>>>
> > > > > > > >>>> It won't help the (l)extent_map encoding, but it avoids
> > > > > > > >>>> almost all of the blob reencoding.  A 4k random write
> > > > > > > >>>> will update one blob out of
> > > > > > > >>>> ~100 (or whatever it is).
> > > > > > > >>>>
> > > > > > > >>>>>>>> 2. This turns the blob Put into rocksdb into two
> > > > > > > >>>>>>>> memcpy
> > > > > stages:
> > > > > > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs
> > > > > > > >>>>>>>> to each untouched
> > > > > > > >>>>>>>> blob) into a single rocksdb::Slice, and another
> > > > > > > >>>>>>>> memcpy somewhere inside rocksdb to copy this into
> > > > > > > >>>>>>>> the write
> > > buffer.
> > > > > > > >>>>>>>> We could extend the rocksdb interface to take an
> > > > > > > >>>>>>>> iovec so that the first memcpy isn't needed (and
> > > > > > > >>>>>>>> rocksdb will instead iterate over our buffers and
> > > > > > > >>>>>>>> copy them directly into its
> > > > > write buffer).
> > > > > > > >>>>>>>> This is probably a pretty small piece of the overall
> time...
> > > > > > > >>>>>>>> should verify with a profiler
> > > > > > > >>>>>> before investing too much effort here.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I doubt it's the memcpy that's really the expensive part.
> > > > > > > >>>>>>> I'll bet it's that we're transcoding from an
> > > > > > > >>>>>>> internal to an external representation on an element
> > > > > > > >>>>>>> by element basis. If the iovec scheme is going to
> > > > > > > >>>>>>> help, it presumes that the internal data structure
> > > > > > > >>>>>>> essentially matches the external data structure so
> > > > > > > >>>>>>> that only an iovec copy is required. I'm wondering
> > > > > > > >>>>>>> how compatible this is with the current concepts of
> > > > > > > lextext/blob/pextent.
> > > > > > > >>>>>>
> > > > > > > >>>>>> I'm thinking of the xattr case (we have a bunch of
> > > > > > > >>>>>> strings to copy
> > > > > > > >>>>>> verbatim) and updated-one-blob-and-kept-99-
> unchanged
> > > case:
> > > > > > > >>>>>> instead of memcpy'ing them into a big contiguous
> > > > > > > >>>>>> buffer and having rocksdb memcpy
> > > > > > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec
> > > > > > > >>>>>> so that they smaller buffers are assembled only once.
> > > > > > > >>>>>>
> > > > > > > >>>>>> These buffers will be on the order of many 10s to a
> > > > > > > >>>>>> couple 100s of
> > > > > > > >>> bytes.
> > > > > > > >>>>>> I'm not sure where the crossover point for
> > > > > > > >>>>>> constructing and then traversing an iovec vs just copying
> twice would be...
> > > > > > > >>>>>
> > > > > > > >>>>> Yes this will eliminate the "extra" copy, but the real
> > > > > > > >>>>> problem is that the oNode itself is just too large. I
> > > > > > > >>>>> doubt removing one extra copy is going to suddenly
> "solve"
> > > > > > > >>>>> this problem. I think we're going to end up
> > > > > > > >>>>> rejiggering things so that this will be much less of a
> > > > > > > >>>>> problem than it is now
> > > -- time will tell.
> > > > > > > >>>>
> > > > > > > >>>> Yeah, leaving this one for last I think... until we see
> > > > > > > >>>> memcpy show up in the profile.
> > > > > > > >>>>
> > > > > > > >>>>>>>> 3. Even if we do the above, we're still setting a
> > > > > > > >>>>>>>> big (~4k or
> > > > > > > >>>>>>>> more?) key into rocksdb every time we touch an
> > > > > > > >>>>>>>> object, even when a tiny
> > > > > > > >>>>>
> > > > > > > >>>>> See my analysis, you're looking at 8-10K for the RBD
> > > > > > > >>>>> random write case
> > > > > > > >>>>> -- which I think everybody cares a lot about.
> > > > > > > >>>>>
> > > > > > > >>>>>>>> amount of metadata is getting changed.  This is a
> > > > > > > >>>>>>>> consequence of embedding all of the blobs into the
> > > > > > > >>>>>>>> onode (or bnode).  That seemed like a good idea
> > > > > > > >>>>>>>> early on when they were tiny (i.e., just an
> > > > > > > >>>>>>>> extent), but now I'm not so sure.  I see a couple
> > > > > > > >>>>>>>> of different
> > > > > > > >>>> options:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> a) Store each blob as ($onode_key+$blobid).  When
> > > > > > > >>>>>>>> we load the onode, load the blobs too.  They will
> > > > > > > >>>>>>>> hopefully be sequential in rocksdb (or definitely
> > > > > > > >>>>>>>> sequential
> > > in zs).
> > > > > > > >>>>>>>> Probably go back to using an
> > > > > > > >>>> iterator.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> b) Go all in on the "bnode" like concept.  Assign
> > > > > > > >>>>>>>> blob ids so that they are unique for any given hash
> value.
> > > > > > > >>>>>>>> Then store the blobs as
> > > > > > > >>>>>>>> $shard.$poolid.$hash.$blobid (i.e., where the bnode
> > > > > > > >>>>>>>> is now).  Then when clone happens there is no
> > > > > > > >>>>>>>> onode->bnode migration magic happening--we've
> > > > > > > >>>>>>>> already committed to storing blobs in separate
> > > > > > > >>>>>>>> keys.  When we load the onode, keep the conditional
> > > > > > > >>>>>>>> bnode loading we already have.. but when the bnode
> > > > > > > >>>>>>>> is loaded load up all the blobs for the hash key.
> > > > > > > >>>>>>>> (Okay, we could fault in blobs individually, but
> > > > > > > >>>>>>>> that code will be more
> > > > > > > >>>>>>>> complicated.)
> > > > > > > >>>>>
> > > > > > > >>>>> I like this direction. I think you'll still end up
> > > > > > > >>>>> demand loading the blobs in order to speed up the random
> read case.
> > > > > > > >>>>> This scheme will result in some space-amplification,
> > > > > > > >>>>> both in the lextent and in the blob-map, it's worth a
> > > > > > > >>>>> bit of study too see how bad the metadata/data ratio
> > > > > > > >>>>> becomes (just as a guess, $shard.$poolid.$hash.$blobid
> > > > > > > >>>>> is probably
> > > > > > > >>>>> 16 +
> > > > > > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for
> > > > > > > >>>>> each Blob
> > > > > > > >>>>> -- unless your KV store does path compression. My
> > > > > > > >>>>> reading of RocksDB sst file seems to indicate that it
> > > > > > > >>>>> doesn't, I
> > > > > > > >>>>> *believe* that ZS does [need to confirm]). I'm
> > > > > > > >>>>> wondering if the current notion
> > > > > > > of local vs.
> > > > > > > >>>>> global blobs isn't actually beneficial in that we can
> > > > > > > >>>>> give local blobs different names that sort with their
> > > > > > > >>>>> associated oNode (which probably makes the space-amp
> > > > > > > >>>>> worse) which is an important optimization. We do need
> > > > > > > >>>>> to watch the space amp, we're going to be burning DRAM
> > > > > > > >>>>> to make KV accesses cheap and the amount of DRAM
> > > > > > > is
> > > > > > > >>> proportional to the space amp.
> > > > > > > >>>>
> > > > > > > >>>> I got this mostly working last night... just need to
> > > > > > > >>>> sort out the clone case (and clean up a bunch of code).
> > > > > > > >>>> It was a relatively painless transition to make,
> > > > > > > >>>> although in its current form the blobs all belong to
> > > > > > > >>>> the bnode, and the bnode if ephemeral but remains in
> > > > > > > >>> memory until all referencing onodes go away.
> > > > > > > >>>> Mostly fine, except it means that odd combinations of
> > > > > > > >>>> clone could leave lots of blobs in cache that don't get
> trimmed.
> > > > > > > >>>> Will address that
> > > > > > > later.
> > > > > > > >>>>
> > > > > > > >>>> I'll try to finish it up this morning and get it
> > > > > > > >>>> passing tests and
> > > posted.
> > > > > > > >>>>
> > > > > > > >>>>>>>> In both these cases, a write will dirty the onode
> > > > > > > >>>>>>>> (which is back to being pretty small.. just xattrs
> > > > > > > >>>>>>>> and the lextent
> > > > > > > >>>>>>>> map) and 1-3 blobs (also
> > > > > > > >>>>>> now small keys).
> > > > > > > >>>>>
> > > > > > > >>>>> I'm not sure the oNode is going to be that small.
> > > > > > > >>>>> Looking at the RBD random 4K write case, you're going
> > > > > > > >>>>> to have 1K entries each of which has an offset, size
> > > > > > > >>>>> and a blob-id reference in them. In my current oNode
> > > > > > > >>>>> compression scheme this compresses to about 1 byte
> > > > > > > per entry.
> > > > > > > >>>>> However, this optimization relies on being able to
> > > > > > > >>>>> cheaply renumber the blob-ids, which is no longer
> > > > > > > >>>>> possible when the blob-ids become parts of a key (see
> > > > > > > >>>>> above). So now you'll have a minimum of 1.5-3 bytes
> > > > > > > >>>>> extra for each blob-id (because you can't assume that
> > > > > > > >>>>> the blob-ids
> > > > > > > >>>> become "dense"
> > > > > > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or
> > > > > > > >>>>> about 2.5-4K Bytes of lextent table. Worse, because of
> > > > > > > >>>>> the variable length encoding you'll have to scan the
> > > > > > > >>>>> entire table to deserialize it (yes, we could do
> > > > > > > >>>>> differential editing when we write but that's another
> > > > > > > >>> discussion).
> > > > > > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and
> > > > > > > >>>>> xattrs
> > > :).
> > > > > > > >>>>> So while this looks small compared to the current ~30K
> > > > > > > >>>>> for the entire thing oNode/lextent/blobmap, it's NOT a
> > > > > > > >>>>> huge gain over 8-10K of the compressed
> > > > > > > >>>>> oNode/lextent/blobmap scheme that I
> > > > > > > published earlier.
> > > > > > > >>>>>
> > > > > > > >>>>> If we want to do better we will need to separate the
> > > > > > > >>>>> lextent from the oNode also. It's relatively easy to
> > > > > > > >>>>> move the lextents into the KV store itself (there are
> > > > > > > >>>>> two obvious ways to deal with this, either use the
> > > > > > > >>>>> native offset/size from the lextent itself OR create
> > > > > > > >>>>> 'N' buckets of logical offset into which we pour
> > > > > > > >>>>> entries -- both of these would add somewhere between 1
> > > > > > > >>>>> and 2 KV look-ups
> > > > > > > per
> > > > > > > >>>>> operation
> > > > > > > >>>>> -- here is where an iterator would probably help.
> > > > > > > >>>>>
> > > > > > > >>>>> Unfortunately, if you only process a portion of the
> > > > > > > >>>>> lextent (because you've made it into multiple keys and
> > > > > > > >>>>> you don't want to load all of
> > > > > > > >>>>> them) you no longer can re-generate the refmap on the
> > > > > > > >>>>> fly (another key space optimization). The lack of
> > > > > > > >>>>> refmap screws up a number of other important
> > > > > > > >>>>> algorithms -- for example the overlapping blob-map
> > > > > > > >>> thing, etc.
> > > > > > > >>>>> Not sure if these are easy to rewrite or not -- too
> > > > > > > >>>>> complicated to think about at this hour of the evening.
> > > > > > > >>>>
> > > > > > > >>>> Yeah, I forgot about the extent_map and how big it
> > > > > > > >>>> gets.  I think, though, that if we can get a 4mb object
> > > > > > > >>>> with 1024 4k lextents to encode the whole onode and
> > > > > > > >>>> extent_map in under 4K that will be good enough.  The
> > > > > > > >>>> blob update that goes with it will be ~200 bytes, and
> > > > > > > >>>> benchmarks aside, the 4k random write 100% fragmented
> > > > > > > >>>> object is a worst
> > > > > > > >>> case.
> > > > > > > >>>
> > > > > > > >>> Yes, it's a worst-case. But it's a
> > > > > > > >>> "worst-case-that-everybody-looks-at" vs. a
> > > > > > > >>> "worst-case-that-almost-
> > > > > > > nobody-looks-at".
> > > > > > > >>>
> > > > > > > >>> I'm still concerned about having an oNode that's larger
> > > > > > > >>> than a 4K
> > > > > block.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>>
> > > > > > > >>>> Anyway, I'll get the blob separation branch working and
> > > > > > > >>>> we can go from there...
> > > > > > > >>>>
> > > > > > > >>>> sage
> > > > > > > >>> --
> > > > > > > >>> To unsubscribe from this list: send the line
> > > > > > > >>> "unsubscribe ceph-devel" in the body of a message to
> > > > > > > >>> majordomo@vger.kernel.org More majordomo info at
> > > > > > > >>> http://vger.kernel.org/majordomo-info.html
> > > > > > > >> --
> > > > > > > >> To unsubscribe from this list: send the line "unsubscribe
> > > > > > > >> ceph-
> > > devel"
> > > > > > > >> in the body of a message to majordomo@vger.kernel.org
> > > > > > > >> More
> > > > > > > majordomo
> > > > > > > >> info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > >>
> > > > > > > >>
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > ceph-devel" in the body of a message to
> > > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > >
> > > > > >
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-23 16:46                 ` Allen Samuels
@ 2016-08-23 19:39                   ` Sage Weil
  2016-08-24  3:07                     ` Ramesh Chander
  2016-08-24  3:52                     ` Allen Samuels
  0 siblings, 2 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-23 19:39 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Tue, 23 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 12:03 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs REVISITED
> > 
> > I just got the onode, including the full lextent map, down to ~1500 bytes.
> > The lextent map encoding optimizations went something like this:
> > 
> > - 1 blobid bit to indicate that this lextent starts where the last one ended.
> > (6500->5500)
> > 
> > - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate length is same as
> > previous lextent.  (5500->3500)
> > 
> > - make blobid signed (1 bit) and delta encode relative to previous blob.
> > (3500->1500).  In practice we'd get something between 1500 and 3500
> > because blobids won't have as much temporal locality as my test workload.
> > OTOH, this is really important because blobids will also get big over time (we
> > don't (yet?) have a way to reuse blobids and/or keep them unique to a hash
> > key, so they grow as the osd ages).
> 
> This seems fishy to me, my mental model for the blob_id suggests that it 
> must be at least 9 bits (for a random write workload) in size (1K 
> entries randomly written lead to an average distance of 512, which means 
> 10 bits to encode -- plus the other optimizations bits). Meaning that 
> you're going to have two bytes for each lextent, meaning at least 2048 
> bytes of lextent plus the remainder of the oNode. So, probably more like 
> 2500 bytes.
> 
> Am I missing something?

Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I 
this is still enough to get us under 4K of metadata if we make pg_stat_t 
encoding space-efficient (and possibly even without out).

> > https://github.com/liewegas/ceph/blob/wip-bluestore-
> > blobwise/src/os/bluestore/bluestore_types.cc#L826
> > 
> > This makes the metadata update for a 4k write look like
> > 
> >   1500 byte onode update
> >   20 byte blob update
> >   182 byte + 847(*) byte omap updates (pg log)
> > 
> >   * pg _info key... def some room for optimization here I think!
> > 
> > In any case, this is pretty encouraging.  I think we have a few options:
> > 
> > 1) keep extent map in onode and (re)encode fully each time (what I have
> > now).  blobs live in their own keys.
> > 
> > 2) keep extent map in onode but shard it in memory and only reencode the
> > part(s) that get modified.  this will alleviate the cpu concerns around a more
> > complex encoding.  no change to on-disk format.
> > 
> 
> This will save some CPU, but the write-amp is still pretty bad unless 
> you get the entire commit to < 4K bytes on Rocks.
> 
> > 3) join lextents and blobs (Allen's proposal) and dynamically bin 
> > based on the encoded size.
> > 
> > 4) #3, but let shard_size=0 in onode (or whatever) put it inline with onode, so
> > that simple objects avoid any additional kv op.
> 
> Yes, I'm still of the mind that this is the best approach. I'm not sure 
> it's really that hard because most of the "special" cases can be dealt 
> with in a common brute-force way (because they don't happen too often).

Yep.  I think my next task is to code this one up.  Before I start, 
though, I want to make sure I'm doing the most reasonable thing.  Please 
review:

Right now the blobwise branch has (unshared and) shared blobs in their own 
key.  Shared blobs only update when they are occluded and their ref_map 
changes.  If it weren't for that, we could go back to cramming them 
together in a single bnode key, but I'm I think we'll do better with them 
as separate blob keys.  This helps small reads (we don't load all blobs), 
hurts large reads (we read them all from separate keys instead of all at 
once), and of course helps writes because we only update a single small 
blob record.

The onode will get a extent_map_shard_size, which is measured in bytes and 
tells us how the map is chunked into keys.  If it's 0, the whole map will 
stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the 
first key, [extent_map_shard_size, extent_map_shard_size*2) is in the 
second, etc.

In memory, we can pull this out of the onode_t struct and manage it in 
Onode where we can keep track of which parts of loaded, dirty, etc.

For each map chunk, we can do the inline blob trick.  Unfortunately we can 
only do this for 'unshared' blobs where shared is now shared across 
extent_map shards and not across objects.  We can conservatively 
estimate this by just looking at the blob size w/o looking at other 
lextents, at least.

I think we can also do a bit better than Varada's current map<> used 
during encode/decode.  As we iterate over the map, we can store the 
ephemeral blob id *for this encoding* in the Blob (not blob_t), and use 
the low bit of the blob id to indicate it is an ephemeral id vs a real 
one.  On decode we can size a vector at start and fill it with BlobRef 
entries as we go to avoid a dynamic map<> or other structure.  Another 
nice thing here is that we only have to assign global blobids when we are 
stored in a blob key instead of inline in the map, which makes the blob 
ids smaller (with fewer significant bits).

The Bnode then goes from a map of all blobs for the hash to a map of only 
shared (between shards or objects) for the hash--i.e., those blobs that 
have a blobid.  I think to make this work we then also need to change the 
in-memory lextent map from map<uint64_t,bluestore_lextent_t> to 
map<uint64_t,Lextent> so that we can handle the local vs remote pointer 
cases and for the local ones keep store the BlobRef right there.  This'll 
be marginally more complex to code (no 1:1 mapping of in-memory to encoded 
structures) but it should be faster (one less in-memory map lookup).

Anyway, that's the my current plan. Any thoughts before I start?
sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-23 19:39                   ` Sage Weil
@ 2016-08-24  3:07                     ` Ramesh Chander
  2016-08-24  3:52                     ` Allen Samuels
  1 sibling, 0 replies; 29+ messages in thread
From: Ramesh Chander @ 2016-08-24  3:07 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel

Few things related to shrink onode that I have been thinking of that may or may not be listed here:
1. Store minimum allocation size once in onode and then store offset and length for physical extents in multiple of minimum allocation size.
  In case of 4k block size, we should be able to handle 16T of storage using just 32 block numbers and single 2 bytes length (with limit on max size of onode).
  This should save 4-5 bytes as compared to 64 bit offsets without incurring much cpu overhead for encoding and decoding.

2. Blob id is identifier of the blob for lookup and sharing of blobs. Can we have blobs without blob ID when they are not shared. So we can directly store blobs in lextent instead of having a pointer?
   Anyways we are writing everything again so change in lextent and blob should go in same write. If we have to share a blob, lextent can point to blob id instead of direct values.

3. Another point Allen already discussed about having fixed length lextents. So we dont need offset to lextent mapping.

Any thoughts on feasibility in current design?

-Ramesh

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, August 24, 2016 1:10 AM
> To: Allen Samuels
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs REVISITED
>
> On Tue, 23 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 12:03 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs REVISITED
> > >
> > > I just got the onode, including the full lextent map, down to ~1500 bytes.
> > > The lextent map encoding optimizations went something like this:
> > >
> > > - 1 blobid bit to indicate that this lextent starts where the last one ended.
> > > (6500->5500)
> > >
> > > - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
> > > length is same as previous lextent.  (5500->3500)
> > >
> > > - make blobid signed (1 bit) and delta encode relative to previous blob.
> > > (3500->1500).  In practice we'd get something between 1500 and 3500
> > > because blobids won't have as much temporal locality as my test
> workload.
> > > OTOH, this is really important because blobids will also get big
> > > over time (we don't (yet?) have a way to reuse blobids and/or keep
> > > them unique to a hash key, so they grow as the osd ages).
> >
> > This seems fishy to me, my mental model for the blob_id suggests that
> > it must be at least 9 bits (for a random write workload) in size (1K
> > entries randomly written lead to an average distance of 512, which
> > means
> > 10 bits to encode -- plus the other optimizations bits). Meaning that
> > you're going to have two bytes for each lextent, meaning at least 2048
> > bytes of lextent plus the remainder of the oNode. So, probably more
> > like
> > 2500 bytes.
> >
> > Am I missing something?
>
> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
> enough to get us under 4K of metadata if we make pg_stat_t encoding
> space-efficient (and possibly even without out).
>
> > > https://github.com/liewegas/ceph/blob/wip-bluestore-
> > > blobwise/src/os/bluestore/bluestore_types.cc#L826
> > >
> > > This makes the metadata update for a 4k write look like
> > >
> > >   1500 byte onode update
> > >   20 byte blob update
> > >   182 byte + 847(*) byte omap updates (pg log)
> > >
> > >   * pg _info key... def some room for optimization here I think!
> > >
> > > In any case, this is pretty encouraging.  I think we have a few options:
> > >
> > > 1) keep extent map in onode and (re)encode fully each time (what I
> > > have now).  blobs live in their own keys.
> > >
> > > 2) keep extent map in onode but shard it in memory and only reencode
> > > the
> > > part(s) that get modified.  this will alleviate the cpu concerns
> > > around a more complex encoding.  no change to on-disk format.
> > >
> >
> > This will save some CPU, but the write-amp is still pretty bad unless
> > you get the entire commit to < 4K bytes on Rocks.
> >
> > > 3) join lextents and blobs (Allen's proposal) and dynamically bin
> > > based on the encoded size.
> > >
> > > 4) #3, but let shard_size=0 in onode (or whatever) put it inline
> > > with onode, so that simple objects avoid any additional kv op.
> >
> > Yes, I'm still of the mind that this is the best approach. I'm not
> > sure it's really that hard because most of the "special" cases can be
> > dealt with in a common brute-force way (because they don't happen too
> often).
>
> Yep.  I think my next task is to code this one up.  Before I start, though, I want
> to make sure I'm doing the most reasonable thing.  Please
> review:
>
> Right now the blobwise branch has (unshared and) shared blobs in their own
> key.  Shared blobs only update when they are occluded and their ref_map
> changes.  If it weren't for that, we could go back to cramming them together
> in a single bnode key, but I'm I think we'll do better with them as separate
> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
> (we read them all from separate keys instead of all at once), and of course
> helps writes because we only update a single small blob record.
>
> The onode will get a extent_map_shard_size, which is measured in bytes
> and tells us how the map is chunked into keys.  If it's 0, the whole map will
> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
> second, etc.
>
> In memory, we can pull this out of the onode_t struct and manage it in
> Onode where we can keep track of which parts of loaded, dirty, etc.
>
> For each map chunk, we can do the inline blob trick.  Unfortunately we can
> only do this for 'unshared' blobs where shared is now shared across
> extent_map shards and not across objects.  We can conservatively estimate
> this by just looking at the blob size w/o looking at other lextents, at least.
>
> I think we can also do a bit better than Varada's current map<> used during
> encode/decode.  As we iterate over the map, we can store the ephemeral
> blob id *for this encoding* in the Blob (not blob_t), and use the low bit of the
> blob id to indicate it is an ephemeral id vs a real one.  On decode we can size
> a vector at start and fill it with BlobRef entries as we go to avoid a dynamic
> map<> or other structure.  Another nice thing here is that we only have to
> assign global blobids when we are stored in a blob key instead of inline in the
> map, which makes the blob ids smaller (with fewer significant bits).
>
> The Bnode then goes from a map of all blobs for the hash to a map of only
> shared (between shards or objects) for the hash--i.e., those blobs that have
> a blobid.  I think to make this work we then also need to change the in-
> memory lextent map from map<uint64_t,bluestore_lextent_t> to
> map<uint64_t,Lextent> so that we can handle the local vs remote pointer
> cases and for the local ones keep store the BlobRef right there.  This'll be
> marginally more complex to code (no 1:1 mapping of in-memory to encoded
> structures) but it should be faster (one less in-memory map lookup).
>
> Anyway, that's the my current plan. Any thoughts before I start?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-23 19:39                   ` Sage Weil
  2016-08-24  3:07                     ` Ramesh Chander
@ 2016-08-24  3:52                     ` Allen Samuels
  2016-08-24 18:15                       ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-24  3:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, August 23, 2016 3:40 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs REVISITED
> 
> On Tue, 23 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 12:03 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs REVISITED
> > >
> > > I just got the onode, including the full lextent map, down to ~1500 bytes.
> > > The lextent map encoding optimizations went something like this:
> > >
> > > - 1 blobid bit to indicate that this lextent starts where the last one ended.
> > > (6500->5500)
> > >
> > > - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
> > > length is same as previous lextent.  (5500->3500)
> > >
> > > - make blobid signed (1 bit) and delta encode relative to previous blob.
> > > (3500->1500).  In practice we'd get something between 1500 and 3500
> > > because blobids won't have as much temporal locality as my test
> workload.
> > > OTOH, this is really important because blobids will also get big
> > > over time (we don't (yet?) have a way to reuse blobids and/or keep
> > > them unique to a hash key, so they grow as the osd ages).
> >
> > This seems fishy to me, my mental model for the blob_id suggests that
> > it must be at least 9 bits (for a random write workload) in size (1K
> > entries randomly written lead to an average distance of 512, which
> > means
> > 10 bits to encode -- plus the other optimizations bits). Meaning that
> > you're going to have two bytes for each lextent, meaning at least 2048
> > bytes of lextent plus the remainder of the oNode. So, probably more
> > like
> > 2500 bytes.
> >
> > Am I missing something?
> 
> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
> enough to get us under 4K of metadata if we make pg_stat_t encoding
> space-efficient (and possibly even without out).
> 
> > > https://github.com/liewegas/ceph/blob/wip-bluestore-
> > > blobwise/src/os/bluestore/bluestore_types.cc#L826
> > >
> > > This makes the metadata update for a 4k write look like
> > >
> > >   1500 byte onode update
> > >   20 byte blob update
> > >   182 byte + 847(*) byte omap updates (pg log)
> > >
> > >   * pg _info key... def some room for optimization here I think!
> > >
> > > In any case, this is pretty encouraging.  I think we have a few options:
> > >
> > > 1) keep extent map in onode and (re)encode fully each time (what I
> > > have now).  blobs live in their own keys.
> > >
> > > 2) keep extent map in onode but shard it in memory and only reencode
> > > the
> > > part(s) that get modified.  this will alleviate the cpu concerns
> > > around a more complex encoding.  no change to on-disk format.
> > >
> >
> > This will save some CPU, but the write-amp is still pretty bad unless
> > you get the entire commit to < 4K bytes on Rocks.
> >
> > > 3) join lextents and blobs (Allen's proposal) and dynamically bin
> > > based on the encoded size.
> > >
> > > 4) #3, but let shard_size=0 in onode (or whatever) put it inline
> > > with onode, so that simple objects avoid any additional kv op.
> >
> > Yes, I'm still of the mind that this is the best approach. I'm not
> > sure it's really that hard because most of the "special" cases can be
> > dealt with in a common brute-force way (because they don't happen too
> often).
> 
> Yep.  I think my next task is to code this one up.  Before I start, though, I want
> to make sure I'm doing the most reasonable thing.  Please
> review:
> 
> Right now the blobwise branch has (unshared and) shared blobs in their own
> key.  Shared blobs only update when they are occluded and their ref_map
> changes.  If it weren't for that, we could go back to cramming them together
> in a single bnode key, but I'm I think we'll do better with them as separate
> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
> (we read them all from separate keys instead of all at once), and of course
> helps writes because we only update a single small blob record.
> 
> The onode will get a extent_map_shard_size, which is measured in bytes
> and tells us how the map is chunked into keys.  If it's 0, the whole map will
> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
> second, etc.

I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.

I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:

(1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
(2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)

The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY. 

A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.

Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.

Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
For a read, you're always done at this point in time.
For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update: 
(a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range) 
(b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hysteresis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that... 

Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancing the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).

I strongly believe that (2) is the way to go.

> 
> In memory, we can pull this out of the onode_t struct and manage it in
> Onode where we can keep track of which parts of loaded, dirty, etc.
> 
> For each map chunk, we can do the inline blob trick.  Unfortunately we can
> only do this for 'unshared' blobs where shared is now shared across
> extent_map shards and not across objects.  We can conservatively estimate
> this by just looking at the blob size w/o looking at other lextents, at least.

Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?

(In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).

Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)

BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).

> 
> I think we can also do a bit better than Varada's current map<> used during
> encode/decode.  As we iterate over the map, we can store the ephemeral
> blob id *for this encoding* in the Blob (not blob_t), and use the low bit of the
> blob id to indicate it is an ephemeral id vs a real one.  On decode we can size
> a vector at start and fill it with BlobRef entries as we go to avoid a dynamic
> map<> or other structure.  Another nice thing here is that we only have to
> assign global blobids when we are stored in a blob key instead of inline in the
> map, which makes the blob ids smaller (with fewer significant bits).

I *think* you're describing exactly the algorithm that I published earlier (Varada's code should only require the map<> on encode). I used somewhat different terminology (i.e., 'positional' encoding of the blob_id for local blobs -- but I think it's the same thing). During decode blob_ids are assigned sequentially, so a vector (or a deque) is what you want to use. Clever coding would allow us to promote the size of the vector to the front of the object (This is something that I had considered regularizing for the enc_dec framework, i.e., a value computed at the end of an encode that's made available earlier in the byte-stream before the corresponding decode (sorry, no varint :) )

> 
> The Bnode then goes from a map of all blobs for the hash to a map of only
> shared (between shards or objects) for the hash--i.e., those blobs that have
> a blobid.  I think to make this work we then also need to change the in-
> memory lextent map from map<uint64_t,bluestore_lextent_t> to
> map<uint64_t,Lextent> so that we can handle the local vs remote pointer
> cases and for the local ones keep store the BlobRef right there.  This'll be
> marginally more complex to code (no 1:1 mapping of in-memory to encoded
> structures) but it should be faster (one less in-memory map lookup).
> 
> Anyway, that's the my current plan. Any thoughts before I start?
> sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: bluestore blobs REVISITED
  2016-08-24  3:52                     ` Allen Samuels
@ 2016-08-24 18:15                       ` Sage Weil
  2016-08-24 18:50                         ` Allen Samuels
                                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-24 18:15 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Wed, 24 Aug 2016, Allen Samuels wrote:
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Tuesday, August 23, 2016 12:03 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: bluestore blobs REVISITED
> > > >
> > > > I just got the onode, including the full lextent map, down to ~1500 bytes.
> > > > The lextent map encoding optimizations went something like this:
> > > >
> > > > - 1 blobid bit to indicate that this lextent starts where the last one ended.
> > > > (6500->5500)
> > > >
> > > > - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
> > > > length is same as previous lextent.  (5500->3500)
> > > >
> > > > - make blobid signed (1 bit) and delta encode relative to previous blob.
> > > > (3500->1500).  In practice we'd get something between 1500 and 3500
> > > > because blobids won't have as much temporal locality as my test
> > workload.
> > > > OTOH, this is really important because blobids will also get big
> > > > over time (we don't (yet?) have a way to reuse blobids and/or keep
> > > > them unique to a hash key, so they grow as the osd ages).
> > >
> > > This seems fishy to me, my mental model for the blob_id suggests that
> > > it must be at least 9 bits (for a random write workload) in size (1K
> > > entries randomly written lead to an average distance of 512, which
> > > means
> > > 10 bits to encode -- plus the other optimizations bits). Meaning that
> > > you're going to have two bytes for each lextent, meaning at least 2048
> > > bytes of lextent plus the remainder of the oNode. So, probably more
> > > like
> > > 2500 bytes.
> > >
> > > Am I missing something?
> > 
> > Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
> > enough to get us under 4K of metadata if we make pg_stat_t encoding
> > space-efficient (and possibly even without out).

Repeating this test with random IO gives onodes that are more like 6-7k, I 
think just because there are too many significant bits in the blob id.

So, I think we can scratch options #1 and #2 off the list.

> > > > 3) join lextents and blobs (Allen's proposal) and dynamically bin
> > > > based on the encoded size.
> > > >
> > > > 4) #3, but let shard_size=0 in onode (or whatever) put it inline
> > > > with onode, so that simple objects avoid any additional kv op.
> > >
> > > Yes, I'm still of the mind that this is the best approach. I'm not
> > > sure it's really that hard because most of the "special" cases can be
> > > dealt with in a common brute-force way (because they don't happen too
> > often).
> > 
> > Yep.  I think my next task is to code this one up.  Before I start, though, I want
> > to make sure I'm doing the most reasonable thing.  Please
> > review:
> > 
> > Right now the blobwise branch has (unshared and) shared blobs in their own
> > key.  Shared blobs only update when they are occluded and their ref_map
> > changes.  If it weren't for that, we could go back to cramming them together
> > in a single bnode key, but I'm I think we'll do better with them as separate
> > blob keys.  This helps small reads (we don't load all blobs), hurts large reads
> > (we read them all from separate keys instead of all at once), and of course
> > helps writes because we only update a single small blob record.
> > 
> > The onode will get a extent_map_shard_size, which is measured in bytes
> > and tells us how the map is chunked into keys.  If it's 0, the whole map will
> > stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
> > first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
> > second, etc.
> 
> I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.
> 
> I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:
> 
> (1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
> (2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)
> 
> The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY. 
> 
> A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.
> 
> Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.
> 
> Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
> For a read, you're always done at this point in time.
> For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update: 
> (a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range) 
> (b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hyster
 esis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that... 
> 
> Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancing
  the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).
> 
> I strongly believe that (2) is the way to go.

Yep, agreed!

> > In memory, we can pull this out of the onode_t struct and manage it in
> > Onode where we can keep track of which parts of loaded, dirty, etc.
> > 
> > For each map chunk, we can do the inline blob trick.  Unfortunately we can
> > only do this for 'unshared' blobs where shared is now shared across
> > extent_map shards and not across objects.  We can conservatively estimate
> > this by just looking at the blob size w/o looking at other lextents, at least.
> 
> Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?
> 
> (In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).
> 
> Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)
> 
> BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).

Hrm.  This is frustrating.  Being able to do *blob* backpointers and only 
update blobs for SMR compaction is really appealing, but that means a flag 
that prevents blob inline-copying on SMR.  This makes me nervous because 
there will be yet another permutation in the mix: fully inline blob, 
external with inline copy (use external one if updating ref_map), and 
fully external.

And if we do the inline copy, there isn't actually any purpose to most of 
the external copy except the ref_map.

OTOH, with currently workloads at least we expect that cloned/shared blobs 
will be pretty rare, so perhaps we should ignore this and keep object hash 
(not blob) backpointers for SMR.

In that case, we should focus instead on sharing the ref_map *only* and 
always inline the forward pointers for the blob.  This is closer to what 
we were originally doing with the enode.  In fact, we could go back to the 
enode approach were it's just a big extent_ref_map and only used to defer 
deallocations until all refs are retired.  The blob is then more ephemeral 
(local to the onode, immutable copy if cloned), and we can more easily 
rejigger how we store it.

We'd still have a "ref map" type structure for the blob, but it would only 
be used for counting the lextents that reference it, and we can 
dynamically build it when we load the extent map.  If we impose the 
restriction that whatever the map sharding approach we take we never share 
a blob across a shard, we the blobs are always local and "ephemeral" 
in the sense we've been talking about.  The only downside here, I think, 
is that the write path needs to be smart enough to not create any new blob 
that spans whatever the current map sharding is (or, alternatively, 
trigger a resharding if it does so).


Anyway, the main practical item that has me grumbling about this 
is that having extent_map in the onode_t means we have lots of helpers 
nad associated unit tests in test_bluestore_types.cc and making the 
extent map into a more dynamic structure (with pointers instead of ids 
etc) pulls it up a level above _types.{cc,h} and makes unit tests harder 
to write.  OTOH, the more complex structure probably needs its own tests 
to ensure we do the paging properly anyway, so here goes...

sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 18:15                       ` Sage Weil
@ 2016-08-24 18:50                         ` Allen Samuels
  2016-08-24 20:47                         ` Mark Nelson
  2016-08-24 21:10                         ` Allen Samuels
  2 siblings, 0 replies; 29+ messages in thread
From: Allen Samuels @ 2016-08-24 18:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

A few quick thoughts on this lousy keyboard. 

Wrt smr a flag that treats all blobs as global is easy and uncomplicated. The performance loss won't be significant if you're metadata is on flash. If it's all hdd you don't care about performance at all. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 2:16 PM, Sage Weil <sweil@redhat.com> wrote:
> 
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Tuesday, August 23, 2016 12:03 PM
>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: RE: bluestore blobs REVISITED
>>>>> 
>>>>> I just got the onode, including the full lextent map, down to ~1500 bytes.
>>>>> The lextent map encoding optimizations went something like this:
>>>>> 
>>>>> - 1 blobid bit to indicate that this lextent starts where the last one ended.
>>>>> (6500->5500)
>>>>> 
>>>>> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
>>>>> length is same as previous lextent.  (5500->3500)
>>>>> 
>>>>> - make blobid signed (1 bit) and delta encode relative to previous blob.
>>>>> (3500->1500).  In practice we'd get something between 1500 and 3500
>>>>> because blobids won't have as much temporal locality as my test
>>> workload.
>>>>> OTOH, this is really important because blobids will also get big
>>>>> over time (we don't (yet?) have a way to reuse blobids and/or keep
>>>>> them unique to a hash key, so they grow as the osd ages).
>>>> 
>>>> This seems fishy to me, my mental model for the blob_id suggests that
>>>> it must be at least 9 bits (for a random write workload) in size (1K
>>>> entries randomly written lead to an average distance of 512, which
>>>> means
>>>> 10 bits to encode -- plus the other optimizations bits). Meaning that
>>>> you're going to have two bytes for each lextent, meaning at least 2048
>>>> bytes of lextent plus the remainder of the oNode. So, probably more
>>>> like
>>>> 2500 bytes.
>>>> 
>>>> Am I missing something?
>>> 
>>> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
>>> enough to get us under 4K of metadata if we make pg_stat_t encoding
>>> space-efficient (and possibly even without out).
> 
> Repeating this test with random IO gives onodes that are more like 6-7k, I 
> think just because there are too many significant bits in the blob id.
> 
> So, I think we can scratch options #1 and #2 off the list.
> 
>>>>> 3) join lextents and blobs (Allen's proposal) and dynamically bin
>>>>> based on the encoded size.
>>>>> 
>>>>> 4) #3, but let shard_size=0 in onode (or whatever) put it inline
>>>>> with onode, so that simple objects avoid any additional kv op.
>>>> 
>>>> Yes, I'm still of the mind that this is the best approach. I'm not
>>>> sure it's really that hard because most of the "special" cases can be
>>>> dealt with in a common brute-force way (because they don't happen too
>>> often).
>>> 
>>> Yep.  I think my next task is to code this one up.  Before I start, though, I want
>>> to make sure I'm doing the most reasonable thing.  Please
>>> review:
>>> 
>>> Right now the blobwise branch has (unshared and) shared blobs in their own
>>> key.  Shared blobs only update when they are occluded and their ref_map
>>> changes.  If it weren't for that, we could go back to cramming them together
>>> in a single bnode key, but I'm I think we'll do better with them as separate
>>> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
>>> (we read them all from separate keys instead of all at once), and of course
>>> helps writes because we only update a single small blob record.
>>> 
>>> The onode will get a extent_map_shard_size, which is measured in bytes
>>> and tells us how the map is chunked into keys.  If it's 0, the whole map will
>>> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
>>> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
>>> second, etc.
>> 
>> I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.
>> 
>> I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:
>> 
>> (1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
>> (2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)
>> 
>> The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY. 
>> 
>> A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.
>> 
>> Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.
>> 
>> Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
>> For a read, you're always done at this point in time.
>> For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update: 
>> (a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range) 
>> (b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hysteresis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that... 
>> 
>> Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancing the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).
>> 
>> I strongly believe that (2) is the way to go.
> 
> Yep, agreed!
> 
>>> In memory, we can pull this out of the onode_t struct and manage it in
>>> Onode where we can keep track of which parts of loaded, dirty, etc.
>>> 
>>> For each map chunk, we can do the inline blob trick.  Unfortunately we can
>>> only do this for 'unshared' blobs where shared is now shared across
>>> extent_map shards and not across objects.  We can conservatively estimate
>>> this by just looking at the blob size w/o looking at other lextents, at least.
>> 
>> Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?
>> 
>> (In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).
>> 
>> Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)
>> 
>> BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).
> 
> Hrm.  This is frustrating.  Being able to do *blob* backpointers and only 
> update blobs for SMR compaction is really appealing, but that means a flag 
> that prevents blob inline-copying on SMR.  This makes me nervous because 
> there will be yet another permutation in the mix: fully inline blob, 
> external with inline copy (use external one if updating ref_map), and 
> fully external.
> 
> And if we do the inline copy, there isn't actually any purpose to most of 
> the external copy except the ref_map.
> 
> OTOH, with currently workloads at least we expect that cloned/shared blobs 
> will be pretty rare, so perhaps we should ignore this and keep object hash 
> (not blob) backpointers for SMR.
> 
> In that case, we should focus instead on sharing the ref_map *only* and 
> always inline the forward pointers for the blob.  This is closer to what 
> we were originally doing with the enode.  In fact, we could go back to the 
> enode approach were it's just a big extent_ref_map and only used to defer 
> deallocations until all refs are retired.  The blob is then more ephemeral 
> (local to the onode, immutable copy if cloned), and we can more easily 
> rejigger how we store it.
> 
> We'd still have a "ref map" type structure for the blob, but it would only 
> be used for counting the lextents that reference it, and we can 
> dynamically build it when we load the extent map.  If we impose the 
> restriction that whatever the map sharding approach we take we never share 
> a blob across a shard, we the blobs are always local and "ephemeral" 
> in the sense we've been talking about.  The only downside here, I think, 
> is that the write path needs to be smart enough to not create any new blob 
> that spans whatever the current map sharding is (or, alternatively, 
> trigger a resharding if it does so).
> 
> 
> Anyway, the main practical item that has me grumbling about this 
> is that having extent_map in the onode_t means we have lots of helpers 
> nad associated unit tests in test_bluestore_types.cc and making the 
> extent map into a more dynamic structure (with pointers instead of ids 
> etc) pulls it up a level above _types.{cc,h} and makes unit tests harder 
> to write.  OTOH, the more complex structure probably needs its own tests 
> to ensure we do the paging properly anyway, so here goes...
> 
> sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 18:15                       ` Sage Weil
  2016-08-24 18:50                         ` Allen Samuels
@ 2016-08-24 20:47                         ` Mark Nelson
  2016-08-24 20:59                           ` Allen Samuels
  2016-08-24 21:10                         ` Allen Samuels
  2 siblings, 1 reply; 29+ messages in thread
From: Mark Nelson @ 2016-08-24 20:47 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 08/24/2016 01:15 PM, Sage Weil wrote:
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Tuesday, August 23, 2016 12:03 PM
>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: RE: bluestore blobs REVISITED
>>>>>
>>>>> I just got the onode, including the full lextent map, down to ~1500 bytes.
>>>>> The lextent map encoding optimizations went something like this:
>>>>>
>>>>> - 1 blobid bit to indicate that this lextent starts where the last one ended.
>>>>> (6500->5500)
>>>>>
>>>>> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
>>>>> length is same as previous lextent.  (5500->3500)
>>>>>
>>>>> - make blobid signed (1 bit) and delta encode relative to previous blob.
>>>>> (3500->1500).  In practice we'd get something between 1500 and 3500
>>>>> because blobids won't have as much temporal locality as my test
>>> workload.
>>>>> OTOH, this is really important because blobids will also get big
>>>>> over time (we don't (yet?) have a way to reuse blobids and/or keep
>>>>> them unique to a hash key, so they grow as the osd ages).
>>>>
>>>> This seems fishy to me, my mental model for the blob_id suggests that
>>>> it must be at least 9 bits (for a random write workload) in size (1K
>>>> entries randomly written lead to an average distance of 512, which
>>>> means
>>>> 10 bits to encode -- plus the other optimizations bits). Meaning that
>>>> you're going to have two bytes for each lextent, meaning at least 2048
>>>> bytes of lextent plus the remainder of the oNode. So, probably more
>>>> like
>>>> 2500 bytes.
>>>>
>>>> Am I missing something?
>>>
>>> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
>>> enough to get us under 4K of metadata if we make pg_stat_t encoding
>>> space-efficient (and possibly even without out).
>
> Repeating this test with random IO gives onodes that are more like 6-7k, I
> think just because there are too many significant bits in the blob id.
>
> So, I think we can scratch options #1 and #2 off the list.

hrmf. I was still holding out hope #2 would be good enough.  Oh well.  I 
still remain very concerned about CPU usage, but we'll see where we are 
at once the dust settles.

Mark

>
>>>>> 3) join lextents and blobs (Allen's proposal) and dynamically bin
>>>>> based on the encoded size.
>>>>>
>>>>> 4) #3, but let shard_size=0 in onode (or whatever) put it inline
>>>>> with onode, so that simple objects avoid any additional kv op.
>>>>
>>>> Yes, I'm still of the mind that this is the best approach. I'm not
>>>> sure it's really that hard because most of the "special" cases can be
>>>> dealt with in a common brute-force way (because they don't happen too
>>> often).
>>>
>>> Yep.  I think my next task is to code this one up.  Before I start, though, I want
>>> to make sure I'm doing the most reasonable thing.  Please
>>> review:
>>>
>>> Right now the blobwise branch has (unshared and) shared blobs in their own
>>> key.  Shared blobs only update when they are occluded and their ref_map
>>> changes.  If it weren't for that, we could go back to cramming them together
>>> in a single bnode key, but I'm I think we'll do better with them as separate
>>> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
>>> (we read them all from separate keys instead of all at once), and of course
>>> helps writes because we only update a single small blob record.
>>>
>>> The onode will get a extent_map_shard_size, which is measured in bytes
>>> and tells us how the map is chunked into keys.  If it's 0, the whole map will
>>> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
>>> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
>>> second, etc.
>>
>> I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.
>>
>> I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:
>>
>> (1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
>> (2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)
>>
>> The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY.
>>
>> A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.
>>
>> Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.
>>
>> Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
>> For a read, you're always done at this point in time.
>> For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update:
>> (a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range)
>> (b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hyste
 resis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that...
>>
>> Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancin
 g the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).
>>
>> I strongly believe that (2) is the way to go.
>
> Yep, agreed!
>
>>> In memory, we can pull this out of the onode_t struct and manage it in
>>> Onode where we can keep track of which parts of loaded, dirty, etc.
>>>
>>> For each map chunk, we can do the inline blob trick.  Unfortunately we can
>>> only do this for 'unshared' blobs where shared is now shared across
>>> extent_map shards and not across objects.  We can conservatively estimate
>>> this by just looking at the blob size w/o looking at other lextents, at least.
>>
>> Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?
>>
>> (In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).
>>
>> Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)
>>
>> BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).
>
> Hrm.  This is frustrating.  Being able to do *blob* backpointers and only
> update blobs for SMR compaction is really appealing, but that means a flag
> that prevents blob inline-copying on SMR.  This makes me nervous because
> there will be yet another permutation in the mix: fully inline blob,
> external with inline copy (use external one if updating ref_map), and
> fully external.
>
> And if we do the inline copy, there isn't actually any purpose to most of
> the external copy except the ref_map.
>
> OTOH, with currently workloads at least we expect that cloned/shared blobs
> will be pretty rare, so perhaps we should ignore this and keep object hash
> (not blob) backpointers for SMR.
>
> In that case, we should focus instead on sharing the ref_map *only* and
> always inline the forward pointers for the blob.  This is closer to what
> we were originally doing with the enode.  In fact, we could go back to the
> enode approach were it's just a big extent_ref_map and only used to defer
> deallocations until all refs are retired.  The blob is then more ephemeral
> (local to the onode, immutable copy if cloned), and we can more easily
> rejigger how we store it.
>
> We'd still have a "ref map" type structure for the blob, but it would only
> be used for counting the lextents that reference it, and we can
> dynamically build it when we load the extent map.  If we impose the
> restriction that whatever the map sharding approach we take we never share
> a blob across a shard, we the blobs are always local and "ephemeral"
> in the sense we've been talking about.  The only downside here, I think,
> is that the write path needs to be smart enough to not create any new blob
> that spans whatever the current map sharding is (or, alternatively,
> trigger a resharding if it does so).
>
>
> Anyway, the main practical item that has me grumbling about this
> is that having extent_map in the onode_t means we have lots of helpers
> nad associated unit tests in test_bluestore_types.cc and making the
> extent map into a more dynamic structure (with pointers instead of ids
> etc) pulls it up a level above _types.{cc,h} and makes unit tests harder
> to write.  OTOH, the more complex structure probably needs its own tests
> to ensure we do the paging properly anyway, so here goes...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 20:47                         ` Mark Nelson
@ 2016-08-24 20:59                           ` Allen Samuels
  0 siblings, 0 replies; 29+ messages in thread
From: Allen Samuels @ 2016-08-24 20:59 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-devel

Yea I totally botched the math on the random write estimation. It's a probabilistic exponential. Doesn't matter. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 4:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> 
> 
> 
>> On 08/24/2016 01:15 PM, Sage Weil wrote:
>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>> Sent: Tuesday, August 23, 2016 12:03 PM
>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: RE: bluestore blobs REVISITED
>>>>>> 
>>>>>> I just got the onode, including the full lextent map, down to ~1500 bytes.
>>>>>> The lextent map encoding optimizations went something like this:
>>>>>> 
>>>>>> - 1 blobid bit to indicate that this lextent starts where the last one ended.
>>>>>> (6500->5500)
>>>>>> 
>>>>>> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
>>>>>> length is same as previous lextent.  (5500->3500)
>>>>>> 
>>>>>> - make blobid signed (1 bit) and delta encode relative to previous blob.
>>>>>> (3500->1500).  In practice we'd get something between 1500 and 3500
>>>>>> because blobids won't have as much temporal locality as my test
>>>> workload.
>>>>>> OTOH, this is really important because blobids will also get big
>>>>>> over time (we don't (yet?) have a way to reuse blobids and/or keep
>>>>>> them unique to a hash key, so they grow as the osd ages).
>>>>> 
>>>>> This seems fishy to me, my mental model for the blob_id suggests that
>>>>> it must be at least 9 bits (for a random write workload) in size (1K
>>>>> entries randomly written lead to an average distance of 512, which
>>>>> means
>>>>> 10 bits to encode -- plus the other optimizations bits). Meaning that
>>>>> you're going to have two bytes for each lextent, meaning at least 2048
>>>>> bytes of lextent plus the remainder of the oNode. So, probably more
>>>>> like
>>>>> 2500 bytes.
>>>>> 
>>>>> Am I missing something?
>>>> 
>>>> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
>>>> enough to get us under 4K of metadata if we make pg_stat_t encoding
>>>> space-efficient (and possibly even without out).
>> 
>> Repeating this test with random IO gives onodes that are more like 6-7k, I
>> think just because there are too many significant bits in the blob id.
>> 
>> So, I think we can scratch options #1 and #2 off the list.
> 
> hrmf. I was still holding out hope #2 would be good enough.  Oh well.  I still remain very concerned about CPU usage, but we'll see where we are at once the dust settles.
> 
> Mark
> 
>> 
>>>>>> 3) join lextents and blobs (Allen's proposal) and dynamically bin
>>>>>> based on the encoded size.
>>>>>> 
>>>>>> 4) #3, but let shard_size=0 in onode (or whatever) put it inline
>>>>>> with onode, so that simple objects avoid any additional kv op.
>>>>> 
>>>>> Yes, I'm still of the mind that this is the best approach. I'm not
>>>>> sure it's really that hard because most of the "special" cases can be
>>>>> dealt with in a common brute-force way (because they don't happen too
>>>> often).
>>>> 
>>>> Yep.  I think my next task is to code this one up.  Before I start, though, I want
>>>> to make sure I'm doing the most reasonable thing.  Please
>>>> review:
>>>> 
>>>> Right now the blobwise branch has (unshared and) shared blobs in their own
>>>> key.  Shared blobs only update when they are occluded and their ref_map
>>>> changes.  If it weren't for that, we could go back to cramming them together
>>>> in a single bnode key, but I'm I think we'll do better with them as separate
>>>> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
>>>> (we read them all from separate keys instead of all at once), and of course
>>>> helps writes because we only update a single small blob record.
>>>> 
>>>> The onode will get a extent_map_shard_size, which is measured in bytes
>>>> and tells us how the map is chunked into keys.  If it's 0, the whole map will
>>>> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
>>>> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
>>>> second, etc.
>>> 
>>> I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.
>>> 
>>> I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:
>>> 
>>> (1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
>>> (2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)
>>> 
>>> The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY.
>>> 
>>> A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.
>>> 
>>> Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.
>>> 
>>> Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
>>> For a read, you're always done at this point in time.
>>> For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update:
>>> (a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range)
>>> (b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hysteresis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that...
>>> 
>>> Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancing the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).
>>> 
>>> I strongly believe that (2) is the way to go.
>> 
>> Yep, agreed!
>> 
>>>> In memory, we can pull this out of the onode_t struct and manage it in
>>>> Onode where we can keep track of which parts of loaded, dirty, etc.
>>>> 
>>>> For each map chunk, we can do the inline blob trick.  Unfortunately we can
>>>> only do this for 'unshared' blobs where shared is now shared across
>>>> extent_map shards and not across objects.  We can conservatively estimate
>>>> this by just looking at the blob size w/o looking at other lextents, at least.
>>> 
>>> Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?
>>> 
>>> (In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).
>>> 
>>> Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)
>>> 
>>> BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).
>> 
>> Hrm.  This is frustrating.  Being able to do *blob* backpointers and only
>> update blobs for SMR compaction is really appealing, but that means a flag
>> that prevents blob inline-copying on SMR.  This makes me nervous because
>> there will be yet another permutation in the mix: fully inline blob,
>> external with inline copy (use external one if updating ref_map), and
>> fully external.
>> 
>> And if we do the inline copy, there isn't actually any purpose to most of
>> the external copy except the ref_map.
>> 
>> OTOH, with currently workloads at least we expect that cloned/shared blobs
>> will be pretty rare, so perhaps we should ignore this and keep object hash
>> (not blob) backpointers for SMR.
>> 
>> In that case, we should focus instead on sharing the ref_map *only* and
>> always inline the forward pointers for the blob.  This is closer to what
>> we were originally doing with the enode.  In fact, we could go back to the
>> enode approach were it's just a big extent_ref_map and only used to defer
>> deallocations until all refs are retired.  The blob is then more ephemeral
>> (local to the onode, immutable copy if cloned), and we can more easily
>> rejigger how we store it.
>> 
>> We'd still have a "ref map" type structure for the blob, but it would only
>> be used for counting the lextents that reference it, and we can
>> dynamically build it when we load the extent map.  If we impose the
>> restriction that whatever the map sharding approach we take we never share
>> a blob across a shard, we the blobs are always local and "ephemeral"
>> in the sense we've been talking about.  The only downside here, I think,
>> is that the write path needs to be smart enough to not create any new blob
>> that spans whatever the current map sharding is (or, alternatively,
>> trigger a resharding if it does so).
>> 
>> 
>> Anyway, the main practical item that has me grumbling about this
>> is that having extent_map in the onode_t means we have lots of helpers
>> nad associated unit tests in test_bluestore_types.cc and making the
>> extent map into a more dynamic structure (with pointers instead of ids
>> etc) pulls it up a level above _types.{cc,h} and makes unit tests harder
>> to write.  OTOH, the more complex structure probably needs its own tests
>> to ensure we do the paging properly anyway, so here goes...
>> 
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 18:15                       ` Sage Weil
  2016-08-24 18:50                         ` Allen Samuels
  2016-08-24 20:47                         ` Mark Nelson
@ 2016-08-24 21:10                         ` Allen Samuels
  2016-08-24 21:18                           ` Sage Weil
  2 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-24 21:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 2:16 PM, Sage Weil <sweil@redhat.com> wrote:
> 
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Tuesday, August 23, 2016 12:03 PM
>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: RE: bluestore blobs REVISITED
>>>>> 
>>>>> I just got the onode, including the full lextent map, down to ~1500 bytes.
>>>>> The lextent map encoding optimizations went something like this:
>>>>> 
>>>>> - 1 blobid bit to indicate that this lextent starts where the last one ended.
>>>>> (6500->5500)
>>>>> 
>>>>> - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate
>>>>> length is same as previous lextent.  (5500->3500)
>>>>> 
>>>>> - make blobid signed (1 bit) and delta encode relative to previous blob.
>>>>> (3500->1500).  In practice we'd get something between 1500 and 3500
>>>>> because blobids won't have as much temporal locality as my test
>>> workload.
>>>>> OTOH, this is really important because blobids will also get big
>>>>> over time (we don't (yet?) have a way to reuse blobids and/or keep
>>>>> them unique to a hash key, so they grow as the osd ages).
>>>> 
>>>> This seems fishy to me, my mental model for the blob_id suggests that
>>>> it must be at least 9 bits (for a random write workload) in size (1K
>>>> entries randomly written lead to an average distance of 512, which
>>>> means
>>>> 10 bits to encode -- plus the other optimizations bits). Meaning that
>>>> you're going to have two bytes for each lextent, meaning at least 2048
>>>> bytes of lextent plus the remainder of the oNode. So, probably more
>>>> like
>>>> 2500 bytes.
>>>> 
>>>> Am I missing something?
>>> 
>>> Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I this is still
>>> enough to get us under 4K of metadata if we make pg_stat_t encoding
>>> space-efficient (and possibly even without out).
> 
> Repeating this test with random IO gives onodes that are more like 6-7k, I 
> think just because there are too many significant bits in the blob id.
> 
> So, I think we can scratch options #1 and #2 off the list.
> 
>>>>> 3) join lextents and blobs (Allen's proposal) and dynamically bin
>>>>> based on the encoded size.
>>>>> 
>>>>> 4) #3, but let shard_size=0 in onode (or whatever) put it inline
>>>>> with onode, so that simple objects avoid any additional kv op.
>>>> 
>>>> Yes, I'm still of the mind that this is the best approach. I'm not
>>>> sure it's really that hard because most of the "special" cases can be
>>>> dealt with in a common brute-force way (because they don't happen too
>>> often).
>>> 
>>> Yep.  I think my next task is to code this one up.  Before I start, though, I want
>>> to make sure I'm doing the most reasonable thing.  Please
>>> review:
>>> 
>>> Right now the blobwise branch has (unshared and) shared blobs in their own
>>> key.  Shared blobs only update when they are occluded and their ref_map
>>> changes.  If it weren't for that, we could go back to cramming them together
>>> in a single bnode key, but I'm I think we'll do better with them as separate
>>> blob keys.  This helps small reads (we don't load all blobs), hurts large reads
>>> (we read them all from separate keys instead of all at once), and of course
>>> helps writes because we only update a single small blob record.
>>> 
>>> The onode will get a extent_map_shard_size, which is measured in bytes
>>> and tells us how the map is chunked into keys.  If it's 0, the whole map will
>>> stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the
>>> first key, [extent_map_shard_size, extent_map_shard_size*2) is in the
>>> second, etc.
>> 
>> I've been thinking about this. I definitely like the '0' case where the lextent and potentially the local blob are inline with the oNode. That'll certainly accelerate lots of Object and CephFS stuff. All options should implement this.
>> 
>> I've been thinking about the lextent sharding/binning/paging problem. Mentally, I see two different ways to do it:
>> 
>> (1) offset-based fixed binning -- I think this is what you described, i.e., the lextent table is recast into fixed offset ranges.
>> (2) Dynamic binning. In this case, the ranges of lextent that are stored as a set of ranges (interval-set) each entry in the interval set represents a KV key and contains some arbitrary number of lextent entries within (boundaries are a dynamic decision)
>> 
>> The downside of (1) is that you're going to have logical extents that cross the fixed bins (and sometimes MULTIPLE fixed bins). Handling this is just plain ugly in the code -- all sorts of special cases crop up for things like local-blobs that used to be unshared, now are sorta-shared (between the two 'split' lextent entries). Then there are the cases where single lextents cross 2, 3, 10 shards.... UGLY UGLY UGLY. 
>> 
>> A secondary problem (quite possibly a non-problem) is that the sizes of the chunks of encoded lextent don't necessarily divide up particularly evenly. Hot spots in objects will cause this problem.
>> 
>> Method (2) sounds more complicated, but I actually think it's dramatically LESS complicated to implement -- because we're going to brute force our way around a lot of problem. First off, none of the ugly issues of splitting an lextent crop up -- by definition -- since we're chunking the map on whole lextent boundaries. I think from an implementation perspective you can break this down into some relatively simple logic.
>> 
>> Basically from an implementation perspective, at the start of the operation (read or write), you compare the incoming offset with the interval-set in the oNode and page in the portions of the lextent table that are required (required=> for uncompressed is just the range in the incoming operation, for compressed I think you need to have the lextent before and after the incoming operation range to make the "depth limiting" code work correctly).
>> For a read, you're always done at this point in time.
>> For a write, you'll have the problem of updating the lextent table with whatever changes have been made. There are basically two sub-cases for the update: 
>> (a) updated lextent chunk (or chunks!) are sufficiently similar that you want to maintain the current "chunking" values (more on this below). This is the easy case, just reserialize the chunk (or chunks) of the lextent that are changed (again, because of over-fetching in the compression case the changed range might not == the fetched range) 
>> (b) You decide that re-chunking is required.  There are lots of cases of splits and joins of chunks -- more ugly code. However because the worst-case size of the lextent isn't really THAT large (and the frequency of re-chunking should be relatively low).  My recommendation is that we ONLY implement the complete re-chunking case. In other words, if you look at the chunking and you don't like it -- rather than spending time figuring out a bunch of splits and joins (sorta reminds me of FIleStore directory splits) just read in the remainder of the lextent into memory and re-chunk the entire darn thing. As long as the frequency of this is low it should be cheap enough (a max-size lextent table isn't THAT big). Keeping the frequency of chunking low should be doable by a policy with some hysteresis, i.e., something like if a chunk is > 'N' bytes do a splitting-rechunk, but only if it's smaller than N/2 bytes do a joining-rechunk.... Something like that... 
>> 
>> Nothing in the odisk structures prevents future code from implementing a more sophisticated split/join chunking scheme. In fact, I recommend that the oNode interval_set of the lextent chunks be decorated with the serialized sizes of each chunk so as to make that future algorithm much easier to implement (since it'll know the sizes of all of the serialized chunks). In other words, when it comes time to update the oNode/lextent, do the lextent chunks FIRST and use the generated sizes of the serialized lextents to put into the chunking table in the oNode. Have the sizes of all of the chunks available (without having to consult the KV store) will make the policy decision about when to re-chunk fairly easy to do and allow the future optimizer to make more intelligent decisions about balancing the chunk-boundaries -- the extra data is pretty small and likely worth it (IMO).
>> 
>> I strongly believe that (2) is the way to go.
> 
> Yep, agreed!
> 
>>> In memory, we can pull this out of the onode_t struct and manage it in
>>> Onode where we can keep track of which parts of loaded, dirty, etc.
>>> 
>>> For each map chunk, we can do the inline blob trick.  Unfortunately we can
>>> only do this for 'unshared' blobs where shared is now shared across
>>> extent_map shards and not across objects.  We can conservatively estimate
>>> this by just looking at the blob size w/o looking at other lextents, at least.
>> 
>> Some thoughts here. In the PMR and Flash world, isn't it the case that blobs are essentially immutable once they are global?
>> 
>> (In the SMR world, I know they'll be modified in order to "relocate" a chunk -- possibly invalidating the immutable assumption).
>> 
>> Why NOT pull the blob up into the lextent -- as a copy? If it's immutable, what's the harm? Yes, the stored metadata now expands by the reference count of a cloned blob, but if you can eliminate an extra KV fetch, I'm of the opinion that this seems like a desireable time/space tradeoff.... Just a thought :)
>> 
>> BTW, here's potentially a really important thing that I just realized.... For a global blob, you want to make the blob_id something like #HashID.#LBA, that way when you're trying to do the SMR backpointer thing, you can minimize the number of bNodes you search for to find this LBA address (something like that :)) -- naturally this fails for blobs with multiple pextents -- But we can outlaw this case in the SMR world (space is basically ALWAYS sequential in the SMR world :)).
> 
> Hrm.  This is frustrating.  Being able to do *blob* backpointers and only 
> update blobs for SMR compaction is really appealing, but that means a flag 
> that prevents blob inline-copying on SMR.  This makes me nervous because 
> there will be yet another permutation in the mix: fully inline blob, 
> external with inline copy (use external one if updating ref_map), and 
> fully external.
> 
> And if we do the inline copy, there isn't actually any purpose to most of 
> the external copy except the ref_map.
> 
> OTOH, with currently workloads at least we expect that cloned/shared blobs 
> will be pretty rare, so perhaps we should ignore this and keep object hash 
> (not blob) backpointers for SMR.
> 
> In that case, we should focus instead on sharing the ref_map *only* and 
> always inline the forward pointers for the blob.  This is closer to what 
> we were originally doing with the enode.  In fact, we could go back to the 
> enode approach were it's just a big extent_ref_map and only used to defer 
> deallocations until all refs are retired.  The blob is then more ephemeral 
> (local to the onode, immutable copy if cloned), and we can more easily 
> rejigger how we store it.
> 
> We'd still have a "ref map" type structure for the blob, but it would only 
> be used for counting the lextents that reference it, and we can 
> dynamically build it when we load the extent map.  If we impose the 
> restriction that whatever the map sharding approach we take we never share 
> a blob across a shard, we the blobs are always local and "ephemeral" 
> in the sense we've been talking about.  The only downside here, I think, 
> is that the write path needs to be smart enough to not create any new blob 
> that spans whatever the current map sharding is (or, alternatively, 
> trigger a resharding if it does so).

Not just a resharding but also a possible decompress recompress cycle. 

> 
> 
> Anyway, the main practical item that has me grumbling about this 
> is that having extent_map in the onode_t means we have lots of helpers 
> nad associated unit tests in test_bluestore_types.cc and making the 
> extent map into a more dynamic structure (with pointers instead of ids 
> etc) pulls it up a level above _types.{cc,h} and makes unit tests harder 
> to write.  OTOH, the more complex structure probably needs its own tests 
> to ensure we do the paging properly anyway, so here goes...
> 

The investment of putting the lextent map behind accessor functions will pay off in simplified debugging later. If it's any solace. 

> sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 21:10                         ` Allen Samuels
@ 2016-08-24 21:18                           ` Sage Weil
  2016-08-24 22:13                             ` Allen Samuels
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-08-24 21:18 UTC (permalink / raw)
  To: Allen Samuels, jdurgin, dillaman; +Cc: ceph-devel

On Wed, 24 Aug 2016, Allen Samuels wrote:
> > In that case, we should focus instead on sharing the ref_map *only* and 
> > always inline the forward pointers for the blob.  This is closer to what 
> > we were originally doing with the enode.  In fact, we could go back to the 
> > enode approach were it's just a big extent_ref_map and only used to defer 
> > deallocations until all refs are retired.  The blob is then more ephemeral 
> > (local to the onode, immutable copy if cloned), and we can more easily 
> > rejigger how we store it.
> > 
> > We'd still have a "ref map" type structure for the blob, but it would only 
> > be used for counting the lextents that reference it, and we can 
> > dynamically build it when we load the extent map.  If we impose the 
> > restriction that whatever the map sharding approach we take we never share 
> > a blob across a shard, we the blobs are always local and "ephemeral" 
> > in the sense we've been talking about.  The only downside here, I think, 
> > is that the write path needs to be smart enough to not create any new blob 
> > that spans whatever the current map sharding is (or, alternatively, 
> > trigger a resharding if it does so).
> 
> Not just a resharding but also a possible decompress recompress cycle. 

Yeah.

Oh, the other consequence of this is that we lose the unified blob-wise 
cache behavior we added a while back.  That means that if you write a 
bunch of data to a rbd data object, then clone it, then read of the clone, 
it'll re-read the data from disk.  Because it'll be a different blob in 
memory (since we'll be making a copy of the metadata etc).

Josh, Jason, do you have a sense of whether that really matters?  The 
common case is probably someone who creates a snapshot and then backs it 
up, but it's going to be reading gobs of cold data off disk anyway so I'm 
guessing it doesn't matter that a bit of warm data that just preceded the 
snapshot gets re-read.

sage


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 21:18                           ` Sage Weil
@ 2016-08-24 22:13                             ` Allen Samuels
  2016-08-24 22:29                               ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-24 22:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdurgin, dillaman, ceph-devel

Yikes. You mean that blob ids are escaping the environment of the lextent table. That's scary. What is the key for this cache? We probably need to invalidate it or something. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
> 
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>> In that case, we should focus instead on sharing the ref_map *only* and 
>>> always inline the forward pointers for the blob.  This is closer to what 
>>> we were originally doing with the enode.  In fact, we could go back to the 
>>> enode approach were it's just a big extent_ref_map and only used to defer 
>>> deallocations until all refs are retired.  The blob is then more ephemeral 
>>> (local to the onode, immutable copy if cloned), and we can more easily 
>>> rejigger how we store it.
>>> 
>>> We'd still have a "ref map" type structure for the blob, but it would only 
>>> be used for counting the lextents that reference it, and we can 
>>> dynamically build it when we load the extent map.  If we impose the 
>>> restriction that whatever the map sharding approach we take we never share 
>>> a blob across a shard, we the blobs are always local and "ephemeral" 
>>> in the sense we've been talking about.  The only downside here, I think, 
>>> is that the write path needs to be smart enough to not create any new blob 
>>> that spans whatever the current map sharding is (or, alternatively, 
>>> trigger a resharding if it does so).
>> 
>> Not just a resharding but also a possible decompress recompress cycle.
> 
> Yeah.
> 
> Oh, the other consequence of this is that we lose the unified blob-wise 
> cache behavior we added a while back.  That means that if you write a 
> bunch of data to a rbd data object, then clone it, then read of the clone, 
> it'll re-read the data from disk.  Because it'll be a different blob in 
> memory (since we'll be making a copy of the metadata etc).
> 
> Josh, Jason, do you have a sense of whether that really matters?  The 
> common case is probably someone who creates a snapshot and then backs it 
> up, but it's going to be reading gobs of cold data off disk anyway so I'm 
> guessing it doesn't matter that a bit of warm data that just preceded the 
> snapshot gets re-read.
> 
> sage
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 22:13                             ` Allen Samuels
@ 2016-08-24 22:29                               ` Sage Weil
  2016-08-24 23:41                                 ` Allen Samuels
  2016-08-25 12:40                                 ` Jason Dillaman
  0 siblings, 2 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-24 22:29 UTC (permalink / raw)
  To: Allen Samuels; +Cc: jdurgin, dillaman, ceph-devel

On Wed, 24 Aug 2016, Allen Samuels wrote:
> Yikes. You mean that blob ids are escaping the environment of the 
> lextent table. That's scary. What is the key for this cache? We probably 
> need to invalidate it or something.

I mean that there will no longer be blob ids (except within the encoding 
of a particular extent map shard).  Which means that when you write to A, 
clone A->B, and then read B, B's blob will no longer be the same as A's 
blob (as it is now in the bnode, or would have been with the -blobwise 
branch) and the cache won't be preserved.

Which I *think* is okay...?

sage


> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
> > 
> > On Wed, 24 Aug 2016, Allen Samuels wrote:
> >>> In that case, we should focus instead on sharing the ref_map *only* and 
> >>> always inline the forward pointers for the blob.  This is closer to what 
> >>> we were originally doing with the enode.  In fact, we could go back to the 
> >>> enode approach were it's just a big extent_ref_map and only used to defer 
> >>> deallocations until all refs are retired.  The blob is then more ephemeral 
> >>> (local to the onode, immutable copy if cloned), and we can more easily 
> >>> rejigger how we store it.
> >>> 
> >>> We'd still have a "ref map" type structure for the blob, but it would only 
> >>> be used for counting the lextents that reference it, and we can 
> >>> dynamically build it when we load the extent map.  If we impose the 
> >>> restriction that whatever the map sharding approach we take we never share 
> >>> a blob across a shard, we the blobs are always local and "ephemeral" 
> >>> in the sense we've been talking about.  The only downside here, I think, 
> >>> is that the write path needs to be smart enough to not create any new blob 
> >>> that spans whatever the current map sharding is (or, alternatively, 
> >>> trigger a resharding if it does so).
> >> 
> >> Not just a resharding but also a possible decompress recompress cycle.
> > 
> > Yeah.
> > 
> > Oh, the other consequence of this is that we lose the unified blob-wise 
> > cache behavior we added a while back.  That means that if you write a 
> > bunch of data to a rbd data object, then clone it, then read of the clone, 
> > it'll re-read the data from disk.  Because it'll be a different blob in 
> > memory (since we'll be making a copy of the metadata etc).
> > 
> > Josh, Jason, do you have a sense of whether that really matters?  The 
> > common case is probably someone who creates a snapshot and then backs it 
> > up, but it's going to be reading gobs of cold data off disk anyway so I'm 
> > guessing it doesn't matter that a bit of warm data that just preceded the 
> > snapshot gets re-read.
> > 
> > sage
> > 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 22:29                               ` Sage Weil
@ 2016-08-24 23:41                                 ` Allen Samuels
  2016-08-25 13:55                                   ` Sage Weil
  2016-08-25 12:40                                 ` Jason Dillaman
  1 sibling, 1 reply; 29+ messages in thread
From: Allen Samuels @ 2016-08-24 23:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdurgin, dillaman, ceph-devel

Your suggesting a logical address cache key (oid offset) rather rhan a physical cache (lba). Which seems fine to me. Provided that deletes and renames properly purge the cache. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
> 
>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>> Yikes. You mean that blob ids are escaping the environment of the 
>> lextent table. That's scary. What is the key for this cache? We probably 
>> need to invalidate it or something.
> 
> I mean that there will no longer be blob ids (except within the encoding 
> of a particular extent map shard).  Which means that when you write to A, 
> clone A->B, and then read B, B's blob will no longer be the same as A's 
> blob (as it is now in the bnode, or would have been with the -blobwise 
> branch) and the cache won't be preserved.
> 
> Which I *think* is okay...?
> 
> sage
> 
> 
>> 
>> Sent from my iPhone. Please excuse all typos and autocorrects.
>> 
>>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
>>> 
>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> In that case, we should focus instead on sharing the ref_map *only* and 
>>>>> always inline the forward pointers for the blob.  This is closer to what 
>>>>> we were originally doing with the enode.  In fact, we could go back to the 
>>>>> enode approach were it's just a big extent_ref_map and only used to defer 
>>>>> deallocations until all refs are retired.  The blob is then more ephemeral 
>>>>> (local to the onode, immutable copy if cloned), and we can more easily 
>>>>> rejigger how we store it.
>>>>> 
>>>>> We'd still have a "ref map" type structure for the blob, but it would only 
>>>>> be used for counting the lextents that reference it, and we can 
>>>>> dynamically build it when we load the extent map.  If we impose the 
>>>>> restriction that whatever the map sharding approach we take we never share 
>>>>> a blob across a shard, we the blobs are always local and "ephemeral" 
>>>>> in the sense we've been talking about.  The only downside here, I think, 
>>>>> is that the write path needs to be smart enough to not create any new blob 
>>>>> that spans whatever the current map sharding is (or, alternatively, 
>>>>> trigger a resharding if it does so).
>>>> 
>>>> Not just a resharding but also a possible decompress recompress cycle.
>>> 
>>> Yeah.
>>> 
>>> Oh, the other consequence of this is that we lose the unified blob-wise 
>>> cache behavior we added a while back.  That means that if you write a 
>>> bunch of data to a rbd data object, then clone it, then read of the clone, 
>>> it'll re-read the data from disk.  Because it'll be a different blob in 
>>> memory (since we'll be making a copy of the metadata etc).
>>> 
>>> Josh, Jason, do you have a sense of whether that really matters?  The 
>>> common case is probably someone who creates a snapshot and then backs it 
>>> up, but it's going to be reading gobs of cold data off disk anyway so I'm 
>>> guessing it doesn't matter that a bit of warm data that just preceded the 
>>> snapshot gets re-read.
>>> 
>>> sage
>> 
>> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 22:29                               ` Sage Weil
  2016-08-24 23:41                                 ` Allen Samuels
@ 2016-08-25 12:40                                 ` Jason Dillaman
  2016-08-25 14:16                                   ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Dillaman @ 2016-08-25 12:40 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, jdurgin, ceph-devel

Just so I understand, let's say a user snapshots an RBD image that has
active IO. At this point, are you saying that the "A" data
(pre-snapshot) is still (potentially) in the cache and any write
op-induced creation of clone "B" would not be in the cache?  If that's
the case, it sounds like a re-read would be required after the first
"post snapshot" write op.

On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>> Yikes. You mean that blob ids are escaping the environment of the
>> lextent table. That's scary. What is the key for this cache? We probably
>> need to invalidate it or something.
>
> I mean that there will no longer be blob ids (except within the encoding
> of a particular extent map shard).  Which means that when you write to A,
> clone A->B, and then read B, B's blob will no longer be the same as A's
> blob (as it is now in the bnode, or would have been with the -blobwise
> branch) and the cache won't be preserved.
>
> Which I *think* is okay...?
>
> sage
>
>
>>
>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>
>> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
>> >
>> > On Wed, 24 Aug 2016, Allen Samuels wrote:
>> >>> In that case, we should focus instead on sharing the ref_map *only* and
>> >>> always inline the forward pointers for the blob.  This is closer to what
>> >>> we were originally doing with the enode.  In fact, we could go back to the
>> >>> enode approach were it's just a big extent_ref_map and only used to defer
>> >>> deallocations until all refs are retired.  The blob is then more ephemeral
>> >>> (local to the onode, immutable copy if cloned), and we can more easily
>> >>> rejigger how we store it.
>> >>>
>> >>> We'd still have a "ref map" type structure for the blob, but it would only
>> >>> be used for counting the lextents that reference it, and we can
>> >>> dynamically build it when we load the extent map.  If we impose the
>> >>> restriction that whatever the map sharding approach we take we never share
>> >>> a blob across a shard, we the blobs are always local and "ephemeral"
>> >>> in the sense we've been talking about.  The only downside here, I think,
>> >>> is that the write path needs to be smart enough to not create any new blob
>> >>> that spans whatever the current map sharding is (or, alternatively,
>> >>> trigger a resharding if it does so).
>> >>
>> >> Not just a resharding but also a possible decompress recompress cycle.
>> >
>> > Yeah.
>> >
>> > Oh, the other consequence of this is that we lose the unified blob-wise
>> > cache behavior we added a while back.  That means that if you write a
>> > bunch of data to a rbd data object, then clone it, then read of the clone,
>> > it'll re-read the data from disk.  Because it'll be a different blob in
>> > memory (since we'll be making a copy of the metadata etc).
>> >
>> > Josh, Jason, do you have a sense of whether that really matters?  The
>> > common case is probably someone who creates a snapshot and then backs it
>> > up, but it's going to be reading gobs of cold data off disk anyway so I'm
>> > guessing it doesn't matter that a bit of warm data that just preceded the
>> > snapshot gets re-read.
>> >
>> > sage
>> >
>>
>>



-- 
Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-24 23:41                                 ` Allen Samuels
@ 2016-08-25 13:55                                   ` Sage Weil
  0 siblings, 0 replies; 29+ messages in thread
From: Sage Weil @ 2016-08-25 13:55 UTC (permalink / raw)
  To: Allen Samuels; +Cc: jdurgin, dillaman, ceph-devel

On Wed, 24 Aug 2016, Allen Samuels wrote:
> Your suggesting a logical address cache key (oid offset) rather rhan a 
> physical cache (lba). Which seems fine to me. Provided that deletes and 
> renames properly purge the cache.

Right.  This is actually how it's currently implemented.  And the rename 
etc works for free since the cached data is all dangling off the Blob 
structure.

sage


> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
> > On Aug 24, 2016, at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
> > 
> >> On Wed, 24 Aug 2016, Allen Samuels wrote:
> >> Yikes. You mean that blob ids are escaping the environment of the 
> >> lextent table. That's scary. What is the key for this cache? We probably 
> >> need to invalidate it or something.
> > 
> > I mean that there will no longer be blob ids (except within the encoding 
> > of a particular extent map shard).  Which means that when you write to A, 
> > clone A->B, and then read B, B's blob will no longer be the same as A's 
> > blob (as it is now in the bnode, or would have been with the -blobwise 
> > branch) and the cache won't be preserved.
> > 
> > Which I *think* is okay...?
> > 
> > sage
> > 
> > 
> >> 
> >> Sent from my iPhone. Please excuse all typos and autocorrects.
> >> 
> >>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
> >>> 
> >>> On Wed, 24 Aug 2016, Allen Samuels wrote:
> >>>>> In that case, we should focus instead on sharing the ref_map *only* and 
> >>>>> always inline the forward pointers for the blob.  This is closer to what 
> >>>>> we were originally doing with the enode.  In fact, we could go back to the 
> >>>>> enode approach were it's just a big extent_ref_map and only used to defer 
> >>>>> deallocations until all refs are retired.  The blob is then more ephemeral 
> >>>>> (local to the onode, immutable copy if cloned), and we can more easily 
> >>>>> rejigger how we store it.
> >>>>> 
> >>>>> We'd still have a "ref map" type structure for the blob, but it would only 
> >>>>> be used for counting the lextents that reference it, and we can 
> >>>>> dynamically build it when we load the extent map.  If we impose the 
> >>>>> restriction that whatever the map sharding approach we take we never share 
> >>>>> a blob across a shard, we the blobs are always local and "ephemeral" 
> >>>>> in the sense we've been talking about.  The only downside here, I think, 
> >>>>> is that the write path needs to be smart enough to not create any new blob 
> >>>>> that spans whatever the current map sharding is (or, alternatively, 
> >>>>> trigger a resharding if it does so).
> >>>> 
> >>>> Not just a resharding but also a possible decompress recompress cycle.
> >>> 
> >>> Yeah.
> >>> 
> >>> Oh, the other consequence of this is that we lose the unified blob-wise 
> >>> cache behavior we added a while back.  That means that if you write a 
> >>> bunch of data to a rbd data object, then clone it, then read of the clone, 
> >>> it'll re-read the data from disk.  Because it'll be a different blob in 
> >>> memory (since we'll be making a copy of the metadata etc).
> >>> 
> >>> Josh, Jason, do you have a sense of whether that really matters?  The 
> >>> common case is probably someone who creates a snapshot and then backs it 
> >>> up, but it's going to be reading gobs of cold data off disk anyway so I'm 
> >>> guessing it doesn't matter that a bit of warm data that just preceded the 
> >>> snapshot gets re-read.
> >>> 
> >>> sage
> >> 
> >> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-25 12:40                                 ` Jason Dillaman
@ 2016-08-25 14:16                                   ` Sage Weil
  2016-08-25 19:07                                     ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-08-25 14:16 UTC (permalink / raw)
  To: dillaman; +Cc: Allen Samuels, jdurgin, ceph-devel

On Thu, 25 Aug 2016, Jason Dillaman wrote:
> Just so I understand, let's say a user snapshots an RBD image that has
> active IO. At this point, are you saying that the "A" data
> (pre-snapshot) is still (potentially) in the cache and any write
> op-induced creation of clone "B" would not be in the cache?  If that's
> the case, it sounds like a re-read would be required after the first
> "post snapshot" write op.

I mean you could have a sequence like

 write A 0~4096 to disk block X
 clone A -> B
 read A 0~4096   (cache hit, it's still there)
 read B 0~4096   (cache miss, read disk block X.  now 2 copies of X in ram)
 read A 0~4096   (cache hit again, it's still there)

The question is whether the "miss" reading B is concerning.  Or the 
double-caching, I suppose.

sage


> 
> On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 24 Aug 2016, Allen Samuels wrote:
> >> Yikes. You mean that blob ids are escaping the environment of the
> >> lextent table. That's scary. What is the key for this cache? We probably
> >> need to invalidate it or something.
> >
> > I mean that there will no longer be blob ids (except within the encoding
> > of a particular extent map shard).  Which means that when you write to A,
> > clone A->B, and then read B, B's blob will no longer be the same as A's
> > blob (as it is now in the bnode, or would have been with the -blobwise
> > branch) and the cache won't be preserved.
> >
> > Which I *think* is okay...?
> >
> > sage
> >
> >
> >>
> >> Sent from my iPhone. Please excuse all typos and autocorrects.
> >>
> >> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >
> >> > On Wed, 24 Aug 2016, Allen Samuels wrote:
> >> >>> In that case, we should focus instead on sharing the ref_map *only* and
> >> >>> always inline the forward pointers for the blob.  This is closer to what
> >> >>> we were originally doing with the enode.  In fact, we could go back to the
> >> >>> enode approach were it's just a big extent_ref_map and only used to defer
> >> >>> deallocations until all refs are retired.  The blob is then more ephemeral
> >> >>> (local to the onode, immutable copy if cloned), and we can more easily
> >> >>> rejigger how we store it.
> >> >>>
> >> >>> We'd still have a "ref map" type structure for the blob, but it would only
> >> >>> be used for counting the lextents that reference it, and we can
> >> >>> dynamically build it when we load the extent map.  If we impose the
> >> >>> restriction that whatever the map sharding approach we take we never share
> >> >>> a blob across a shard, we the blobs are always local and "ephemeral"
> >> >>> in the sense we've been talking about.  The only downside here, I think,
> >> >>> is that the write path needs to be smart enough to not create any new blob
> >> >>> that spans whatever the current map sharding is (or, alternatively,
> >> >>> trigger a resharding if it does so).
> >> >>
> >> >> Not just a resharding but also a possible decompress recompress cycle.
> >> >
> >> > Yeah.
> >> >
> >> > Oh, the other consequence of this is that we lose the unified blob-wise
> >> > cache behavior we added a while back.  That means that if you write a
> >> > bunch of data to a rbd data object, then clone it, then read of the clone,
> >> > it'll re-read the data from disk.  Because it'll be a different blob in
> >> > memory (since we'll be making a copy of the metadata etc).
> >> >
> >> > Josh, Jason, do you have a sense of whether that really matters?  The
> >> > common case is probably someone who creates a snapshot and then backs it
> >> > up, but it's going to be reading gobs of cold data off disk anyway so I'm
> >> > guessing it doesn't matter that a bit of warm data that just preceded the
> >> > snapshot gets re-read.
> >> >
> >> > sage
> >> >
> >>
> >>
> 
> 
> 
> -- 
> Jason
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-25 14:16                                   ` Sage Weil
@ 2016-08-25 19:07                                     ` Sage Weil
  2016-08-25 19:10                                       ` Allen Samuels
  0 siblings, 1 reply; 29+ messages in thread
From: Sage Weil @ 2016-08-25 19:07 UTC (permalink / raw)
  To: dillaman; +Cc: Allen Samuels, jdurgin, ceph-devel

On Thu, 25 Aug 2016, Sage Weil wrote:
> On Thu, 25 Aug 2016, Jason Dillaman wrote:
> > Just so I understand, let's say a user snapshots an RBD image that has
> > active IO. At this point, are you saying that the "A" data
> > (pre-snapshot) is still (potentially) in the cache and any write
> > op-induced creation of clone "B" would not be in the cache?  If that's
> > the case, it sounds like a re-read would be required after the first
> > "post snapshot" write op.
> 
> I mean you could have a sequence like
> 
>  write A 0~4096 to disk block X
>  clone A -> B
>  read A 0~4096   (cache hit, it's still there)
>  read B 0~4096   (cache miss, read disk block X.  now 2 copies of X in ram)
>  read A 0~4096   (cache hit again, it's still there)
> 
> The question is whether the "miss" reading B is concerning.  Or the 
> double-caching, I suppose.

You know what, I take it all back.  We can uniquely identify blobs by 
their starting LBA, so there's no reason we can't unify the caches as 
before.

sage



> 
> sage
> 
> 
> > 
> > On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
> > > On Wed, 24 Aug 2016, Allen Samuels wrote:
> > >> Yikes. You mean that blob ids are escaping the environment of the
> > >> lextent table. That's scary. What is the key for this cache? We probably
> > >> need to invalidate it or something.
> > >
> > > I mean that there will no longer be blob ids (except within the encoding
> > > of a particular extent map shard).  Which means that when you write to A,
> > > clone A->B, and then read B, B's blob will no longer be the same as A's
> > > blob (as it is now in the bnode, or would have been with the -blobwise
> > > branch) and the cache won't be preserved.
> > >
> > > Which I *think* is okay...?
> > >
> > > sage
> > >
> > >
> > >>
> > >> Sent from my iPhone. Please excuse all typos and autocorrects.
> > >>
> > >> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
> > >> >
> > >> > On Wed, 24 Aug 2016, Allen Samuels wrote:
> > >> >>> In that case, we should focus instead on sharing the ref_map *only* and
> > >> >>> always inline the forward pointers for the blob.  This is closer to what
> > >> >>> we were originally doing with the enode.  In fact, we could go back to the
> > >> >>> enode approach were it's just a big extent_ref_map and only used to defer
> > >> >>> deallocations until all refs are retired.  The blob is then more ephemeral
> > >> >>> (local to the onode, immutable copy if cloned), and we can more easily
> > >> >>> rejigger how we store it.
> > >> >>>
> > >> >>> We'd still have a "ref map" type structure for the blob, but it would only
> > >> >>> be used for counting the lextents that reference it, and we can
> > >> >>> dynamically build it when we load the extent map.  If we impose the
> > >> >>> restriction that whatever the map sharding approach we take we never share
> > >> >>> a blob across a shard, we the blobs are always local and "ephemeral"
> > >> >>> in the sense we've been talking about.  The only downside here, I think,
> > >> >>> is that the write path needs to be smart enough to not create any new blob
> > >> >>> that spans whatever the current map sharding is (or, alternatively,
> > >> >>> trigger a resharding if it does so).
> > >> >>
> > >> >> Not just a resharding but also a possible decompress recompress cycle.
> > >> >
> > >> > Yeah.
> > >> >
> > >> > Oh, the other consequence of this is that we lose the unified blob-wise
> > >> > cache behavior we added a while back.  That means that if you write a
> > >> > bunch of data to a rbd data object, then clone it, then read of the clone,
> > >> > it'll re-read the data from disk.  Because it'll be a different blob in
> > >> > memory (since we'll be making a copy of the metadata etc).
> > >> >
> > >> > Josh, Jason, do you have a sense of whether that really matters?  The
> > >> > common case is probably someone who creates a snapshot and then backs it
> > >> > up, but it's going to be reading gobs of cold data off disk anyway so I'm
> > >> > guessing it doesn't matter that a bit of warm data that just preceded the
> > >> > snapshot gets re-read.
> > >> >
> > >> > sage
> > >> >
> > >>
> > >>
> > 
> > 
> > 
> > -- 
> > Jason
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: bluestore blobs REVISITED
  2016-08-25 19:07                                     ` Sage Weil
@ 2016-08-25 19:10                                       ` Allen Samuels
  0 siblings, 0 replies; 29+ messages in thread
From: Allen Samuels @ 2016-08-25 19:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: dillaman, jdurgin, ceph-devel

Yes a physically indexed cache solves that problem. But you will suffer the translation overhead on a read hit - still probably the right choice. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 25, 2016, at 3:08 PM, Sage Weil <sweil@redhat.com> wrote:
> 
>> On Thu, 25 Aug 2016, Sage Weil wrote:
>>> On Thu, 25 Aug 2016, Jason Dillaman wrote:
>>> Just so I understand, let's say a user snapshots an RBD image that has
>>> active IO. At this point, are you saying that the "A" data
>>> (pre-snapshot) is still (potentially) in the cache and any write
>>> op-induced creation of clone "B" would not be in the cache?  If that's
>>> the case, it sounds like a re-read would be required after the first
>>> "post snapshot" write op.
>> 
>> I mean you could have a sequence like
>> 
>> write A 0~4096 to disk block X
>> clone A -> B
>> read A 0~4096   (cache hit, it's still there)
>> read B 0~4096   (cache miss, read disk block X.  now 2 copies of X in ram)
>> read A 0~4096   (cache hit again, it's still there)
>> 
>> The question is whether the "miss" reading B is concerning.  Or the 
>> double-caching, I suppose.
> 
> You know what, I take it all back.  We can uniquely identify blobs by 
> their starting LBA, so there's no reason we can't unify the caches as 
> before.
> 
> sage
> 
> 
> 
>> 
>> sage
>> 
>> 
>>> 
>>>> On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> Yikes. You mean that blob ids are escaping the environment of the
>>>>> lextent table. That's scary. What is the key for this cache? We probably
>>>>> need to invalidate it or something.
>>>> 
>>>> I mean that there will no longer be blob ids (except within the encoding
>>>> of a particular extent map shard).  Which means that when you write to A,
>>>> clone A->B, and then read B, B's blob will no longer be the same as A's
>>>> blob (as it is now in the bnode, or would have been with the -blobwise
>>>> branch) and the cache won't be preserved.
>>>> 
>>>> Which I *think* is okay...?
>>>> 
>>>> sage
>>>> 
>>>> 
>>>>> 
>>>>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>>>> 
>>>>>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>>> 
>>>>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>>>>> In that case, we should focus instead on sharing the ref_map *only* and
>>>>>>>> always inline the forward pointers for the blob.  This is closer to what
>>>>>>>> we were originally doing with the enode.  In fact, we could go back to the
>>>>>>>> enode approach were it's just a big extent_ref_map and only used to defer
>>>>>>>> deallocations until all refs are retired.  The blob is then more ephemeral
>>>>>>>> (local to the onode, immutable copy if cloned), and we can more easily
>>>>>>>> rejigger how we store it.
>>>>>>>> 
>>>>>>>> We'd still have a "ref map" type structure for the blob, but it would only
>>>>>>>> be used for counting the lextents that reference it, and we can
>>>>>>>> dynamically build it when we load the extent map.  If we impose the
>>>>>>>> restriction that whatever the map sharding approach we take we never share
>>>>>>>> a blob across a shard, we the blobs are always local and "ephemeral"
>>>>>>>> in the sense we've been talking about.  The only downside here, I think,
>>>>>>>> is that the write path needs to be smart enough to not create any new blob
>>>>>>>> that spans whatever the current map sharding is (or, alternatively,
>>>>>>>> trigger a resharding if it does so).
>>>>>>> 
>>>>>>> Not just a resharding but also a possible decompress recompress cycle.
>>>>>> 
>>>>>> Yeah.
>>>>>> 
>>>>>> Oh, the other consequence of this is that we lose the unified blob-wise
>>>>>> cache behavior we added a while back.  That means that if you write a
>>>>>> bunch of data to a rbd data object, then clone it, then read of the clone,
>>>>>> it'll re-read the data from disk.  Because it'll be a different blob in
>>>>>> memory (since we'll be making a copy of the metadata etc).
>>>>>> 
>>>>>> Josh, Jason, do you have a sense of whether that really matters?  The
>>>>>> common case is probably someone who creates a snapshot and then backs it
>>>>>> up, but it's going to be reading gobs of cold data off disk anyway so I'm
>>>>>> guessing it doesn't matter that a bit of warm data that just preceded the
>>>>>> snapshot gets re-read.
>>>>>> 
>>>>>> sage
>>> 
>>> 
>>> 
>>> -- 
>>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-08-25 20:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-20  5:33 bluestore blobs REVISITED Allen Samuels
2016-08-21 16:08 ` Sage Weil
2016-08-21 17:27   ` Allen Samuels
2016-08-22  1:41     ` Allen Samuels
2016-08-22 22:08       ` Sage Weil
2016-08-22 22:20         ` Allen Samuels
2016-08-22 22:55           ` Sage Weil
2016-08-22 23:09             ` Allen Samuels
2016-08-23 16:02               ` Sage Weil
2016-08-23 16:44                 ` Mark Nelson
2016-08-23 16:46                 ` Allen Samuels
2016-08-23 19:39                   ` Sage Weil
2016-08-24  3:07                     ` Ramesh Chander
2016-08-24  3:52                     ` Allen Samuels
2016-08-24 18:15                       ` Sage Weil
2016-08-24 18:50                         ` Allen Samuels
2016-08-24 20:47                         ` Mark Nelson
2016-08-24 20:59                           ` Allen Samuels
2016-08-24 21:10                         ` Allen Samuels
2016-08-24 21:18                           ` Sage Weil
2016-08-24 22:13                             ` Allen Samuels
2016-08-24 22:29                               ` Sage Weil
2016-08-24 23:41                                 ` Allen Samuels
2016-08-25 13:55                                   ` Sage Weil
2016-08-25 12:40                                 ` Jason Dillaman
2016-08-25 14:16                                   ` Sage Weil
2016-08-25 19:07                                     ` Sage Weil
2016-08-25 19:10                                       ` Allen Samuels
2016-08-23  5:21             ` Varada Kari

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.