All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore blobs
@ 2016-08-26 17:51 Sage Weil
  2016-08-26 18:16 ` Allen Samuels
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-26 17:51 UTC (permalink / raw)
  To: allen.samuels; +Cc: ceph-devel

Hi Allen,

The "blobs must be confined to a extent map shard" rule is still somewhat 
unsatisfying to me.  There's another easy possibility, though: we 
allow blobs to span extent map shards, and when they do, we stuff them 
directly in the onode.  The number of such blobs will always be small 
(no more than the number of extent map shards), so I don't think size is a 
concern.  And we'll always already have them loaded up when we bring any 
particular shard in, so we don't need to worry about any additional 
complexity around paging them in.  And we avoid the slightly annoying cut 
points on compressed extents when they cross such boundaries.

This also avoids some of the tuning practicalities that were annoying me 
(does a global config option control where the enforced cut points are?  
what happens if that changes on an existing store?)

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-26 17:51 bluestore blobs Sage Weil
@ 2016-08-26 18:16 ` Allen Samuels
  0 siblings, 0 replies; 24+ messages in thread
From: Allen Samuels @ 2016-08-26 18:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I like the simplicity of this approach a lot. When the system isn't fragmented, this should work excellently. However when it is fragmented, this falls back to the current situation. Nevertheless in a heavily fragmented environment the onode size might be the least important problem to solve, so I agree this is the approach we should take.

Sent from my iPhone. Please excuse all typos and autocorrect

> On Aug 26, 2016, at 1:52 PM, Sage Weil <sweil@redhat.com> wrote:
> 
> Hi Allen,
> 
> The "blobs must be confined to a extent map shard" rule is still somewhat 
> unsatisfying to me.  There's another easy possibility, though: we 
> allow blobs to span extent map shards, and when they do, we stuff them 
> directly in the onode.  The number of such blobs will always be small 
> (no more than the number of extent map shards), so I don't think size is a 
> concern.  And we'll always already have them loaded up when we bring any 
> particular shard in, so we don't need to worry about any additional 
> complexity around paging them in.  And we avoid the slightly annoying cut 
> points on compressed extents when they cross such boundaries.
> 
> This also avoids some of the tuning practicalities that were annoying me 
> (does a global config option control where the enforced cut points are?  
> what happens if that changes on an existing store?)
> 
> sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-19 13:53       ` Sage Weil
@ 2016-08-19 14:16         ` Allen Samuels
  0 siblings, 0 replies; 24+ messages in thread
From: Allen Samuels @ 2016-08-19 14:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, August 19, 2016 6:53 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs
> 
> On Fri, 19 Aug 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Thursday, August 18, 2016 8:10 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: bluestore blobs
> > >
> > > On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > > > To: ceph-devel@vger.kernel.org
> > > > > Subject: bluestore blobs
> > > > >
> > > > > I think we need to look at other changes in addition to the
> > > > > encoding performance improvements.  Even if they end up being
> > > > > good enough, these changes are somewhat orthogonal and at least
> > > > > one of them should give us something that is even faster.
> > > > >
> > > > > 1. I mentioned this before, but we should keep the encoding
> > > > > bluestore_blob_t around when we load the blob map.  If it's not
> > > > > changed, don't reencode it.  There are no blockers for
> > > > > implementing this
> > > currently.
> > > > > It may be difficult to ensure the blobs are properly marked dirty...
> > > > > I'll see if we can use proper accessors for the blob to enforce
> > > > > this at compile time.  We should do that anyway.
> > > >
> > > > If it's not changed, then why are we re-writing it? I'm having a
> > > > hard time thinking of a case worth optimizing where I want to
> > > > re-write the oNode but the blob_map is unchanged. Am I missing
> something obvious?
> > >
> > > An onode's blob_map might have 300 blobs, and a single write only
> > > updates one of them.  The other 299 blobs need not be reencoded, just
> memcpy'd.
> >
> > As long as we're just appending that's a good optimization. How often
> > does that happen? It's certainly not going to help the RBD 4K random
> > write problem.
> 
> It won't help the (l)extent_map encoding, but it avoids almost all of the blob
> reencoding.  A 4k random write will update one blob out of ~100 (or
> whatever it is).
> 
> > > > > 2. This turns the blob Put into rocksdb into two memcpy stages:
> > > > > one to assemble the bufferlist (lots of bufferptrs to each
> > > > > untouched
> > > > > blob) into a single rocksdb::Slice, and another memcpy somewhere
> > > > > inside rocksdb to copy this into the write buffer.  We could
> > > > > extend the rocksdb interface to take an iovec so that the first
> > > > > memcpy isn't needed (and rocksdb will instead iterate over our
> > > > > buffers and copy them directly into its write buffer).  This is
> > > > > probably a pretty small piece of the overall time... should
> > > > > verify with a profiler
> > > before investing too much effort here.
> > > >
> > > > I doubt it's the memcpy that's really the expensive part. I'll bet
> > > > it's that we're transcoding from an internal to an external
> > > > representation on an element by element basis. If the iovec scheme
> > > > is going to help, it presumes that the internal data structure
> > > > essentially matches the external data structure so that only an
> > > > iovec copy is required. I'm wondering how compatible this is with
> > > > the current concepts of lextext/blob/pextent.
> > >
> > > I'm thinking of the xattr case (we have a bunch of strings to copy
> > > verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
> > > of memcpy'ing them into a big contiguous buffer and having rocksdb
> > > memcpy
> > > *that* into it's larger buffer, give rocksdb an iovec so that they
> > > smaller buffers are assembled only once.
> > >
> > > These buffers will be on the order of many 10s to a couple 100s of bytes.
> > > I'm not sure where the crossover point for constructing and then
> > > traversing an iovec vs just copying twice would be...
> > >
> >
> > Yes this will eliminate the "extra" copy, but the real problem is that
> > the oNode itself is just too large. I doubt removing one extra copy is
> > going to suddenly "solve" this problem. I think we're going to end up
> > rejiggering things so that this will be much less of a problem than it
> > is now -- time will tell.
> 
> Yeah, leaving this one for last I think... until we see memcpy show up in the
> profile.
> 
> > > > > 3. Even if we do the above, we're still setting a big (~4k or
> > > > > more?) key into rocksdb every time we touch an object, even when
> > > > > a tiny
> >
> > See my analysis, you're looking at 8-10K for the RBD random write case
> > -- which I think everybody cares a lot about.
> >
> > > > > amount of metadata is getting changed.  This is a consequence of
> > > > > embedding all of the blobs into the onode (or bnode).  That
> > > > > seemed like a good idea early on when they were tiny (i.e., just
> > > > > an extent), but now I'm not so sure.  I see a couple of different
> options:
> > > > >
> > > > > a) Store each blob as ($onode_key+$blobid).  When we load the
> > > > > onode, load the blobs too.  They will hopefully be sequential in
> > > > > rocksdb (or definitely sequential in zs).  Probably go back to using an
> iterator.
> > > > >
> > > > > b) Go all in on the "bnode" like concept.  Assign blob ids so
> > > > > that they are unique for any given hash value.  Then store the
> > > > > blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is
> > > > > now).  Then when clone happens there is no onode->bnode
> > > > > migration magic happening--we've already committed to storing
> > > > > blobs in separate keys.  When we load the onode, keep the
> > > > > conditional bnode loading we already have.. but when the bnode
> > > > > is loaded load up all the blobs for the hash key.  (Okay, we
> > > > > could fault in blobs individually, but that code will be more
> > > > > complicated.)
> >
> > I like this direction. I think you'll still end up demand loading the
> > blobs in order to speed up the random read case. This scheme will
> > result in some space-amplification, both in the lextent and in the
> > blob-map, it's worth a bit of study too see how bad the metadata/data
> > ratio becomes (just as a guess, $shard.$poolid.$hash.$blobid is
> > probably 16 +
> > 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob --
> > unless your KV store does path compression. My reading of RocksDB sst
> > file seems to indicate that it doesn't, I *believe* that ZS does [need
> > to confirm]). I'm wondering if the current notion of local vs. global
> > blobs isn't actually beneficial in that we can give local blobs
> > different names that sort with their associated oNode (which probably
> > makes the space-amp worse) which is an important optimization. We do
> > need to watch the space amp, we're going to be burning DRAM to make KV
> > accesses cheap and the amount of DRAM is proportional to the space amp.
> 
> I got this mostly working last night... just need to sort out the clone case (and
> clean up a bunch of code).  It was a relatively painless transition to make,
> although in its current form the blobs all belong to the bnode, and the bnode
> if ephemeral but remains in memory until all referencing onodes go away.
> Mostly fine, except it means that odd combinations of clone could leave lots
> of blobs in cache that don't get trimmed.  Will address that later.
> 
> I'll try to finish it up this morning and get it passing tests and posted.
> 
> > > > > In both these cases, a write will dirty the onode (which is back
> > > > > to being pretty small.. just xattrs and the lextent map) and 1-3
> > > > > blobs (also
> > > now small keys).
> >
> > I'm not sure the oNode is going to be that small. Looking at the RBD
> > random 4K write case, you're going to have 1K entries each of which
> > has an offset, size and a blob-id reference in them. In my current
> > oNode compression scheme this compresses to about 1 byte per entry.
> > However, this optimization relies on being able to cheaply renumber
> > the blob-ids, which is no longer possible when the blob-ids become
> > parts of a key (see above). So now you'll have a minimum of 1.5-3
> > bytes extra for each blob-id (because you can't assume that the blob-ids
> become "dense"
> > anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K
> > Bytes of lextent table. Worse, because of the variable length encoding
> > you'll have to scan the entire table to deserialize it (yes, we could
> > do differential editing when we write but that's another discussion).
> > Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So
> > while this looks small compared to the current ~30K for the entire
> > thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the
> > compressed oNode/lextent/blobmap scheme that I published earlier.
> >
> > If we want to do better we will need to separate the lextent from the
> > oNode also. It's relatively easy to move the lextents into the KV
> > store itself (there are two obvious ways to deal with this, either use
> > the native offset/size from the lextent itself OR create 'N' buckets
> > of logical offset into which we pour entries -- both of these would
> > add somewhere between 1 and 2 KV look-ups per operation -- here is
> > where an iterator would probably help.
> >
> > Unfortunately, if you only process a portion of the lextent (because
> > you've made it into multiple keys and you don't want to load all of
> > them) you no longer can re-generate the refmap on the fly (another key
> > space optimization). The lack of refmap screws up a number of other
> > important algorithms -- for example the overlapping blob-map thing, etc.
> > Not sure if these are easy to rewrite or not -- too complicated to
> > think about at this hour of the evening.
> 
> Yeah, I forgot about the extent_map and how big it gets.  I think, though,
> that if we can get a 4mb object with 1024 4k lextents to encode the whole
> onode and extent_map in under 4K that will be good enough.  The blob
> update that goes with it will be ~200 bytes, and benchmarks aside, the 4k
> random write 100% fragmented object is a worst case.

Yes, it's a worst-case. But it's a "worst-case-that-everybody-looks-at" vs. a "worst-case-that-almost-nobody-looks-at".

I'm still concerned about having an oNode that's larger than a 4K block.


>
> Anyway, I'll get the blob separation branch working and we can go from
> there...
> 
> sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-19  3:11     ` Allen Samuels
@ 2016-08-19 13:53       ` Sage Weil
  2016-08-19 14:16         ` Allen Samuels
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-19 13:53 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Fri, 19 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 18, 2016 8:10 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: bluestore blobs
> > 
> > On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > > To: ceph-devel@vger.kernel.org
> > > > Subject: bluestore blobs
> > > >
> > > > I think we need to look at other changes in addition to the encoding
> > > > performance improvements.  Even if they end up being good enough,
> > > > these changes are somewhat orthogonal and at least one of them
> > > > should give us something that is even faster.
> > > >
> > > > 1. I mentioned this before, but we should keep the encoding
> > > > bluestore_blob_t around when we load the blob map.  If it's not
> > > > changed, don't reencode it.  There are no blockers for implementing this
> > currently.
> > > > It may be difficult to ensure the blobs are properly marked dirty...
> > > > I'll see if we can use proper accessors for the blob to enforce this
> > > > at compile time.  We should do that anyway.
> > >
> > > If it's not changed, then why are we re-writing it? I'm having a hard
> > > time thinking of a case worth optimizing where I want to re-write the
> > > oNode but the blob_map is unchanged. Am I missing something obvious?
> > 
> > An onode's blob_map might have 300 blobs, and a single write only updates
> > one of them.  The other 299 blobs need not be reencoded, just memcpy'd.
> 
> As long as we're just appending that's a good optimization. How often 
> does that happen? It's certainly not going to help the RBD 4K random 
> write problem.

It won't help the (l)extent_map encoding, but it avoids almost all of the 
blob reencoding.  A 4k random write will update one blob out of ~100 (or 
whatever it is).

> > > > 2. This turns the blob Put into rocksdb into two memcpy stages: one
> > > > to assemble the bufferlist (lots of bufferptrs to each untouched
> > > > blob) into a single rocksdb::Slice, and another memcpy somewhere
> > > > inside rocksdb to copy this into the write buffer.  We could extend
> > > > the rocksdb interface to take an iovec so that the first memcpy
> > > > isn't needed (and rocksdb will instead iterate over our buffers and
> > > > copy them directly into its write buffer).  This is probably a
> > > > pretty small piece of the overall time... should verify with a profiler
> > before investing too much effort here.
> > >
> > > I doubt it's the memcpy that's really the expensive part. I'll bet
> > > it's that we're transcoding from an internal to an external
> > > representation on an element by element basis. If the iovec scheme is
> > > going to help, it presumes that the internal data structure
> > > essentially matches the external data structure so that only an iovec
> > > copy is required. I'm wondering how compatible this is with the
> > > current concepts of lextext/blob/pextent.
> > 
> > I'm thinking of the xattr case (we have a bunch of strings to copy
> > verbatim) and updated-one-blob-and-kept-99-unchanged case: instead of
> > memcpy'ing them into a big contiguous buffer and having rocksdb memcpy
> > *that* into it's larger buffer, give rocksdb an iovec so that they smaller
> > buffers are assembled only once.
> > 
> > These buffers will be on the order of many 10s to a couple 100s of bytes.
> > I'm not sure where the crossover point for constructing and then traversing
> > an iovec vs just copying twice would be...
> > 
> 
> Yes this will eliminate the "extra" copy, but the real problem is that 
> the oNode itself is just too large. I doubt removing one extra copy is 
> going to suddenly "solve" this problem. I think we're going to end up 
> rejiggering things so that this will be much less of a problem than it 
> is now -- time will tell.

Yeah, leaving this one for last I think... until we see memcpy show up in 
the profile.
 
> > > > 3. Even if we do the above, we're still setting a big (~4k or more?)
> > > > key into rocksdb every time we touch an object, even when a tiny
> 
> See my analysis, you're looking at 8-10K for the RBD random write case 
> -- which I think everybody cares a lot about.
> 
> > > > amount of metadata is getting changed.  This is a consequence of
> > > > embedding all of the blobs into the onode (or bnode).  That seemed
> > > > like a good idea early on when they were tiny (i.e., just an
> > > > extent), but now I'm not so sure.  I see a couple of different options:
> > > >
> > > > a) Store each blob as ($onode_key+$blobid).  When we load the onode,
> > > > load the blobs too.  They will hopefully be sequential in rocksdb
> > > > (or definitely sequential in zs).  Probably go back to using an iterator.
> > > >
> > > > b) Go all in on the "bnode" like concept.  Assign blob ids so that
> > > > they are unique for any given hash value.  Then store the blobs as
> > > > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then
> > > > when clone happens there is no onode->bnode migration magic
> > > > happening--we've already committed to storing blobs in separate
> > > > keys.  When we load the onode, keep the conditional bnode loading we
> > > > already have.. but when the bnode is loaded load up all the blobs
> > > > for the hash key.  (Okay, we could fault in blobs individually, but
> > > > that code will be more complicated.)
> 
> I like this direction. I think you'll still end up demand loading the 
> blobs in order to speed up the random read case. This scheme will result 
> in some space-amplification, both in the lextent and in the blob-map, 
> it's worth a bit of study too see how bad the metadata/data ratio 
> becomes (just as a guess, $shard.$poolid.$hash.$blobid is probably 16 + 
> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob -- 
> unless your KV store does path compression. My reading of RocksDB sst 
> file seems to indicate that it doesn't, I *believe* that ZS does [need 
> to confirm]). I'm wondering if the current notion of local vs. global 
> blobs isn't actually beneficial in that we can give local blobs 
> different names that sort with their associated oNode (which probably 
> makes the space-amp worse) which is an important optimization. We do 
> need to watch the space amp, we're going to be burning DRAM to make KV 
> accesses cheap and the amount of DRAM is proportional to the space amp.

I got this mostly working last night... just need to sort out the clone 
case (and clean up a bunch of code).  It was a relatively painless 
transition to make, although in its current form the blobs all belong to 
the bnode, and the bnode if ephemeral but remains in memory until all 
referencing onodes go away.  Mostly fine, except it means that odd 
combinations of clone could leave lots of blobs in cache that don't get 
trimmed.  Will address that later.

I'll try to finish it up this morning and get it passing tests and posted.

> > > > In both these cases, a write will dirty the onode (which is back to
> > > > being pretty small.. just xattrs and the lextent map) and 1-3 blobs (also
> > now small keys).
> 
> I'm not sure the oNode is going to be that small. Looking at the RBD 
> random 4K write case, you're going to have 1K entries each of which has 
> an offset, size and a blob-id reference in them. In my current oNode 
> compression scheme this compresses to about 1 byte per entry. However, 
> this optimization relies on being able to cheaply renumber the blob-ids, 
> which is no longer possible when the blob-ids become parts of a key (see 
> above). So now you'll have a minimum of 1.5-3 bytes extra for each 
> blob-id (because you can't assume that the blob-ids become "dense" 
> anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K 
> Bytes of lextent table. Worse, because of the variable length encoding 
> you'll have to scan the entire table to deserialize it (yes, we could do 
> differential editing when we write but that's another discussion). Oh 
> and I forgot to add the 200-300 bytes of oNode and xattrs :). So while 
> this looks small compared to the current ~30K for the entire thing 
> oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the compressed 
> oNode/lextent/blobmap scheme that I published earlier.
> 
> If we want to do better we will need to separate the lextent from the 
> oNode also. It's relatively easy to move the lextents into the KV store 
> itself (there are two obvious ways to deal with this, either use the 
> native offset/size from the lextent itself OR create 'N' buckets of 
> logical offset into which we pour entries -- both of these would add 
> somewhere between 1 and 2 KV look-ups per operation -- here is where an 
> iterator would probably help.
> 
> Unfortunately, if you only process a portion of the lextent (because 
> you've made it into multiple keys and you don't want to load all of 
> them) you no longer can re-generate the refmap on the fly (another key 
> space optimization). The lack of refmap screws up a number of other 
> important algorithms -- for example the overlapping blob-map thing, etc. 
> Not sure if these are easy to rewrite or not -- too complicated to think 
> about at this hour of the evening.

Yeah, I forgot about the extent_map and how big it gets.  I think, though, 
that if we can get a 4mb object with 1024 4k lextents to encode the whole 
onode and extent_map in under 4K that will be good enough.  The blob 
update that goes with it will be ~200 bytes, and benchmarks aside, the 4k 
random write 100% fragmented object is a worst case.

Anyway, I'll get the blob separation branch working and we can go from 
there...

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-18 15:10   ` Sage Weil
  2016-08-19  3:11     ` Allen Samuels
@ 2016-08-19 11:38     ` Mark Nelson
  1 sibling, 0 replies; 24+ messages in thread
From: Mark Nelson @ 2016-08-19 11:38 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 08/18/2016 10:10 AM, Sage Weil wrote:
> On Thu, 18 Aug 2016, Allen Samuels wrote:
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Wednesday, August 17, 2016 7:26 AM
>>> To: ceph-devel@vger.kernel.org
>>> Subject: bluestore blobs
>>>
>>> I think we need to look at other changes in addition to the encoding
>>> performance improvements.  Even if they end up being good enough, these
>>> changes are somewhat orthogonal and at least one of them should give us
>>> something that is even faster.
>>>
>>> 1. I mentioned this before, but we should keep the encoding
>>> bluestore_blob_t around when we load the blob map.  If it's not changed,
>>> don't reencode it.  There are no blockers for implementing this currently.
>>> It may be difficult to ensure the blobs are properly marked dirty... I'll see if
>>> we can use proper accessors for the blob to enforce this at compile time.  We
>>> should do that anyway.
>>
>> If it's not changed, then why are we re-writing it? I'm having a hard
>> time thinking of a case worth optimizing where I want to re-write the
>> oNode but the blob_map is unchanged. Am I missing something obvious?
>
> An onode's blob_map might have 300 blobs, and a single write only updates
> one of them.  The other 299 blobs need not be reencoded, just memcpy'd.
>
>>> 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>>> assemble the bufferlist (lots of bufferptrs to each untouched blob) into a
>>> single rocksdb::Slice, and another memcpy somewhere inside rocksdb to
>>> copy this into the write buffer.  We could extend the rocksdb interface to
>>> take an iovec so that the first memcpy isn't needed (and rocksdb will instead
>>> iterate over our buffers and copy them directly into its write buffer).  This is
>>> probably a pretty small piece of the overall time... should verify with a
>>> profiler before investing too much effort here.
>>
>> I doubt it's the memcpy that's really the expensive part. I'll bet it's
>> that we're transcoding from an internal to an external representation on
>> an element by element basis. If the iovec scheme is going to help, it
>> presumes that the internal data structure essentially matches the
>> external data structure so that only an iovec copy is required. I'm
>> wondering how compatible this is with the current concepts of
>> lextext/blob/pextent.
>
> I'm thinking of the xattr case (we have a bunch of strings to copy
> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead
> of memcpy'ing them into a big contiguous buffer and having rocksdb
> memcpy *that* into it's larger buffer, give rocksdb an iovec so that they
> smaller buffers are assembled only once.
>
> These buffers will be on the order of many 10s to a couple 100s of bytes.
> I'm not sure where the crossover point for constructing and then
> traversing an iovec vs just copying twice would be...
>
>>> 3. Even if we do the above, we're still setting a big (~4k or more?) key into
>>> rocksdb every time we touch an object, even when a tiny amount of
>>> metadata is getting changed.  This is a consequence of embedding all of the
>>> blobs into the onode (or bnode).  That seemed like a good idea early on
>>> when they were tiny (i.e., just an extent), but now I'm not so sure.  I see a
>>> couple of different options:
>>>
>>> a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>>> the blobs too.  They will hopefully be sequential in rocksdb (or definitely
>>> sequential in zs).  Probably go back to using an iterator.
>>>
>>> b) Go all in on the "bnode" like concept.  Assign blob ids so that they are
>>> unique for any given hash value.  Then store the blobs as
>>> $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>>> clone happens there is no onode->bnode migration magic happening--we've
>>> already committed to storing blobs in separate keys.  When we load the
>>> onode, keep the conditional bnode loading we already have.. but when the
>>> bnode is loaded load up all the blobs for the hash key.  (Okay, we could fault
>>> in blobs individually, but that code will be more complicated.)
>>>
>>> In both these cases, a write will dirty the onode (which is back to being pretty
>>> small.. just xattrs and the lextent map) and 1-3 blobs (also now small keys).
>>> Updates will generate much lower metadata write traffic, which'll reduce
>>> media wear and compaction overhead.  The cost is that operations (e.g.,
>>> reads) that have to fault in an onode are now fetching several nearby keys
>>> instead of a single key.
>>>
>>>
>>> #1 and #2 are completely orthogonal to any encoding efficiency
>>> improvements we make.  And #1 is simple... I plan to implement this shortly.
>>>
>>> #3 is balancing (re)encoding efficiency against the cost of separate keys, and
>>> that tradeoff will change as encoding efficiency changes, so it'll be difficult to
>>> properly evaluate without knowing where we'll land with the (re)encode
>>> times.  I think it's a design decision made early on that is worth revisiting,
>>> though!
>>
>> It's not just the encoding efficiency, it's the cost of KV accesses. For
>> example, we could move the lextent map into the KV world similarly to
>> the way that you're suggesting the blob_maps be moved. You could do it
>> for the xattrs also. Now you've almost completely eliminated any
>> serialization/deserialization costs for the LARGE oNodes that we have
>> today but have replaced that with several KV lookups (one small Onode,
>> probably an xAttr, an lextent and a blob_map).
>>
>> I'm guessing that the "right" point is in between. I doubt that
>> separating the oNode from the xattrs pays off (especially since the
>> current code pretty much assumes that they are all cheap to get at).
>
> Yep.. this is why it'll be a hard call to make, esp when the encoding
> efficiency is changing at the same time.  I'm calling out blobs here
> because they are biggish (lextents are tiny) and nontrivial to encode
> (xattrs are just strings).
>
>> I'm wondering if it pays off to make each lextent entry a separate
>> key/value vs encoding the entire extent table (several KB) as a single
>> value. Same for the blobmap (though I suspect they have roughly the same
>> behavior w.r.t. this particular parameter)
>
> I'm guessing no because they are so small that the kv overhead will dwarf
> the encoding cost, but who knows.  I think implementing the blob case
> won't be so bad and will give us a better idea (i.e., blobs are bigger and
> more expensive and if it's not a win there then certainly don't bother
> with lextents).

This is certainly what I'm seeing in perf while I walk through and 
change the existing encoding in bluestore to use safe_appender. 
lextents are way down on the list.

>
>> We need to temper this experiment with the notion that we change the
>> lextent/blob_map encoding to something that doesn't require transcoding
>> -- if possible.
>
> Right.  I don't have any bright ideas here, though.  The variable length
> encoding makes this really hard and we still care about keeping things
> small.

Back in the onode diet thread I was wondering about the way cap'n 
protocol does encoding.  It's basically focused on speed first, 
compression 2nd.  It still does a reasonably good job with the common 
cases where you are just trying to avoid a bunch of empty space, and 
optionally uses compression to deal with the rest.

https://capnproto.org/encoding.html

FWIW the guy that wrote that used to be the lead on google's protocol 
buffers.

>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-18 15:10   ` Sage Weil
@ 2016-08-19  3:11     ` Allen Samuels
  2016-08-19 13:53       ` Sage Weil
  2016-08-19 11:38     ` Mark Nelson
  1 sibling, 1 reply; 24+ messages in thread
From: Allen Samuels @ 2016-08-19  3:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 18, 2016 8:10 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: bluestore blobs
> 
> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > owner@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: bluestore blobs
> > >
> > > I think we need to look at other changes in addition to the encoding
> > > performance improvements.  Even if they end up being good enough,
> > > these changes are somewhat orthogonal and at least one of them
> > > should give us something that is even faster.
> > >
> > > 1. I mentioned this before, but we should keep the encoding
> > > bluestore_blob_t around when we load the blob map.  If it's not
> > > changed, don't reencode it.  There are no blockers for implementing this
> currently.
> > > It may be difficult to ensure the blobs are properly marked dirty...
> > > I'll see if we can use proper accessors for the blob to enforce this
> > > at compile time.  We should do that anyway.
> >
> > If it's not changed, then why are we re-writing it? I'm having a hard
> > time thinking of a case worth optimizing where I want to re-write the
> > oNode but the blob_map is unchanged. Am I missing something obvious?
> 
> An onode's blob_map might have 300 blobs, and a single write only updates
> one of them.  The other 299 blobs need not be reencoded, just memcpy'd.

As long as we're just appending that's a good optimization. How often does that happen? It's certainly not going to help the RBD 4K random write problem.

> 
> > > 2. This turns the blob Put into rocksdb into two memcpy stages: one
> > > to assemble the bufferlist (lots of bufferptrs to each untouched
> > > blob) into a single rocksdb::Slice, and another memcpy somewhere
> > > inside rocksdb to copy this into the write buffer.  We could extend
> > > the rocksdb interface to take an iovec so that the first memcpy
> > > isn't needed (and rocksdb will instead iterate over our buffers and
> > > copy them directly into its write buffer).  This is probably a
> > > pretty small piece of the overall time... should verify with a profiler
> before investing too much effort here.
> >
> > I doubt it's the memcpy that's really the expensive part. I'll bet
> > it's that we're transcoding from an internal to an external
> > representation on an element by element basis. If the iovec scheme is
> > going to help, it presumes that the internal data structure
> > essentially matches the external data structure so that only an iovec
> > copy is required. I'm wondering how compatible this is with the
> > current concepts of lextext/blob/pextent.
> 
> I'm thinking of the xattr case (we have a bunch of strings to copy
> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead of
> memcpy'ing them into a big contiguous buffer and having rocksdb memcpy
> *that* into it's larger buffer, give rocksdb an iovec so that they smaller
> buffers are assembled only once.
> 
> These buffers will be on the order of many 10s to a couple 100s of bytes.
> I'm not sure where the crossover point for constructing and then traversing
> an iovec vs just copying twice would be...
> 

Yes this will eliminate the "extra" copy, but the real problem is that the oNode itself is just too large. I doubt removing one extra copy is going to suddenly "solve" this problem. I think we're going to end up rejiggering things so that this will be much less of a problem than it is now -- time will tell.

> > > 3. Even if we do the above, we're still setting a big (~4k or more?)
> > > key into rocksdb every time we touch an object, even when a tiny

See my analysis, you're looking at 8-10K for the RBD random write case -- which I think everybody cares a lot about.

> > > amount of metadata is getting changed.  This is a consequence of
> > > embedding all of the blobs into the onode (or bnode).  That seemed
> > > like a good idea early on when they were tiny (i.e., just an
> > > extent), but now I'm not so sure.  I see a couple of different options:
> > >
> > > a) Store each blob as ($onode_key+$blobid).  When we load the onode,
> > > load the blobs too.  They will hopefully be sequential in rocksdb
> > > (or definitely sequential in zs).  Probably go back to using an iterator.
> > >
> > > b) Go all in on the "bnode" like concept.  Assign blob ids so that
> > > they are unique for any given hash value.  Then store the blobs as
> > > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then
> > > when clone happens there is no onode->bnode migration magic
> > > happening--we've already committed to storing blobs in separate
> > > keys.  When we load the onode, keep the conditional bnode loading we
> > > already have.. but when the bnode is loaded load up all the blobs
> > > for the hash key.  (Okay, we could fault in blobs individually, but
> > > that code will be more complicated.)

I like this direction. I think you'll still end up demand loading the blobs in order to speed up the random read case. This scheme will result in some space-amplification, both in the lextent and in the blob-map, it's worth a bit of study too see how bad the metadata/data ratio becomes (just as a guess, $shard.$poolid.$hash.$blobid is probably 16 + 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob -- unless your KV store does path compression. My reading of RocksDB sst file seems to indicate that it doesn't, I *believe* that ZS does [need to confirm]). I'm wondering if the current notion of local vs. global blobs isn't actually beneficial in that we can give local blobs different names that sort with their associated oNode (which probably makes the space-amp worse) which is an important optimization. We do need to watch the space amp, we're going to be burning DRAM to make KV accesses cheap and the amount of DRAM is proportional to the space amp.


> > >
> > > In both these cases, a write will dirty the onode (which is back to
> > > being pretty small.. just xattrs and the lextent map) and 1-3 blobs (also
> now small keys).

I'm not sure the oNode is going to be that small. Looking at the RBD random 4K write case, you're going to have 1K entries each of which has an offset, size and a blob-id reference in them. In my current oNode compression scheme this compresses to about 1 byte per entry. However, this optimization relies on being able to cheaply renumber the blob-ids, which is no longer possible when the blob-ids become parts of a key (see above). So now you'll have a minimum of 1.5-3 bytes extra for each blob-id (because you can't assume that the blob-ids become "dense" anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K Bytes of lextent table. Worse, because of the variable length encoding you'll have to scan the entire table to deserialize it (yes, we could do differential editing when we write but that's another discussion). Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So while this looks small compared to the current ~30K for the entire thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the compressed oNode/lextent/blobmap scheme that I published earlier.

If we want to do better we will need to separate the lextent from the oNode also. It's relatively easy to move the lextents into the KV store itself (there are two obvious ways to deal with this, either use the native offset/size from the lextent itself OR create 'N' buckets of logical offset into which we pour entries -- both of these would add somewhere between 1 and 2 KV look-ups per operation -- here is where an iterator would probably help.

Unfortunately, if you only process a portion of the lextent (because you've made it into multiple keys and you don't want to load all of them) you no longer can re-generate the refmap on the fly (another key space optimization). The lack of refmap screws up a number of other important algorithms -- for example the overlapping blob-map thing, etc. Not sure if these are easy to rewrite or not -- too complicated to think about at this hour of the evening.
 
 
> > > Updates will generate much lower metadata write traffic, which'll
> > > reduce media wear and compaction overhead.  The cost is that
> > > operations (e.g.,
> > > reads) that have to fault in an onode are now fetching several
> > > nearby keys instead of a single key.
> > >
> > >
> > > #1 and #2 are completely orthogonal to any encoding efficiency
> > > improvements we make.  And #1 is simple... I plan to implement this
> shortly.
> > >
> > > #3 is balancing (re)encoding efficiency against the cost of separate
> > > keys, and that tradeoff will change as encoding efficiency changes,
> > > so it'll be difficult to properly evaluate without knowing where
> > > we'll land with the (re)encode times.  I think it's a design
> > > decision made early on that is worth revisiting, though!
> >
> > It's not just the encoding efficiency, it's the cost of KV accesses.
> > For example, we could move the lextent map into the KV world similarly
> > to the way that you're suggesting the blob_maps be moved. You could do
> > it for the xattrs also. Now you've almost completely eliminated any
> > serialization/deserialization costs for the LARGE oNodes that we have
> > today but have replaced that with several KV lookups (one small Onode,
> > probably an xAttr, an lextent and a blob_map).
> >
> > I'm guessing that the "right" point is in between. I doubt that
> > separating the oNode from the xattrs pays off (especially since the
> > current code pretty much assumes that they are all cheap to get at).
> 
> Yep.. this is why it'll be a hard call to make, esp when the encoding efficiency
> is changing at the same time.  I'm calling out blobs here because they are
> biggish (lextents are tiny) and nontrivial to encode (xattrs are just strings).
> 
> > I'm wondering if it pays off to make each lextent entry a separate
> > key/value vs encoding the entire extent table (several KB) as a single
> > value. Same for the blobmap (though I suspect they have roughly the
> > same behavior w.r.t. this particular parameter)
> 
> I'm guessing no because they are so small that the kv overhead will dwarf the
> encoding cost, but who knows.  I think implementing the blob case won't be
> so bad and will give us a better idea (i.e., blobs are bigger and more
> expensive and if it's not a win there then certainly don't bother with
> lextents).
> 
> > We need to temper this experiment with the notion that we change the
> > lextent/blob_map encoding to something that doesn't require
> > transcoding
> > -- if possible.
> 
> Right.  I don't have any bright ideas here, though.  The variable length
> encoding makes this really hard and we still care about keeping things small.

Without some clear measurements on the KV-get cost vs. object size (copy in/out plus serialize/deserialize) it's going to be difficult to figure out what to do.

> 
> sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-18 16:53                     ` Haomai Wang
@ 2016-08-18 17:09                       ` Haomai Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Haomai Wang @ 2016-08-18 17:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Varada Kari, ceph-devel

On Fri, Aug 19, 2016 at 12:53 AM, Haomai Wang <haomai@xsky.com> wrote:
> On Thu, Aug 18, 2016 at 11:53 PM, Sage Weil <sweil@redhat.com> wrote:
>> On Thu, 18 Aug 2016, Haomai Wang wrote:
>>> This is my perf program https://github.com/yuyuyu101/ceph/tree/wip-wal
>>
>> Looks right...
>>
>>> It mainly simulate WAL workload and compare rocksdb wal to filejournal
>>> Summary:
>>>
>>>
>>> iodepth 1 4096 payload:
>>> filejournal: 160 us
>>> rocksdb: 3300 us
>>>
>>> iodepth 1 2048 payload:
>>> filejournal: 180us
>>> rocksdb: 3000 us
>>>
>>> iodepth 1 5124 payload:
>>> filejournal: 240us
>>> rocksdb: 3200us
>>>
>>> iodepth 16 4096 payload:
>>> filejournal: 550us
>>> rocksdb: 27000us
>>>
>>> iodepth 16 5124 payload:
>>> fiejournal: 580us
>>> rocksdb: 27100us
>>>
>>> I'm not sure, do we observe outstanding op latency in bluestore
>>> compare to filestore?
>>>
>>> From my logs, it shows BlueFS::_fsync occur 1/2 latency which contains
>>> two aio_write and two aio_wait(data and metadata).
>>
>> Note that this will change once rocksdb warms up and starts recycling
>> existing log files.  You can force this by writing a few 10s of MB
>> of keys.  After that it will be one aio_write, aio_wait, and flush.
>>
>> Even so, the numbers don't look very good.  Want to repeat with the
>> preconditioning?
>
> OH, I forget about this....
>
> To be simply, I add "if (0 && old_dirty_seq)" to disable metadata update.
>
> It's amazing.... Now iodepth 1 cases all better than filejournal
> because of shorter path(filejournal has three threads to handle one
> io).
>
> iodepth 16 shows filejournal 3x better than rocksdb which is expected...
>
> I'm not sure why disable _flush_and_sync_log can benefit so much. And
> why it will cause another 1ms missing....

Oh, I know another 1ms from. rocksdb will flush log and call fsync. So
there will be _flush(false) and _fsync..... Looks good enough!

>
>>
>>> And I also found DBImpl::WriteImpl prevents multi sync writes via
>>> "log.getting_synced" flag, so multi rocksdb writers may not make
>>> sense.
>>
>> Hrm, yeah.
>>
>> sage
>>
>>
>>> I don't find another 1/2 latency now. Is my test program missing
>>> something or have a wrong mock for WAL behavior?
>>>
>>> On Thu, Aug 18, 2016 at 12:42 AM, Sage Weil <sweil@redhat.com> wrote:
>>> > On Thu, 18 Aug 2016, Haomai Wang wrote:
>>> >> On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
>>> >> > On Thu, 18 Aug 2016, Haomai Wang wrote:
>>> >> >> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
>>> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>>> >> >> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>>> >> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>>> >> >> >> >> another latency perf problem:
>>> >> >> >> >>
>>> >> >> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
>>> >> >> >> >> complete WAL.
>>> >> >> >> >>
>>> >> >> >> >> I found the latency between kv transaction submitting isn't negligible
>>> >> >> >> >> and limit the transaction throughput.
>>> >> >> >> >>
>>> >> >> >> >> So what if we implement a async transaction submit in rocksdb side
>>> >> >> >> >> using callback way? It will decrease kv in queue latency. It would
>>> >> >> >> >> help rocksdb WAL performance close to FileJournal. And async interface
>>> >> >> >> >> will help control each kv transaction size and make transaction
>>> >> >> >> >> complete smoothly instead of tps spike with us precious.
>>> >> >> >> >
>>> >> >> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
>>> >> >> >> > we have X bytes accumulated (I think there is an option in rocksdb that
>>> >> >> >> > drives this already, actually)?  Changing the interfaces around will
>>> >> >> >> > change the threading model (= work) but doesn't actually change who needs
>>> >> >> >> > to wait and when.
>>> >> >> >>
>>> >> >> >> why we need to wait after interface change?
>>> >> >> >>
>>> >> >> >> 1. kv thread submit transaction with callback.
>>> >> >> >> 2. rocksdb append and call bluefs aio_submit with callback
>>> >> >> >> 3. bluefs submit aio write with callback
>>> >> >> >> 4. KernelDevice will poll linux aio event and execute callback inline
>>> >> >> >> or queue finish
>>> >> >> >> 5. callback will notify we complete the kv transaction
>>> >> >> >>
>>> >> >> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
>>> >> >> >> submit interface....
>>> >> >> >>
>>> >> >> >> Is anything I'm missing?
>>> >> >> >
>>> >> >> > That can all be done with callbacks, but even if we do the kv thread will
>>> >> >> > still need to wait on the callback before doing anything else.
>>> >> >> >
>>> >> >> > Oh, you're suggesting we have multiple batches of transactions in flight.
>>> >> >> > Got it.
>>> >> >>
>>> >> >> I don't think so.. because bluefs has lock for fsync and flush. So
>>> >> >> multi rocksdb thread will be serial to flush...
>>> >> >
>>> >> > Oh, this was fixed recently:
>>> >> >
>>> >> >         10d055d65727e47deae4e459bc21aaa243c24a7d
>>> >> >         97699334acd59e9530d36b13d3a8408cabf848ef
>>> >>
>>> >> Hmm, looks better!
>>> >>
>>> >> The only thing is I notice we don't have FileWriter lock for "buffer",
>>> >> so multi rocksdb writer will result in corrupt? I haven't look at
>>> >> rocksdb to check, but I think if posix backend, rocksdb don't need to
>>> >> have a look to protect log append racing.
>>> >
>>> > Hmm, there is this option:
>>> >
>>> >         https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueRocksEnv.cc#L224
>>> >
>>> > but that doesn't say anything about more than one concurrent Append.
>>> > You're probably right and we need some extra locking here...
>>> >
>>> > sage
>>> >
>>> >
>>> >
>>> >>
>>> >> >
>>> >> >> and another thing is the single thread is help for polling case.....
>>> >> >> from my current perf, compared queue filejournal class, rocksdb plays
>>> >> >> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
>>> >> >> exactly has a good pipeline for pure linux aio job.
>>> >> >
>>> >> > Yeah, I think you're right.  Even if we do the parallel submission, we
>>> >> > don't want to do parallel blocking (since the callers don't want to
>>> >> > block), so we'll still want async completion/notification of commit.
>>> >> >
>>> >> > No idea if this is something the rocksdb folks are already interested in
>>> >> > or not... want to ask them on their cool facebook group?  :)
>>> >> >
>>> >> >         https://www.facebook.com/groups/rocksdb.dev/
>>> >>
>>> >> sure
>>> >>
>>> >> >
>>> >> > sage
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> >
>>> >> >> > I think we will get some of the benefit by enabling the parallel
>>> >> >> > transaction submits (so we don't funnel everything through
>>> >> >> > _kv_sync_thread).  I think we should get that merged first and see how it
>>> >> >> > behaves before taking the next step.  I forgot to ask Varada is standup
>>> >> >> > this morning what the current status of that is.  Varada?
>>> >> >> >
>>> >> >> > sage
>>> >> >> >
>>> >> >> >>
>>> >> >> >> >
>>> >> >> >> > sage
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>>> >> >> >> >> > I think we need to look at other changes in addition to the encoding
>>> >> >> >> >> > performance improvements.  Even if they end up being good enough, these
>>> >> >> >> >> > changes are somewhat orthogonal and at least one of them should give us
>>> >> >> >> >> > something that is even faster.
>>> >> >> >> >> >
>>> >> >> >> >> > 1. I mentioned this before, but we should keep the encoding
>>> >> >> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>>> >> >> >> >> > don't reencode it.  There are no blockers for implementing this currently.
>>> >> >> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>>> >> >> >> >> > see if we can use proper accessors for the blob to enforce this at compile
>>> >> >> >> >> > time.  We should do that anyway.
>>> >> >> >> >> >
>>> >> >> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>>> >> >> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>>> >> >> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
>>> >> >> >> >> > rocksdb to copy this into the write buffer.  We could extend the
>>> >> >> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>>> >> >> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
>>> >> >> >> >> > into its write buffer).  This is probably a pretty small piece of the
>>> >> >> >> >> > overall time... should verify with a profiler before investing too much
>>> >> >> >> >> > effort here.
>>> >> >> >> >> >
>>> >> >> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>>> >> >> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
>>> >> >> >> >> > metadata is getting changed.  This is a consequence of embedding all of
>>> >> >> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
>>> >> >> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>>> >> >> >> >> > see a couple of different options:
>>> >> >> >> >> >
>>> >> >> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>>> >> >> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
>>> >> >> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
>>> >> >> >> >> >
>>> >> >> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>>> >> >> >> >> > are unique for any given hash value.  Then store the blobs as
>>> >> >> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>>> >> >> >> >> > clone happens there is no onode->bnode migration magic happening--we've
>>> >> >> >> >> > already committed to storing blobs in separate keys.  When we load the
>>> >> >> >> >> > onode, keep the conditional bnode loading we already have.. but when the
>>> >> >> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>>> >> >> >> >> > fault in blobs individually, but that code will be more complicated.)
>>> >> >> >> >> >
>>> >> >> >> >> > In both these cases, a write will dirty the onode (which is back to being
>>> >> >> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>>> >> >> >> >> > small keys).  Updates will generate much lower metadata write traffic,
>>> >> >> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
>>> >> >> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
>>> >> >> >> >> > several nearby keys instead of a single key.
>>> >> >> >> >> >
>>> >> >> >> >> >
>>> >> >> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
>>> >> >> >> >> > improvements we make.  And #1 is simple... I plan to implement this
>>> >> >> >> >> > shortly.
>>> >> >> >> >> >
>>> >> >> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>>> >> >> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
>>> >> >> >> >> > difficult to properly evaluate without knowing where we'll land with the
>>> >> >> >> >> > (re)encode times.  I think it's a design decision made early on that is
>>> >> >> >> >> > worth revisiting, though!
>>> >> >> >> >> >
>>> >> >> >> >> > sage
>>> >> >> >> >> > --
>>> >> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> >> >> > the body of a message to majordomo@vger.kernel.org
>>> >> >> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >> >> >> --
>>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>>
>>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-18 15:53                   ` Sage Weil
@ 2016-08-18 16:53                     ` Haomai Wang
  2016-08-18 17:09                       ` Haomai Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-18 16:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Varada Kari, ceph-devel

On Thu, Aug 18, 2016 at 11:53 PM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 18 Aug 2016, Haomai Wang wrote:
>> This is my perf program https://github.com/yuyuyu101/ceph/tree/wip-wal
>
> Looks right...
>
>> It mainly simulate WAL workload and compare rocksdb wal to filejournal
>> Summary:
>>
>>
>> iodepth 1 4096 payload:
>> filejournal: 160 us
>> rocksdb: 3300 us
>>
>> iodepth 1 2048 payload:
>> filejournal: 180us
>> rocksdb: 3000 us
>>
>> iodepth 1 5124 payload:
>> filejournal: 240us
>> rocksdb: 3200us
>>
>> iodepth 16 4096 payload:
>> filejournal: 550us
>> rocksdb: 27000us
>>
>> iodepth 16 5124 payload:
>> fiejournal: 580us
>> rocksdb: 27100us
>>
>> I'm not sure, do we observe outstanding op latency in bluestore
>> compare to filestore?
>>
>> From my logs, it shows BlueFS::_fsync occur 1/2 latency which contains
>> two aio_write and two aio_wait(data and metadata).
>
> Note that this will change once rocksdb warms up and starts recycling
> existing log files.  You can force this by writing a few 10s of MB
> of keys.  After that it will be one aio_write, aio_wait, and flush.
>
> Even so, the numbers don't look very good.  Want to repeat with the
> preconditioning?

OH, I forget about this....

To be simply, I add "if (0 && old_dirty_seq)" to disable metadata update.

It's amazing.... Now iodepth 1 cases all better than filejournal
because of shorter path(filejournal has three threads to handle one
io).

iodepth 16 shows filejournal 3x better than rocksdb which is expected...

I'm not sure why disable _flush_and_sync_log can benefit so much. And
why it will cause another 1ms missing....

>
>> And I also found DBImpl::WriteImpl prevents multi sync writes via
>> "log.getting_synced" flag, so multi rocksdb writers may not make
>> sense.
>
> Hrm, yeah.
>
> sage
>
>
>> I don't find another 1/2 latency now. Is my test program missing
>> something or have a wrong mock for WAL behavior?
>>
>> On Thu, Aug 18, 2016 at 12:42 AM, Sage Weil <sweil@redhat.com> wrote:
>> > On Thu, 18 Aug 2016, Haomai Wang wrote:
>> >> On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
>> >> > On Thu, 18 Aug 2016, Haomai Wang wrote:
>> >> >> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> >> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> >> >> >> another latency perf problem:
>> >> >> >> >>
>> >> >> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
>> >> >> >> >> complete WAL.
>> >> >> >> >>
>> >> >> >> >> I found the latency between kv transaction submitting isn't negligible
>> >> >> >> >> and limit the transaction throughput.
>> >> >> >> >>
>> >> >> >> >> So what if we implement a async transaction submit in rocksdb side
>> >> >> >> >> using callback way? It will decrease kv in queue latency. It would
>> >> >> >> >> help rocksdb WAL performance close to FileJournal. And async interface
>> >> >> >> >> will help control each kv transaction size and make transaction
>> >> >> >> >> complete smoothly instead of tps spike with us precious.
>> >> >> >> >
>> >> >> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
>> >> >> >> > we have X bytes accumulated (I think there is an option in rocksdb that
>> >> >> >> > drives this already, actually)?  Changing the interfaces around will
>> >> >> >> > change the threading model (= work) but doesn't actually change who needs
>> >> >> >> > to wait and when.
>> >> >> >>
>> >> >> >> why we need to wait after interface change?
>> >> >> >>
>> >> >> >> 1. kv thread submit transaction with callback.
>> >> >> >> 2. rocksdb append and call bluefs aio_submit with callback
>> >> >> >> 3. bluefs submit aio write with callback
>> >> >> >> 4. KernelDevice will poll linux aio event and execute callback inline
>> >> >> >> or queue finish
>> >> >> >> 5. callback will notify we complete the kv transaction
>> >> >> >>
>> >> >> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> >> >> >> submit interface....
>> >> >> >>
>> >> >> >> Is anything I'm missing?
>> >> >> >
>> >> >> > That can all be done with callbacks, but even if we do the kv thread will
>> >> >> > still need to wait on the callback before doing anything else.
>> >> >> >
>> >> >> > Oh, you're suggesting we have multiple batches of transactions in flight.
>> >> >> > Got it.
>> >> >>
>> >> >> I don't think so.. because bluefs has lock for fsync and flush. So
>> >> >> multi rocksdb thread will be serial to flush...
>> >> >
>> >> > Oh, this was fixed recently:
>> >> >
>> >> >         10d055d65727e47deae4e459bc21aaa243c24a7d
>> >> >         97699334acd59e9530d36b13d3a8408cabf848ef
>> >>
>> >> Hmm, looks better!
>> >>
>> >> The only thing is I notice we don't have FileWriter lock for "buffer",
>> >> so multi rocksdb writer will result in corrupt? I haven't look at
>> >> rocksdb to check, but I think if posix backend, rocksdb don't need to
>> >> have a look to protect log append racing.
>> >
>> > Hmm, there is this option:
>> >
>> >         https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueRocksEnv.cc#L224
>> >
>> > but that doesn't say anything about more than one concurrent Append.
>> > You're probably right and we need some extra locking here...
>> >
>> > sage
>> >
>> >
>> >
>> >>
>> >> >
>> >> >> and another thing is the single thread is help for polling case.....
>> >> >> from my current perf, compared queue filejournal class, rocksdb plays
>> >> >> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
>> >> >> exactly has a good pipeline for pure linux aio job.
>> >> >
>> >> > Yeah, I think you're right.  Even if we do the parallel submission, we
>> >> > don't want to do parallel blocking (since the callers don't want to
>> >> > block), so we'll still want async completion/notification of commit.
>> >> >
>> >> > No idea if this is something the rocksdb folks are already interested in
>> >> > or not... want to ask them on their cool facebook group?  :)
>> >> >
>> >> >         https://www.facebook.com/groups/rocksdb.dev/
>> >>
>> >> sure
>> >>
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > I think we will get some of the benefit by enabling the parallel
>> >> >> > transaction submits (so we don't funnel everything through
>> >> >> > _kv_sync_thread).  I think we should get that merged first and see how it
>> >> >> > behaves before taking the next step.  I forgot to ask Varada is standup
>> >> >> > this morning what the current status of that is.  Varada?
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> > sage
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> >> >> > I think we need to look at other changes in addition to the encoding
>> >> >> >> >> > performance improvements.  Even if they end up being good enough, these
>> >> >> >> >> > changes are somewhat orthogonal and at least one of them should give us
>> >> >> >> >> > something that is even faster.
>> >> >> >> >> >
>> >> >> >> >> > 1. I mentioned this before, but we should keep the encoding
>> >> >> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>> >> >> >> >> > don't reencode it.  There are no blockers for implementing this currently.
>> >> >> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>> >> >> >> >> > see if we can use proper accessors for the blob to enforce this at compile
>> >> >> >> >> > time.  We should do that anyway.
>> >> >> >> >> >
>> >> >> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>> >> >> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>> >> >> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
>> >> >> >> >> > rocksdb to copy this into the write buffer.  We could extend the
>> >> >> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>> >> >> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
>> >> >> >> >> > into its write buffer).  This is probably a pretty small piece of the
>> >> >> >> >> > overall time... should verify with a profiler before investing too much
>> >> >> >> >> > effort here.
>> >> >> >> >> >
>> >> >> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>> >> >> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
>> >> >> >> >> > metadata is getting changed.  This is a consequence of embedding all of
>> >> >> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
>> >> >> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>> >> >> >> >> > see a couple of different options:
>> >> >> >> >> >
>> >> >> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>> >> >> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
>> >> >> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
>> >> >> >> >> >
>> >> >> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>> >> >> >> >> > are unique for any given hash value.  Then store the blobs as
>> >> >> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>> >> >> >> >> > clone happens there is no onode->bnode migration magic happening--we've
>> >> >> >> >> > already committed to storing blobs in separate keys.  When we load the
>> >> >> >> >> > onode, keep the conditional bnode loading we already have.. but when the
>> >> >> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>> >> >> >> >> > fault in blobs individually, but that code will be more complicated.)
>> >> >> >> >> >
>> >> >> >> >> > In both these cases, a write will dirty the onode (which is back to being
>> >> >> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>> >> >> >> >> > small keys).  Updates will generate much lower metadata write traffic,
>> >> >> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
>> >> >> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
>> >> >> >> >> > several nearby keys instead of a single key.
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
>> >> >> >> >> > improvements we make.  And #1 is simple... I plan to implement this
>> >> >> >> >> > shortly.
>> >> >> >> >> >
>> >> >> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>> >> >> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
>> >> >> >> >> > difficult to properly evaluate without knowing where we'll land with the
>> >> >> >> >> > (re)encode times.  I think it's a design decision made early on that is
>> >> >> >> >> > worth revisiting, though!
>> >> >> >> >> >
>> >> >> >> >> > sage
>> >> >> >> >> > --
>> >> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >> --
>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-18 15:49                 ` Haomai Wang
@ 2016-08-18 15:53                   ` Sage Weil
  2016-08-18 16:53                     ` Haomai Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-18 15:53 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Varada Kari, ceph-devel

On Thu, 18 Aug 2016, Haomai Wang wrote:
> This is my perf program https://github.com/yuyuyu101/ceph/tree/wip-wal

Looks right...
 
> It mainly simulate WAL workload and compare rocksdb wal to filejournal
> Summary:
> 
> 
> iodepth 1 4096 payload:
> filejournal: 160 us
> rocksdb: 3300 us
> 
> iodepth 1 2048 payload:
> filejournal: 180us
> rocksdb: 3000 us
> 
> iodepth 1 5124 payload:
> filejournal: 240us
> rocksdb: 3200us
> 
> iodepth 16 4096 payload:
> filejournal: 550us
> rocksdb: 27000us
> 
> iodepth 16 5124 payload:
> fiejournal: 580us
> rocksdb: 27100us
> 
> I'm not sure, do we observe outstanding op latency in bluestore
> compare to filestore?
> 
> From my logs, it shows BlueFS::_fsync occur 1/2 latency which contains
> two aio_write and two aio_wait(data and metadata).

Note that this will change once rocksdb warms up and starts recycling 
existing log files.  You can force this by writing a few 10s of MB 
of keys.  After that it will be one aio_write, aio_wait, and flush.

Even so, the numbers don't look very good.  Want to repeat with the 
preconditioning?

> And I also found DBImpl::WriteImpl prevents multi sync writes via
> "log.getting_synced" flag, so multi rocksdb writers may not make
> sense.

Hrm, yeah.

sage


> I don't find another 1/2 latency now. Is my test program missing
> something or have a wrong mock for WAL behavior?
> 
> On Thu, Aug 18, 2016 at 12:42 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Thu, 18 Aug 2016, Haomai Wang wrote:
> >> On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
> >> > On Thu, 18 Aug 2016, Haomai Wang wrote:
> >> >> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> >> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> >> >> >> another latency perf problem:
> >> >> >> >>
> >> >> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
> >> >> >> >> complete WAL.
> >> >> >> >>
> >> >> >> >> I found the latency between kv transaction submitting isn't negligible
> >> >> >> >> and limit the transaction throughput.
> >> >> >> >>
> >> >> >> >> So what if we implement a async transaction submit in rocksdb side
> >> >> >> >> using callback way? It will decrease kv in queue latency. It would
> >> >> >> >> help rocksdb WAL performance close to FileJournal. And async interface
> >> >> >> >> will help control each kv transaction size and make transaction
> >> >> >> >> complete smoothly instead of tps spike with us precious.
> >> >> >> >
> >> >> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
> >> >> >> > we have X bytes accumulated (I think there is an option in rocksdb that
> >> >> >> > drives this already, actually)?  Changing the interfaces around will
> >> >> >> > change the threading model (= work) but doesn't actually change who needs
> >> >> >> > to wait and when.
> >> >> >>
> >> >> >> why we need to wait after interface change?
> >> >> >>
> >> >> >> 1. kv thread submit transaction with callback.
> >> >> >> 2. rocksdb append and call bluefs aio_submit with callback
> >> >> >> 3. bluefs submit aio write with callback
> >> >> >> 4. KernelDevice will poll linux aio event and execute callback inline
> >> >> >> or queue finish
> >> >> >> 5. callback will notify we complete the kv transaction
> >> >> >>
> >> >> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
> >> >> >> submit interface....
> >> >> >>
> >> >> >> Is anything I'm missing?
> >> >> >
> >> >> > That can all be done with callbacks, but even if we do the kv thread will
> >> >> > still need to wait on the callback before doing anything else.
> >> >> >
> >> >> > Oh, you're suggesting we have multiple batches of transactions in flight.
> >> >> > Got it.
> >> >>
> >> >> I don't think so.. because bluefs has lock for fsync and flush. So
> >> >> multi rocksdb thread will be serial to flush...
> >> >
> >> > Oh, this was fixed recently:
> >> >
> >> >         10d055d65727e47deae4e459bc21aaa243c24a7d
> >> >         97699334acd59e9530d36b13d3a8408cabf848ef
> >>
> >> Hmm, looks better!
> >>
> >> The only thing is I notice we don't have FileWriter lock for "buffer",
> >> so multi rocksdb writer will result in corrupt? I haven't look at
> >> rocksdb to check, but I think if posix backend, rocksdb don't need to
> >> have a look to protect log append racing.
> >
> > Hmm, there is this option:
> >
> >         https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueRocksEnv.cc#L224
> >
> > but that doesn't say anything about more than one concurrent Append.
> > You're probably right and we need some extra locking here...
> >
> > sage
> >
> >
> >
> >>
> >> >
> >> >> and another thing is the single thread is help for polling case.....
> >> >> from my current perf, compared queue filejournal class, rocksdb plays
> >> >> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
> >> >> exactly has a good pipeline for pure linux aio job.
> >> >
> >> > Yeah, I think you're right.  Even if we do the parallel submission, we
> >> > don't want to do parallel blocking (since the callers don't want to
> >> > block), so we'll still want async completion/notification of commit.
> >> >
> >> > No idea if this is something the rocksdb folks are already interested in
> >> > or not... want to ask them on their cool facebook group?  :)
> >> >
> >> >         https://www.facebook.com/groups/rocksdb.dev/
> >>
> >> sure
> >>
> >> >
> >> > sage
> >> >
> >> >
> >> >>
> >> >> >
> >> >> > I think we will get some of the benefit by enabling the parallel
> >> >> > transaction submits (so we don't funnel everything through
> >> >> > _kv_sync_thread).  I think we should get that merged first and see how it
> >> >> > behaves before taking the next step.  I forgot to ask Varada is standup
> >> >> > this morning what the current status of that is.  Varada?
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> > sage
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> >> >> > I think we need to look at other changes in addition to the encoding
> >> >> >> >> > performance improvements.  Even if they end up being good enough, these
> >> >> >> >> > changes are somewhat orthogonal and at least one of them should give us
> >> >> >> >> > something that is even faster.
> >> >> >> >> >
> >> >> >> >> > 1. I mentioned this before, but we should keep the encoding
> >> >> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> >> >> >> >> > don't reencode it.  There are no blockers for implementing this currently.
> >> >> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> >> >> >> >> > see if we can use proper accessors for the blob to enforce this at compile
> >> >> >> >> > time.  We should do that anyway.
> >> >> >> >> >
> >> >> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> >> >> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> >> >> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
> >> >> >> >> > rocksdb to copy this into the write buffer.  We could extend the
> >> >> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> >> >> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
> >> >> >> >> > into its write buffer).  This is probably a pretty small piece of the
> >> >> >> >> > overall time... should verify with a profiler before investing too much
> >> >> >> >> > effort here.
> >> >> >> >> >
> >> >> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> >> >> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
> >> >> >> >> > metadata is getting changed.  This is a consequence of embedding all of
> >> >> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
> >> >> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> >> >> >> >> > see a couple of different options:
> >> >> >> >> >
> >> >> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> >> >> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
> >> >> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
> >> >> >> >> >
> >> >> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> >> >> >> >> > are unique for any given hash value.  Then store the blobs as
> >> >> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> >> >> >> >> > clone happens there is no onode->bnode migration magic happening--we've
> >> >> >> >> > already committed to storing blobs in separate keys.  When we load the
> >> >> >> >> > onode, keep the conditional bnode loading we already have.. but when the
> >> >> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> >> >> >> >> > fault in blobs individually, but that code will be more complicated.)
> >> >> >> >> >
> >> >> >> >> > In both these cases, a write will dirty the onode (which is back to being
> >> >> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> >> >> >> >> > small keys).  Updates will generate much lower metadata write traffic,
> >> >> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
> >> >> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
> >> >> >> >> > several nearby keys instead of a single key.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
> >> >> >> >> > improvements we make.  And #1 is simple... I plan to implement this
> >> >> >> >> > shortly.
> >> >> >> >> >
> >> >> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> >> >> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
> >> >> >> >> > difficult to properly evaluate without knowing where we'll land with the
> >> >> >> >> > (re)encode times.  I think it's a design decision made early on that is
> >> >> >> >> > worth revisiting, though!
> >> >> >> >> >
> >> >> >> >> > sage
> >> >> >> >> > --
> >> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 16:42               ` Sage Weil
@ 2016-08-18 15:49                 ` Haomai Wang
  2016-08-18 15:53                   ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-18 15:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Varada Kari, ceph-devel

This is my perf program https://github.com/yuyuyu101/ceph/tree/wip-wal

It mainly simulate WAL workload and compare rocksdb wal to filejournal
Summary:


iodepth 1 4096 payload:
filejournal: 160 us
rocksdb: 3300 us

iodepth 1 2048 payload:
filejournal: 180us
rocksdb: 3000 us

iodepth 1 5124 payload:
filejournal: 240us
rocksdb: 3200us

iodepth 16 4096 payload:
filejournal: 550us
rocksdb: 27000us

iodepth 16 5124 payload:
fiejournal: 580us
rocksdb: 27100us

I'm not sure, do we observe outstanding op latency in bluestore
compare to filestore?

From my logs, it shows BlueFS::_fsync occur 1/2 latency which contains
two aio_write and two aio_wait(data and metadata).

And I also found DBImpl::WriteImpl prevents multi sync writes via
"log.getting_synced" flag, so multi rocksdb writers may not make
sense.

I don't find another 1/2 latency now. Is my test program missing
something or have a wrong mock for WAL behavior?

On Thu, Aug 18, 2016 at 12:42 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 18 Aug 2016, Haomai Wang wrote:
>> On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
>> > On Thu, 18 Aug 2016, Haomai Wang wrote:
>> >> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> >> >> another latency perf problem:
>> >> >> >>
>> >> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
>> >> >> >> complete WAL.
>> >> >> >>
>> >> >> >> I found the latency between kv transaction submitting isn't negligible
>> >> >> >> and limit the transaction throughput.
>> >> >> >>
>> >> >> >> So what if we implement a async transaction submit in rocksdb side
>> >> >> >> using callback way? It will decrease kv in queue latency. It would
>> >> >> >> help rocksdb WAL performance close to FileJournal. And async interface
>> >> >> >> will help control each kv transaction size and make transaction
>> >> >> >> complete smoothly instead of tps spike with us precious.
>> >> >> >
>> >> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
>> >> >> > we have X bytes accumulated (I think there is an option in rocksdb that
>> >> >> > drives this already, actually)?  Changing the interfaces around will
>> >> >> > change the threading model (= work) but doesn't actually change who needs
>> >> >> > to wait and when.
>> >> >>
>> >> >> why we need to wait after interface change?
>> >> >>
>> >> >> 1. kv thread submit transaction with callback.
>> >> >> 2. rocksdb append and call bluefs aio_submit with callback
>> >> >> 3. bluefs submit aio write with callback
>> >> >> 4. KernelDevice will poll linux aio event and execute callback inline
>> >> >> or queue finish
>> >> >> 5. callback will notify we complete the kv transaction
>> >> >>
>> >> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> >> >> submit interface....
>> >> >>
>> >> >> Is anything I'm missing?
>> >> >
>> >> > That can all be done with callbacks, but even if we do the kv thread will
>> >> > still need to wait on the callback before doing anything else.
>> >> >
>> >> > Oh, you're suggesting we have multiple batches of transactions in flight.
>> >> > Got it.
>> >>
>> >> I don't think so.. because bluefs has lock for fsync and flush. So
>> >> multi rocksdb thread will be serial to flush...
>> >
>> > Oh, this was fixed recently:
>> >
>> >         10d055d65727e47deae4e459bc21aaa243c24a7d
>> >         97699334acd59e9530d36b13d3a8408cabf848ef
>>
>> Hmm, looks better!
>>
>> The only thing is I notice we don't have FileWriter lock for "buffer",
>> so multi rocksdb writer will result in corrupt? I haven't look at
>> rocksdb to check, but I think if posix backend, rocksdb don't need to
>> have a look to protect log append racing.
>
> Hmm, there is this option:
>
>         https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueRocksEnv.cc#L224
>
> but that doesn't say anything about more than one concurrent Append.
> You're probably right and we need some extra locking here...
>
> sage
>
>
>
>>
>> >
>> >> and another thing is the single thread is help for polling case.....
>> >> from my current perf, compared queue filejournal class, rocksdb plays
>> >> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
>> >> exactly has a good pipeline for pure linux aio job.
>> >
>> > Yeah, I think you're right.  Even if we do the parallel submission, we
>> > don't want to do parallel blocking (since the callers don't want to
>> > block), so we'll still want async completion/notification of commit.
>> >
>> > No idea if this is something the rocksdb folks are already interested in
>> > or not... want to ask them on their cool facebook group?  :)
>> >
>> >         https://www.facebook.com/groups/rocksdb.dev/
>>
>> sure
>>
>> >
>> > sage
>> >
>> >
>> >>
>> >> >
>> >> > I think we will get some of the benefit by enabling the parallel
>> >> > transaction submits (so we don't funnel everything through
>> >> > _kv_sync_thread).  I think we should get that merged first and see how it
>> >> > behaves before taking the next step.  I forgot to ask Varada is standup
>> >> > this morning what the current status of that is.  Varada?
>> >> >
>> >> > sage
>> >> >
>> >> >>
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> >> > I think we need to look at other changes in addition to the encoding
>> >> >> >> > performance improvements.  Even if they end up being good enough, these
>> >> >> >> > changes are somewhat orthogonal and at least one of them should give us
>> >> >> >> > something that is even faster.
>> >> >> >> >
>> >> >> >> > 1. I mentioned this before, but we should keep the encoding
>> >> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>> >> >> >> > don't reencode it.  There are no blockers for implementing this currently.
>> >> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>> >> >> >> > see if we can use proper accessors for the blob to enforce this at compile
>> >> >> >> > time.  We should do that anyway.
>> >> >> >> >
>> >> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>> >> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>> >> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
>> >> >> >> > rocksdb to copy this into the write buffer.  We could extend the
>> >> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>> >> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
>> >> >> >> > into its write buffer).  This is probably a pretty small piece of the
>> >> >> >> > overall time... should verify with a profiler before investing too much
>> >> >> >> > effort here.
>> >> >> >> >
>> >> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>> >> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
>> >> >> >> > metadata is getting changed.  This is a consequence of embedding all of
>> >> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
>> >> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>> >> >> >> > see a couple of different options:
>> >> >> >> >
>> >> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>> >> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
>> >> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
>> >> >> >> >
>> >> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>> >> >> >> > are unique for any given hash value.  Then store the blobs as
>> >> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>> >> >> >> > clone happens there is no onode->bnode migration magic happening--we've
>> >> >> >> > already committed to storing blobs in separate keys.  When we load the
>> >> >> >> > onode, keep the conditional bnode loading we already have.. but when the
>> >> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>> >> >> >> > fault in blobs individually, but that code will be more complicated.)
>> >> >> >> >
>> >> >> >> > In both these cases, a write will dirty the onode (which is back to being
>> >> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>> >> >> >> > small keys).  Updates will generate much lower metadata write traffic,
>> >> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
>> >> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
>> >> >> >> > several nearby keys instead of a single key.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
>> >> >> >> > improvements we make.  And #1 is simple... I plan to implement this
>> >> >> >> > shortly.
>> >> >> >> >
>> >> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>> >> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
>> >> >> >> > difficult to properly evaluate without knowing where we'll land with the
>> >> >> >> > (re)encode times.  I think it's a design decision made early on that is
>> >> >> >> > worth revisiting, though!
>> >> >> >> >
>> >> >> >> > sage
>> >> >> >> > --
>> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-18  0:05 ` Allen Samuels
@ 2016-08-18 15:10   ` Sage Weil
  2016-08-19  3:11     ` Allen Samuels
  2016-08-19 11:38     ` Mark Nelson
  0 siblings, 2 replies; 24+ messages in thread
From: Sage Weil @ 2016-08-18 15:10 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Thu, 18 Aug 2016, Allen Samuels wrote:
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, August 17, 2016 7:26 AM
> > To: ceph-devel@vger.kernel.org
> > Subject: bluestore blobs
> > 
> > I think we need to look at other changes in addition to the encoding
> > performance improvements.  Even if they end up being good enough, these
> > changes are somewhat orthogonal and at least one of them should give us
> > something that is even faster.
> > 
> > 1. I mentioned this before, but we should keep the encoding
> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> > don't reencode it.  There are no blockers for implementing this currently.
> > It may be difficult to ensure the blobs are properly marked dirty... I'll see if
> > we can use proper accessors for the blob to enforce this at compile time.  We
> > should do that anyway.
> 
> If it's not changed, then why are we re-writing it? I'm having a hard 
> time thinking of a case worth optimizing where I want to re-write the 
> oNode but the blob_map is unchanged. Am I missing something obvious?

An onode's blob_map might have 300 blobs, and a single write only updates 
one of them.  The other 299 blobs need not be reencoded, just memcpy'd.

> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> > assemble the bufferlist (lots of bufferptrs to each untouched blob) into a
> > single rocksdb::Slice, and another memcpy somewhere inside rocksdb to
> > copy this into the write buffer.  We could extend the rocksdb interface to
> > take an iovec so that the first memcpy isn't needed (and rocksdb will instead
> > iterate over our buffers and copy them directly into its write buffer).  This is
> > probably a pretty small piece of the overall time... should verify with a
> > profiler before investing too much effort here.
> 
> I doubt it's the memcpy that's really the expensive part. I'll bet it's 
> that we're transcoding from an internal to an external representation on 
> an element by element basis. If the iovec scheme is going to help, it 
> presumes that the internal data structure essentially matches the 
> external data structure so that only an iovec copy is required. I'm 
> wondering how compatible this is with the current concepts of 
> lextext/blob/pextent.

I'm thinking of the xattr case (we have a bunch of strings to copy 
verbatim) and updated-one-blob-and-kept-99-unchanged case: instead 
of memcpy'ing them into a big contiguous buffer and having rocksdb 
memcpy *that* into it's larger buffer, give rocksdb an iovec so that they 
smaller buffers are assembled only once.

These buffers will be on the order of many 10s to a couple 100s of bytes.  
I'm not sure where the crossover point for constructing and then 
traversing an iovec vs just copying twice would be...
 
> > 3. Even if we do the above, we're still setting a big (~4k or more?) key into
> > rocksdb every time we touch an object, even when a tiny amount of
> > metadata is getting changed.  This is a consequence of embedding all of the
> > blobs into the onode (or bnode).  That seemed like a good idea early on
> > when they were tiny (i.e., just an extent), but now I'm not so sure.  I see a
> > couple of different options:
> > 
> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> > the blobs too.  They will hopefully be sequential in rocksdb (or definitely
> > sequential in zs).  Probably go back to using an iterator.
> > 
> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they are
> > unique for any given hash value.  Then store the blobs as
> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> > clone happens there is no onode->bnode migration magic happening--we've
> > already committed to storing blobs in separate keys.  When we load the
> > onode, keep the conditional bnode loading we already have.. but when the
> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could fault
> > in blobs individually, but that code will be more complicated.)
> > 
> > In both these cases, a write will dirty the onode (which is back to being pretty
> > small.. just xattrs and the lextent map) and 1-3 blobs (also now small keys).
> > Updates will generate much lower metadata write traffic, which'll reduce
> > media wear and compaction overhead.  The cost is that operations (e.g.,
> > reads) that have to fault in an onode are now fetching several nearby keys
> > instead of a single key.
> > 
> > 
> > #1 and #2 are completely orthogonal to any encoding efficiency
> > improvements we make.  And #1 is simple... I plan to implement this shortly.
> > 
> > #3 is balancing (re)encoding efficiency against the cost of separate keys, and
> > that tradeoff will change as encoding efficiency changes, so it'll be difficult to
> > properly evaluate without knowing where we'll land with the (re)encode
> > times.  I think it's a design decision made early on that is worth revisiting,
> > though!
> 
> It's not just the encoding efficiency, it's the cost of KV accesses. For 
> example, we could move the lextent map into the KV world similarly to 
> the way that you're suggesting the blob_maps be moved. You could do it 
> for the xattrs also. Now you've almost completely eliminated any 
> serialization/deserialization costs for the LARGE oNodes that we have 
> today but have replaced that with several KV lookups (one small Onode, 
> probably an xAttr, an lextent and a blob_map).
> 
> I'm guessing that the "right" point is in between. I doubt that 
> separating the oNode from the xattrs pays off (especially since the 
> current code pretty much assumes that they are all cheap to get at).

Yep.. this is why it'll be a hard call to make, esp when the encoding 
efficiency is changing at the same time.  I'm calling out blobs here 
because they are biggish (lextents are tiny) and nontrivial to encode 
(xattrs are just strings).

> I'm wondering if it pays off to make each lextent entry a separate 
> key/value vs encoding the entire extent table (several KB) as a single 
> value. Same for the blobmap (though I suspect they have roughly the same 
> behavior w.r.t. this particular parameter)

I'm guessing no because they are so small that the kv overhead will dwarf 
the encoding cost, but who knows.  I think implementing the blob case 
won't be so bad and will give us a better idea (i.e., blobs are bigger and 
more expensive and if it's not a win there then certainly don't bother 
with lextents).

> We need to temper this experiment with the notion that we change the 
> lextent/blob_map encoding to something that doesn't require transcoding 
> -- if possible.

Right.  I don't have any bright ideas here, though.  The variable length 
encoding makes this really hard and we still care about keeping things 
small.

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-17 14:26 Sage Weil
  2016-08-17 15:00 ` Haomai Wang
@ 2016-08-18  0:05 ` Allen Samuels
  2016-08-18 15:10   ` Sage Weil
  1 sibling, 1 reply; 24+ messages in thread
From: Allen Samuels @ 2016-08-18  0:05 UTC (permalink / raw)
  To: Sage Weil, ceph-devel




Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, August 17, 2016 7:26 AM
> To: ceph-devel@vger.kernel.org
> Subject: bluestore blobs
> 
> I think we need to look at other changes in addition to the encoding
> performance improvements.  Even if they end up being good enough, these
> changes are somewhat orthogonal and at least one of them should give us
> something that is even faster.
> 
> 1. I mentioned this before, but we should keep the encoding
> bluestore_blob_t around when we load the blob map.  If it's not changed,
> don't reencode it.  There are no blockers for implementing this currently.
> It may be difficult to ensure the blobs are properly marked dirty... I'll see if
> we can use proper accessors for the blob to enforce this at compile time.  We
> should do that anyway.

If it's not changed, then why are we re-writing it? I'm having a hard time thinking of a case worth optimizing where I want to re-write the oNode but the blob_map is unchanged. Am I missing something obvious?

> 
> 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> assemble the bufferlist (lots of bufferptrs to each untouched blob) into a
> single rocksdb::Slice, and another memcpy somewhere inside rocksdb to
> copy this into the write buffer.  We could extend the rocksdb interface to
> take an iovec so that the first memcpy isn't needed (and rocksdb will instead
> iterate over our buffers and copy them directly into its write buffer).  This is
> probably a pretty small piece of the overall time... should verify with a
> profiler before investing too much effort here.

I doubt it's the memcpy that's really the expensive part. I'll bet it's that we're transcoding from an internal to an external representation on an element by element basis.
If the iovec scheme is going to help, it presumes that the internal data structure essentially matches the external data structure so that only an iovec copy is required. I'm wondering how compatible this is with the current concepts of lextext/blob/pextent.

> 
> 3. Even if we do the above, we're still setting a big (~4k or more?) key into
> rocksdb every time we touch an object, even when a tiny amount of
> metadata is getting changed.  This is a consequence of embedding all of the
> blobs into the onode (or bnode).  That seemed like a good idea early on
> when they were tiny (i.e., just an extent), but now I'm not so sure.  I see a
> couple of different options:
> 
> a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> the blobs too.  They will hopefully be sequential in rocksdb (or definitely
> sequential in zs).  Probably go back to using an iterator.
> 
> b) Go all in on the "bnode" like concept.  Assign blob ids so that they are
> unique for any given hash value.  Then store the blobs as
> $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> clone happens there is no onode->bnode migration magic happening--we've
> already committed to storing blobs in separate keys.  When we load the
> onode, keep the conditional bnode loading we already have.. but when the
> bnode is loaded load up all the blobs for the hash key.  (Okay, we could fault
> in blobs individually, but that code will be more complicated.)
> 
> In both these cases, a write will dirty the onode (which is back to being pretty
> small.. just xattrs and the lextent map) and 1-3 blobs (also now small keys).
> Updates will generate much lower metadata write traffic, which'll reduce
> media wear and compaction overhead.  The cost is that operations (e.g.,
> reads) that have to fault in an onode are now fetching several nearby keys
> instead of a single key.
> 
> 
> #1 and #2 are completely orthogonal to any encoding efficiency
> improvements we make.  And #1 is simple... I plan to implement this shortly.
> 
> #3 is balancing (re)encoding efficiency against the cost of separate keys, and
> that tradeoff will change as encoding efficiency changes, so it'll be difficult to
> properly evaluate without knowing where we'll land with the (re)encode
> times.  I think it's a design decision made early on that is worth revisiting,
> though!

It's not just the encoding efficiency, it's the cost of KV accesses. For example, we could move the lextent map into the KV world similarly to the way that you're suggesting the blob_maps be moved. You could do it for the xattrs also. Now you've almost completely eliminated any serialization/deserialization costs for the LARGE oNodes that we have today but have replaced that with several KV lookups (one small Onode, probably an xAttr, an lextent and a blob_map).

I'm guessing that the "right" point is in between. I doubt that separating the oNode from the xattrs pays off (especially since the current code pretty much assumes that they are all cheap to get at).

I'm wondering if it pays off to make each lextent entry a separate key/value vs encoding the entire extent table (several KB) as a single value.
Same for the blobmap (though I suspect they have roughly the same behavior w.r.t. this particular parameter)

We need to temper this experiment with the notion that we change the lextent/blob_map encoding to something that doesn't require transcoding -- if possible.

> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 16:32             ` Haomai Wang
@ 2016-08-17 16:42               ` Sage Weil
  2016-08-18 15:49                 ` Haomai Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-17 16:42 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Varada Kari, ceph-devel

On Thu, 18 Aug 2016, Haomai Wang wrote:
> On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Thu, 18 Aug 2016, Haomai Wang wrote:
> >> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> >> >> another latency perf problem:
> >> >> >>
> >> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
> >> >> >> complete WAL.
> >> >> >>
> >> >> >> I found the latency between kv transaction submitting isn't negligible
> >> >> >> and limit the transaction throughput.
> >> >> >>
> >> >> >> So what if we implement a async transaction submit in rocksdb side
> >> >> >> using callback way? It will decrease kv in queue latency. It would
> >> >> >> help rocksdb WAL performance close to FileJournal. And async interface
> >> >> >> will help control each kv transaction size and make transaction
> >> >> >> complete smoothly instead of tps spike with us precious.
> >> >> >
> >> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
> >> >> > we have X bytes accumulated (I think there is an option in rocksdb that
> >> >> > drives this already, actually)?  Changing the interfaces around will
> >> >> > change the threading model (= work) but doesn't actually change who needs
> >> >> > to wait and when.
> >> >>
> >> >> why we need to wait after interface change?
> >> >>
> >> >> 1. kv thread submit transaction with callback.
> >> >> 2. rocksdb append and call bluefs aio_submit with callback
> >> >> 3. bluefs submit aio write with callback
> >> >> 4. KernelDevice will poll linux aio event and execute callback inline
> >> >> or queue finish
> >> >> 5. callback will notify we complete the kv transaction
> >> >>
> >> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
> >> >> submit interface....
> >> >>
> >> >> Is anything I'm missing?
> >> >
> >> > That can all be done with callbacks, but even if we do the kv thread will
> >> > still need to wait on the callback before doing anything else.
> >> >
> >> > Oh, you're suggesting we have multiple batches of transactions in flight.
> >> > Got it.
> >>
> >> I don't think so.. because bluefs has lock for fsync and flush. So
> >> multi rocksdb thread will be serial to flush...
> >
> > Oh, this was fixed recently:
> >
> >         10d055d65727e47deae4e459bc21aaa243c24a7d
> >         97699334acd59e9530d36b13d3a8408cabf848ef
> 
> Hmm, looks better!
> 
> The only thing is I notice we don't have FileWriter lock for "buffer",
> so multi rocksdb writer will result in corrupt? I haven't look at
> rocksdb to check, but I think if posix backend, rocksdb don't need to
> have a look to protect log append racing.

Hmm, there is this option:

	https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueRocksEnv.cc#L224

but that doesn't say anything about more than one concurrent Append.  
You're probably right and we need some extra locking here...

sage



> 
> >
> >> and another thing is the single thread is help for polling case.....
> >> from my current perf, compared queue filejournal class, rocksdb plays
> >> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
> >> exactly has a good pipeline for pure linux aio job.
> >
> > Yeah, I think you're right.  Even if we do the parallel submission, we
> > don't want to do parallel blocking (since the callers don't want to
> > block), so we'll still want async completion/notification of commit.
> >
> > No idea if this is something the rocksdb folks are already interested in
> > or not... want to ask them on their cool facebook group?  :)
> >
> >         https://www.facebook.com/groups/rocksdb.dev/
> 
> sure
> 
> >
> > sage
> >
> >
> >>
> >> >
> >> > I think we will get some of the benefit by enabling the parallel
> >> > transaction submits (so we don't funnel everything through
> >> > _kv_sync_thread).  I think we should get that merged first and see how it
> >> > behaves before taking the next step.  I forgot to ask Varada is standup
> >> > this morning what the current status of that is.  Varada?
> >> >
> >> > sage
> >> >
> >> >>
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> >> > I think we need to look at other changes in addition to the encoding
> >> >> >> > performance improvements.  Even if they end up being good enough, these
> >> >> >> > changes are somewhat orthogonal and at least one of them should give us
> >> >> >> > something that is even faster.
> >> >> >> >
> >> >> >> > 1. I mentioned this before, but we should keep the encoding
> >> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> >> >> >> > don't reencode it.  There are no blockers for implementing this currently.
> >> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> >> >> >> > see if we can use proper accessors for the blob to enforce this at compile
> >> >> >> > time.  We should do that anyway.
> >> >> >> >
> >> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> >> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> >> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
> >> >> >> > rocksdb to copy this into the write buffer.  We could extend the
> >> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> >> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
> >> >> >> > into its write buffer).  This is probably a pretty small piece of the
> >> >> >> > overall time... should verify with a profiler before investing too much
> >> >> >> > effort here.
> >> >> >> >
> >> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> >> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
> >> >> >> > metadata is getting changed.  This is a consequence of embedding all of
> >> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
> >> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> >> >> >> > see a couple of different options:
> >> >> >> >
> >> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> >> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
> >> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
> >> >> >> >
> >> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> >> >> >> > are unique for any given hash value.  Then store the blobs as
> >> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> >> >> >> > clone happens there is no onode->bnode migration magic happening--we've
> >> >> >> > already committed to storing blobs in separate keys.  When we load the
> >> >> >> > onode, keep the conditional bnode loading we already have.. but when the
> >> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> >> >> >> > fault in blobs individually, but that code will be more complicated.)
> >> >> >> >
> >> >> >> > In both these cases, a write will dirty the onode (which is back to being
> >> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> >> >> >> > small keys).  Updates will generate much lower metadata write traffic,
> >> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
> >> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
> >> >> >> > several nearby keys instead of a single key.
> >> >> >> >
> >> >> >> >
> >> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
> >> >> >> > improvements we make.  And #1 is simple... I plan to implement this
> >> >> >> > shortly.
> >> >> >> >
> >> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> >> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
> >> >> >> > difficult to properly evaluate without knowing where we'll land with the
> >> >> >> > (re)encode times.  I think it's a design decision made early on that is
> >> >> >> > worth revisiting, though!
> >> >> >> >
> >> >> >> > sage
> >> >> >> > --
> >> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 16:10           ` Sage Weil
@ 2016-08-17 16:32             ` Haomai Wang
  2016-08-17 16:42               ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-17 16:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Varada Kari, ceph-devel

On Thu, Aug 18, 2016 at 12:10 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 18 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
>> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> >> another latency perf problem:
>> >> >>
>> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
>> >> >> complete WAL.
>> >> >>
>> >> >> I found the latency between kv transaction submitting isn't negligible
>> >> >> and limit the transaction throughput.
>> >> >>
>> >> >> So what if we implement a async transaction submit in rocksdb side
>> >> >> using callback way? It will decrease kv in queue latency. It would
>> >> >> help rocksdb WAL performance close to FileJournal. And async interface
>> >> >> will help control each kv transaction size and make transaction
>> >> >> complete smoothly instead of tps spike with us precious.
>> >> >
>> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
>> >> > we have X bytes accumulated (I think there is an option in rocksdb that
>> >> > drives this already, actually)?  Changing the interfaces around will
>> >> > change the threading model (= work) but doesn't actually change who needs
>> >> > to wait and when.
>> >>
>> >> why we need to wait after interface change?
>> >>
>> >> 1. kv thread submit transaction with callback.
>> >> 2. rocksdb append and call bluefs aio_submit with callback
>> >> 3. bluefs submit aio write with callback
>> >> 4. KernelDevice will poll linux aio event and execute callback inline
>> >> or queue finish
>> >> 5. callback will notify we complete the kv transaction
>> >>
>> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> >> submit interface....
>> >>
>> >> Is anything I'm missing?
>> >
>> > That can all be done with callbacks, but even if we do the kv thread will
>> > still need to wait on the callback before doing anything else.
>> >
>> > Oh, you're suggesting we have multiple batches of transactions in flight.
>> > Got it.
>>
>> I don't think so.. because bluefs has lock for fsync and flush. So
>> multi rocksdb thread will be serial to flush...
>
> Oh, this was fixed recently:
>
>         10d055d65727e47deae4e459bc21aaa243c24a7d
>         97699334acd59e9530d36b13d3a8408cabf848ef

Hmm, looks better!

The only thing is I notice we don't have FileWriter lock for "buffer",
so multi rocksdb writer will result in corrupt? I haven't look at
rocksdb to check, but I think if posix backend, rocksdb don't need to
have a look to protect log append racing.

>
>> and another thing is the single thread is help for polling case.....
>> from my current perf, compared queue filejournal class, rocksdb plays
>> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
>> exactly has a good pipeline for pure linux aio job.
>
> Yeah, I think you're right.  Even if we do the parallel submission, we
> don't want to do parallel blocking (since the callers don't want to
> block), so we'll still want async completion/notification of commit.
>
> No idea if this is something the rocksdb folks are already interested in
> or not... want to ask them on their cool facebook group?  :)
>
>         https://www.facebook.com/groups/rocksdb.dev/

sure

>
> sage
>
>
>>
>> >
>> > I think we will get some of the benefit by enabling the parallel
>> > transaction submits (so we don't funnel everything through
>> > _kv_sync_thread).  I think we should get that merged first and see how it
>> > behaves before taking the next step.  I forgot to ask Varada is standup
>> > this morning what the current status of that is.  Varada?
>> >
>> > sage
>> >
>> >>
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> >> > I think we need to look at other changes in addition to the encoding
>> >> >> > performance improvements.  Even if they end up being good enough, these
>> >> >> > changes are somewhat orthogonal and at least one of them should give us
>> >> >> > something that is even faster.
>> >> >> >
>> >> >> > 1. I mentioned this before, but we should keep the encoding
>> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>> >> >> > don't reencode it.  There are no blockers for implementing this currently.
>> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>> >> >> > see if we can use proper accessors for the blob to enforce this at compile
>> >> >> > time.  We should do that anyway.
>> >> >> >
>> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
>> >> >> > rocksdb to copy this into the write buffer.  We could extend the
>> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
>> >> >> > into its write buffer).  This is probably a pretty small piece of the
>> >> >> > overall time... should verify with a profiler before investing too much
>> >> >> > effort here.
>> >> >> >
>> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
>> >> >> > metadata is getting changed.  This is a consequence of embedding all of
>> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
>> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>> >> >> > see a couple of different options:
>> >> >> >
>> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
>> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
>> >> >> >
>> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>> >> >> > are unique for any given hash value.  Then store the blobs as
>> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>> >> >> > clone happens there is no onode->bnode migration magic happening--we've
>> >> >> > already committed to storing blobs in separate keys.  When we load the
>> >> >> > onode, keep the conditional bnode loading we already have.. but when the
>> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>> >> >> > fault in blobs individually, but that code will be more complicated.)
>> >> >> >
>> >> >> > In both these cases, a write will dirty the onode (which is back to being
>> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>> >> >> > small keys).  Updates will generate much lower metadata write traffic,
>> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
>> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
>> >> >> > several nearby keys instead of a single key.
>> >> >> >
>> >> >> >
>> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
>> >> >> > improvements we make.  And #1 is simple... I plan to implement this
>> >> >> > shortly.
>> >> >> >
>> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
>> >> >> > difficult to properly evaluate without knowing where we'll land with the
>> >> >> > (re)encode times.  I think it's a design decision made early on that is
>> >> >> > worth revisiting, though!
>> >> >> >
>> >> >> > sage
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >>
>> >>
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 16:00         ` Haomai Wang
@ 2016-08-17 16:10           ` Sage Weil
  2016-08-17 16:32             ` Haomai Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-17 16:10 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Varada Kari, ceph-devel

On Thu, 18 Aug 2016, Haomai Wang wrote:
> On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> >> another latency perf problem:
> >> >>
> >> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
> >> >> complete WAL.
> >> >>
> >> >> I found the latency between kv transaction submitting isn't negligible
> >> >> and limit the transaction throughput.
> >> >>
> >> >> So what if we implement a async transaction submit in rocksdb side
> >> >> using callback way? It will decrease kv in queue latency. It would
> >> >> help rocksdb WAL performance close to FileJournal. And async interface
> >> >> will help control each kv transaction size and make transaction
> >> >> complete smoothly instead of tps spike with us precious.
> >> >
> >> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
> >> > we have X bytes accumulated (I think there is an option in rocksdb that
> >> > drives this already, actually)?  Changing the interfaces around will
> >> > change the threading model (= work) but doesn't actually change who needs
> >> > to wait and when.
> >>
> >> why we need to wait after interface change?
> >>
> >> 1. kv thread submit transaction with callback.
> >> 2. rocksdb append and call bluefs aio_submit with callback
> >> 3. bluefs submit aio write with callback
> >> 4. KernelDevice will poll linux aio event and execute callback inline
> >> or queue finish
> >> 5. callback will notify we complete the kv transaction
> >>
> >> the main task is implement logics in rocksdb log*.cc and bluefs aio
> >> submit interface....
> >>
> >> Is anything I'm missing?
> >
> > That can all be done with callbacks, but even if we do the kv thread will
> > still need to wait on the callback before doing anything else.
> >
> > Oh, you're suggesting we have multiple batches of transactions in flight.
> > Got it.
> 
> I don't think so.. because bluefs has lock for fsync and flush. So
> multi rocksdb thread will be serial to flush...

Oh, this was fixed recently:

	10d055d65727e47deae4e459bc21aaa243c24a7d
	97699334acd59e9530d36b13d3a8408cabf848ef

> and another thing is the single thread is help for polling case..... 
> from my current perf, compared queue filejournal class, rocksdb plays 
> 1.5x-2x latency, in heavy load it will be more .... Yes, filejournal 
> exactly has a good pipeline for pure linux aio job.

Yeah, I think you're right.  Even if we do the parallel submission, we 
don't want to do parallel blocking (since the callers don't want to 
block), so we'll still want async completion/notification of commit.

No idea if this is something the rocksdb folks are already interested in 
or not... want to ask them on their cool facebook group?  :)

	https://www.facebook.com/groups/rocksdb.dev/

sage


> 
> >
> > I think we will get some of the benefit by enabling the parallel
> > transaction submits (so we don't funnel everything through
> > _kv_sync_thread).  I think we should get that merged first and see how it
> > behaves before taking the next step.  I forgot to ask Varada is standup
> > this morning what the current status of that is.  Varada?
> >
> > sage
> >
> >>
> >> >
> >> > sage
> >> >
> >> >
> >> >
> >> >>
> >> >>
> >> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> >> >> > I think we need to look at other changes in addition to the encoding
> >> >> > performance improvements.  Even if they end up being good enough, these
> >> >> > changes are somewhat orthogonal and at least one of them should give us
> >> >> > something that is even faster.
> >> >> >
> >> >> > 1. I mentioned this before, but we should keep the encoding
> >> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> >> >> > don't reencode it.  There are no blockers for implementing this currently.
> >> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> >> >> > see if we can use proper accessors for the blob to enforce this at compile
> >> >> > time.  We should do that anyway.
> >> >> >
> >> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> >> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> >> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
> >> >> > rocksdb to copy this into the write buffer.  We could extend the
> >> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> >> >> > (and rocksdb will instead iterate over our buffers and copy them directly
> >> >> > into its write buffer).  This is probably a pretty small piece of the
> >> >> > overall time... should verify with a profiler before investing too much
> >> >> > effort here.
> >> >> >
> >> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> >> >> > into rocksdb every time we touch an object, even when a tiny amount of
> >> >> > metadata is getting changed.  This is a consequence of embedding all of
> >> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
> >> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> >> >> > see a couple of different options:
> >> >> >
> >> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> >> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
> >> >> > definitely sequential in zs).  Probably go back to using an iterator.
> >> >> >
> >> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> >> >> > are unique for any given hash value.  Then store the blobs as
> >> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> >> >> > clone happens there is no onode->bnode migration magic happening--we've
> >> >> > already committed to storing blobs in separate keys.  When we load the
> >> >> > onode, keep the conditional bnode loading we already have.. but when the
> >> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> >> >> > fault in blobs individually, but that code will be more complicated.)
> >> >> >
> >> >> > In both these cases, a write will dirty the onode (which is back to being
> >> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> >> >> > small keys).  Updates will generate much lower metadata write traffic,
> >> >> > which'll reduce media wear and compaction overhead.  The cost is that
> >> >> > operations (e.g., reads) that have to fault in an onode are now fetching
> >> >> > several nearby keys instead of a single key.
> >> >> >
> >> >> >
> >> >> > #1 and #2 are completely orthogonal to any encoding efficiency
> >> >> > improvements we make.  And #1 is simple... I plan to implement this
> >> >> > shortly.
> >> >> >
> >> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> >> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
> >> >> > difficult to properly evaluate without knowing where we'll land with the
> >> >> > (re)encode times.  I think it's a design decision made early on that is
> >> >> > worth revisiting, though!
> >> >> >
> >> >> > sage
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >>
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:43       ` Sage Weil
  2016-08-17 15:55         ` Somnath Roy
  2016-08-17 16:00         ` Haomai Wang
@ 2016-08-17 16:03         ` Varada Kari
  2 siblings, 0 replies; 24+ messages in thread
From: Varada Kari @ 2016-08-17 16:03 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: ceph-devel

I have couple of branches as wip, will fix them and send a DNM PR soon.

https://github.com/varadakari/ceph/commits/wip-parallel-tx -->
introduced a new wq, You suggested not to have that.

https://github.com/varadakari/ceph/commits/wip-parallel-aiocb  --> tried
make the whole transaction sync here, want to see the latency if we can
do the entire transaction in sharded_worker context. Still wip.

will fix them up.

Varada

On Wednesday 17 August 2016 09:13 PM, Sage Weil wrote:
> On Wed, 17 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Wed, 17 Aug 2016, Haomai Wang wrote:
>>>> another latency perf problem:
>>>>
>>>> rocksdb log is on bluefs and mainly uses append and fsync interface to
>>>> complete WAL.
>>>>
>>>> I found the latency between kv transaction submitting isn't negligible
>>>> and limit the transaction throughput.
>>>>
>>>> So what if we implement a async transaction submit in rocksdb side
>>>> using callback way? It will decrease kv in queue latency. It would
>>>> help rocksdb WAL performance close to FileJournal. And async interface
>>>> will help control each kv transaction size and make transaction
>>>> complete smoothly instead of tps spike with us precious.
>>> Can we get the same benefit by calling BlueFS::_flush on the log whenever
>>> we have X bytes accumulated (I think there is an option in rocksdb that
>>> drives this already, actually)?  Changing the interfaces around will
>>> change the threading model (= work) but doesn't actually change who needs
>>> to wait and when.
>> why we need to wait after interface change?
>>
>> 1. kv thread submit transaction with callback.
>> 2. rocksdb append and call bluefs aio_submit with callback
>> 3. bluefs submit aio write with callback
>> 4. KernelDevice will poll linux aio event and execute callback inline
>> or queue finish
>> 5. callback will notify we complete the kv transaction
>>
>> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> submit interface....
>>
>> Is anything I'm missing?
> That can all be done with callbacks, but even if we do the kv thread will
> still need to wait on the callback before doing anything else.
>
> Oh, you're suggesting we have multiple batches of transactions in flight.
> Got it.
>
> I think we will get some of the benefit by enabling the parallel
> transaction submits (so we don't funnel everything through
> _kv_sync_thread).  I think we should get that merged first and see how it
> behaves before taking the next step.  I forgot to ask Varada is standup
> this morning what the current status of that is.  Varada?
>
> sage
>
>>> sage
>>>
>>>
>>>
>>>>
>>>> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> I think we need to look at other changes in addition to the encoding
>>>>> performance improvements.  Even if they end up being good enough, these
>>>>> changes are somewhat orthogonal and at least one of them should give us
>>>>> something that is even faster.
>>>>>
>>>>> 1. I mentioned this before, but we should keep the encoding
>>>>> bluestore_blob_t around when we load the blob map.  If it's not changed,
>>>>> don't reencode it.  There are no blockers for implementing this currently.
>>>>> It may be difficult to ensure the blobs are properly marked dirty... I'll
>>>>> see if we can use proper accessors for the blob to enforce this at compile
>>>>> time.  We should do that anyway.
>>>>>
>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>>>>> assemble the bufferlist (lots of bufferptrs to each untouched blob)
>>>>> into a single rocksdb::Slice, and another memcpy somewhere inside
>>>>> rocksdb to copy this into the write buffer.  We could extend the
>>>>> rocksdb interface to take an iovec so that the first memcpy isn't needed
>>>>> (and rocksdb will instead iterate over our buffers and copy them directly
>>>>> into its write buffer).  This is probably a pretty small piece of the
>>>>> overall time... should verify with a profiler before investing too much
>>>>> effort here.
>>>>>
>>>>> 3. Even if we do the above, we're still setting a big (~4k or more?) key
>>>>> into rocksdb every time we touch an object, even when a tiny amount of
>>>>> metadata is getting changed.  This is a consequence of embedding all of
>>>>> the blobs into the onode (or bnode).  That seemed like a good idea early
>>>>> on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>>>>> see a couple of different options:
>>>>>
>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>>>>> the blobs too.  They will hopefully be sequential in rocksdb (or
>>>>> definitely sequential in zs).  Probably go back to using an iterator.
>>>>>
>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>>>>> are unique for any given hash value.  Then store the blobs as
>>>>> $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>>>>> clone happens there is no onode->bnode migration magic happening--we've
>>>>> already committed to storing blobs in separate keys.  When we load the
>>>>> onode, keep the conditional bnode loading we already have.. but when the
>>>>> bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>>>>> fault in blobs individually, but that code will be more complicated.)
>>>>>
>>>>> In both these cases, a write will dirty the onode (which is back to being
>>>>> pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>>>>> small keys).  Updates will generate much lower metadata write traffic,
>>>>> which'll reduce media wear and compaction overhead.  The cost is that
>>>>> operations (e.g., reads) that have to fault in an onode are now fetching
>>>>> several nearby keys instead of a single key.
>>>>>
>>>>>
>>>>> #1 and #2 are completely orthogonal to any encoding efficiency
>>>>> improvements we make.  And #1 is simple... I plan to implement this
>>>>> shortly.
>>>>>
>>>>> #3 is balancing (re)encoding efficiency against the cost of separate keys,
>>>>> and that tradeoff will change as encoding efficiency changes, so it'll be
>>>>> difficult to properly evaluate without knowing where we'll land with the
>>>>> (re)encode times.  I think it's a design decision made early on that is
>>>>> worth revisiting, though!
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:43       ` Sage Weil
  2016-08-17 15:55         ` Somnath Roy
@ 2016-08-17 16:00         ` Haomai Wang
  2016-08-17 16:10           ` Sage Weil
  2016-08-17 16:03         ` Varada Kari
  2 siblings, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-17 16:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: Varada Kari, ceph-devel

On Wed, Aug 17, 2016 at 11:43 PM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 17 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>> > On Wed, 17 Aug 2016, Haomai Wang wrote:
>> >> another latency perf problem:
>> >>
>> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
>> >> complete WAL.
>> >>
>> >> I found the latency between kv transaction submitting isn't negligible
>> >> and limit the transaction throughput.
>> >>
>> >> So what if we implement a async transaction submit in rocksdb side
>> >> using callback way? It will decrease kv in queue latency. It would
>> >> help rocksdb WAL performance close to FileJournal. And async interface
>> >> will help control each kv transaction size and make transaction
>> >> complete smoothly instead of tps spike with us precious.
>> >
>> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
>> > we have X bytes accumulated (I think there is an option in rocksdb that
>> > drives this already, actually)?  Changing the interfaces around will
>> > change the threading model (= work) but doesn't actually change who needs
>> > to wait and when.
>>
>> why we need to wait after interface change?
>>
>> 1. kv thread submit transaction with callback.
>> 2. rocksdb append and call bluefs aio_submit with callback
>> 3. bluefs submit aio write with callback
>> 4. KernelDevice will poll linux aio event and execute callback inline
>> or queue finish
>> 5. callback will notify we complete the kv transaction
>>
>> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> submit interface....
>>
>> Is anything I'm missing?
>
> That can all be done with callbacks, but even if we do the kv thread will
> still need to wait on the callback before doing anything else.
>
> Oh, you're suggesting we have multiple batches of transactions in flight.
> Got it.

I don't think so.. because bluefs has lock for fsync and flush. So
multi rocksdb thread will be serial to flush... and another thing is
the single thread is help for polling case.....

from my current perf, compared queue filejournal class, rocksdb plays
1.5x-2x latency, in heavy load it will be more .... Yes, filejournal
exactly has a good pipeline for pure linux aio job.

>
> I think we will get some of the benefit by enabling the parallel
> transaction submits (so we don't funnel everything through
> _kv_sync_thread).  I think we should get that merged first and see how it
> behaves before taking the next step.  I forgot to ask Varada is standup
> this morning what the current status of that is.  Varada?
>
> sage
>
>>
>> >
>> > sage
>> >
>> >
>> >
>> >>
>> >>
>> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>> >> > I think we need to look at other changes in addition to the encoding
>> >> > performance improvements.  Even if they end up being good enough, these
>> >> > changes are somewhat orthogonal and at least one of them should give us
>> >> > something that is even faster.
>> >> >
>> >> > 1. I mentioned this before, but we should keep the encoding
>> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>> >> > don't reencode it.  There are no blockers for implementing this currently.
>> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>> >> > see if we can use proper accessors for the blob to enforce this at compile
>> >> > time.  We should do that anyway.
>> >> >
>> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
>> >> > rocksdb to copy this into the write buffer.  We could extend the
>> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>> >> > (and rocksdb will instead iterate over our buffers and copy them directly
>> >> > into its write buffer).  This is probably a pretty small piece of the
>> >> > overall time... should verify with a profiler before investing too much
>> >> > effort here.
>> >> >
>> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>> >> > into rocksdb every time we touch an object, even when a tiny amount of
>> >> > metadata is getting changed.  This is a consequence of embedding all of
>> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
>> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>> >> > see a couple of different options:
>> >> >
>> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
>> >> > definitely sequential in zs).  Probably go back to using an iterator.
>> >> >
>> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>> >> > are unique for any given hash value.  Then store the blobs as
>> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>> >> > clone happens there is no onode->bnode migration magic happening--we've
>> >> > already committed to storing blobs in separate keys.  When we load the
>> >> > onode, keep the conditional bnode loading we already have.. but when the
>> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>> >> > fault in blobs individually, but that code will be more complicated.)
>> >> >
>> >> > In both these cases, a write will dirty the onode (which is back to being
>> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>> >> > small keys).  Updates will generate much lower metadata write traffic,
>> >> > which'll reduce media wear and compaction overhead.  The cost is that
>> >> > operations (e.g., reads) that have to fault in an onode are now fetching
>> >> > several nearby keys instead of a single key.
>> >> >
>> >> >
>> >> > #1 and #2 are completely orthogonal to any encoding efficiency
>> >> > improvements we make.  And #1 is simple... I plan to implement this
>> >> > shortly.
>> >> >
>> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
>> >> > difficult to properly evaluate without knowing where we'll land with the
>> >> > (re)encode times.  I think it's a design decision made early on that is
>> >> > worth revisiting, though!
>> >> >
>> >> > sage
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:55         ` Somnath Roy
@ 2016-08-17 15:58           ` Mark Nelson
  0 siblings, 0 replies; 24+ messages in thread
From: Mark Nelson @ 2016-08-17 15:58 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil, Varada Kari, Haomai Wang; +Cc: ceph-devel

On 08/17/2016 10:55 AM, Somnath Roy wrote:
> Will parallel transaction submit improve the rocksdb performance ?
> If not, it is unlikely we will see any benefit because of that. May be different db like ZS could benefit out of that though.
> What I saw after short circuiting the db path entirely earlier that db performance is still the bottleneck.

Did you try the memdbstore at all?  Last time I tried it bluestore was 
segfaulting, but it's possible it works now.

Mark

>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, August 17, 2016 8:43 AM
> To: Varada Kari; Haomai Wang
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: bluestore blobs
>
> On Wed, 17 Aug 2016, Haomai Wang wrote:
>> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Wed, 17 Aug 2016, Haomai Wang wrote:
>>>> another latency perf problem:
>>>>
>>>> rocksdb log is on bluefs and mainly uses append and fsync interface
>>>> to complete WAL.
>>>>
>>>> I found the latency between kv transaction submitting isn't
>>>> negligible and limit the transaction throughput.
>>>>
>>>> So what if we implement a async transaction submit in rocksdb side
>>>> using callback way? It will decrease kv in queue latency. It would
>>>> help rocksdb WAL performance close to FileJournal. And async
>>>> interface will help control each kv transaction size and make
>>>> transaction complete smoothly instead of tps spike with us precious.
>>>
>>> Can we get the same benefit by calling BlueFS::_flush on the log
>>> whenever we have X bytes accumulated (I think there is an option in
>>> rocksdb that drives this already, actually)?  Changing the
>>> interfaces around will change the threading model (= work) but
>>> doesn't actually change who needs to wait and when.
>>
>> why we need to wait after interface change?
>>
>> 1. kv thread submit transaction with callback.
>> 2. rocksdb append and call bluefs aio_submit with callback 3. bluefs
>> submit aio write with callback 4. KernelDevice will poll linux aio
>> event and execute callback inline or queue finish 5. callback will
>> notify we complete the kv transaction
>>
>> the main task is implement logics in rocksdb log*.cc and bluefs aio
>> submit interface....
>>
>> Is anything I'm missing?
>
> That can all be done with callbacks, but even if we do the kv thread will still need to wait on the callback before doing anything else.
>
> Oh, you're suggesting we have multiple batches of transactions in flight.
> Got it.
>
> I think we will get some of the benefit by enabling the parallel transaction submits (so we don't funnel everything through _kv_sync_thread).  I think we should get that merged first and see how it behaves before taking the next step.  I forgot to ask Varada is standup this morning what the current status of that is.  Varada?
>
> sage
>
>>
>>>
>>> sage
>>>
>>>
>>>
>>>>
>>>>
>>>> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> I think we need to look at other changes in addition to the
>>>>> encoding performance improvements.  Even if they end up being
>>>>> good enough, these changes are somewhat orthogonal and at least
>>>>> one of them should give us something that is even faster.
>>>>>
>>>>> 1. I mentioned this before, but we should keep the encoding
>>>>> bluestore_blob_t around when we load the blob map.  If it's not
>>>>> changed, don't reencode it.  There are no blockers for implementing this currently.
>>>>> It may be difficult to ensure the blobs are properly marked
>>>>> dirty... I'll see if we can use proper accessors for the blob to
>>>>> enforce this at compile time.  We should do that anyway.
>>>>>
>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages:
>>>>> one to assemble the bufferlist (lots of bufferptrs to each
>>>>> untouched blob) into a single rocksdb::Slice, and another memcpy
>>>>> somewhere inside rocksdb to copy this into the write buffer.  We
>>>>> could extend the rocksdb interface to take an iovec so that the
>>>>> first memcpy isn't needed (and rocksdb will instead iterate over
>>>>> our buffers and copy them directly into its write buffer).  This
>>>>> is probably a pretty small piece of the overall time... should
>>>>> verify with a profiler before investing too much effort here.
>>>>>
>>>>> 3. Even if we do the above, we're still setting a big (~4k or
>>>>> more?) key into rocksdb every time we touch an object, even when
>>>>> a tiny amount of metadata is getting changed.  This is a
>>>>> consequence of embedding all of the blobs into the onode (or
>>>>> bnode).  That seemed like a good idea early on when they were
>>>>> tiny (i.e., just an extent), but now I'm not so sure.  I see a couple of different options:
>>>>>
>>>>> a) Store each blob as ($onode_key+$blobid).  When we load the
>>>>> onode, load the blobs too.  They will hopefully be sequential in
>>>>> rocksdb (or definitely sequential in zs).  Probably go back to using an iterator.
>>>>>
>>>>> b) Go all in on the "bnode" like concept.  Assign blob ids so
>>>>> that they are unique for any given hash value.  Then store the
>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is
>>>>> now).  Then when clone happens there is no onode->bnode migration
>>>>> magic happening--we've already committed to storing blobs in
>>>>> separate keys.  When we load the onode, keep the conditional
>>>>> bnode loading we already have.. but when the bnode is loaded load
>>>>> up all the blobs for the hash key.  (Okay, we could fault in
>>>>> blobs individually, but that code will be more complicated.)
>>>>>
>>>>> In both these cases, a write will dirty the onode (which is back
>>>>> to being pretty small.. just xattrs and the lextent map) and 1-3
>>>>> blobs (also now small keys).  Updates will generate much lower
>>>>> metadata write traffic, which'll reduce media wear and compaction
>>>>> overhead.  The cost is that operations (e.g., reads) that have to
>>>>> fault in an onode are now fetching several nearby keys instead of a single key.
>>>>>
>>>>>
>>>>> #1 and #2 are completely orthogonal to any encoding efficiency
>>>>> improvements we make.  And #1 is simple... I plan to implement
>>>>> this shortly.
>>>>>
>>>>> #3 is balancing (re)encoding efficiency against the cost of
>>>>> separate keys, and that tradeoff will change as encoding
>>>>> efficiency changes, so it'll be difficult to properly evaluate
>>>>> without knowing where we'll land with the (re)encode times.  I
>>>>> think it's a design decision made early on that is worth revisiting, though!
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at
>>>>> http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: bluestore blobs
  2016-08-17 15:43       ` Sage Weil
@ 2016-08-17 15:55         ` Somnath Roy
  2016-08-17 15:58           ` Mark Nelson
  2016-08-17 16:00         ` Haomai Wang
  2016-08-17 16:03         ` Varada Kari
  2 siblings, 1 reply; 24+ messages in thread
From: Somnath Roy @ 2016-08-17 15:55 UTC (permalink / raw)
  To: Sage Weil, Varada Kari, Haomai Wang; +Cc: ceph-devel

Will parallel transaction submit improve the rocksdb performance ?
If not, it is unlikely we will see any benefit because of that. May be different db like ZS could benefit out of that though.
What I saw after short circuiting the db path entirely earlier that db performance is still the bottleneck.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, August 17, 2016 8:43 AM
To: Varada Kari; Haomai Wang
Cc: ceph-devel@vger.kernel.org
Subject: Re: bluestore blobs

On Wed, 17 Aug 2016, Haomai Wang wrote:
> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> another latency perf problem:
> >>
> >> rocksdb log is on bluefs and mainly uses append and fsync interface
> >> to complete WAL.
> >>
> >> I found the latency between kv transaction submitting isn't
> >> negligible and limit the transaction throughput.
> >>
> >> So what if we implement a async transaction submit in rocksdb side
> >> using callback way? It will decrease kv in queue latency. It would
> >> help rocksdb WAL performance close to FileJournal. And async
> >> interface will help control each kv transaction size and make
> >> transaction complete smoothly instead of tps spike with us precious.
> >
> > Can we get the same benefit by calling BlueFS::_flush on the log
> > whenever we have X bytes accumulated (I think there is an option in
> > rocksdb that drives this already, actually)?  Changing the
> > interfaces around will change the threading model (= work) but
> > doesn't actually change who needs to wait and when.
>
> why we need to wait after interface change?
>
> 1. kv thread submit transaction with callback.
> 2. rocksdb append and call bluefs aio_submit with callback 3. bluefs
> submit aio write with callback 4. KernelDevice will poll linux aio
> event and execute callback inline or queue finish 5. callback will
> notify we complete the kv transaction
>
> the main task is implement logics in rocksdb log*.cc and bluefs aio
> submit interface....
>
> Is anything I'm missing?

That can all be done with callbacks, but even if we do the kv thread will still need to wait on the callback before doing anything else.

Oh, you're suggesting we have multiple batches of transactions in flight.
Got it.

I think we will get some of the benefit by enabling the parallel transaction submits (so we don't funnel everything through _kv_sync_thread).  I think we should get that merged first and see how it behaves before taking the next step.  I forgot to ask Varada is standup this morning what the current status of that is.  Varada?

sage

>
> >
> > sage
> >
> >
> >
> >>
> >>
> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > I think we need to look at other changes in addition to the
> >> > encoding performance improvements.  Even if they end up being
> >> > good enough, these changes are somewhat orthogonal and at least
> >> > one of them should give us something that is even faster.
> >> >
> >> > 1. I mentioned this before, but we should keep the encoding
> >> > bluestore_blob_t around when we load the blob map.  If it's not
> >> > changed, don't reencode it.  There are no blockers for implementing this currently.
> >> > It may be difficult to ensure the blobs are properly marked
> >> > dirty... I'll see if we can use proper accessors for the blob to
> >> > enforce this at compile time.  We should do that anyway.
> >> >
> >> > 2. This turns the blob Put into rocksdb into two memcpy stages:
> >> > one to assemble the bufferlist (lots of bufferptrs to each
> >> > untouched blob) into a single rocksdb::Slice, and another memcpy
> >> > somewhere inside rocksdb to copy this into the write buffer.  We
> >> > could extend the rocksdb interface to take an iovec so that the
> >> > first memcpy isn't needed (and rocksdb will instead iterate over
> >> > our buffers and copy them directly into its write buffer).  This
> >> > is probably a pretty small piece of the overall time... should
> >> > verify with a profiler before investing too much effort here.
> >> >
> >> > 3. Even if we do the above, we're still setting a big (~4k or
> >> > more?) key into rocksdb every time we touch an object, even when
> >> > a tiny amount of metadata is getting changed.  This is a
> >> > consequence of embedding all of the blobs into the onode (or
> >> > bnode).  That seemed like a good idea early on when they were
> >> > tiny (i.e., just an extent), but now I'm not so sure.  I see a couple of different options:
> >> >
> >> > a) Store each blob as ($onode_key+$blobid).  When we load the
> >> > onode, load the blobs too.  They will hopefully be sequential in
> >> > rocksdb (or definitely sequential in zs).  Probably go back to using an iterator.
> >> >
> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so
> >> > that they are unique for any given hash value.  Then store the
> >> > blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is
> >> > now).  Then when clone happens there is no onode->bnode migration
> >> > magic happening--we've already committed to storing blobs in
> >> > separate keys.  When we load the onode, keep the conditional
> >> > bnode loading we already have.. but when the bnode is loaded load
> >> > up all the blobs for the hash key.  (Okay, we could fault in
> >> > blobs individually, but that code will be more complicated.)
> >> >
> >> > In both these cases, a write will dirty the onode (which is back
> >> > to being pretty small.. just xattrs and the lextent map) and 1-3
> >> > blobs (also now small keys).  Updates will generate much lower
> >> > metadata write traffic, which'll reduce media wear and compaction
> >> > overhead.  The cost is that operations (e.g., reads) that have to
> >> > fault in an onode are now fetching several nearby keys instead of a single key.
> >> >
> >> >
> >> > #1 and #2 are completely orthogonal to any encoding efficiency
> >> > improvements we make.  And #1 is simple... I plan to implement
> >> > this shortly.
> >> >
> >> > #3 is balancing (re)encoding efficiency against the cost of
> >> > separate keys, and that tradeoff will change as encoding
> >> > efficiency changes, so it'll be difficult to properly evaluate
> >> > without knowing where we'll land with the (re)encode times.  I
> >> > think it's a design decision made early on that is worth revisiting, though!
> >> >
> >> > sage
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe
> >> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at
> >> > http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:31     ` Haomai Wang
@ 2016-08-17 15:43       ` Sage Weil
  2016-08-17 15:55         ` Somnath Roy
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Sage Weil @ 2016-08-17 15:43 UTC (permalink / raw)
  To: varada.kari, Haomai Wang; +Cc: ceph-devel

On Wed, 17 Aug 2016, Haomai Wang wrote:
> On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 17 Aug 2016, Haomai Wang wrote:
> >> another latency perf problem:
> >>
> >> rocksdb log is on bluefs and mainly uses append and fsync interface to
> >> complete WAL.
> >>
> >> I found the latency between kv transaction submitting isn't negligible
> >> and limit the transaction throughput.
> >>
> >> So what if we implement a async transaction submit in rocksdb side
> >> using callback way? It will decrease kv in queue latency. It would
> >> help rocksdb WAL performance close to FileJournal. And async interface
> >> will help control each kv transaction size and make transaction
> >> complete smoothly instead of tps spike with us precious.
> >
> > Can we get the same benefit by calling BlueFS::_flush on the log whenever
> > we have X bytes accumulated (I think there is an option in rocksdb that
> > drives this already, actually)?  Changing the interfaces around will
> > change the threading model (= work) but doesn't actually change who needs
> > to wait and when.
> 
> why we need to wait after interface change?
> 
> 1. kv thread submit transaction with callback.
> 2. rocksdb append and call bluefs aio_submit with callback
> 3. bluefs submit aio write with callback
> 4. KernelDevice will poll linux aio event and execute callback inline
> or queue finish
> 5. callback will notify we complete the kv transaction
> 
> the main task is implement logics in rocksdb log*.cc and bluefs aio
> submit interface....
> 
> Is anything I'm missing?

That can all be done with callbacks, but even if we do the kv thread will 
still need to wait on the callback before doing anything else.

Oh, you're suggesting we have multiple batches of transactions in flight. 
Got it.

I think we will get some of the benefit by enabling the parallel 
transaction submits (so we don't funnel everything through 
_kv_sync_thread).  I think we should get that merged first and see how it 
behaves before taking the next step.  I forgot to ask Varada is standup 
this morning what the current status of that is.  Varada?

sage

> 
> >
> > sage
> >
> >
> >
> >>
> >>
> >> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> >> > I think we need to look at other changes in addition to the encoding
> >> > performance improvements.  Even if they end up being good enough, these
> >> > changes are somewhat orthogonal and at least one of them should give us
> >> > something that is even faster.
> >> >
> >> > 1. I mentioned this before, but we should keep the encoding
> >> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> >> > don't reencode it.  There are no blockers for implementing this currently.
> >> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> >> > see if we can use proper accessors for the blob to enforce this at compile
> >> > time.  We should do that anyway.
> >> >
> >> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> >> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> >> > into a single rocksdb::Slice, and another memcpy somewhere inside
> >> > rocksdb to copy this into the write buffer.  We could extend the
> >> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> >> > (and rocksdb will instead iterate over our buffers and copy them directly
> >> > into its write buffer).  This is probably a pretty small piece of the
> >> > overall time... should verify with a profiler before investing too much
> >> > effort here.
> >> >
> >> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> >> > into rocksdb every time we touch an object, even when a tiny amount of
> >> > metadata is getting changed.  This is a consequence of embedding all of
> >> > the blobs into the onode (or bnode).  That seemed like a good idea early
> >> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> >> > see a couple of different options:
> >> >
> >> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> >> > the blobs too.  They will hopefully be sequential in rocksdb (or
> >> > definitely sequential in zs).  Probably go back to using an iterator.
> >> >
> >> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> >> > are unique for any given hash value.  Then store the blobs as
> >> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> >> > clone happens there is no onode->bnode migration magic happening--we've
> >> > already committed to storing blobs in separate keys.  When we load the
> >> > onode, keep the conditional bnode loading we already have.. but when the
> >> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> >> > fault in blobs individually, but that code will be more complicated.)
> >> >
> >> > In both these cases, a write will dirty the onode (which is back to being
> >> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> >> > small keys).  Updates will generate much lower metadata write traffic,
> >> > which'll reduce media wear and compaction overhead.  The cost is that
> >> > operations (e.g., reads) that have to fault in an onode are now fetching
> >> > several nearby keys instead of a single key.
> >> >
> >> >
> >> > #1 and #2 are completely orthogonal to any encoding efficiency
> >> > improvements we make.  And #1 is simple... I plan to implement this
> >> > shortly.
> >> >
> >> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> >> > and that tradeoff will change as encoding efficiency changes, so it'll be
> >> > difficult to properly evaluate without knowing where we'll land with the
> >> > (re)encode times.  I think it's a design decision made early on that is
> >> > worth revisiting, though!
> >> >
> >> > sage
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:25   ` Sage Weil
@ 2016-08-17 15:31     ` Haomai Wang
  2016-08-17 15:43       ` Sage Weil
  0 siblings, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-17 15:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Aug 17, 2016 at 11:25 PM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 17 Aug 2016, Haomai Wang wrote:
>> another latency perf problem:
>>
>> rocksdb log is on bluefs and mainly uses append and fsync interface to
>> complete WAL.
>>
>> I found the latency between kv transaction submitting isn't negligible
>> and limit the transaction throughput.
>>
>> So what if we implement a async transaction submit in rocksdb side
>> using callback way? It will decrease kv in queue latency. It would
>> help rocksdb WAL performance close to FileJournal. And async interface
>> will help control each kv transaction size and make transaction
>> complete smoothly instead of tps spike with us precious.
>
> Can we get the same benefit by calling BlueFS::_flush on the log whenever
> we have X bytes accumulated (I think there is an option in rocksdb that
> drives this already, actually)?  Changing the interfaces around will
> change the threading model (= work) but doesn't actually change who needs
> to wait and when.

why we need to wait after interface change?

1. kv thread submit transaction with callback.
2. rocksdb append and call bluefs aio_submit with callback
3. bluefs submit aio write with callback
4. KernelDevice will poll linux aio event and execute callback inline
or queue finish
5. callback will notify we complete the kv transaction

the main task is implement logics in rocksdb log*.cc and bluefs aio
submit interface....

Is anything I'm missing?

>
> sage
>
>
>
>>
>>
>> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
>> > I think we need to look at other changes in addition to the encoding
>> > performance improvements.  Even if they end up being good enough, these
>> > changes are somewhat orthogonal and at least one of them should give us
>> > something that is even faster.
>> >
>> > 1. I mentioned this before, but we should keep the encoding
>> > bluestore_blob_t around when we load the blob map.  If it's not changed,
>> > don't reencode it.  There are no blockers for implementing this currently.
>> > It may be difficult to ensure the blobs are properly marked dirty... I'll
>> > see if we can use proper accessors for the blob to enforce this at compile
>> > time.  We should do that anyway.
>> >
>> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
>> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
>> > into a single rocksdb::Slice, and another memcpy somewhere inside
>> > rocksdb to copy this into the write buffer.  We could extend the
>> > rocksdb interface to take an iovec so that the first memcpy isn't needed
>> > (and rocksdb will instead iterate over our buffers and copy them directly
>> > into its write buffer).  This is probably a pretty small piece of the
>> > overall time... should verify with a profiler before investing too much
>> > effort here.
>> >
>> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
>> > into rocksdb every time we touch an object, even when a tiny amount of
>> > metadata is getting changed.  This is a consequence of embedding all of
>> > the blobs into the onode (or bnode).  That seemed like a good idea early
>> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
>> > see a couple of different options:
>> >
>> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
>> > the blobs too.  They will hopefully be sequential in rocksdb (or
>> > definitely sequential in zs).  Probably go back to using an iterator.
>> >
>> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
>> > are unique for any given hash value.  Then store the blobs as
>> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
>> > clone happens there is no onode->bnode migration magic happening--we've
>> > already committed to storing blobs in separate keys.  When we load the
>> > onode, keep the conditional bnode loading we already have.. but when the
>> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
>> > fault in blobs individually, but that code will be more complicated.)
>> >
>> > In both these cases, a write will dirty the onode (which is back to being
>> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
>> > small keys).  Updates will generate much lower metadata write traffic,
>> > which'll reduce media wear and compaction overhead.  The cost is that
>> > operations (e.g., reads) that have to fault in an onode are now fetching
>> > several nearby keys instead of a single key.
>> >
>> >
>> > #1 and #2 are completely orthogonal to any encoding efficiency
>> > improvements we make.  And #1 is simple... I plan to implement this
>> > shortly.
>> >
>> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
>> > and that tradeoff will change as encoding efficiency changes, so it'll be
>> > difficult to properly evaluate without knowing where we'll land with the
>> > (re)encode times.  I think it's a design decision made early on that is
>> > worth revisiting, though!
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 15:00 ` Haomai Wang
@ 2016-08-17 15:25   ` Sage Weil
  2016-08-17 15:31     ` Haomai Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-17 15:25 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

On Wed, 17 Aug 2016, Haomai Wang wrote:
> another latency perf problem:
> 
> rocksdb log is on bluefs and mainly uses append and fsync interface to
> complete WAL.
> 
> I found the latency between kv transaction submitting isn't negligible
> and limit the transaction throughput.
> 
> So what if we implement a async transaction submit in rocksdb side
> using callback way? It will decrease kv in queue latency. It would
> help rocksdb WAL performance close to FileJournal. And async interface
> will help control each kv transaction size and make transaction
> complete smoothly instead of tps spike with us precious.

Can we get the same benefit by calling BlueFS::_flush on the log whenever 
we have X bytes accumulated (I think there is an option in rocksdb that 
drives this already, actually)?  Changing the interfaces around will 
change the threading model (= work) but doesn't actually change who needs 
to wait and when.

sage



> 
> 
> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> > I think we need to look at other changes in addition to the encoding
> > performance improvements.  Even if they end up being good enough, these
> > changes are somewhat orthogonal and at least one of them should give us
> > something that is even faster.
> >
> > 1. I mentioned this before, but we should keep the encoding
> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> > don't reencode it.  There are no blockers for implementing this currently.
> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> > see if we can use proper accessors for the blob to enforce this at compile
> > time.  We should do that anyway.
> >
> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> > into a single rocksdb::Slice, and another memcpy somewhere inside
> > rocksdb to copy this into the write buffer.  We could extend the
> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> > (and rocksdb will instead iterate over our buffers and copy them directly
> > into its write buffer).  This is probably a pretty small piece of the
> > overall time... should verify with a profiler before investing too much
> > effort here.
> >
> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> > into rocksdb every time we touch an object, even when a tiny amount of
> > metadata is getting changed.  This is a consequence of embedding all of
> > the blobs into the onode (or bnode).  That seemed like a good idea early
> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> > see a couple of different options:
> >
> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> > the blobs too.  They will hopefully be sequential in rocksdb (or
> > definitely sequential in zs).  Probably go back to using an iterator.
> >
> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> > are unique for any given hash value.  Then store the blobs as
> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> > clone happens there is no onode->bnode migration magic happening--we've
> > already committed to storing blobs in separate keys.  When we load the
> > onode, keep the conditional bnode loading we already have.. but when the
> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> > fault in blobs individually, but that code will be more complicated.)
> >
> > In both these cases, a write will dirty the onode (which is back to being
> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> > small keys).  Updates will generate much lower metadata write traffic,
> > which'll reduce media wear and compaction overhead.  The cost is that
> > operations (e.g., reads) that have to fault in an onode are now fetching
> > several nearby keys instead of a single key.
> >
> >
> > #1 and #2 are completely orthogonal to any encoding efficiency
> > improvements we make.  And #1 is simple... I plan to implement this
> > shortly.
> >
> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> > and that tradeoff will change as encoding efficiency changes, so it'll be
> > difficult to properly evaluate without knowing where we'll land with the
> > (re)encode times.  I think it's a design decision made early on that is
> > worth revisiting, though!
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: bluestore blobs
  2016-08-17 14:26 Sage Weil
@ 2016-08-17 15:00 ` Haomai Wang
  2016-08-17 15:25   ` Sage Weil
  2016-08-18  0:05 ` Allen Samuels
  1 sibling, 1 reply; 24+ messages in thread
From: Haomai Wang @ 2016-08-17 15:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

another latency perf problem:

rocksdb log is on bluefs and mainly uses append and fsync interface to
complete WAL.

I found the latency between kv transaction submitting isn't negligible
and limit the transaction throughput.

So what if we implement a async transaction submit in rocksdb side
using callback way? It will decrease kv in queue latency. It would
help rocksdb WAL performance close to FileJournal. And async interface
will help control each kv transaction size and make transaction
complete smoothly instead of tps spike with us precious.


On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@redhat.com> wrote:
> I think we need to look at other changes in addition to the encoding
> performance improvements.  Even if they end up being good enough, these
> changes are somewhat orthogonal and at least one of them should give us
> something that is even faster.
>
> 1. I mentioned this before, but we should keep the encoding
> bluestore_blob_t around when we load the blob map.  If it's not changed,
> don't reencode it.  There are no blockers for implementing this currently.
> It may be difficult to ensure the blobs are properly marked dirty... I'll
> see if we can use proper accessors for the blob to enforce this at compile
> time.  We should do that anyway.
>
> 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> assemble the bufferlist (lots of bufferptrs to each untouched blob)
> into a single rocksdb::Slice, and another memcpy somewhere inside
> rocksdb to copy this into the write buffer.  We could extend the
> rocksdb interface to take an iovec so that the first memcpy isn't needed
> (and rocksdb will instead iterate over our buffers and copy them directly
> into its write buffer).  This is probably a pretty small piece of the
> overall time... should verify with a profiler before investing too much
> effort here.
>
> 3. Even if we do the above, we're still setting a big (~4k or more?) key
> into rocksdb every time we touch an object, even when a tiny amount of
> metadata is getting changed.  This is a consequence of embedding all of
> the blobs into the onode (or bnode).  That seemed like a good idea early
> on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> see a couple of different options:
>
> a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> the blobs too.  They will hopefully be sequential in rocksdb (or
> definitely sequential in zs).  Probably go back to using an iterator.
>
> b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> are unique for any given hash value.  Then store the blobs as
> $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> clone happens there is no onode->bnode migration magic happening--we've
> already committed to storing blobs in separate keys.  When we load the
> onode, keep the conditional bnode loading we already have.. but when the
> bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> fault in blobs individually, but that code will be more complicated.)
>
> In both these cases, a write will dirty the onode (which is back to being
> pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> small keys).  Updates will generate much lower metadata write traffic,
> which'll reduce media wear and compaction overhead.  The cost is that
> operations (e.g., reads) that have to fault in an onode are now fetching
> several nearby keys instead of a single key.
>
>
> #1 and #2 are completely orthogonal to any encoding efficiency
> improvements we make.  And #1 is simple... I plan to implement this
> shortly.
>
> #3 is balancing (re)encoding efficiency against the cost of separate keys,
> and that tradeoff will change as encoding efficiency changes, so it'll be
> difficult to properly evaluate without knowing where we'll land with the
> (re)encode times.  I think it's a design decision made early on that is
> worth revisiting, though!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* bluestore blobs
@ 2016-08-17 14:26 Sage Weil
  2016-08-17 15:00 ` Haomai Wang
  2016-08-18  0:05 ` Allen Samuels
  0 siblings, 2 replies; 24+ messages in thread
From: Sage Weil @ 2016-08-17 14:26 UTC (permalink / raw)
  To: ceph-devel

I think we need to look at other changes in addition to the encoding 
performance improvements.  Even if they end up being good enough, these 
changes are somewhat orthogonal and at least one of them should give us 
something that is even faster.

1. I mentioned this before, but we should keep the encoding 
bluestore_blob_t around when we load the blob map.  If it's not changed, 
don't reencode it.  There are no blockers for implementing this currently.  
It may be difficult to ensure the blobs are properly marked dirty... I'll 
see if we can use proper accessors for the blob to enforce this at compile 
time.  We should do that anyway.

2. This turns the blob Put into rocksdb into two memcpy stages: one to 
assemble the bufferlist (lots of bufferptrs to each untouched blob) 
into a single rocksdb::Slice, and another memcpy somewhere inside 
rocksdb to copy this into the write buffer.  We could extend the 
rocksdb interface to take an iovec so that the first memcpy isn't needed 
(and rocksdb will instead iterate over our buffers and copy them directly 
into its write buffer).  This is probably a pretty small piece of the 
overall time... should verify with a profiler before investing too much 
effort here.

3. Even if we do the above, we're still setting a big (~4k or more?) key 
into rocksdb every time we touch an object, even when a tiny amount of 
metadata is getting changed.  This is a consequence of embedding all of 
the blobs into the onode (or bnode).  That seemed like a good idea early 
on when they were tiny (i.e., just an extent), but now I'm not so sure.  I 
see a couple of different options:

a) Store each blob as ($onode_key+$blobid).  When we load the onode, load 
the blobs too.  They will hopefully be sequential in rocksdb (or 
definitely sequential in zs).  Probably go back to using an iterator.

b) Go all in on the "bnode" like concept.  Assign blob ids so that they 
are unique for any given hash value.  Then store the blobs as 
$shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when 
clone happens there is no onode->bnode migration magic happening--we've 
already committed to storing blobs in separate keys.  When we load the 
onode, keep the conditional bnode loading we already have.. but when the 
bnode is loaded load up all the blobs for the hash key.  (Okay, we could 
fault in blobs individually, but that code will be more complicated.)

In both these cases, a write will dirty the onode (which is back to being 
pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now 
small keys).  Updates will generate much lower metadata write traffic, 
which'll reduce media wear and compaction overhead.  The cost is that 
operations (e.g., reads) that have to fault in an onode are now fetching 
several nearby keys instead of a single key.


#1 and #2 are completely orthogonal to any encoding efficiency 
improvements we make.  And #1 is simple... I plan to implement this 
shortly.

#3 is balancing (re)encoding efficiency against the cost of separate keys, 
and that tradeoff will change as encoding efficiency changes, so it'll be 
difficult to properly evaluate without knowing where we'll land with the 
(re)encode times.  I think it's a design decision made early on that is 
worth revisiting, though!

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-08-26 18:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-26 17:51 bluestore blobs Sage Weil
2016-08-26 18:16 ` Allen Samuels
  -- strict thread matches above, loose matches on Subject: below --
2016-08-17 14:26 Sage Weil
2016-08-17 15:00 ` Haomai Wang
2016-08-17 15:25   ` Sage Weil
2016-08-17 15:31     ` Haomai Wang
2016-08-17 15:43       ` Sage Weil
2016-08-17 15:55         ` Somnath Roy
2016-08-17 15:58           ` Mark Nelson
2016-08-17 16:00         ` Haomai Wang
2016-08-17 16:10           ` Sage Weil
2016-08-17 16:32             ` Haomai Wang
2016-08-17 16:42               ` Sage Weil
2016-08-18 15:49                 ` Haomai Wang
2016-08-18 15:53                   ` Sage Weil
2016-08-18 16:53                     ` Haomai Wang
2016-08-18 17:09                       ` Haomai Wang
2016-08-17 16:03         ` Varada Kari
2016-08-18  0:05 ` Allen Samuels
2016-08-18 15:10   ` Sage Weil
2016-08-19  3:11     ` Allen Samuels
2016-08-19 13:53       ` Sage Weil
2016-08-19 14:16         ` Allen Samuels
2016-08-19 11:38     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.