All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore blobs
@ 2016-08-17 14:26 Sage Weil
  2016-08-17 15:00 ` Haomai Wang
  2016-08-18  0:05 ` Allen Samuels
  0 siblings, 2 replies; 24+ messages in thread
From: Sage Weil @ 2016-08-17 14:26 UTC (permalink / raw)
  To: ceph-devel

I think we need to look at other changes in addition to the encoding 
performance improvements.  Even if they end up being good enough, these 
changes are somewhat orthogonal and at least one of them should give us 
something that is even faster.

1. I mentioned this before, but we should keep the encoding 
bluestore_blob_t around when we load the blob map.  If it's not changed, 
don't reencode it.  There are no blockers for implementing this currently.  
It may be difficult to ensure the blobs are properly marked dirty... I'll 
see if we can use proper accessors for the blob to enforce this at compile 
time.  We should do that anyway.

2. This turns the blob Put into rocksdb into two memcpy stages: one to 
assemble the bufferlist (lots of bufferptrs to each untouched blob) 
into a single rocksdb::Slice, and another memcpy somewhere inside 
rocksdb to copy this into the write buffer.  We could extend the 
rocksdb interface to take an iovec so that the first memcpy isn't needed 
(and rocksdb will instead iterate over our buffers and copy them directly 
into its write buffer).  This is probably a pretty small piece of the 
overall time... should verify with a profiler before investing too much 
effort here.

3. Even if we do the above, we're still setting a big (~4k or more?) key 
into rocksdb every time we touch an object, even when a tiny amount of 
metadata is getting changed.  This is a consequence of embedding all of 
the blobs into the onode (or bnode).  That seemed like a good idea early 
on when they were tiny (i.e., just an extent), but now I'm not so sure.  I 
see a couple of different options:

a) Store each blob as ($onode_key+$blobid).  When we load the onode, load 
the blobs too.  They will hopefully be sequential in rocksdb (or 
definitely sequential in zs).  Probably go back to using an iterator.

b) Go all in on the "bnode" like concept.  Assign blob ids so that they 
are unique for any given hash value.  Then store the blobs as 
$shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when 
clone happens there is no onode->bnode migration magic happening--we've 
already committed to storing blobs in separate keys.  When we load the 
onode, keep the conditional bnode loading we already have.. but when the 
bnode is loaded load up all the blobs for the hash key.  (Okay, we could 
fault in blobs individually, but that code will be more complicated.)

In both these cases, a write will dirty the onode (which is back to being 
pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now 
small keys).  Updates will generate much lower metadata write traffic, 
which'll reduce media wear and compaction overhead.  The cost is that 
operations (e.g., reads) that have to fault in an onode are now fetching 
several nearby keys instead of a single key.


#1 and #2 are completely orthogonal to any encoding efficiency 
improvements we make.  And #1 is simple... I plan to implement this 
shortly.

#3 is balancing (re)encoding efficiency against the cost of separate keys, 
and that tradeoff will change as encoding efficiency changes, so it'll be 
difficult to properly evaluate without knowing where we'll land with the 
(re)encode times.  I think it's a design decision made early on that is 
worth revisiting, though!

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread
* bluestore blobs
@ 2016-08-26 17:51 Sage Weil
  2016-08-26 18:16 ` Allen Samuels
  0 siblings, 1 reply; 24+ messages in thread
From: Sage Weil @ 2016-08-26 17:51 UTC (permalink / raw)
  To: allen.samuels; +Cc: ceph-devel

Hi Allen,

The "blobs must be confined to a extent map shard" rule is still somewhat 
unsatisfying to me.  There's another easy possibility, though: we 
allow blobs to span extent map shards, and when they do, we stuff them 
directly in the onode.  The number of such blobs will always be small 
(no more than the number of extent map shards), so I don't think size is a 
concern.  And we'll always already have them loaded up when we bring any 
particular shard in, so we don't need to worry about any additional 
complexity around paging them in.  And we avoid the slightly annoying cut 
points on compressed extents when they cross such boundaries.

This also avoids some of the tuning practicalities that were annoying me 
(does a global config option control where the enforced cut points are?  
what happens if that changes on an existing store?)

sage

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-08-26 18:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 14:26 bluestore blobs Sage Weil
2016-08-17 15:00 ` Haomai Wang
2016-08-17 15:25   ` Sage Weil
2016-08-17 15:31     ` Haomai Wang
2016-08-17 15:43       ` Sage Weil
2016-08-17 15:55         ` Somnath Roy
2016-08-17 15:58           ` Mark Nelson
2016-08-17 16:00         ` Haomai Wang
2016-08-17 16:10           ` Sage Weil
2016-08-17 16:32             ` Haomai Wang
2016-08-17 16:42               ` Sage Weil
2016-08-18 15:49                 ` Haomai Wang
2016-08-18 15:53                   ` Sage Weil
2016-08-18 16:53                     ` Haomai Wang
2016-08-18 17:09                       ` Haomai Wang
2016-08-17 16:03         ` Varada Kari
2016-08-18  0:05 ` Allen Samuels
2016-08-18 15:10   ` Sage Weil
2016-08-19  3:11     ` Allen Samuels
2016-08-19 13:53       ` Sage Weil
2016-08-19 14:16         ` Allen Samuels
2016-08-19 11:38     ` Mark Nelson
2016-08-26 17:51 Sage Weil
2016-08-26 18:16 ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.