All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore cache
@ 2016-05-26 19:27 Sage Weil
  2016-05-27  3:38 ` Jianjian Huo
  2016-05-27 17:09 ` Gregory Farnum
  0 siblings, 2 replies; 5+ messages in thread
From: Sage Weil @ 2016-05-26 19:27 UTC (permalink / raw)
  To: ceph-devel

Previously we were relying on the block device cache when appropriate 
(when a rados hint indicated we should, usually), but that is unreliable 
for annoying O_DIRECT reasons.  We also need an explicit cache to cope 
with some of the read/modify/write cases with compression and checksum 
block/chunk sizes.

A basic cache is in place that is attached to each Onode, and maps the 
logical object bytes to buffers.  We also have a per-collection onode 
cache.  Currently the only trimming of data happens when onodes are 
trimmed, and we control that using a coarse per-collection num_onodes 
knob.

There are two basic questions:

1. First, should we stick with a logical extent -> buffer mapping, or move 
to a blob -> blob extent mapping.  The former is simpler and attaches to 
the onode cache, which we also have to fix trimming for anyway.  On the 
other hand, when we clone an object (usually a head object to a snapshot), 
the cache doesn't follow.  Moving to a blob-based scheme would work better 
in that regard, but probably means that we have another object 
(bluestore_blob_t) whose lifecycle we need to manage.

I'm inclined to stick with the current scheme since reads from just-cloned 
snaps probably aren't too frequent, at least until we have a better 
idea how to do the lifecycle with the simpler (current) model.

2. Second, we need to trim the buffers.  The per-collection onode cache is 
nice because the LRU is local to the collection and already protected by 
existing locks, which avoids complicated locking in the trim path that 
we'd get from a global LRU.  On the other hand, it's clearly suboptimal: 
some pools will get more IO than others and we want to apportion our cache 
resources more fairly.

My hope is that we can do both using a scheme that has collection-local 
LRU (or ARC or some other cache policy) for onodes and buffers, and then 
have a global view of what proportion of the cache a collection is 
entitled to and drive our trimming against that.  This won't be super 
precise, but I wouldn't even if we fudge where the "perfect" 
per-collection cache size is by say 30% I wouldn't expect to see a huge 
cache hit range change over that range (unless your cache is already 
pretty undersized).

Anyway, a simple way to estimate a collections portion of the total might 
be to have a periodic global epoch counter that increments every, say, 10 
seconds.  Then we could ops globally and per collection.  When the epoch 
rolls over, we look at our previous epoch count vs the global count and 
use that ratio to size our per-collection cache.  Since this doesn't have 
to be super-precise we can get clever with atomic and per-cpu variables if 
we need to on the global count.

Within a collection, we're under the same collection lock, so a simple lru 
or onodes and buffers ought to suffice.  I think we want something better 
than just a straight-LRU, though: some data is hinted WILLNEED, and 
buffers we hit in cache twice should get bumped up higher than stuff we 
just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect 
something like MQ might be a better choice for us, but this is probably a 
secondary issue.. we can optimize this independently once we have the 
overall approach sorted out.

Anyway, I guess what I'm looking for is feedback on (1) above, and whether 
per-collection caches with periodic size calibration (based on workload) 
sounds reasonable.

Thanks!
sage


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: bluestore cache
  2016-05-26 19:27 bluestore cache Sage Weil
@ 2016-05-27  3:38 ` Jianjian Huo
  2016-05-27 17:09 ` Gregory Farnum
  1 sibling, 0 replies; 5+ messages in thread
From: Jianjian Huo @ 2016-05-27  3:38 UTC (permalink / raw)
  To: Sage Weil, ceph-devel



On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@redhat.com> wrote:
> Previously we were relying on the block device cache when appropriate
> (when a rados hint indicated we should, usually), but that is unreliable
> for annoying O_DIRECT reasons.  We also need an explicit cache to cope
> with some of the read/modify/write cases with compression and checksum
> block/chunk sizes.
>
> A basic cache is in place that is attached to each Onode, and maps the
> logical object bytes to buffers.  We also have a per-collection onode
> cache.  Currently the only trimming of data happens when onodes are
> trimmed, and we control that using a coarse per-collection num_onodes
> knob.
>
> There are two basic questions:
>
> 1. First, should we stick with a logical extent -> buffer mapping, or move
> to a blob -> blob extent mapping.  The former is simpler and attaches to
> the onode cache, which we also have to fix trimming for anyway.  On the
> other hand, when we clone an object (usually a head object to a snapshot),
> the cache doesn't follow.  Moving to a blob-based scheme would work better
> in that regard, but probably means that we have another object
> (bluestore_blob_t) whose lifecycle we need to manage.
>
> I'm inclined to stick with the current scheme since reads from just-cloned
> snaps probably aren't too frequent, at least until we have a better
> idea how to do the lifecycle with the simpler (current) model.
>
> 2. Second, we need to trim the buffers.  The per-collection onode cache is
> nice because the LRU is local to the collection and already protected by
> existing locks, which avoids complicated locking in the trim path that
> we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> some pools will get more IO than others and we want to apportion our cache
> resources more fairly.
>
> My hope is that we can do both using a scheme that has collection-local
> LRU (or ARC or some other cache policy) for onodes and buffers, and then
> have a global view of what proportion of the cache a collection is
> entitled to and drive our trimming against that.  This won't be super
> precise, but I wouldn't even if we fudge where the "perfect"
> per-collection cache size is by say 30% I wouldn't expect to see a huge
> cache hit range change over that range (unless your cache is already
> pretty undersized).
>
> Anyway, a simple way to estimate a collections portion of the total might
> be to have a periodic global epoch counter that increments every, say, 10
> seconds.  Then we could ops globally and per collection.  When the epoch
> rolls over, we look at our previous epoch count vs the global count and
> use that ratio to size our per-collection cache.  Since this doesn't have
> to be super-precise we can get clever with atomic and per-cpu variables if
> we need to on the global count.
>
> Within a collection, we're under the same collection lock, so a simple lru
> or onodes and buffers ought to suffice.  I think we want something better
> than just a straight-LRU, though: some data is hinted WILLNEED, and
> buffers we hit in cache twice should get bumped up higher than stuff we
> just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect
> something like MQ might be a better choice for us, but this is probably a
> secondary issue.. we can optimize this independently once we have the
> overall approach sorted out.
>
> Anyway, I guess what I'm looking for is feedback on (1) above, and whether
> per-collection caches with periodic size calibration (based on workload)
> sounds reasonable.

Very good design, sharded cache without additional locks, adapts to different workloads. One benefit this logical extent caching doesn't have but blob caching has, if data set compression rate is high, then blob caching will use less RAM? also users tend to use more PGs on SSD deployment, will per collection cache use more CPU cycles with too many PGs?

Jianjian
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bluestore cache
  2016-05-26 19:27 bluestore cache Sage Weil
  2016-05-27  3:38 ` Jianjian Huo
@ 2016-05-27 17:09 ` Gregory Farnum
  2016-05-28  2:35   ` Sage Weil
  1 sibling, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2016-05-27 17:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@redhat.com> wrote:
> Previously we were relying on the block device cache when appropriate
> (when a rados hint indicated we should, usually), but that is unreliable
> for annoying O_DIRECT reasons.  We also need an explicit cache to cope
> with some of the read/modify/write cases with compression and checksum
> block/chunk sizes.
>
> A basic cache is in place that is attached to each Onode, and maps the
> logical object bytes to buffers.  We also have a per-collection onode
> cache.  Currently the only trimming of data happens when onodes are
> trimmed, and we control that using a coarse per-collection num_onodes
> knob.
>
> There are two basic questions:
>
> 1. First, should we stick with a logical extent -> buffer mapping, or move
> to a blob -> blob extent mapping.  The former is simpler and attaches to
> the onode cache, which we also have to fix trimming for anyway.  On the
> other hand, when we clone an object (usually a head object to a snapshot),
> the cache doesn't follow.  Moving to a blob-based scheme would work better
> in that regard, but probably means that we have another object
> (bluestore_blob_t) whose lifecycle we need to manage.
>
> I'm inclined to stick with the current scheme since reads from just-cloned
> snaps probably aren't too frequent, at least until we have a better
> idea how to do the lifecycle with the simpler (current) model.
>
> 2. Second, we need to trim the buffers.  The per-collection onode cache is
> nice because the LRU is local to the collection and already protected by
> existing locks, which avoids complicated locking in the trim path that
> we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> some pools will get more IO than others and we want to apportion our cache
> resources more fairly.
>
> My hope is that we can do both using a scheme that has collection-local
> LRU (or ARC or some other cache policy) for onodes and buffers, and then
> have a global view of what proportion of the cache a collection is
> entitled to and drive our trimming against that.  This won't be super
> precise, but I wouldn't even if we fudge where the "perfect"
> per-collection cache size is by say 30% I wouldn't expect to see a huge
> cache hit range change over that range (unless your cache is already
> pretty undersized).
>
> Anyway, a simple way to estimate a collections portion of the total might
> be to have a periodic global epoch counter that increments every, say, 10
> seconds.  Then we could ops globally and per collection.  When the epoch
> rolls over, we look at our previous epoch count vs the global count and
> use that ratio to size our per-collection cache.  Since this doesn't have
> to be super-precise we can get clever with atomic and per-cpu variables if
> we need to on the global count.


I'm a little concerned about this due to the nature of some of our IO
patterns. Maybe it's not a problem within the levels you're talking
about, but consider:
1) RGW cluster, in which the index pool gets a hugely disproportionate
number of ops in comparison to its actual size (at least for writes)
2) RBD cluster, in which you can expect a golden master pool to get a
lot of reads but have much less total data compared to the user block
device pools.

A system that naively allocates cache space based on proportion of ops
is going to perform pretty badly.
-Greg

>
> Within a collection, we're under the same collection lock, so a simple lru
> or onodes and buffers ought to suffice.  I think we want something better
> than just a straight-LRU, though: some data is hinted WILLNEED, and
> buffers we hit in cache twice should get bumped up higher than stuff we
> just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect
> something like MQ might be a better choice for us, but this is probably a
> secondary issue.. we can optimize this independently once we have the
> overall approach sorted out.
>
> Anyway, I guess what I'm looking for is feedback on (1) above, and whether
> per-collection caches with periodic size calibration (based on workload)
> sounds reasonable.
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: bluestore cache
  2016-05-27 17:09 ` Gregory Farnum
@ 2016-05-28  2:35   ` Sage Weil
  2016-05-28  4:32     ` Ramesh Chander
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2016-05-28  2:35 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Fri, 27 May 2016, Gregory Farnum wrote:
> On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@redhat.com> wrote:
> > Previously we were relying on the block device cache when appropriate
> > (when a rados hint indicated we should, usually), but that is unreliable
> > for annoying O_DIRECT reasons.  We also need an explicit cache to cope
> > with some of the read/modify/write cases with compression and checksum
> > block/chunk sizes.
> >
> > A basic cache is in place that is attached to each Onode, and maps the
> > logical object bytes to buffers.  We also have a per-collection onode
> > cache.  Currently the only trimming of data happens when onodes are
> > trimmed, and we control that using a coarse per-collection num_onodes
> > knob.
> >
> > There are two basic questions:
> >
> > 1. First, should we stick with a logical extent -> buffer mapping, or move
> > to a blob -> blob extent mapping.  The former is simpler and attaches to
> > the onode cache, which we also have to fix trimming for anyway.  On the
> > other hand, when we clone an object (usually a head object to a snapshot),
> > the cache doesn't follow.  Moving to a blob-based scheme would work better
> > in that regard, but probably means that we have another object
> > (bluestore_blob_t) whose lifecycle we need to manage.
> >
> > I'm inclined to stick with the current scheme since reads from just-cloned
> > snaps probably aren't too frequent, at least until we have a better
> > idea how to do the lifecycle with the simpler (current) model.
> >
> > 2. Second, we need to trim the buffers.  The per-collection onode cache is
> > nice because the LRU is local to the collection and already protected by
> > existing locks, which avoids complicated locking in the trim path that
> > we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> > some pools will get more IO than others and we want to apportion our cache
> > resources more fairly.
> >
> > My hope is that we can do both using a scheme that has collection-local
> > LRU (or ARC or some other cache policy) for onodes and buffers, and then
> > have a global view of what proportion of the cache a collection is
> > entitled to and drive our trimming against that.  This won't be super
> > precise, but I wouldn't even if we fudge where the "perfect"
> > per-collection cache size is by say 30% I wouldn't expect to see a huge
> > cache hit range change over that range (unless your cache is already
> > pretty undersized).
> >
> > Anyway, a simple way to estimate a collections portion of the total might
> > be to have a periodic global epoch counter that increments every, say, 10
> > seconds.  Then we could ops globally and per collection.  When the epoch
> > rolls over, we look at our previous epoch count vs the global count and
> > use that ratio to size our per-collection cache.  Since this doesn't have
> > to be super-precise we can get clever with atomic and per-cpu variables if
> > we need to on the global count.
> 
> 
> I'm a little concerned about this due to the nature of some of our IO
> patterns. Maybe it's not a problem within the levels you're talking
> about, but consider:
> 1) RGW cluster, in which the index pool gets a hugely disproportionate
> number of ops in comparison to its actual size (at least for writes)
> 2) RBD cluster, in which you can expect a golden master pool to get a
> lot of reads but have much less total data compared to the user block
> device pools.
> 
> A system that naively allocates cache space based on proportion of ops
> is going to perform pretty badly.

Yeah.  We could count both (1) ops and (2) bytes, and use some function of 
the two.  There are actually 2 caches to size: the onode cache and the 
buffer cache.

What we don't really have good control over is the omap portion, though.  
Since that goes through rocksdb and bluefs it'd have to be sized globally 
for the OSD.  So maybe we'd also count (3) omap keys or something.

What do you think?

I think the main alternative is a global LRU (or whatever), but trimming 
in that situations sucks, because for each victim you have to go take the 
collection lock to update the onode map or buffer cache maps...

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: bluestore cache
  2016-05-28  2:35   ` Sage Weil
@ 2016-05-28  4:32     ` Ramesh Chander
  0 siblings, 0 replies; 5+ messages in thread
From: Ramesh Chander @ 2016-05-28  4:32 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: ceph-devel

I am not completely well versed with blue store terms yet. But overall this problem looks more generic.

IMHO  blob cache with all extents directly pointing to entries in cache where the reference it or take read write lock for specific operations is easy to control in terms
of total DRAM usage and shared content.

If we make our cache entries in some fixed size units. In order to make cache usage fair. We can make a global pool of such free units as free lists and as you mentioned
some counter to track per collection usage. The free pool can skew towards any collection until not completely exhausted. At a time when we need eviction, first target the oversized collection
cache and move those entries to wherever required. The eviction can start from own collection if it is oversized. The eviction may need to take lock on two collections at a time.
In order to trim an underutilized collection cache, we might need some other strategy.

BTW: I did not fully understand the fair usage of collection. Isn't it fair that whoever needs more gets more unless we have some QOS?
If we divide cache strictly among the collections, then underutilization is possible. So we may need dynamic size both sides +ve or -ve.

-Regards,
Ramesh Chander


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Saturday, May 28, 2016 8:05 AM
To: Gregory Farnum
Cc: ceph-devel
Subject: Re: bluestore cache

On Fri, 27 May 2016, Gregory Farnum wrote:
> On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@redhat.com> wrote:
> > Previously we were relying on the block device cache when
> > appropriate (when a rados hint indicated we should, usually), but
> > that is unreliable for annoying O_DIRECT reasons.  We also need an
> > explicit cache to cope with some of the read/modify/write cases with
> > compression and checksum block/chunk sizes.
> >
> > A basic cache is in place that is attached to each Onode, and maps
> > the logical object bytes to buffers.  We also have a per-collection
> > onode cache.  Currently the only trimming of data happens when
> > onodes are trimmed, and we control that using a coarse
> > per-collection num_onodes knob.
> >
> > There are two basic questions:
> >
> > 1. First, should we stick with a logical extent -> buffer mapping,
> > or move to a blob -> blob extent mapping.  The former is simpler and
> > attaches to the onode cache, which we also have to fix trimming for
> > anyway.  On the other hand, when we clone an object (usually a head
> > object to a snapshot), the cache doesn't follow.  Moving to a
> > blob-based scheme would work better in that regard, but probably
> > means that we have another object
> > (bluestore_blob_t) whose lifecycle we need to manage.
> >
> > I'm inclined to stick with the current scheme since reads from
> > just-cloned snaps probably aren't too frequent, at least until we
> > have a better idea how to do the lifecycle with the simpler (current) model.
> >
> > 2. Second, we need to trim the buffers.  The per-collection onode
> > cache is nice because the LRU is local to the collection and already
> > protected by existing locks, which avoids complicated locking in the
> > trim path that we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> > some pools will get more IO than others and we want to apportion our
> > cache resources more fairly.
> >
> > My hope is that we can do both using a scheme that has
> > collection-local LRU (or ARC or some other cache policy) for onodes
> > and buffers, and then have a global view of what proportion of the
> > cache a collection is entitled to and drive our trimming against
> > that.  This won't be super precise, but I wouldn't even if we fudge where the "perfect"
> > per-collection cache size is by say 30% I wouldn't expect to see a
> > huge cache hit range change over that range (unless your cache is
> > already pretty undersized).
> >
> > Anyway, a simple way to estimate a collections portion of the total
> > might be to have a periodic global epoch counter that increments
> > every, say, 10 seconds.  Then we could ops globally and per
> > collection.  When the epoch rolls over, we look at our previous
> > epoch count vs the global count and use that ratio to size our
> > per-collection cache.  Since this doesn't have to be super-precise
> > we can get clever with atomic and per-cpu variables if we need to on the global count.
>
>
> I'm a little concerned about this due to the nature of some of our IO
> patterns. Maybe it's not a problem within the levels you're talking
> about, but consider:
> 1) RGW cluster, in which the index pool gets a hugely disproportionate
> number of ops in comparison to its actual size (at least for writes)
> 2) RBD cluster, in which you can expect a golden master pool to get a
> lot of reads but have much less total data compared to the user block
> device pools.
>
> A system that naively allocates cache space based on proportion of ops
> is going to perform pretty badly.

Yeah.  We could count both (1) ops and (2) bytes, and use some function of the two.  There are actually 2 caches to size: the onode cache and the buffer cache.

What we don't really have good control over is the omap portion, though.
Since that goes through rocksdb and bluefs it'd have to be sized globally for the OSD.  So maybe we'd also count (3) omap keys or something.

What do you think?

I think the main alternative is a global LRU (or whatever), but trimming in that situations sucks, because for each victim you have to go take the collection lock to update the onode map or buffer cache maps...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-28  4:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 19:27 bluestore cache Sage Weil
2016-05-27  3:38 ` Jianjian Huo
2016-05-27 17:09 ` Gregory Farnum
2016-05-28  2:35   ` Sage Weil
2016-05-28  4:32     ` Ramesh Chander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.