All of lore.kernel.org
 help / color / mirror / Atom feed
* luminous OSD memory usage
@ 2017-08-30 10:00 Aleksei Gutikov
  2017-08-30 15:17 ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Aleksei Gutikov @ 2017-08-30 10:00 UTC (permalink / raw)
  To: Ceph Development

Hi.

I'm trying to synchronize osd daemons memory limits and bluestore cache 
settings.
For 12.1.4 we have hdd osds usage about 4G with default settings.
For ssds we have limit 6G and they are been oom killed periodically.

While
osd_op_num_threads_per_shard_hdd=1
osd_op_num_threads_per_shard_ssd=2
and
bluestore_cache_size_hdd=1G
bluestore_cache_size_ssd=3G
and
osd_op_num_shards_hdd=5
osd_op_num_shards_ssd=8

Does it mean that ssd osds will use 4G*2*3*8/5, or 3G*2*8/5, or other?
Does anybody have an idea about the equation for upper bound of memory 
consumption?
Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?

I want to calculate the maximum expected size of bluestore metadata 
(that must be definitely fit into cache) using size of raw space, 
average size of objects, rocksdb space amplification.
I thought it should be something simple like 
raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification 
is 2 and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M 
for cache.
But wise guys said that I have to take into account number of extents also.
But bluestore_extent_map_shard_max_size=1200, I hope this number is not 
a multiplicator...
What would be correct approach for calculation of this minimum cache size?
What can be expected size of key-values stored in rocksdb per rados object?

Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
while default bluestore_cache_kv_max=512M
Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 
in default case. Is 512M enough for bluestore metadata?


Thanks!
Aleksei

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-08-30 10:00 luminous OSD memory usage Aleksei Gutikov
@ 2017-08-30 15:17 ` Sage Weil
  2017-09-01  2:00   ` xiaoyan li
  2017-10-20  8:52   ` Aleksei Gutikov
  0 siblings, 2 replies; 7+ messages in thread
From: Sage Weil @ 2017-08-30 15:17 UTC (permalink / raw)
  To: Aleksei Gutikov; +Cc: Ceph Development

Hi Aleksei,

On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
> Hi.
> 
> I'm trying to synchronize osd daemons memory limits and bluestore cache
> settings.
> For 12.1.4 we have hdd osds usage about 4G with default settings.
> For ssds we have limit 6G and they are been oom killed periodically.

So,
 
> While
> osd_op_num_threads_per_shard_hdd=1
> osd_op_num_threads_per_shard_ssd=2
> and
> osd_op_num_shards_hdd=5
> osd_op_num_shards_ssd=8

aren't relevant to memory usage.  The _per_shard is about how many bytes 
are stored in each rocksdb key, and the num_shards is about how many 
threads we use.

This is the one that matters:

> bluestore_cache_size_hdd=1G
> bluestore_cache_size_ssd=3G

It governs how much memory bluestore limits itself to.  He bad news is 
that bluestore counts what it allocates, not how much memory the allocator 
uses, so there is some overhead.  From what I've anecdotally seen it's 
something like 1.5x, which kind of sucks; there is more to be done here.

On top of that is usage by the OSD outside of bluestore, 
which is somewhere in the 500m to 1.5g range.

We're very interested in hearing what observed RSS users see relative to 
the configured bluestore size and pg count, along with a dump of the 
mempool metadata (ceph daemon osd.NNN dump_mempools).

> Does anybody have an idea about the equation for upper bound of memory
> consumption?

Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ?

> Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?

Yes, you can/should change this to whatever you like (big or small).

> I want to calculate the maximum expected size of bluestore metadata (that must
> be definitely fit into cache) using size of raw space, average size of
> objects, rocksdb space amplification.
> I thought it should be something simple like
> raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
> For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification is 2
> and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for cache.
> But wise guys said that I have to take into account number of extents also.
> But bluestore_extent_map_shard_max_size=1200, I hope this number is not a
> multiplicator...

Nope, just a shard size...

> What would be correct approach for calculation of this minimum cache size?
> What can be expected size of key-values stored in rocksdb per rados object?

This depends, unfortunately, on what the write pattern is for the objects.  
If they're written by RGW in big chunks, the overhead will be smaller.  
If it comes out of a 4k random write pattern it will be bigger.  Again, 
very interestd in hearing user reports of what is observed in real world 
situations.  I would trust that more than a calculation from first 
principles like the above.

> Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
> while default bluestore_cache_kv_max=512M
> Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in
> default case. Is 512M enough for bluestore metadata?

In Mark's testing he found that we got more performance benefit when 
small caches were devoted to rocksdb and large caches were devoted mostly 
to the bluestore metadata cache (parsed onodes vs caching the encoded 
on-disk content).  You can always adjust the 512m value upwards (and that 
may make sense for large cache sizes).  Again, very interested in hearing 
whther that works better or worse for your workload!

Thanks-
sage


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-08-30 15:17 ` Sage Weil
@ 2017-09-01  2:00   ` xiaoyan li
  2017-09-01 12:51     ` Mark Nelson
  2017-09-01 13:10     ` Sage Weil
  2017-10-20  8:52   ` Aleksei Gutikov
  1 sibling, 2 replies; 7+ messages in thread
From: xiaoyan li @ 2017-09-01  2:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: Aleksei Gutikov, Ceph Development

On Wed, Aug 30, 2017 at 11:17 PM, Sage Weil <sage@newdream.net> wrote:
> Hi Aleksei,
>
> On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
>> Hi.
>>
>> I'm trying to synchronize osd daemons memory limits and bluestore cache
>> settings.
>> For 12.1.4 we have hdd osds usage about 4G with default settings.
>> For ssds we have limit 6G and they are been oom killed periodically.
>
> So,
>
>> While
>> osd_op_num_threads_per_shard_hdd=1
>> osd_op_num_threads_per_shard_ssd=2
>> and
>> osd_op_num_shards_hdd=5
>> osd_op_num_shards_ssd=8
>
> aren't relevant to memory usage.  The _per_shard is about how many bytes
> are stored in each rocksdb key, and the num_shards is about how many
> threads we use.

I can't understand the point about _per_shard. I notice that
osd_op_num_threads_per_shard is used to set cache shards in BlueStore.
>
> This is the one that matters:
>
>> bluestore_cache_size_hdd=1G
>> bluestore_cache_size_ssd=3G
>
> It governs how much memory bluestore limits itself to.  He bad news is
> that bluestore counts what it allocates, not how much memory the allocator
> uses, so there is some overhead.  From what I've anecdotally seen it's
> something like 1.5x, which kind of sucks; there is more to be done here.
>
> On top of that is usage by the OSD outside of bluestore,
> which is somewhere in the 500m to 1.5g range.
>
> We're very interested in hearing what observed RSS users see relative to
> the configured bluestore size and pg count, along with a dump of the
> mempool metadata (ceph daemon osd.NNN dump_mempools).
>
>> Does anybody have an idea about the equation for upper bound of memory
>> consumption?
>
> Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ?
>
>> Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?
>
> Yes, you can/should change this to whatever you like (big or small).
>
>> I want to calculate the maximum expected size of bluestore metadata (that must
>> be definitely fit into cache) using size of raw space, average size of
>> objects, rocksdb space amplification.
>> I thought it should be something simple like
>> raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
>> For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification is 2
>> and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for cache.
>> But wise guys said that I have to take into account number of extents also.
>> But bluestore_extent_map_shard_max_size=1200, I hope this number is not a
>> multiplicator...
>
> Nope, just a shard size...
>
>> What would be correct approach for calculation of this minimum cache size?
>> What can be expected size of key-values stored in rocksdb per rados object?
>
> This depends, unfortunately, on what the write pattern is for the objects.
> If they're written by RGW in big chunks, the overhead will be smaller.
> If it comes out of a 4k random write pattern it will be bigger.  Again,
> very interestd in hearing user reports of what is observed in real world
> situations.  I would trust that more than a calculation from first
> principles like the above.
>
>> Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
>> while default bluestore_cache_kv_max=512M
>> Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in
>> default case. Is 512M enough for bluestore metadata?
>
> In Mark's testing he found that we got more performance benefit when
> small caches were devoted to rocksdb and large caches were devoted mostly
> to the bluestore metadata cache (parsed onodes vs caching the encoded
> on-disk content).  You can always adjust the 512m value upwards (and that
> may make sense for large cache sizes).  Again, very interested in hearing
> whther that works better or worse for your workload!
>
I am wondering whether Bluestore should have Rocksdb block cache.
Rocksdb uses block cache to cache data blocks in memory for reads. But
the data is already cached by bluestore metadata cache. And bluestore
metadata cache has two advantages: 1. no cache for pg_log and deferred
io logs. 2. Rocksdb cache data blocks. it doesn't have
fine-granularity.

Thanks-
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best wishes
Lisa

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-09-01  2:00   ` xiaoyan li
@ 2017-09-01 12:51     ` Mark Nelson
  2017-09-01 13:10     ` Sage Weil
  1 sibling, 0 replies; 7+ messages in thread
From: Mark Nelson @ 2017-09-01 12:51 UTC (permalink / raw)
  To: xiaoyan li, Sage Weil; +Cc: Aleksei Gutikov, Ceph Development



On 08/31/2017 09:00 PM, xiaoyan li wrote:
> On Wed, Aug 30, 2017 at 11:17 PM, Sage Weil <sage@newdream.net> wrote:
>> Hi Aleksei,
>>
>> On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
>>> Hi.
>>>
>>> I'm trying to synchronize osd daemons memory limits and bluestore cache
>>> settings.
>>> For 12.1.4 we have hdd osds usage about 4G with default settings.
>>> For ssds we have limit 6G and they are been oom killed periodically.
>>
>> So,
>>
>>> While
>>> osd_op_num_threads_per_shard_hdd=1
>>> osd_op_num_threads_per_shard_ssd=2
>>> and
>>> osd_op_num_shards_hdd=5
>>> osd_op_num_shards_ssd=8
>>
>> aren't relevant to memory usage.  The _per_shard is about how many bytes
>> are stored in each rocksdb key, and the num_shards is about how many
>> threads we use.
>
> I can't understand the point about _per_shard. I notice that
> osd_op_num_threads_per_shard is used to set cache shards in BlueStore.
>>
>> This is the one that matters:
>>
>>> bluestore_cache_size_hdd=1G
>>> bluestore_cache_size_ssd=3G
>>
>> It governs how much memory bluestore limits itself to.  He bad news is
>> that bluestore counts what it allocates, not how much memory the allocator
>> uses, so there is some overhead.  From what I've anecdotally seen it's
>> something like 1.5x, which kind of sucks; there is more to be done here.
>>
>> On top of that is usage by the OSD outside of bluestore,
>> which is somewhere in the 500m to 1.5g range.
>>
>> We're very interested in hearing what observed RSS users see relative to
>> the configured bluestore size and pg count, along with a dump of the
>> mempool metadata (ceph daemon osd.NNN dump_mempools).
>>
>>> Does anybody have an idea about the equation for upper bound of memory
>>> consumption?
>>
>> Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ?
>>
>>> Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?
>>
>> Yes, you can/should change this to whatever you like (big or small).
>>
>>> I want to calculate the maximum expected size of bluestore metadata (that must
>>> be definitely fit into cache) using size of raw space, average size of
>>> objects, rocksdb space amplification.
>>> I thought it should be something simple like
>>> raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
>>> For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification is 2
>>> and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for cache.
>>> But wise guys said that I have to take into account number of extents also.
>>> But bluestore_extent_map_shard_max_size=1200, I hope this number is not a
>>> multiplicator...
>>
>> Nope, just a shard size...
>>
>>> What would be correct approach for calculation of this minimum cache size?
>>> What can be expected size of key-values stored in rocksdb per rados object?
>>
>> This depends, unfortunately, on what the write pattern is for the objects.
>> If they're written by RGW in big chunks, the overhead will be smaller.
>> If it comes out of a 4k random write pattern it will be bigger.  Again,
>> very interestd in hearing user reports of what is observed in real world
>> situations.  I would trust that more than a calculation from first
>> principles like the above.
>>
>>> Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
>>> while default bluestore_cache_kv_max=512M
>>> Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in
>>> default case. Is 512M enough for bluestore metadata?
>>
>> In Mark's testing he found that we got more performance benefit when
>> small caches were devoted to rocksdb and large caches were devoted mostly
>> to the bluestore metadata cache (parsed onodes vs caching the encoded
>> on-disk content).  You can always adjust the 512m value upwards (and that
>> may make sense for large cache sizes).  Again, very interested in hearing
>> whther that works better or worse for your workload!
>>
> I am wondering whether Bluestore should have Rocksdb block cache.
> Rocksdb uses block cache to cache data blocks in memory for reads. But
> the data is already cached by bluestore metadata cache. And bluestore
> metadata cache has two advantages: 1. no cache for pg_log and deferred
> io logs. 2. Rocksdb cache data blocks. it doesn't have
> fine-granularity.

We use the block cache in rocksdb to store the bloom filters as well so 
we'll want to at least give it enough to cover those.  We also found 
that performance tanked at small cache values when favoring onode cache 
vs rocksdb block cache.  There could be a couple of reasons for that. 
rocksdb is storing the data already encoded, and since we use varint, I 
suspect that we can store more in the rocksdb cache.  It might be that 
at low cache sizes it's worth taking the CPU hit to fit more data in 
cache.  We can do that in the bluestore cache as well, but it might be 
that the two-level apporach is better, ie we want the cache closest to 
bluestore to not be encoded while cache at the rocksdb level is encoded 
(and perhaps we even want to use compression at some point).

Mark

>
> Thanks-
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-09-01  2:00   ` xiaoyan li
  2017-09-01 12:51     ` Mark Nelson
@ 2017-09-01 13:10     ` Sage Weil
  1 sibling, 0 replies; 7+ messages in thread
From: Sage Weil @ 2017-09-01 13:10 UTC (permalink / raw)
  To: xiaoyan li; +Cc: Aleksei Gutikov, Ceph Development

On Fri, 1 Sep 2017, xiaoyan li wrote:
> On Wed, Aug 30, 2017 at 11:17 PM, Sage Weil <sage@newdream.net> wrote:
> > Hi Aleksei,
> >
> > On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
> >> Hi.
> >>
> >> I'm trying to synchronize osd daemons memory limits and bluestore cache
> >> settings.
> >> For 12.1.4 we have hdd osds usage about 4G with default settings.
> >> For ssds we have limit 6G and they are been oom killed periodically.
> >
> > So,
> >
> >> While
> >> osd_op_num_threads_per_shard_hdd=1
> >> osd_op_num_threads_per_shard_ssd=2
> >> and
> >> osd_op_num_shards_hdd=5
> >> osd_op_num_shards_ssd=8
> >
> > aren't relevant to memory usage.  The _per_shard is about how many bytes
> > are stored in each rocksdb key, and the num_shards is about how many
> > threads we use.
> 
> I can't understand the point about _per_shard. I notice that
> osd_op_num_threads_per_shard is used to set cache shards in BlueStore.

  store->set_cache_shards(get_num_op_shards());

and the osd op queue thread count is then shards * threads_per_shard.

sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-08-30 15:17 ` Sage Weil
  2017-09-01  2:00   ` xiaoyan li
@ 2017-10-20  8:52   ` Aleksei Gutikov
  2017-10-20 12:12     ` Sage Weil
  1 sibling, 1 reply; 7+ messages in thread
From: Aleksei Gutikov @ 2017-10-20  8:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4411 bytes --]

Hi,

Here is some stats about OSD memory usage on our lab cluster
(110hdd*1T + 36ssd*200G)

https://drive.google.com/drive/folders/0B1s9jTJ0z59JcmtEX283WUd5bzg?usp=sharing

osd_mem_stats.txt contains stats for hdd and ssd OSDs:
- stats for process memory usage from /proc/pid/status,
- mempools stats
- and heap stats

We use luminous 12.2.1.
We set 9G memory limit in 
/lib/systemd/system/ceph-osd@.service.d/override.conf
But that is not enough - OSDs are still been killed because they use 
more (with bluestore_cache_size_ssd=3G)
OSDs with hdd (with bluestore_cache_size_hdd=1G) use up to 4.6G.

And, btw, seems that systemd's oom killer looks at VmData, not on VmRss.

Thanks.

On 08/30/2017 06:17 PM, Sage Weil wrote:
> Hi Aleksei,
> 
> On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
>> Hi.
>>
>> I'm trying to synchronize osd daemons memory limits and bluestore cache
>> settings.
>> For 12.1.4 we have hdd osds usage about 4G with default settings.
>> For ssds we have limit 6G and they are been oom killed periodically.
> 
> So,
>   
>> While
>> osd_op_num_threads_per_shard_hdd=1
>> osd_op_num_threads_per_shard_ssd=2
>> and
>> osd_op_num_shards_hdd=5
>> osd_op_num_shards_ssd=8
> 
> aren't relevant to memory usage.  The _per_shard is about how many bytes
> are stored in each rocksdb key, and the num_shards is about how many
> threads we use.
> 
> This is the one that matters:
> 
>> bluestore_cache_size_hdd=1G
>> bluestore_cache_size_ssd=3G
> 
> It governs how much memory bluestore limits itself to.  He bad news is
> that bluestore counts what it allocates, not how much memory the allocator
> uses, so there is some overhead.  From what I've anecdotally seen it's
> something like 1.5x, which kind of sucks; there is more to be done here.
> 
> On top of that is usage by the OSD outside of bluestore,
> which is somewhere in the 500m to 1.5g range.
> 
> We're very interested in hearing what observed RSS users see relative to
> the configured bluestore size and pg count, along with a dump of the
> mempool metadata (ceph daemon osd.NNN dump_mempools).
> 
>> Does anybody have an idea about the equation for upper bound of memory
>> consumption?
> 
> Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ?
> 
>> Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?
> 
> Yes, you can/should change this to whatever you like (big or small).
> 
>> I want to calculate the maximum expected size of bluestore metadata (that must
>> be definitely fit into cache) using size of raw space, average size of
>> objects, rocksdb space amplification.
>> I thought it should be something simple like
>> raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
>> For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification is 2
>> and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for cache.
>> But wise guys said that I have to take into account number of extents also.
>> But bluestore_extent_map_shard_max_size=1200, I hope this number is not a
>> multiplicator...
> 
> Nope, just a shard size...
> 
>> What would be correct approach for calculation of this minimum cache size?
>> What can be expected size of key-values stored in rocksdb per rados object?
> 
> This depends, unfortunately, on what the write pattern is for the objects.
> If they're written by RGW in big chunks, the overhead will be smaller.
> If it comes out of a 4k random write pattern it will be bigger.  Again,
> very interestd in hearing user reports of what is observed in real world
> situations.  I would trust that more than a calculation from first
> principles like the above.
> 
>> Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
>> while default bluestore_cache_kv_max=512M
>> Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in
>> default case. Is 512M enough for bluestore metadata?
> 
> In Mark's testing he found that we got more performance benefit when
> small caches were devoted to rocksdb and large caches were devoted mostly
> to the bluestore metadata cache (parsed onodes vs caching the encoded
> on-disk content).  You can always adjust the 512m value upwards (and that
> may make sense for large cache sizes).  Again, very interested in hearing
> whther that works better or worse for your workload!
> 
> Thanks-
> sage
> 

-- 

Best regards,
Aleksei Gutikov
Software Engineer | synesis.ru | Minsk. BY

[-- Attachment #2: osd_mem_stats.txt --]
[-- Type: text/plain, Size: 17077 bytes --]


OSD device class: ssd
=====================


heap_dump
---------

    +-----------------------------------+----------+-----------+-----------+
    |                            metric |      min |      mean |       max |
    +-----------------------------------+----------+-----------+-----------+
    |  Actual_memory_used_physical_swap |  1.67 Gi |   6.58 Gi |   8.70 Gi |
    |   Bytes_in_central_cache_freelist |  4.95 Mi | 118.15 Mi | 294.81 Mi |
    |          Bytes_in_malloc_metadata |  5.01 Mi |  26.56 Mi |  39.82 Mi |
    |       Bytes_in_page_heap_freelist |   0.0    |    0.0    |    0.0    |
    |   Bytes_in_thread_cache_freelists | 24.26 Mi |  94.57 Mi | 109.40 Mi |
    |  Bytes_in_transfer_cache_freelist |  7.04 Mi |  10.93 Mi |  18.05 Mi |
    |       Bytes_in_use_by_application |  1.63 Gi |   6.34 Gi |   8.40 Gi |
    | Bytes_released_to_OS_aka_unmapped |  4.42 Mi | 417.02 Mi | 933.85 Mi |
    |                      Spans_in_use | 44.98 Ki | 383.28 Ki | 611.09 Ki |
    |                Tcmalloc_page_size |  8.00 Ki |   8.00 Ki |   8.00 Ki |
    |               Thread_heaps_in_use |  41.0    |   45.8    |   47.0    |
    |        Virtual_address_space_used |  1.67 Gi |   6.99 Gi |   9.61 Gi |
    +-----------------------------------+----------+-----------+-----------+

mempool_bytes
-------------

    +----------------------------+----------+-----------+-----------+
    |                     metric |      min |      mean |       max |
    +----------------------------+----------+-----------+-----------+
    |               bloom_filter |   0.0    |    0.0    |    0.0    |
    |                     bluefs | 21.97 Ki |  26.27 Ki |  33.38 Ki |
    |            bluestore_alloc |  1.07 Mi |   1.38 Mi |   1.72 Mi |
    |       bluestore_cache_data | 53.20 Mi | 965.17 Mi |   1.73 Gi |
    |      bluestore_cache_onode |  1.44 Mi | 682.79 Mi |   1.25 Gi |
    |      bluestore_cache_other | 11.58 Mi | 718.71 Mi |   1.16 Gi |
    |             bluestore_fsck |   0.0    |    0.0    |    0.0    |
    |              bluestore_txc |  1.41 Ki |   6.84 Ki |  12.66 Ki |
    |          bluestore_writing | 87.17 Ki | 693.05 Ki | 905.40 Ki |
    | bluestore_writing_deferred | 44.00 Ki | 457.67 Ki |   4.09 Mi |
    |                buffer_anon | 10.09 Mi | 995.24 Mi |   1.55 Gi |
    |                buffer_meta | 98.14 Ki |   1.51 Mi |   3.07 Mi |
    |                     mds_co |   0.0    |    0.0    |    0.0    |
    |                        osd |  1.27 Mi |   1.49 Mi |   1.77 Mi |
    |                  osd_mapbl |   0.0    | 338.79 Ki |   1.80 Mi |
    |                  osd_pglog | 68.36 Mi |  89.42 Mi | 117.34 Mi |
    |                     osdmap |  1.11 Mi |   1.26 Mi |   1.38 Mi |
    |             osdmap_mapping |   0.0    |    0.0    |    0.0    |
    |                      pgmap |   0.0    |    0.0    |    0.0    |
    |                      total |  1.31 Gi |   3.38 Gi |   4.09 Gi |
    |                 unittest_1 |   0.0    |    0.0    |    0.0    |
    |                 unittest_2 |   0.0    |    0.0    |    0.0    |
    +----------------------------+----------+-----------+-----------+

mempool_item_size
-----------------

    +----------------------------+----------+----------+-----------+
    |                     metric |      min |     mean |       max |
    +----------------------------+----------+----------+-----------+
    |               bloom_filter |   0.0    |   0.0    |    0.0    |
    |                     bluefs |  34.4    |  45.1    |   63.3    |
    |            bluestore_alloc |   1.0    |   1.0    |    1.0    |
    |       bluestore_cache_data | 34.73 Ki | 55.22 Ki |  58.67 Ki |
    |      bluestore_cache_onode | 680.0    | 680.0    |  680.0    |
    |      bluestore_cache_other |   2.9    |   4.0    |   15.5    |
    |             bluestore_fsck |   0.0    |   0.0    |    0.0    |
    |              bluestore_txc | 720.0    | 720.0    |  720.0    |
    |          bluestore_writing | 43.59 Ki | 65.47 Ki |  69.65 Ki |
    | bluestore_writing_deferred |  4.40 Ki | 25.40 Ki | 209.60 Ki |
    |                buffer_anon |  92.4    | 112.3    |  276.7    |
    |                buffer_meta |  88.0    |  88.0    |   88.0    |
    |                     mds_co |   0.0    |   0.0    |    0.0    |
    |                        osd | 11.78 Ki | 11.78 Ki |  11.78 Ki |
    |                  osd_mapbl |   0.0    | 14.44 Ki |  47.27 Ki |
    |                  osd_pglog | 217.2    | 261.3    |  279.0    |
    |                     osdmap |  29.2    |  32.4    |   33.1    |
    |             osdmap_mapping |   0.0    |   0.0    |    0.0    |
    |                      pgmap |   0.0    |   0.0    |    0.0    |
    |                      total |  10.1    |  60.4    |  599.6    |
    |                 unittest_1 |   0.0    |   0.0    |    0.0    |
    |                 unittest_2 |   0.0    |   0.0    |    0.0    |
    +----------------------------+----------+----------+-----------+

mempool_items
-------------

    +----------------------------+-----------+-----------+-----------+
    |                     metric |       min |      mean |       max |
    +----------------------------+-----------+-----------+-----------+
    |               bloom_filter |    0.0    |    0.0    |    0.0    |
    |                     bluefs |  387.0    |  612.1    |  917.0    |
    |            bluestore_alloc |   1.07 Mi |   1.38 Mi |   1.72 Mi |
    |       bluestore_cache_data |  987.0    |  17.46 Ki |  35.55 Ki |
    |      bluestore_cache_onode |   2.17 Ki |   1.00 Mi |   1.88 Mi |
    |      bluestore_cache_other | 769.63 Ki | 240.06 Mi | 392.61 Mi |
    |             bluestore_fsck |    0.0    |    0.0    |    0.0    |
    |              bluestore_txc |    2.0    |    9.7    |   18.0    |
    |          bluestore_writing |    2.0    |   10.4    |   13.0    |
    | bluestore_writing_deferred |   10.0    |   17.6    |   27.0    |
    |                buffer_anon |  48.34 Ki |   9.69 Mi |  15.35 Mi |
    |                buffer_meta |   1.12 Ki |  17.59 Ki |  35.68 Ki |
    |                     mds_co |    0.0    |    0.0    |    0.0    |
    |                        osd |  110.0    |  129.8    |  154.0    |
    |                  osd_mapbl |    0.0    |    7.2    |   39.0    |
    |                  osd_pglog | 267.34 Ki | 353.21 Ki | 501.38 Ki |
    |                     osdmap |  38.36 Ki |  39.74 Ki |  47.61 Ki |
    |             osdmap_mapping |    0.0    |    0.0    |    0.0    |
    |                      pgmap |    0.0    |    0.0    |    0.0    |
    |                      total |   2.50 Mi | 252.55 Mi | 411.59 Mi |
    |                 unittest_1 |    0.0    |    0.0    |    0.0    |
    |                 unittest_2 |    0.0    |    0.0    |    0.0    |
    +----------------------------+-----------+-----------+-----------+

proc_status
-----------

    +----------------------------+-----------+-----------+-----------+
    |                     metric |       min |      mean |       max |
    +----------------------------+-----------+-----------+-----------+
    |                    RssAnon |   1.65 Gi |   6.56 Gi |   8.66 Gi |
    |                    RssFile |  27.91 Mi |  28.70 Mi |  29.05 Mi |
    |                   RssShmem |    0.0    |    0.0    |    0.0    |
    |                    Threads |   65.0    |   66.9    |   67.0    |
    |                     VmData |   2.30 Gi |   7.63 Gi |  10.26 Gi |
    |                      VmExe |  18.79 Mi |  18.79 Mi |  18.79 Mi |
    |                      VmHWM |   1.68 Gi |   6.77 Gi |   8.92 Gi |
    |                      VmLck |    0.0    |    0.0    |    0.0    |
    |                      VmLib |  15.21 Mi |  15.21 Mi |  15.21 Mi |
    |                      VmPMD |  20.00 Ki |  43.44 Ki |  56.00 Ki |
    |                      VmPTE |   3.87 Mi |  14.52 Mi |  19.77 Mi |
    |                     VmPeak |   2.43 Gi |   8.08 Gi |  10.77 Gi |
    |                      VmPin |    0.0    |    0.0    |    0.0    |
    |                      VmRSS |   1.67 Gi |   6.58 Gi |   8.69 Gi |
    |                     VmSize |   2.43 Gi |   7.77 Gi |  10.40 Gi |
    |                      VmStk | 804.00 Ki | 808.11 Ki | 812.00 Ki |
    |                     VmSwap |    0.0    |    0.0    |    0.0    |
    | nonvoluntary_ctxt_switches |   53.0    |  299.6    |  978.0    |
    |    voluntary_ctxt_switches |  19.93 Ki |  26.59 Ki |  39.16 Ki |
    +----------------------------+-----------+-----------+-----------+

OSD device class: hdd
=====================


heap_dump
---------

    +-----------------------------------+-----------+-----------+-----------+
    |                            metric |       min |      mean |       max |
    +-----------------------------------+-----------+-----------+-----------+
    |  Actual_memory_used_physical_swap |   2.93 Gi |   3.19 Gi |   3.42 Gi |
    |   Bytes_in_central_cache_freelist | 173.28 Mi | 231.17 Mi | 285.54 Mi |
    |          Bytes_in_malloc_metadata |  12.61 Mi |  13.46 Mi |  14.00 Mi |
    |       Bytes_in_page_heap_freelist |    0.0    |    0.0    |    0.0    |
    |   Bytes_in_thread_cache_freelists |  74.56 Mi |  93.96 Mi | 105.09 Mi |
    |  Bytes_in_transfer_cache_freelist |  10.89 Mi |  15.77 Mi |  19.35 Mi |
    |       Bytes_in_use_by_application |   2.62 Gi |   2.85 Gi |   3.05 Gi |
    | Bytes_released_to_OS_aka_unmapped |  95.78 Mi | 285.71 Mi | 480.20 Mi |
    |                      Spans_in_use | 171.07 Ki | 185.28 Ki | 195.04 Ki |
    |                Tcmalloc_page_size |   8.00 Ki |   8.00 Ki |   8.00 Ki |
    |               Thread_heaps_in_use |   33.0    |   33.2    |   36.0    |
    |        Virtual_address_space_used |   3.14 Gi |   3.47 Gi |   3.69 Gi |
    +-----------------------------------+-----------+-----------+-----------+

mempool_bytes
-------------

    +----------------------------+-----------+-----------+-----------+
    |                     metric |       min |      mean |       max |
    +----------------------------+-----------+-----------+-----------+
    |               bloom_filter |    0.0    |    0.0    |    0.0    |
    |                     bluefs |   9.63 Ki |  11.94 Ki |  13.67 Ki |
    |            bluestore_alloc |  26.64 Ki | 379.62 Ki |   1.10 Mi |
    |       bluestore_cache_data |    0.0    | 552.91 Ki |  19.61 Mi |
    |      bluestore_cache_onode | 187.35 Mi | 192.72 Mi | 197.83 Mi |
    |      bluestore_cache_other | 304.98 Mi | 318.62 Mi | 322.48 Mi |
    |             bluestore_fsck |    0.0    |    0.0    |    0.0    |
    |              bluestore_txc |   1.41 Ki |  53.80 Ki | 113.91 Ki |
    |          bluestore_writing | 808.52 Ki |   1.25 Mi |   7.77 Mi |
    | bluestore_writing_deferred |   6.92 Ki | 730.02 Ki |   4.60 Mi |
    |                buffer_anon | 835.34 Mi | 967.42 Mi |   1.08 Gi |
    |                buffer_meta |   7.65 Ki |  10.18 Ki |  20.54 Ki |
    |                     mds_co |    0.0    |    0.0    |    0.0    |
    |                        osd |   1.39 Mi |   1.71 Mi |   1.97 Mi |
    |                  osd_mapbl |    0.0    |    0.0    |    0.0    |
    |                  osd_pglog | 103.92 Mi | 129.63 Mi | 149.63 Mi |
    |                     osdmap |   1.24 Mi |   1.25 Mi |   1.26 Mi |
    |             osdmap_mapping |    0.0    |    0.0    |    0.0    |
    |                      pgmap |    0.0    |    0.0    |    0.0    |
    |                      total |   1.42 Gi |   1.58 Gi |   1.73 Gi |
    |                 unittest_1 |    0.0    |    0.0    |    0.0    |
    |                 unittest_2 |    0.0    |    0.0    |    0.0    |
    +----------------------------+-----------+-----------+-----------+

mempool_item_size
-----------------

    +----------------------------+----------+----------+-----------+
    |                     metric |      min |     mean |       max |
    +----------------------------+----------+----------+-----------+
    |               bloom_filter |   0.0    |   0.0    |    0.0    |
    |                     bluefs |  40.0    |  43.1    |   61.8    |
    |            bluestore_alloc |   1.0    |   1.0    |    1.0    |
    |       bluestore_cache_data |   0.0    | 15.18 Ki | 248.00 Ki |
    |      bluestore_cache_onode | 680.0    | 680.0    |  680.0    |
    |      bluestore_cache_other |   7.1    |   7.3    |    7.4    |
    |             bluestore_fsck |   0.0    |   0.0    |    0.0    |
    |              bluestore_txc | 720.0    | 720.0    |  720.0    |
    |          bluestore_writing |  9.11 Ki | 37.39 Ki | 357.00 Ki |
    | bluestore_writing_deferred |  5.16 Ki | 26.33 Ki | 503.09 Ki |
    |                buffer_anon | 381.1    | 421.3    |  475.5    |
    |                buffer_meta |  88.0    |  88.0    |   88.0    |
    |                     mds_co |   0.0    |   0.0    |    0.0    |
    |                        osd | 11.78 Ki | 11.78 Ki |  11.78 Ki |
    |                  osd_mapbl |   0.0    |   0.0    |    0.0    |
    |                  osd_pglog | 264.0    | 272.6    |  281.4    |
    |                     osdmap |  32.6    |  32.7    |   33.1    |
    |             osdmap_mapping |   0.0    |   0.0    |    0.0    |
    |                      pgmap |   0.0    |   0.0    |    0.0    |
    |                      total |  30.9    |  34.3    |   37.6    |
    |                 unittest_1 |   0.0    |   0.0    |    0.0    |
    |                 unittest_2 |   0.0    |   0.0    |    0.0    |
    +----------------------------+----------+----------+-----------+

mempool_items
-------------

    +----------------------------+-----------+-----------+-----------+
    |                     metric |       min |      mean |       max |
    +----------------------------+-----------+-----------+-----------+
    |               bloom_filter |    0.0    |    0.0    |    0.0    |
    |                     bluefs |  172.0    |  286.0    |  315.0    |
    |            bluestore_alloc |  26.64 Ki | 379.62 Ki |   1.10 Mi |
    |       bluestore_cache_data |    0.0    |    3.7    |  147.0    |
    |      bluestore_cache_onode | 282.12 Ki | 290.21 Ki | 297.91 Ki |
    |      bluestore_cache_other |  42.30 Mi |  43.65 Mi |  44.03 Mi |
    |             bluestore_fsck |    0.0    |    0.0    |    0.0    |
    |              bluestore_txc |    2.0    |   76.5    |  162.0    |
    |          bluestore_writing |   12.0    |   85.5    |  133.0    |
    | bluestore_writing_deferred |    1.0    |   42.3    |   80.0    |
    |                buffer_anon |   2.18 Mi |   2.30 Mi |   2.35 Mi |
    |                buffer_meta |   89.0    |  118.4    |  239.0    |
    |                     mds_co |    0.0    |    0.0    |    0.0    |
    |                        osd |  121.0    |  148.9    |  171.0    |
    |                  osd_mapbl |    0.0    |    0.0    |    0.0    |
    |                  osd_pglog | 378.11 Ki | 487.10 Ki | 569.86 Ki |
    |                     osdmap |  38.36 Ki |  39.18 Ki |  39.65 Ki |
    |             osdmap_mapping |    0.0    |    0.0    |    0.0    |
    |                      pgmap |    0.0    |    0.0    |    0.0    |
    |                      total |  45.39 Mi |  47.11 Mi |  48.00 Mi |
    |                 unittest_1 |    0.0    |    0.0    |    0.0    |
    |                 unittest_2 |    0.0    |    0.0    |    0.0    |
    +----------------------------+-----------+-----------+-----------+

proc_status
-----------

    +----------------------------+-----------+-----------+-----------+
    |                     metric |       min |      mean |       max |
    +----------------------------+-----------+-----------+-----------+
    |                    RssAnon |   2.93 Gi |   3.17 Gi |   3.41 Gi |
    |                    RssFile |  28.10 Mi |  28.65 Mi |  29.01 Mi |
    |                   RssShmem |    0.0    |    0.0    |    0.0    |
    |                    Threads |   56.0    |   56.0    |   56.0    |
    |                     VmData |   3.71 Gi |   4.03 Gi |   4.25 Gi |
    |                      VmExe |  18.79 Mi |  18.79 Mi |  18.79 Mi |
    |                      VmHWM |   3.06 Gi |   3.34 Gi |   3.54 Gi |
    |                      VmLck |    0.0    |    0.0    |    0.0    |
    |                      VmLib |  15.21 Mi |  15.21 Mi |  15.21 Mi |
    |                      VmPMD |  24.00 Ki |  28.62 Ki |  32.00 Ki |
    |                      VmPTE |   6.79 Mi |   7.45 Mi |   7.89 Mi |
    |                     VmPeak |   3.99 Gi |   4.35 Gi |   4.60 Gi |
    |                      VmPin |    0.0    |    0.0    |    0.0    |
    |                      VmRSS |   2.96 Gi |   3.20 Gi |   3.44 Gi |
    |                     VmSize |   3.84 Gi |   4.16 Gi |   4.39 Gi |
    |                      VmStk | 804.00 Ki | 808.40 Ki | 812.00 Ki |
    |                     VmSwap |    0.0    |    0.0    |    0.0    |
    | nonvoluntary_ctxt_switches |  462.0    |  24.04 Ki |  40.61 Ki |
    |    voluntary_ctxt_switches |  43.45 Ki |  61.96 Ki |  89.04 Ki |
    +----------------------------+-----------+-----------+-----------+


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: luminous OSD memory usage
  2017-10-20  8:52   ` Aleksei Gutikov
@ 2017-10-20 12:12     ` Sage Weil
  0 siblings, 0 replies; 7+ messages in thread
From: Sage Weil @ 2017-10-20 12:12 UTC (permalink / raw)
  To: Aleksei Gutikov; +Cc: Ceph Development

On Fri, 20 Oct 2017, Aleksei Gutikov wrote:
> Hi,
> 
> Here is some stats about OSD memory usage on our lab cluster
> (110hdd*1T + 36ssd*200G)
> 
> https://drive.google.com/drive/folders/0B1s9jTJ0z59JcmtEX283WUd5bzg?usp=sharing
> 
> osd_mem_stats.txt contains stats for hdd and ssd OSDs:
> - stats for process memory usage from /proc/pid/status,
> - mempools stats
> - and heap stats
> 
> We use luminous 12.2.1.
> We set 9G memory limit in
> /lib/systemd/system/ceph-osd@.service.d/override.conf
> But that is not enough - OSDs are still been killed because they use more
> (with bluestore_cache_size_ssd=3G)
> OSDs with hdd (with bluestore_cache_size_hdd=1G) use up to 4.6G.
> 
> And, btw, seems that systemd's oom killer looks at VmData, not on VmRss.

There was a fix for the bluestore memory usage calculation that didn't 
make it into 12.1.1 (e.g., f60a942023088cbba53a816e6ef846994921cab3).  
You repeat your test with the latest luminous branch or wait a week or so 
for 12.2.2.

Thanks!
sage


> 
> Thanks.
> 
> On 08/30/2017 06:17 PM, Sage Weil wrote:
> > Hi Aleksei,
> > 
> > On Wed, 30 Aug 2017, Aleksei Gutikov wrote:
> > > Hi.
> > > 
> > > I'm trying to synchronize osd daemons memory limits and bluestore cache
> > > settings.
> > > For 12.1.4 we have hdd osds usage about 4G with default settings.
> > > For ssds we have limit 6G and they are been oom killed periodically.
> > 
> > So,
> >   
> > > While
> > > osd_op_num_threads_per_shard_hdd=1
> > > osd_op_num_threads_per_shard_ssd=2
> > > and
> > > osd_op_num_shards_hdd=5
> > > osd_op_num_shards_ssd=8
> > 
> > aren't relevant to memory usage.  The _per_shard is about how many bytes
> > are stored in each rocksdb key, and the num_shards is about how many
> > threads we use.
> > 
> > This is the one that matters:
> > 
> > > bluestore_cache_size_hdd=1G
> > > bluestore_cache_size_ssd=3G
> > 
> > It governs how much memory bluestore limits itself to.  He bad news is
> > that bluestore counts what it allocates, not how much memory the allocator
> > uses, so there is some overhead.  From what I've anecdotally seen it's
> > something like 1.5x, which kind of sucks; there is more to be done here.
> > 
> > On top of that is usage by the OSD outside of bluestore,
> > which is somewhere in the 500m to 1.5g range.
> > 
> > We're very interested in hearing what observed RSS users see relative to
> > the configured bluestore size and pg count, along with a dump of the
> > mempool metadata (ceph daemon osd.NNN dump_mempools).
> > 
> > > Does anybody have an idea about the equation for upper bound of memory
> > > consumption?
> > 
> > Very roughly, something like: osd_overhead + bluestore_cache_size * 1.5 ?
> > 
> > > Can bluestore_cache_size be decreased safely for example to 2G, or to 1G?
> > 
> > Yes, you can/should change this to whatever you like (big or small).
> > 
> > > I want to calculate the maximum expected size of bluestore metadata (that
> > > must
> > > be definitely fit into cache) using size of raw space, average size of
> > > objects, rocksdb space amplification.
> > > I thought it should be something simple like
> > > raw_space/avg_obj_size*obj_overhead*rocksdb_space_amp.
> > > For example if obj_overhead=1k, hdd size=1T, rocksdb space amplification
> > > is 2
> > > and avg obj size=4M then 1T/4M*1k*2=500M so I need at least 512M for
> > > cache.
> > > But wise guys said that I have to take into account number of extents
> > > also.
> > > But bluestore_extent_map_shard_max_size=1200, I hope this number is not a
> > > multiplicator...
> > 
> > Nope, just a shard size...
> > 
> > > What would be correct approach for calculation of this minimum cache size?
> > > What can be expected size of key-values stored in rocksdb per rados
> > > object?
> > 
> > This depends, unfortunately, on what the write pattern is for the objects.
> > If they're written by RGW in big chunks, the overhead will be smaller.
> > If it comes out of a 4k random write pattern it will be bigger.  Again,
> > very interestd in hearing user reports of what is observed in real world
> > situations.  I would trust that more than a calculation from first
> > principles like the above.
> > 
> > > Default bluestore_cache_kv_ratio*bluestore_cache_size_ssd=0.99*3G
> > > while default bluestore_cache_kv_max=512M
> > > Looks like BlueStore::_set_cache_sizes() will set cache_kv_ratio to 1/6 in
> > > default case. Is 512M enough for bluestore metadata?
> > 
> > In Mark's testing he found that we got more performance benefit when
> > small caches were devoted to rocksdb and large caches were devoted mostly
> > to the bluestore metadata cache (parsed onodes vs caching the encoded
> > on-disk content).  You can always adjust the 512m value upwards (and that
> > may make sense for large cache sizes).  Again, very interested in hearing
> > whther that works better or worse for your workload!
> > 
> > Thanks-
> > sage
> > 
> 
> -- 
> 
> Best regards,
> Aleksei Gutikov
> Software Engineer | synesis.ru | Minsk. BY
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-20 12:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30 10:00 luminous OSD memory usage Aleksei Gutikov
2017-08-30 15:17 ` Sage Weil
2017-09-01  2:00   ` xiaoyan li
2017-09-01 12:51     ` Mark Nelson
2017-09-01 13:10     ` Sage Weil
2017-10-20  8:52   ` Aleksei Gutikov
2017-10-20 12:12     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.