All of lore.kernel.org
 help / color / mirror / Atom feed
* Bluestore memory usage on our test cluster
@ 2017-08-30 19:48 Mark Nelson
       [not found] ` <CABbLZY3HEowq1WvSty0dS_Wr=+TAGcAgf9NM_b3C=v5oje2fgA@mail.gmail.com>
  2017-09-01  1:04 ` xiaoyan li
  0 siblings, 2 replies; 4+ messages in thread
From: Mark Nelson @ 2017-08-30 19:48 UTC (permalink / raw)
  To: Sage Weil, Josh Durgin, ceph-devel

Based on the recent conversation about bluestore memory usage, I did a 
survey of all of the bluestore OSDs in one of our internal test 
clusters.  The one with the highest RSS usage at the time was osd.82:

  6017 ceph      20   0 4488440 2.648g   5004 S   3.0 16.9   5598:01 
ceph-osd

In the grand scheme of bluestore memory usage, I've seen higher RSS 
usage, but usually with bluestore_cache cranked up higher.  On these 
nodes, I believe Sage said the bluestore_cache size is being set to 
512MB to keep memory usage down.

To dig into this more, mempool data from the osd can be dumped via:

sudo ceph daemon osd.82 dump_mempools

A slightly compressed version of that data follows.  Note that the 
allocated space for blueestore_cache_* isn't terribly high.  buffer_anon 
and osd_pglog together are taking up more space:

bloom_filters: 0MB
bluestore_alloc: 13.5MB
blustore_cache_data: 0MB
bluestore_cache_onode: 234.7MB
bluestore_cache_other: 277.3MB
bluestore_fsck: 0MB
bluestore_txc: 0MB
bluestore_writing_deferred: 5.4MB
bluestore_writing: 11.1MB
bluefs: 0.1MB
buffer_anon: 386.1MB
buffer_meta: 0MB
osd: 4.4MB
osd_mapbl: 0MB
osd_pglog: 181.4MB
osdmap: 0.7MB
osdmap_mapping: 0MB
pgmap: 0MB
unittest_1: 0MB
unittest_2: 0MB

total: 1114.8MB

A heap dump from tcmalloc shows a fair amount of data yet to be returned 
to the OS:

sudo ceph tell osd.82 heap start_profiler
sudo ceph tell osd.82 heap dump

osd.82 dumping heap profile now.
------------------------------------------------
MALLOC:     2364583720 ( 2255.0 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    360267096 (  343.6 MiB) Bytes in central cache freelist
MALLOC: +     10953808 (   10.4 MiB) Bytes in transfer cache freelist
MALLOC: +    114290480 (  109.0 MiB) Bytes in thread cache freelists
MALLOC: +     13562016 (   12.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2863657120 ( 2731.0 MiB) Actual memory used (physical + swap)
MALLOC: +    997007360 (  950.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   3860664480 ( 3681.8 MiB) Virtual address space used
MALLOC:
MALLOC:         156783              Spans in use
MALLOC:             35              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------


The heap profile is showing us about the same as top excluding bytes 
released to the OS.  Another ~500MB is being used by tcmalloc for 
various cache and metadata, and ~1.1GB we can account for in the mempools.

The question is where does that other 1GB go.  Is it allocations that 
are not made via the mempools?  heap fragmentation?  Maybe a combination 
of multiple things?  I don't actually know how to get heap fragmentation 
statistics out of tcmalloc, but jemalloc potentially would allow us to 
compute it via:

malloc_stats_print()

External fragmentation: 1.0 - (allocated/active)
Virtual fragmentation: 1.0 - (active/mapped)

Mark

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bluestore memory usage on our test cluster
       [not found] ` <CABbLZY3HEowq1WvSty0dS_Wr=+TAGcAgf9NM_b3C=v5oje2fgA@mail.gmail.com>
@ 2017-08-31  2:49   ` Mark Nelson
  2017-08-31  4:07     ` Alexandre DERUMIER
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Nelson @ 2017-08-31  2:49 UTC (permalink / raw)
  To: Varada Kari; +Cc: Sage Weil, Josh Durgin, ceph-devel

Yep.  FWIW, the last time I looked at jemalloc it was both faster and 
resulted in higher memory use vs tcmalloc.  That may have simply been 
due to more thread cache being used, but I didn't have any way at the 
time to verify.

I think we still need to audit and make sure there isn't a bunch of 
memory allocated outside of the mempools.

Mark

On 08/30/2017 09:25 PM, Varada Kari wrote:
> Hi Mark,
>
> One thing pending in the wish-list is building profiler hooks to
> jemalloc like we have for tcmalloc now, that will enable us to do a fair
> comparison with tcmalloc that time and check if this due to
> fragmentation in the allocators.
>
> Varada
>> On 31-Aug-2017, at 1:18 AM, Mark Nelson <mnelson@redhat.com
>> <mailto:mnelson@redhat.com>> wrote:
>>
>> Based on the recent conversation about bluestore memory usage, I did a
>> survey of all of the bluestore OSDs in one of our internal test
>> clusters.  The one with the highest RSS usage at the time was osd.82:
>>
>> 6017 ceph      20   0 4488440 2.648g   5004 S   3.0 16.9   5598:01
>> ceph-osd
>>
>> In the grand scheme of bluestore memory usage, I've seen higher RSS
>> usage, but usually with bluestore_cache cranked up higher.  On these
>> nodes, I believe Sage said the bluestore_cache size is being set to
>> 512MB to keep memory usage down.
>>
>> To dig into this more, mempool data from the osd can be dumped via:
>>
>> sudo ceph daemon osd.82 dump_mempools
>>
>> A slightly compressed version of that data follows.  Note that the
>> allocated space for blueestore_cache_* isn't terribly high.
>>  buffer_anon and osd_pglog together are taking up more space:
>>
>> bloom_filters: 0MB
>> bluestore_alloc: 13.5MB
>> blustore_cache_data: 0MB
>> bluestore_cache_onode: 234.7MB
>> bluestore_cache_other: 277.3MB
>> bluestore_fsck: 0MB
>> bluestore_txc: 0MB
>> bluestore_writing_deferred: 5.4MB
>> bluestore_writing: 11.1MB
>> bluefs: 0.1MB
>> buffer_anon: 386.1MB
>> buffer_meta: 0MB
>> osd: 4.4MB
>> osd_mapbl: 0MB
>> osd_pglog: 181.4MB
>> osdmap: 0.7MB
>> osdmap_mapping: 0MB
>> pgmap: 0MB
>> unittest_1: 0MB
>> unittest_2: 0MB
>>
>> total: 1114.8MB
>>
>> A heap dump from tcmalloc shows a fair amount of data yet to be
>> returned to the OS:
>>
>> sudo ceph tell osd.82 heap start_profiler
>> sudo ceph tell osd.82 heap dump
>>
>> osd.82 dumping heap profile now.
>> ------------------------------------------------
>> MALLOC:     2364583720 ( 2255.0 MiB) Bytes in use by application
>> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
>> MALLOC: +    360267096 (  343.6 MiB) Bytes in central cache freelist
>> MALLOC: +     10953808 (   10.4 MiB) Bytes in transfer cache freelist
>> MALLOC: +    114290480 (  109.0 MiB) Bytes in thread cache freelists
>> MALLOC: +     13562016 (   12.9 MiB) Bytes in malloc metadata
>> MALLOC:   ------------
>> MALLOC: =   2863657120 ( 2731.0 MiB) Actual memory used (physical + swap)
>> MALLOC: +    997007360 (  950.8 MiB) Bytes released to OS (aka unmapped)
>> MALLOC:   ------------
>> MALLOC: =   3860664480 ( 3681.8 MiB) Virtual address space used
>> MALLOC:
>> MALLOC:         156783              Spans in use
>> MALLOC:             35              Thread heaps in use
>> MALLOC:           8192              Tcmalloc page size
>> ------------------------------------------------
>>
>>
>> The heap profile is showing us about the same as top excluding bytes
>> released to the OS.  Another ~500MB is being used by tcmalloc for
>> various cache and metadata, and ~1.1GB we can account for in the mempools.
>>
>> The question is where does that other 1GB go.  Is it allocations that
>> are not made via the mempools?  heap fragmentation?  Maybe a
>> combination of multiple things?  I don't actually know how to get heap
>> fragmentation statistics out of tcmalloc, but jemalloc potentially
>> would allow us to compute it via:
>>
>> malloc_stats_print()
>>
>> External fragmentation: 1.0 - (allocated/active)
>> Virtual fragmentation: 1.0 - (active/mapped)
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> <mailto:majordomo@vger.kernel.org>
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> <http://vger.kernel.org/majordomo-info.html>
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bluestore memory usage on our test cluster
  2017-08-31  2:49   ` Mark Nelson
@ 2017-08-31  4:07     ` Alexandre DERUMIER
  0 siblings, 0 replies; 4+ messages in thread
From: Alexandre DERUMIER @ 2017-08-31  4:07 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Varada Kari, Sage Weil, Josh Durgin, ceph-devel

>>Yep. FWIW, the last time I looked at jemalloc it was both faster and 
>>resulted in higher memory use vs tcmalloc.

It could be great to test with jemmaloc 4.X and 5.X builded -with-malloc-conf=purge:decay to compare.
(the plan was to enable it by default in jemmaloc 5.X, don't known if it's already done).


also, glibc 2.26 now support native thread cache

some small benchmarks show it faster than jemalloc
https://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.26-Redis-Test


I'll try to test them in coming months.

Alexandre

----- Mail original -----
De: "Mark Nelson" <mnelson@redhat.com>
À: "Varada Kari" <varada.kari@gmail.com>
Cc: "Sage Weil" <sweil@redhat.com>, "Josh Durgin" <jdurgin@redhat.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Jeudi 31 Août 2017 04:49:01
Objet: Re: Bluestore memory usage on our test cluster

Yep. FWIW, the last time I looked at jemalloc it was both faster and 
resulted in higher memory use vs tcmalloc. That may have simply been 
due to more thread cache being used, but I didn't have any way at the 
time to verify. 

I think we still need to audit and make sure there isn't a bunch of 
memory allocated outside of the mempools. 

Mark 

On 08/30/2017 09:25 PM, Varada Kari wrote: 
> Hi Mark, 
> 
> One thing pending in the wish-list is building profiler hooks to 
> jemalloc like we have for tcmalloc now, that will enable us to do a fair 
> comparison with tcmalloc that time and check if this due to 
> fragmentation in the allocators. 
> 
> Varada 
>> On 31-Aug-2017, at 1:18 AM, Mark Nelson <mnelson@redhat.com 
>> <mailto:mnelson@redhat.com>> wrote: 
>> 
>> Based on the recent conversation about bluestore memory usage, I did a 
>> survey of all of the bluestore OSDs in one of our internal test 
>> clusters. The one with the highest RSS usage at the time was osd.82: 
>> 
>> 6017 ceph 20 0 4488440 2.648g 5004 S 3.0 16.9 5598:01 
>> ceph-osd 
>> 
>> In the grand scheme of bluestore memory usage, I've seen higher RSS 
>> usage, but usually with bluestore_cache cranked up higher. On these 
>> nodes, I believe Sage said the bluestore_cache size is being set to 
>> 512MB to keep memory usage down. 
>> 
>> To dig into this more, mempool data from the osd can be dumped via: 
>> 
>> sudo ceph daemon osd.82 dump_mempools 
>> 
>> A slightly compressed version of that data follows. Note that the 
>> allocated space for blueestore_cache_* isn't terribly high. 
>> buffer_anon and osd_pglog together are taking up more space: 
>> 
>> bloom_filters: 0MB 
>> bluestore_alloc: 13.5MB 
>> blustore_cache_data: 0MB 
>> bluestore_cache_onode: 234.7MB 
>> bluestore_cache_other: 277.3MB 
>> bluestore_fsck: 0MB 
>> bluestore_txc: 0MB 
>> bluestore_writing_deferred: 5.4MB 
>> bluestore_writing: 11.1MB 
>> bluefs: 0.1MB 
>> buffer_anon: 386.1MB 
>> buffer_meta: 0MB 
>> osd: 4.4MB 
>> osd_mapbl: 0MB 
>> osd_pglog: 181.4MB 
>> osdmap: 0.7MB 
>> osdmap_mapping: 0MB 
>> pgmap: 0MB 
>> unittest_1: 0MB 
>> unittest_2: 0MB 
>> 
>> total: 1114.8MB 
>> 
>> A heap dump from tcmalloc shows a fair amount of data yet to be 
>> returned to the OS: 
>> 
>> sudo ceph tell osd.82 heap start_profiler 
>> sudo ceph tell osd.82 heap dump 
>> 
>> osd.82 dumping heap profile now. 
>> ------------------------------------------------ 
>> MALLOC: 2364583720 ( 2255.0 MiB) Bytes in use by application 
>> MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist 
>> MALLOC: + 360267096 ( 343.6 MiB) Bytes in central cache freelist 
>> MALLOC: + 10953808 ( 10.4 MiB) Bytes in transfer cache freelist 
>> MALLOC: + 114290480 ( 109.0 MiB) Bytes in thread cache freelists 
>> MALLOC: + 13562016 ( 12.9 MiB) Bytes in malloc metadata 
>> MALLOC: ------------ 
>> MALLOC: = 2863657120 ( 2731.0 MiB) Actual memory used (physical + swap) 
>> MALLOC: + 997007360 ( 950.8 MiB) Bytes released to OS (aka unmapped) 
>> MALLOC: ------------ 
>> MALLOC: = 3860664480 ( 3681.8 MiB) Virtual address space used 
>> MALLOC: 
>> MALLOC: 156783 Spans in use 
>> MALLOC: 35 Thread heaps in use 
>> MALLOC: 8192 Tcmalloc page size 
>> ------------------------------------------------ 
>> 
>> 
>> The heap profile is showing us about the same as top excluding bytes 
>> released to the OS. Another ~500MB is being used by tcmalloc for 
>> various cache and metadata, and ~1.1GB we can account for in the mempools. 
>> 
>> The question is where does that other 1GB go. Is it allocations that 
>> are not made via the mempools? heap fragmentation? Maybe a 
>> combination of multiple things? I don't actually know how to get heap 
>> fragmentation statistics out of tcmalloc, but jemalloc potentially 
>> would allow us to compute it via: 
>> 
>> malloc_stats_print() 
>> 
>> External fragmentation: 1.0 - (allocated/active) 
>> Virtual fragmentation: 1.0 - (active/mapped) 
>> 
>> Mark 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> <mailto:majordomo@vger.kernel.org> 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>> <http://vger.kernel.org/majordomo-info.html> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Bluestore memory usage on our test cluster
  2017-08-30 19:48 Bluestore memory usage on our test cluster Mark Nelson
       [not found] ` <CABbLZY3HEowq1WvSty0dS_Wr=+TAGcAgf9NM_b3C=v5oje2fgA@mail.gmail.com>
@ 2017-09-01  1:04 ` xiaoyan li
  1 sibling, 0 replies; 4+ messages in thread
From: xiaoyan li @ 2017-09-01  1:04 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, Josh Durgin, ceph-devel

On Thu, Aug 31, 2017 at 3:48 AM, Mark Nelson <mnelson@redhat.com> wrote:
> Based on the recent conversation about bluestore memory usage, I did a
> survey of all of the bluestore OSDs in one of our internal test clusters.
> The one with the highest RSS usage at the time was osd.82:
>
>  6017 ceph      20   0 4488440 2.648g   5004 S   3.0 16.9   5598:01 ceph-osd
>
> In the grand scheme of bluestore memory usage, I've seen higher RSS usage,
> but usually with bluestore_cache cranked up higher.  On these nodes, I
> believe Sage said the bluestore_cache size is being set to 512MB to keep
> memory usage down.
>
> To dig into this more, mempool data from the osd can be dumped via:
>
> sudo ceph daemon osd.82 dump_mempools
>
> A slightly compressed version of that data follows.  Note that the allocated
> space for blueestore_cache_* isn't terribly high.  buffer_anon and osd_pglog
> together are taking up more space:
>
> bloom_filters: 0MB
> bluestore_alloc: 13.5MB
> blustore_cache_data: 0MB
> bluestore_cache_onode: 234.7MB
> bluestore_cache_other: 277.3MB
> bluestore_fsck: 0MB
> bluestore_txc: 0MB
> bluestore_writing_deferred: 5.4MB
> bluestore_writing: 11.1MB
> bluefs: 0.1MB
> buffer_anon: 386.1MB
> buffer_meta: 0MB
> osd: 4.4MB
> osd_mapbl: 0MB
> osd_pglog: 181.4MB
> osdmap: 0.7MB
> osdmap_mapping: 0MB
> pgmap: 0MB
> unittest_1: 0MB
> unittest_2: 0MB
>
> total: 1114.8MB
>
> A heap dump from tcmalloc shows a fair amount of data yet to be returned to
> the OS:
>
> sudo ceph tell osd.82 heap start_profiler
> sudo ceph tell osd.82 heap dump
>
> osd.82 dumping heap profile now.
> ------------------------------------------------
> MALLOC:     2364583720 ( 2255.0 MiB) Bytes in use by application
> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
> MALLOC: +    360267096 (  343.6 MiB) Bytes in central cache freelist
> MALLOC: +     10953808 (   10.4 MiB) Bytes in transfer cache freelist
> MALLOC: +    114290480 (  109.0 MiB) Bytes in thread cache freelists
> MALLOC: +     13562016 (   12.9 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =   2863657120 ( 2731.0 MiB) Actual memory used (physical + swap)
> MALLOC: +    997007360 (  950.8 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =   3860664480 ( 3681.8 MiB) Virtual address space used
> MALLOC:
> MALLOC:         156783              Spans in use
> MALLOC:             35              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> ------------------------------------------------
>
>
> The heap profile is showing us about the same as top excluding bytes
> released to the OS.  Another ~500MB is being used by tcmalloc for various
> cache and metadata, and ~1.1GB we can account for in the mempools.
>
> The question is where does that other 1GB go.  Is it allocations that are
> not made via the mempools?  heap fragmentation?  Maybe a combination of
> multiple things?  I don't actually know how to get heap fragmentation
> statistics out of tcmalloc, but jemalloc potentially would allow us to
> compute it via:
>
> malloc_stats_print()

Seems the other 1GB should be for Rocksdb memtables. This is not
included in kv cache size.

>
> External fragmentation: 1.0 - (allocated/active)
> Virtual fragmentation: 1.0 - (active/mapped)
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best wishes
Lisa

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-01  1:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30 19:48 Bluestore memory usage on our test cluster Mark Nelson
     [not found] ` <CABbLZY3HEowq1WvSty0dS_Wr=+TAGcAgf9NM_b3C=v5oje2fgA@mail.gmail.com>
2017-08-31  2:49   ` Mark Nelson
2017-08-31  4:07     ` Alexandre DERUMIER
2017-09-01  1:04 ` xiaoyan li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.