All of lore.kernel.org
 help / color / mirror / Atom feed
* Memory Pooling and Containers
@ 2016-09-27 20:30 Allen Samuels
  2016-09-27 23:00 ` Gregory Farnum
  2016-09-28 13:27 ` Sage Weil
  0 siblings, 2 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-27 20:30 UTC (permalink / raw)
  To: Sage Weil, Ceph Development

As we discussed in the Bluestore standup this morning. This is intended to start a discussion about creating some internal memory pooling technology to try to get a better handle on the internal usage of memory by Ceph. Let's start by discussing the requirements...

Here is my list of requirements:

(1) Should be able to create an arbitrary number of "pools" of memory.
(2) Developers should be able to declare that a particular container (i.e., STL or boost-like container) is wholly contained within a pool.
(3) Beyond declarations (and possibly constructor initialization), no explicit code is required to be written by developers to support (2). All container manipulation primitives properly update the accounting.
(4) Beyond construction/destruction costs, no container operation is burdened by additional code -- only implicit malloc/free operations are burdened with accounting.
(5) The system tracks the aggregate amount of memory consumed in each pool and it's relatively cheap to interrogate the current total consumption. 
(6) The system tracks the aggregate amount of memory consumed by each container in each pool -- but this is expensive to interrogate and is intended to be used primarily for debugging purposes.
(7) generic object new/delete is possible, but not freed of the accounting requirements -- especially #6, i.e..
(8) No backpressure is built into the scheme, i.e., nobody has to worry about suddenly being "out" of memory or being delayed -- just because some particular pool is filling up. That's a higher level problem to solve. No memory is "reserved" either -- If you overcommit, that's also not solved at this layer. IMO, this is a crappy place to be doing ingest and flow control.
(9) Implementation must be multi-thread and multi-socket aware. It should expect high levels of thread concurrency and avoid unnecessary global data manipulation (expect internal sharding of data structures -- something like an arena-based malloc scheme).

Requirement 5 allows a "trimming" system to be developed. I think there are really two styles for this:

(a) Time-based, i.e., periodically some thread wakes up and checks memory usage within a pool. If it doesn't like it, then it's responsible for "fixing" it, i.e., trimming as needed.
(b) event-based. No reason that we couldn't setup an event or condition variable per-pool and have the malloc/free code trigger that condition/variable. It adds one or two compare/branches to each malloc / free operation (which is pretty cheap), but doesn't have the latency costs of (a). The downside is that this implicitly assumes a single global-thread is responsible for cleaning each pool which works well when there are a relatively small number of pools.

Here is my list of anti-requirements:

(1) No hierarchical relationship between the pools. [IMO, this is kewl, but unnecessary and tends to screw up your cache, i.e., destroys #9.
(2) No physical colocation of the allocated pool memory. The pool is "logical", i.e., an accounting mirage only.
(3) No reason to dynamically create/destroy memory pools. They can be statically declared (this dramatically simplifies the code that uses this system).

Let the discussion begin!!
/////////////////////////

Here is my proposed external interface to the design:

First, look at the slab_xxxx containers that I just submitted. You can find them at https://github.com/allensamuels/ceph/blob/master/src/include/slab_containers.h

I would propose to extend those containers as the basis for the memory pooling.

First, there's a global enum that defines the memory pools -- yes, they're static and a small number

enum mpool_index {
   MPOOL_ONE,
   MPOOL_TWO,
...
   MPOOL_LAST
};

And a global object for each pool:

class mpool; // TBD ... see below.

Extern mpool[MPOOL_LAST]; // Actual definition of each pool

Each slab_xxx container template is augmented to expect receive an additional "enum mpool_index" parameter.

That's ALL THAT'S required for the developer. In other words, if each definition of an STL container uses a typedef with the right mpool index, then you're done. The machinery takes care of everything else :)

Standalone objects, i.e., naked new/delete are easily done by making the equivalent of a slab_intrusive_list and maybe a macro or two. There's some tricky initialization for this one (see below).

-------------------------------------------

Implementation

-------------------------------------------

Requirement 6 is what drives this implementation.

I would extend each slab_xxxx container to also virtually inherit from a Pool_Member interface, this interface allows the memory pool global machinery to implement #6.

I propose that the ctor/dtor for Pool_Member (one for each container) put itself on a list within the respective memory pool. This MUST be a synchronized operation but we can shard the list to reduce the collisions (use the low 4-5 bits of the creating thread pointer to index the shard -- minimizes ctor expense but increases the dtor expense -- which is often done in "trim"). This assumes that the rate of container creation/destruction within a memory pool is not super high -- we could make this be a compile-time option if it becomes too expensive.

The per-pool sharded lists allow the debug routines to visit each container and do things like ask "how many elements do you have?" -- "How big is each element" -- "Give me a printable string of the type-signature for this container". Once you have this list you can generate lots of interesting debug reports. Because you can sort by individual containers as well as group containers by their type signatures (i.e., combined the consumption of all "map<a,b>" containers as a group). You can report out both by Byte as well as by Element count consumption. 

This kind of information usually allows you to quickly figure out where the memory is being consumed. A bit of script wizardry would recognize that some containers contain other containers. For example, no reason a simple Python script couldn't recognize that each oNode might have a bunch of vector<pextents> within it and tell you things like the average number of pextents / oNodes. Average DRAM consumption per oNode (which is a pretty complicated combination of pextents, lextents, buffer::ptr, etc.)

Comments: ?????


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-27 20:30 Memory Pooling and Containers Allen Samuels
@ 2016-09-27 23:00 ` Gregory Farnum
  2016-09-28  9:13   ` Allen Samuels
  2016-09-28 13:16   ` Sage Weil
  2016-09-28 13:27 ` Sage Weil
  1 sibling, 2 replies; 15+ messages in thread
From: Gregory Farnum @ 2016-09-27 23:00 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Ceph Development

On Tue, Sep 27, 2016 at 1:30 PM, Allen Samuels
<Allen.Samuels@sandisk.com> wrote:
> As we discussed in the Bluestore standup this morning. This is intended to start a discussion about creating some internal memory pooling technology to try to get a better handle on the internal usage of memory by Ceph. Let's start by discussing the requirements...

I'm afraid I don't have any useful thoughts on this, but I'm curious
if there's any chance of using these memory pools effectively outside
of BlueStore. Many of the features you discuss would be nice in the
MDS, which currently uses boost::pool for some of its metadata types,
but could really use something with a little more control (at least
than we've set up so far).
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-27 23:00 ` Gregory Farnum
@ 2016-09-28  9:13   ` Allen Samuels
  2016-09-28 13:16   ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-28  9:13 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Ceph Development

Should work in any C++ context -- nothing that's BlueStore dependent.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, September 28, 2016 1:01 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Sage Weil <sage@newdream.net>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Tue, Sep 27, 2016 at 1:30 PM, Allen Samuels
> <Allen.Samuels@sandisk.com> wrote:
> > As we discussed in the Bluestore standup this morning. This is intended to
> start a discussion about creating some internal memory pooling technology
> to try to get a better handle on the internal usage of memory by Ceph. Let's
> start by discussing the requirements...
> 
> I'm afraid I don't have any useful thoughts on this, but I'm curious if there's
> any chance of using these memory pools effectively outside of BlueStore.
> Many of the features you discuss would be nice in the MDS, which currently
> uses boost::pool for some of its metadata types, but could really use
> something with a little more control (at least than we've set up so far).
> -Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-27 23:00 ` Gregory Farnum
  2016-09-28  9:13   ` Allen Samuels
@ 2016-09-28 13:16   ` Sage Weil
  1 sibling, 0 replies; 15+ messages in thread
From: Sage Weil @ 2016-09-28 13:16 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Allen Samuels, Ceph Development

On Tue, 27 Sep 2016, Gregory Farnum wrote:
> On Tue, Sep 27, 2016 at 1:30 PM, Allen Samuels
> <Allen.Samuels@sandisk.com> wrote:
> > As we discussed in the Bluestore standup this morning. This is intended to start a discussion about creating some internal memory pooling technology to try to get a better handle on the internal usage of memory by Ceph. Let's start by discussing the requirements...
> 
> I'm afraid I don't have any useful thoughts on this, but I'm curious
> if there's any chance of using these memory pools effectively outside
> of BlueStore. Many of the features you discuss would be nice in the
> MDS, which currently uses boost::pool for some of its metadata types,
> but could really use something with a little more control (at least
> than we've set up so far).

Yeah, that is definitely a goal.  This would be useful for all services!

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-27 20:30 Memory Pooling and Containers Allen Samuels
  2016-09-27 23:00 ` Gregory Farnum
@ 2016-09-28 13:27 ` Sage Weil
  2016-09-28 13:34   ` Haomai Wang
                     ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Sage Weil @ 2016-09-28 13:27 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Ceph Development

On Tue, 27 Sep 2016, Allen Samuels wrote:
> As we discussed in the Bluestore standup this morning. This is intended 
> to start a discussion about creating some internal memory pooling 
> technology to try to get a better handle on the internal usage of memory 
> by Ceph. Let's start by discussing the requirements...
> 
> Here is my list of requirements:
> 
> (1) Should be able to create an arbitrary number of "pools" of memory.
>
> (2) Developers should be able to declare that a particular container 
> (i.e., STL or boost-like container) is wholly contained within a pool.
>
> (3) Beyond declarations (and possibly constructor initialization), no 
> explicit code is required to be written by developers to support (2). 
> All container manipulation primitives properly update the accounting.
>
> (4) Beyond construction/destruction costs, no container operation is 
> burdened by additional code -- only implicit malloc/free operations are 
> burdened with accounting.
>
> (5) The system tracks the aggregate amount of memory consumed in each 
> pool and it's relatively cheap to interrogate the current total 
> consumption.

Yes

> (6) The system tracks the aggregate amount of memory consumed by each 
> container in each pool -- but this is expensive to interrogate and is 
> intended to be used primarily for debugging purposes.

This one sounds like a nice-to-have to me.  If there is a performance cost 
I would skip it.

> (7) generic object new/delete is possible, but not freed of the 
> accounting requirements -- especially #6, i.e..
>
> (8) No backpressure is built into the scheme, i.e., nobody has to worry 
> about suddenly being "out" of memory or being delayed -- just because 
> some particular pool is filling up. That's a higher level problem to 
> solve. No memory is "reserved" either -- If you overcommit, that's also 
> not solved at this layer. IMO, this is a crappy place to be doing ingest 
> and flow control.
>
> (9) Implementation must be multi-thread and multi-socket aware. It 
> should expect high levels of thread concurrency and avoid unnecessary 
> global data manipulation (expect internal sharding of data structures -- 
> something like an arena-based malloc scheme).

Yes
 
> Requirement 5 allows a "trimming" system to be developed. I think there 
> are really two styles for this:
> 
> (a) Time-based, i.e., periodically some thread wakes up and checks 
> memory usage within a pool. If it doesn't like it, then it's responsible 
> for "fixing" it, i.e., trimming as needed.
>
> (b) event-based. No reason that we couldn't setup an event or condition 
> variable per-pool and have the malloc/free code trigger that 
> condition/variable. It adds one or two compare/branches to each malloc / 
> free operation (which is pretty cheap), but doesn't have the latency 
> costs of (a). The downside is that this implicitly assumes a single 
> global-thread is responsible for cleaning each pool which works well 
> when there are a relatively small number of pools.
> 
> Here is my list of anti-requirements:
> 
> (1) No hierarchical relationship between the pools. [IMO, this is kewl, 
> but unnecessary and tends to screw up your cache, i.e., destroys #9.
>
> (2) No physical colocation of the allocated pool memory. The pool is 
> "logical", i.e., an accounting mirage only.
>
> (3) No reason to dynamically create/destroy memory pools. They can be 
> statically declared (this dramatically simplifies the code that uses 
> this system).

Yes.  Great summary!

> Let the discussion begin!!
> /////////////////////////
> 
> Here is my proposed external interface to the design:
> 
> First, look at the slab_xxxx containers that I just submitted. You can 
> find them at 
> https://github.com/allensamuels/ceph/blob/master/src/include/slab_containers.h
> 
> I would propose to extend those containers as the basis for the memory 
> pooling.
> 
> First, there's a global enum that defines the memory pools -- yes, 
> they're static and a small number
> 
> enum mpool_index {
>    MPOOL_ONE,
>    MPOOL_TWO,
> ...
>    MPOOL_LAST
> };
> 
> And a global object for each pool:
> 
> class mpool; // TBD ... see below.
> 
> Extern mpool[MPOOL_LAST]; // Actual definition of each pool
> 
> Each slab_xxx container template is augmented to expect receive an 
> additional "enum mpool_index" parameter.
> 
> That's ALL THAT'S required for the developer. In other words, if each 
> definition of an STL container uses a typedef with the right mpool 
> index, then you're done. The machinery takes care of everything else :)

FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of 
g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes 
the global instance; the latter means you could manage the memory pool 
however you like (e.g., as part of the CephContext for librados).  That's 
a small detail, though.

> Standalone objects, i.e., naked new/delete are easily done by making the 
> equivalent of a slab_intrusive_list and maybe a macro or two. There's 
> some tricky initialization for this one (see below).
> 
> -------------------------------------------
> 
> Implementation
> 
> -------------------------------------------
> 
> Requirement 6 is what drives this implementation.
> 
> I would extend each slab_xxxx container to also virtually inherit from a 
> Pool_Member interface, this interface allows the memory pool global 
> machinery to implement #6.
> 
> I propose that the ctor/dtor for Pool_Member (one for each container) 
> put itself on a list within the respective memory pool. This MUST be a 
> synchronized operation but we can shard the list to reduce the 
> collisions (use the low 4-5 bits of the creating thread pointer to index 
> the shard -- minimizes ctor expense but increases the dtor expense -- 
> which is often done in "trim"). This assumes that the rate of container 
> creation/destruction within a memory pool is not super high -- we could 
> make this be a compile-time option if it becomes too expensive.
> 
> The per-pool sharded lists allow the debug routines to visit each 
> container and do things like ask "how many elements do you have?" -- 
> "How big is each element" -- "Give me a printable string of the 
> type-signature for this container". Once you have this list you can 
> generate lots of interesting debug reports. Because you can sort by 
> individual containers as well as group containers by their type 
> signatures (i.e., combined the consumption of all "map<a,b>" containers 
> as a group). You can report out both by Byte as well as by Element count 
> consumption.

Yeah, this sounds pretty nice.  But I think it's important to be able to 
compile it out.  I think we will have a lot of creations/destructions.  
For example, in BlueStore, Onodes have maps of Extents, those map to 
Blobs, and those have a BufferSpace with a map of Buffers for cached 
data.  I expect that blobs and even onodes will be coming in and out of 
cache a lot.
 
> This kind of information usually allows you to quickly figure out where 
> the memory is being consumed. A bit of script wizardry would recognize 
> that some containers contain other containers. For example, no reason a 
> simple Python script couldn't recognize that each oNode might have a 
> bunch of vector<pextents> within it and tell you things like the average 
> number of pextents / oNodes. Average DRAM consumption per oNode (which 
> is a pretty complicated combination of pextents, lextents, buffer::ptr, 
> etc.)
> 
> Comments: ?????

It would be nice to build this on top of existing allocator libraries if 
we can.  For example, something in boost.  I took a quick peek the other 
day and didn't find something that allowed simple interrogation about 
utilization, though, which was surprising.  It would be nice to have 
something useful (perhaps without #6) that could be done relatively 
quickly and address all of the other requirements.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-28 13:27 ` Sage Weil
@ 2016-09-28 13:34   ` Haomai Wang
  2016-09-28 15:56     ` Allen Samuels
  2016-09-28 19:39     ` Allen Samuels
  2016-09-28 15:56   ` Allen Samuels
  2016-09-28 21:16   ` Jesse Williamson
  2 siblings, 2 replies; 15+ messages in thread
From: Haomai Wang @ 2016-09-28 13:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Ceph Development

On Wed, Sep 28, 2016 at 9:27 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 27 Sep 2016, Allen Samuels wrote:
>> As we discussed in the Bluestore standup this morning. This is intended
>> to start a discussion about creating some internal memory pooling
>> technology to try to get a better handle on the internal usage of memory
>> by Ceph. Let's start by discussing the requirements...
>>
>> Here is my list of requirements:
>>
>> (1) Should be able to create an arbitrary number of "pools" of memory.
>>
>> (2) Developers should be able to declare that a particular container
>> (i.e., STL or boost-like container) is wholly contained within a pool.
>>
>> (3) Beyond declarations (and possibly constructor initialization), no
>> explicit code is required to be written by developers to support (2).
>> All container manipulation primitives properly update the accounting.
>>
>> (4) Beyond construction/destruction costs, no container operation is
>> burdened by additional code -- only implicit malloc/free operations are
>> burdened with accounting.
>>
>> (5) The system tracks the aggregate amount of memory consumed in each
>> pool and it's relatively cheap to interrogate the current total
>> consumption.
>
> Yes
>
>> (6) The system tracks the aggregate amount of memory consumed by each
>> container in each pool -- but this is expensive to interrogate and is
>> intended to be used primarily for debugging purposes.
>
> This one sounds like a nice-to-have to me.  If there is a performance cost
> I would skip it.
>
>> (7) generic object new/delete is possible, but not freed of the
>> accounting requirements -- especially #6, i.e..
>>
>> (8) No backpressure is built into the scheme, i.e., nobody has to worry
>> about suddenly being "out" of memory or being delayed -- just because
>> some particular pool is filling up. That's a higher level problem to
>> solve. No memory is "reserved" either -- If you overcommit, that's also
>> not solved at this layer. IMO, this is a crappy place to be doing ingest
>> and flow control.
>>
>> (9) Implementation must be multi-thread and multi-socket aware. It
>> should expect high levels of thread concurrency and avoid unnecessary
>> global data manipulation (expect internal sharding of data structures --
>> something like an arena-based malloc scheme).
>
> Yes
>
>> Requirement 5 allows a "trimming" system to be developed. I think there
>> are really two styles for this:
>>
>> (a) Time-based, i.e., periodically some thread wakes up and checks
>> memory usage within a pool. If it doesn't like it, then it's responsible
>> for "fixing" it, i.e., trimming as needed.
>>
>> (b) event-based. No reason that we couldn't setup an event or condition
>> variable per-pool and have the malloc/free code trigger that
>> condition/variable. It adds one or two compare/branches to each malloc /
>> free operation (which is pretty cheap), but doesn't have the latency
>> costs of (a). The downside is that this implicitly assumes a single
>> global-thread is responsible for cleaning each pool which works well
>> when there are a relatively small number of pools.
>>
>> Here is my list of anti-requirements:
>>
>> (1) No hierarchical relationship between the pools. [IMO, this is kewl,
>> but unnecessary and tends to screw up your cache, i.e., destroys #9.
>>
>> (2) No physical colocation of the allocated pool memory. The pool is
>> "logical", i.e., an accounting mirage only.
>>
>> (3) No reason to dynamically create/destroy memory pools. They can be
>> statically declared (this dramatically simplifies the code that uses
>> this system).
>
> Yes.  Great summary!
>
>> Let the discussion begin!!
>> /////////////////////////
>>
>> Here is my proposed external interface to the design:
>>
>> First, look at the slab_xxxx containers that I just submitted. You can
>> find them at
>> https://github.com/allensamuels/ceph/blob/master/src/include/slab_containers.h
>>
>> I would propose to extend those containers as the basis for the memory
>> pooling.
>>
>> First, there's a global enum that defines the memory pools -- yes,
>> they're static and a small number
>>
>> enum mpool_index {
>>    MPOOL_ONE,
>>    MPOOL_TWO,
>> ...
>>    MPOOL_LAST
>> };
>>
>> And a global object for each pool:
>>
>> class mpool; // TBD ... see below.
>>
>> Extern mpool[MPOOL_LAST]; // Actual definition of each pool
>>
>> Each slab_xxx container template is augmented to expect receive an
>> additional "enum mpool_index" parameter.
>>
>> That's ALL THAT'S required for the developer. In other words, if each
>> definition of an STL container uses a typedef with the right mpool
>> index, then you're done. The machinery takes care of everything else :)
>
> FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of
> g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes
> the global instance; the latter means you could manage the memory pool
> however you like (e.g., as part of the CephContext for librados).  That's
> a small detail, though.
>
>> Standalone objects, i.e., naked new/delete are easily done by making the
>> equivalent of a slab_intrusive_list and maybe a macro or two. There's
>> some tricky initialization for this one (see below).
>>
>> -------------------------------------------
>>
>> Implementation
>>
>> -------------------------------------------
>>
>> Requirement 6 is what drives this implementation.
>>
>> I would extend each slab_xxxx container to also virtually inherit from a
>> Pool_Member interface, this interface allows the memory pool global
>> machinery to implement #6.
>>
>> I propose that the ctor/dtor for Pool_Member (one for each container)
>> put itself on a list within the respective memory pool. This MUST be a
>> synchronized operation but we can shard the list to reduce the
>> collisions (use the low 4-5 bits of the creating thread pointer to index
>> the shard -- minimizes ctor expense but increases the dtor expense --
>> which is often done in "trim"). This assumes that the rate of container
>> creation/destruction within a memory pool is not super high -- we could
>> make this be a compile-time option if it becomes too expensive.
>>
>> The per-pool sharded lists allow the debug routines to visit each
>> container and do things like ask "how many elements do you have?" --
>> "How big is each element" -- "Give me a printable string of the
>> type-signature for this container". Once you have this list you can
>> generate lots of interesting debug reports. Because you can sort by
>> individual containers as well as group containers by their type
>> signatures (i.e., combined the consumption of all "map<a,b>" containers
>> as a group). You can report out both by Byte as well as by Element count
>> consumption.
>
> Yeah, this sounds pretty nice.  But I think it's important to be able to
> compile it out.  I think we will have a lot of creations/destructions.
> For example, in BlueStore, Onodes have maps of Extents, those map to
> Blobs, and those have a BufferSpace with a map of Buffers for cached
> data.  I expect that blobs and even onodes will be coming in and out of
> cache a lot.
>
>> This kind of information usually allows you to quickly figure out where
>> the memory is being consumed. A bit of script wizardry would recognize
>> that some containers contain other containers. For example, no reason a
>> simple Python script couldn't recognize that each oNode might have a
>> bunch of vector<pextents> within it and tell you things like the average
>> number of pextents / oNodes. Average DRAM consumption per oNode (which
>> is a pretty complicated combination of pextents, lextents, buffer::ptr,
>> etc.)
>>
>> Comments: ?????
>
> It would be nice to build this on top of existing allocator libraries if
> we can.  For example, something in boost.  I took a quick peek the other
> day and didn't find something that allowed simple interrogation about
> utilization, though, which was surprising.  It would be nice to have
> something useful (perhaps without #6) that could be done relatively
> quickly and address all of the other requirements.

If we want to do in a light way, I recommend to refer to seastar
impl(https://github.com/scylladb/seastar/blob/master/core/memory.hh
https://github.com/scylladb/seastar/blob/master/core/slab.hh). This
can give a lot of insights.

>
> sage
> --
> To unsubscribe from this list: send the line "unsu bscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-28 13:27 ` Sage Weil
  2016-09-28 13:34   ` Haomai Wang
@ 2016-09-28 15:56   ` Allen Samuels
  2016-09-28 21:16   ` Jesse Williamson
  2 siblings, 0 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-28 15:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, September 28, 2016 3:28 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Ceph Development <ceph-devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Tue, 27 Sep 2016, Allen Samuels wrote:
> > As we discussed in the Bluestore standup this morning. This is
> > intended to start a discussion about creating some internal memory
> > pooling technology to try to get a better handle on the internal usage
> > of memory by Ceph. Let's start by discussing the requirements...
> >
> > Here is my list of requirements:
> >
> > (1) Should be able to create an arbitrary number of "pools" of memory.
> >
> > (2) Developers should be able to declare that a particular container
> > (i.e., STL or boost-like container) is wholly contained within a pool.
> >
> > (3) Beyond declarations (and possibly constructor initialization), no
> > explicit code is required to be written by developers to support (2).
> > All container manipulation primitives properly update the accounting.
> >
> > (4) Beyond construction/destruction costs, no container operation is
> > burdened by additional code -- only implicit malloc/free operations
> > are burdened with accounting.
> >
> > (5) The system tracks the aggregate amount of memory consumed in each
> > pool and it's relatively cheap to interrogate the current total
> > consumption.
> 
> Yes
> 
> > (6) The system tracks the aggregate amount of memory consumed by each
> > container in each pool -- but this is expensive to interrogate and is
> > intended to be used primarily for debugging purposes.
> 
> This one sounds like a nice-to-have to me.  If there is a performance cost I
> would skip it.
> 
> > (7) generic object new/delete is possible, but not freed of the
> > accounting requirements -- especially #6, i.e..
> >
> > (8) No backpressure is built into the scheme, i.e., nobody has to
> > worry about suddenly being "out" of memory or being delayed -- just
> > because some particular pool is filling up. That's a higher level
> > problem to solve. No memory is "reserved" either -- If you overcommit,
> > that's also not solved at this layer. IMO, this is a crappy place to
> > be doing ingest and flow control.
> >
> > (9) Implementation must be multi-thread and multi-socket aware. It
> > should expect high levels of thread concurrency and avoid unnecessary
> > global data manipulation (expect internal sharding of data structures
> > -- something like an arena-based malloc scheme).
> 
> Yes
> 
> > Requirement 5 allows a "trimming" system to be developed. I think
> > there are really two styles for this:
> >
> > (a) Time-based, i.e., periodically some thread wakes up and checks
> > memory usage within a pool. If it doesn't like it, then it's
> > responsible for "fixing" it, i.e., trimming as needed.
> >
> > (b) event-based. No reason that we couldn't setup an event or
> > condition variable per-pool and have the malloc/free code trigger that
> > condition/variable. It adds one or two compare/branches to each malloc
> > / free operation (which is pretty cheap), but doesn't have the latency
> > costs of (a). The downside is that this implicitly assumes a single
> > global-thread is responsible for cleaning each pool which works well
> > when there are a relatively small number of pools.
> >
> > Here is my list of anti-requirements:
> >
> > (1) No hierarchical relationship between the pools. [IMO, this is
> > kewl, but unnecessary and tends to screw up your cache, i.e., destroys #9.
> >
> > (2) No physical colocation of the allocated pool memory. The pool is
> > "logical", i.e., an accounting mirage only.
> >
> > (3) No reason to dynamically create/destroy memory pools. They can be
> > statically declared (this dramatically simplifies the code that uses
> > this system).
> 
> Yes.  Great summary!
> 
> > Let the discussion begin!!
> > /////////////////////////
> >
> > Here is my proposed external interface to the design:
> >
> > First, look at the slab_xxxx containers that I just submitted. You can
> > find them at
> > https://github.com/allensamuels/ceph/blob/master/src/include/slab_cont
> > ainers.h
> >
> > I would propose to extend those containers as the basis for the memory
> > pooling.
> >
> > First, there's a global enum that defines the memory pools -- yes,
> > they're static and a small number
> >
> > enum mpool_index {
> >    MPOOL_ONE,
> >    MPOOL_TWO,
> > ...
> >    MPOOL_LAST
> > };
> >
> > And a global object for each pool:
> >
> > class mpool; // TBD ... see below.
> >
> > Extern mpool[MPOOL_LAST]; // Actual definition of each pool
> >
> > Each slab_xxx container template is augmented to expect receive an
> > additional "enum mpool_index" parameter.
> >
> > That's ALL THAT'S required for the developer. In other words, if each
> > definition of an STL container uses a typedef with the right mpool
> > index, then you're done. The machinery takes care of everything else
> > :)
> 
> FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of
> g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes the
> global instance; the latter means you could manage the memory pool
> however you like (e.g., as part of the CephContext for librados).  That's a
> small detail, though.
> 

I went with the enum because my plan was to pass the pool into the object declaration, not the construction (i.e., it's a template parameter)

I agree that what you describe is more flexible -- but unless you have dynamic memory pool ctor/dtor, I believe it's flexibility without value.

By going with the static declaration the biggest win is that the existing code doesn't have to have a lot of ctor calls crafted to add the extra parameter for construction -- a lot of tedious editing and value-less churn of the codebase. 

By decorating the declaration with the pool (i.e., using the enum), you'll centralize both the pool and the slab-size into one place. I believe that best-practices will be to create a typedef for each of these containers and to use that typedef everywhere. By using this best-practice, we'll be able to change slab-sizes and move containers in and about of pools (as well as provide clear debug evidence that some particular bug is or is-not due to pool-ization).
 
> > Standalone objects, i.e., naked new/delete are easily done by making
> > the equivalent of a slab_intrusive_list and maybe a macro or two.
> > There's some tricky initialization for this one (see below).
> >
> > -------------------------------------------
> >
> > Implementation
> >
> > -------------------------------------------
> >
> > Requirement 6 is what drives this implementation.
> >
> > I would extend each slab_xxxx container to also virtually inherit from
> > a Pool_Member interface, this interface allows the memory pool global
> > machinery to implement #6.
> >
> > I propose that the ctor/dtor for Pool_Member (one for each container)
> > put itself on a list within the respective memory pool. This MUST be a
> > synchronized operation but we can shard the list to reduce the
> > collisions (use the low 4-5 bits of the creating thread pointer to
> > index the shard -- minimizes ctor expense but increases the dtor
> > expense -- which is often done in "trim"). This assumes that the rate
> > of container creation/destruction within a memory pool is not super
> > high -- we could make this be a compile-time option if it becomes too
> expensive.
> >
> > The per-pool sharded lists allow the debug routines to visit each
> > container and do things like ask "how many elements do you have?" --
> > "How big is each element" -- "Give me a printable string of the
> > type-signature for this container". Once you have this list you can
> > generate lots of interesting debug reports. Because you can sort by
> > individual containers as well as group containers by their type
> > signatures (i.e., combined the consumption of all "map<a,b>"
> > containers as a group). You can report out both by Byte as well as by
> > Element count consumption.
> 
> Yeah, this sounds pretty nice.  But I think it's important to be able to compile
> it out.  I think we will have a lot of creations/destructions.
> For example, in BlueStore, Onodes have maps of Extents, those map to
> Blobs, and those have a BufferSpace with a map of Buffers for cached data.  I
> expect that blobs and even onodes will be coming in and out of cache a lot.
> 
> > This kind of information usually allows you to quickly figure out
> > where the memory is being consumed. A bit of script wizardry would
> > recognize that some containers contain other containers. For example,
> > no reason a simple Python script couldn't recognize that each oNode
> > might have a bunch of vector<pextents> within it and tell you things
> > like the average number of pextents / oNodes. Average DRAM
> consumption
> > per oNode (which is a pretty complicated combination of pextents,
> > lextents, buffer::ptr,
> > etc.)
> >
> > Comments: ?????
> 
> It would be nice to build this on top of existing allocator libraries if we can.
> For example, something in boost.  I took a quick peek the other day and
> didn't find something that allowed simple interrogation about utilization,
> though, which was surprising.  It would be nice to have something useful
> (perhaps without #6) that could be done relatively quickly and address all of
> the other requirements.
> 
> sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-28 13:34   ` Haomai Wang
@ 2016-09-28 15:56     ` Allen Samuels
  2016-09-28 19:39     ` Allen Samuels
  1 sibling, 0 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-28 15:56 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil; +Cc: Ceph Development

I'll look at those references. Thanks!


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Haomai Wang [mailto:haomai@xsky.com]
> Sent: Wednesday, September 28, 2016 3:34 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Ceph Development
> <ceph-devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Wed, Sep 28, 2016 at 9:27 PM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 27 Sep 2016, Allen Samuels wrote:
> >> As we discussed in the Bluestore standup this morning. This is
> >> intended to start a discussion about creating some internal memory
> >> pooling technology to try to get a better handle on the internal
> >> usage of memory by Ceph. Let's start by discussing the requirements...
> >>
> >> Here is my list of requirements:
> >>
> >> (1) Should be able to create an arbitrary number of "pools" of memory.
> >>
> >> (2) Developers should be able to declare that a particular container
> >> (i.e., STL or boost-like container) is wholly contained within a pool.
> >>
> >> (3) Beyond declarations (and possibly constructor initialization), no
> >> explicit code is required to be written by developers to support (2).
> >> All container manipulation primitives properly update the accounting.
> >>
> >> (4) Beyond construction/destruction costs, no container operation is
> >> burdened by additional code -- only implicit malloc/free operations
> >> are burdened with accounting.
> >>
> >> (5) The system tracks the aggregate amount of memory consumed in
> each
> >> pool and it's relatively cheap to interrogate the current total
> >> consumption.
> >
> > Yes
> >
> >> (6) The system tracks the aggregate amount of memory consumed by
> each
> >> container in each pool -- but this is expensive to interrogate and is
> >> intended to be used primarily for debugging purposes.
> >
> > This one sounds like a nice-to-have to me.  If there is a performance
> > cost I would skip it.
> >
> >> (7) generic object new/delete is possible, but not freed of the
> >> accounting requirements -- especially #6, i.e..
> >>
> >> (8) No backpressure is built into the scheme, i.e., nobody has to
> >> worry about suddenly being "out" of memory or being delayed -- just
> >> because some particular pool is filling up. That's a higher level
> >> problem to solve. No memory is "reserved" either -- If you
> >> overcommit, that's also not solved at this layer. IMO, this is a
> >> crappy place to be doing ingest and flow control.
> >>
> >> (9) Implementation must be multi-thread and multi-socket aware. It
> >> should expect high levels of thread concurrency and avoid unnecessary
> >> global data manipulation (expect internal sharding of data structures
> >> -- something like an arena-based malloc scheme).
> >
> > Yes
> >
> >> Requirement 5 allows a "trimming" system to be developed. I think
> >> there are really two styles for this:
> >>
> >> (a) Time-based, i.e., periodically some thread wakes up and checks
> >> memory usage within a pool. If it doesn't like it, then it's
> >> responsible for "fixing" it, i.e., trimming as needed.
> >>
> >> (b) event-based. No reason that we couldn't setup an event or
> >> condition variable per-pool and have the malloc/free code trigger
> >> that condition/variable. It adds one or two compare/branches to each
> >> malloc / free operation (which is pretty cheap), but doesn't have the
> >> latency costs of (a). The downside is that this implicitly assumes a
> >> single global-thread is responsible for cleaning each pool which
> >> works well when there are a relatively small number of pools.
> >>
> >> Here is my list of anti-requirements:
> >>
> >> (1) No hierarchical relationship between the pools. [IMO, this is
> >> kewl, but unnecessary and tends to screw up your cache, i.e., destroys
> #9.
> >>
> >> (2) No physical colocation of the allocated pool memory. The pool is
> >> "logical", i.e., an accounting mirage only.
> >>
> >> (3) No reason to dynamically create/destroy memory pools. They can be
> >> statically declared (this dramatically simplifies the code that uses
> >> this system).
> >
> > Yes.  Great summary!
> >
> >> Let the discussion begin!!
> >> /////////////////////////
> >>
> >> Here is my proposed external interface to the design:
> >>
> >> First, look at the slab_xxxx containers that I just submitted. You
> >> can find them at
> >>
> https://github.com/allensamuels/ceph/blob/master/src/include/slab_con
> >> tainers.h
> >>
> >> I would propose to extend those containers as the basis for the
> >> memory pooling.
> >>
> >> First, there's a global enum that defines the memory pools -- yes,
> >> they're static and a small number
> >>
> >> enum mpool_index {
> >>    MPOOL_ONE,
> >>    MPOOL_TWO,
> >> ...
> >>    MPOOL_LAST
> >> };
> >>
> >> And a global object for each pool:
> >>
> >> class mpool; // TBD ... see below.
> >>
> >> Extern mpool[MPOOL_LAST]; // Actual definition of each pool
> >>
> >> Each slab_xxx container template is augmented to expect receive an
> >> additional "enum mpool_index" parameter.
> >>
> >> That's ALL THAT'S required for the developer. In other words, if each
> >> definition of an STL container uses a typedef with the right mpool
> >> index, then you're done. The machinery takes care of everything else
> >> :)
> >
> > FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of
> > g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes
> > the global instance; the latter means you could manage the memory pool
> > however you like (e.g., as part of the CephContext for librados).
> > That's a small detail, though.
> >
> >> Standalone objects, i.e., naked new/delete are easily done by making
> >> the equivalent of a slab_intrusive_list and maybe a macro or two.
> >> There's some tricky initialization for this one (see below).
> >>
> >> -------------------------------------------
> >>
> >> Implementation
> >>
> >> -------------------------------------------
> >>
> >> Requirement 6 is what drives this implementation.
> >>
> >> I would extend each slab_xxxx container to also virtually inherit
> >> from a Pool_Member interface, this interface allows the memory pool
> >> global machinery to implement #6.
> >>
> >> I propose that the ctor/dtor for Pool_Member (one for each container)
> >> put itself on a list within the respective memory pool. This MUST be
> >> a synchronized operation but we can shard the list to reduce the
> >> collisions (use the low 4-5 bits of the creating thread pointer to
> >> index the shard -- minimizes ctor expense but increases the dtor
> >> expense -- which is often done in "trim"). This assumes that the rate
> >> of container creation/destruction within a memory pool is not super
> >> high -- we could make this be a compile-time option if it becomes too
> expensive.
> >>
> >> The per-pool sharded lists allow the debug routines to visit each
> >> container and do things like ask "how many elements do you have?" --
> >> "How big is each element" -- "Give me a printable string of the
> >> type-signature for this container". Once you have this list you can
> >> generate lots of interesting debug reports. Because you can sort by
> >> individual containers as well as group containers by their type
> >> signatures (i.e., combined the consumption of all "map<a,b>"
> >> containers as a group). You can report out both by Byte as well as by
> >> Element count consumption.
> >
> > Yeah, this sounds pretty nice.  But I think it's important to be able
> > to compile it out.  I think we will have a lot of creations/destructions.
> > For example, in BlueStore, Onodes have maps of Extents, those map to
> > Blobs, and those have a BufferSpace with a map of Buffers for cached
> > data.  I expect that blobs and even onodes will be coming in and out
> > of cache a lot.
> >
> >> This kind of information usually allows you to quickly figure out
> >> where the memory is being consumed. A bit of script wizardry would
> >> recognize that some containers contain other containers. For example,
> >> no reason a simple Python script couldn't recognize that each oNode
> >> might have a bunch of vector<pextents> within it and tell you things
> >> like the average number of pextents / oNodes. Average DRAM
> >> consumption per oNode (which is a pretty complicated combination of
> >> pextents, lextents, buffer::ptr,
> >> etc.)
> >>
> >> Comments: ?????
> >
> > It would be nice to build this on top of existing allocator libraries
> > if we can.  For example, something in boost.  I took a quick peek the
> > other day and didn't find something that allowed simple interrogation
> > about utilization, though, which was surprising.  It would be nice to
> > have something useful (perhaps without #6) that could be done
> > relatively quickly and address all of the other requirements.
> 
> If we want to do in a light way, I recommend to refer to seastar
> impl(https://github.com/scylladb/seastar/blob/master/core/memory.hh
> https://github.com/scylladb/seastar/blob/master/core/slab.hh). This can
> give a lot of insights.
> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsu bscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-28 13:34   ` Haomai Wang
  2016-09-28 15:56     ` Allen Samuels
@ 2016-09-28 19:39     ` Allen Samuels
  1 sibling, 0 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-28 19:39 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil; +Cc: Ceph Development

The seastar stuff is interesting, it looks like a replacement for the underlying storage allocator with some heavy optimization and assumptions about the mapping of cores onto malloc/free operations.

I was intending to do something considerably smaller and easier solely in the accounting area, but not to attempt to replace the existing malloc/free work that's been done (tcmalloc/jemalloc, etc.).


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Haomai Wang [mailto:haomai@xsky.com]
> Sent: Wednesday, September 28, 2016 3:34 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Ceph Development
> <ceph-devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Wed, Sep 28, 2016 at 9:27 PM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 27 Sep 2016, Allen Samuels wrote:
> >> As we discussed in the Bluestore standup this morning. This is
> >> intended to start a discussion about creating some internal memory
> >> pooling technology to try to get a better handle on the internal
> >> usage of memory by Ceph. Let's start by discussing the requirements...
> >>
> >> Here is my list of requirements:
> >>
> >> (1) Should be able to create an arbitrary number of "pools" of memory.
> >>
> >> (2) Developers should be able to declare that a particular container
> >> (i.e., STL or boost-like container) is wholly contained within a pool.
> >>
> >> (3) Beyond declarations (and possibly constructor initialization), no
> >> explicit code is required to be written by developers to support (2).
> >> All container manipulation primitives properly update the accounting.
> >>
> >> (4) Beyond construction/destruction costs, no container operation is
> >> burdened by additional code -- only implicit malloc/free operations
> >> are burdened with accounting.
> >>
> >> (5) The system tracks the aggregate amount of memory consumed in
> each
> >> pool and it's relatively cheap to interrogate the current total
> >> consumption.
> >
> > Yes
> >
> >> (6) The system tracks the aggregate amount of memory consumed by
> each
> >> container in each pool -- but this is expensive to interrogate and is
> >> intended to be used primarily for debugging purposes.
> >
> > This one sounds like a nice-to-have to me.  If there is a performance
> > cost I would skip it.
> >
> >> (7) generic object new/delete is possible, but not freed of the
> >> accounting requirements -- especially #6, i.e..
> >>
> >> (8) No backpressure is built into the scheme, i.e., nobody has to
> >> worry about suddenly being "out" of memory or being delayed -- just
> >> because some particular pool is filling up. That's a higher level
> >> problem to solve. No memory is "reserved" either -- If you
> >> overcommit, that's also not solved at this layer. IMO, this is a
> >> crappy place to be doing ingest and flow control.
> >>
> >> (9) Implementation must be multi-thread and multi-socket aware. It
> >> should expect high levels of thread concurrency and avoid unnecessary
> >> global data manipulation (expect internal sharding of data structures
> >> -- something like an arena-based malloc scheme).
> >
> > Yes
> >
> >> Requirement 5 allows a "trimming" system to be developed. I think
> >> there are really two styles for this:
> >>
> >> (a) Time-based, i.e., periodically some thread wakes up and checks
> >> memory usage within a pool. If it doesn't like it, then it's
> >> responsible for "fixing" it, i.e., trimming as needed.
> >>
> >> (b) event-based. No reason that we couldn't setup an event or
> >> condition variable per-pool and have the malloc/free code trigger
> >> that condition/variable. It adds one or two compare/branches to each
> >> malloc / free operation (which is pretty cheap), but doesn't have the
> >> latency costs of (a). The downside is that this implicitly assumes a
> >> single global-thread is responsible for cleaning each pool which
> >> works well when there are a relatively small number of pools.
> >>
> >> Here is my list of anti-requirements:
> >>
> >> (1) No hierarchical relationship between the pools. [IMO, this is
> >> kewl, but unnecessary and tends to screw up your cache, i.e., destroys
> #9.
> >>
> >> (2) No physical colocation of the allocated pool memory. The pool is
> >> "logical", i.e., an accounting mirage only.
> >>
> >> (3) No reason to dynamically create/destroy memory pools. They can be
> >> statically declared (this dramatically simplifies the code that uses
> >> this system).
> >
> > Yes.  Great summary!
> >
> >> Let the discussion begin!!
> >> /////////////////////////
> >>
> >> Here is my proposed external interface to the design:
> >>
> >> First, look at the slab_xxxx containers that I just submitted. You
> >> can find them at
> >>
> https://github.com/allensamuels/ceph/blob/master/src/include/slab_con
> >> tainers.h
> >>
> >> I would propose to extend those containers as the basis for the
> >> memory pooling.
> >>
> >> First, there's a global enum that defines the memory pools -- yes,
> >> they're static and a small number
> >>
> >> enum mpool_index {
> >>    MPOOL_ONE,
> >>    MPOOL_TWO,
> >> ...
> >>    MPOOL_LAST
> >> };
> >>
> >> And a global object for each pool:
> >>
> >> class mpool; // TBD ... see below.
> >>
> >> Extern mpool[MPOOL_LAST]; // Actual definition of each pool
> >>
> >> Each slab_xxx container template is augmented to expect receive an
> >> additional "enum mpool_index" parameter.
> >>
> >> That's ALL THAT'S required for the developer. In other words, if each
> >> definition of an STL container uses a typedef with the right mpool
> >> index, then you're done. The machinery takes care of everything else
> >> :)
> >
> > FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of
> > g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes
> > the global instance; the latter means you could manage the memory pool
> > however you like (e.g., as part of the CephContext for librados).
> > That's a small detail, though.
> >
> >> Standalone objects, i.e., naked new/delete are easily done by making
> >> the equivalent of a slab_intrusive_list and maybe a macro or two.
> >> There's some tricky initialization for this one (see below).
> >>
> >> -------------------------------------------
> >>
> >> Implementation
> >>
> >> -------------------------------------------
> >>
> >> Requirement 6 is what drives this implementation.
> >>
> >> I would extend each slab_xxxx container to also virtually inherit
> >> from a Pool_Member interface, this interface allows the memory pool
> >> global machinery to implement #6.
> >>
> >> I propose that the ctor/dtor for Pool_Member (one for each container)
> >> put itself on a list within the respective memory pool. This MUST be
> >> a synchronized operation but we can shard the list to reduce the
> >> collisions (use the low 4-5 bits of the creating thread pointer to
> >> index the shard -- minimizes ctor expense but increases the dtor
> >> expense -- which is often done in "trim"). This assumes that the rate
> >> of container creation/destruction within a memory pool is not super
> >> high -- we could make this be a compile-time option if it becomes too
> expensive.
> >>
> >> The per-pool sharded lists allow the debug routines to visit each
> >> container and do things like ask "how many elements do you have?" --
> >> "How big is each element" -- "Give me a printable string of the
> >> type-signature for this container". Once you have this list you can
> >> generate lots of interesting debug reports. Because you can sort by
> >> individual containers as well as group containers by their type
> >> signatures (i.e., combined the consumption of all "map<a,b>"
> >> containers as a group). You can report out both by Byte as well as by
> >> Element count consumption.
> >
> > Yeah, this sounds pretty nice.  But I think it's important to be able
> > to compile it out.  I think we will have a lot of creations/destructions.
> > For example, in BlueStore, Onodes have maps of Extents, those map to
> > Blobs, and those have a BufferSpace with a map of Buffers for cached
> > data.  I expect that blobs and even onodes will be coming in and out
> > of cache a lot.
> >
> >> This kind of information usually allows you to quickly figure out
> >> where the memory is being consumed. A bit of script wizardry would
> >> recognize that some containers contain other containers. For example,
> >> no reason a simple Python script couldn't recognize that each oNode
> >> might have a bunch of vector<pextents> within it and tell you things
> >> like the average number of pextents / oNodes. Average DRAM
> >> consumption per oNode (which is a pretty complicated combination of
> >> pextents, lextents, buffer::ptr,
> >> etc.)
> >>
> >> Comments: ?????
> >
> > It would be nice to build this on top of existing allocator libraries
> > if we can.  For example, something in boost.  I took a quick peek the
> > other day and didn't find something that allowed simple interrogation
> > about utilization, though, which was surprising.  It would be nice to
> > have something useful (perhaps without #6) that could be done
> > relatively quickly and address all of the other requirements.
> 
> If we want to do in a light way, I recommend to refer to seastar
> impl(https://github.com/scylladb/seastar/blob/master/core/memory.hh
> https://github.com/scylladb/seastar/blob/master/core/slab.hh). This can
> give a lot of insights.
> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsu bscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-28 13:27 ` Sage Weil
  2016-09-28 13:34   ` Haomai Wang
  2016-09-28 15:56   ` Allen Samuels
@ 2016-09-28 21:16   ` Jesse Williamson
  2016-09-28 21:31     ` Allen Samuels
  2 siblings, 1 reply; 15+ messages in thread
From: Jesse Williamson @ 2016-09-28 21:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Ceph Development

On Wed, 28 Sep 2016, Sage Weil wrote:

> It would be nice to build this on top of existing allocator libraries if
> we can.  For example, something in boost.  I took a quick peek the other
> day and didn't find something that allowed simple interrogation about
> utilization, though, which was surprising.

Boost::pool look to me like it should fufill most of the requirements, but 
I haven't used it in production:
http://www.boost.org/doc/libs/1_62_0/libs/pool/doc/html/index.html

By "iterrogation about utilization", do you mean statistics about the 
allocator activity?

-Jesse

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-28 21:16   ` Jesse Williamson
@ 2016-09-28 21:31     ` Allen Samuels
  2016-09-30 14:22       ` Sage Weil
  0 siblings, 1 reply; 15+ messages in thread
From: Allen Samuels @ 2016-09-28 21:31 UTC (permalink / raw)
  To: Jesse Williamson, Sage Weil; +Cc: Ceph Development

Boost::pool works very well when you're allocated "same" sized objects. That's not our situation, we're allocating lots of different sized objects -- some small, some large. The only way that Boost::pool supports that situation is to use the "ordered_free" operation to keep the freelist sorted (if you don't use it then you'll get fragmentation that'll prevent allocation of large objects -- even though there's plenty of free memory). The implementation of the sorted freelist is O(N). Which should work well for small pools, but that's the exact opposite of the desired use for Ceph, we're targeting large pools (think 1GB).

I didn't word it very well, but my proposal doesn't actually change the underlying malloc/free algorithm, rather it's intended to put some statistics around memory usage so that we can self-trim our memory pools.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Jesse Williamson [mailto:jwilliamson@suse.de]
> Sent: Wednesday, September 28, 2016 11:17 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Ceph Development
> <ceph-devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Wed, 28 Sep 2016, Sage Weil wrote:
> 
> > It would be nice to build this on top of existing allocator libraries
> > if we can.  For example, something in boost.  I took a quick peek the
> > other day and didn't find something that allowed simple interrogation
> > about utilization, though, which was surprising.
> 
> Boost::pool look to me like it should fufill most of the requirements, but I
> haven't used it in production:
> http://www.boost.org/doc/libs/1_62_0/libs/pool/doc/html/index.html
> 
> By "iterrogation about utilization", do you mean statistics about the allocator
> activity?
> 
> -Jesse

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Memory Pooling and Containers
  2016-09-28 21:31     ` Allen Samuels
@ 2016-09-30 14:22       ` Sage Weil
  2016-09-30 14:30         ` Allen Samuels
  2016-09-30 18:54         ` Gregory Farnum
  0 siblings, 2 replies; 15+ messages in thread
From: Sage Weil @ 2016-09-30 14:22 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Jesse Williamson, Ceph Development

On Wed, 28 Sep 2016, Allen Samuels wrote:
> Boost::pool works very well when you're allocated "same" sized objects. 
> That's not our situation, we're allocating lots of different sized 
> objects -- some small, some large. The only way that Boost::pool 
> supports that situation is to use the "ordered_free" operation to keep 
> the freelist sorted (if you don't use it then you'll get fragmentation 
> that'll prevent allocation of large objects -- even though there's 
> plenty of free memory). The implementation of the sorted freelist is 
> O(N). Which should work well for small pools, but that's the exact 
> opposite of the desired use for Ceph, we're targeting large pools (think 
> 1GB).
> 
> I didn't word it very well, but my proposal doesn't actually change the 
> underlying malloc/free algorithm, rather it's intended to put some 
> statistics around memory usage so that we can self-trim our memory 
> pools.

We were doing some heap profiling yesterday and one interesting thing is 
that the utilized heap reported by tcmalloc is about 1/2 the RSS.  We 
probably want to consider creating separate pools for the handful of 
objects that are consuming the bulk of the heap.

We did this a few years back in the MDS and IIRC it helped significantly 
with memory utilization there.

sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-30 14:22       ` Sage Weil
@ 2016-09-30 14:30         ` Allen Samuels
  2016-09-30 21:47           ` Jesse Williamson
  2016-09-30 18:54         ` Gregory Farnum
  1 sibling, 1 reply; 15+ messages in thread
From: Allen Samuels @ 2016-09-30 14:30 UTC (permalink / raw)
  To: Sage Weil; +Cc: Jesse Williamson, Ceph Development

I'm expecting the "in-object" capabilities of the slab_xxxxx containers to dramatically reduce malloc/free calls. Also having some visibility into where the memory is actually going will also spur a bit more development that will likely have more significant positive effects. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Sep 30, 2016, at 4:22 PM, Sage Weil <sage@newdream.net> wrote:
> 
>> On Wed, 28 Sep 2016, Allen Samuels wrote:
>> Boost::pool works very well when you're allocated "same" sized objects. 
>> That's not our situation, we're allocating lots of different sized 
>> objects -- some small, some large. The only way that Boost::pool 
>> supports that situation is to use the "ordered_free" operation to keep 
>> the freelist sorted (if you don't use it then you'll get fragmentation 
>> that'll prevent allocation of large objects -- even though there's 
>> plenty of free memory). The implementation of the sorted freelist is 
>> O(N). Which should work well for small pools, but that's the exact 
>> opposite of the desired use for Ceph, we're targeting large pools (think 
>> 1GB).
>> 
>> I didn't word it very well, but my proposal doesn't actually change the 
>> underlying malloc/free algorithm, rather it's intended to put some 
>> statistics around memory usage so that we can self-trim our memory 
>> pools.
> 
> We were doing some heap profiling yesterday and one interesting thing is 
> that the utilized heap reported by tcmalloc is about 1/2 the RSS.  We 
> probably want to consider creating separate pools for the handful of 
> objects that are consuming the bulk of the heap.
> 
> We did this a few years back in the MDS and IIRC it helped significantly 
> with memory utilization there.
> 
> sage

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-30 14:22       ` Sage Weil
  2016-09-30 14:30         ` Allen Samuels
@ 2016-09-30 18:54         ` Gregory Farnum
  1 sibling, 0 replies; 15+ messages in thread
From: Gregory Farnum @ 2016-09-30 18:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Jesse Williamson, Ceph Development

On Fri, Sep 30, 2016 at 7:22 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 28 Sep 2016, Allen Samuels wrote:
>> Boost::pool works very well when you're allocated "same" sized objects.
>> That's not our situation, we're allocating lots of different sized
>> objects -- some small, some large. The only way that Boost::pool
>> supports that situation is to use the "ordered_free" operation to keep
>> the freelist sorted (if you don't use it then you'll get fragmentation
>> that'll prevent allocation of large objects -- even though there's
>> plenty of free memory). The implementation of the sorted freelist is
>> O(N). Which should work well for small pools, but that's the exact
>> opposite of the desired use for Ceph, we're targeting large pools (think
>> 1GB).
>>
>> I didn't word it very well, but my proposal doesn't actually change the
>> underlying malloc/free algorithm, rather it's intended to put some
>> statistics around memory usage so that we can self-trim our memory
>> pools.
>
> We were doing some heap profiling yesterday and one interesting thing is
> that the utilized heap reported by tcmalloc is about 1/2 the RSS.  We
> probably want to consider creating separate pools for the handful of
> objects that are consuming the bulk of the heap.

Off-hand I'd say this is almost certainly the result of bumping the
thread cache sizes way up; I don't think it used to be the case. What
settings were being used?

> We did this a few years back in the MDS and IIRC it helped significantly
> with memory utilization there.

More than a "few" years, but yes.
-Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Memory Pooling and Containers
  2016-09-30 14:30         ` Allen Samuels
@ 2016-09-30 21:47           ` Jesse Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Jesse Williamson @ 2016-09-30 21:47 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Ceph Development

On Fri, 30 Sep 2016, Allen Samuels wrote:

> I'm expecting the "in-object" capabilities of the slab_xxxxx containers 
> to dramatically reduce malloc/free calls. Also having some visibility 
> into where the memory is actually going will also spur a bit more 
> development that will likely have more significant positive effects.

Definitely sounds useful! I see that some works been done with massif, 
too.

Also, I took a quick peek into boost::pool. It's built exactly as you 
said-- specifically for the manu-small-objects situation. I didn't see an 
"obvious" way to hook in a new freelist scheme.

Have a great weekend,

-Jesse


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-09-30 21:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-27 20:30 Memory Pooling and Containers Allen Samuels
2016-09-27 23:00 ` Gregory Farnum
2016-09-28  9:13   ` Allen Samuels
2016-09-28 13:16   ` Sage Weil
2016-09-28 13:27 ` Sage Weil
2016-09-28 13:34   ` Haomai Wang
2016-09-28 15:56     ` Allen Samuels
2016-09-28 19:39     ` Allen Samuels
2016-09-28 15:56   ` Allen Samuels
2016-09-28 21:16   ` Jesse Williamson
2016-09-28 21:31     ` Allen Samuels
2016-09-30 14:22       ` Sage Weil
2016-09-30 14:30         ` Allen Samuels
2016-09-30 21:47           ` Jesse Williamson
2016-09-30 18:54         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.