Memory Pooling and Containers

* Memory Pooling and Containers
@ 2016-09-27 20:30 Allen Samuels
  2016-09-27 23:00 ` Gregory Farnum
  2016-09-28 13:27 ` Sage Weil
  0 siblings, 2 replies; 15+ messages in thread
From: Allen Samuels @ 2016-09-27 20:30 UTC (permalink / raw)
  To: Sage Weil, Ceph Development

As we discussed in the Bluestore standup this morning. This is intended to start a discussion about creating some internal memory pooling technology to try to get a better handle on the internal usage of memory by Ceph. Let's start by discussing the requirements...

Here is my list of requirements:

(1) Should be able to create an arbitrary number of "pools" of memory.
(2) Developers should be able to declare that a particular container (i.e., STL or boost-like container) is wholly contained within a pool.
(3) Beyond declarations (and possibly constructor initialization), no explicit code is required to be written by developers to support (2). All container manipulation primitives properly update the accounting.
(4) Beyond construction/destruction costs, no container operation is burdened by additional code -- only implicit malloc/free operations are burdened with accounting.
(5) The system tracks the aggregate amount of memory consumed in each pool and it's relatively cheap to interrogate the current total consumption. 
(6) The system tracks the aggregate amount of memory consumed by each container in each pool -- but this is expensive to interrogate and is intended to be used primarily for debugging purposes.
(7) generic object new/delete is possible, but not freed of the accounting requirements -- especially #6, i.e..
(8) No backpressure is built into the scheme, i.e., nobody has to worry about suddenly being "out" of memory or being delayed -- just because some particular pool is filling up. That's a higher level problem to solve. No memory is "reserved" either -- If you overcommit, that's also not solved at this layer. IMO, this is a crappy place to be doing ingest and flow control.
(9) Implementation must be multi-thread and multi-socket aware. It should expect high levels of thread concurrency and avoid unnecessary global data manipulation (expect internal sharding of data structures -- something like an arena-based malloc scheme).

Requirement 5 allows a "trimming" system to be developed. I think there are really two styles for this:

(a) Time-based, i.e., periodically some thread wakes up and checks memory usage within a pool. If it doesn't like it, then it's responsible for "fixing" it, i.e., trimming as needed.
(b) event-based. No reason that we couldn't setup an event or condition variable per-pool and have the malloc/free code trigger that condition/variable. It adds one or two compare/branches to each malloc / free operation (which is pretty cheap), but doesn't have the latency costs of (a). The downside is that this implicitly assumes a single global-thread is responsible for cleaning each pool which works well when there are a relatively small number of pools.

Here is my list of anti-requirements:

(1) No hierarchical relationship between the pools. [IMO, this is kewl, but unnecessary and tends to screw up your cache, i.e., destroys #9.
(2) No physical colocation of the allocated pool memory. The pool is "logical", i.e., an accounting mirage only.
(3) No reason to dynamically create/destroy memory pools. They can be statically declared (this dramatically simplifies the code that uses this system).

Let the discussion begin!!
/////////////////////////

Here is my proposed external interface to the design:

First, look at the slab_xxxx containers that I just submitted. You can find them at https://github.com/allensamuels/ceph/blob/master/src/include/slab_containers.h

I would propose to extend those containers as the basis for the memory pooling.

First, there's a global enum that defines the memory pools -- yes, they're static and a small number

enum mpool_index {
   MPOOL_ONE,
   MPOOL_TWO,
...
   MPOOL_LAST
};

And a global object for each pool:

class mpool; // TBD ... see below.

Extern mpool[MPOOL_LAST]; // Actual definition of each pool

Each slab_xxx container template is augmented to expect receive an additional "enum mpool_index" parameter.

That's ALL THAT'S required for the developer. In other words, if each definition of an STL container uses a typedef with the right mpool index, then you're done. The machinery takes care of everything else :)

Standalone objects, i.e., naked new/delete are easily done by making the equivalent of a slab_intrusive_list and maybe a macro or two. There's some tricky initialization for this one (see below).

-------------------------------------------

Implementation

-------------------------------------------

Requirement 6 is what drives this implementation.

I would extend each slab_xxxx container to also virtually inherit from a Pool_Member interface, this interface allows the memory pool global machinery to implement #6.

I propose that the ctor/dtor for Pool_Member (one for each container) put itself on a list within the respective memory pool. This MUST be a synchronized operation but we can shard the list to reduce the collisions (use the low 4-5 bits of the creating thread pointer to index the shard -- minimizes ctor expense but increases the dtor expense -- which is often done in "trim"). This assumes that the rate of container creation/destruction within a memory pool is not super high -- we could make this be a compile-time option if it becomes too expensive.

The per-pool sharded lists allow the debug routines to visit each container and do things like ask "how many elements do you have?" -- "How big is each element" -- "Give me a printable string of the type-signature for this container". Once you have this list you can generate lots of interesting debug reports. Because you can sort by individual containers as well as group containers by their type signatures (i.e., combined the consumption of all "map<a,b>" containers as a group). You can report out both by Byte as well as by Element count consumption. 

This kind of information usually allows you to quickly figure out where the memory is being consumed. A bit of script wizardry would recognize that some containers contain other containers. For example, no reason a simple Python script couldn't recognize that each oNode might have a bunch of vector<pextents> within it and tell you things like the average number of pextents / oNodes. Average DRAM consumption per oNode (which is a pretty complicated combination of pextents, lextents, buffer::ptr, etc.)

Comments: ?????

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

^ permalink raw reply	[flat|nested] 15+ messages in thread