All of lore.kernel.org
 help / color / mirror / Atom feed
From: Allen Samuels <Allen.Samuels@sandisk.com>
To: Haomai Wang <haomai@xsky.com>, Sage Weil <sage@newdream.net>
Cc: Ceph Development <ceph-devel@vger.kernel.org>
Subject: RE: Memory Pooling and Containers
Date: Wed, 28 Sep 2016 15:56:42 +0000	[thread overview]
Message-ID: <BLUPR0201MB15248C633E182B073E7F94AEE8CF0@BLUPR0201MB1524.namprd02.prod.outlook.com> (raw)
In-Reply-To: <CACJqLyauuKpGv9u7WJx18hLEBa0dp+dGZ_ueuMdNeaHi3YRK_w@mail.gmail.com>

I'll look at those references. Thanks!


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Haomai Wang [mailto:haomai@xsky.com]
> Sent: Wednesday, September 28, 2016 3:34 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Ceph Development
> <ceph-devel@vger.kernel.org>
> Subject: Re: Memory Pooling and Containers
> 
> On Wed, Sep 28, 2016 at 9:27 PM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 27 Sep 2016, Allen Samuels wrote:
> >> As we discussed in the Bluestore standup this morning. This is
> >> intended to start a discussion about creating some internal memory
> >> pooling technology to try to get a better handle on the internal
> >> usage of memory by Ceph. Let's start by discussing the requirements...
> >>
> >> Here is my list of requirements:
> >>
> >> (1) Should be able to create an arbitrary number of "pools" of memory.
> >>
> >> (2) Developers should be able to declare that a particular container
> >> (i.e., STL or boost-like container) is wholly contained within a pool.
> >>
> >> (3) Beyond declarations (and possibly constructor initialization), no
> >> explicit code is required to be written by developers to support (2).
> >> All container manipulation primitives properly update the accounting.
> >>
> >> (4) Beyond construction/destruction costs, no container operation is
> >> burdened by additional code -- only implicit malloc/free operations
> >> are burdened with accounting.
> >>
> >> (5) The system tracks the aggregate amount of memory consumed in
> each
> >> pool and it's relatively cheap to interrogate the current total
> >> consumption.
> >
> > Yes
> >
> >> (6) The system tracks the aggregate amount of memory consumed by
> each
> >> container in each pool -- but this is expensive to interrogate and is
> >> intended to be used primarily for debugging purposes.
> >
> > This one sounds like a nice-to-have to me.  If there is a performance
> > cost I would skip it.
> >
> >> (7) generic object new/delete is possible, but not freed of the
> >> accounting requirements -- especially #6, i.e..
> >>
> >> (8) No backpressure is built into the scheme, i.e., nobody has to
> >> worry about suddenly being "out" of memory or being delayed -- just
> >> because some particular pool is filling up. That's a higher level
> >> problem to solve. No memory is "reserved" either -- If you
> >> overcommit, that's also not solved at this layer. IMO, this is a
> >> crappy place to be doing ingest and flow control.
> >>
> >> (9) Implementation must be multi-thread and multi-socket aware. It
> >> should expect high levels of thread concurrency and avoid unnecessary
> >> global data manipulation (expect internal sharding of data structures
> >> -- something like an arena-based malloc scheme).
> >
> > Yes
> >
> >> Requirement 5 allows a "trimming" system to be developed. I think
> >> there are really two styles for this:
> >>
> >> (a) Time-based, i.e., periodically some thread wakes up and checks
> >> memory usage within a pool. If it doesn't like it, then it's
> >> responsible for "fixing" it, i.e., trimming as needed.
> >>
> >> (b) event-based. No reason that we couldn't setup an event or
> >> condition variable per-pool and have the malloc/free code trigger
> >> that condition/variable. It adds one or two compare/branches to each
> >> malloc / free operation (which is pretty cheap), but doesn't have the
> >> latency costs of (a). The downside is that this implicitly assumes a
> >> single global-thread is responsible for cleaning each pool which
> >> works well when there are a relatively small number of pools.
> >>
> >> Here is my list of anti-requirements:
> >>
> >> (1) No hierarchical relationship between the pools. [IMO, this is
> >> kewl, but unnecessary and tends to screw up your cache, i.e., destroys
> #9.
> >>
> >> (2) No physical colocation of the allocated pool memory. The pool is
> >> "logical", i.e., an accounting mirage only.
> >>
> >> (3) No reason to dynamically create/destroy memory pools. They can be
> >> statically declared (this dramatically simplifies the code that uses
> >> this system).
> >
> > Yes.  Great summary!
> >
> >> Let the discussion begin!!
> >> /////////////////////////
> >>
> >> Here is my proposed external interface to the design:
> >>
> >> First, look at the slab_xxxx containers that I just submitted. You
> >> can find them at
> >>
> https://github.com/allensamuels/ceph/blob/master/src/include/slab_con
> >> tainers.h
> >>
> >> I would propose to extend those containers as the basis for the
> >> memory pooling.
> >>
> >> First, there's a global enum that defines the memory pools -- yes,
> >> they're static and a small number
> >>
> >> enum mpool_index {
> >>    MPOOL_ONE,
> >>    MPOOL_TWO,
> >> ...
> >>    MPOOL_LAST
> >> };
> >>
> >> And a global object for each pool:
> >>
> >> class mpool; // TBD ... see below.
> >>
> >> Extern mpool[MPOOL_LAST]; // Actual definition of each pool
> >>
> >> Each slab_xxx container template is augmented to expect receive an
> >> additional "enum mpool_index" parameter.
> >>
> >> That's ALL THAT'S required for the developer. In other words, if each
> >> definition of an STL container uses a typedef with the right mpool
> >> index, then you're done. The machinery takes care of everything else
> >> :)
> >
> > FWIW I'm not sure if there's much reason to pass MPOOL_FOO instead of
> > g_mpool[MPOOL_FOO] to the allocator instance.  The former hard-codes
> > the global instance; the latter means you could manage the memory pool
> > however you like (e.g., as part of the CephContext for librados).
> > That's a small detail, though.
> >
> >> Standalone objects, i.e., naked new/delete are easily done by making
> >> the equivalent of a slab_intrusive_list and maybe a macro or two.
> >> There's some tricky initialization for this one (see below).
> >>
> >> -------------------------------------------
> >>
> >> Implementation
> >>
> >> -------------------------------------------
> >>
> >> Requirement 6 is what drives this implementation.
> >>
> >> I would extend each slab_xxxx container to also virtually inherit
> >> from a Pool_Member interface, this interface allows the memory pool
> >> global machinery to implement #6.
> >>
> >> I propose that the ctor/dtor for Pool_Member (one for each container)
> >> put itself on a list within the respective memory pool. This MUST be
> >> a synchronized operation but we can shard the list to reduce the
> >> collisions (use the low 4-5 bits of the creating thread pointer to
> >> index the shard -- minimizes ctor expense but increases the dtor
> >> expense -- which is often done in "trim"). This assumes that the rate
> >> of container creation/destruction within a memory pool is not super
> >> high -- we could make this be a compile-time option if it becomes too
> expensive.
> >>
> >> The per-pool sharded lists allow the debug routines to visit each
> >> container and do things like ask "how many elements do you have?" --
> >> "How big is each element" -- "Give me a printable string of the
> >> type-signature for this container". Once you have this list you can
> >> generate lots of interesting debug reports. Because you can sort by
> >> individual containers as well as group containers by their type
> >> signatures (i.e., combined the consumption of all "map<a,b>"
> >> containers as a group). You can report out both by Byte as well as by
> >> Element count consumption.
> >
> > Yeah, this sounds pretty nice.  But I think it's important to be able
> > to compile it out.  I think we will have a lot of creations/destructions.
> > For example, in BlueStore, Onodes have maps of Extents, those map to
> > Blobs, and those have a BufferSpace with a map of Buffers for cached
> > data.  I expect that blobs and even onodes will be coming in and out
> > of cache a lot.
> >
> >> This kind of information usually allows you to quickly figure out
> >> where the memory is being consumed. A bit of script wizardry would
> >> recognize that some containers contain other containers. For example,
> >> no reason a simple Python script couldn't recognize that each oNode
> >> might have a bunch of vector<pextents> within it and tell you things
> >> like the average number of pextents / oNodes. Average DRAM
> >> consumption per oNode (which is a pretty complicated combination of
> >> pextents, lextents, buffer::ptr,
> >> etc.)
> >>
> >> Comments: ?????
> >
> > It would be nice to build this on top of existing allocator libraries
> > if we can.  For example, something in boost.  I took a quick peek the
> > other day and didn't find something that allowed simple interrogation
> > about utilization, though, which was surprising.  It would be nice to
> > have something useful (perhaps without #6) that could be done
> > relatively quickly and address all of the other requirements.
> 
> If we want to do in a light way, I recommend to refer to seastar
> impl(https://github.com/scylladb/seastar/blob/master/core/memory.hh
> https://github.com/scylladb/seastar/blob/master/core/slab.hh). This can
> give a lot of insights.
> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsu bscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2016-09-28 15:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-27 20:30 Memory Pooling and Containers Allen Samuels
2016-09-27 23:00 ` Gregory Farnum
2016-09-28  9:13   ` Allen Samuels
2016-09-28 13:16   ` Sage Weil
2016-09-28 13:27 ` Sage Weil
2016-09-28 13:34   ` Haomai Wang
2016-09-28 15:56     ` Allen Samuels [this message]
2016-09-28 19:39     ` Allen Samuels
2016-09-28 15:56   ` Allen Samuels
2016-09-28 21:16   ` Jesse Williamson
2016-09-28 21:31     ` Allen Samuels
2016-09-30 14:22       ` Sage Weil
2016-09-30 14:30         ` Allen Samuels
2016-09-30 21:47           ` Jesse Williamson
2016-09-30 18:54         ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BLUPR0201MB15248C633E182B073E7F94AEE8CF0@BLUPR0201MB1524.namprd02.prod.outlook.com \
    --to=allen.samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=haomai@xsky.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.