LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, Linux MM <linux-mm@kvack.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Kernel Team <kernel-team@fb.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API
Date: Fri, 19 Jun 2020 14:38:10 -0700
Message-ID: <20200619213810.GA237539@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <CALvZod4hq5moKcb6f5L6VAS+v5+jKf8Hyd0gLotD7bMK7FsKgg@mail.gmail.com>

On Fri, Jun 19, 2020 at 08:42:34AM -0700, Shakeel Butt wrote:
> On Mon, Jun 8, 2020 at 4:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Obj_cgroup API provides an ability to account sub-page sized kernel
> > objects, which potentially outlive the original memory cgroup.
> >
> > The top-level API consists of the following functions:
> >   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
> >   void obj_cgroup_get(struct obj_cgroup *objcg);
> >   void obj_cgroup_put(struct obj_cgroup *objcg);
> >
> >   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
> >   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> >
> >   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> >   struct obj_cgroup *get_obj_cgroup_from_current(void);
> >
> > Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> > reference counter. It substitutes a memory cgroup in places where
> > it's necessary to charge a custom amount of bytes instead of pages.
> >
> > All charged memory rounded down to pages is charged to the
> > corresponding memory cgroup using __memcg_kmem_charge().
> >
> > It implements reparenting: on memcg offlining it's getting reattached
> > to the parent memory cgroup. Each online memory cgroup has an
> > associated active object cgroup to handle new allocations and the list
> > of all attached object cgroups. On offlining of a cgroup this list is
> > reparented and for each object cgroup in the list the memcg pointer is
> > swapped to the parent memory cgroup. It prevents long-living objects
> > from pinning the original memory cgroup in the memory.
> >
> > The implementation is based on byte-sized per-cpu stocks. A sub-page
> > sized leftover is stored in an atomic field, which is a part of
> > obj_cgroup object. So on cgroup offlining the leftover is automatically
> > reparented.
> >
> > memcg->objcg is rcu protected.
> > objcg->memcg is a raw pointer, which is always pointing at a memory
> > cgroup, but can be atomically swapped to the parent memory cgroup. So
> > the caller
> 
> What type of caller? The allocator?

Basically whoever uses the pointer. Is it better to s/caller/user?

> 
> > must ensure the lifetime of the cgroup, e.g. grab
> > rcu_read_lock or css_set_lock.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  include/linux/memcontrol.h |  51 +++++++
> >  mm/memcontrol.c            | 288 ++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 338 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 93dbc7f9d8b8..c69e66fe4f12 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -23,6 +23,7 @@
> >  #include <linux/page-flags.h>
> >
> >  struct mem_cgroup;
> > +struct obj_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  struct kmem_cache;
> > @@ -192,6 +193,22 @@ struct memcg_cgwb_frn {
> >         struct wb_completion done;      /* tracks in-flight foreign writebacks */
> >  };
> >
> > +/*
> > + * Bucket for arbitrarily byte-sized objects charged to a memory
> > + * cgroup. The bucket can be reparented in one piece when the cgroup
> > + * is destroyed, without having to round up the individual references
> > + * of all live memory objects in the wild.
> > + */
> > +struct obj_cgroup {
> > +       struct percpu_ref refcnt;
> > +       struct mem_cgroup *memcg;
> > +       atomic_t nr_charged_bytes;
> 
> So, we still charge the mem page counter in pages but keep the
> remaining sub-page slack charge in nr_charge_bytes, right?

Kind of. The remainder is usually kept in a per-cpu stock,
but if the stock has to be flushed, it's getting flushed to nr_charge_bytes.

> 
> > +       union {
> > +               struct list_head list;
> > +               struct rcu_head rcu;
> > +       };
> > +};
> > +
> >  /*
> >   * The memory controller data structure. The memory controller controls both
> >   * page cache and RSS per cgroup. We would eventually like to provide
> > @@ -301,6 +318,8 @@ struct mem_cgroup {
> >         int kmemcg_id;
> >         enum memcg_kmem_state kmem_state;
> >         struct list_head kmem_caches;
> > +       struct obj_cgroup __rcu *objcg;
> > +       struct list_head objcg_list;
> >  #endif
> >
> [snip]
> > +
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > +                                 struct mem_cgroup *parent)
> > +{
> > +       struct obj_cgroup *objcg, *iter;
> > +
> > +       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> > +
> > +       spin_lock_irq(&css_set_lock);
> > +
> > +       /* Move active objcg to the parent's list */
> > +       xchg(&objcg->memcg, parent);
> > +       css_get(&parent->css);
> > +       list_add(&objcg->list, &parent->objcg_list);
> 
> So, memcg->objcs_list will always only contain the offlined
> descendants objcgs. I would recommend to rename objcg_list to clearly
> show that. Maybe offlined_objcg_list or descendants_objcg_list or
> something else.

Right. Let me add a comment for now and think of a better name.

> 
> > +
> > +       /* Move already reparented objcgs to the parent's list */
> > +       list_for_each_entry(iter, &memcg->objcg_list, list) {
> > +               css_get(&parent->css);
> > +               xchg(&iter->memcg, parent);
> > +               css_put(&memcg->css);
> > +       }
> > +       list_splice(&memcg->objcg_list, &parent->objcg_list);
> > +
> > +       spin_unlock_irq(&css_set_lock);
> > +
> > +       percpu_ref_kill(&objcg->refcnt);
> > +}
> > +
> >  /*
> [snip]
> >
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (unlikely(!current->mm))
> > +               return NULL;
> 
> I have not seen the users of this function yet but shouldn't the above
> check be (!current->mm && !current->active_memcg)?

Yes, good catch, it might save a couple of cycles if
current->mm == current->active_memcg == NULL. Adding.

> 
> Do we need a mem_cgroup_disabled() check as well?

As now both call sides are guarded by memcg_kmem_enabled(),
so we don't need it.

But maybe it's a good target for some refactorings,
e.g. moving !current->mm and !current->active_memcg checks out
of memcg_kmem_bypass(). And _maybe_ it's better to move memcg_kmem_enabled()
here, but I'm not sure.

> 
> > +
> > +       rcu_read_lock();
> > +       if (unlikely(current->active_memcg))
> > +               memcg = rcu_dereference(current->active_memcg);
> > +       else
> > +               memcg = mem_cgroup_from_task(current);
> > +
> > +       for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +               objcg = rcu_dereference(memcg->objcg);
> > +               if (objcg && obj_cgroup_tryget(objcg))
> > +                       break;
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       return objcg;
> > +}
> > +
> [...]
> > +
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +{
> > +       struct memcg_stock_pcp *stock;
> > +       unsigned long flags;
> > +
> > +       local_irq_save(flags);
> > +
> > +       stock = this_cpu_ptr(&memcg_stock);
> > +       if (stock->cached_objcg != objcg) { /* reset if necessary */
> > +               drain_obj_stock(stock);
> > +               obj_cgroup_get(objcg);
> > +               stock->cached_objcg = objcg;
> > +               stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> > +       }
> > +       stock->nr_bytes += nr_bytes;
> > +
> > +       if (stock->nr_bytes > PAGE_SIZE)
> > +               drain_obj_stock(stock);
> 
> The normal stock can go to 32*nr_cpus*PAGE_SIZE. I am wondering if
> just PAGE_SIZE is too less for obj stock.

It works on top of the current stock of 32 pages, so it can grab these
32 pages without any atomic operations. And it should be easy to increase
this limit if we'll see any benefits.

Thank you for looking into the patchset!

Andrew, can you, please, squash the following fix based on Shakeel's suggestions?
Thanks!

--

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ed3af71a6fb..2499f78cf32d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -326,7 +326,7 @@ struct mem_cgroup {
        int kmemcg_id;
        enum memcg_kmem_state kmem_state;
        struct obj_cgroup __rcu *objcg;
-       struct list_head objcg_list;
+       struct list_head objcg_list; /* list of inherited objcgs */
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 70cd44b28db1..9f14b91700d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2843,7 +2843,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
        struct obj_cgroup *objcg = NULL;
        struct mem_cgroup *memcg;
 
-       if (unlikely(!current->mm))
+       if (unlikely(!current->mm && !current->active_memcg))
                return NULL;
 
        rcu_read_lock();

  reply index

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-08 23:06 [PATCH v6 00/19] The new cgroup slab memory controller Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 01/19] mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state() Roman Gushchin
2020-06-17  1:52   ` Shakeel Butt
2020-06-17  2:50     ` Roman Gushchin
2020-06-17  2:59       ` Shakeel Butt
2020-06-17  3:19         ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 02/19] mm: memcg: prepare for byte-sized vmstat items Roman Gushchin
2020-06-17  2:57   ` Shakeel Butt
2020-06-17  3:19     ` Roman Gushchin
2020-06-17 15:55   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 03/19] mm: memcg: convert vmstat slab counters to bytes Roman Gushchin
2020-06-17  3:03   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 04/19] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2020-06-17  3:08   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 05/19] mm: memcontrol: decouple reference counting from page accounting Roman Gushchin
2020-06-18  0:47   ` Shakeel Butt
2020-06-18 14:55   ` Shakeel Butt
2020-06-18 19:51     ` Roman Gushchin
2020-06-19  1:08     ` Roman Gushchin
2020-06-19  1:18       ` Shakeel Butt
2020-06-19  1:31   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 06/19] mm: memcg/slab: obj_cgroup API Roman Gushchin
2020-06-19 15:42   ` Shakeel Butt
2020-06-19 21:38     ` Roman Gushchin [this message]
2020-06-19 22:16       ` Shakeel Butt
2020-06-19 22:52         ` Roman Gushchin
2020-06-20 22:50       ` Andrew Morton
2020-06-08 23:06 ` [PATCH v6 07/19] mm: memcg/slab: allocate obj_cgroups for non-root slab pages Roman Gushchin
2020-06-19 16:36   ` Shakeel Butt
2020-06-20  0:25     ` Roman Gushchin
2020-06-20  0:31       ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 08/19] mm: memcg/slab: save obj_cgroup for non-root slab objects Roman Gushchin
2020-06-20  0:16   ` Shakeel Butt
2020-06-20  1:19     ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 09/19] mm: memcg/slab: charge individual slab objects instead of pages Roman Gushchin
2020-06-20  0:54   ` Shakeel Butt
2020-06-20  1:29     ` Roman Gushchin
2020-06-08 23:06 ` [PATCH v6 10/19] mm: memcg/slab: deprecate memory.kmem.slabinfo Roman Gushchin
2020-06-22 17:12   ` Shakeel Butt
2020-06-22 18:01     ` Roman Gushchin
2020-06-22 18:09       ` Shakeel Butt
2020-06-22 18:25         ` Roman Gushchin
2020-06-22 18:38           ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 11/19] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2020-06-20  1:19   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 12/19] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations Roman Gushchin
2020-06-22 16:56   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 13/19] mm: memcg/slab: simplify memcg cache creation Roman Gushchin
2020-06-22 17:29   ` Shakeel Butt
2020-06-22 17:40     ` Roman Gushchin
2020-06-22 18:03       ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 14/19] mm: memcg/slab: remove memcg_kmem_get_cache() Roman Gushchin
2020-06-22 18:42   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 15/19] mm: memcg/slab: deprecate slab_root_caches Roman Gushchin
2020-06-22 17:36   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 16/19] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() Roman Gushchin
2020-06-22 17:32   ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 17/19] mm: memcg/slab: use a single set of kmem_caches for all allocations Roman Gushchin
2020-06-17 23:35   ` Andrew Morton
2020-06-18  0:35     ` Roman Gushchin
2020-06-18  7:33       ` Vlastimil Babka
2020-06-18 19:54         ` Roman Gushchin
2020-06-22 19:21   ` Shakeel Butt
2020-06-22 20:37     ` Roman Gushchin
2020-06-22 21:04       ` Shakeel Butt
2020-06-22 21:13         ` Roman Gushchin
2020-06-22 21:28           ` Shakeel Butt
2020-06-22 21:58             ` Roman Gushchin
2020-06-22 22:05               ` Shakeel Butt
2020-06-08 23:06 ` [PATCH v6 18/19] kselftests: cgroup: add kernel memory accounting tests Roman Gushchin
2020-06-17  1:46 ` [PATCH v6 00/19] The new cgroup slab memory controller Shakeel Butt
2020-06-17  2:41   ` Roman Gushchin
2020-06-17  3:05     ` Shakeel Butt
2020-06-17  3:32       ` Roman Gushchin
2020-06-17 11:24         ` Vlastimil Babka
2020-06-17 14:31           ` Mel Gorman
2020-06-20  0:57             ` Roman Gushchin
2020-06-18  1:29           ` Roman Gushchin
2020-06-18  8:43             ` Jesper Dangaard Brouer
2020-06-18  9:31               ` Jesper Dangaard Brouer
2020-06-19  1:30                 ` Roman Gushchin
2020-06-19  8:32                   ` Jesper Dangaard Brouer
2020-06-19  1:27               ` Roman Gushchin
2020-06-19  9:39                 ` Jesper Dangaard Brouer
2020-06-19 18:47                   ` Roman Gushchin
2020-06-18  1:18   ` Roman Gushchin
2020-06-18  9:27 ` Mike Rapoport
2020-06-18 20:43   ` Roman Gushchin
2020-06-21 22:57 ` Qian Cai
2020-06-21 23:34   ` Roman Gushchin
2020-06-21 23:53     ` Qian Cai
2020-06-22  3:07       ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200619213810.GA237539@carbon.dhcp.thefacebook.com \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shakeelb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git