Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API

From: Roman Gushchin <guro@fb.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>, <linux-mm@kvack.org>,
	<kernel-team@fb.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API
Date: Thu, 7 May 2020 15:26:31 -0700	[thread overview]
Message-ID: <20200507222631.GA81857@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20200507210314.GD161043@cmpxchg.org>

On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> >  }
> >  
> >  #ifdef CONFIG_MEMCG_KMEM
> > +extern spinlock_t css_set_lock;
> > +
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > +	struct mem_cgroup *memcg;
> > +	unsigned int nr_bytes;
> > +	unsigned int nr_pages;
> > +	unsigned long flags;
> > +
> > +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > +	nr_pages = nr_bytes >> PAGE_SHIFT;
> 
> What guarantees that we don't have a partial page in there at this
> point? I guess any outstanding allocations would pin the objcg, so
> when it's released all objects have been freed.

Right, this is exactly the reason why there can't be a partial page
at this point.

> 
> But if that's true, how can we have full pages remaining in there now?

Imagine the following sequence:
1) CPU0: objcg == stock->cached_objcg
2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
3) CPU1: a process from another memcg is allocating something, stock if flushed,
   objcg->nr_charged_bytes = PAGE_SIZE - 92
5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes

In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
in obj_cgroup_release().

I've double checked this, it's actually pretty easy to trigger in the real life.

> 
> > @@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >  	return page->mem_cgroup;
> >  }
> >  
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > +	struct obj_cgroup *objcg = NULL;
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (unlikely(!current->mm))
> > +		return NULL;
> > +
> > +	rcu_read_lock();
> > +	if (unlikely(current->active_memcg))
> > +		memcg = rcu_dereference(current->active_memcg);
> > +	else
> > +		memcg = mem_cgroup_from_task(current);
> > +
> > +	for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > +		objcg = rcu_dereference(memcg->objcg);
> > +		if (objcg && obj_cgroup_tryget(objcg))
> > +			break;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return objcg;
> > +}
> 
> Thanks for moving this here from one of the later patches, it helps
> understanding the life cycle of obj_cgroup better.
> 
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > +{
> > +	struct mem_cgroup *memcg;
> > +	unsigned int nr_pages, nr_bytes;
> > +	int ret;
> > +
> > +	if (consume_obj_stock(objcg, size))
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +	memcg = obj_cgroup_memcg(objcg);
> > +	css_get(&memcg->css);
> > +	rcu_read_unlock();
> > +
> > +	nr_pages = size >> PAGE_SHIFT;
> > +	nr_bytes = size & (PAGE_SIZE - 1);
> > +
> > +	if (nr_bytes)
> > +		nr_pages += 1;
> > +
> > +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> 
> If consume_obj_stock() fails because some other memcg is cached,
> should this try to consume the partial page in objcg->nr_charged_bytes
> before getting more pages?

We can definitely do it, but I'm not sure if it's good for the performance.

Dealing with nr_charged_bytes will require up to two atomic writes,
so calling __memcg_kmem_charge() can be faster if memcg is cached
on percpu stock.

> 
> > +	if (!ret && nr_bytes)
> > +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> 
> This will put the cgroup into the cache if the allocation resulted in
> a partially free page.
> 
> But if this was a page allocation, we may have objcg->nr_cache_bytes
> from a previous subpage allocation that we should probably put back
> into the stock.

Yeah, we can do this, but I don't know if there will be any benefits.

Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
it can become a contention point if there are many threads allocating
in the memory cgroup.

So maybe we want to do the opposite: relax it a bit and stop flushing
it on every stock refill and flush only if it exceeds a certain value.

> 
> It's not a common case, I'm just trying to understand what
> objcg->nr_cache_bytes holds and when it does so.

So it's actually a centralized leftover from the rounding of the actual
charge to the page size.

> 
> The rest of this patch looks good to me!

Great!

Thank you very much for the review!