Re: [PATCH v1 1/4] mm: memcontrol: use helpers to access page's memcg data

From: Roman Gushchin <guro@fb.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeelb@google.com>,
	Michal Hocko <mhocko@kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <kernel-team@fb.com>
Subject: Re: [PATCH v1 1/4] mm: memcontrol: use helpers to access page's memcg data
Date: Thu, 24 Sep 2020 13:27:00 -0700	[thread overview]
Message-ID: <20200924202700.GB1899519@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20200924194508.GA329853@cmpxchg.org>

On Thu, Sep 24, 2020 at 03:45:08PM -0400, Johannes Weiner wrote:
> On Tue, Sep 22, 2020 at 01:36:57PM -0700, Roman Gushchin wrote:
> > Currently there are many open-coded reads and writes of the
> > page->mem_cgroup pointer, as well as a couple of read helpers,
> > which are barely used.
> > 
> > It creates an obstacle on a way to reuse some bits of the pointer
> > for storing additional bits of information. In fact, we already do
> > this for slab pages, where the last bit indicates that a pointer has
> > an attached vector of objcg pointers instead of a regular memcg
> > pointer.
> > 
> > This commits introduces 4 new helper functions and converts all
> > raw accesses to page->mem_cgroup to calls of these helpers:
> >   struct mem_cgroup *page_mem_cgroup(struct page *page);
> >   struct mem_cgroup *page_mem_cgroup_check(struct page *page);
> >   void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
> >   void clear_page_mem_cgroup(struct page *page);
> 
> Sounds reasonable to me!
> 
> > page_mem_cgroup_check() is intended to be used in cases when the page
> > can be a slab page and have a memcg pointer pointing at objcg vector.
> > It does check the lowest bit, and if set, returns NULL.
> > page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
> > being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().
> > 
> > To make sure nobody uses a direct access, struct page's
> > mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
> > Only new helpers and a couple of slab-accounting related functions
> > access this field directly.
> > 
> > page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
> > New page_mem_cgroup() is a direct analog of page_memcg(), while
> > page_memcg_rcu() has a single call site in a small rcu-read-lock
> > section, so it's just not worth it to have a separate helper. So
> > it's replaced with page_mem_cgroup() too.
> 
> page_memcg_rcu() does READ_ONCE(). We need to keep that for lockless
> accesses.

Ok, how about page_memcg() and page_objcgs() which always do READ_ONCE()?
Because page_memcg_rcu() has only a single call site, I would prefer to
have one helper instead of two.

> 
> > @@ -343,6 +343,72 @@ struct mem_cgroup {
> >  
> >  extern struct mem_cgroup *root_mem_cgroup;
> >  
> > +/*
> > + * page_mem_cgroup - get the memory cgroup associated with a page
> > + * @page: a pointer to the page struct
> > + *
> > + * Returns a pointer to the memory cgroup associated with the page,
> > + * or NULL. This function assumes that the page is known to have a
> > + * proper memory cgroup pointer. It's not safe to call this function
> > + * against some type of pages, e.g. slab pages or ex-slab pages.
> > + */
> > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(PageSlab(page), page);
> > +	return (struct mem_cgroup *)page->memcg_data;
> > +}
> 
> This would also be a good place to mention what's required for the
> function to be called safely, or in a way that produces a stable
> result - i.e. the list of conditions in commit_charge().

Makes sense.

> 
> > + * page_mem_cgroup_check - get the memory cgroup associated with a page
> > + * @page: a pointer to the page struct
> > + *
> > + * Returns a pointer to the memory cgroup associated with the page,
> > + * or NULL. This function unlike page_mem_cgroup() can take any  page
> > + * as an argument. It has to be used in cases when it's not known if a page
> > + * has an associated memory cgroup pointer or an object cgroups vector.
> > + */
> > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> > +{
> > +	unsigned long memcg_data = page->memcg_data;
> > +
> > +	/*
> > +	 * The lowest bit set means that memcg isn't a valid
> > +	 * memcg pointer, but a obj_cgroups pointer.
> > +	 * In this case the page is shared and doesn't belong
> > +	 * to any specific memory cgroup.
> > +	 */
> > +	if (memcg_data & 0x1UL)
> > +		return NULL;
> > +
> > +	return (struct mem_cgroup *)memcg_data;
> > +}
> 
> Here as well.
> 
> > +
> > +/*
> > + * set_page_mem_cgroup - associate a page with a memory cgroup
> > + * @page: a pointer to the page struct
> > + * @memcg: a pointer to the memory cgroup
> > + *
> > + * Associates a page with a memory cgroup.
> > + */
> > +static inline void set_page_mem_cgroup(struct page *page,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	VM_BUG_ON_PAGE(PageSlab(page), page);
> > +	page->memcg_data = (unsigned long)memcg;
> > +}
> > +
> > +/*
> > + * clear_page_mem_cgroup - clear an association of a page with a memory cgroup
> > + * @page: a pointer to the page struct
> > + *
> > + * Clears an association of a page with a memory cgroup.
> > + */
> > +static inline void clear_page_mem_cgroup(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(PageSlab(page), page);
> > +	page->memcg_data = 0;
> > +}
> > +
> >  static __always_inline bool memcg_stat_item_in_bytes(int idx)
> >  {
> >  	if (idx == MEMCG_PERCPU_B)
> > @@ -743,15 +809,15 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
> >  static inline void __mod_memcg_page_state(struct page *page,
> >  					  int idx, int val)
> >  {
> > -	if (page->mem_cgroup)
> > -		__mod_memcg_state(page->mem_cgroup, idx, val);
> > +	if (page_mem_cgroup(page))
> > +		__mod_memcg_state(page_mem_cgroup(page), idx, val);
> >  }
> >  
> >  static inline void mod_memcg_page_state(struct page *page,
> >  					int idx, int val)
> >  {
> > -	if (page->mem_cgroup)
> > -		mod_memcg_state(page->mem_cgroup, idx, val);
> > +	if (page_mem_cgroup(page))
> > +		mod_memcg_state(page_mem_cgroup(page), idx, val);
> >  }
> >  
> >  static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
> > @@ -838,12 +904,12 @@ static inline void __mod_lruvec_page_state(struct page *page,
> >  	struct lruvec *lruvec;
> >  
> >  	/* Untracked pages have no memcg, no lruvec. Update only the node */
> > -	if (!head->mem_cgroup) {
> > +	if (!page_mem_cgroup(head)) {
> >  		__mod_node_page_state(pgdat, idx, val);
> >  		return;
> >  	}
> >  
> > -	lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
> > +	lruvec = mem_cgroup_lruvec(page_mem_cgroup(head), pgdat);
> >  	__mod_lruvec_state(lruvec, idx, val);
> 
> The repetition of the function call is a bit jarring, especially in
> configs with VM_BUG_ON() enabled (some distros use it for their beta
> release kernels, so it's not just kernel developer test machines that
> pay this cost). Can you please use a local variable when the function
> needs the memcg more than once?

Sure.

> 
> > @@ -878,8 +944,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
> >  static inline void count_memcg_page_event(struct page *page,
> >  					  enum vm_event_item idx)
> >  {
> > -	if (page->mem_cgroup)
> > -		count_memcg_events(page->mem_cgroup, idx, 1);
> > +	if (page_mem_cgroup(page))
> > +		count_memcg_events(page_mem_cgroup(page), idx, 1);
> >  }
> >  
> >  static inline void count_memcg_event_mm(struct mm_struct *mm,
> > @@ -941,6 +1007,25 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> >  
> >  struct mem_cgroup;
> >  
> > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> > +{
> > +	return NULL;
> > +}
> > +
> > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> > +{
> > +	return NULL;
> > +}
> > +
> > +static inline void set_page_mem_cgroup(struct page *page,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +}
> > +
> > +static inline void clear_page_mem_cgroup(struct page *page)
> > +{
> > +}
> > +
> >  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> >  {
> >  	return true;
> > @@ -1430,7 +1515,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page,
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> > -	if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
> > +	if (unlikely(&page_mem_cgroup(page)->css != wb->memcg_css))
> >  		mem_cgroup_track_foreign_dirty_slowpath(page, wb);
> >  }
> >  
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 17e712207d74..5e24ff2ffec9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1476,28 +1476,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> >  #endif
> >  }
> >  
> > -#ifdef CONFIG_MEMCG
> > -static inline struct mem_cgroup *page_memcg(struct page *page)
> > -{
> > -	return page->mem_cgroup;
> > -}
> > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > -{
> > -	WARN_ON_ONCE(!rcu_read_lock_held());
> > -	return READ_ONCE(page->mem_cgroup);
> > -}
> > -#else
> > -static inline struct mem_cgroup *page_memcg(struct page *page)
> > -{
> > -	return NULL;
> > -}
> > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > -{
> > -	WARN_ON_ONCE(!rcu_read_lock_held());
> > -	return NULL;
> > -}
> > -#endif
> 
> You essentially renamed these existing helpers, but I don't think
> that's justified. Especially with the proliferation of callsites, the
> original names are nicer. I'd prefer we keep them.
> 
> > @@ -560,16 +560,7 @@ ino_t page_cgroup_ino(struct page *page)
> >  	unsigned long ino = 0;
> >  
> >  	rcu_read_lock();
> > -	memcg = page->mem_cgroup;
> > -
> > -	/*
> > -	 * The lowest bit set means that memcg isn't a valid
> > -	 * memcg pointer, but a obj_cgroups pointer.
> > -	 * In this case the page is shared and doesn't belong
> > -	 * to any specific memory cgroup.
> > -	 */
> > -	if ((unsigned long) memcg & 0x1UL)
> > -		memcg = NULL;
> > +	memcg = page_mem_cgroup_check(page);
> 
> This should actually have been using READ_ONCE() all along. Otherwise
> the compiler can issue multiple loads to page->mem_cgroup here and you
> can end up with a pointer with the lowest bit set leaking out.
> 
> > @@ -2928,17 +2918,6 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >  
> >  	page = virt_to_head_page(p);
> >  
> > -	/*
> > -	 * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
> > -	 * or a pointer to obj_cgroup vector. In the latter case the lowest
> > -	 * bit of the pointer is set.
> > -	 * The page->mem_cgroup pointer can be asynchronously changed
> > -	 * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
> > -	 * from a valid memcg pointer to objcg vector or back.
> > -	 */
> > -	if (!page->mem_cgroup)
> > -		return NULL;
> > -
> >  	/*
> >  	 * Slab objects are accounted individually, not per-page.
> >  	 * Memcg membership data for each individual object is saved in
> > @@ -2956,8 +2935,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >  		return NULL;
> >  	}
> >  
> > -	/* All other pages use page->mem_cgroup */
> > -	return page->mem_cgroup;
> > +	/*
> > +	 * page_mem_cgroup_check() is used here, because page_has_obj_cgroups()
> > +	 * check above could fail because the object cgroups vector wasn't set
> > +	 * at that moment, but it can be set concurrently.
> > +	 * page_mem_cgroup_check(page) will guarantee tat a proper memory
> > +	 * cgroup pointer or NULL will be returned.
> > +	 */
> > +	return page_mem_cgroup_check(page);
> 
> The code right now doesn't look quite safe. As per above, without the
> READ_ONCE the compiler might issue multiple loads and we may get a
> pointer with the low bit set.
> 
> Maybe slightly off-topic, but what are "all other pages" in general?
> I don't see any callsites that ask for ownership on objects whose
> backing pages may belong to a single memcg. That wouldn't seem to make
> too much sense. Unless I'm missing something, this function should
> probably tighten up its scope a bit and only work on stuff that is
> actually following the obj_cgroup protocol.

Kernel stacks can be slabs or generic pages/vmallocs. Also large kmallocs
are using the page allocator, so they don't follow the objcg protocol.

Thanks!