linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages
@ 2021-03-03  5:59 Muchun Song
  2021-03-03  5:59 ` [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page Muchun Song
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

Since Roman series "The new cgroup slab memory controller" applied. All
slab objects are charged with the new APIs of obj_cgroup. The new APIs
introduce a struct obj_cgroup to charge slab objects. It prevents
long-living objects from pinning the original memory cgroup in the memory.
But there are still some corner objects (e.g. allocations larger than
order-1 page on SLUB) which are not charged with the new APIs. Those
objects (include the pages which are allocated from buddy allocator
directly) are charged as kmem pages which still hold a reference to
the memory cgroup.

E.g. We know that the kernel stack is charged as kmem pages because the
size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
or arm64). If we create a thread (suppose the thread stack is charged to
memory cgroup A) and then move it from memory cgroup A to memory cgroup
B. Because the kernel stack of the thread hold a reference to the memory
cgroup A. The thread can pin the memory cgroup A in the memory even if
we remove the cgroup A. If we want to see this scenario by using the
following script. We can see that the system has added 500 dying cgroups
(This is not a real world issue, just a script to show that the large
kmallocs are charged as kmem pages which can pin the memory cgroup in the
memory).

	#!/bin/bash

	cat /proc/cgroups | grep memory

	cd /sys/fs/cgroup/memory
	echo 1 > memory.move_charge_at_immigrate

	for i in range{1..500}
	do
		mkdir kmem_test
		echo $$ > kmem_test/cgroup.procs
		sleep 3600 &
		echo $$ > cgroup.procs
		echo `cat kmem_test/cgroup.procs` > cgroup.procs
		rmdir kmem_test
	done

	cat /proc/cgroups | grep memory

This patchset aims to make those kmem pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the above test script.

Patch 1-3 use obj_cgroup APIs to charge kmem pages.
Patch 4 introduces remote objcg charging APIs.
Patch 5 uses remote objcg charging APIs to charge kernel memory.

Changlogs in v2:
  1. Fix some types in the commit log (Thanks Roman).
  2. Do not introduce page_memcg_kmem helper (Thanks to Johannes and Shakeel).
  3. Reduce the CC list to mm/memcg folks (Thanks to Johannes).
  4. Introduce remote objcg charging APIs instead of convert "remote memcg
     charging APIs" to "remote objcg charging APIs".

Muchun Song (5):
  mm: memcontrol: introduce obj_cgroup_{un}charge_page
  mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
    page
  mm: memcontrol: charge kmem pages by using obj_cgroup APIs
  mm: memcontrol: introduce remote objcg charging API
  mm: memcontrol: use remote objcg charging APIs to charge kernel memory

 fs/buffer.c                          |  10 +-
 fs/notify/fanotify/fanotify.c        |   6 +-
 fs/notify/fanotify/fanotify_user.c   |   2 +-
 fs/notify/group.c                    |   3 +-
 fs/notify/inotify/inotify_fsnotify.c |   8 +-
 fs/notify/inotify/inotify_user.c     |   2 +-
 include/linux/bpf.h                  |   2 +-
 include/linux/fsnotify_backend.h     |   2 +-
 include/linux/memcontrol.h           | 114 +++++++++++++---
 include/linux/sched.h                |   4 +
 include/linux/sched/mm.h             |  38 ++++++
 kernel/bpf/syscall.c                 |  35 ++---
 kernel/fork.c                        |   3 +
 mm/memcontrol.c                      | 257 ++++++++++++++++++++++++++---------
 mm/page_alloc.c                      |   4 +-
 15 files changed, 372 insertions(+), 118 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page
  2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
@ 2021-03-03  5:59 ` Muchun Song
  2021-03-05 18:56   ` Roman Gushchin
  2021-03-03  5:59 ` [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page Muchun Song
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

We know that the unit of slab object charging is bytes, the unit of
kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
skip touch the objcg stock. And obj_cgroup_{un}charge_page() are
introduced to charge in units of page level.

In the later patch, we also can reuse those two helpers to charge or
uncharge a number of kernel pages to a object cgroup. This is just
a code movement without any functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 46 +++++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 845eec01ef9d..faae16def127 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3056,6 +3056,34 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
+static inline void obj_cgroup_uncharge_page(struct obj_cgroup *objcg,
+					    unsigned int nr_pages)
+{
+	rcu_read_lock();
+	__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+	rcu_read_unlock();
+}
+
+static int obj_cgroup_charge_page(struct obj_cgroup *objcg, gfp_t gfp,
+				  unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	rcu_read_lock();
+retry:
+	memcg = obj_cgroup_memcg(objcg);
+	if (unlikely(!css_tryget(&memcg->css)))
+		goto retry;
+	rcu_read_unlock();
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+
+	css_put(&memcg->css);
+
+	return ret;
+}
+
 /**
  * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
@@ -3180,11 +3208,8 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
 		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
 
-		if (nr_pages) {
-			rcu_read_lock();
-			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
-			rcu_read_unlock();
-		}
+		if (nr_pages)
+			obj_cgroup_uncharge_page(old, nr_pages);
 
 		/*
 		 * The leftover is flushed to the centralized per-memcg value.
@@ -3242,7 +3267,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 {
-	struct mem_cgroup *memcg;
 	unsigned int nr_pages, nr_bytes;
 	int ret;
 
@@ -3259,24 +3283,16 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
 	 * refill_obj_stock(), called from this function or
 	 * independently later.
 	 */
-	rcu_read_lock();
-retry:
-	memcg = obj_cgroup_memcg(objcg);
-	if (unlikely(!css_tryget(&memcg->css)))
-		goto retry;
-	rcu_read_unlock();
-
 	nr_pages = size >> PAGE_SHIFT;
 	nr_bytes = size & (PAGE_SIZE - 1);
 
 	if (nr_bytes)
 		nr_pages += 1;
 
-	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	ret = obj_cgroup_charge_page(objcg, gfp, nr_pages);
 	if (!ret && nr_bytes)
 		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
 
-	css_put(&memcg->css);
 	return ret;
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page
  2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
  2021-03-03  5:59 ` [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page Muchun Song
@ 2021-03-03  5:59 ` Muchun Song
  2021-03-05 19:00   ` Roman Gushchin
  2021-03-03  5:59 ` [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs Muchun Song
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

We want to reuse the obj_cgroup APIs to charge the kmem pages when
If we do that, we should store an object cgroup pointer to
page->memcg_data for the kmem pages.

Finally, page->memcg_data can have 3 different meanings.

  1) For the slab pages, page->memcg_data points to an object cgroups
     vector.

  2) For the kmem pages (exclude the slab pages), page->memcg_data
     points to an object cgroup.

  3) For the user pages (e.g. the LRU pages), page->memcg_data points
     to a memory cgroup.

Currently we always get the memory cgroup associated with a page via
page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
has to be used in cases when it's not known if a page has an
associated memory cgroup pointer or an object cgroups vector. Because
the page->memcg_data of the kmem page is not pointing to a memory
cgroup in the later patch, the page_memcg() and page_memcg_rcu()
cannot be applicable for the kmem pages. In this patch, make
page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
We do not change the behavior of the page_memcg_check(), it is also
applicable for the kmem pages.

In the end, there are 3 helpers to get the memcg associated with a page.
Usage is as follows.

  1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
     pages).

     - page_memcg()
     - page_memcg_rcu()

  2) Get the memory cgroup associated with a page. It has to be used in
     cases when it's not known if a page has an associated memory cgroup
     pointer or an object cgroups vector. Returns NULL for slab pages or
     uncharged pages. Otherwise, returns memory cgroup for charged pages
     (e.g. the kmem pages, the LRU pages).

     - page_memcg_check()

In some place, we use page_memcg() to check whether the page is charged.
Now introduce page_memcg_charged() helper to do that.

This is a preparation for reparenting the kmem pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 36 ++++++++++++++++++++++++++++--------
 mm/memcontrol.c            | 23 +++++++++++++----------
 mm/page_alloc.c            |  4 ++--
 3 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..049b80246cbf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -358,14 +358,26 @@ enum page_memcg_data_flags {
 
 #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
 
+/* Return true for charged page, otherwise false. */
+static inline bool page_memcg_charged(struct page *page)
+{
+	unsigned long memcg_data = page->memcg_data;
+
+	VM_BUG_ON_PAGE(PageSlab(page), page);
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
+
+	return !!memcg_data;
+}
+
 /*
- * page_memcg - get the memory cgroup associated with a page
+ * page_memcg - get the memory cgroup associated with a non-kmem page
  * @page: a pointer to the page struct
  *
  * Returns a pointer to the memory cgroup associated with the page,
  * or NULL. This function assumes that the page is known to have a
  * proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages.
+ * against some type of pages, e.g. slab pages, kmem pages or ex-slab
+ * pages.
  *
  * Any of the following ensures page and memcg binding stability:
  * - the page lock
@@ -378,27 +390,30 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
 	unsigned long memcg_data = page->memcg_data;
 
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
 
-	return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+	return (struct mem_cgroup *)memcg_data;
 }
 
 /*
- * page_memcg_rcu - locklessly get the memory cgroup associated with a page
+ * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page
  * @page: a pointer to the page struct
  *
  * Returns a pointer to the memory cgroup associated with the page,
  * or NULL. This function assumes that the page is known to have a
  * proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages.
+ * against some type of pages, e.g. slab pages, kmem pages or ex-slab
+ * pages.
  */
 static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
 {
+	unsigned long memcg_data = READ_ONCE(page->memcg_data);
+
 	VM_BUG_ON_PAGE(PageSlab(page), page);
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
-	return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) &
-				     ~MEMCG_DATA_FLAGS_MASK);
+	return (struct mem_cgroup *)memcg_data;
 }
 
 /*
@@ -1072,6 +1087,11 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 
 struct mem_cgroup;
 
+static inline bool page_memcg_charged(struct page *page)
+{
+	return false;
+}
+
 static inline struct mem_cgroup *page_memcg(struct page *page)
 {
 	return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index faae16def127..86a8db937ec6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
 			     int val)
 {
 	struct page *head = compound_head(page); /* rmap on tail pages */
-	struct mem_cgroup *memcg = page_memcg(head);
+	struct mem_cgroup *memcg;
 	pg_data_t *pgdat = page_pgdat(page);
 	struct lruvec *lruvec;
 
+	memcg = page_memcg_check(head);
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg) {
 		__mod_node_page_state(pgdat, idx, val);
@@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
  */
 void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
-	struct mem_cgroup *memcg = page_memcg(page);
+	struct mem_cgroup *memcg;
 	unsigned int nr_pages = 1 << order;
 
-	if (!memcg)
+	if (!page_memcg_charged(page))
 		return;
 
+	memcg = page_memcg_check(page);
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->memcg_data = 0;
@@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 {
 	unsigned long nr_pages;
+	struct mem_cgroup *memcg;
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
-	if (!page_memcg(page))
+	if (!page_memcg_charged(page))
 		return;
 
 	/*
 	 * Nobody should be changing or seriously looking at
-	 * page_memcg(page) at this point, we have fully
-	 * exclusive access to the page.
+	 * page memcg at this point, we have fully exclusive
+	 * access to the page.
 	 */
-
-	if (ug->memcg != page_memcg(page)) {
+	memcg = page_memcg_check(page);
+	if (ug->memcg != memcg) {
 		if (ug->memcg) {
 			uncharge_batch(ug);
 			uncharge_gather_clear(ug);
 		}
-		ug->memcg = page_memcg(page);
+		ug->memcg = memcg;
 
 		/* pairs with css_put in uncharge_batch */
 		css_get(&ug->memcg->css);
@@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page)
 		return;
 
 	/* Don't touch page->lru of any random page, pre-check: */
-	if (!page_memcg(page))
+	if (!page_memcg_charged(page))
 		return;
 
 	uncharge_gather_clear(&ug);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f10966e3b4a5..bcb58ae15e24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page,
 	if (unlikely((unsigned long)page->mapping |
 			page_ref_count(page) |
 #ifdef CONFIG_MEMCG
-			(unsigned long)page_memcg(page) |
+			page_memcg_charged(page) |
 #endif
 			(page->flags & check_flags)))
 		return false;
@@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
 			bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
 	}
 #ifdef CONFIG_MEMCG
-	if (unlikely(page_memcg(page)))
+	if (unlikely(page_memcg_charged(page)))
 		bad_reason = "page still charged to cgroup";
 #endif
 	return bad_reason;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs
  2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
  2021-03-03  5:59 ` [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page Muchun Song
  2021-03-03  5:59 ` [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page Muchun Song
@ 2021-03-03  5:59 ` Muchun Song
  2021-03-05 19:40   ` Roman Gushchin
  2021-03-03  5:59 ` [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API Muchun Song
  2021-03-03  5:59 ` [PATCH v2 5/5] mm: memcontrol: use remote objcg charging APIs to charge kernel memory Muchun Song
  4 siblings, 1 reply; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

Since Roman series "The new cgroup slab memory controller" applied. All
slab objects are charged via the new APIs of obj_cgroup. The new APIs
introduce a struct obj_cgroup to charge slab objects. It prevents
long-living objects from pinning the original memory cgroup in the memory.
But there are still some corner objects (e.g. allocations larger than
order-1 page on SLUB) which are not charged via the new APIs. Those
objects (include the pages which are allocated from buddy allocator
directly) are charged as kmem pages which still hold a reference to
the memory cgroup.

This patch aims to charge the kmem pages by using the new APIs of
obj_cgroup. Finally, the page->memcg_data of the kmem page points to
an object cgroup. We can use the page_objcg() to get the object
cgroup associated with a kmem page. Or we can use page_memcg_check()
to get the memory cgroup associated with a kmem page, but caller must
ensure that the returned memcg won't be released (e.g. acquire the
rcu_read_lock or css_set_lock).

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h |  63 +++++++++++++++++------
 mm/memcontrol.c            | 123 +++++++++++++++++++++++++++++++--------------
 2 files changed, 133 insertions(+), 53 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 049b80246cbf..5911b9d107b0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -370,6 +370,18 @@ static inline bool page_memcg_charged(struct page *page)
 }
 
 /*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
+/*
  * page_memcg - get the memory cgroup associated with a non-kmem page
  * @page: a pointer to the page struct
  *
@@ -421,9 +433,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
  * @page: a pointer to the page struct
  *
  * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function unlike page_memcg() can take any  page
+ * or NULL. This function unlike page_memcg() can take any non-kmem page
  * as an argument. It has to be used in cases when it's not known if a page
- * has an associated memory cgroup pointer or an object cgroups vector.
+ * has an associated memory cgroup pointer or an object cgroups vector or
+ * an object cgroup.
  *
  * Any of the following ensures page and memcg binding stability:
  * - the page lock
@@ -442,6 +455,17 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
 	if (memcg_data & MEMCG_DATA_OBJCGS)
 		return NULL;
 
+	if (memcg_data & MEMCG_DATA_KMEM) {
+		struct obj_cgroup *objcg;
+
+		/*
+		 * The caller must ensure that the returned memcg won't be
+		 * released: e.g. acquire the rcu_read_lock or css_set_lock.
+		 */
+		objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+		return obj_cgroup_memcg(objcg);
+	}
+
 	return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
 }
 
@@ -500,6 +524,24 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
 	return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
 }
 
+/*
+ * page_objcg - get the object cgroup associated with a kmem page
+ * @page: a pointer to the page struct
+ *
+ * Returns a pointer to the object cgroup associated with the kmem page,
+ * or NULL. This function assumes that the page is known to have an
+ * associated object cgroup. It's only safe to call this function
+ * against kmem pages (PageMemcgKmem() returns true).
+ */
+static inline struct obj_cgroup *page_objcg(struct page *page)
+{
+	unsigned long memcg_data = page->memcg_data;
+
+	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
+	VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
+
+	return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+}
 #else
 static inline struct obj_cgroup **page_objcgs(struct page *page)
 {
@@ -510,6 +552,11 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
 {
 	return NULL;
 }
+
+static inline struct obj_cgroup *page_objcg(struct page *page)
+{
+	return NULL;
+}
 #endif
 
 static __always_inline bool memcg_stat_item_in_bytes(int idx)
@@ -728,18 +775,6 @@ static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 	percpu_ref_put(&objcg->refcnt);
 }
 
-/*
- * After the initialization objcg->memcg is always pointing at
- * a valid memcg, but can be atomically swapped to the parent memcg.
- *
- * The caller must ensure that the returned memcg won't be released:
- * e.g. acquire the rcu_read_lock or css_set_lock.
- */
-static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
-{
-	return READ_ONCE(objcg->memcg);
-}
-
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86a8db937ec6..0cf342d22547 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -856,10 +856,16 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
 {
 	struct page *head = compound_head(page); /* rmap on tail pages */
 	struct mem_cgroup *memcg;
-	pg_data_t *pgdat = page_pgdat(page);
+	pg_data_t *pgdat;
 	struct lruvec *lruvec;
 
-	memcg = page_memcg_check(head);
+	if (PageMemcgKmem(head)) {
+		__mod_lruvec_kmem_state(page_to_virt(head), idx, val);
+		return;
+	}
+
+	pgdat = page_pgdat(head);
+	memcg = page_memcg(head);
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg) {
 		__mod_node_page_state(pgdat, idx, val);
@@ -3144,18 +3150,18 @@ static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_page
  */
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	int ret = 0;
 
-	memcg = get_mem_cgroup_from_current();
-	if (memcg && !mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
+	objcg = get_obj_cgroup_from_current();
+	if (objcg) {
+		ret = obj_cgroup_charge_page(objcg, gfp, 1 << order);
 		if (!ret) {
-			page->memcg_data = (unsigned long)memcg |
+			page->memcg_data = (unsigned long)objcg |
 				MEMCG_DATA_KMEM;
 			return 0;
 		}
-		css_put(&memcg->css);
+		obj_cgroup_put(objcg);
 	}
 	return ret;
 }
@@ -3167,17 +3173,18 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
  */
 void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	unsigned int nr_pages = 1 << order;
 
 	if (!page_memcg_charged(page))
 		return;
 
-	memcg = page_memcg_check(page);
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	__memcg_kmem_uncharge(memcg, nr_pages);
+	VM_BUG_ON_PAGE(!PageMemcgKmem(page), page);
+
+	objcg = page_objcg(page);
+	obj_cgroup_uncharge_page(objcg, nr_pages);
 	page->memcg_data = 0;
-	css_put(&memcg->css);
+	obj_cgroup_put(objcg);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -6794,8 +6801,12 @@ struct uncharge_gather {
 	struct mem_cgroup *memcg;
 	unsigned long nr_pages;
 	unsigned long pgpgout;
-	unsigned long nr_kmem;
 	struct page *dummy_page;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *objcg;
+	unsigned long nr_kmem;
+#endif
 };
 
 static inline void uncharge_gather_clear(struct uncharge_gather *ug)
@@ -6807,12 +6818,21 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 {
 	unsigned long flags;
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (ug->objcg) {
+		obj_cgroup_uncharge_page(ug->objcg, ug->nr_kmem);
+		/* drop reference from uncharge_kmem_page */
+		obj_cgroup_put(ug->objcg);
+	}
+#endif
+
+	if (!ug->memcg)
+		return;
+
 	if (!mem_cgroup_is_root(ug->memcg)) {
 		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
 		if (do_memsw_account())
 			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
-			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
 		memcg_oom_recover(ug->memcg);
 	}
 
@@ -6822,26 +6842,40 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
 
-	/* drop reference from uncharge_page */
+	/* drop reference from uncharge_user_page */
 	css_put(&ug->memcg->css);
 }
 
-static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+#ifdef CONFIG_MEMCG_KMEM
+static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
 {
-	unsigned long nr_pages;
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg = page_objcg(page);
 
-	VM_BUG_ON_PAGE(PageLRU(page), page);
+	if (ug->objcg != objcg) {
+		if (ug->objcg) {
+			uncharge_batch(ug);
+			uncharge_gather_clear(ug);
+		}
+		ug->objcg = objcg;
 
-	if (!page_memcg_charged(page))
-		return;
+		/* pairs with obj_cgroup_put in uncharge_batch */
+		obj_cgroup_get(ug->objcg);
+	}
+
+	ug->nr_kmem += compound_nr(page);
+	page->memcg_data = 0;
+	obj_cgroup_put(ug->objcg);
+}
+#else
+static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
+{
+}
+#endif
+
+static void uncharge_user_page(struct page *page, struct uncharge_gather *ug)
+{
+	struct mem_cgroup *memcg = page_memcg(page);
 
-	/*
-	 * Nobody should be changing or seriously looking at
-	 * page memcg at this point, we have fully exclusive
-	 * access to the page.
-	 */
-	memcg = page_memcg_check(page);
 	if (ug->memcg != memcg) {
 		if (ug->memcg) {
 			uncharge_batch(ug);
@@ -6852,18 +6886,30 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 		/* pairs with css_put in uncharge_batch */
 		css_get(&ug->memcg->css);
 	}
+	ug->pgpgout++;
+	ug->dummy_page = page;
+
+	ug->nr_pages += compound_nr(page);
+	page->memcg_data = 0;
+	css_put(&ug->memcg->css);
+}
 
-	nr_pages = compound_nr(page);
-	ug->nr_pages += nr_pages;
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+	if (!page_memcg_charged(page))
+		return;
+
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * page memcg at this point, we have fully exclusive
+	 * access to the page.
+	 */
 	if (PageMemcgKmem(page))
-		ug->nr_kmem += nr_pages;
+		uncharge_kmem_page(page, ug);
 	else
-		ug->pgpgout++;
-
-	ug->dummy_page = page;
-	page->memcg_data = 0;
-	css_put(&ug->memcg->css);
+		uncharge_user_page(page, ug);
 }
 
 /**
@@ -6906,8 +6952,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
 	uncharge_gather_clear(&ug);
 	list_for_each_entry(page, page_list, lru)
 		uncharge_page(page, &ug);
-	if (ug.memcg)
-		uncharge_batch(&ug);
+	uncharge_batch(&ug);
 }
 
 /**
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API
  2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
                   ` (2 preceding siblings ...)
  2021-03-03  5:59 ` [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs Muchun Song
@ 2021-03-03  5:59 ` Muchun Song
  2021-03-05 19:48   ` Roman Gushchin
  2021-03-03  5:59 ` [PATCH v2 5/5] mm: memcontrol: use remote objcg charging APIs to charge kernel memory Muchun Song
  4 siblings, 1 reply; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

The remote memcg charing APIs is a mechanism to charge pages to a given
memcg. Since all kernel memory are charged by using obj_cgroup APIs.
Actually, we want to charge kernel memory to the remote object cgroup
instead of memory cgroup. So introduce remote objcg charging APIs to
charge the kmem pages by using objcg_cgroup APIs. And if remote memcg
and objcg are both set, objcg takes precedence over memcg to charge
the kmem pages.

In the later patch, we will use those API to charge kernel memory to
the remote objcg.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/sched.h    |  4 ++++
 include/linux/sched/mm.h | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c            |  3 +++
 mm/memcontrol.c          | 44 ++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee46f5cab95b..8edcc71a0a1d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1318,6 +1318,10 @@ struct task_struct {
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
 #endif
+#ifdef CONFIG_MEMCG_KMEM
+	/* Used by memcontrol for targeted objcg charge: */
+	struct obj_cgroup		*active_objcg;
+#endif
 
 #ifdef CONFIG_BLK_CGROUP
 	struct request_queue		*throttle_queue;
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 1ae08b8462a4..be1189598b09 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -330,6 +330,44 @@ set_active_memcg(struct mem_cgroup *memcg)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_KMEM
+DECLARE_PER_CPU(struct obj_cgroup *, int_active_objcg);
+
+/**
+ * set_active_objcg - Starts the remote objcg kmem pages charging scope.
+ * @objcg: objcg to charge.
+ *
+ * This function marks the beginning of the remote objcg charging scope. All the
+ * __GFP_ACCOUNT kmem page allocations till the end of the scope will be charged
+ * to the given objcg.
+ *
+ * NOTE: This function can nest. Users must save the return value and
+ * reset the previous value after their own charging scope is over.
+ *
+ * If remote memcg and objcg are both set, objcg takes precedence over memcg
+ * to charge the kmem pages.
+ */
+static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
+{
+	struct obj_cgroup *old;
+
+	if (in_interrupt()) {
+		old = this_cpu_read(int_active_objcg);
+		this_cpu_write(int_active_objcg, objcg);
+	} else {
+		old = current->active_objcg;
+		current->active_objcg = objcg;
+	}
+
+	return old;
+}
+#else
+static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
+{
+	return NULL;
+}
+#endif
+
 #ifdef CONFIG_MEMBARRIER
 enum {
 	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY		= (1U << 0),
diff --git a/kernel/fork.c b/kernel/fork.c
index d66cd1014211..b4b9dd5d122f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -945,6 +945,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+#ifdef CONFIG_MEMCG_KMEM
+	tsk->active_objcg = NULL;
+#endif
 	return tsk;
 
 free_stack:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0cf342d22547..e48d4ab0af76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -79,6 +79,11 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
 /* Active memory cgroup to use from an interrupt context */
 DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 
+#ifdef CONFIG_MEMCG_KMEM
+/* Active object cgroup to use from an interrupt context */
+DEFINE_PER_CPU(struct obj_cgroup *, int_active_objcg);
+#endif
+
 /* Socket memory accounting disabled? */
 static bool cgroup_memory_nosocket;
 
@@ -1076,7 +1081,7 @@ static __always_inline struct mem_cgroup *get_active_memcg(void)
 	return memcg;
 }
 
-static __always_inline bool memcg_kmem_bypass(void)
+static __always_inline bool memcg_charge_bypass(void)
 {
 	/* Allow remote memcg charging from any context. */
 	if (unlikely(active_memcg()))
@@ -1094,7 +1099,7 @@ static __always_inline bool memcg_kmem_bypass(void)
  */
 static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
 {
-	if (memcg_kmem_bypass())
+	if (memcg_charge_bypass())
 		return NULL;
 
 	if (unlikely(active_memcg()))
@@ -1103,6 +1108,29 @@ static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
 	return get_mem_cgroup_from_mm(current->mm);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static __always_inline struct obj_cgroup *active_objcg(void)
+{
+	if (in_interrupt())
+		return this_cpu_read(int_active_objcg);
+	else
+		return current->active_objcg;
+}
+
+static __always_inline bool kmem_charge_bypass(void)
+{
+	/* Allow remote charging from any context. */
+	if (unlikely(active_objcg() || active_memcg()))
+		return false;
+
+	/* Memcg to charge can't be determined. */
+	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
+		return true;
+
+	return false;
+}
+#endif
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -2997,13 +3025,20 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
-	struct obj_cgroup *objcg = NULL;
+	struct obj_cgroup *objcg;
 	struct mem_cgroup *memcg;
 
-	if (memcg_kmem_bypass())
+	if (kmem_charge_bypass())
 		return NULL;
 
 	rcu_read_lock();
+	objcg = active_objcg();
+	if (unlikely(objcg)) {
+		/* remote object cgroup must hold a reference. */
+		obj_cgroup_get(objcg);
+		goto out;
+	}
+
 	if (unlikely(active_memcg()))
 		memcg = active_memcg();
 	else
@@ -3015,6 +3050,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 			break;
 		objcg = NULL;
 	}
+out:
 	rcu_read_unlock();
 
 	return objcg;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/5] mm: memcontrol: use remote objcg charging APIs to charge kernel memory
  2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
                   ` (3 preceding siblings ...)
  2021-03-03  5:59 ` [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API Muchun Song
@ 2021-03-03  5:59 ` Muchun Song
  4 siblings, 0 replies; 13+ messages in thread
From: Muchun Song @ 2021-03-03  5:59 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, Muchun Song

Some users of remote memory cgroup charging (e.g. bpf and fsnotify)
can switch to the remote objcg charging APIs. Because they just want
to charge kernel memory to the remote memcg. Finally, all the kernel
memory are charged by using obj_cgroup APIs and do not hold a refcount
to the memory cgroup. It can prevent long-living objects from pinning
the original memory cgroup in the memory efficiently.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/buffer.c                          | 10 ++++++---
 fs/notify/fanotify/fanotify.c        |  6 +++---
 fs/notify/fanotify/fanotify_user.c   |  2 +-
 fs/notify/group.c                    |  3 ++-
 fs/notify/inotify/inotify_fsnotify.c |  8 ++++----
 fs/notify/inotify/inotify_user.c     |  2 +-
 include/linux/bpf.h                  |  2 +-
 include/linux/fsnotify_backend.h     |  2 +-
 include/linux/memcontrol.h           | 15 ++++++++++++++
 kernel/bpf/syscall.c                 | 35 ++++++++++++++++----------------
 mm/memcontrol.c                      | 39 +++++++++++++++++++++++++++++++++---
 11 files changed, 89 insertions(+), 35 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 591547779dbd..6915195b1a70 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -842,14 +842,16 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	struct buffer_head *bh, *head;
 	gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
 	long offset;
-	struct mem_cgroup *memcg, *old_memcg;
+	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg, *old_objcg;
 
 	if (retry)
 		gfp |= __GFP_NOFAIL;
 
 	/* The page lock pins the memcg */
 	memcg = page_memcg(page);
-	old_memcg = set_active_memcg(memcg);
+	objcg = get_obj_cgroup_from_mem_cgroup(memcg);
+	old_objcg = set_active_objcg(objcg);
 
 	head = NULL;
 	offset = PAGE_SIZE;
@@ -868,7 +870,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 		set_bh_page(bh, page, offset);
 	}
 out:
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
+	if (objcg)
+		obj_cgroup_put(objcg);
 	return head;
 /*
  * In case anything failed, we just free everything we got.
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 1192c9953620..d4ca37b85c56 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -530,7 +530,7 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	struct inode *dirid = fanotify_dfid_inode(mask, data, data_type, dir);
 	const struct path *path = fsnotify_data_path(data, data_type);
 	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
-	struct mem_cgroup *old_memcg;
+	struct obj_cgroup *old_objcg;
 	struct inode *child = NULL;
 	bool name_event = false;
 
@@ -580,7 +580,7 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 		gfp |= __GFP_RETRY_MAYFAIL;
 
 	/* Whoever is interested in the event, pays for the allocation. */
-	old_memcg = set_active_memcg(group->memcg);
+	old_objcg = set_active_objcg(group->objcg);
 
 	if (fanotify_is_perm_event(mask)) {
 		event = fanotify_alloc_perm_event(path, gfp);
@@ -608,7 +608,7 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 		event->pid = get_pid(task_tgid(current));
 
 out:
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 	return event;
 }
 
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9e0c1afac8bd..055ca36d4e0e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -985,7 +985,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	group->fanotify_data.user = user;
 	group->fanotify_data.flags = flags;
 	atomic_inc(&user->fanotify_listeners);
-	group->memcg = get_mem_cgroup_from_mm(current->mm);
+	group->objcg = get_obj_cgroup_from_current();
 
 	group->overflow_event = fanotify_alloc_overflow_event();
 	if (unlikely(!group->overflow_event)) {
diff --git a/fs/notify/group.c b/fs/notify/group.c
index ffd723ffe46d..fac46b92c16f 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -24,7 +24,8 @@ static void fsnotify_final_destroy_group(struct fsnotify_group *group)
 	if (group->ops->free_group_priv)
 		group->ops->free_group_priv(group);
 
-	mem_cgroup_put(group->memcg);
+	if (group->objcg)
+		obj_cgroup_put(group->objcg);
 	mutex_destroy(&group->mark_mutex);
 
 	kfree(group);
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 1901d799909b..ece6dd561659 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -66,7 +66,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
 	int ret;
 	int len = 0;
 	int alloc_len = sizeof(struct inotify_event_info);
-	struct mem_cgroup *old_memcg;
+	struct obj_cgroup *old_objcg;
 
 	if (name) {
 		len = name->len;
@@ -81,12 +81,12 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
 
 	/*
 	 * Whoever is interested in the event, pays for the allocation. Do not
-	 * trigger OOM killer in the target monitoring memcg as it may have
+	 * trigger OOM killer in the target monitoring objcg as it may have
 	 * security repercussion.
 	 */
-	old_memcg = set_active_memcg(group->memcg);
+	old_objcg = set_active_objcg(group->objcg);
 	event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 
 	if (unlikely(!event)) {
 		/*
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index c71be4fb7dc5..5b4de477fcac 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -649,7 +649,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
 	oevent->name_len = 0;
 
 	group->max_events = max_events;
-	group->memcg = get_mem_cgroup_from_mm(current->mm);
+	group->objcg = get_obj_cgroup_from_current();
 
 	spin_lock_init(&group->inotify_data.idr_lock);
 	idr_init(&group->inotify_data.idr);
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cccaef1088ea..b6894e3cd095 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -158,7 +158,7 @@ struct bpf_map {
 	u32 btf_value_type_id;
 	struct btf *btf;
 #ifdef CONFIG_MEMCG_KMEM
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 #endif
 	char name[BPF_OBJ_NAME_LEN];
 	u32 btf_vmlinux_value_type_id;
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index e5409b83e731..d0303f634da6 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -220,7 +220,7 @@ struct fsnotify_group {
 						 * notification list is too
 						 * full */
 
-	struct mem_cgroup *memcg;	/* memcg to charge allocations */
+	struct obj_cgroup *objcg;	/* objcg to charge allocations */
 
 	/* groups can define private fields here or use the void *private */
 	union {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5911b9d107b0..d5a286e3394f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1643,6 +1643,7 @@ static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_mem_cgroup(struct mem_cgroup *memcg);
 struct obj_cgroup *get_obj_cgroup_from_current(void);
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
@@ -1693,6 +1694,20 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 struct mem_cgroup *mem_cgroup_from_obj(void *p);
 
 #else
+static inline
+struct obj_cgroup *get_obj_cgroup_from_mem_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+
+static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	return NULL;
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+}
 
 static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
 					 int order)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c859bc46d06c..0703dc94cb30 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -390,37 +390,38 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static void bpf_map_save_memcg(struct bpf_map *map)
+static void bpf_map_save_objcg(struct bpf_map *map)
 {
-	map->memcg = get_mem_cgroup_from_mm(current->mm);
+	map->objcg = get_obj_cgroup_from_current();
 }
 
-static void bpf_map_release_memcg(struct bpf_map *map)
+static void bpf_map_release_objcg(struct bpf_map *map)
 {
-	mem_cgroup_put(map->memcg);
+	if (map->objcg)
+		obj_cgroup_put(map->objcg);
 }
 
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
 			   int node)
 {
-	struct mem_cgroup *old_memcg;
+	struct obj_cgroup *old_objcg;
 	void *ptr;
 
-	old_memcg = set_active_memcg(map->memcg);
+	old_objcg = set_active_objcg(map->objcg);
 	ptr = kmalloc_node(size, flags | __GFP_ACCOUNT, node);
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 
 	return ptr;
 }
 
 void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 {
-	struct mem_cgroup *old_memcg;
+	struct obj_cgroup *old_objcg;
 	void *ptr;
 
-	old_memcg = set_active_memcg(map->memcg);
+	old_objcg = set_active_objcg(map->objcg);
 	ptr = kzalloc(size, flags | __GFP_ACCOUNT);
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 
 	return ptr;
 }
@@ -428,22 +429,22 @@ void *bpf_map_kzalloc(const struct bpf_map *map, size_t size, gfp_t flags)
 void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
 				    size_t align, gfp_t flags)
 {
-	struct mem_cgroup *old_memcg;
+	struct obj_cgroup *old_objcg;
 	void __percpu *ptr;
 
-	old_memcg = set_active_memcg(map->memcg);
+	old_objcg = set_active_objcg(map->objcg);
 	ptr = __alloc_percpu_gfp(size, align, flags | __GFP_ACCOUNT);
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 
 	return ptr;
 }
 
 #else
-static void bpf_map_save_memcg(struct bpf_map *map)
+static void bpf_map_save_objcg(struct bpf_map *map)
 {
 }
 
-static void bpf_map_release_memcg(struct bpf_map *map)
+static void bpf_map_release_objcg(struct bpf_map *map)
 {
 }
 #endif
@@ -454,7 +455,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
 	struct bpf_map *map = container_of(work, struct bpf_map, work);
 
 	security_bpf_map_free(map);
-	bpf_map_release_memcg(map);
+	bpf_map_release_objcg(map);
 	/* implementation dependent freeing */
 	map->ops->map_free(map);
 }
@@ -877,7 +878,7 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		goto free_map_sec;
 
-	bpf_map_save_memcg(map);
+	bpf_map_save_objcg(map);
 
 	err = bpf_map_new_fd(map, f_flags);
 	if (err < 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e48d4ab0af76..367d1a485292 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3023,6 +3023,22 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	return page_memcg_check(page);
 }
 
+struct obj_cgroup *get_obj_cgroup_from_mem_cgroup(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg = NULL;
+
+	rcu_read_lock();
+	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && obj_cgroup_tryget(objcg))
+			break;
+		objcg = NULL;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct obj_cgroup *objcg;
@@ -5356,16 +5372,33 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	return ERR_PTR(error);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+static inline struct obj_cgroup *memcg_obj_cgroup(struct mem_cgroup *memcg)
+{
+	return memcg ? memcg->objcg : NULL;
+}
+#else
+static inline struct obj_cgroup *memcg_obj_cgroup(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+#endif
+
 static struct cgroup_subsys_state * __ref
 mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct mem_cgroup *parent = mem_cgroup_from_css(parent_css);
-	struct mem_cgroup *memcg, *old_memcg;
+	struct mem_cgroup *memcg;
+	struct obj_cgroup *old_objcg;
 	long error = -ENOMEM;
 
-	old_memcg = set_active_memcg(parent);
+	/*
+	 * The @parent cannot be offlined, so @parent->objcg cannot be freed
+	 * under us.
+	 */
+	old_objcg = set_active_objcg(memcg_obj_cgroup(parent));
 	memcg = mem_cgroup_alloc();
-	set_active_memcg(old_memcg);
+	set_active_objcg(old_objcg);
 	if (IS_ERR(memcg))
 		return ERR_CAST(memcg);
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page
  2021-03-03  5:59 ` [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page Muchun Song
@ 2021-03-05 18:56   ` Roman Gushchin
  2021-03-06  5:26     ` [External] " Muchun Song
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2021-03-05 18:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun

On Wed, Mar 03, 2021 at 01:59:13PM +0800, Muchun Song wrote:
> We know that the unit of slab object charging is bytes, the unit of
> kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
> to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
> to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
> skip touch the objcg stock. And obj_cgroup_{un}charge_page() are
> introduced to charge in units of page level.
> 
> In the later patch, we also can reuse those two helpers to charge or
> uncharge a number of kernel pages to a object cgroup. This is just
> a code movement without any functional changes.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

This patch looks good to me, even as a standalone refactoring.
Please, rename obj_cgroup_charge_page() to obj_cgroup_charge_pages()
and the same with uncharge. It's because _page suffix usually means
we're dealing with a physical page (e.g. struct page * as an argument),
here it's not the case.

Please, add my Acked-by: Roman Gushchin <guro@fb.com>
after the renaming.

Thank you!

> ---
>  mm/memcontrol.c | 46 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..faae16def127 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3056,6 +3056,34 @@ static void memcg_free_cache_id(int id)
>  	ida_simple_remove(&memcg_cache_ida, id);
>  }
>  
> +static inline void obj_cgroup_uncharge_page(struct obj_cgroup *objcg,
> +					    unsigned int nr_pages)
> +{
> +	rcu_read_lock();
> +	__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +	rcu_read_unlock();
> +}
> +
> +static int obj_cgroup_charge_page(struct obj_cgroup *objcg, gfp_t gfp,
> +				  unsigned int nr_pages)
> +{
> +	struct mem_cgroup *memcg;
> +	int ret;
> +
> +	rcu_read_lock();
> +retry:
> +	memcg = obj_cgroup_memcg(objcg);
> +	if (unlikely(!css_tryget(&memcg->css)))
> +		goto retry;
> +	rcu_read_unlock();
> +
> +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +
> +	css_put(&memcg->css);
> +
> +	return ret;
> +}
> +
>  /**
>   * __memcg_kmem_charge: charge a number of kernel pages to a memcg
>   * @memcg: memory cgroup to charge
> @@ -3180,11 +3208,8 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
>  		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>  		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>  
> -		if (nr_pages) {
> -			rcu_read_lock();
> -			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> -			rcu_read_unlock();
> -		}
> +		if (nr_pages)
> +			obj_cgroup_uncharge_page(old, nr_pages);
>  
>  		/*
>  		 * The leftover is flushed to the centralized per-memcg value.
> @@ -3242,7 +3267,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>  
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>  {
> -	struct mem_cgroup *memcg;
>  	unsigned int nr_pages, nr_bytes;
>  	int ret;
>  
> @@ -3259,24 +3283,16 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>  	 * refill_obj_stock(), called from this function or
>  	 * independently later.
>  	 */
> -	rcu_read_lock();
> -retry:
> -	memcg = obj_cgroup_memcg(objcg);
> -	if (unlikely(!css_tryget(&memcg->css)))
> -		goto retry;
> -	rcu_read_unlock();
> -
>  	nr_pages = size >> PAGE_SHIFT;
>  	nr_bytes = size & (PAGE_SIZE - 1);
>  
>  	if (nr_bytes)
>  		nr_pages += 1;
>  
> -	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +	ret = obj_cgroup_charge_page(objcg, gfp, nr_pages);
>  	if (!ret && nr_bytes)
>  		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
>  
> -	css_put(&memcg->css);
>  	return ret;
>  }
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page
  2021-03-03  5:59 ` [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page Muchun Song
@ 2021-03-05 19:00   ` Roman Gushchin
  2021-03-06  6:02     ` [External] " Muchun Song
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2021-03-05 19:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun

On Wed, Mar 03, 2021 at 01:59:14PM +0800, Muchun Song wrote:
> We want to reuse the obj_cgroup APIs to charge the kmem pages when
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data can have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>      vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>      points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>      to a memory cgroup.
> 
> Currently we always get the memory cgroup associated with a page via
> page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
> has to be used in cases when it's not known if a page has an
> associated memory cgroup pointer or an object cgroups vector. Because
> the page->memcg_data of the kmem page is not pointing to a memory
> cgroup in the later patch, the page_memcg() and page_memcg_rcu()
> cannot be applicable for the kmem pages. In this patch, make
> page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
> We do not change the behavior of the page_memcg_check(), it is also
> applicable for the kmem pages.
> 
> In the end, there are 3 helpers to get the memcg associated with a page.
> Usage is as follows.
> 
>   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
>      pages).
> 
>      - page_memcg()
>      - page_memcg_rcu()
> 
>   2) Get the memory cgroup associated with a page. It has to be used in
>      cases when it's not known if a page has an associated memory cgroup
>      pointer or an object cgroups vector. Returns NULL for slab pages or
>      uncharged pages. Otherwise, returns memory cgroup for charged pages
>      (e.g. the kmem pages, the LRU pages).
> 
>      - page_memcg_check()
> 
> In some place, we use page_memcg() to check whether the page is charged.
> Now introduce page_memcg_charged() helper to do that.
> 
> This is a preparation for reparenting the kmem pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

This patch also looks good to me, but, please, make it safe for adding
new memcg_data flags. E.g. if someone will add a new flag with a completely
new meaning, it shouldn't break the code.

I'll ack it after another look at the final version, but overall it
looks good.

Thanks!

> ---
>  include/linux/memcontrol.h | 36 ++++++++++++++++++++++++++++--------
>  mm/memcontrol.c            | 23 +++++++++++++----------
>  mm/page_alloc.c            |  4 ++--
>  3 files changed, 43 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..049b80246cbf 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>  
> +/* Return true for charged page, otherwise false. */
> +static inline bool page_memcg_charged(struct page *page)
> +{
> +	unsigned long memcg_data = page->memcg_data;
> +
> +	VM_BUG_ON_PAGE(PageSlab(page), page);
> +	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +
> +	return !!memcg_data;
> +}
> +
>  /*
> - * page_memcg - get the memory cgroup associated with a page
> + * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
> @@ -378,27 +390,30 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
>  	unsigned long memcg_data = page->memcg_data;
>  
>  	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
>  
> -	return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> +	return (struct mem_cgroup *)memcg_data;
>  }
>  
>  /*
> - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> + * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   */
>  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>  {
> +	unsigned long memcg_data = READ_ONCE(page->memcg_data);
> +
>  	VM_BUG_ON_PAGE(PageSlab(page), page);
> +	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
>  	WARN_ON_ONCE(!rcu_read_lock_held());
>  
> -	return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) &
> -				     ~MEMCG_DATA_FLAGS_MASK);
> +	return (struct mem_cgroup *)memcg_data;
>  }
>  
>  /*
> @@ -1072,6 +1087,11 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  
>  struct mem_cgroup;
>  
> +static inline bool page_memcg_charged(struct page *page)
> +{
> +	return false;
> +}
> +
>  static inline struct mem_cgroup *page_memcg(struct page *page)
>  {
>  	return NULL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index faae16def127..86a8db937ec6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
>  			     int val)
>  {
>  	struct page *head = compound_head(page); /* rmap on tail pages */
> -	struct mem_cgroup *memcg = page_memcg(head);
> +	struct mem_cgroup *memcg;
>  	pg_data_t *pgdat = page_pgdat(page);
>  	struct lruvec *lruvec;
>  
> +	memcg = page_memcg_check(head);
>  	/* Untracked pages have no memcg, no lruvec. Update only the node */
>  	if (!memcg) {
>  		__mod_node_page_state(pgdat, idx, val);
> @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>   */
>  void __memcg_kmem_uncharge_page(struct page *page, int order)
>  {
> -	struct mem_cgroup *memcg = page_memcg(page);
> +	struct mem_cgroup *memcg;
>  	unsigned int nr_pages = 1 << order;
>  
> -	if (!memcg)
> +	if (!page_memcg_charged(page))
>  		return;
>  
> +	memcg = page_memcg_check(page);
>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>  	__memcg_kmem_uncharge(memcg, nr_pages);
>  	page->memcg_data = 0;
> @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>  {
>  	unsigned long nr_pages;
> +	struct mem_cgroup *memcg;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> -	if (!page_memcg(page))
> +	if (!page_memcg_charged(page))
>  		return;
>  
>  	/*
>  	 * Nobody should be changing or seriously looking at
> -	 * page_memcg(page) at this point, we have fully
> -	 * exclusive access to the page.
> +	 * page memcg at this point, we have fully exclusive
> +	 * access to the page.
>  	 */
> -
> -	if (ug->memcg != page_memcg(page)) {
> +	memcg = page_memcg_check(page);
> +	if (ug->memcg != memcg) {
>  		if (ug->memcg) {
>  			uncharge_batch(ug);
>  			uncharge_gather_clear(ug);
>  		}
> -		ug->memcg = page_memcg(page);
> +		ug->memcg = memcg;
>  
>  		/* pairs with css_put in uncharge_batch */
>  		css_get(&ug->memcg->css);
> @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page)
>  		return;
>  
>  	/* Don't touch page->lru of any random page, pre-check: */
> -	if (!page_memcg(page))
> +	if (!page_memcg_charged(page))
>  		return;
>  
>  	uncharge_gather_clear(&ug);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f10966e3b4a5..bcb58ae15e24 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page,
>  	if (unlikely((unsigned long)page->mapping |
>  			page_ref_count(page) |
>  #ifdef CONFIG_MEMCG
> -			(unsigned long)page_memcg(page) |
> +			page_memcg_charged(page) |
>  #endif
>  			(page->flags & check_flags)))
>  		return false;
> @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
>  			bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
>  	}
>  #ifdef CONFIG_MEMCG
> -	if (unlikely(page_memcg(page)))
> +	if (unlikely(page_memcg_charged(page)))
>  		bad_reason = "page still charged to cgroup";
>  #endif
>  	return bad_reason;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs
  2021-03-03  5:59 ` [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs Muchun Song
@ 2021-03-05 19:40   ` Roman Gushchin
  2021-03-06  8:03     ` [External] " Muchun Song
  0 siblings, 1 reply; 13+ messages in thread
From: Roman Gushchin @ 2021-03-05 19:40 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun

On Wed, Mar 03, 2021 at 01:59:15PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
> 
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg_check()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).

I believe it's a good direction, but there are still things which
need to be figured out first.

> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/memcontrol.h |  63 +++++++++++++++++------
>  mm/memcontrol.c            | 123 +++++++++++++++++++++++++++++++--------------
>  2 files changed, 133 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 049b80246cbf..5911b9d107b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -370,6 +370,18 @@ static inline bool page_memcg_charged(struct page *page)
>  }
>  
>  /*
> + * After the initialization objcg->memcg is always pointing at
> + * a valid memcg, but can be atomically swapped to the parent memcg.
> + *
> + * The caller must ensure that the returned memcg won't be released:
> + * e.g. acquire the rcu_read_lock or css_set_lock.
> + */
> +static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> +{
> +	return READ_ONCE(objcg->memcg);
> +}
> +
> +/*
>   * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
> @@ -421,9 +433,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
> - * or NULL. This function unlike page_memcg() can take any  page
> + * or NULL. This function unlike page_memcg() can take any non-kmem page
>   * as an argument. It has to be used in cases when it's not known if a page
> - * has an associated memory cgroup pointer or an object cgroups vector.
> + * has an associated memory cgroup pointer or an object cgroups vector or
> + * an object cgroup.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
> @@ -442,6 +455,17 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
>  	if (memcg_data & MEMCG_DATA_OBJCGS)
>  		return NULL;
>  
> +	if (memcg_data & MEMCG_DATA_KMEM) {

This is confusing: the comment above says it can't take kmem pages?

> +		struct obj_cgroup *objcg;
> +
> +		/*
> +		 * The caller must ensure that the returned memcg won't be
> +		 * released: e.g. acquire the rcu_read_lock or css_set_lock.
> +		 */
> +		objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> +		return obj_cgroup_memcg(objcg);
> +	}
> +
>  	return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);

Also, the comment about page<->memcg binding stability is not correct anymore.
Taking page_lock, for example, won't protect memcg from being released,
if this a kmem page.

_Maybe_ it's ok to just say that page_memcg_check() requires a rcu lock,
but I'm not yet quite sure. The calling convention is already complicated,
we should avoid making it even more complicated, if we can.

>  }
>  
> @@ -500,6 +524,24 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
>  	return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
> +/*
> + * page_objcg - get the object cgroup associated with a kmem page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the object cgroup associated with the kmem page,
> + * or NULL. This function assumes that the page is known to have an
> + * associated object cgroup. It's only safe to call this function
> + * against kmem pages (PageMemcgKmem() returns true).
> + */
> +static inline struct obj_cgroup *page_objcg(struct page *page)
> +{
> +	unsigned long memcg_data = page->memcg_data;
> +
> +	VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +	VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
> +
> +	return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> +}
>  #else
>  static inline struct obj_cgroup **page_objcgs(struct page *page)
>  {
> @@ -510,6 +552,11 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
>  {
>  	return NULL;
>  }
> +
> +static inline struct obj_cgroup *page_objcg(struct page *page)
> +{
> +	return NULL;
> +}
>  #endif
>  
>  static __always_inline bool memcg_stat_item_in_bytes(int idx)
> @@ -728,18 +775,6 @@ static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>  	percpu_ref_put(&objcg->refcnt);
>  }
>  
> -/*
> - * After the initialization objcg->memcg is always pointing at
> - * a valid memcg, but can be atomically swapped to the parent memcg.
> - *
> - * The caller must ensure that the returned memcg won't be released:
> - * e.g. acquire the rcu_read_lock or css_set_lock.
> - */
> -static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> -{
> -	return READ_ONCE(objcg->memcg);
> -}
> -
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  	if (memcg)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 86a8db937ec6..0cf342d22547 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -856,10 +856,16 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
>  {
>  	struct page *head = compound_head(page); /* rmap on tail pages */
>  	struct mem_cgroup *memcg;
> -	pg_data_t *pgdat = page_pgdat(page);
> +	pg_data_t *pgdat;
>  	struct lruvec *lruvec;
>  
> -	memcg = page_memcg_check(head);
> +	if (PageMemcgKmem(head)) {
> +		__mod_lruvec_kmem_state(page_to_virt(head), idx, val);
> +		return;
> +	}

This is a very confusing part: we're converting the page to the virtual address
to run mem_cgroup_from_obj() inside __mod_lruvec_kmem_state() to get back the page.

> +
> +	pgdat = page_pgdat(head);
> +	memcg = page_memcg(head);
>  	/* Untracked pages have no memcg, no lruvec. Update only the node */
>  	if (!memcg) {
>  		__mod_node_page_state(pgdat, idx, val);
> @@ -3144,18 +3150,18 @@ static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_page
>   */
>  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  {
> -	struct mem_cgroup *memcg;
> +	struct obj_cgroup *objcg;
>  	int ret = 0;
>  
> -	memcg = get_mem_cgroup_from_current();
> -	if (memcg && !mem_cgroup_is_root(memcg)) {
> -		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> +	objcg = get_obj_cgroup_from_current();
> +	if (objcg) {
> +		ret = obj_cgroup_charge_page(objcg, gfp, 1 << order);
>  		if (!ret) {
> -			page->memcg_data = (unsigned long)memcg |
> +			page->memcg_data = (unsigned long)objcg |
>  				MEMCG_DATA_KMEM;
>  			return 0;
>  		}
> -		css_put(&memcg->css);
> +		obj_cgroup_put(objcg);
>  	}
>  	return ret;
>  }
> @@ -3167,17 +3173,18 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>   */
>  void __memcg_kmem_uncharge_page(struct page *page, int order)
>  {
> -	struct mem_cgroup *memcg;
> +	struct obj_cgroup *objcg;
>  	unsigned int nr_pages = 1 << order;
>  
>  	if (!page_memcg_charged(page))
>  		return;
>  
> -	memcg = page_memcg_check(page);
> -	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> -	__memcg_kmem_uncharge(memcg, nr_pages);
> +	VM_BUG_ON_PAGE(!PageMemcgKmem(page), page);
> +
> +	objcg = page_objcg(page);
> +	obj_cgroup_uncharge_page(objcg, nr_pages);
>  	page->memcg_data = 0;
> -	css_put(&memcg->css);
> +	obj_cgroup_put(objcg);
>  }
>  
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> @@ -6794,8 +6801,12 @@ struct uncharge_gather {
>  	struct mem_cgroup *memcg;
>  	unsigned long nr_pages;
>  	unsigned long pgpgout;
> -	unsigned long nr_kmem;
>  	struct page *dummy_page;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct obj_cgroup *objcg;
> +	unsigned long nr_kmem;
> +#endif
>  };
>  
>  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> @@ -6807,12 +6818,21 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  {
>  	unsigned long flags;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (ug->objcg) {
> +		obj_cgroup_uncharge_page(ug->objcg, ug->nr_kmem);
> +		/* drop reference from uncharge_kmem_page */
> +		obj_cgroup_put(ug->objcg);
> +	}
> +#endif

Hm, an obvious question here is why do we need to double the ug infrastructure
if we can just get kmem page's memcg and use the infra for user pages?
Because ug is holding a reference to memcg, it will not go away.
Maybe I'm missing something, but it seems that there is a simpler implementation.

Thanks!

> +
> +	if (!ug->memcg)
> +		return;
> +
>  	if (!mem_cgroup_is_root(ug->memcg)) {
>  		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
>  		if (do_memsw_account())
>  			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
> -		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> -			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
>  		memcg_oom_recover(ug->memcg);
>  	}
>  
> @@ -6822,26 +6842,40 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  	memcg_check_events(ug->memcg, ug->dummy_page);
>  	local_irq_restore(flags);
>  
> -	/* drop reference from uncharge_page */
> +	/* drop reference from uncharge_user_page */
>  	css_put(&ug->memcg->css);
>  }
>  
> -static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> +#ifdef CONFIG_MEMCG_KMEM
> +static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
>  {
> -	unsigned long nr_pages;
> -	struct mem_cgroup *memcg;
> +	struct obj_cgroup *objcg = page_objcg(page);
>  
> -	VM_BUG_ON_PAGE(PageLRU(page), page);
> +	if (ug->objcg != objcg) {
> +		if (ug->objcg) {
> +			uncharge_batch(ug);
> +			uncharge_gather_clear(ug);
> +		}
> +		ug->objcg = objcg;
>  
> -	if (!page_memcg_charged(page))
> -		return;
> +		/* pairs with obj_cgroup_put in uncharge_batch */
> +		obj_cgroup_get(ug->objcg);
> +	}
> +
> +	ug->nr_kmem += compound_nr(page);
> +	page->memcg_data = 0;
> +	obj_cgroup_put(ug->objcg);
> +}
> +#else
> +static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
> +{
> +}
> +#endif
> +
> +static void uncharge_user_page(struct page *page, struct uncharge_gather *ug)
> +{
> +	struct mem_cgroup *memcg = page_memcg(page);
>  
> -	/*
> -	 * Nobody should be changing or seriously looking at
> -	 * page memcg at this point, we have fully exclusive
> -	 * access to the page.
> -	 */
> -	memcg = page_memcg_check(page);
>  	if (ug->memcg != memcg) {
>  		if (ug->memcg) {
>  			uncharge_batch(ug);
> @@ -6852,18 +6886,30 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>  		/* pairs with css_put in uncharge_batch */
>  		css_get(&ug->memcg->css);
>  	}
> +	ug->pgpgout++;
> +	ug->dummy_page = page;
> +
> +	ug->nr_pages += compound_nr(page);
> +	page->memcg_data = 0;
> +	css_put(&ug->memcg->css);
> +}
>  
> -	nr_pages = compound_nr(page);
> -	ug->nr_pages += nr_pages;
> +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> +{
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> +	if (!page_memcg_charged(page))
> +		return;
> +
> +	/*
> +	 * Nobody should be changing or seriously looking at
> +	 * page memcg at this point, we have fully exclusive
> +	 * access to the page.
> +	 */
>  	if (PageMemcgKmem(page))
> -		ug->nr_kmem += nr_pages;
> +		uncharge_kmem_page(page, ug);
>  	else
> -		ug->pgpgout++;
> -
> -	ug->dummy_page = page;
> -	page->memcg_data = 0;
> -	css_put(&ug->memcg->css);
> +		uncharge_user_page(page, ug);
>  }
>  
>  /**
> @@ -6906,8 +6952,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
>  	uncharge_gather_clear(&ug);
>  	list_for_each_entry(page, page_list, lru)
>  		uncharge_page(page, &ug);
> -	if (ug.memcg)
> -		uncharge_batch(&ug);
> +	uncharge_batch(&ug);
>  }
>  
>  /**
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API
  2021-03-03  5:59 ` [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API Muchun Song
@ 2021-03-05 19:48   ` Roman Gushchin
  0 siblings, 0 replies; 13+ messages in thread
From: Roman Gushchin @ 2021-03-05 19:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun

On Wed, Mar 03, 2021 at 01:59:16PM +0800, Muchun Song wrote:
> The remote memcg charing APIs is a mechanism to charge pages to a given
> memcg. Since all kernel memory are charged by using obj_cgroup APIs.
> Actually, we want to charge kernel memory to the remote object cgroup
> instead of memory cgroup. So introduce remote objcg charging APIs to
> charge the kmem pages by using objcg_cgroup APIs. And if remote memcg
> and objcg are both set, objcg takes precedence over memcg to charge
> the kmem pages.
> 
> In the later patch, we will use those API to charge kernel memory to
> the remote objcg.

I'd abandon/postpone the rest of the patchset (patches 4 and 5) as now.
They add a lot of new code to solve a theoretical problem (please, fix
me if I'm wrong), which is not a panic or data corruption, but
a sub-optimal garbage collection behavior. I think we need a better
motivation or/and an implementation which makes the code simpler
and smaller.

> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/sched.h    |  4 ++++
>  include/linux/sched/mm.h | 38 ++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c            |  3 +++
>  mm/memcontrol.c          | 44 ++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee46f5cab95b..8edcc71a0a1d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1318,6 +1318,10 @@ struct task_struct {
>  	/* Used by memcontrol for targeted memcg charge: */
>  	struct mem_cgroup		*active_memcg;
>  #endif
> +#ifdef CONFIG_MEMCG_KMEM
> +	/* Used by memcontrol for targeted objcg charge: */
> +	struct obj_cgroup		*active_objcg;
> +#endif
>  
>  #ifdef CONFIG_BLK_CGROUP
>  	struct request_queue		*throttle_queue;
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 1ae08b8462a4..be1189598b09 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -330,6 +330,44 @@ set_active_memcg(struct mem_cgroup *memcg)
>  }
>  #endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +DECLARE_PER_CPU(struct obj_cgroup *, int_active_objcg);
> +
> +/**
> + * set_active_objcg - Starts the remote objcg kmem pages charging scope.
> + * @objcg: objcg to charge.
> + *
> + * This function marks the beginning of the remote objcg charging scope. All the
> + * __GFP_ACCOUNT kmem page allocations till the end of the scope will be charged
> + * to the given objcg.
> + *
> + * NOTE: This function can nest. Users must save the return value and
> + * reset the previous value after their own charging scope is over.
> + *
> + * If remote memcg and objcg are both set, objcg takes precedence over memcg
> + * to charge the kmem pages.
> + */
> +static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
> +{
> +	struct obj_cgroup *old;
> +
> +	if (in_interrupt()) {
> +		old = this_cpu_read(int_active_objcg);
> +		this_cpu_write(int_active_objcg, objcg);
> +	} else {
> +		old = current->active_objcg;
> +		current->active_objcg = objcg;
> +	}
> +
> +	return old;
> +}
> +#else
> +static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMBARRIER
>  enum {
>  	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY		= (1U << 0),
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..b4b9dd5d122f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -945,6 +945,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>  #ifdef CONFIG_MEMCG
>  	tsk->active_memcg = NULL;
>  #endif
> +#ifdef CONFIG_MEMCG_KMEM
> +	tsk->active_objcg = NULL;
> +#endif
>  	return tsk;
>  
>  free_stack:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cf342d22547..e48d4ab0af76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -79,6 +79,11 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +/* Active object cgroup to use from an interrupt context */
> +DEFINE_PER_CPU(struct obj_cgroup *, int_active_objcg);
> +#endif
> +
>  /* Socket memory accounting disabled? */
>  static bool cgroup_memory_nosocket;
>  
> @@ -1076,7 +1081,7 @@ static __always_inline struct mem_cgroup *get_active_memcg(void)
>  	return memcg;
>  }
>  
> -static __always_inline bool memcg_kmem_bypass(void)
> +static __always_inline bool memcg_charge_bypass(void)
>  {
>  	/* Allow remote memcg charging from any context. */
>  	if (unlikely(active_memcg()))
> @@ -1094,7 +1099,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>   */
>  static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
>  {
> -	if (memcg_kmem_bypass())
> +	if (memcg_charge_bypass())
>  		return NULL;
>  
>  	if (unlikely(active_memcg()))
> @@ -1103,6 +1108,29 @@ static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
>  	return get_mem_cgroup_from_mm(current->mm);
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +static __always_inline struct obj_cgroup *active_objcg(void)
> +{
> +	if (in_interrupt())
> +		return this_cpu_read(int_active_objcg);
> +	else
> +		return current->active_objcg;
> +}
> +
> +static __always_inline bool kmem_charge_bypass(void)
> +{
> +	/* Allow remote charging from any context. */
> +	if (unlikely(active_objcg() || active_memcg()))
> +		return false;
> +
> +	/* Memcg to charge can't be determined. */
> +	if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +		return true;
> +
> +	return false;
> +}
> +#endif
> +
>  /**
>   * mem_cgroup_iter - iterate over memory cgroup hierarchy
>   * @root: hierarchy root
> @@ -2997,13 +3025,20 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
>  
>  __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>  {
> -	struct obj_cgroup *objcg = NULL;
> +	struct obj_cgroup *objcg;
>  	struct mem_cgroup *memcg;
>  
> -	if (memcg_kmem_bypass())
> +	if (kmem_charge_bypass())
>  		return NULL;
>  
>  	rcu_read_lock();
> +	objcg = active_objcg();
> +	if (unlikely(objcg)) {
> +		/* remote object cgroup must hold a reference. */
> +		obj_cgroup_get(objcg);
> +		goto out;
> +	}
> +
>  	if (unlikely(active_memcg()))
>  		memcg = active_memcg();
>  	else
> @@ -3015,6 +3050,7 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>  			break;
>  		objcg = NULL;
>  	}
> +out:
>  	rcu_read_unlock();
>  
>  	return objcg;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] Re: [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page
  2021-03-05 18:56   ` Roman Gushchin
@ 2021-03-06  5:26     ` Muchun Song
  0 siblings, 0 replies; 13+ messages in thread
From: Muchun Song @ 2021-03-06  5:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, Shakeel Butt,
	Vladimir Davydov, LKML, Linux Memory Management List,
	Xiongchun duan

On Sat, Mar 6, 2021 at 2:56 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Mar 03, 2021 at 01:59:13PM +0800, Muchun Song wrote:
> > We know that the unit of slab object charging is bytes, the unit of
> > kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
> > to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
> > to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
> > skip touch the objcg stock. And obj_cgroup_{un}charge_page() are
> > introduced to charge in units of page level.
> >
> > In the later patch, we also can reuse those two helpers to charge or
> > uncharge a number of kernel pages to a object cgroup. This is just
> > a code movement without any functional changes.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> This patch looks good to me, even as a standalone refactoring.
> Please, rename obj_cgroup_charge_page() to obj_cgroup_charge_pages()
> and the same with uncharge. It's because _page suffix usually means
> we're dealing with a physical page (e.g. struct page * as an argument),
> here it's not the case.

Make sense.

>
> Please, add my Acked-by: Roman Gushchin <guro@fb.com>
> after the renaming.

Will do. Thanks for your review.

>
> Thank you!
>
> > ---
> >  mm/memcontrol.c | 46 +++++++++++++++++++++++++++++++---------------
> >  1 file changed, 31 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 845eec01ef9d..faae16def127 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3056,6 +3056,34 @@ static void memcg_free_cache_id(int id)
> >       ida_simple_remove(&memcg_cache_ida, id);
> >  }
> >
> > +static inline void obj_cgroup_uncharge_page(struct obj_cgroup *objcg,
> > +                                         unsigned int nr_pages)
> > +{
> > +     rcu_read_lock();
> > +     __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > +     rcu_read_unlock();
> > +}
> > +
> > +static int obj_cgroup_charge_page(struct obj_cgroup *objcg, gfp_t gfp,
> > +                               unsigned int nr_pages)
> > +{
> > +     struct mem_cgroup *memcg;
> > +     int ret;
> > +
> > +     rcu_read_lock();
> > +retry:
> > +     memcg = obj_cgroup_memcg(objcg);
> > +     if (unlikely(!css_tryget(&memcg->css)))
> > +             goto retry;
> > +     rcu_read_unlock();
> > +
> > +     ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > +
> > +     css_put(&memcg->css);
> > +
> > +     return ret;
> > +}
> > +
> >  /**
> >   * __memcg_kmem_charge: charge a number of kernel pages to a memcg
> >   * @memcg: memory cgroup to charge
> > @@ -3180,11 +3208,8 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> >               unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> >               unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
> >
> > -             if (nr_pages) {
> > -                     rcu_read_lock();
> > -                     __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> > -                     rcu_read_unlock();
> > -             }
> > +             if (nr_pages)
> > +                     obj_cgroup_uncharge_page(old, nr_pages);
> >
> >               /*
> >                * The leftover is flushed to the centralized per-memcg value.
> > @@ -3242,7 +3267,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> >
> >  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> >  {
> > -     struct mem_cgroup *memcg;
> >       unsigned int nr_pages, nr_bytes;
> >       int ret;
> >
> > @@ -3259,24 +3283,16 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> >        * refill_obj_stock(), called from this function or
> >        * independently later.
> >        */
> > -     rcu_read_lock();
> > -retry:
> > -     memcg = obj_cgroup_memcg(objcg);
> > -     if (unlikely(!css_tryget(&memcg->css)))
> > -             goto retry;
> > -     rcu_read_unlock();
> > -
> >       nr_pages = size >> PAGE_SHIFT;
> >       nr_bytes = size & (PAGE_SIZE - 1);
> >
> >       if (nr_bytes)
> >               nr_pages += 1;
> >
> > -     ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > +     ret = obj_cgroup_charge_page(objcg, gfp, nr_pages);
> >       if (!ret && nr_bytes)
> >               refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> >
> > -     css_put(&memcg->css);
> >       return ret;
> >  }
> >
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] Re: [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page
  2021-03-05 19:00   ` Roman Gushchin
@ 2021-03-06  6:02     ` Muchun Song
  0 siblings, 0 replies; 13+ messages in thread
From: Muchun Song @ 2021-03-06  6:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, Shakeel Butt,
	Vladimir Davydov, LKML, Linux Memory Management List,
	Xiongchun duan

On Sat, Mar 6, 2021 at 3:00 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Mar 03, 2021 at 01:59:14PM +0800, Muchun Song wrote:
> > We want to reuse the obj_cgroup APIs to charge the kmem pages when
> > If we do that, we should store an object cgroup pointer to
> > page->memcg_data for the kmem pages.
> >
> > Finally, page->memcg_data can have 3 different meanings.
> >
> >   1) For the slab pages, page->memcg_data points to an object cgroups
> >      vector.
> >
> >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> >      points to an object cgroup.
> >
> >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> >      to a memory cgroup.
> >
> > Currently we always get the memory cgroup associated with a page via
> > page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
> > has to be used in cases when it's not known if a page has an
> > associated memory cgroup pointer or an object cgroups vector. Because
> > the page->memcg_data of the kmem page is not pointing to a memory
> > cgroup in the later patch, the page_memcg() and page_memcg_rcu()
> > cannot be applicable for the kmem pages. In this patch, make
> > page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
> > We do not change the behavior of the page_memcg_check(), it is also
> > applicable for the kmem pages.
> >
> > In the end, there are 3 helpers to get the memcg associated with a page.
> > Usage is as follows.
> >
> >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> >      pages).
> >
> >      - page_memcg()
> >      - page_memcg_rcu()
> >
> >   2) Get the memory cgroup associated with a page. It has to be used in
> >      cases when it's not known if a page has an associated memory cgroup
> >      pointer or an object cgroups vector. Returns NULL for slab pages or
> >      uncharged pages. Otherwise, returns memory cgroup for charged pages
> >      (e.g. the kmem pages, the LRU pages).
> >
> >      - page_memcg_check()
> >
> > In some place, we use page_memcg() to check whether the page is charged.
> > Now introduce page_memcg_charged() helper to do that.
> >
> > This is a preparation for reparenting the kmem pages.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> This patch also looks good to me, but, please, make it safe for adding
> new memcg_data flags. E.g. if someone will add a new flag with a completely
> new meaning, it shouldn't break the code.

I'm not thoughtful enough. Will do that in the next version.
Thanks for your suggestions.

>
> I'll ack it after another look at the final version, but overall it
> looks good.
>
> Thanks!
>
> > ---
> >  include/linux/memcontrol.h | 36 ++++++++++++++++++++++++++++--------
> >  mm/memcontrol.c            | 23 +++++++++++++----------
> >  mm/page_alloc.c            |  4 ++--
> >  3 files changed, 43 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e6dc793d587d..049b80246cbf 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
> >
> >  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
> >
> > +/* Return true for charged page, otherwise false. */
> > +static inline bool page_memcg_charged(struct page *page)
> > +{
> > +     unsigned long memcg_data = page->memcg_data;
> > +
> > +     VM_BUG_ON_PAGE(PageSlab(page), page);
> > +     VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> > +
> > +     return !!memcg_data;
> > +}
> > +
> >  /*
> > - * page_memcg - get the memory cgroup associated with a page
> > + * page_memcg - get the memory cgroup associated with a non-kmem page
> >   * @page: a pointer to the page struct
> >   *
> >   * Returns a pointer to the memory cgroup associated with the page,
> >   * or NULL. This function assumes that the page is known to have a
> >   * proper memory cgroup pointer. It's not safe to call this function
> > - * against some type of pages, e.g. slab pages or ex-slab pages.
> > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> > + * pages.
> >   *
> >   * Any of the following ensures page and memcg binding stability:
> >   * - the page lock
> > @@ -378,27 +390,30 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
> >       unsigned long memcg_data = page->memcg_data;
> >
> >       VM_BUG_ON_PAGE(PageSlab(page), page);
> > -     VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> > +     VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
> >
> > -     return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > +     return (struct mem_cgroup *)memcg_data;
> >  }
> >
> >  /*
> > - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> > + * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page
> >   * @page: a pointer to the page struct
> >   *
> >   * Returns a pointer to the memory cgroup associated with the page,
> >   * or NULL. This function assumes that the page is known to have a
> >   * proper memory cgroup pointer. It's not safe to call this function
> > - * against some type of pages, e.g. slab pages or ex-slab pages.
> > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> > + * pages.
> >   */
> >  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> >  {
> > +     unsigned long memcg_data = READ_ONCE(page->memcg_data);
> > +
> >       VM_BUG_ON_PAGE(PageSlab(page), page);
> > +     VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
> >       WARN_ON_ONCE(!rcu_read_lock_held());
> >
> > -     return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) &
> > -                                  ~MEMCG_DATA_FLAGS_MASK);
> > +     return (struct mem_cgroup *)memcg_data;
> >  }
> >
> >  /*
> > @@ -1072,6 +1087,11 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> >
> >  struct mem_cgroup;
> >
> > +static inline bool page_memcg_charged(struct page *page)
> > +{
> > +     return false;
> > +}
> > +
> >  static inline struct mem_cgroup *page_memcg(struct page *page)
> >  {
> >       return NULL;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index faae16def127..86a8db937ec6 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> >                            int val)
> >  {
> >       struct page *head = compound_head(page); /* rmap on tail pages */
> > -     struct mem_cgroup *memcg = page_memcg(head);
> > +     struct mem_cgroup *memcg;
> >       pg_data_t *pgdat = page_pgdat(page);
> >       struct lruvec *lruvec;
> >
> > +     memcg = page_memcg_check(head);
> >       /* Untracked pages have no memcg, no lruvec. Update only the node */
> >       if (!memcg) {
> >               __mod_node_page_state(pgdat, idx, val);
> > @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >   */
> >  void __memcg_kmem_uncharge_page(struct page *page, int order)
> >  {
> > -     struct mem_cgroup *memcg = page_memcg(page);
> > +     struct mem_cgroup *memcg;
> >       unsigned int nr_pages = 1 << order;
> >
> > -     if (!memcg)
> > +     if (!page_memcg_charged(page))
> >               return;
> >
> > +     memcg = page_memcg_check(page);
> >       VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> >       __memcg_kmem_uncharge(memcg, nr_pages);
> >       page->memcg_data = 0;
> > @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug)
> >  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> >  {
> >       unsigned long nr_pages;
> > +     struct mem_cgroup *memcg;
> >
> >       VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > -     if (!page_memcg(page))
> > +     if (!page_memcg_charged(page))
> >               return;
> >
> >       /*
> >        * Nobody should be changing or seriously looking at
> > -      * page_memcg(page) at this point, we have fully
> > -      * exclusive access to the page.
> > +      * page memcg at this point, we have fully exclusive
> > +      * access to the page.
> >        */
> > -
> > -     if (ug->memcg != page_memcg(page)) {
> > +     memcg = page_memcg_check(page);
> > +     if (ug->memcg != memcg) {
> >               if (ug->memcg) {
> >                       uncharge_batch(ug);
> >                       uncharge_gather_clear(ug);
> >               }
> > -             ug->memcg = page_memcg(page);
> > +             ug->memcg = memcg;
> >
> >               /* pairs with css_put in uncharge_batch */
> >               css_get(&ug->memcg->css);
> > @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page)
> >               return;
> >
> >       /* Don't touch page->lru of any random page, pre-check: */
> > -     if (!page_memcg(page))
> > +     if (!page_memcg_charged(page))
> >               return;
> >
> >       uncharge_gather_clear(&ug);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f10966e3b4a5..bcb58ae15e24 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page,
> >       if (unlikely((unsigned long)page->mapping |
> >                       page_ref_count(page) |
> >  #ifdef CONFIG_MEMCG
> > -                     (unsigned long)page_memcg(page) |
> > +                     page_memcg_charged(page) |
> >  #endif
> >                       (page->flags & check_flags)))
> >               return false;
> > @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
> >                       bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set";
> >       }
> >  #ifdef CONFIG_MEMCG
> > -     if (unlikely(page_memcg(page)))
> > +     if (unlikely(page_memcg_charged(page)))
> >               bad_reason = "page still charged to cgroup";
> >  #endif
> >       return bad_reason;
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [External] Re: [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs
  2021-03-05 19:40   ` Roman Gushchin
@ 2021-03-06  8:03     ` Muchun Song
  0 siblings, 0 replies; 13+ messages in thread
From: Muchun Song @ 2021-03-06  8:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Andrew Morton, Shakeel Butt,
	Vladimir Davydov, LKML, Linux Memory Management List,
	Xiongchun duan

On Sat, Mar 6, 2021 at 3:41 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Mar 03, 2021 at 01:59:15PM +0800, Muchun Song wrote:
> > Since Roman series "The new cgroup slab memory controller" applied. All
> > slab objects are charged via the new APIs of obj_cgroup. The new APIs
> > introduce a struct obj_cgroup to charge slab objects. It prevents
> > long-living objects from pinning the original memory cgroup in the memory.
> > But there are still some corner objects (e.g. allocations larger than
> > order-1 page on SLUB) which are not charged via the new APIs. Those
> > objects (include the pages which are allocated from buddy allocator
> > directly) are charged as kmem pages which still hold a reference to
> > the memory cgroup.
> >
> > This patch aims to charge the kmem pages by using the new APIs of
> > obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> > an object cgroup. We can use the page_objcg() to get the object
> > cgroup associated with a kmem page. Or we can use page_memcg_check()
> > to get the memory cgroup associated with a kmem page, but caller must
> > ensure that the returned memcg won't be released (e.g. acquire the
> > rcu_read_lock or css_set_lock).
>
> I believe it's a good direction, but there are still things which
> need to be figured out first.
>
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  include/linux/memcontrol.h |  63 +++++++++++++++++------
> >  mm/memcontrol.c            | 123 +++++++++++++++++++++++++++++++--------------
> >  2 files changed, 133 insertions(+), 53 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 049b80246cbf..5911b9d107b0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -370,6 +370,18 @@ static inline bool page_memcg_charged(struct page *page)
> >  }
> >
> >  /*
> > + * After the initialization objcg->memcg is always pointing at
> > + * a valid memcg, but can be atomically swapped to the parent memcg.
> > + *
> > + * The caller must ensure that the returned memcg won't be released:
> > + * e.g. acquire the rcu_read_lock or css_set_lock.
> > + */
> > +static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> > +{
> > +     return READ_ONCE(objcg->memcg);
> > +}
> > +
> > +/*
> >   * page_memcg - get the memory cgroup associated with a non-kmem page
> >   * @page: a pointer to the page struct
> >   *
> > @@ -421,9 +433,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> >   * @page: a pointer to the page struct
> >   *
> >   * Returns a pointer to the memory cgroup associated with the page,
> > - * or NULL. This function unlike page_memcg() can take any  page
> > + * or NULL. This function unlike page_memcg() can take any non-kmem page
> >   * as an argument. It has to be used in cases when it's not known if a page
> > - * has an associated memory cgroup pointer or an object cgroups vector.
> > + * has an associated memory cgroup pointer or an object cgroups vector or
> > + * an object cgroup.
> >   *
> >   * Any of the following ensures page and memcg binding stability:
> >   * - the page lock
> > @@ -442,6 +455,17 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
> >       if (memcg_data & MEMCG_DATA_OBJCGS)
> >               return NULL;
> >
> > +     if (memcg_data & MEMCG_DATA_KMEM) {
>
> This is confusing: the comment above says it can't take kmem pages?

It is my fault. I will fix the comment.

>
> > +             struct obj_cgroup *objcg;
> > +
> > +             /*
> > +              * The caller must ensure that the returned memcg won't be
> > +              * released: e.g. acquire the rcu_read_lock or css_set_lock.
> > +              */
> > +             objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > +             return obj_cgroup_memcg(objcg);
> > +     }
> > +
> >       return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>
> Also, the comment about page<->memcg binding stability is not correct anymore.
> Taking page_lock, for example, won't protect memcg from being released,
> if this a kmem page.

I also have a question: Will the page lock of the kmem page be taken
somewhere?

>
> _Maybe_ it's ok to just say that page_memcg_check() requires a rcu lock,
> but I'm not yet quite sure. The calling convention is already complicated,
> we should avoid making it even more complicated, if we can.

In most scenarios, we can distinguish what page it is. So we can see
that there are only 2 users of it. I think that it may be acceptable to just
complicate page_memcg_check(). Document it requires a rcu lock.

>
> >  }
> >
> > @@ -500,6 +524,24 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
> >       return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> >  }
> >
> > +/*
> > + * page_objcg - get the object cgroup associated with a kmem page
> > + * @page: a pointer to the page struct
> > + *
> > + * Returns a pointer to the object cgroup associated with the kmem page,
> > + * or NULL. This function assumes that the page is known to have an
> > + * associated object cgroup. It's only safe to call this function
> > + * against kmem pages (PageMemcgKmem() returns true).
> > + */
> > +static inline struct obj_cgroup *page_objcg(struct page *page)
> > +{
> > +     unsigned long memcg_data = page->memcg_data;
> > +
> > +     VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> > +     VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
> > +
> > +     return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> > +}
> >  #else
> >  static inline struct obj_cgroup **page_objcgs(struct page *page)
> >  {
> > @@ -510,6 +552,11 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page)
> >  {
> >       return NULL;
> >  }
> > +
> > +static inline struct obj_cgroup *page_objcg(struct page *page)
> > +{
> > +     return NULL;
> > +}
> >  #endif
> >
> >  static __always_inline bool memcg_stat_item_in_bytes(int idx)
> > @@ -728,18 +775,6 @@ static inline void obj_cgroup_put(struct obj_cgroup *objcg)
> >       percpu_ref_put(&objcg->refcnt);
> >  }
> >
> > -/*
> > - * After the initialization objcg->memcg is always pointing at
> > - * a valid memcg, but can be atomically swapped to the parent memcg.
> > - *
> > - * The caller must ensure that the returned memcg won't be released:
> > - * e.g. acquire the rcu_read_lock or css_set_lock.
> > - */
> > -static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> > -{
> > -     return READ_ONCE(objcg->memcg);
> > -}
> > -
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> >       if (memcg)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 86a8db937ec6..0cf342d22547 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -856,10 +856,16 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> >  {
> >       struct page *head = compound_head(page); /* rmap on tail pages */
> >       struct mem_cgroup *memcg;
> > -     pg_data_t *pgdat = page_pgdat(page);
> > +     pg_data_t *pgdat;
> >       struct lruvec *lruvec;
> >
> > -     memcg = page_memcg_check(head);
> > +     if (PageMemcgKmem(head)) {
> > +             __mod_lruvec_kmem_state(page_to_virt(head), idx, val);
> > +             return;
> > +     }
>
> This is a very confusing part: we're converting the page to the virtual address
> to run mem_cgroup_from_obj() inside __mod_lruvec_kmem_state() to get back the page.

OK. I will rework the code here.


>
> > +
> > +     pgdat = page_pgdat(head);
> > +     memcg = page_memcg(head);
> >       /* Untracked pages have no memcg, no lruvec. Update only the node */
> >       if (!memcg) {
> >               __mod_node_page_state(pgdat, idx, val);
> > @@ -3144,18 +3150,18 @@ static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_page
> >   */
> >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >  {
> > -     struct mem_cgroup *memcg;
> > +     struct obj_cgroup *objcg;
> >       int ret = 0;
> >
> > -     memcg = get_mem_cgroup_from_current();
> > -     if (memcg && !mem_cgroup_is_root(memcg)) {
> > -             ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> > +     objcg = get_obj_cgroup_from_current();
> > +     if (objcg) {
> > +             ret = obj_cgroup_charge_page(objcg, gfp, 1 << order);
> >               if (!ret) {
> > -                     page->memcg_data = (unsigned long)memcg |
> > +                     page->memcg_data = (unsigned long)objcg |
> >                               MEMCG_DATA_KMEM;
> >                       return 0;
> >               }
> > -             css_put(&memcg->css);
> > +             obj_cgroup_put(objcg);
> >       }
> >       return ret;
> >  }
> > @@ -3167,17 +3173,18 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >   */
> >  void __memcg_kmem_uncharge_page(struct page *page, int order)
> >  {
> > -     struct mem_cgroup *memcg;
> > +     struct obj_cgroup *objcg;
> >       unsigned int nr_pages = 1 << order;
> >
> >       if (!page_memcg_charged(page))
> >               return;
> >
> > -     memcg = page_memcg_check(page);
> > -     VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> > -     __memcg_kmem_uncharge(memcg, nr_pages);
> > +     VM_BUG_ON_PAGE(!PageMemcgKmem(page), page);
> > +
> > +     objcg = page_objcg(page);
> > +     obj_cgroup_uncharge_page(objcg, nr_pages);
> >       page->memcg_data = 0;
> > -     css_put(&memcg->css);
> > +     obj_cgroup_put(objcg);
> >  }
> >
> >  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > @@ -6794,8 +6801,12 @@ struct uncharge_gather {
> >       struct mem_cgroup *memcg;
> >       unsigned long nr_pages;
> >       unsigned long pgpgout;
> > -     unsigned long nr_kmem;
> >       struct page *dummy_page;
> > +
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     struct obj_cgroup *objcg;
> > +     unsigned long nr_kmem;
> > +#endif
> >  };
> >
> >  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> > @@ -6807,12 +6818,21 @@ static void uncharge_batch(const struct uncharge_gather *ug)
> >  {
> >       unsigned long flags;
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     if (ug->objcg) { > +#endif
>
> Hm, an obvious question here is why do we need to double the ug infrastructure
> if we can just get kmem page's memcg and use the infra for user pages?
> Because ug is holding a reference to memcg, it will not go away.
> Maybe I'm missing something, but it seems that there is a simpler implementation.

Yeah. There is a simpler implementation. I will do a try. Thanks for your
suggestions.

>
> Thanks!
>
> > +
> > +     if (!ug->memcg)
> > +             return;
> > +
> >       if (!mem_cgroup_is_root(ug->memcg)) {
> >               page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
> >               if (do_memsw_account())
> >                       page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
> > -             if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> > -                     page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
> >               memcg_oom_recover(ug->memcg);
> >       }
> >
> > @@ -6822,26 +6842,40 @@ static void uncharge_batch(const struct uncharge_gather *ug)
> >       memcg_check_events(ug->memcg, ug->dummy_page);
> >       local_irq_restore(flags);
> >
> > -     /* drop reference from uncharge_page */
> > +     /* drop reference from uncharge_user_page */
> >       css_put(&ug->memcg->css);
> >  }
> >
> > -static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> > +#ifdef CONFIG_MEMCG_KMEM
> > +static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
> >  {
> > -     unsigned long nr_pages;
> > -     struct mem_cgroup *memcg;
> > +     struct obj_cgroup *objcg = page_objcg(page);
> >
> > -     VM_BUG_ON_PAGE(PageLRU(page), page);
> > +     if (ug->objcg != objcg) {
> > +             if (ug->objcg) {
> > +                     uncharge_batch(ug);
> > +                     uncharge_gather_clear(ug);
> > +             }
> > +             ug->objcg = objcg;
> >
> > -     if (!page_memcg_charged(page))
> > -             return;
> > +             /* pairs with obj_cgroup_put in uncharge_batch */
> > +             obj_cgroup_get(ug->objcg);
> > +     }
> > +
> > +     ug->nr_kmem += compound_nr(page);
> > +     page->memcg_data = 0;
> > +     obj_cgroup_put(ug->objcg);
> > +}
> > +#else
> > +static void uncharge_kmem_page(struct page *page, struct uncharge_gather *ug)
> > +{
> > +}
> > +#endif
> > +
> > +static void uncharge_user_page(struct page *page, struct uncharge_gather *ug)
> > +{
> > +     struct mem_cgroup *memcg = page_memcg(page);
> >
> > -     /*
> > -      * Nobody should be changing or seriously looking at
> > -      * page memcg at this point, we have fully exclusive
> > -      * access to the page.
> > -      */
> > -     memcg = page_memcg_check(page);
> >       if (ug->memcg != memcg) {
> >               if (ug->memcg) {
> >                       uncharge_batch(ug);
> > @@ -6852,18 +6886,30 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> >               /* pairs with css_put in uncharge_batch */
> >               css_get(&ug->memcg->css);
> >       }
> > +     ug->pgpgout++;
> > +     ug->dummy_page = page;
> > +
> > +     ug->nr_pages += compound_nr(page);
> > +     page->memcg_data = 0;
> > +     css_put(&ug->memcg->css);
> > +}
> >
> > -     nr_pages = compound_nr(page);
> > -     ug->nr_pages += nr_pages;
> > +static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> > +{
> > +     VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > +     if (!page_memcg_charged(page))
> > +             return;
> > +
> > +     /*
> > +      * Nobody should be changing or seriously looking at
> > +      * page memcg at this point, we have fully exclusive
> > +      * access to the page.
> > +      */
> >       if (PageMemcgKmem(page))
> > -             ug->nr_kmem += nr_pages;
> > +             uncharge_kmem_page(page, ug);
> >       else
> > -             ug->pgpgout++;
> > -
> > -     ug->dummy_page = page;
> > -     page->memcg_data = 0;
> > -     css_put(&ug->memcg->css);
> > +             uncharge_user_page(page, ug);
> >  }
> >
> >  /**
> > @@ -6906,8 +6952,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
> >       uncharge_gather_clear(&ug);
> >       list_for_each_entry(page, page_list, lru)
> >               uncharge_page(page, &ug);
> > -     if (ug.memcg)
> > -             uncharge_batch(&ug);
> > +     uncharge_batch(&ug);
> >  }
> >
> >  /**
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-03-06  8:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03  5:59 [PATCH v2 0/5] Use obj_cgroup APIs to charge kmem pages Muchun Song
2021-03-03  5:59 ` [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page Muchun Song
2021-03-05 18:56   ` Roman Gushchin
2021-03-06  5:26     ` [External] " Muchun Song
2021-03-03  5:59 ` [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page Muchun Song
2021-03-05 19:00   ` Roman Gushchin
2021-03-06  6:02     ` [External] " Muchun Song
2021-03-03  5:59 ` [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs Muchun Song
2021-03-05 19:40   ` Roman Gushchin
2021-03-06  8:03     ` [External] " Muchun Song
2021-03-03  5:59 ` [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API Muchun Song
2021-03-05 19:48   ` Roman Gushchin
2021-03-03  5:59 ` [PATCH v2 5/5] mm: memcontrol: use remote objcg charging APIs to charge kernel memory Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).