linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages
@ 2021-04-09 12:29 Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement Muchun Song
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.

	[v17,00/19] The new cgroup slab memory controller
	[v5,0/7] Use obj_cgroup APIs to charge kmem pages

But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.

This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.

```bash
#!/bin/bash

cat /proc/cgroups | grep memory

cd /sys/fs/cgroup/memory

for i in range{1..500}
do
	mkdir test
	echo $$ > test/cgroup.procs
	sleep 60 &
	echo $$ > cgroup.procs
	echo `cat test/cgroup.procs` > cgroup.procs
	rmdir test
done

cat /proc/cgroups | grep memory
```

Patch 1 aims to fix page charging in page replacement.
Patch 2-5 are code cleanup and simplification.
Patch 6-18 convert LRU pages pin to the objcg direction.

Any comments are welcome. Thanks.

Changlogs in RFC v2:
  1. Collect Acked-by tags by Johannes. Thanks.
  2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
  3. Fix move_pages_to_lru().

Muchun Song (18):
  mm: memcontrol: fix page charging in page replacement
  mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
  mm: memcontrol: simplify lruvec_holds_page_lru_lock
  mm: memcontrol: simplify the logic of objcg pinning memcg
  mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
  mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
  mm: memcontrol: make lruvec lock safe when the LRU pages reparented
  mm: vmscan: remove noinline_for_stack
  mm: vmscan: rework move_pages_to_lru()
  mm: thp: introduce lock/unlock_split_queue{_irqsave}()
  mm: thp: make deferred split queue lock safe when the LRU pages
    reparented
  mm: memcontrol: make all the callers of page_memcg() safe
  mm: memcontrol: introduce memcg_reparent_ops
  mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
  mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
  mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
  mm: lru: use lruvec lock to serialize memcg changes

 Documentation/admin-guide/cgroup-v1/memory.rst |   2 +-
 fs/buffer.c                                    |  13 +-
 fs/fs-writeback.c                              |  23 +-
 fs/iomap/buffered-io.c                         |   4 +-
 include/linux/memcontrol.h                     | 216 +++++----
 include/linux/mm_inline.h                      |   6 +
 mm/compaction.c                                |  40 +-
 mm/filemap.c                                   |   2 +-
 mm/huge_memory.c                               | 171 ++++++-
 mm/memcontrol.c                                | 622 ++++++++++++++++++-------
 mm/migrate.c                                   |   4 +
 mm/page-writeback.c                            |  24 +-
 mm/page_io.c                                   |   5 +-
 mm/rmap.c                                      |  14 +-
 mm/swap.c                                      |  48 +-
 mm/vmscan.c                                    |  58 ++-
 mm/workingset.c                                |   2 +-
 17 files changed, 841 insertions(+), 413 deletions(-)

-- 
2.11.0



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-10 17:44   ` Shakeel Butt
  2021-04-09 12:29 ` [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm Muchun Song
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

The pages aren't accounted at the root level, so do not charge the page
to the root memcg in page replacement. Although we do not display the
value (mem_cgroup_usage) so there shouldn't be any actual problem, but
there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
will trigger? So it is better to fix it.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..f229de925aa5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6806,9 +6806,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
 	/* Force-charge the new page. The old one will be freed soon */
 	nr_pages = thp_nr_pages(newpage);
 
-	page_counter_charge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_charge(&memcg->memsw, nr_pages);
+	if (!mem_cgroup_is_root(memcg)) {
+		page_counter_charge(&memcg->memory, nr_pages);
+		if (do_memsw_account())
+			page_counter_charge(&memcg->memsw, nr_pages);
+	}
 
 	css_get(&memcg->css);
 	commit_charge(newpage, memcg);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-10 17:47   ` Shakeel Butt
  2021-04-09 12:29 ` [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg Muchun Song
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

When mm is NULL, we do not need to hold rcu lock and call css_tryget for
the root memcg. And we also do not need to check !mm in every loop of
while. So bail out early when !mm.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f229de925aa5..9cbfff59b171 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -901,20 +901,19 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 	if (mem_cgroup_disabled())
 		return NULL;
 
+	/*
+	 * Page cache insertions can happen without an
+	 * actual mm context, e.g. during disk probing
+	 * on boot, loopback IO, acct() writes etc.
+	 */
+	if (unlikely(!mm))
+		return root_mem_cgroup;
+
 	rcu_read_lock();
 	do {
-		/*
-		 * Page cache insertions can happen without an
-		 * actual mm context, e.g. during disk probing
-		 * on boot, loopback IO, acct() writes etc.
-		 */
-		if (unlikely(!mm))
+		memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!memcg))
 			memcg = root_mem_cgroup;
-		else {
-			memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
-			if (unlikely(!memcg))
-				memcg = root_mem_cgroup;
-		}
 	} while (!css_tryget(&memcg->css));
 	rcu_read_unlock();
 	return memcg;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-09 16:27   ` Johannes Weiner
  2021-04-09 12:29 ` [RFC PATCH v2 07/18] mm: memcontrol: introduce compact_lock_page_lruvec_irqsave Muchun Song
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
the css_set_lock. We do not need to care about objcg->memcg being
released in the process of obj_cgroup_release(). So there is no need
to pin memcg before releasing objcg. Remove those pinning logic to
simplfy the code.

There are only two places that modifies the objcg->memcg. One is the
initialization to objcg->memcg in the memcg_online_kmem(), another
is objcgs reparenting in the memcg_reparent_objcgs(). It is also
impossible for the two to run in parallel. So xchg() is unnecessary
and it is enough to use WRITE_ONCE().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f807448233e..90c1ac58c64c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,7 +261,6 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
 	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
-	struct mem_cgroup *memcg;
 	unsigned int nr_bytes;
 	unsigned int nr_pages;
 	unsigned long flags;
@@ -291,11 +290,9 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	nr_pages = nr_bytes >> PAGE_SHIFT;
 
 	spin_lock_irqsave(&css_set_lock, flags);
-	memcg = obj_cgroup_memcg(objcg);
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
 	list_del(&objcg->list);
-	mem_cgroup_put(memcg);
 	spin_unlock_irqrestore(&css_set_lock, flags);
 
 	percpu_ref_exit(ref);
@@ -330,17 +327,14 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 
 	spin_lock_irq(&css_set_lock);
 
-	/* Move active objcg to the parent's list */
-	xchg(&objcg->memcg, parent);
-	css_get(&parent->css);
-	list_add(&objcg->list, &parent->objcg_list);
+	/* 1) Ready to reparent active objcg. */
+	list_add(&objcg->list, &memcg->objcg_list);
 
-	/* Move already reparented objcgs to the parent's list */
-	list_for_each_entry(iter, &memcg->objcg_list, list) {
-		css_get(&parent->css);
-		xchg(&iter->memcg, parent);
-		css_put(&memcg->css);
-	}
+	/* 2) Reparent active objcg and already reparented objcgs to parent. */
+	list_for_each_entry(iter, &memcg->objcg_list, list)
+		WRITE_ONCE(iter->memcg, parent);
+
+	/* 3) Move already reparented objcgs to the parent's list */
 	list_splice(&memcg->objcg_list, &parent->objcg_list);
 
 	spin_unlock_irq(&css_set_lock);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 07/18] mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (2 preceding siblings ...)
  2021-04-09 12:29 ` [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack Muchun Song
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

If we reuse the objcg APIs to charge LRU pages, the page_memcg()
can be changed when the LRU pages reparented. In this case, we need
to acquire the new lruvec lock.

    lruvec = mem_cgroup_page_lruvec(page);

    // The page is reparented.

    compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);

    // Acquired the wrong lruvec lock and need to retry.

But compact_lock_irqsave() only take lruvec lock as the parameter,
we cannot aware this change. If it can take the page as parameter
to acquire the lruvec lock. When the page memcg is changed, we can
use the page_memcg() detect whether we need to reacquire the new
lruvec lock. So compact_lock_irqsave() is not suitable for us.
Similar to lock_page_lruvec_irqsave(), introduce
compact_lock_page_lruvec_irqsave() to acquire the lruvec lock in
the compaction routine.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/compaction.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e7da342003dd..c9efe3542b0a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -511,6 +511,29 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 	return true;
 }
 
+static struct lruvec *
+compact_lock_page_lruvec_irqsave(struct page *page, unsigned long *flags,
+				 struct compact_control *cc)
+{
+	struct lruvec *lruvec;
+
+	lruvec = mem_cgroup_page_lruvec(page);
+
+	/* Track if the lock is contended in async mode */
+	if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
+		if (spin_trylock_irqsave(&lruvec->lru_lock, *flags))
+			goto out;
+
+		cc->contended = true;
+	}
+
+	spin_lock_irqsave(&lruvec->lru_lock, *flags);
+out:
+	lruvec_memcg_debug(lruvec, page);
+
+	return lruvec;
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. The lock should be periodically unlocked to avoid
@@ -1040,10 +1063,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (locked)
 				unlock_page_lruvec_irqrestore(locked, flags);
 
-			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
-			locked = lruvec;
-
-			lruvec_memcg_debug(lruvec, page);
+			locked = compact_lock_page_lruvec_irqsave(page, &flags, cc);
+			lruvec = locked;
 
 			/* Try get exclusive access under lock */
 			if (!skip_updated) {
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (3 preceding siblings ...)
  2021-04-09 12:29 ` [RFC PATCH v2 07/18] mm: memcontrol: introduce compact_lock_page_lruvec_irqsave Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-09 18:31   ` Johannes Weiner
  2021-04-10  0:49   ` Roman Gushchin
  2021-04-09 12:29 ` [RFC PATCH v2 10/18] mm: vmscan: rework move_pages_to_lru() Muchun Song
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
set up pagevec as late as possible in shrink_inactive_list()"), its
purpose is to delay the allocation of pagevec as late as possible to
save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
reclaim stack") replace pagevecs by lists of pages_to_free. So we do
not need noinline_for_stack, just remove it (let the compiler decide
whether to inline).

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64bf07cc20f2..e40b21298d77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  *
  * Returns the number of pages moved to the given lruvec.
  */
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
-						     struct list_head *list)
+static unsigned int move_pages_to_lru(struct lruvec *lruvec,
+				      struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
@@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
-static noinline_for_stack unsigned long
+static unsigned long
 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		     struct scan_control *sc, enum lru_list lru)
 {
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 10/18] mm: vmscan: rework move_pages_to_lru()
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (4 preceding siblings ...)
  2021-04-09 12:29 ` [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 12/18] mm: thp: make deferred split queue lock safe when the LRU pages reparented Muchun Song
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

In the later patch, we will reparent the LRU pages. The pages which will
move to appropriate LRU list can be reparented during the process of the
move_pages_to_lru(). So holding a lruvec lock by the caller is wrong, we
should use the more general interface of relock_page_lruvec_irq() to
acquire the correct lruvec lock.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e40b21298d77..4431007825ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2013,23 +2013,27 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  * move_pages_to_lru() moves pages from private @list to appropriate LRU list.
  * On return, @list is reused as a list of pages to be freed by the caller.
  *
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate LRU list.
+ *
+ * Note: The caller must not hold any lruvec lock.
  */
-static unsigned int move_pages_to_lru(struct lruvec *lruvec,
-				      struct list_head *list)
+static unsigned int move_pages_to_lru(struct list_head *list)
 {
-	int nr_pages, nr_moved = 0;
+	int nr_moved = 0;
+	struct lruvec *lruvec = NULL;
 	LIST_HEAD(pages_to_free);
-	struct page *page;
 
 	while (!list_empty(list)) {
-		page = lru_to_page(list);
+		int nr_pages;
+		struct page *page = lru_to_page(list);
+
+		lruvec = relock_page_lruvec_irq(page, lruvec);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
 		if (unlikely(!page_evictable(page))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			unlock_page_lruvec_irq(lruvec);
 			putback_lru_page(page);
-			spin_lock_irq(&lruvec->lru_lock);
+			lruvec = NULL;
 			continue;
 		}
 
@@ -2050,19 +2054,15 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
 			__clear_page_lru_flags(page);
 
 			if (unlikely(PageCompound(page))) {
-				spin_unlock_irq(&lruvec->lru_lock);
+				unlock_page_lruvec_irq(lruvec);
 				destroy_compound_page(page);
-				spin_lock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 			} else
 				list_add(&page->lru, &pages_to_free);
 
 			continue;
 		}
 
-		/*
-		 * All pages were isolated from the same lruvec (and isolation
-		 * inhibits memcg migration).
-		 */
 		VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
 		add_page_to_lru_list(page, lruvec);
 		nr_pages = thp_nr_pages(page);
@@ -2071,6 +2071,8 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
+	if (lruvec)
+		unlock_page_lruvec_irq(lruvec);
 	/*
 	 * To save our caller's stack, now use input list for pages to free.
 	 */
@@ -2144,16 +2146,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false);
 
-	spin_lock_irq(&lruvec->lru_lock);
-	move_pages_to_lru(lruvec, &page_list);
+	move_pages_to_lru(&page_list);
 
+	local_irq_disable();
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&lruvec->lru_lock);
+	local_irq_enable();
 
 	lru_note_cost(lruvec, file, stat.nr_pageout);
 	mem_cgroup_uncharge_list(&page_list);
@@ -2280,18 +2282,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move pages back to the lru list.
 	 */
-	spin_lock_irq(&lruvec->lru_lock);
-
-	nr_activate = move_pages_to_lru(lruvec, &l_active);
-	nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
+	nr_activate = move_pages_to_lru(&l_active);
+	nr_deactivate = move_pages_to_lru(&l_inactive);
 	/* Keep all free pages in l_active list */
 	list_splice(&l_inactive, &l_active);
 
+	local_irq_disable();
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
-
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&lruvec->lru_lock);
+	local_irq_enable();
 
 	mem_cgroup_uncharge_list(&l_active);
 	free_unref_page_list(&l_active);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 12/18] mm: thp: make deferred split queue lock safe when the LRU pages reparented
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (5 preceding siblings ...)
  2021-04-09 12:29 ` [RFC PATCH v2 10/18] mm: vmscan: rework move_pages_to_lru() Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
  2021-04-09 12:29 ` [RFC PATCH v2 14/18] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

Similar to lruvec lock, we use the same approach to make the lock safe
when the LRU pages reparented.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/huge_memory.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 275dbfc8b2ae..aa5d7b72d5fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -497,6 +497,8 @@ static struct deferred_split *lock_split_queue(struct page *page)
 	struct deferred_split *queue;
 	struct mem_cgroup *memcg;
 
+	rcu_read_lock();
+retry:
 	memcg = page_memcg(compound_head(page));
 	if (memcg)
 		queue = &memcg->deferred_split_queue;
@@ -504,6 +506,17 @@ static struct deferred_split *lock_split_queue(struct page *page)
 		queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
 	spin_lock(&queue->split_queue_lock);
 
+	if (unlikely(memcg != page_memcg(page))) {
+		spin_unlock(&queue->split_queue_lock);
+		goto retry;
+	}
+
+	/*
+	 * Preemption is disabled in the internal of spin_lock, which can serve
+	 * as RCU read-side critical sections.
+	 */
+	rcu_read_unlock();
+
 	return queue;
 }
 
@@ -513,6 +526,8 @@ static struct deferred_split *lock_split_queue_irqsave(struct page *page,
 	struct deferred_split *queue;
 	struct mem_cgroup *memcg;
 
+	rcu_read_lock();
+retry:
 	memcg = page_memcg(compound_head(page));
 	if (memcg)
 		queue = &memcg->deferred_split_queue;
@@ -520,6 +535,14 @@ static struct deferred_split *lock_split_queue_irqsave(struct page *page,
 		queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
 
+	if (unlikely(memcg != page_memcg(page))) {
+		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
+		goto retry;
+	}
+
+	/* See the comments in lock_split_queue(). */
+	rcu_read_unlock();
+
 	return queue;
 }
 #else
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH v2 14/18] mm: memcontrol: introduce memcg_reparent_ops
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (6 preceding siblings ...)
  2021-04-09 12:29 ` [RFC PATCH v2 12/18] mm: thp: make deferred split queue lock safe when the LRU pages reparented Muchun Song
@ 2021-04-09 12:29 ` Muchun Song
       [not found] ` <20210409122959.82264-5-songmuchun@bytedance.com>
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-09 12:29 UTC (permalink / raw)
  To: guro, hannes, mhocko, akpm, shakeelb, vdavydov.dev
  Cc: linux-kernel, linux-mm, duanxiongchun, fam.zheng, bsingharora,
	shy828301, alex.shi, Muchun Song

In the previous patch, we know how to make the lruvec lock safe when the
LRU pages reparented. We should do something like following.

    memcg_reparent_objcgs(memcg)
        1) lock
        // lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
        spin_lock(&lruvec->lru_lock);
        spin_lock(&lruvec_parent->lru_lock);

        2) do reparent
        // Move all the pages from the lruvec list to the parent lruvec list.

        3) unlock
        spin_unlock(&lruvec_parent->lru_lock);
        spin_unlock(&lruvec->lru_lock);

Apart from the page lruvec lock, the deferred split queue lock (THP only)
also needs to do something similar. So we extracted the necessary 3 steps
in the memcg_reparent_objcgs().

    memcg_reparent_objcgs(memcg)
        1) lock
        memcg_reparent_ops->lock(memcg, parent);

        2) reparent
        memcg_reparent_ops->reparent(memcg, reparent);

        3) unlock
        memcg_reparent_ops->unlock(memcg, reparent);

Now there are two different locks (e.g. lruvec lock and deferred split
queue lock) need to use this infrastructure. In the next patch, we will
use those APIs to make those locks safe when the LRU pages reparented.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 11 +++++++++++
 mm/memcontrol.c            | 49 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7e15be2bd47a..fdc385ecff55 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -355,6 +355,17 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+struct memcg_reparent_ops {
+	struct list_head list;
+
+	/* Irq is disabled before calling those functions. */
+	void (*lock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+	void (*unlock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+	void (*reparent)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+};
+
+void __init register_memcg_repatent(struct memcg_reparent_ops *ops);
+
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f4fcb182883..536be7bdc98f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -338,6 +338,41 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
 	return objcg;
 }
 
+static LIST_HEAD(reparent_ops_head);
+
+static void memcg_reparent_lock(struct mem_cgroup *memcg,
+				struct mem_cgroup *parent)
+{
+	struct memcg_reparent_ops *ops;
+
+	list_for_each_entry(ops, &reparent_ops_head, list)
+		ops->lock(memcg, parent);
+}
+
+static void memcg_reparent_unlock(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct memcg_reparent_ops *ops;
+
+	list_for_each_entry(ops, &reparent_ops_head, list)
+		ops->unlock(memcg, parent);
+}
+
+static void memcg_do_reparent(struct mem_cgroup *memcg,
+			      struct mem_cgroup *parent)
+{
+	struct memcg_reparent_ops *ops;
+
+	list_for_each_entry(ops, &reparent_ops_head, list)
+		ops->reparent(memcg, parent);
+}
+
+void __init register_memcg_repatent(struct memcg_reparent_ops *ops)
+{
+	BUG_ON(!ops->lock || !ops->unlock || !ops->reparent);
+	list_add(&ops->list, &reparent_ops_head);
+}
+
 static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
 {
 	struct obj_cgroup *objcg, *iter;
@@ -347,9 +382,13 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
 	if (!parent)
 		parent = root_mem_cgroup;
 
+	local_irq_disable();
+
+	memcg_reparent_lock(memcg, parent);
+
 	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
 
-	spin_lock_irq(&css_set_lock);
+	spin_lock(&css_set_lock);
 
 	/* 1) Ready to reparent active objcg. */
 	list_add(&objcg->list, &memcg->objcg_list);
@@ -361,7 +400,13 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
 	/* 3) Move already reparented objcgs to the parent's list */
 	list_splice(&memcg->objcg_list, &parent->objcg_list);
 
-	spin_unlock_irq(&css_set_lock);
+	spin_unlock(&css_set_lock);
+
+	memcg_do_reparent(memcg, parent);
+
+	memcg_reparent_unlock(memcg, parent);
+
+	local_irq_enable();
 
 	percpu_ref_kill(&objcg->refcnt);
 }
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 04/18] mm: memcontrol: simplify lruvec_holds_page_lru_lock
       [not found] ` <20210409122959.82264-5-songmuchun@bytedance.com>
@ 2021-04-09 16:00   ` Johannes Weiner
  2021-04-11  5:37     ` [External] " Muchun Song
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2021-04-09 16:00 UTC (permalink / raw)
  To: Muchun Song
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

If you haven't done so yet, please make sure to explicitly test with
all three config combinations, just because the dummy abstractions for
memcg disabled or compiled out tend to be paper thin and don't always
behave the way you might expect when you do more complicated things.

Something like

boot
echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
echo 1 >/proc/sys/vm/compact_memory

should exercise this new function in a couple of important scenarios.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg
  2021-04-09 12:29 ` [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg Muchun Song
@ 2021-04-09 16:27   ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2021-04-09 16:27 UTC (permalink / raw)
  To: Muchun Song
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:46PM +0800, Muchun Song wrote:
> The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
> the css_set_lock. We do not need to care about objcg->memcg being
> released in the process of obj_cgroup_release(). So there is no need
> to pin memcg before releasing objcg. Remove those pinning logic to
> simplfy the code.

Hm yeah, it's not clear to me why inherited objcgs pinned the memcg in
the first place, since they are reparented during memcg deletion and
therefor have no actual impact on the memcg's lifetime.

> There are only two places that modifies the objcg->memcg. One is the
> initialization to objcg->memcg in the memcg_online_kmem(), another
> is objcgs reparenting in the memcg_reparent_objcgs(). It is also
> impossible for the two to run in parallel. So xchg() is unnecessary
> and it is enough to use WRITE_ONCE().

Good catch.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Looks like a nice cleanup / simplification:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 06/18] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
       [not found] ` <20210409122959.82264-7-songmuchun@bytedance.com>
@ 2021-04-09 16:56   ` Johannes Weiner
  2021-04-11  6:24     ` [External] " Muchun Song
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2021-04-09 16:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer.
> 
> Therefore, the infrastructure of objcg no longer only serves
> CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> can reuse it to charge pages.

Just an observation on this:

We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
point. It used to be an optional feature, but nowadays it's not
configurable anymore, and always on unless slob is configured.

We've also added more than just slab accounting to it, like kernel
stack pages, and it all gets disabled on slob configs just because it
doesn't support slab object tracking.

We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.

But that's beyond the scope of your patch series, so I'm also okay
with this patch here.

> We know that the LRU pages are not accounted at the root level. But the
> page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> LRU pages, we should set the page->memcg_data to a root object cgroup. So
> we also allocate an object cgroup for the root_mem_cgroup and introduce
> root_obj_cgroup to cache its value just like root_mem_cgroup.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Overall, the patch makes sense to me. A few comments below:

> @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  	return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
>  }
>  
> -#ifdef CONFIG_MEMCG_KMEM
>  extern spinlock_t css_set_lock;
>  
> +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> +{
  > +	return objcg == root_obj_cgroup;
> +}

This function, and by extension root_obj_cgroup, aren't used by this
patch. Please move them to the patch that adds users for them.

> @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>  	percpu_ref_exit(ref);
>  	kfree_rcu(objcg, rcu);
>  }
> +#else
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&css_set_lock, flags);
> +	list_del(&objcg->list);
> +	spin_unlock_irqrestore(&css_set_lock, flags);
> +
> +	percpu_ref_exit(ref);
> +	kfree_rcu(objcg, rcu);
> +}
> +#endif

Having two separate functions for if and else is good when the else
branch is a completely empty dummy function. In this case you end up
duplicating code, so it's better to have just one function and put the
ifdef around the nr_charged_bytes handling in it.

> @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
>  	return objcg;
>  }
>  
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> -				  struct mem_cgroup *parent)
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
>  {
>  	struct obj_cgroup *objcg, *iter;
> +	struct mem_cgroup *parent;
> +
> +	parent = parent_mem_cgroup(memcg);
> +	if (!parent)
> +		parent = root_mem_cgroup;
>  
>  	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
>  
> @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
>  	percpu_ref_kill(&objcg->refcnt);
>  }
>  
> +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg)
> +		return -ENOMEM;
> +
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +
> +	return 0;
> +}
> +
> +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> +{
> +	if (unlikely(memcg->objcg))
> +		memcg_reparent_objcgs(memcg);
> +}

It's confusing to have a 'free' function not free the object it's
called on.

But rather than search for a fitting name, I think it might be better
to just fold both of these short functions into their only callsites.

Also, since memcg->objcg is reparented, and the pointer cleared, on
offlining, when could this ever be non-NULL? This deserves a comment.

> @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	struct obj_cgroup *objcg;
>  	int memcg_id;
>  
>  	if (cgroup_memory_nokmem)
> @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
>  	if (memcg_id < 0)
>  		return memcg_id;
>  
> -	objcg = obj_cgroup_alloc();
> -	if (!objcg) {
> -		memcg_free_cache_id(memcg_id);
> -		return -ENOMEM;
> -	}
> -	objcg->memcg = memcg;
> -	rcu_assign_pointer(memcg->objcg, objcg);
> -
>  	static_branch_enable(&memcg_kmem_enabled_key);
>  
>  	memcg->kmemcg_id = memcg_id;
> @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>  	if (!parent)
>  		parent = root_mem_cgroup;
>  
> -	memcg_reparent_objcgs(memcg, parent);
> +	memcg_reparent_objcgs(memcg);

Since the objcg is no longer tied to kmem, this should move to
mem_cgroup_css_offline() instead.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack
  2021-04-09 12:29 ` [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack Muchun Song
@ 2021-04-09 18:31   ` Johannes Weiner
  2021-04-10  4:33     ` [External] " Muchun Song
  2021-04-10  0:49   ` Roman Gushchin
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2021-04-09 18:31 UTC (permalink / raw)
  To: Muchun Song
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Good catch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Since this patch is somewhat independent of the rest of the series,
you may want to put it in the very beginning, or even submit it
separately, to keep the main series as compact as possible. Reviewers
can be more hesitant to get involved with larger series ;)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack
  2021-04-09 12:29 ` [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack Muchun Song
  2021-04-09 18:31   ` Johannes Weiner
@ 2021-04-10  0:49   ` Roman Gushchin
  1 sibling, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2021-04-10  0:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Roman Gushchin <guro@fb.com>

> ---
>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64bf07cc20f2..e40b21298d77 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>   *
>   * Returns the number of pages moved to the given lruvec.
>   */
> -static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> -						     struct list_head *list)
> +static unsigned int move_pages_to_lru(struct lruvec *lruvec,
> +				      struct list_head *list)
>  {
>  	int nr_pages, nr_moved = 0;
>  	LIST_HEAD(pages_to_free);
> @@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>   * of reclaimed pages
>   */
> -static noinline_for_stack unsigned long
> +static unsigned long
>  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  		     struct scan_control *sc, enum lru_list lru)
>  {
> -- 
> 2.11.0
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages
  2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
                   ` (9 preceding siblings ...)
       [not found] ` <20210409122959.82264-7-songmuchun@bytedance.com>
@ 2021-04-10  1:29 ` Roman Gushchin
  2021-04-12 17:14   ` Johannes Weiner
  10 siblings, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2021-04-10  1:29 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
> 
> 	[v17,00/19] The new cgroup slab memory controller
> 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> 
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
> 
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
> 
> ```bash
> #!/bin/bash
> 
> cat /proc/cgroups | grep memory
> 
> cd /sys/fs/cgroup/memory
> 
> for i in range{1..500}
> do
> 	mkdir test
> 	echo $$ > test/cgroup.procs
> 	sleep 60 &
> 	echo $$ > cgroup.procs
> 	echo `cat test/cgroup.procs` > cgroup.procs
> 	rmdir test
> done
> 
> cat /proc/cgroups | grep memory
> ```
> 
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-18 convert LRU pages pin to the objcg direction.
> 
> Any comments are welcome. Thanks.

Indeed the problem exists for a long time and it would be nice to fix it.
However I'm against merging the patchset in the current form (there are some
nice fixes/clean-ups, which can/must be applied independently). Let me explain
my concerns:

Back to the new slab controller discussion obj_cgroup was suggested by Johannes
as a union of two concepts:
1) reparenting (basically an auto-pointer to a memcg in c++ terms)
2) byte-sized accounting

I was initially against this union because I anticipated that the reparenting
part will be useful separately. And the time told it was true.

I still think obj_cgroup API must be significantly reworked before being
applied outside of the kmem area: reparenting part must be separated
and moved to the cgroup core level to be used not only in the memcg
context but also for other controllers, which are facing similar problems.
Spilling obj_cgroup API in the current form over all memcg code will
make it more complicated and will delay it, given the amount of changes
and the number of potential code conflicts.

I'm working on the generalization of obj_cgroup API (as described above)
and expect to have some patches next week.

Thanks!


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack
  2021-04-09 18:31   ` Johannes Weiner
@ 2021-04-10  4:33     ` Muchun Song
  2021-04-10 17:48       ` Shakeel Butt
  0 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2021-04-10  4:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Sat, Apr 10, 2021 at 2:31 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> > The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> > set up pagevec as late as possible in shrink_inactive_list()"), its
> > purpose is to delay the allocation of pagevec as late as possible to
> > save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> > reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> > not need noinline_for_stack, just remove it (let the compiler decide
> > whether to inline).
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Good catch.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Since this patch is somewhat independent of the rest of the series,
> you may want to put it in the very beginning, or even submit it
> separately, to keep the main series as compact as possible. Reviewers
> can be more hesitant to get involved with larger series ;)

OK. I will gather all the cleanup patches into a separate series.
Thanks for your suggestion.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement
  2021-04-09 12:29 ` [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement Muchun Song
@ 2021-04-10 17:44   ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2021-04-10 17:44 UTC (permalink / raw)
  To: Muchun Song
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Andrew Morton,
	Vladimir Davydov, LKML, Linux MM, Xiongchun duan, fam.zheng,
	Balbir Singh, Yang Shi, Alex Shi

On Fri, Apr 9, 2021 at 5:32 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  2021-04-09 12:29 ` [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm Muchun Song
@ 2021-04-10 17:47   ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2021-04-10 17:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: Roman Gushchin, Johannes Weiner, Michal Hocko, Andrew Morton,
	Vladimir Davydov, LKML, Linux MM, Xiongchun duan, fam.zheng,
	Balbir Singh, Yang Shi, Alex Shi, Dan Schatzberg

On Fri, Apr 9, 2021 at 5:32 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack
  2021-04-10  4:33     ` [External] " Muchun Song
@ 2021-04-10 17:48       ` Shakeel Butt
  0 siblings, 0 replies; 23+ messages in thread
From: Shakeel Butt @ 2021-04-10 17:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, Roman Gushchin, Michal Hocko, Andrew Morton,
	Vladimir Davydov, LKML, Linux MM, Xiongchun duan, fam.zheng,
	Balbir Singh, Yang Shi, Alex Shi

On Fri, Apr 9, 2021 at 9:34 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Sat, Apr 10, 2021 at 2:31 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> > > The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> > > set up pagevec as late as possible in shrink_inactive_list()"), its
> > > purpose is to delay the allocation of pagevec as late as possible to
> > > save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> > > reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> > > not need noinline_for_stack, just remove it (let the compiler decide
> > > whether to inline).
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >
> > Good catch.
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Since this patch is somewhat independent of the rest of the series,
> > you may want to put it in the very beginning, or even submit it
> > separately, to keep the main series as compact as possible. Reviewers
> > can be more hesitant to get involved with larger series ;)
>
> OK. I will gather all the cleanup patches into a separate series.
> Thanks for your suggestion.

That would be best.

For this patch:

Reviewed-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [RFC PATCH v2 04/18] mm: memcontrol: simplify lruvec_holds_page_lru_lock
  2021-04-09 16:00   ` [RFC PATCH v2 04/18] mm: memcontrol: simplify lruvec_holds_page_lru_lock Johannes Weiner
@ 2021-04-11  5:37     ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-11  5:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Sat, Apr 10, 2021 at 12:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> > We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> > do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> > lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> > page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> > of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> > to simplify the code. We can have a single definition for this function
> > that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> > CONFIG_MEMCG.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Looks good to me.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for your review.

>
> If you haven't done so yet, please make sure to explicitly test with
> all three config combinations, just because the dummy abstractions for
> memcg disabled or compiled out tend to be paper thin and don't always
> behave the way you might expect when you do more complicated things.

I have tested. There is no problem. Thanks :-)


>
> Something like
>
> boot
> echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
> echo 1 >/proc/sys/vm/compact_memory
>
> should exercise this new function in a couple of important scenarios.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] Re: [RFC PATCH v2 06/18] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
  2021-04-09 16:56   ` [RFC PATCH v2 06/18] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM Johannes Weiner
@ 2021-04-11  6:24     ` Muchun Song
  0 siblings, 0 replies; 23+ messages in thread
From: Muchun Song @ 2021-04-11  6:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: guro, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Sat, Apr 10, 2021 at 12:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> > Because memory allocations pinning memcgs for a long time - it exists
> > at a larger scale and is causing recurring problems in the real world:
> > page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted
> > into a new cgroup every time. Unreclaimable dying cgroups pile up,
> > waste memory, and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg
> > direction to fix this problem, and then the page->memcg will always
> > point to an object cgroup pointer.
> >
> > Therefore, the infrastructure of objcg no longer only serves
> > CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> > objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> > can reuse it to charge pages.
>
> Just an observation on this:
>
> We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
> point. It used to be an optional feature, but nowadays it's not
> configurable anymore, and always on unless slob is configured.
>
> We've also added more than just slab accounting to it, like kernel
> stack pages, and it all gets disabled on slob configs just because it
> doesn't support slab object tracking.
>
> We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
> places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.
>
> But that's beyond the scope of your patch series, so I'm also okay
> with this patch here.
>
> > We know that the LRU pages are not accounted at the root level. But the
> > page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> > of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> > dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> > LRU pages, we should set the page->memcg_data to a root object cgroup. So
> > we also allocate an object cgroup for the root_mem_cgroup and introduce
> > root_obj_cgroup to cache its value just like root_mem_cgroup.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Overall, the patch makes sense to me. A few comments below:
>
> > @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> >       return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
> >  }
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> >  extern spinlock_t css_set_lock;
> >
> > +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> > +{
>   > +   return objcg == root_obj_cgroup;
> > +}
>
> This function, and by extension root_obj_cgroup, aren't used by this
> patch. Please move them to the patch that adds users for them.

OK. Will do.

>
> > @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> >       percpu_ref_exit(ref);
> >       kfree_rcu(objcg, rcu);
> >  }
> > +#else
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > +     struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > +     unsigned long flags;
> > +
> > +     spin_lock_irqsave(&css_set_lock, flags);
> > +     list_del(&objcg->list);
> > +     spin_unlock_irqrestore(&css_set_lock, flags);
> > +
> > +     percpu_ref_exit(ref);
> > +     kfree_rcu(objcg, rcu);
> > +}
> > +#endif
>
> Having two separate functions for if and else is good when the else
> branch is a completely empty dummy function. In this case you end up
> duplicating code, so it's better to have just one function and put the
> ifdef around the nr_charged_bytes handling in it.

Make sense. I will rework the code here.

>
> > @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> >       return objcg;
> >  }
> >
> > -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > -                               struct mem_cgroup *parent)
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> >  {
> >       struct obj_cgroup *objcg, *iter;
> > +     struct mem_cgroup *parent;
> > +
> > +     parent = parent_mem_cgroup(memcg);
> > +     if (!parent)
> > +             parent = root_mem_cgroup;
> >
> >       objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> >
> > @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> >       percpu_ref_kill(&objcg->refcnt);
> >  }
> >
> > +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> > +{
> > +     struct obj_cgroup *objcg;
> > +
> > +     objcg = obj_cgroup_alloc();
> > +     if (!objcg)
> > +             return -ENOMEM;
> > +
> > +     objcg->memcg = memcg;
> > +     rcu_assign_pointer(memcg->objcg, objcg);
> > +
> > +     return 0;
> > +}
> > +
> > +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> > +{
> > +     if (unlikely(memcg->objcg))
> > +             memcg_reparent_objcgs(memcg);
> > +}
>
> It's confusing to have a 'free' function not free the object it's
> called on.
>
> But rather than search for a fitting name, I think it might be better
> to just fold both of these short functions into their only callsites.

OK. Will do.

>
> Also, since memcg->objcg is reparented, and the pointer cleared, on
> offlining, when could this ever be non-NULL? This deserves a comment.

css_alloc() failed, offlining didn't happen. In this case, memcg->objcg
could be non-NULL (Just like memcg_free_kmem() dose). I will move
memcg_obj_cgroup_alloc() to the mem_cgroup_css_online() so that
we do not need memcg_obj_cgroup_free.

>
> > @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  #ifdef CONFIG_MEMCG_KMEM
> >  static int memcg_online_kmem(struct mem_cgroup *memcg)
> >  {
> > -     struct obj_cgroup *objcg;
> >       int memcg_id;
> >
> >       if (cgroup_memory_nokmem)
> > @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
> >       if (memcg_id < 0)
> >               return memcg_id;
> >
> > -     objcg = obj_cgroup_alloc();
> > -     if (!objcg) {
> > -             memcg_free_cache_id(memcg_id);
> > -             return -ENOMEM;
> > -     }
> > -     objcg->memcg = memcg;
> > -     rcu_assign_pointer(memcg->objcg, objcg);
> > -
> >       static_branch_enable(&memcg_kmem_enabled_key);
> >
> >       memcg->kmemcg_id = memcg_id;
> > @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
> >       if (!parent)
> >               parent = root_mem_cgroup;
> >
> > -     memcg_reparent_objcgs(memcg, parent);
> > +     memcg_reparent_objcgs(memcg);
>
> Since the objcg is no longer tied to kmem, this should move to
> mem_cgroup_css_offline() instead.

LGTM, will do.

Thanks for your all suggestions.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages
  2021-04-10  1:29 ` [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Roman Gushchin
@ 2021-04-12 17:14   ` Johannes Weiner
  2021-04-12 17:45     ` Roman Gushchin
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2021-04-12 17:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Muchun Song, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> > 
> > 	[v17,00/19] The new cgroup slab memory controller
> > 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > 
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> > 
> > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> > 
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> > 
> > ```bash
> > #!/bin/bash
> > 
> > cat /proc/cgroups | grep memory
> > 
> > cd /sys/fs/cgroup/memory
> > 
> > for i in range{1..500}
> > do
> > 	mkdir test
> > 	echo $$ > test/cgroup.procs
> > 	sleep 60 &
> > 	echo $$ > cgroup.procs
> > 	echo `cat test/cgroup.procs` > cgroup.procs
> > 	rmdir test
> > done
> > 
> > cat /proc/cgroups | grep memory
> > ```
> > 
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-18 convert LRU pages pin to the objcg direction.
> > 
> > Any comments are welcome. Thanks.
> 
> Indeed the problem exists for a long time and it would be nice to fix it.
> However I'm against merging the patchset in the current form (there are some
> nice fixes/clean-ups, which can/must be applied independently). Let me explain
> my concerns:
> 
> Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> as a union of two concepts:
> 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> 2) byte-sized accounting
> 
> I was initially against this union because I anticipated that the reparenting
> part will be useful separately. And the time told it was true.

"The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
level is really good."

https://lore.kernel.org/lkml/20191025200020.GA8325@castle.DHCP.thefacebook.com/

If you recall, the main concern was how the byte charging interface
was added to the existing page charging interface, instead of being
layered on top of it. I suggested to do that and, since there was no
other user for the indirection pointer, just include it in the API.

It made sense at the time, and you seemed to agree. But I also agree
it makes sense to factor it out now that more users are materializing.

> I still think obj_cgroup API must be significantly reworked before being
> applied outside of the kmem area: reparenting part must be separated
> and moved to the cgroup core level to be used not only in the memcg
> context but also for other controllers, which are facing similar problems.
> Spilling obj_cgroup API in the current form over all memcg code will
> make it more complicated and will delay it, given the amount of changes
> and the number of potential code conflicts.
> 
> I'm working on the generalization of obj_cgroup API (as described above)
> and expect to have some patches next week.

Yeah, splitting the byte charging API from the reference API and
making the latter cgroup-generic makes sense. I'm looking forward to
your patches.

And yes, the conflicts between that work and Muchun's patches would be
quite large. However, most of them would come down to renames, since
the access rules and refcounting sites will remain the same, so it
shouldn't be too bad to rebase Muchun's patches on yours. And we can
continue reviewing his patches for correctness for now.

Thanks


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages
  2021-04-12 17:14   ` Johannes Weiner
@ 2021-04-12 17:45     ` Roman Gushchin
  0 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2021-04-12 17:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Muchun Song, mhocko, akpm, shakeelb, vdavydov.dev, linux-kernel,
	linux-mm, duanxiongchun, fam.zheng, bsingharora, shy828301,
	alex.shi

On Mon, Apr 12, 2021 at 01:14:57PM -0400, Johannes Weiner wrote:
> On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > > 
> > > 	[v17,00/19] The new cgroup slab memory controller
> > > 	[v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > 
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > and make page reclaim very inefficient.
> > > 
> > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > 
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > of the dying cgroups will not increase if we run the following test script.
> > > 
> > > ```bash
> > > #!/bin/bash
> > > 
> > > cat /proc/cgroups | grep memory
> > > 
> > > cd /sys/fs/cgroup/memory
> > > 
> > > for i in range{1..500}
> > > do
> > > 	mkdir test
> > > 	echo $$ > test/cgroup.procs
> > > 	sleep 60 &
> > > 	echo $$ > cgroup.procs
> > > 	echo `cat test/cgroup.procs` > cgroup.procs
> > > 	rmdir test
> > > done
> > > 
> > > cat /proc/cgroups | grep memory
> > > ```
> > > 
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-18 convert LRU pages pin to the objcg direction.
> > > 
> > > Any comments are welcome. Thanks.
> > 
> > Indeed the problem exists for a long time and it would be nice to fix it.
> > However I'm against merging the patchset in the current form (there are some
> > nice fixes/clean-ups, which can/must be applied independently). Let me explain
> > my concerns:
> > 
> > Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> > as a union of two concepts:
> > 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> > 2) byte-sized accounting
> > 
> > I was initially against this union because I anticipated that the reparenting
> > part will be useful separately. And the time told it was true.
> 
> "The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
> level is really good."
> 
> https://lore.kernel.org/lkml/20191025200020.GA8325@castle.DHCP.thefacebook.com/
> 
> If you recall, the main concern was how the byte charging interface
> was added to the existing page charging interface, instead of being
> layered on top of it. I suggested to do that and, since there was no
> other user for the indirection pointer, just include it in the API.
> 
> It made sense at the time, and you seemed to agree. But I also agree
> it makes sense to factor it out now that more users are materializing.

Agreed.

> 
> > I still think obj_cgroup API must be significantly reworked before being
> > applied outside of the kmem area: reparenting part must be separated
> > and moved to the cgroup core level to be used not only in the memcg
> > context but also for other controllers, which are facing similar problems.
> > Spilling obj_cgroup API in the current form over all memcg code will
> > make it more complicated and will delay it, given the amount of changes
> > and the number of potential code conflicts.
> > 
> > I'm working on the generalization of obj_cgroup API (as described above)
> > and expect to have some patches next week.
> 
> Yeah, splitting the byte charging API from the reference API and
> making the latter cgroup-generic makes sense. I'm looking forward to
> your patches.
> 
> And yes, the conflicts between that work and Muchun's patches would be
> quite large. However, most of them would come down to renames, since
> the access rules and refcounting sites will remain the same, so it
> shouldn't be too bad to rebase Muchun's patches on yours. And we can
> continue reviewing his patches for correctness for now.

Sounds good to me!

Thanks


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-04-12 17:45 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-09 12:29 [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Muchun Song
2021-04-09 12:29 ` [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement Muchun Song
2021-04-10 17:44   ` Shakeel Butt
2021-04-09 12:29 ` [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm Muchun Song
2021-04-10 17:47   ` Shakeel Butt
2021-04-09 12:29 ` [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg Muchun Song
2021-04-09 16:27   ` Johannes Weiner
2021-04-09 12:29 ` [RFC PATCH v2 07/18] mm: memcontrol: introduce compact_lock_page_lruvec_irqsave Muchun Song
2021-04-09 12:29 ` [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack Muchun Song
2021-04-09 18:31   ` Johannes Weiner
2021-04-10  4:33     ` [External] " Muchun Song
2021-04-10 17:48       ` Shakeel Butt
2021-04-10  0:49   ` Roman Gushchin
2021-04-09 12:29 ` [RFC PATCH v2 10/18] mm: vmscan: rework move_pages_to_lru() Muchun Song
2021-04-09 12:29 ` [RFC PATCH v2 12/18] mm: thp: make deferred split queue lock safe when the LRU pages reparented Muchun Song
2021-04-09 12:29 ` [RFC PATCH v2 14/18] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
     [not found] ` <20210409122959.82264-5-songmuchun@bytedance.com>
2021-04-09 16:00   ` [RFC PATCH v2 04/18] mm: memcontrol: simplify lruvec_holds_page_lru_lock Johannes Weiner
2021-04-11  5:37     ` [External] " Muchun Song
     [not found] ` <20210409122959.82264-7-songmuchun@bytedance.com>
2021-04-09 16:56   ` [RFC PATCH v2 06/18] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM Johannes Weiner
2021-04-11  6:24     ` [External] " Muchun Song
2021-04-10  1:29 ` [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages Roman Gushchin
2021-04-12 17:14   ` Johannes Weiner
2021-04-12 17:45     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).