linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8]: mm: vmscan: cgroup-related cleanups
@ 2019-10-22 14:47 Johannes Weiner
  2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

Here are 8 patches that clean up the reclaim code's interaction with
cgroups a bit. They're not supposed to change any behavior, just make
the implementation easier to understand and work with.

My apologies in advance for 5/8, which changes the indentation of
shrink_node() and so results in a terrible diff. The rest of the
series should be straight-forward.

 include/linux/memcontrol.h |  32 ++--
 include/linux/mmzone.h     |  26 +--
 mm/memcontrol.c            |  12 +-
 mm/page_alloc.c            |   2 +-
 mm/slab.h                  |   4 +-
 mm/vmscan.c                | 386 +++++++++++++++++++------------------------
 mm/workingset.c            |   8 +-
 7 files changed, 216 insertions(+), 254 deletions(-)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size()
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
@ 2019-10-22 14:47 ` Johannes Weiner
  2019-10-22 19:18   ` Roman Gushchin
  2019-10-23 13:48   ` Michal Hocko
  2019-10-22 14:47 ` [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

This function currently takes the node or lruvec size and subtracts
the zones that are excluded by the classzone index of the
allocation. It uses four different types of counters to do this.

Just add up the eligible zones.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1154b3a2b637..57f533b808f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -351,32 +351,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
  */
 unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
 {
-	unsigned long lru_size = 0;
+	unsigned long size = 0;
 	int zid;
 
-	if (!mem_cgroup_disabled()) {
-		for (zid = 0; zid < MAX_NR_ZONES; zid++)
-			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
-	} else
-		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
-
-	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
+	for (zid = 0; zid <= zone_idx; zid++) {
 		struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
-		unsigned long size;
 
 		if (!managed_zone(zone))
 			continue;
 
 		if (!mem_cgroup_disabled())
-			size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
+			size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
 		else
-			size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
-				       NR_ZONE_LRU_BASE + lru);
-		lru_size -= min(size, lru_size);
+			size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
 	}
-
-	return lru_size;
-
+	return size;
 }
 
 /*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
  2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
@ 2019-10-22 14:47 ` Johannes Weiner
  2019-10-22 19:25   ` Roman Gushchin
  2019-10-23 14:00   ` Michal Hocko
  2019-10-22 14:47 ` [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
used is somewhat confusing right now, and it's easy to make mistakes -
especially when it comes to global reclaim.

How it works: when memory cgroups are enabled, we always use the
root_mem_cgroup's per-node lruvecs. When memory cgroups are not
compiled in or disabled at runtime, we use pgdat->lruvec.

Document that in a comment.

Due to the way the reclaim code is generalized, all lookups use the
mem_cgroup_lruvec() helper function, and nobody should have to find
the right lruvec manually right now. But to avoid future mistakes,
rename the pgdat->lruvec member to pgdat->__lruvec and delete the
convenience wrapper that suggests it's a commonly accessed member.

While in this area, swap the mem_cgroup_lruvec() argument order. The
name suggests a memcg operation, yet it takes a pgdat first and a
memcg second. I have to double take every time I call this. Fix that.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 26 +++++++++++++-------------
 include/linux/mmzone.h     | 15 ++++++++-------
 mm/memcontrol.c            | 12 ++++++------
 mm/page_alloc.c            |  2 +-
 mm/slab.h                  |  4 ++--
 mm/vmscan.c                |  6 +++---
 mm/workingset.c            |  8 ++++----
 7 files changed, 37 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2b34925fc19d..498cea07cbb1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -393,22 +393,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
 }
 
 /**
- * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
- * @node: node of the wanted lruvec
+ * mem_cgroup_lruvec - get the lru list vector for a memcg & node
  * @memcg: memcg of the wanted lruvec
+ * @node: node of the wanted lruvec
  *
- * Returns the lru list vector holding pages for a given @node or a given
- * @memcg and @zone. This can be the node lruvec, if the memory controller
- * is disabled.
+ * Returns the lru list vector holding pages for a given @memcg &
+ * @node combination. This can be the node lruvec, if the memory
+ * controller is disabled.
  */
-static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
+					       struct pglist_data *pgdat)
 {
 	struct mem_cgroup_per_node *mz;
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = node_lruvec(pgdat);
+		lruvec = &pgdat->__lruvec;
 		goto out;
 	}
 
@@ -727,7 +727,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
 		return;
 	}
 
-	lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
+	lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
 	__mod_lruvec_state(lruvec, idx, val);
 }
 
@@ -898,16 +898,16 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-				struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
+					       struct pglist_data *pgdat)
 {
-	return node_lruvec(pgdat);
+	return &pgdat->__lruvec;
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 						    struct pglist_data *pgdat)
 {
-	return &pgdat->lruvec;
+	return &pgdat->__lruvec;
 }
 
 static inline bool mm_match_cgroup(struct mm_struct *mm,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4ca03b93373..449a44171026 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -777,7 +777,13 @@ typedef struct pglist_data {
 #endif
 
 	/* Fields commonly accessed by the page reclaim scanner */
-	struct lruvec		lruvec;
+
+	/*
+	 * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
+	 *
+	 * Use mem_cgroup_lruvec() to look up lruvecs.
+	 */
+	struct lruvec		__lruvec;
 
 	unsigned long		flags;
 
@@ -800,11 +806,6 @@ typedef struct pglist_data {
 #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
 
-static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
-{
-	return &pgdat->lruvec;
-}
-
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
 {
 	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
@@ -842,7 +843,7 @@ static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
 #ifdef CONFIG_MEMCG
 	return lruvec->pgdat;
 #else
-	return container_of(lruvec, struct pglist_data, lruvec);
+	return container_of(lruvec, struct pglist_data, __lruvec);
 #endif
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98c2fd902533..055975b0b3a3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -770,7 +770,7 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 	if (!memcg || memcg == root_mem_cgroup) {
 		__mod_node_page_state(pgdat, idx, val);
 	} else {
-		lruvec = mem_cgroup_lruvec(pgdat, memcg);
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		__mod_lruvec_state(lruvec, idx, val);
 	}
 	rcu_read_unlock();
@@ -1226,7 +1226,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
 	struct lruvec *lruvec;
 
 	if (mem_cgroup_disabled()) {
-		lruvec = &pgdat->lruvec;
+		lruvec = &pgdat->__lruvec;
 		goto out;
 	}
 
@@ -1605,7 +1605,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
 		int nid, bool noswap)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
 
 	if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
 	    lruvec_page_state(lruvec, NR_ACTIVE_FILE))
@@ -3735,7 +3735,7 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 					   int nid, unsigned int lru_mask)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
 	unsigned long nr = 0;
 	enum lru_list lru;
 
@@ -5433,8 +5433,8 @@ static int mem_cgroup_move_account(struct page *page,
 	anon = PageAnon(page);
 
 	pgdat = page_pgdat(page);
-	from_vec = mem_cgroup_lruvec(pgdat, from);
-	to_vec = mem_cgroup_lruvec(pgdat, to);
+	from_vec = mem_cgroup_lruvec(from, pgdat);
+	to_vec = mem_cgroup_lruvec(to, pgdat);
 
 	spin_lock_irqsave(&from->move_lock, flags);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe76be55c9d5..791c018314b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6708,7 +6708,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_page_ext_init(pgdat);
 	spin_lock_init(&pgdat->lru_lock);
-	lruvec_init(node_lruvec(pgdat));
+	lruvec_init(&pgdat->__lruvec);
 }
 
 static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
diff --git a/mm/slab.h b/mm/slab.h
index 3eb29ae75743..2bbecf28688d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -369,7 +369,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	if (ret)
 		goto out;
 
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
 
 	/* transer try_charge() page references to kmem_cache */
@@ -393,7 +393,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	rcu_read_lock();
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
-		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
 		memcg_kmem_uncharge_memcg(page, order, memcg);
 	} else {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57f533b808f2..be3c22c274c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2545,7 +2545,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
 			      struct scan_control *sc)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -3023,7 +3023,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
 		unsigned long refaults;
 		struct lruvec *lruvec;
 
-		lruvec = mem_cgroup_lruvec(pgdat, memcg);
+		lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
 		lruvec->refaults = refaults;
 	} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
@@ -3391,7 +3391,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
-		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 		if (inactive_list_is_low(lruvec, false, sc, true))
 			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
diff --git a/mm/workingset.c b/mm/workingset.c
index c963831d354f..e8212123c1c3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -233,7 +233,7 @@ void *workingset_eviction(struct page *page)
 	VM_BUG_ON_PAGE(page_count(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	lruvec = mem_cgroup_lruvec(pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	eviction = atomic_long_inc_return(&lruvec->inactive_age);
 	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
 }
@@ -280,7 +280,7 @@ void workingset_refault(struct page *page, void *shadow)
 	memcg = mem_cgroup_from_id(memcgid);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(pgdat, memcg);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	refault = atomic_long_read(&lruvec->inactive_age);
 	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
 
@@ -345,7 +345,7 @@ void workingset_activation(struct page *page)
 	memcg = page_memcg_rcu(page);
 	if (!mem_cgroup_disabled() && !memcg)
 		goto out;
-	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 	atomic_long_inc(&lruvec->inactive_age);
 out:
 	rcu_read_unlock();
@@ -426,7 +426,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
+		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
 							 NR_LRU_BASE + i);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
  2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
  2019-10-22 14:47 ` [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
@ 2019-10-22 14:47 ` Johannes Weiner
  2019-10-22 19:28   ` Roman Gushchin
  2019-10-23 14:06   ` Michal Hocko
  2019-10-22 14:47 ` [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() Johannes Weiner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

inactive_list_is_low() should be about one thing: checking the ratio
between inactive and active list. Kitchensink checks like the one for
swap space makes the function hard to use and modify its
callsites. Luckly, most callers already have an understanding of the
swap situation, so it's easy to clean up.

get_scan_count() has its own, memcg-aware swap check, and doesn't even
get to the inactive_list_is_low() check on the anon list when there is
no swap space available.

shrink_list() is called on the results of get_scan_count(), so that
check is redundant too.

age_active_anon() has its own totalswap_pages check right before it
checks the list proportions.

The shrink_node_memcg() site is the only one that doesn't do its own
swap check. Add it there.

Then delete the swap check from inactive_list_is_low().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index be3c22c274c1..622b77488144 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2226,13 +2226,6 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 	unsigned long refaults;
 	unsigned long gb;
 
-	/*
-	 * If we don't have swap space, anonymous page deactivation
-	 * is pointless.
-	 */
-	if (!file && !total_swap_pages)
-		return false;
-
 	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
 	active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
 
@@ -2653,7 +2646,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (inactive_list_is_low(lruvec, false, sc, true))
+	if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
                   ` (2 preceding siblings ...)
  2019-10-22 14:47 ` [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
@ 2019-10-22 14:47 ` Johannes Weiner
  2019-10-22 19:40   ` Roman Gushchin
  2019-10-23 14:14   ` Michal Hocko
  2019-10-22 14:48 ` [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

Seven years after introducing the global_reclaim() function, I still
have to double take when reading a callsite. I don't know how others
do it, this is a terrible name.

Invert the meaning and rename it to cgroup_reclaim().

[ After all, "global reclaim" is just regular reclaim invoked from the
  page allocator. It's reclaim on behalf of a cgroup limit that is a
  special case of reclaim, and should be explicit - not the reverse. ]

sane_reclaim() isn't very descriptive either: it tests whether we can
use the regular writeback throttling - available during regular page
reclaim or cgroup2 limit reclaim - or need to use the broken
wait_on_page_writeback() method. Use "writeback_throttling_sane()".

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 622b77488144..302dad112f75 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 	up_write(&shrinker_rwsem);
 }
 
-static bool global_reclaim(struct scan_control *sc)
+static bool cgroup_reclaim(struct scan_control *sc)
 {
-	return !sc->target_mem_cgroup;
+	return sc->target_mem_cgroup;
 }
 
 /**
- * sane_reclaim - is the usual dirty throttling mechanism operational?
+ * writeback_throttling_sane - is the usual dirty throttling mechanism available?
  * @sc: scan_control in question
  *
  * The normal page dirty throttling mechanism in balance_dirty_pages() is
@@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
  * This function tests whether the vmscan currently in progress can assume
  * that the normal dirty throttling mechanism is operational.
  */
-static bool sane_reclaim(struct scan_control *sc)
+static bool writeback_throttling_sane(struct scan_control *sc)
 {
-	struct mem_cgroup *memcg = sc->target_mem_cgroup;
-
-	if (!memcg)
+	if (!cgroup_reclaim(sc))
 		return true;
 #ifdef CONFIG_CGROUP_WRITEBACK
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
@@ -302,12 +300,12 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
 {
 }
 
-static bool global_reclaim(struct scan_control *sc)
+static bool cgroup_reclaim(struct scan_control *sc)
 {
-	return true;
+	return false;
 }
 
-static bool sane_reclaim(struct scan_control *sc)
+static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
@@ -1227,7 +1225,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto activate_locked;
 
 			/* Case 2 above */
-			} else if (sane_reclaim(sc) ||
+			} else if (writeback_throttling_sane(sc) ||
 			    !PageReclaim(page) || !may_enter_fs) {
 				/*
 				 * This is slightly racy - end_page_writeback()
@@ -1821,7 +1819,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if (current_is_kswapd())
 		return 0;
 
-	if (!sane_reclaim(sc))
+	if (!writeback_throttling_sane(sc))
 		return 0;
 
 	if (file) {
@@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
-	if (global_reclaim(sc))
+	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	spin_unlock_irq(&pgdat->lru_lock);
@@ -1985,7 +1983,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	spin_lock_irq(&pgdat->lru_lock);
 
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
-	if (global_reclaim(sc))
+	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
@@ -2309,7 +2307,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * using the memory controller's swap limit feature would be
 	 * too expensive.
 	 */
-	if (!global_reclaim(sc) && !swappiness) {
+	if (cgroup_reclaim(sc) && !swappiness) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2333,7 +2331,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	 * thrashing file LRU becomes infinitely more attractive than
 	 * anon pages.  Try to detect this based on file LRU size.
 	 */
-	if (global_reclaim(sc)) {
+	if (!cgroup_reclaim(sc)) {
 		unsigned long pgdatfile;
 		unsigned long pgdatfree;
 		int z;
@@ -2564,7 +2562,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	 * abort proportional reclaim if either the file or anon lru has already
 	 * dropped to zero at the first pass.
 	 */
-	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
+	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
 	blk_start_plug(&plug);
@@ -2853,7 +2851,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		 * Legacy memcg will stall in page writeback so avoid forcibly
 		 * stalling in wait_iff_congested().
 		 */
-		if (!global_reclaim(sc) && sane_reclaim(sc) &&
+		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
 		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
 			set_memcg_congestion(pgdat, root, true);
 
@@ -2948,7 +2946,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
-		if (global_reclaim(sc)) {
+		if (!cgroup_reclaim(sc)) {
 			if (!cpuset_zone_allowed(zone,
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
@@ -3048,7 +3046,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 retry:
 	delayacct_freepages_start();
 
-	if (global_reclaim(sc))
+	if (!cgroup_reclaim(sc))
 		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
 
 	do {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
                   ` (3 preceding siblings ...)
  2019-10-22 14:47 ` [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() Johannes Weiner
@ 2019-10-22 14:48 ` Johannes Weiner
  2019-10-22 19:56   ` Roman Gushchin
  2019-10-23 14:18   ` Michal Hocko
  2019-10-22 14:48 ` [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

Most of the function body is inside a loop, which imposes an
additional indentation and scoping level that makes the code a bit
hard to follow and modify.

The looping only happens in case of reclaim-compaction, which isn't
the common case. So rather than adding yet another function level to
the reclaim path and have every reclaim invocation go through a level
that only exists for one specific cornercase, use a retry goto.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 231 ++++++++++++++++++++++++++--------------------------
 1 file changed, 115 insertions(+), 116 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 302dad112f75..235d1fc72311 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2729,144 +2729,143 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct mem_cgroup *root = sc->target_mem_cgroup;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	struct mem_cgroup *memcg;
+again:
+	memset(&sc->nr, 0, sizeof(sc->nr));
 
-	do {
-		struct mem_cgroup *root = sc->target_mem_cgroup;
-		struct mem_cgroup *memcg;
-
-		memset(&sc->nr, 0, sizeof(sc->nr));
-
-		nr_reclaimed = sc->nr_reclaimed;
-		nr_scanned = sc->nr_scanned;
+	nr_reclaimed = sc->nr_reclaimed;
+	nr_scanned = sc->nr_scanned;
 
-		memcg = mem_cgroup_iter(root, NULL, NULL);
-		do {
-			unsigned long reclaimed;
-			unsigned long scanned;
+	memcg = mem_cgroup_iter(root, NULL, NULL);
+	do {
+		unsigned long reclaimed;
+		unsigned long scanned;
 
-			switch (mem_cgroup_protected(root, memcg)) {
-			case MEMCG_PROT_MIN:
-				/*
-				 * Hard protection.
-				 * If there is no reclaimable memory, OOM.
-				 */
+		switch (mem_cgroup_protected(root, memcg)) {
+		case MEMCG_PROT_MIN:
+			/*
+			 * Hard protection.
+			 * If there is no reclaimable memory, OOM.
+			 */
+			continue;
+		case MEMCG_PROT_LOW:
+			/*
+			 * Soft protection.
+			 * Respect the protection only as long as
+			 * there is an unprotected supply
+			 * of reclaimable memory from other cgroups.
+			 */
+			if (!sc->memcg_low_reclaim) {
+				sc->memcg_low_skipped = 1;
 				continue;
-			case MEMCG_PROT_LOW:
-				/*
-				 * Soft protection.
-				 * Respect the protection only as long as
-				 * there is an unprotected supply
-				 * of reclaimable memory from other cgroups.
-				 */
-				if (!sc->memcg_low_reclaim) {
-					sc->memcg_low_skipped = 1;
-					continue;
-				}
-				memcg_memory_event(memcg, MEMCG_LOW);
-				break;
-			case MEMCG_PROT_NONE:
-				/*
-				 * All protection thresholds breached. We may
-				 * still choose to vary the scan pressure
-				 * applied based on by how much the cgroup in
-				 * question has exceeded its protection
-				 * thresholds (see get_scan_count).
-				 */
-				break;
 			}
+			memcg_memory_event(memcg, MEMCG_LOW);
+			break;
+		case MEMCG_PROT_NONE:
+			/*
+			 * All protection thresholds breached. We may
+			 * still choose to vary the scan pressure
+			 * applied based on by how much the cgroup in
+			 * question has exceeded its protection
+			 * thresholds (see get_scan_count).
+			 */
+			break;
+		}
 
-			reclaimed = sc->nr_reclaimed;
-			scanned = sc->nr_scanned;
-			shrink_node_memcg(pgdat, memcg, sc);
-
-			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
-					sc->priority);
-
-			/* Record the group's reclaim efficiency */
-			vmpressure(sc->gfp_mask, memcg, false,
-				   sc->nr_scanned - scanned,
-				   sc->nr_reclaimed - reclaimed);
-
-		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+		reclaimed = sc->nr_reclaimed;
+		scanned = sc->nr_scanned;
+		shrink_node_memcg(pgdat, memcg, sc);
 
-		if (reclaim_state) {
-			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-			reclaim_state->reclaimed_slab = 0;
-		}
+		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
+			    sc->priority);
 
-		/* Record the subtree's reclaim efficiency */
-		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
-			   sc->nr_scanned - nr_scanned,
-			   sc->nr_reclaimed - nr_reclaimed);
+		/* Record the group's reclaim efficiency */
+		vmpressure(sc->gfp_mask, memcg, false,
+			   sc->nr_scanned - scanned,
+			   sc->nr_reclaimed - reclaimed);
 
-		if (sc->nr_reclaimed - nr_reclaimed)
-			reclaimable = true;
+	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 
-		if (current_is_kswapd()) {
-			/*
-			 * If reclaim is isolating dirty pages under writeback,
-			 * it implies that the long-lived page allocation rate
-			 * is exceeding the page laundering rate. Either the
-			 * global limits are not being effective at throttling
-			 * processes due to the page distribution throughout
-			 * zones or there is heavy usage of a slow backing
-			 * device. The only option is to throttle from reclaim
-			 * context which is not ideal as there is no guarantee
-			 * the dirtying process is throttled in the same way
-			 * balance_dirty_pages() manages.
-			 *
-			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
-			 * count the number of pages under pages flagged for
-			 * immediate reclaim and stall if any are encountered
-			 * in the nr_immediate check below.
-			 */
-			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
-				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+	if (reclaim_state) {
+		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+		reclaim_state->reclaimed_slab = 0;
+	}
 
-			/*
-			 * Tag a node as congested if all the dirty pages
-			 * scanned were backed by a congested BDI and
-			 * wait_iff_congested will stall.
-			 */
-			if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-				set_bit(PGDAT_CONGESTED, &pgdat->flags);
+	/* Record the subtree's reclaim efficiency */
+	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+		   sc->nr_scanned - nr_scanned,
+		   sc->nr_reclaimed - nr_reclaimed);
 
-			/* Allow kswapd to start writing pages during reclaim.*/
-			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
-				set_bit(PGDAT_DIRTY, &pgdat->flags);
+	if (sc->nr_reclaimed - nr_reclaimed)
+		reclaimable = true;
 
-			/*
-			 * If kswapd scans pages marked marked for immediate
-			 * reclaim and under writeback (nr_immediate), it
-			 * implies that pages are cycling through the LRU
-			 * faster than they are written so also forcibly stall.
-			 */
-			if (sc->nr.immediate)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
-		}
+	if (current_is_kswapd()) {
+		/*
+		 * If reclaim is isolating dirty pages under writeback,
+		 * it implies that the long-lived page allocation rate
+		 * is exceeding the page laundering rate. Either the
+		 * global limits are not being effective at throttling
+		 * processes due to the page distribution throughout
+		 * zones or there is heavy usage of a slow backing
+		 * device. The only option is to throttle from reclaim
+		 * context which is not ideal as there is no guarantee
+		 * the dirtying process is throttled in the same way
+		 * balance_dirty_pages() manages.
+		 *
+		 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
+		 * count the number of pages under pages flagged for
+		 * immediate reclaim and stall if any are encountered
+		 * in the nr_immediate check below.
+		 */
+		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
+			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
 		/*
-		 * Legacy memcg will stall in page writeback so avoid forcibly
-		 * stalling in wait_iff_congested().
+		 * Tag a node as congested if all the dirty pages
+		 * scanned were backed by a congested BDI and
+		 * wait_iff_congested will stall.
 		 */
-		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
-		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-			set_memcg_congestion(pgdat, root, true);
+		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+			set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
+		/* Allow kswapd to start writing pages during reclaim.*/
+		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
+			set_bit(PGDAT_DIRTY, &pgdat->flags);
 
 		/*
-		 * Stall direct reclaim for IO completions if underlying BDIs
-		 * and node is congested. Allow kswapd to continue until it
-		 * starts encountering unqueued dirty pages or cycling through
-		 * the LRU too quickly.
+		 * If kswapd scans pages marked marked for immediate
+		 * reclaim and under writeback (nr_immediate), it
+		 * implies that pages are cycling through the LRU
+		 * faster than they are written so also forcibly stall.
 		 */
-		if (!sc->hibernation_mode && !current_is_kswapd() &&
-		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
-			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		if (sc->nr.immediate)
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
+
+	/*
+	 * Legacy memcg will stall in page writeback so avoid forcibly
+	 * stalling in wait_iff_congested().
+	 */
+	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
+	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
+		set_memcg_congestion(pgdat, root, true);
+
+	/*
+	 * Stall direct reclaim for IO completions if underlying BDIs
+	 * and node is congested. Allow kswapd to continue until it
+	 * starts encountering unqueued dirty pages or cycling through
+	 * the LRU too quickly.
+	 */
+	if (!sc->hibernation_mode && !current_is_kswapd() &&
+	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
-	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-					 sc));
+	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
+				    sc))
+		goto again;
 
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
                   ` (4 preceding siblings ...)
  2019-10-22 14:48 ` [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
@ 2019-10-22 14:48 ` Johannes Weiner
  2019-10-22 20:04   ` Roman Gushchin
  2019-10-23 14:21   ` Michal Hocko
  2019-10-22 14:48 ` [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
  2019-10-22 14:48 ` [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs Johannes Weiner
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
Instead of awkwardly passing around a combination of a pgdat and a
memcg pointer, pass down the lruvec as soon as we can look it up.

Nested callers that need to access node or cgroup properties can look
them them up if necessary, but there are only a few cases.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 235d1fc72311..db073b40c432 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2280,9 +2280,10 @@ enum scan_balance {
  * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
  * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
  */
-static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
-			   struct scan_control *sc, unsigned long *nr)
+static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
+			   unsigned long *nr)
 {
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	int swappiness = mem_cgroup_swappiness(memcg);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
@@ -2530,13 +2531,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	}
 }
 
-/*
- * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
- */
-static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
-			      struct scan_control *sc)
+static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	unsigned long nr[NR_LRU_LISTS];
 	unsigned long targets[NR_LRU_LISTS];
 	unsigned long nr_to_scan;
@@ -2546,7 +2542,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	struct blk_plug plug;
 	bool scan_adjusted;
 
-	get_scan_count(lruvec, memcg, sc, nr);
+	get_scan_count(lruvec, sc, nr);
 
 	/* Record the original scan target for proportional adjustments later */
 	memcpy(targets, nr, sizeof(nr));
@@ -2741,6 +2737,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	memcg = mem_cgroup_iter(root, NULL, NULL);
 	do {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long reclaimed;
 		unsigned long scanned;
 
@@ -2777,7 +2774,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 		reclaimed = sc->nr_reclaimed;
 		scanned = sc->nr_scanned;
-		shrink_node_memcg(pgdat, memcg, sc);
+
+		shrink_lruvec(lruvec, sc);
 
 		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
 			    sc->priority);
@@ -3281,6 +3279,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 						pg_data_t *pgdat,
 						unsigned long *nr_scanned)
 {
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.target_mem_cgroup = memcg,
@@ -3307,7 +3306,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	 * will pick up pages from other mem cgroup's as well. We hack
 	 * the priority and make it zero.
 	 */
-	shrink_node_memcg(pgdat, memcg, &sc);
+	shrink_lruvec(lruvec, &sc);
 
 	trace_mm_vmscan_memcg_softlimit_reclaim_end(
 					cgroup_ino(memcg->css.cgroup),
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
                   ` (5 preceding siblings ...)
  2019-10-22 14:48 ` [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
@ 2019-10-22 14:48 ` Johannes Weiner
  2019-10-22 20:08   ` Roman Gushchin
  2019-10-23 14:24   ` Michal Hocko
  2019-10-22 14:48 ` [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs Johannes Weiner
  7 siblings, 2 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

This function is getting long and unwieldy, split out the memcg bits.

The updated shrink_node() handles the generic (node) reclaim aspects:
  - global vmpressure notifications
  - writeback and congestion throttling
  - reclaim/compaction management
  - kswapd giving up on unreclaimable nodes

It then calls a new shrink_node_memcgs() which handles cgroup specifics:
  - the cgroup tree traversal
  - memory.low considerations
  - per-cgroup slab shrinking callbacks
  - per-cgroup vmpressure notifications

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index db073b40c432..65baa89740dd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2722,18 +2722,10 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
 		(memcg && memcg_congested(pgdat, memcg));
 }
 
-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 {
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct mem_cgroup *root = sc->target_mem_cgroup;
-	unsigned long nr_reclaimed, nr_scanned;
-	bool reclaimable = false;
 	struct mem_cgroup *memcg;
-again:
-	memset(&sc->nr, 0, sizeof(sc->nr));
-
-	nr_reclaimed = sc->nr_reclaimed;
-	nr_scanned = sc->nr_scanned;
 
 	memcg = mem_cgroup_iter(root, NULL, NULL);
 	do {
@@ -2786,6 +2778,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			   sc->nr_reclaimed - reclaimed);
 
 	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+}
+
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+{
+	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	unsigned long nr_reclaimed, nr_scanned;
+	bool reclaimable = false;
+
+again:
+	memset(&sc->nr, 0, sizeof(sc->nr));
+
+	nr_reclaimed = sc->nr_reclaimed;
+	nr_scanned = sc->nr_scanned;
+
+	shrink_node_memcgs(pgdat, sc);
 
 	if (reclaim_state) {
 		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2793,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/* Record the subtree's reclaim efficiency */
-	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+	vmpressure(sc->gfp_mask, root, true,
 		   sc->nr_scanned - nr_scanned,
 		   sc->nr_reclaimed - nr_reclaimed);
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs
  2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
                   ` (6 preceding siblings ...)
  2019-10-22 14:48 ` [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
@ 2019-10-22 14:48 ` Johannes Weiner
  2019-10-22 21:03   ` Roman Gushchin
  7 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 14:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, cgroups, linux-kernel, kernel-team

The current writeback congestion tracking has separate flags for
kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
level). This is unnecessarily complicated: the lruvec is an existing
abstraction layer for that node-memcg intersection.

Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
reclaim root level, which is either the NUMA node for global reclaim,
or the cgroup-node intersection for cgroup reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  6 +--
 include/linux/mmzone.h     | 11 ++++--
 mm/vmscan.c                | 80 ++++++++++++--------------------------
 3 files changed, 36 insertions(+), 61 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 498cea07cbb1..d8ffcf60440c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -133,9 +133,6 @@ struct mem_cgroup_per_node {
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
-	bool			congested;	/* memcg has many dirty pages */
-						/* backed by a congested BDI */
-
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -412,6 +409,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 		goto out;
 	}
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
 	lruvec = &mz->lruvec;
 out:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 449a44171026..c04b4c1f01fa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -296,6 +296,12 @@ struct zone_reclaim_stat {
 	unsigned long		recent_scanned[2];
 };
 
+enum lruvec_flags {
+	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
+					 * backed by a congested BDI
+					 */
+};
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
@@ -303,6 +309,8 @@ struct lruvec {
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	/* Various lruvec state flags (enum lruvec_flags) */
+	unsigned long			flags;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
@@ -572,9 +580,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
-	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
-					 * a congested BDI
-					 */
 	PGDAT_DIRTY,			/* reclaim scanning has recently found
 					 * many dirty file pages at the tail
 					 * of the LRU.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65baa89740dd..3e21166d5198 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -267,29 +267,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
-
-static void set_memcg_congestion(pg_data_t *pgdat,
-				struct mem_cgroup *memcg,
-				bool congested)
-{
-	struct mem_cgroup_per_node *mn;
-
-	if (!memcg)
-		return;
-
-	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
-	WRITE_ONCE(mn->congested, congested);
-}
-
-static bool memcg_congested(pg_data_t *pgdat,
-			struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_per_node *mn;
-
-	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
-	return READ_ONCE(mn->congested);
-
-}
 #else
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
@@ -309,18 +286,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
-
-static inline void set_memcg_congestion(struct pglist_data *pgdat,
-				struct mem_cgroup *memcg, bool congested)
-{
-}
-
-static inline bool memcg_congested(struct pglist_data *pgdat,
-			struct mem_cgroup *memcg)
-{
-	return false;
-
-}
 #endif
 
 /*
@@ -2716,12 +2681,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	return inactive_lru_pages > pages_for_compaction;
 }
 
-static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
-{
-	return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
-		(memcg && memcg_congested(pgdat, memcg));
-}
-
 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2785,8 +2744,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct mem_cgroup *root = sc->target_mem_cgroup;
 	unsigned long nr_reclaimed, nr_scanned;
+	struct lruvec *target_lruvec;
 	bool reclaimable = false;
 
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
 again:
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
@@ -2829,14 +2791,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
 			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
-		/*
-		 * Tag a node as congested if all the dirty pages
-		 * scanned were backed by a congested BDI and
-		 * wait_iff_congested will stall.
-		 */
-		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-			set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
 		/* Allow kswapd to start writing pages during reclaim.*/
 		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
@@ -2852,12 +2806,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/*
+	 * Tag a node/memcg as congested if all the dirty pages
+	 * scanned were backed by a congested BDI and
+	 * wait_iff_congested will stall.
+	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
 	 * stalling in wait_iff_congested().
 	 */
-	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
+	if ((current_is_kswapd() ||
+	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
 	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-		set_memcg_congestion(pgdat, root, true);
+		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
 
 	/*
 	 * Stall direct reclaim for IO completions if underlying BDIs
@@ -2865,8 +2824,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * starts encountering unqueued dirty pages or cycling through
 	 * the LRU too quickly.
 	 */
-	if (!sc->hibernation_mode && !current_is_kswapd() &&
-	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+	if (!current_is_kswapd() && current_may_throttle() &&
+	    !sc->hibernation_mode &&
+	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
@@ -3080,8 +3040,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
+
 		snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
-		set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
+
+		if (cgroup_reclaim(sc)) {
+			struct lruvec *lruvec;
+
+			lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
+						   zone->zone_pgdat);
+			clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
+		}
 	}
 
 	delayacct_freepages_end();
@@ -3461,7 +3429,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 /* Clear pgdat state for congested, dirty or under writeback. */
 static void clear_pgdat_congested(pg_data_t *pgdat)
 {
-	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+	struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+	clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
 	clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size()
  2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
@ 2019-10-22 19:18   ` Roman Gushchin
  2019-10-23 13:48   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 19:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:47:56AM -0400, Johannes Weiner wrote:
> This function currently takes the node or lruvec size and subtracts
> the zones that are excluded by the classzone index of the
> allocation. It uses four different types of counters to do this.
> 
> Just add up the eligible zones.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 21 +++++----------------
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1154b3a2b637..57f533b808f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -351,32 +351,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size = 0;
> +	unsigned long size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled()) {
> -		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> -			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> -	} else
> -		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
> -
> -	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> +	for (zid = 0; zid <= zone_idx; zid++) {
>  		struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
> -		unsigned long size;
>  
>  		if (!managed_zone(zone))
>  			continue;
>  
>  		if (!mem_cgroup_disabled())
> -			size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +			size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
>  		else
> -			size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
> -				       NR_ZONE_LRU_BASE + lru);
> -		lru_size -= min(size, lru_size);
> +			size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
>  	}
> -
> -	return lru_size;
> -
> +	return size;

Neat!

Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure
  2019-10-22 14:47 ` [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
@ 2019-10-22 19:25   ` Roman Gushchin
  2019-10-22 21:31     ` Johannes Weiner
  2019-10-23 14:00   ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 19:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:47:57AM -0400, Johannes Weiner wrote:
> There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
> used is somewhat confusing right now, and it's easy to make mistakes -
> especially when it comes to global reclaim.
> 
> How it works: when memory cgroups are enabled, we always use the
> root_mem_cgroup's per-node lruvecs. When memory cgroups are not
> compiled in or disabled at runtime, we use pgdat->lruvec.
> 
> Document that in a comment.
> 
> Due to the way the reclaim code is generalized, all lookups use the
> mem_cgroup_lruvec() helper function, and nobody should have to find
> the right lruvec manually right now. But to avoid future mistakes,
> rename the pgdat->lruvec member to pgdat->__lruvec and delete the
> convenience wrapper that suggests it's a commonly accessed member.

This part looks great!

> 
> While in this area, swap the mem_cgroup_lruvec() argument order. The
> name suggests a memcg operation, yet it takes a pgdat first and a
> memcg second. I have to double take every time I call this. Fix that.

Idk, I agree that the new order makes more sense (slightly), but
such changes make any backports / git blame searches more complex.
So, I'm not entirely convinced that it worth it. The compiler will
prevent passing bad arguments by mistake.

Thanks!

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h | 26 +++++++++++++-------------
>  include/linux/mmzone.h     | 15 ++++++++-------
>  mm/memcontrol.c            | 12 ++++++------
>  mm/page_alloc.c            |  2 +-
>  mm/slab.h                  |  4 ++--
>  mm/vmscan.c                |  6 +++---
>  mm/workingset.c            |  8 ++++----
>  7 files changed, 37 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2b34925fc19d..498cea07cbb1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -393,22 +393,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
>  }
>  
>  /**
> - * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
> - * @node: node of the wanted lruvec
> + * mem_cgroup_lruvec - get the lru list vector for a memcg & node
>   * @memcg: memcg of the wanted lruvec
> + * @node: node of the wanted lruvec
>   *
> - * Returns the lru list vector holding pages for a given @node or a given
> - * @memcg and @zone. This can be the node lruvec, if the memory controller
> - * is disabled.
> + * Returns the lru list vector holding pages for a given @memcg &
> + * @node combination. This can be the node lruvec, if the memory
> + * controller is disabled.
>   */
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> -				struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +					       struct pglist_data *pgdat)
>  {
>  	struct mem_cgroup_per_node *mz;
>  	struct lruvec *lruvec;
>  
>  	if (mem_cgroup_disabled()) {
> -		lruvec = node_lruvec(pgdat);
> +		lruvec = &pgdat->__lruvec;
>  		goto out;
>  	}
>  
> @@ -727,7 +727,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
>  		return;
>  	}
>  
> -	lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
> +	lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
>  	__mod_lruvec_state(lruvec, idx, val);
>  }
>  
> @@ -898,16 +898,16 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  {
>  }
>  
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> -				struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +					       struct pglist_data *pgdat)
>  {
> -	return node_lruvec(pgdat);
> +	return &pgdat->__lruvec;
>  }
>  
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  						    struct pglist_data *pgdat)
>  {
> -	return &pgdat->lruvec;
> +	return &pgdat->__lruvec;
>  }
>  
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d4ca03b93373..449a44171026 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -777,7 +777,13 @@ typedef struct pglist_data {
>  #endif
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
> -	struct lruvec		lruvec;
> +
> +	/*
> +	 * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
> +	 *
> +	 * Use mem_cgroup_lruvec() to look up lruvecs.
> +	 */
> +	struct lruvec		__lruvec;
>  
>  	unsigned long		flags;
>  
> @@ -800,11 +806,6 @@ typedef struct pglist_data {
>  #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
>  #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
>  
> -static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
> -{
> -	return &pgdat->lruvec;
> -}
> -
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>  	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -842,7 +843,7 @@ static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  #ifdef CONFIG_MEMCG
>  	return lruvec->pgdat;
>  #else
> -	return container_of(lruvec, struct pglist_data, lruvec);
> +	return container_of(lruvec, struct pglist_data, __lruvec);
>  #endif
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 98c2fd902533..055975b0b3a3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -770,7 +770,7 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
>  	if (!memcg || memcg == root_mem_cgroup) {
>  		__mod_node_page_state(pgdat, idx, val);
>  	} else {
> -		lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		__mod_lruvec_state(lruvec, idx, val);
>  	}
>  	rcu_read_unlock();
> @@ -1226,7 +1226,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	struct lruvec *lruvec;
>  
>  	if (mem_cgroup_disabled()) {
> -		lruvec = &pgdat->lruvec;
> +		lruvec = &pgdat->__lruvec;
>  		goto out;
>  	}
>  
> @@ -1605,7 +1605,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
>  		int nid, bool noswap)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>  
>  	if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
>  	    lruvec_page_state(lruvec, NR_ACTIVE_FILE))
> @@ -3735,7 +3735,7 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  					   int nid, unsigned int lru_mask)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>  	unsigned long nr = 0;
>  	enum lru_list lru;
>  
> @@ -5433,8 +5433,8 @@ static int mem_cgroup_move_account(struct page *page,
>  	anon = PageAnon(page);
>  
>  	pgdat = page_pgdat(page);
> -	from_vec = mem_cgroup_lruvec(pgdat, from);
> -	to_vec = mem_cgroup_lruvec(pgdat, to);
> +	from_vec = mem_cgroup_lruvec(from, pgdat);
> +	to_vec = mem_cgroup_lruvec(to, pgdat);
>  
>  	spin_lock_irqsave(&from->move_lock, flags);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fe76be55c9d5..791c018314b3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6708,7 +6708,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  
>  	pgdat_page_ext_init(pgdat);
>  	spin_lock_init(&pgdat->lru_lock);
> -	lruvec_init(node_lruvec(pgdat));
> +	lruvec_init(&pgdat->__lruvec);
>  }
>  
>  static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/slab.h b/mm/slab.h
> index 3eb29ae75743..2bbecf28688d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -369,7 +369,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
>  	if (ret)
>  		goto out;
>  
> -	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
>  
>  	/* transer try_charge() page references to kmem_cache */
> @@ -393,7 +393,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  	rcu_read_lock();
>  	memcg = READ_ONCE(s->memcg_params.memcg);
>  	if (likely(!mem_cgroup_is_root(memcg))) {
> -		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>  		memcg_kmem_uncharge_memcg(page, order, memcg);
>  	} else {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 57f533b808f2..be3c22c274c1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2545,7 +2545,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
>  			      struct scan_control *sc)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long targets[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -3023,7 +3023,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
>  		unsigned long refaults;
>  		struct lruvec *lruvec;
>  
> -		lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
>  		lruvec->refaults = refaults;
>  	} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
> @@ -3391,7 +3391,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>  
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
> -		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  
>  		if (inactive_list_is_low(lruvec, false, sc, true))
>  			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> diff --git a/mm/workingset.c b/mm/workingset.c
> index c963831d354f..e8212123c1c3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -233,7 +233,7 @@ void *workingset_eviction(struct page *page)
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
> -	lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	eviction = atomic_long_inc_return(&lruvec->inactive_age);
>  	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
> @@ -280,7 +280,7 @@ void workingset_refault(struct page *page, void *shadow)
>  	memcg = mem_cgroup_from_id(memcgid);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	refault = atomic_long_read(&lruvec->inactive_age);
>  	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
>  
> @@ -345,7 +345,7 @@ void workingset_activation(struct page *page)
>  	memcg = page_memcg_rcu(page);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  	atomic_long_inc(&lruvec->inactive_age);
>  out:
>  	rcu_read_unlock();
> @@ -426,7 +426,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  		struct lruvec *lruvec;
>  		int i;
>  
> -		lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
> +		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>  		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>  			pages += lruvec_page_state_local(lruvec,
>  							 NR_LRU_BASE + i);
> -- 
> 2.23.0
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller
  2019-10-22 14:47 ` [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
@ 2019-10-22 19:28   ` Roman Gushchin
  2019-10-23 14:06   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 19:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:47:58AM -0400, Johannes Weiner wrote:
> inactive_list_is_low() should be about one thing: checking the ratio
> between inactive and active list. Kitchensink checks like the one for
> swap space makes the function hard to use and modify its
> callsites. Luckly, most callers already have an understanding of the
> swap situation, so it's easy to clean up.
> 
> get_scan_count() has its own, memcg-aware swap check, and doesn't even
> get to the inactive_list_is_low() check on the anon list when there is
> no swap space available.
> 
> shrink_list() is called on the results of get_scan_count(), so that
> check is redundant too.
> 
> age_active_anon() has its own totalswap_pages check right before it
> checks the list proportions.
> 
> The shrink_node_memcg() site is the only one that doesn't do its own
> swap check. Add it there.
> 
> Then delete the swap check from inactive_list_is_low().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index be3c22c274c1..622b77488144 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2226,13 +2226,6 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>  	unsigned long refaults;
>  	unsigned long gb;
>  
> -	/*
> -	 * If we don't have swap space, anonymous page deactivation
> -	 * is pointless.
> -	 */
> -	if (!file && !total_swap_pages)
> -		return false;
> -
>  	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
>  	active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
>  
> @@ -2653,7 +2646,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false, sc, true))
> +	if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  }
> -- 
> 2.23.0
> 
> 

Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
  2019-10-22 14:47 ` [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() Johannes Weiner
@ 2019-10-22 19:40   ` Roman Gushchin
  2019-10-23 16:02     ` Johannes Weiner
  2019-10-23 14:14   ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 19:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:47:59AM -0400, Johannes Weiner wrote:
> Seven years after introducing the global_reclaim() function, I still
> have to double take when reading a callsite. I don't know how others
> do it, this is a terrible name.
> 
> Invert the meaning and rename it to cgroup_reclaim().
> 
> [ After all, "global reclaim" is just regular reclaim invoked from the
>   page allocator. It's reclaim on behalf of a cgroup limit that is a
>   special case of reclaim, and should be explicit - not the reverse. ]
> 
> sane_reclaim() isn't very descriptive either: it tests whether we can
> use the regular writeback throttling - available during regular page
> reclaim or cgroup2 limit reclaim - or need to use the broken
> wait_on_page_writeback() method. Use "writeback_throttling_sane()".
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 38 ++++++++++++++++++--------------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 622b77488144..302dad112f75 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  	up_write(&shrinker_rwsem);
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> -	return !sc->target_mem_cgroup;
> +	return sc->target_mem_cgroup;
>  }

Isn't targeted_reclaim() better?

cgroup_reclaim() is also ok to me, but it sounds a bit like we reclaim
from this specific cgroup. Also targeted/global is IMO a better opposition
than cgroup/global (the latter reminds me days when there were global
and cgroup LRUs).

The rest of the patch looks good!

Reviewed-by: Roman Gushchin <guro@fb.com>

>  
>  /**
> - * sane_reclaim - is the usual dirty throttling mechanism operational?
> + * writeback_throttling_sane - is the usual dirty throttling mechanism available?
>   * @sc: scan_control in question
>   *
>   * The normal page dirty throttling mechanism in balance_dirty_pages() is
> @@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
>   * This function tests whether the vmscan currently in progress can assume
>   * that the normal dirty throttling mechanism is operational.
>   */
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
> -	struct mem_cgroup *memcg = sc->target_mem_cgroup;
> -
> -	if (!memcg)
> +	if (!cgroup_reclaim(sc))
>  		return true;
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> @@ -302,12 +300,12 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  {
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> -	return true;
> +	return false;
>  }
>  
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>  	return true;
>  }
> @@ -1227,7 +1225,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto activate_locked;
>  
>  			/* Case 2 above */
> -			} else if (sane_reclaim(sc) ||
> +			} else if (writeback_throttling_sane(sc) ||
>  			    !PageReclaim(page) || !may_enter_fs) {
>  				/*
>  				 * This is slightly racy - end_page_writeback()
> @@ -1821,7 +1819,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!sane_reclaim(sc))
> +	if (!writeback_throttling_sane(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>  	spin_unlock_irq(&pgdat->lru_lock);
> @@ -1985,7 +1983,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	spin_lock_irq(&pgdat->lru_lock);
>  
>  	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>  	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
> @@ -2309,7 +2307,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * using the memory controller's swap limit feature would be
>  	 * too expensive.
>  	 */
> -	if (!global_reclaim(sc) && !swappiness) {
> +	if (cgroup_reclaim(sc) && !swappiness) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
>  	}
> @@ -2333,7 +2331,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * thrashing file LRU becomes infinitely more attractive than
>  	 * anon pages.  Try to detect this based on file LRU size.
>  	 */
> -	if (global_reclaim(sc)) {
> +	if (!cgroup_reclaim(sc)) {
>  		unsigned long pgdatfile;
>  		unsigned long pgdatfree;
>  		int z;
> @@ -2564,7 +2562,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * abort proportional reclaim if either the file or anon lru has already
>  	 * dropped to zero at the first pass.
>  	 */
> -	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
> +	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
>  			 sc->priority == DEF_PRIORITY);
>  
>  	blk_start_plug(&plug);
> @@ -2853,7 +2851,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		 * Legacy memcg will stall in page writeback so avoid forcibly
>  		 * stalling in wait_iff_congested().
>  		 */
> -		if (!global_reclaim(sc) && sane_reclaim(sc) &&
> +		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
>  		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
>  			set_memcg_congestion(pgdat, root, true);
>  
> @@ -2948,7 +2946,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
>  		 */
> -		if (global_reclaim(sc)) {
> +		if (!cgroup_reclaim(sc)) {
>  			if (!cpuset_zone_allowed(zone,
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
> @@ -3048,7 +3046,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  retry:
>  	delayacct_freepages_start();
>  
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
>  
>  	do {
> -- 
> 2.23.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-22 14:48 ` [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
@ 2019-10-22 19:56   ` Roman Gushchin
  2019-10-22 21:42     ` Johannes Weiner
  2019-10-23 14:18   ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 19:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:48:00AM -0400, Johannes Weiner wrote:
> Most of the function body is inside a loop, which imposes an
> additional indentation and scoping level that makes the code a bit
> hard to follow and modify.
> 
> The looping only happens in case of reclaim-compaction, which isn't
> the common case. So rather than adding yet another function level to
> the reclaim path and have every reclaim invocation go through a level
> that only exists for one specific cornercase, use a retry goto.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 231 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 115 insertions(+), 116 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 302dad112f75..235d1fc72311 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2729,144 +2729,143 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
>  	unsigned long nr_reclaimed, nr_scanned;
>  	bool reclaimable = false;
> +	struct mem_cgroup *memcg;
> +again:
> +	memset(&sc->nr, 0, sizeof(sc->nr));
>  
> -	do {
> -		struct mem_cgroup *root = sc->target_mem_cgroup;
> -		struct mem_cgroup *memcg;
> -
> -		memset(&sc->nr, 0, sizeof(sc->nr));
> -
> -		nr_reclaimed = sc->nr_reclaimed;
> -		nr_scanned = sc->nr_scanned;
> +	nr_reclaimed = sc->nr_reclaimed;
> +	nr_scanned = sc->nr_scanned;
>  
> -		memcg = mem_cgroup_iter(root, NULL, NULL);
> -		do {
> -			unsigned long reclaimed;
> -			unsigned long scanned;
> +	memcg = mem_cgroup_iter(root, NULL, NULL);
> +	do {
> +		unsigned long reclaimed;
> +		unsigned long scanned;
>  
> -			switch (mem_cgroup_protected(root, memcg)) {
> -			case MEMCG_PROT_MIN:
> -				/*
> -				 * Hard protection.
> -				 * If there is no reclaimable memory, OOM.
> -				 */
> +		switch (mem_cgroup_protected(root, memcg)) {
> +		case MEMCG_PROT_MIN:
> +			/*
> +			 * Hard protection.
> +			 * If there is no reclaimable memory, OOM.
> +			 */
> +			continue;
> +		case MEMCG_PROT_LOW:
> +			/*
> +			 * Soft protection.
> +			 * Respect the protection only as long as
> +			 * there is an unprotected supply
> +			 * of reclaimable memory from other cgroups.
> +			 */
> +			if (!sc->memcg_low_reclaim) {
> +				sc->memcg_low_skipped = 1;
>  				continue;
> -			case MEMCG_PROT_LOW:
> -				/*
> -				 * Soft protection.
> -				 * Respect the protection only as long as
> -				 * there is an unprotected supply
> -				 * of reclaimable memory from other cgroups.
> -				 */
> -				if (!sc->memcg_low_reclaim) {
> -					sc->memcg_low_skipped = 1;
> -					continue;
> -				}
> -				memcg_memory_event(memcg, MEMCG_LOW);
> -				break;
> -			case MEMCG_PROT_NONE:
> -				/*
> -				 * All protection thresholds breached. We may
> -				 * still choose to vary the scan pressure
> -				 * applied based on by how much the cgroup in
> -				 * question has exceeded its protection
> -				 * thresholds (see get_scan_count).
> -				 */
> -				break;
>  			}
> +			memcg_memory_event(memcg, MEMCG_LOW);
> +			break;
> +		case MEMCG_PROT_NONE:
> +			/*
> +			 * All protection thresholds breached. We may
> +			 * still choose to vary the scan pressure
> +			 * applied based on by how much the cgroup in
> +			 * question has exceeded its protection
> +			 * thresholds (see get_scan_count).
> +			 */
> +			break;
> +		}
>  
> -			reclaimed = sc->nr_reclaimed;
> -			scanned = sc->nr_scanned;
> -			shrink_node_memcg(pgdat, memcg, sc);
> -
> -			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> -					sc->priority);
> -
> -			/* Record the group's reclaim efficiency */
> -			vmpressure(sc->gfp_mask, memcg, false,
> -				   sc->nr_scanned - scanned,
> -				   sc->nr_reclaimed - reclaimed);
> -
> -		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> +		reclaimed = sc->nr_reclaimed;
> +		scanned = sc->nr_scanned;
> +		shrink_node_memcg(pgdat, memcg, sc);
>  
> -		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
> +		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> +			    sc->priority);
>  
> -		/* Record the subtree's reclaim efficiency */
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> +		/* Record the group's reclaim efficiency */
> +		vmpressure(sc->gfp_mask, memcg, false,
> +			   sc->nr_scanned - scanned,
> +			   sc->nr_reclaimed - reclaimed);

It doesn't look as a trivial change. I'd add some comments to the commit message
why it's safe to do.

Thanks!

>  
> -		if (sc->nr_reclaimed - nr_reclaimed)
> -			reclaimable = true;
> +	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
> -		if (current_is_kswapd()) {
> -			/*
> -			 * If reclaim is isolating dirty pages under writeback,
> -			 * it implies that the long-lived page allocation rate
> -			 * is exceeding the page laundering rate. Either the
> -			 * global limits are not being effective at throttling
> -			 * processes due to the page distribution throughout
> -			 * zones or there is heavy usage of a slow backing
> -			 * device. The only option is to throttle from reclaim
> -			 * context which is not ideal as there is no guarantee
> -			 * the dirtying process is throttled in the same way
> -			 * balance_dirty_pages() manages.
> -			 *
> -			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> -			 * count the number of pages under pages flagged for
> -			 * immediate reclaim and stall if any are encountered
> -			 * in the nr_immediate check below.
> -			 */
> -			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> -				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +	if (reclaim_state) {
> +		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +		reclaim_state->reclaimed_slab = 0;
> +	}
>  
> -			/*
> -			 * Tag a node as congested if all the dirty pages
> -			 * scanned were backed by a congested BDI and
> -			 * wait_iff_congested will stall.
> -			 */
> -			if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -				set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +	/* Record the subtree's reclaim efficiency */
> +	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +		   sc->nr_scanned - nr_scanned,
> +		   sc->nr_reclaimed - nr_reclaimed);
>  
> -			/* Allow kswapd to start writing pages during reclaim.*/
> -			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> -				set_bit(PGDAT_DIRTY, &pgdat->flags);
> +	if (sc->nr_reclaimed - nr_reclaimed)
> +		reclaimable = true;
>  
> -			/*
> -			 * If kswapd scans pages marked marked for immediate
> -			 * reclaim and under writeback (nr_immediate), it
> -			 * implies that pages are cycling through the LRU
> -			 * faster than they are written so also forcibly stall.
> -			 */
> -			if (sc->nr.immediate)
> -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> -		}

Don't you want to separate the block below into a separate function?
It can probably make the big loop shorter and easier to follow.

Thanks!

> +	if (current_is_kswapd()) {
> +		/*
> +		 * If reclaim is isolating dirty pages under writeback,
> +		 * it implies that the long-lived page allocation rate
> +		 * is exceeding the page laundering rate. Either the
> +		 * global limits are not being effective at throttling
> +		 * processes due to the page distribution throughout
> +		 * zones or there is heavy usage of a slow backing
> +		 * device. The only option is to throttle from reclaim
> +		 * context which is not ideal as there is no guarantee
> +		 * the dirtying process is throttled in the same way
> +		 * balance_dirty_pages() manages.
> +		 *
> +		 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> +		 * count the number of pages under pages flagged for
> +		 * immediate reclaim and stall if any are encountered
> +		 * in the nr_immediate check below.
> +		 */
> +		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> +			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  
>  		/*
> -		 * Legacy memcg will stall in page writeback so avoid forcibly
> -		 * stalling in wait_iff_congested().
> +		 * Tag a node as congested if all the dirty pages
> +		 * scanned were backed by a congested BDI and
> +		 * wait_iff_congested will stall.
>  		 */
> -		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> -		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -			set_memcg_congestion(pgdat, root, true);
> +		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> +			set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> +		/* Allow kswapd to start writing pages during reclaim.*/
> +		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> +			set_bit(PGDAT_DIRTY, &pgdat->flags);
>  
>  		/*
> -		 * Stall direct reclaim for IO completions if underlying BDIs
> -		 * and node is congested. Allow kswapd to continue until it
> -		 * starts encountering unqueued dirty pages or cycling through
> -		 * the LRU too quickly.
> +		 * If kswapd scans pages marked marked for immediate
> +		 * reclaim and under writeback (nr_immediate), it
> +		 * implies that pages are cycling through the LRU
> +		 * faster than they are written so also forcibly stall.
>  		 */
> -		if (!sc->hibernation_mode && !current_is_kswapd() &&
> -		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> -			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> +		if (sc->nr.immediate)
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	}
> +
> +	/*
> +	 * Legacy memcg will stall in page writeback so avoid forcibly
> +	 * stalling in wait_iff_congested().
> +	 */
> +	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> +	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> +		set_memcg_congestion(pgdat, root, true);
> +
> +	/*
> +	 * Stall direct reclaim for IO completions if underlying BDIs
> +	 * and node is congested. Allow kswapd to continue until it
> +	 * starts encountering unqueued dirty pages or cycling through
> +	 * the LRU too quickly.
> +	 */
> +	if (!sc->hibernation_mode && !current_is_kswapd() &&
> +	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> +		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  
> -	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> -					 sc));
> +	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> +				    sc))
> +		goto again;
>  
>  	/*
>  	 * Kswapd gives up on balancing particular nodes after too
> -- 
> 2.23.0
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()
  2019-10-22 14:48 ` [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
@ 2019-10-22 20:04   ` Roman Gushchin
  2019-10-23 14:21   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 20:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:48:01AM -0400, Johannes Weiner wrote:
> A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
> Instead of awkwardly passing around a combination of a pgdat and a
> memcg pointer, pass down the lruvec as soon as we can look it up.
> 
> Nested callers that need to access node or cgroup properties can look
> them them up if necessary, but there are only a few cases.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 235d1fc72311..db073b40c432 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2280,9 +2280,10 @@ enum scan_balance {
>   * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
>   * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
>   */
> -static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> -			   struct scan_control *sc, unsigned long *nr)
> +static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> +			   unsigned long *nr)
>  {
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	int swappiness = mem_cgroup_swappiness(memcg);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	u64 fraction[2];
> @@ -2530,13 +2531,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	}
>  }
>  
> -/*
> - * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
> - */
> -static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
> -			      struct scan_control *sc)
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long targets[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -2546,7 +2542,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	struct blk_plug plug;
>  	bool scan_adjusted;
>  
> -	get_scan_count(lruvec, memcg, sc, nr);
> +	get_scan_count(lruvec, sc, nr);
>  
>  	/* Record the original scan target for proportional adjustments later */
>  	memcpy(targets, nr, sizeof(nr));
> @@ -2741,6 +2737,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  	memcg = mem_cgroup_iter(root, NULL, NULL);
>  	do {
> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		unsigned long reclaimed;
>  		unsigned long scanned;
>  
> @@ -2777,7 +2774,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		reclaimed = sc->nr_reclaimed;
>  		scanned = sc->nr_scanned;
> -		shrink_node_memcg(pgdat, memcg, sc);
> +
> +		shrink_lruvec(lruvec, sc);
>  
>  		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>  			    sc->priority);
> @@ -3281,6 +3279,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  						pg_data_t *pgdat,
>  						unsigned long *nr_scanned)
>  {
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>  		.target_mem_cgroup = memcg,
> @@ -3307,7 +3306,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  	 * will pick up pages from other mem cgroup's as well. We hack
>  	 * the priority and make it zero.
>  	 */
> -	shrink_node_memcg(pgdat, memcg, &sc);
> +	shrink_lruvec(lruvec, &sc);
>  
>  	trace_mm_vmscan_memcg_softlimit_reclaim_end(
>  					cgroup_ino(memcg->css.cgroup),
> -- 
> 2.23.0
> 

Reviewed-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part
  2019-10-22 14:48 ` [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
@ 2019-10-22 20:08   ` Roman Gushchin
  2019-10-25 14:36     ` Johannes Weiner
  2019-10-23 14:24   ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 20:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:48:02AM -0400, Johannes Weiner wrote:
> This function is getting long and unwieldy, split out the memcg bits.
> 
> The updated shrink_node() handles the generic (node) reclaim aspects:
>   - global vmpressure notifications
>   - writeback and congestion throttling
>   - reclaim/compaction management
>   - kswapd giving up on unreclaimable nodes
> 
> It then calls a new shrink_node_memcgs() which handles cgroup specifics:
>   - the cgroup tree traversal
>   - memory.low considerations
>   - per-cgroup slab shrinking callbacks
>   - per-cgroup vmpressure notifications
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index db073b40c432..65baa89740dd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2722,18 +2722,10 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>  		(memcg && memcg_congested(pgdat, memcg));
>  }
>  
> -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
> -	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	struct mem_cgroup *root = sc->target_mem_cgroup;
> -	unsigned long nr_reclaimed, nr_scanned;
> -	bool reclaimable = false;
>  	struct mem_cgroup *memcg;
> -again:
> -	memset(&sc->nr, 0, sizeof(sc->nr));
> -
> -	nr_reclaimed = sc->nr_reclaimed;
> -	nr_scanned = sc->nr_scanned;
>  
>  	memcg = mem_cgroup_iter(root, NULL, NULL);
>  	do {
> @@ -2786,6 +2778,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			   sc->nr_reclaimed - reclaimed);
>  
>  	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> +}
> +
> +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
> +	unsigned long nr_reclaimed, nr_scanned;
> +	bool reclaimable = false;
> +
> +again:
> +	memset(&sc->nr, 0, sizeof(sc->nr));
> +
> +	nr_reclaimed = sc->nr_reclaimed;
> +	nr_scanned = sc->nr_scanned;
> +
> +	shrink_node_memcgs(pgdat, sc);
>  
>  	if (reclaim_state) {
>  		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> @@ -2793,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	}
>  
>  	/* Record the subtree's reclaim efficiency */
> -	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +	vmpressure(sc->gfp_mask, root, true,

Maybe target? Or target_memcg? The word root is associated with the root cgroup.

Other than root the patch looks good to me:

Reviewed-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs
  2019-10-22 14:48 ` [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs Johannes Weiner
@ 2019-10-22 21:03   ` Roman Gushchin
  2019-10-25 14:41     ` Johannes Weiner
  0 siblings, 1 reply; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 21:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 10:48:03AM -0400, Johannes Weiner wrote:
> The current writeback congestion tracking has separate flags for
> kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
> level). This is unnecessarily complicated: the lruvec is an existing
> abstraction layer for that node-memcg intersection.
> 
> Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
> reclaim root level, which is either the NUMA node for global reclaim,
> or the cgroup-node intersection for cgroup reclaim.

Good idea!

Reviewed-by: Roman Gushchin <guro@fb.com>

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/memcontrol.h |  6 +--
>  include/linux/mmzone.h     | 11 ++++--
>  mm/vmscan.c                | 80 ++++++++++++--------------------------
>  3 files changed, 36 insertions(+), 61 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 498cea07cbb1..d8ffcf60440c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -133,9 +133,6 @@ struct mem_cgroup_per_node {
>  	unsigned long		usage_in_excess;/* Set to the value by which */
>  						/* the soft limit is exceeded*/
>  	bool			on_tree;
> -	bool			congested;	/* memcg has many dirty pages */
> -						/* backed by a congested BDI */
> -
>  	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
>  						/* use container_of	   */
>  };
> @@ -412,6 +409,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
>  		goto out;
>  	}
>  
> +	if (!memcg)
> +		memcg = root_mem_cgroup;
> +
>  	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
>  	lruvec = &mz->lruvec;
>  out:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 449a44171026..c04b4c1f01fa 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -296,6 +296,12 @@ struct zone_reclaim_stat {
>  	unsigned long		recent_scanned[2];
>  };
>  
> +enum lruvec_flags {
> +	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
> +					 * backed by a congested BDI
> +					 */
> +};
> +
>  struct lruvec {
>  	struct list_head		lists[NR_LRU_LISTS];
>  	struct zone_reclaim_stat	reclaim_stat;
> @@ -303,6 +309,8 @@ struct lruvec {
>  	atomic_long_t			inactive_age;
>  	/* Refaults at the time of last reclaim cycle */
>  	unsigned long			refaults;
> +	/* Various lruvec state flags (enum lruvec_flags) */
> +	unsigned long			flags;
>  #ifdef CONFIG_MEMCG
>  	struct pglist_data *pgdat;
>  #endif
> @@ -572,9 +580,6 @@ struct zone {
>  } ____cacheline_internodealigned_in_smp;
>  
>  enum pgdat_flags {
> -	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
> -					 * a congested BDI
> -					 */
>  	PGDAT_DIRTY,			/* reclaim scanning has recently found
>  					 * many dirty file pages at the tail
>  					 * of the LRU.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 65baa89740dd..3e21166d5198 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -267,29 +267,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> -
> -static void set_memcg_congestion(pg_data_t *pgdat,
> -				struct mem_cgroup *memcg,
> -				bool congested)
> -{
> -	struct mem_cgroup_per_node *mn;
> -
> -	if (!memcg)
> -		return;
> -
> -	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> -	WRITE_ONCE(mn->congested, congested);
> -}
> -
> -static bool memcg_congested(pg_data_t *pgdat,
> -			struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup_per_node *mn;
> -
> -	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
> -	return READ_ONCE(mn->congested);
> -
> -}
>  #else
>  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  {
> @@ -309,18 +286,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>  	return true;
>  }
> -
> -static inline void set_memcg_congestion(struct pglist_data *pgdat,
> -				struct mem_cgroup *memcg, bool congested)
> -{
> -}
> -
> -static inline bool memcg_congested(struct pglist_data *pgdat,
> -			struct mem_cgroup *memcg)
> -{
> -	return false;
> -
> -}
>  #endif
>  
>  /*
> @@ -2716,12 +2681,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  	return inactive_lru_pages > pages_for_compaction;
>  }
>  
> -static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> -{
> -	return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
> -		(memcg && memcg_congested(pgdat, memcg));
> -}
> -
>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
>  	struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2785,8 +2744,11 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	struct mem_cgroup *root = sc->target_mem_cgroup;
>  	unsigned long nr_reclaimed, nr_scanned;
> +	struct lruvec *target_lruvec;
>  	bool reclaimable = false;
>  
> +	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
> +
>  again:
>  	memset(&sc->nr, 0, sizeof(sc->nr));
>  
> @@ -2829,14 +2791,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
>  			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  
> -		/*
> -		 * Tag a node as congested if all the dirty pages
> -		 * scanned were backed by a congested BDI and
> -		 * wait_iff_congested will stall.
> -		 */
> -		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -			set_bit(PGDAT_CONGESTED, &pgdat->flags);
> -
>  		/* Allow kswapd to start writing pages during reclaim.*/
>  		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
>  			set_bit(PGDAT_DIRTY, &pgdat->flags);
> @@ -2852,12 +2806,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	}
>  
>  	/*
> +	 * Tag a node/memcg as congested if all the dirty pages
> +	 * scanned were backed by a congested BDI and
> +	 * wait_iff_congested will stall.
> +	 *
>  	 * Legacy memcg will stall in page writeback so avoid forcibly
>  	 * stalling in wait_iff_congested().
>  	 */
> -	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> +	if ((current_is_kswapd() ||
> +	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
>  	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -		set_memcg_congestion(pgdat, root, true);
> +		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
>  
>  	/*
>  	 * Stall direct reclaim for IO completions if underlying BDIs
> @@ -2865,8 +2824,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	 * starts encountering unqueued dirty pages or cycling through
>  	 * the LRU too quickly.
>  	 */
> -	if (!sc->hibernation_mode && !current_is_kswapd() &&
> -	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> +	if (!current_is_kswapd() && current_may_throttle() &&
> +	    !sc->hibernation_mode &&
> +	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
>  		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  
>  	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> @@ -3080,8 +3040,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		if (zone->zone_pgdat == last_pgdat)
>  			continue;
>  		last_pgdat = zone->zone_pgdat;
> +
>  		snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
> -		set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
> +
> +		if (cgroup_reclaim(sc)) {
> +			struct lruvec *lruvec;
> +
> +			lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
> +						   zone->zone_pgdat);
> +			clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
> +		}
>  	}
>  
>  	delayacct_freepages_end();
> @@ -3461,7 +3429,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
>  /* Clear pgdat state for congested, dirty or under writeback. */
>  static void clear_pgdat_congested(pg_data_t *pgdat)
>  {
> -	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
> +
> +	clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
>  	clear_bit(PGDAT_DIRTY, &pgdat->flags);
>  	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  }
> -- 
> 2.23.0
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure
  2019-10-22 19:25   ` Roman Gushchin
@ 2019-10-22 21:31     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 21:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 07:25:01PM +0000, Roman Gushchin wrote:
> On Tue, Oct 22, 2019 at 10:47:57AM -0400, Johannes Weiner wrote:
> > There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
> > used is somewhat confusing right now, and it's easy to make mistakes -
> > especially when it comes to global reclaim.
> > 
> > How it works: when memory cgroups are enabled, we always use the
> > root_mem_cgroup's per-node lruvecs. When memory cgroups are not
> > compiled in or disabled at runtime, we use pgdat->lruvec.
> > 
> > Document that in a comment.
> > 
> > Due to the way the reclaim code is generalized, all lookups use the
> > mem_cgroup_lruvec() helper function, and nobody should have to find
> > the right lruvec manually right now. But to avoid future mistakes,
> > rename the pgdat->lruvec member to pgdat->__lruvec and delete the
> > convenience wrapper that suggests it's a commonly accessed member.
> 
> This part looks great!

Thanks!

> > While in this area, swap the mem_cgroup_lruvec() argument order. The
> > name suggests a memcg operation, yet it takes a pgdat first and a
> > memcg second. I have to double take every time I call this. Fix that.
> 
> Idk, I agree that the new order makes more sense (slightly), but
> such changes make any backports / git blame searches more complex.
> So, I'm not entirely convinced that it worth it. The compiler will
> prevent passing bad arguments by mistake.

Lol, this has cost me a lot of time since we've had it. It takes you
out of the flow while writing code and make you think about something
stupid and trivial.

The backport period is limited, but the mental overhead of a stupid
interface is forever!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-22 19:56   ` Roman Gushchin
@ 2019-10-22 21:42     ` Johannes Weiner
  2019-10-22 22:46       ` Roman Gushchin
  0 siblings, 1 reply; 32+ messages in thread
From: Johannes Weiner @ 2019-10-22 21:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 07:56:33PM +0000, Roman Gushchin wrote:
> On Tue, Oct 22, 2019 at 10:48:00AM -0400, Johannes Weiner wrote:
> > -			/* Record the group's reclaim efficiency */
> > -			vmpressure(sc->gfp_mask, memcg, false,
> > -				   sc->nr_scanned - scanned,
> > -				   sc->nr_reclaimed - reclaimed);
> > -
> > -		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> > +		reclaimed = sc->nr_reclaimed;
> > +		scanned = sc->nr_scanned;
> > +		shrink_node_memcg(pgdat, memcg, sc);
> >  
> > -		if (reclaim_state) {
> > -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > -			reclaim_state->reclaimed_slab = 0;
> > -		}
> > +		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > +			    sc->priority);
> >  
> > -		/* Record the subtree's reclaim efficiency */
> > -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> > -			   sc->nr_scanned - nr_scanned,
> > -			   sc->nr_reclaimed - nr_reclaimed);
> > +		/* Record the group's reclaim efficiency */
> > +		vmpressure(sc->gfp_mask, memcg, false,
> > +			   sc->nr_scanned - scanned,
> > +			   sc->nr_reclaimed - reclaimed);
> 
> It doesn't look as a trivial change. I'd add some comments to the commit message
> why it's safe to do.

It's an equivalent change - it's just really misleading because the
+++ lines are not the counter-part of the --- lines here!

There are two vmpressure calls in this function: one against the
individual cgroups, and one against the tree. The diff puts them
adjacent here, but the counter-part for the --- lines is here:

> > +	/* Record the subtree's reclaim efficiency */
> > +	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> > +		   sc->nr_scanned - nr_scanned,
> > +		   sc->nr_reclaimed - nr_reclaimed);

And the counter-part to the +++ lines is further up (beginning of the
quoted diff).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-22 21:42     ` Johannes Weiner
@ 2019-10-22 22:46       ` Roman Gushchin
  0 siblings, 0 replies; 32+ messages in thread
From: Roman Gushchin @ 2019-10-22 22:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 05:42:49PM -0400, Johannes Weiner wrote:
> On Tue, Oct 22, 2019 at 07:56:33PM +0000, Roman Gushchin wrote:
> > On Tue, Oct 22, 2019 at 10:48:00AM -0400, Johannes Weiner wrote:
> > > -			/* Record the group's reclaim efficiency */
> > > -			vmpressure(sc->gfp_mask, memcg, false,
> > > -				   sc->nr_scanned - scanned,
> > > -				   sc->nr_reclaimed - reclaimed);
> > > -
> > > -		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> > > +		reclaimed = sc->nr_reclaimed;
> > > +		scanned = sc->nr_scanned;
> > > +		shrink_node_memcg(pgdat, memcg, sc);
> > >  
> > > -		if (reclaim_state) {
> > > -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > > -			reclaim_state->reclaimed_slab = 0;
> > > -		}
> > > +		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> > > +			    sc->priority);
> > >  
> > > -		/* Record the subtree's reclaim efficiency */
> > > -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> > > -			   sc->nr_scanned - nr_scanned,
> > > -			   sc->nr_reclaimed - nr_reclaimed);
> > > +		/* Record the group's reclaim efficiency */
> > > +		vmpressure(sc->gfp_mask, memcg, false,
> > > +			   sc->nr_scanned - scanned,
> > > +			   sc->nr_reclaimed - reclaimed);
> > 
> > It doesn't look as a trivial change. I'd add some comments to the commit message
> > why it's safe to do.
> 
> It's an equivalent change - it's just really misleading because the
> +++ lines are not the counter-part of the --- lines here!
> 
> There are two vmpressure calls in this function: one against the
> individual cgroups, and one against the tree. The diff puts them
> adjacent here, but the counter-part for the --- lines is here:
> 
> > > +	/* Record the subtree's reclaim efficiency */
> > > +	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> > > +		   sc->nr_scanned - nr_scanned,
> > > +		   sc->nr_reclaimed - nr_reclaimed);
> 
> And the counter-part to the +++ lines is further up (beginning of the
> quoted diff).
> 

Ah, ok, got it. You were right in the foreword, indentation change
diffs are hard to read.

Thanks for the explanation!

Reviewed-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size()
  2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
  2019-10-22 19:18   ` Roman Gushchin
@ 2019-10-23 13:48   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 13:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:47:56, Johannes Weiner wrote:
> This function currently takes the node or lruvec size and subtracts
> the zones that are excluded by the classzone index of the
> allocation. It uses four different types of counters to do this.
> 
> Just add up the eligible zones.

The original intention was to optimize this for GFP_KERNEL like
allocations by reducing the number of zones to reduce. But considering
this is not called from hot paths I do agree that a simpler code is more
preferable.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 21 +++++----------------
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1154b3a2b637..57f533b808f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -351,32 +351,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size = 0;
> +	unsigned long size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled()) {
> -		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> -			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> -	} else
> -		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
> -
> -	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> +	for (zid = 0; zid <= zone_idx; zid++) {
>  		struct zone *zone = &lruvec_pgdat(lruvec)->node_zones[zid];
> -		unsigned long size;
>  
>  		if (!managed_zone(zone))
>  			continue;
>  
>  		if (!mem_cgroup_disabled())
> -			size = mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +			size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
>  		else
> -			size = zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
> -				       NR_ZONE_LRU_BASE + lru);
> -		lru_size -= min(size, lru_size);
> +			size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
>  	}
> -
> -	return lru_size;
> -
> +	return size;
>  }
>  
>  /*
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure
  2019-10-22 14:47 ` [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
  2019-10-22 19:25   ` Roman Gushchin
@ 2019-10-23 14:00   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:47:57, Johannes Weiner wrote:
> There is a per-memcg lruvec and a NUMA node lruvec. Which one is being
> used is somewhat confusing right now, and it's easy to make mistakes -
> especially when it comes to global reclaim.
> 
> How it works: when memory cgroups are enabled, we always use the
> root_mem_cgroup's per-node lruvecs. When memory cgroups are not
> compiled in or disabled at runtime, we use pgdat->lruvec.
> 
> Document that in a comment.
> 
> Due to the way the reclaim code is generalized, all lookups use the
> mem_cgroup_lruvec() helper function, and nobody should have to find
> the right lruvec manually right now. But to avoid future mistakes,
> rename the pgdat->lruvec member to pgdat->__lruvec and delete the
> convenience wrapper that suggests it's a commonly accessed member.
> 
> While in this area, swap the mem_cgroup_lruvec() argument order. The
> name suggests a memcg operation, yet it takes a pgdat first and a
> memcg second. I have to double take every time I call this. Fix that.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I do agree that node_lruvec() adds confusion and it is better to get rid
of it.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 26 +++++++++++++-------------
>  include/linux/mmzone.h     | 15 ++++++++-------
>  mm/memcontrol.c            | 12 ++++++------
>  mm/page_alloc.c            |  2 +-
>  mm/slab.h                  |  4 ++--
>  mm/vmscan.c                |  6 +++---
>  mm/workingset.c            |  8 ++++----
>  7 files changed, 37 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2b34925fc19d..498cea07cbb1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -393,22 +393,22 @@ mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
>  }
>  
>  /**
> - * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
> - * @node: node of the wanted lruvec
> + * mem_cgroup_lruvec - get the lru list vector for a memcg & node
>   * @memcg: memcg of the wanted lruvec
> + * @node: node of the wanted lruvec
>   *
> - * Returns the lru list vector holding pages for a given @node or a given
> - * @memcg and @zone. This can be the node lruvec, if the memory controller
> - * is disabled.
> + * Returns the lru list vector holding pages for a given @memcg &
> + * @node combination. This can be the node lruvec, if the memory
> + * controller is disabled.
>   */
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> -				struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +					       struct pglist_data *pgdat)
>  {
>  	struct mem_cgroup_per_node *mz;
>  	struct lruvec *lruvec;
>  
>  	if (mem_cgroup_disabled()) {
> -		lruvec = node_lruvec(pgdat);
> +		lruvec = &pgdat->__lruvec;
>  		goto out;
>  	}
>  
> @@ -727,7 +727,7 @@ static inline void __mod_lruvec_page_state(struct page *page,
>  		return;
>  	}
>  
> -	lruvec = mem_cgroup_lruvec(pgdat, page->mem_cgroup);
> +	lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
>  	__mod_lruvec_state(lruvec, idx, val);
>  }
>  
> @@ -898,16 +898,16 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new)
>  {
>  }
>  
> -static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> -				struct mem_cgroup *memcg)
> +static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
> +					       struct pglist_data *pgdat)
>  {
> -	return node_lruvec(pgdat);
> +	return &pgdat->__lruvec;
>  }
>  
>  static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
>  						    struct pglist_data *pgdat)
>  {
> -	return &pgdat->lruvec;
> +	return &pgdat->__lruvec;
>  }
>  
>  static inline bool mm_match_cgroup(struct mm_struct *mm,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d4ca03b93373..449a44171026 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -777,7 +777,13 @@ typedef struct pglist_data {
>  #endif
>  
>  	/* Fields commonly accessed by the page reclaim scanner */
> -	struct lruvec		lruvec;
> +
> +	/*
> +	 * NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
> +	 *
> +	 * Use mem_cgroup_lruvec() to look up lruvecs.
> +	 */
> +	struct lruvec		__lruvec;
>  
>  	unsigned long		flags;
>  
> @@ -800,11 +806,6 @@ typedef struct pglist_data {
>  #define node_start_pfn(nid)	(NODE_DATA(nid)->node_start_pfn)
>  #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
>  
> -static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
> -{
> -	return &pgdat->lruvec;
> -}
> -
>  static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
>  {
>  	return pgdat->node_start_pfn + pgdat->node_spanned_pages;
> @@ -842,7 +843,7 @@ static inline struct pglist_data *lruvec_pgdat(struct lruvec *lruvec)
>  #ifdef CONFIG_MEMCG
>  	return lruvec->pgdat;
>  #else
> -	return container_of(lruvec, struct pglist_data, lruvec);
> +	return container_of(lruvec, struct pglist_data, __lruvec);
>  #endif
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 98c2fd902533..055975b0b3a3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -770,7 +770,7 @@ void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
>  	if (!memcg || memcg == root_mem_cgroup) {
>  		__mod_node_page_state(pgdat, idx, val);
>  	} else {
> -		lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		__mod_lruvec_state(lruvec, idx, val);
>  	}
>  	rcu_read_unlock();
> @@ -1226,7 +1226,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgd
>  	struct lruvec *lruvec;
>  
>  	if (mem_cgroup_disabled()) {
> -		lruvec = &pgdat->lruvec;
> +		lruvec = &pgdat->__lruvec;
>  		goto out;
>  	}
>  
> @@ -1605,7 +1605,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
>  		int nid, bool noswap)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>  
>  	if (lruvec_page_state(lruvec, NR_INACTIVE_FILE) ||
>  	    lruvec_page_state(lruvec, NR_ACTIVE_FILE))
> @@ -3735,7 +3735,7 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  					   int nid, unsigned int lru_mask)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(NODE_DATA(nid), memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>  	unsigned long nr = 0;
>  	enum lru_list lru;
>  
> @@ -5433,8 +5433,8 @@ static int mem_cgroup_move_account(struct page *page,
>  	anon = PageAnon(page);
>  
>  	pgdat = page_pgdat(page);
> -	from_vec = mem_cgroup_lruvec(pgdat, from);
> -	to_vec = mem_cgroup_lruvec(pgdat, to);
> +	from_vec = mem_cgroup_lruvec(from, pgdat);
> +	to_vec = mem_cgroup_lruvec(to, pgdat);
>  
>  	spin_lock_irqsave(&from->move_lock, flags);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fe76be55c9d5..791c018314b3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6708,7 +6708,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  
>  	pgdat_page_ext_init(pgdat);
>  	spin_lock_init(&pgdat->lru_lock);
> -	lruvec_init(node_lruvec(pgdat));
> +	lruvec_init(&pgdat->__lruvec);
>  }
>  
>  static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid,
> diff --git a/mm/slab.h b/mm/slab.h
> index 3eb29ae75743..2bbecf28688d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -369,7 +369,7 @@ static __always_inline int memcg_charge_slab(struct page *page,
>  	if (ret)
>  		goto out;
>  
> -	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
>  
>  	/* transer try_charge() page references to kmem_cache */
> @@ -393,7 +393,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  	rcu_read_lock();
>  	memcg = READ_ONCE(s->memcg_params.memcg);
>  	if (likely(!mem_cgroup_is_root(memcg))) {
> -		lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>  		memcg_kmem_uncharge_memcg(page, order, memcg);
>  	} else {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 57f533b808f2..be3c22c274c1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2545,7 +2545,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
>  			      struct scan_control *sc)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long targets[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -3023,7 +3023,7 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, pg_data_t *pgdat)
>  		unsigned long refaults;
>  		struct lruvec *lruvec;
>  
> -		lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		refaults = lruvec_page_state_local(lruvec, WORKINGSET_ACTIVATE);
>  		lruvec->refaults = refaults;
>  	} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
> @@ -3391,7 +3391,7 @@ static void age_active_anon(struct pglist_data *pgdat,
>  
>  	memcg = mem_cgroup_iter(NULL, NULL, NULL);
>  	do {
> -		struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  
>  		if (inactive_list_is_low(lruvec, false, sc, true))
>  			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> diff --git a/mm/workingset.c b/mm/workingset.c
> index c963831d354f..e8212123c1c3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -233,7 +233,7 @@ void *workingset_eviction(struct page *page)
>  	VM_BUG_ON_PAGE(page_count(page), page);
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
> -	lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	eviction = atomic_long_inc_return(&lruvec->inactive_age);
>  	return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
>  }
> @@ -280,7 +280,7 @@ void workingset_refault(struct page *page, void *shadow)
>  	memcg = mem_cgroup_from_id(memcgid);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	lruvec = mem_cgroup_lruvec(pgdat, memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	refault = atomic_long_read(&lruvec->inactive_age);
>  	active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
>  
> @@ -345,7 +345,7 @@ void workingset_activation(struct page *page)
>  	memcg = page_memcg_rcu(page);
>  	if (!mem_cgroup_disabled() && !memcg)
>  		goto out;
> -	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
>  	atomic_long_inc(&lruvec->inactive_age);
>  out:
>  	rcu_read_unlock();
> @@ -426,7 +426,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  		struct lruvec *lruvec;
>  		int i;
>  
> -		lruvec = mem_cgroup_lruvec(NODE_DATA(sc->nid), sc->memcg);
> +		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>  		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>  			pages += lruvec_page_state_local(lruvec,
>  							 NR_LRU_BASE + i);
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller
  2019-10-22 14:47 ` [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
  2019-10-22 19:28   ` Roman Gushchin
@ 2019-10-23 14:06   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:47:58, Johannes Weiner wrote:
> inactive_list_is_low() should be about one thing: checking the ratio
> between inactive and active list. Kitchensink checks like the one for
> swap space makes the function hard to use and modify its
> callsites. Luckly, most callers already have an understanding of the
> swap situation, so it's easy to clean up.
> 
> get_scan_count() has its own, memcg-aware swap check, and doesn't even
> get to the inactive_list_is_low() check on the anon list when there is
> no swap space available.
> 
> shrink_list() is called on the results of get_scan_count(), so that
> check is redundant too.
> 
> age_active_anon() has its own totalswap_pages check right before it
> checks the list proportions.
> 
> The shrink_node_memcg() site is the only one that doesn't do its own
> swap check. Add it there.
> 
> Then delete the swap check from inactive_list_is_low().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

OK, makes sense to me.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index be3c22c274c1..622b77488144 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2226,13 +2226,6 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
>  	unsigned long refaults;
>  	unsigned long gb;
>  
> -	/*
> -	 * If we don't have swap space, anonymous page deactivation
> -	 * is pointless.
> -	 */
> -	if (!file && !total_swap_pages)
> -		return false;
> -
>  	inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
>  	active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
>  
> @@ -2653,7 +2646,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * Even if we did not try to evict anon pages at all, we want to
>  	 * rebalance the anon lru active/inactive ratio.
>  	 */
> -	if (inactive_list_is_low(lruvec, false, sc, true))
> +	if (total_swap_pages && inactive_list_is_low(lruvec, false, sc, true))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  }
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
  2019-10-22 14:47 ` [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() Johannes Weiner
  2019-10-22 19:40   ` Roman Gushchin
@ 2019-10-23 14:14   ` Michal Hocko
  2019-10-23 15:56     ` Johannes Weiner
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:47:59, Johannes Weiner wrote:
> Seven years after introducing the global_reclaim() function, I still
> have to double take when reading a callsite. I don't know how others
> do it, this is a terrible name.

I somehow never had problem with that but ...
> 
> Invert the meaning and rename it to cgroup_reclaim().
> 
> [ After all, "global reclaim" is just regular reclaim invoked from the
>   page allocator. It's reclaim on behalf of a cgroup limit that is a
>   special case of reclaim, and should be explicit - not the reverse. ]

... this is a valid point.

> sane_reclaim() isn't very descriptive either: it tests whether we can
> use the regular writeback throttling - available during regular page
> reclaim or cgroup2 limit reclaim - or need to use the broken
> wait_on_page_writeback() method. Use "writeback_throttling_sane()".

I do have a stronger opinion on this one. sane_reclaim is really a
terrible name. As you say the only thing this should really tell is
whether writeback throttling is available so I would rather go with
has_writeback_throttling() or writeba_throttling_{eabled,available}
If you insist on having sane in the name then I won't object but it just
raises a question whether we have some levels of throttling with a
different level of sanity.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 38 ++++++++++++++++++--------------------
>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 622b77488144..302dad112f75 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  	up_write(&shrinker_rwsem);
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> -	return !sc->target_mem_cgroup;
> +	return sc->target_mem_cgroup;
>  }
>  
>  /**
> - * sane_reclaim - is the usual dirty throttling mechanism operational?
> + * writeback_throttling_sane - is the usual dirty throttling mechanism available?
>   * @sc: scan_control in question
>   *
>   * The normal page dirty throttling mechanism in balance_dirty_pages() is
> @@ -257,11 +257,9 @@ static bool global_reclaim(struct scan_control *sc)
>   * This function tests whether the vmscan currently in progress can assume
>   * that the normal dirty throttling mechanism is operational.
>   */
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
> -	struct mem_cgroup *memcg = sc->target_mem_cgroup;
> -
> -	if (!memcg)
> +	if (!cgroup_reclaim(sc))
>  		return true;
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> @@ -302,12 +300,12 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
>  {
>  }
>  
> -static bool global_reclaim(struct scan_control *sc)
> +static bool cgroup_reclaim(struct scan_control *sc)
>  {
> -	return true;
> +	return false;
>  }
>  
> -static bool sane_reclaim(struct scan_control *sc)
> +static bool writeback_throttling_sane(struct scan_control *sc)
>  {
>  	return true;
>  }
> @@ -1227,7 +1225,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto activate_locked;
>  
>  			/* Case 2 above */
> -			} else if (sane_reclaim(sc) ||
> +			} else if (writeback_throttling_sane(sc) ||
>  			    !PageReclaim(page) || !may_enter_fs) {
>  				/*
>  				 * This is slightly racy - end_page_writeback()
> @@ -1821,7 +1819,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  	if (current_is_kswapd())
>  		return 0;
>  
> -	if (!sane_reclaim(sc))
> +	if (!writeback_throttling_sane(sc))
>  		return 0;
>  
>  	if (file) {
> @@ -1971,7 +1969,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_scanned);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
>  	spin_unlock_irq(&pgdat->lru_lock);
> @@ -1985,7 +1983,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	spin_lock_irq(&pgdat->lru_lock);
>  
>  	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(item, nr_reclaimed);
>  	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>  	reclaim_stat->recent_rotated[0] += stat.nr_activate[0];
> @@ -2309,7 +2307,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * using the memory controller's swap limit feature would be
>  	 * too expensive.
>  	 */
> -	if (!global_reclaim(sc) && !swappiness) {
> +	if (cgroup_reclaim(sc) && !swappiness) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
>  	}
> @@ -2333,7 +2331,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	 * thrashing file LRU becomes infinitely more attractive than
>  	 * anon pages.  Try to detect this based on file LRU size.
>  	 */
> -	if (global_reclaim(sc)) {
> +	if (!cgroup_reclaim(sc)) {
>  		unsigned long pgdatfile;
>  		unsigned long pgdatfree;
>  		int z;
> @@ -2564,7 +2562,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	 * abort proportional reclaim if either the file or anon lru has already
>  	 * dropped to zero at the first pass.
>  	 */
> -	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
> +	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
>  			 sc->priority == DEF_PRIORITY);
>  
>  	blk_start_plug(&plug);
> @@ -2853,7 +2851,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  		 * Legacy memcg will stall in page writeback so avoid forcibly
>  		 * stalling in wait_iff_congested().
>  		 */
> -		if (!global_reclaim(sc) && sane_reclaim(sc) &&
> +		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
>  		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
>  			set_memcg_congestion(pgdat, root, true);
>  
> @@ -2948,7 +2946,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  		 * Take care memory controller reclaiming has small influence
>  		 * to global LRU.
>  		 */
> -		if (global_reclaim(sc)) {
> +		if (!cgroup_reclaim(sc)) {
>  			if (!cpuset_zone_allowed(zone,
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
> @@ -3048,7 +3046,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  retry:
>  	delayacct_freepages_start();
>  
> -	if (global_reclaim(sc))
> +	if (!cgroup_reclaim(sc))
>  		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
>  
>  	do {
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-22 14:48 ` [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
  2019-10-22 19:56   ` Roman Gushchin
@ 2019-10-23 14:18   ` Michal Hocko
  2019-10-25 13:44     ` Johannes Weiner
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:48:00, Johannes Weiner wrote:
> Most of the function body is inside a loop, which imposes an
> additional indentation and scoping level that makes the code a bit
> hard to follow and modify.

I do agree!

> The looping only happens in case of reclaim-compaction, which isn't
> the common case. So rather than adding yet another function level to
> the reclaim path and have every reclaim invocation go through a level
> that only exists for one specific cornercase, use a retry goto.

I would just keep the core logic in its own function and do the loop
around it rather than a goto retry. This is certainly a matter of taste
but I like a loop with an explicit condition much more than a if with
goto.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 231 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 115 insertions(+), 116 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 302dad112f75..235d1fc72311 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2729,144 +2729,143 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>  static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
>  	unsigned long nr_reclaimed, nr_scanned;
>  	bool reclaimable = false;
> +	struct mem_cgroup *memcg;
> +again:
> +	memset(&sc->nr, 0, sizeof(sc->nr));
>  
> -	do {
> -		struct mem_cgroup *root = sc->target_mem_cgroup;
> -		struct mem_cgroup *memcg;
> -
> -		memset(&sc->nr, 0, sizeof(sc->nr));
> -
> -		nr_reclaimed = sc->nr_reclaimed;
> -		nr_scanned = sc->nr_scanned;
> +	nr_reclaimed = sc->nr_reclaimed;
> +	nr_scanned = sc->nr_scanned;
>  
> -		memcg = mem_cgroup_iter(root, NULL, NULL);
> -		do {
> -			unsigned long reclaimed;
> -			unsigned long scanned;
> +	memcg = mem_cgroup_iter(root, NULL, NULL);
> +	do {
> +		unsigned long reclaimed;
> +		unsigned long scanned;
>  
> -			switch (mem_cgroup_protected(root, memcg)) {
> -			case MEMCG_PROT_MIN:
> -				/*
> -				 * Hard protection.
> -				 * If there is no reclaimable memory, OOM.
> -				 */
> +		switch (mem_cgroup_protected(root, memcg)) {
> +		case MEMCG_PROT_MIN:
> +			/*
> +			 * Hard protection.
> +			 * If there is no reclaimable memory, OOM.
> +			 */
> +			continue;
> +		case MEMCG_PROT_LOW:
> +			/*
> +			 * Soft protection.
> +			 * Respect the protection only as long as
> +			 * there is an unprotected supply
> +			 * of reclaimable memory from other cgroups.
> +			 */
> +			if (!sc->memcg_low_reclaim) {
> +				sc->memcg_low_skipped = 1;
>  				continue;
> -			case MEMCG_PROT_LOW:
> -				/*
> -				 * Soft protection.
> -				 * Respect the protection only as long as
> -				 * there is an unprotected supply
> -				 * of reclaimable memory from other cgroups.
> -				 */
> -				if (!sc->memcg_low_reclaim) {
> -					sc->memcg_low_skipped = 1;
> -					continue;
> -				}
> -				memcg_memory_event(memcg, MEMCG_LOW);
> -				break;
> -			case MEMCG_PROT_NONE:
> -				/*
> -				 * All protection thresholds breached. We may
> -				 * still choose to vary the scan pressure
> -				 * applied based on by how much the cgroup in
> -				 * question has exceeded its protection
> -				 * thresholds (see get_scan_count).
> -				 */
> -				break;
>  			}
> +			memcg_memory_event(memcg, MEMCG_LOW);
> +			break;
> +		case MEMCG_PROT_NONE:
> +			/*
> +			 * All protection thresholds breached. We may
> +			 * still choose to vary the scan pressure
> +			 * applied based on by how much the cgroup in
> +			 * question has exceeded its protection
> +			 * thresholds (see get_scan_count).
> +			 */
> +			break;
> +		}
>  
> -			reclaimed = sc->nr_reclaimed;
> -			scanned = sc->nr_scanned;
> -			shrink_node_memcg(pgdat, memcg, sc);
> -
> -			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> -					sc->priority);
> -
> -			/* Record the group's reclaim efficiency */
> -			vmpressure(sc->gfp_mask, memcg, false,
> -				   sc->nr_scanned - scanned,
> -				   sc->nr_reclaimed - reclaimed);
> -
> -		} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> +		reclaimed = sc->nr_reclaimed;
> +		scanned = sc->nr_scanned;
> +		shrink_node_memcg(pgdat, memcg, sc);
>  
> -		if (reclaim_state) {
> -			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -			reclaim_state->reclaimed_slab = 0;
> -		}
> +		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> +			    sc->priority);
>  
> -		/* Record the subtree's reclaim efficiency */
> -		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> -			   sc->nr_scanned - nr_scanned,
> -			   sc->nr_reclaimed - nr_reclaimed);
> +		/* Record the group's reclaim efficiency */
> +		vmpressure(sc->gfp_mask, memcg, false,
> +			   sc->nr_scanned - scanned,
> +			   sc->nr_reclaimed - reclaimed);
>  
> -		if (sc->nr_reclaimed - nr_reclaimed)
> -			reclaimable = true;
> +	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
>  
> -		if (current_is_kswapd()) {
> -			/*
> -			 * If reclaim is isolating dirty pages under writeback,
> -			 * it implies that the long-lived page allocation rate
> -			 * is exceeding the page laundering rate. Either the
> -			 * global limits are not being effective at throttling
> -			 * processes due to the page distribution throughout
> -			 * zones or there is heavy usage of a slow backing
> -			 * device. The only option is to throttle from reclaim
> -			 * context which is not ideal as there is no guarantee
> -			 * the dirtying process is throttled in the same way
> -			 * balance_dirty_pages() manages.
> -			 *
> -			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> -			 * count the number of pages under pages flagged for
> -			 * immediate reclaim and stall if any are encountered
> -			 * in the nr_immediate check below.
> -			 */
> -			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> -				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
> +	if (reclaim_state) {
> +		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> +		reclaim_state->reclaimed_slab = 0;
> +	}
>  
> -			/*
> -			 * Tag a node as congested if all the dirty pages
> -			 * scanned were backed by a congested BDI and
> -			 * wait_iff_congested will stall.
> -			 */
> -			if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -				set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +	/* Record the subtree's reclaim efficiency */
> +	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +		   sc->nr_scanned - nr_scanned,
> +		   sc->nr_reclaimed - nr_reclaimed);
>  
> -			/* Allow kswapd to start writing pages during reclaim.*/
> -			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> -				set_bit(PGDAT_DIRTY, &pgdat->flags);
> +	if (sc->nr_reclaimed - nr_reclaimed)
> +		reclaimable = true;
>  
> -			/*
> -			 * If kswapd scans pages marked marked for immediate
> -			 * reclaim and under writeback (nr_immediate), it
> -			 * implies that pages are cycling through the LRU
> -			 * faster than they are written so also forcibly stall.
> -			 */
> -			if (sc->nr.immediate)
> -				congestion_wait(BLK_RW_ASYNC, HZ/10);
> -		}
> +	if (current_is_kswapd()) {
> +		/*
> +		 * If reclaim is isolating dirty pages under writeback,
> +		 * it implies that the long-lived page allocation rate
> +		 * is exceeding the page laundering rate. Either the
> +		 * global limits are not being effective at throttling
> +		 * processes due to the page distribution throughout
> +		 * zones or there is heavy usage of a slow backing
> +		 * device. The only option is to throttle from reclaim
> +		 * context which is not ideal as there is no guarantee
> +		 * the dirtying process is throttled in the same way
> +		 * balance_dirty_pages() manages.
> +		 *
> +		 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
> +		 * count the number of pages under pages flagged for
> +		 * immediate reclaim and stall if any are encountered
> +		 * in the nr_immediate check below.
> +		 */
> +		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
> +			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
>  
>  		/*
> -		 * Legacy memcg will stall in page writeback so avoid forcibly
> -		 * stalling in wait_iff_congested().
> +		 * Tag a node as congested if all the dirty pages
> +		 * scanned were backed by a congested BDI and
> +		 * wait_iff_congested will stall.
>  		 */
> -		if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> -		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> -			set_memcg_congestion(pgdat, root, true);
> +		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> +			set_bit(PGDAT_CONGESTED, &pgdat->flags);
> +
> +		/* Allow kswapd to start writing pages during reclaim.*/
> +		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
> +			set_bit(PGDAT_DIRTY, &pgdat->flags);
>  
>  		/*
> -		 * Stall direct reclaim for IO completions if underlying BDIs
> -		 * and node is congested. Allow kswapd to continue until it
> -		 * starts encountering unqueued dirty pages or cycling through
> -		 * the LRU too quickly.
> +		 * If kswapd scans pages marked marked for immediate
> +		 * reclaim and under writeback (nr_immediate), it
> +		 * implies that pages are cycling through the LRU
> +		 * faster than they are written so also forcibly stall.
>  		 */
> -		if (!sc->hibernation_mode && !current_is_kswapd() &&
> -		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> -			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> +		if (sc->nr.immediate)
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	}
> +
> +	/*
> +	 * Legacy memcg will stall in page writeback so avoid forcibly
> +	 * stalling in wait_iff_congested().
> +	 */
> +	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
> +	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
> +		set_memcg_congestion(pgdat, root, true);
> +
> +	/*
> +	 * Stall direct reclaim for IO completions if underlying BDIs
> +	 * and node is congested. Allow kswapd to continue until it
> +	 * starts encountering unqueued dirty pages or cycling through
> +	 * the LRU too quickly.
> +	 */
> +	if (!sc->hibernation_mode && !current_is_kswapd() &&
> +	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> +		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  
> -	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> -					 sc));
> +	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> +				    sc))
> +		goto again;
>  
>  	/*
>  	 * Kswapd gives up on balancing particular nodes after too
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()
  2019-10-22 14:48 ` [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
  2019-10-22 20:04   ` Roman Gushchin
@ 2019-10-23 14:21   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:48:01, Johannes Weiner wrote:
> A lruvec holds LRU pages owned by a certain NUMA node and cgroup.
> Instead of awkwardly passing around a combination of a pgdat and a
> memcg pointer, pass down the lruvec as soon as we can look it up.
> 
> Nested callers that need to access node or cgroup properties can look
> them them up if necessary, but there are only a few cases.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 235d1fc72311..db073b40c432 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2280,9 +2280,10 @@ enum scan_balance {
>   * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
>   * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
>   */
> -static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> -			   struct scan_control *sc, unsigned long *nr)
> +static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> +			   unsigned long *nr)
>  {
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	int swappiness = mem_cgroup_swappiness(memcg);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>  	u64 fraction[2];
> @@ -2530,13 +2531,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	}
>  }
>  
> -/*
> - * This is a basic per-node page freer.  Used by both kswapd and direct reclaim.
> - */
> -static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memcg,
> -			      struct scan_control *sc)
> +static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	unsigned long nr[NR_LRU_LISTS];
>  	unsigned long targets[NR_LRU_LISTS];
>  	unsigned long nr_to_scan;
> @@ -2546,7 +2542,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	struct blk_plug plug;
>  	bool scan_adjusted;
>  
> -	get_scan_count(lruvec, memcg, sc, nr);
> +	get_scan_count(lruvec, sc, nr);
>  
>  	/* Record the original scan target for proportional adjustments later */
>  	memcpy(targets, nr, sizeof(nr));
> @@ -2741,6 +2737,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  	memcg = mem_cgroup_iter(root, NULL, NULL);
>  	do {
> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  		unsigned long reclaimed;
>  		unsigned long scanned;
>  
> @@ -2777,7 +2774,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  
>  		reclaimed = sc->nr_reclaimed;
>  		scanned = sc->nr_scanned;
> -		shrink_node_memcg(pgdat, memcg, sc);
> +
> +		shrink_lruvec(lruvec, sc);
>  
>  		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>  			    sc->priority);
> @@ -3281,6 +3279,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  						pg_data_t *pgdat,
>  						unsigned long *nr_scanned)
>  {
> +	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
>  		.target_mem_cgroup = memcg,
> @@ -3307,7 +3306,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
>  	 * will pick up pages from other mem cgroup's as well. We hack
>  	 * the priority and make it zero.
>  	 */
> -	shrink_node_memcg(pgdat, memcg, &sc);
> +	shrink_lruvec(lruvec, &sc);
>  
>  	trace_mm_vmscan_memcg_softlimit_reclaim_end(
>  					cgroup_ino(memcg->css.cgroup),
> -- 
> 2.23.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part
  2019-10-22 14:48 ` [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
  2019-10-22 20:08   ` Roman Gushchin
@ 2019-10-23 14:24   ` Michal Hocko
  1 sibling, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-10-23 14:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Tue 22-10-19 10:48:02, Johannes Weiner wrote:
> This function is getting long and unwieldy, split out the memcg bits.
> 
> The updated shrink_node() handles the generic (node) reclaim aspects:
>   - global vmpressure notifications
>   - writeback and congestion throttling
>   - reclaim/compaction management
>   - kswapd giving up on unreclaimable nodes
> 
> It then calls a new shrink_node_memcgs() which handles cgroup specifics:
>   - the cgroup tree traversal
>   - memory.low considerations
>   - per-cgroup slab shrinking callbacks
>   - per-cgroup vmpressure notifications
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index db073b40c432..65baa89740dd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2722,18 +2722,10 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
>  		(memcg && memcg_congested(pgdat, memcg));
>  }
>  
> -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
> -	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	struct mem_cgroup *root = sc->target_mem_cgroup;
> -	unsigned long nr_reclaimed, nr_scanned;
> -	bool reclaimable = false;
>  	struct mem_cgroup *memcg;
> -again:
> -	memset(&sc->nr, 0, sizeof(sc->nr));
> -
> -	nr_reclaimed = sc->nr_reclaimed;
> -	nr_scanned = sc->nr_scanned;
>  
>  	memcg = mem_cgroup_iter(root, NULL, NULL);
>  	do {
> @@ -2786,6 +2778,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			   sc->nr_reclaimed - reclaimed);
>  
>  	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> +}
> +
> +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> +{
> +	struct reclaim_state *reclaim_state = current->reclaim_state;
> +	struct mem_cgroup *root = sc->target_mem_cgroup;
> +	unsigned long nr_reclaimed, nr_scanned;
> +	bool reclaimable = false;
> +
> +again:
> +	memset(&sc->nr, 0, sizeof(sc->nr));
> +
> +	nr_reclaimed = sc->nr_reclaimed;
> +	nr_scanned = sc->nr_scanned;
> +
> +	shrink_node_memcgs(pgdat, sc);
>  
>  	if (reclaim_state) {
>  		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> @@ -2793,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	}
>  
>  	/* Record the subtree's reclaim efficiency */
> -	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> +	vmpressure(sc->gfp_mask, root, true,
>  		   sc->nr_scanned - nr_scanned,
>  		   sc->nr_reclaimed - nr_reclaimed);
>  
> -- 
> 2.23.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
  2019-10-23 14:14   ` Michal Hocko
@ 2019-10-23 15:56     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-23 15:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Wed, Oct 23, 2019 at 04:14:36PM +0200, Michal Hocko wrote:
> On Tue 22-10-19 10:47:59, Johannes Weiner wrote:
> > Seven years after introducing the global_reclaim() function, I still
> > have to double take when reading a callsite. I don't know how others
> > do it, this is a terrible name.
> 
> I somehow never had problem with that but ...
> > 
> > Invert the meaning and rename it to cgroup_reclaim().
> > 
> > [ After all, "global reclaim" is just regular reclaim invoked from the
> >   page allocator. It's reclaim on behalf of a cgroup limit that is a
> >   special case of reclaim, and should be explicit - not the reverse. ]
> 
> ... this is a valid point.
> 
> > sane_reclaim() isn't very descriptive either: it tests whether we can
> > use the regular writeback throttling - available during regular page
> > reclaim or cgroup2 limit reclaim - or need to use the broken
> > wait_on_page_writeback() method. Use "writeback_throttling_sane()".
> 
> I do have a stronger opinion on this one. sane_reclaim is really a
> terrible name. As you say the only thing this should really tell is
> whether writeback throttling is available so I would rather go with
> has_writeback_throttling() or writeba_throttling_{eabled,available}
> If you insist on having sane in the name then I won't object but it just
> raises a question whether we have some levels of throttling with a
> different level of sanity.

I mean, cgroup1 *does* have a method to not OOM due to pages under
writeback: wait_on_page_writeback() on each wb page on the LRU.

It's terrible, but it's a form of writeback throttling. That's what
the sane vs insane distinction is about, I guess: we do in fact have
throttling implementations with different levels of sanity.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()
  2019-10-22 19:40   ` Roman Gushchin
@ 2019-10-23 16:02     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-23 16:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 07:40:52PM +0000, Roman Gushchin wrote:
> On Tue, Oct 22, 2019 at 10:47:59AM -0400, Johannes Weiner wrote:
> > Seven years after introducing the global_reclaim() function, I still
> > have to double take when reading a callsite. I don't know how others
> > do it, this is a terrible name.
> > 
> > Invert the meaning and rename it to cgroup_reclaim().
> > 
> > [ After all, "global reclaim" is just regular reclaim invoked from the
> >   page allocator. It's reclaim on behalf of a cgroup limit that is a
> >   special case of reclaim, and should be explicit - not the reverse. ]
> > 
> > sane_reclaim() isn't very descriptive either: it tests whether we can
> > use the regular writeback throttling - available during regular page
> > reclaim or cgroup2 limit reclaim - or need to use the broken
> > wait_on_page_writeback() method. Use "writeback_throttling_sane()".
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c | 38 ++++++++++++++++++--------------------
> >  1 file changed, 18 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 622b77488144..302dad112f75 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -239,13 +239,13 @@ static void unregister_memcg_shrinker(struct shrinker *shrinker)
> >  	up_write(&shrinker_rwsem);
> >  }
> >  
> > -static bool global_reclaim(struct scan_control *sc)
> > +static bool cgroup_reclaim(struct scan_control *sc)
> >  {
> > -	return !sc->target_mem_cgroup;
> > +	return sc->target_mem_cgroup;
> >  }
> 
> Isn't targeted_reclaim() better?
> 
> cgroup_reclaim() is also ok to me, but it sounds a bit like we reclaim
> from this specific cgroup. Also targeted/global is IMO a better opposition
> than cgroup/global (the latter reminds me days when there were global
> and cgroup LRUs).

I think "targeted" is quite a bit less descriptive when you come at
the page replacement algorithm without cgroups in mind.

> The rest of the patch looks good!
> 
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump
  2019-10-23 14:18   ` Michal Hocko
@ 2019-10-25 13:44     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-25 13:44 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, cgroups, linux-kernel, kernel-team

On Wed, Oct 23, 2019 at 04:18:57PM +0200, Michal Hocko wrote:
> On Tue 22-10-19 10:48:00, Johannes Weiner wrote:
> > Most of the function body is inside a loop, which imposes an
> > additional indentation and scoping level that makes the code a bit
> > hard to follow and modify.
> 
> I do agree!
> 
> > The looping only happens in case of reclaim-compaction, which isn't
> > the common case. So rather than adding yet another function level to
> > the reclaim path and have every reclaim invocation go through a level
> > that only exists for one specific cornercase, use a retry goto.
> 
> I would just keep the core logic in its own function and do the loop
> around it rather than a goto retry. This is certainly a matter of taste
> but I like a loop with an explicit condition much more than a if with
> goto.

Yeah, as the changelog says, I'm intentionally putting the looping
construct into the "cold path" of the code flow: we only loops in a
very specific cornercase, and having the whole body in a loop, or
creating another function nesting level for it suggests otherwise.

A goto seems like the perfect tool to have a retry for one particular
caller without muddying the code flow for the common call stack.

Matter of taste, I guess.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part
  2019-10-22 20:08   ` Roman Gushchin
@ 2019-10-25 14:36     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-25 14:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 08:08:23PM +0000, Roman Gushchin wrote:
> On Tue, Oct 22, 2019 at 10:48:02AM -0400, Johannes Weiner wrote:
> > This function is getting long and unwieldy, split out the memcg bits.
> > 
> > The updated shrink_node() handles the generic (node) reclaim aspects:
> >   - global vmpressure notifications
> >   - writeback and congestion throttling
> >   - reclaim/compaction management
> >   - kswapd giving up on unreclaimable nodes
> > 
> > It then calls a new shrink_node_memcgs() which handles cgroup specifics:
> >   - the cgroup tree traversal
> >   - memory.low considerations
> >   - per-cgroup slab shrinking callbacks
> >   - per-cgroup vmpressure notifications
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c | 28 ++++++++++++++++++----------
> >  1 file changed, 18 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index db073b40c432..65baa89740dd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2722,18 +2722,10 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> >  		(memcg && memcg_congested(pgdat, memcg));
> >  }
> >  
> > -static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > +static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >  {
> > -	struct reclaim_state *reclaim_state = current->reclaim_state;
> >  	struct mem_cgroup *root = sc->target_mem_cgroup;
> > -	unsigned long nr_reclaimed, nr_scanned;
> > -	bool reclaimable = false;
> >  	struct mem_cgroup *memcg;
> > -again:
> > -	memset(&sc->nr, 0, sizeof(sc->nr));
> > -
> > -	nr_reclaimed = sc->nr_reclaimed;
> > -	nr_scanned = sc->nr_scanned;
> >  
> >  	memcg = mem_cgroup_iter(root, NULL, NULL);
> >  	do {
> > @@ -2786,6 +2778,22 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			   sc->nr_reclaimed - reclaimed);
> >  
> >  	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
> > +}
> > +
> > +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > +{
> > +	struct reclaim_state *reclaim_state = current->reclaim_state;
> > +	struct mem_cgroup *root = sc->target_mem_cgroup;
> > +	unsigned long nr_reclaimed, nr_scanned;
> > +	bool reclaimable = false;
> > +
> > +again:
> > +	memset(&sc->nr, 0, sizeof(sc->nr));
> > +
> > +	nr_reclaimed = sc->nr_reclaimed;
> > +	nr_scanned = sc->nr_scanned;
> > +
> > +	shrink_node_memcgs(pgdat, sc);
> >  
> >  	if (reclaim_state) {
> >  		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> > @@ -2793,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  	}
> >  
> >  	/* Record the subtree's reclaim efficiency */
> > -	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> > +	vmpressure(sc->gfp_mask, root, true,
> 
> Maybe target? Or target_memcg? The word root is associated with the root cgroup.
>
> Other than root the patch looks good to me:
> 
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!

I agree, target_memcg is better than root. The next patch also
replaces some of these with target_lruvec.

This on top?

From f981c99d3a9da05513c5137873315974782e97ec Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 25 Oct 2019 10:28:42 -0400
Subject: [PATCH] mm: vmscan: split shrink_node() into node part and
 memcgs part fix

As per Roman's suggestion, rename "root" to "target_memcg" to avoid
confusion with the global cgroup root, root_mem_cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65baa89740dd..6199692af434 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2724,16 +2724,16 @@ static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
 
 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 {
-	struct mem_cgroup *root = sc->target_mem_cgroup;
+	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
 	struct mem_cgroup *memcg;
 
-	memcg = mem_cgroup_iter(root, NULL, NULL);
+	memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 		unsigned long reclaimed;
 		unsigned long scanned;
 
-		switch (mem_cgroup_protected(root, memcg)) {
+		switch (mem_cgroup_protected(target_memcg, memcg)) {
 		case MEMCG_PROT_MIN:
 			/*
 			 * Hard protection.
@@ -2777,13 +2777,13 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			   sc->nr_scanned - scanned,
 			   sc->nr_reclaimed - reclaimed);
 
-	} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
+	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 }
 
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct mem_cgroup *root = sc->target_mem_cgroup;
+	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
 
@@ -2801,7 +2801,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/* Record the subtree's reclaim efficiency */
-	vmpressure(sc->gfp_mask, root, true,
+	vmpressure(sc->gfp_mask, target_memcg, true,
 		   sc->nr_scanned - nr_scanned,
 		   sc->nr_reclaimed - nr_reclaimed);
 
@@ -2857,7 +2857,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 */
 	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
 	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-		set_memcg_congestion(pgdat, root, true);
+		set_memcg_congestion(pgdat, target_memcg, true);
 
 	/*
 	 * Stall direct reclaim for IO completions if underlying BDIs
@@ -2866,7 +2866,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * the LRU too quickly.
 	 */
 	if (!sc->hibernation_mode && !current_is_kswapd() &&
-	    current_may_throttle() && pgdat_memcg_congested(pgdat, root))
+	    current_may_throttle() &&
+	    pgdat_memcg_congested(pgdat, target_memcg))
 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs
  2019-10-22 21:03   ` Roman Gushchin
@ 2019-10-25 14:41     ` Johannes Weiner
  0 siblings, 0 replies; 32+ messages in thread
From: Johannes Weiner @ 2019-10-25 14:41 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Michal Hocko, linux-mm, cgroups, linux-kernel,
	Kernel Team

On Tue, Oct 22, 2019 at 09:03:57PM +0000, Roman Gushchin wrote:
> On Tue, Oct 22, 2019 at 10:48:03AM -0400, Johannes Weiner wrote:
> > The current writeback congestion tracking has separate flags for
> > kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
> > level). This is unnecessarily complicated: the lruvec is an existing
> > abstraction layer for that node-memcg intersection.
> > 
> > Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
> > reclaim root level, which is either the NUMA node for global reclaim,
> > or the cgroup-node intersection for cgroup reclaim.
> 
> Good idea!
> 
> Reviewed-by: Roman Gushchin <guro@fb.com>

Thank you.

The suggested cleanup fixlet for patch 7 causes conflicts in this
one. Here is the rebased version:

---

From 253729a7639780508b166e547d13fa7894987a87 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 11 Mar 2019 15:43:44 -0400
Subject: [PATCH] mm: vmscan: harmonize writeback congestion tracking for
 nodes & memcgs

The current writeback congestion tracking has separate flags for
kswapd reclaim (node level) and cgroup limit reclaim (memcg-node
level). This is unnecessarily complicated: the lruvec is an existing
abstraction layer for that node-memcg intersection.

Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the
reclaim root level, which is either the NUMA node for global reclaim,
or the cgroup-node intersection for cgroup reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  6 +--
 include/linux/mmzone.h     | 11 +++--
 mm/vmscan.c                | 84 ++++++++++++--------------------------
 3 files changed, 37 insertions(+), 64 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 498cea07cbb1..d8ffcf60440c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -133,9 +133,6 @@ struct mem_cgroup_per_node {
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
-	bool			congested;	/* memcg has many dirty pages */
-						/* backed by a congested BDI */
-
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -412,6 +409,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
 		goto out;
 	}
 
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
 	mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
 	lruvec = &mz->lruvec;
 out:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 449a44171026..c04b4c1f01fa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -296,6 +296,12 @@ struct zone_reclaim_stat {
 	unsigned long		recent_scanned[2];
 };
 
+enum lruvec_flags {
+	LRUVEC_CONGESTED,		/* lruvec has many dirty pages
+					 * backed by a congested BDI
+					 */
+};
+
 struct lruvec {
 	struct list_head		lists[NR_LRU_LISTS];
 	struct zone_reclaim_stat	reclaim_stat;
@@ -303,6 +309,8 @@ struct lruvec {
 	atomic_long_t			inactive_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults;
+	/* Various lruvec state flags (enum lruvec_flags) */
+	unsigned long			flags;
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
 #endif
@@ -572,9 +580,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
-	PGDAT_CONGESTED,		/* pgdat has many dirty pages backed by
-					 * a congested BDI
-					 */
 	PGDAT_DIRTY,			/* reclaim scanning has recently found
 					 * many dirty file pages at the tail
 					 * of the LRU.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6199692af434..47b5c5548f3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -267,29 +267,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
-
-static void set_memcg_congestion(pg_data_t *pgdat,
-				struct mem_cgroup *memcg,
-				bool congested)
-{
-	struct mem_cgroup_per_node *mn;
-
-	if (!memcg)
-		return;
-
-	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
-	WRITE_ONCE(mn->congested, congested);
-}
-
-static bool memcg_congested(pg_data_t *pgdat,
-			struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_per_node *mn;
-
-	mn = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
-	return READ_ONCE(mn->congested);
-
-}
 #else
 static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 {
@@ -309,18 +286,6 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
-
-static inline void set_memcg_congestion(struct pglist_data *pgdat,
-				struct mem_cgroup *memcg, bool congested)
-{
-}
-
-static inline bool memcg_congested(struct pglist_data *pgdat,
-			struct mem_cgroup *memcg)
-{
-	return false;
-
-}
 #endif
 
 /*
@@ -2716,12 +2681,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	return inactive_lru_pages > pages_for_compaction;
 }
 
-static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
-{
-	return test_bit(PGDAT_CONGESTED, &pgdat->flags) ||
-		(memcg && memcg_congested(pgdat, memcg));
-}
-
 static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
@@ -2783,10 +2742,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
 	unsigned long nr_reclaimed, nr_scanned;
+	struct lruvec *target_lruvec;
 	bool reclaimable = false;
 
+	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+
 again:
 	memset(&sc->nr, 0, sizeof(sc->nr));
 
@@ -2801,7 +2762,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/* Record the subtree's reclaim efficiency */
-	vmpressure(sc->gfp_mask, target_memcg, true,
+	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
 		   sc->nr_scanned - nr_scanned,
 		   sc->nr_reclaimed - nr_reclaimed);
 
@@ -2829,14 +2790,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 		if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
 			set_bit(PGDAT_WRITEBACK, &pgdat->flags);
 
-		/*
-		 * Tag a node as congested if all the dirty pages
-		 * scanned were backed by a congested BDI and
-		 * wait_iff_congested will stall.
-		 */
-		if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-			set_bit(PGDAT_CONGESTED, &pgdat->flags);
-
 		/* Allow kswapd to start writing pages during reclaim.*/
 		if (sc->nr.unqueued_dirty == sc->nr.file_taken)
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
@@ -2852,12 +2805,17 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/*
+	 * Tag a node/memcg as congested if all the dirty pages
+	 * scanned were backed by a congested BDI and
+	 * wait_iff_congested will stall.
+	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
 	 * stalling in wait_iff_congested().
 	 */
-	if (cgroup_reclaim(sc) && writeback_throttling_sane(sc) &&
+	if ((current_is_kswapd() ||
+	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
 	    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
-		set_memcg_congestion(pgdat, target_memcg, true);
+		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
 
 	/*
 	 * Stall direct reclaim for IO completions if underlying BDIs
@@ -2865,9 +2823,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	 * starts encountering unqueued dirty pages or cycling through
 	 * the LRU too quickly.
 	 */
-	if (!sc->hibernation_mode && !current_is_kswapd() &&
-	    current_may_throttle() &&
-	    pgdat_memcg_congested(pgdat, target_memcg))
+	if (!current_is_kswapd() && current_may_throttle() &&
+	    !sc->hibernation_mode &&
+	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
@@ -3081,8 +3039,16 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
+
 		snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
-		set_memcg_congestion(last_pgdat, sc->target_mem_cgroup, false);
+
+		if (cgroup_reclaim(sc)) {
+			struct lruvec *lruvec;
+
+			lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
+						   zone->zone_pgdat);
+			clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
+		}
 	}
 
 	delayacct_freepages_end();
@@ -3462,7 +3428,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 /* Clear pgdat state for congested, dirty or under writeback. */
 static void clear_pgdat_congested(pg_data_t *pgdat)
 {
-	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+	struct lruvec *lruvec = mem_cgroup_lruvec(NULL, pgdat);
+
+	clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
 	clear_bit(PGDAT_DIRTY, &pgdat->flags);
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 }
-- 
2.23.0



^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-10-25 14:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-22 14:47 [PATCH 0/8]: mm: vmscan: cgroup-related cleanups Johannes Weiner
2019-10-22 14:47 ` [PATCH 1/8] mm: vmscan: simplify lruvec_lru_size() Johannes Weiner
2019-10-22 19:18   ` Roman Gushchin
2019-10-23 13:48   ` Michal Hocko
2019-10-22 14:47 ` [PATCH 2/8] mm: clean up and clarify lruvec lookup procedure Johannes Weiner
2019-10-22 19:25   ` Roman Gushchin
2019-10-22 21:31     ` Johannes Weiner
2019-10-23 14:00   ` Michal Hocko
2019-10-22 14:47 ` [PATCH 3/8] mm: vmscan: move inactive_list_is_low() swap check to the caller Johannes Weiner
2019-10-22 19:28   ` Roman Gushchin
2019-10-23 14:06   ` Michal Hocko
2019-10-22 14:47 ` [PATCH 4/8] mm: vmscan: naming fixes: global_reclaim() and sane_reclaim() Johannes Weiner
2019-10-22 19:40   ` Roman Gushchin
2019-10-23 16:02     ` Johannes Weiner
2019-10-23 14:14   ` Michal Hocko
2019-10-23 15:56     ` Johannes Weiner
2019-10-22 14:48 ` [PATCH 5/8] mm: vmscan: replace shrink_node() loop with a retry jump Johannes Weiner
2019-10-22 19:56   ` Roman Gushchin
2019-10-22 21:42     ` Johannes Weiner
2019-10-22 22:46       ` Roman Gushchin
2019-10-23 14:18   ` Michal Hocko
2019-10-25 13:44     ` Johannes Weiner
2019-10-22 14:48 ` [PATCH 6/8] mm: vmscan: turn shrink_node_memcg() into shrink_lruvec() Johannes Weiner
2019-10-22 20:04   ` Roman Gushchin
2019-10-23 14:21   ` Michal Hocko
2019-10-22 14:48 ` [PATCH 7/8] mm: vmscan: split shrink_node() into node part and memcgs part Johannes Weiner
2019-10-22 20:08   ` Roman Gushchin
2019-10-25 14:36     ` Johannes Weiner
2019-10-23 14:24   ` Michal Hocko
2019-10-22 14:48 ` [PATCH 8/8] mm: vmscan: harmonize writeback congestion tracking for nodes & memcgs Johannes Weiner
2019-10-22 21:03   ` Roman Gushchin
2019-10-25 14:41     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).