[RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits

* [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 15:56 ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-02 15:56 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo
  Cc: linux-mm, cgroups, linux-kernel, Christian Brauner

Christian reports the following situation in a cgroup that doesn't
have memory.swap.max configured:

  $ cat memory.swap.events
  high 0
  max 0
  fail 6218

Upon closer examination, this is an ARM64 machine that doesn't support
swapping out THPs. In that case, the first get_swap_page() fails, and
the kernel falls back to splitting the THP and swapping the 4k
constituents one by one. /proc/vmstat confirms this with a high rate
of thp_swpout_fallback events.

While the behavior can ultimately be explained, it's unexpected and
confusing. I see three choices how to address this:

a) Specifically exlude THP fallbacks from being counted, as the
   failure is transient and the memory is ultimately swapped.

   Arguably, though, the user would like to know if their cgroup's
   swap limit is causing high rates of THP splitting during swapout.

b) Only count cgroup swap events when they are actually due to a
   cgroup's own limit. Exclude failures that are due to physical swap
   shortage or other system-level conditions (like !THP_SWAP). Also
   count them at the level where the limit is configured, which may be
   above the local cgroup that holds the page-to-be-swapped.

   This is in line with how memory.swap.high, memory.high and
   memory.max events are counted.

   However, it's a change in documented behavior.

c) Leave it as is. The documentation says system-level events are
   counted, so stick to that.

   This is the conservative option, but isn't very user friendly.
   Cgroup events are usually due to a local control choice made by the
   user. Mixing in events that are beyond the user's control makes it
   difficult to id root causes and configure the system properly.

Implement option b).

Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++---
 mm/memcontrol.c                         | 12 +++++-------
 mm/swap_slots.c                         |  2 +-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..a8ffb89a4169 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1605,9 +1605,9 @@ PAGE_SIZE multiple when read back.
 		failed.
 
 	  fail
-		The number of times swap allocation failed either
-		because of running out of swap system-wide or max
-		limit.
+
+		The number of times swap allocation failed because of
+		the max limit.
 
 	When reduced under the current usage, the existing swap
 	entries are reclaimed gradually and the swap usage may stay
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..c2a6206ce84b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7470,17 +7470,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (!memcg)
 		return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
-
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
-		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+		struct mem_cgroup *swap_over_limit;
+
+		swap_over_limit = mem_cgroup_from_counter(counter, swap);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_MAX);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_FAIL);
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..66076bd60e2b 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -342,7 +342,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	get_swap_pages(1, &entry, 1);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry) < 0) {
 		put_swap_folio(folio, entry);
 		entry.val = 0;
 	}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread