All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 15:56 ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-02 15:56 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo
  Cc: linux-mm, cgroups, linux-kernel, Christian Brauner

Christian reports the following situation in a cgroup that doesn't
have memory.swap.max configured:

  $ cat memory.swap.events
  high 0
  max 0
  fail 6218

Upon closer examination, this is an ARM64 machine that doesn't support
swapping out THPs. In that case, the first get_swap_page() fails, and
the kernel falls back to splitting the THP and swapping the 4k
constituents one by one. /proc/vmstat confirms this with a high rate
of thp_swpout_fallback events.

While the behavior can ultimately be explained, it's unexpected and
confusing. I see three choices how to address this:

a) Specifically exlude THP fallbacks from being counted, as the
   failure is transient and the memory is ultimately swapped.

   Arguably, though, the user would like to know if their cgroup's
   swap limit is causing high rates of THP splitting during swapout.

b) Only count cgroup swap events when they are actually due to a
   cgroup's own limit. Exclude failures that are due to physical swap
   shortage or other system-level conditions (like !THP_SWAP). Also
   count them at the level where the limit is configured, which may be
   above the local cgroup that holds the page-to-be-swapped.

   This is in line with how memory.swap.high, memory.high and
   memory.max events are counted.

   However, it's a change in documented behavior.

c) Leave it as is. The documentation says system-level events are
   counted, so stick to that.

   This is the conservative option, but isn't very user friendly.
   Cgroup events are usually due to a local control choice made by the
   user. Mixing in events that are beyond the user's control makes it
   difficult to id root causes and configure the system properly.

Implement option b).

Reported-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++---
 mm/memcontrol.c                         | 12 +++++-------
 mm/swap_slots.c                         |  2 +-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..a8ffb89a4169 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1605,9 +1605,9 @@ PAGE_SIZE multiple when read back.
 		failed.
 
 	  fail
-		The number of times swap allocation failed either
-		because of running out of swap system-wide or max
-		limit.
+
+		The number of times swap allocation failed because of
+		the max limit.
 
 	When reduced under the current usage, the existing swap
 	entries are reclaimed gradually and the swap usage may stay
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..c2a6206ce84b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7470,17 +7470,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (!memcg)
 		return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
-
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
-		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+		struct mem_cgroup *swap_over_limit;
+
+		swap_over_limit = mem_cgroup_from_counter(counter, swap);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_MAX);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_FAIL);
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..66076bd60e2b 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -342,7 +342,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	get_swap_pages(1, &entry, 1);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry) < 0) {
 		put_swap_folio(folio, entry);
 		entry.val = 0;
 	}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 15:56 ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-02 15:56 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

Christian reports the following situation in a cgroup that doesn't
have memory.swap.max configured:

  $ cat memory.swap.events
  high 0
  max 0
  fail 6218

Upon closer examination, this is an ARM64 machine that doesn't support
swapping out THPs. In that case, the first get_swap_page() fails, and
the kernel falls back to splitting the THP and swapping the 4k
constituents one by one. /proc/vmstat confirms this with a high rate
of thp_swpout_fallback events.

While the behavior can ultimately be explained, it's unexpected and
confusing. I see three choices how to address this:

a) Specifically exlude THP fallbacks from being counted, as the
   failure is transient and the memory is ultimately swapped.

   Arguably, though, the user would like to know if their cgroup's
   swap limit is causing high rates of THP splitting during swapout.

b) Only count cgroup swap events when they are actually due to a
   cgroup's own limit. Exclude failures that are due to physical swap
   shortage or other system-level conditions (like !THP_SWAP). Also
   count them at the level where the limit is configured, which may be
   above the local cgroup that holds the page-to-be-swapped.

   This is in line with how memory.swap.high, memory.high and
   memory.max events are counted.

   However, it's a change in documented behavior.

c) Leave it as is. The documentation says system-level events are
   counted, so stick to that.

   This is the conservative option, but isn't very user friendly.
   Cgroup events are usually due to a local control choice made by the
   user. Mixing in events that are beyond the user's control makes it
   difficult to id root causes and configure the system properly.

Implement option b).

Reported-by: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  6 +++---
 mm/memcontrol.c                         | 12 +++++-------
 mm/swap_slots.c                         |  2 +-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index c8ae7c897f14..a8ffb89a4169 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1605,9 +1605,9 @@ PAGE_SIZE multiple when read back.
 		failed.
 
 	  fail
-		The number of times swap allocation failed either
-		because of running out of swap system-wide or max
-		limit.
+
+		The number of times swap allocation failed because of
+		the max limit.
 
 	When reduced under the current usage, the existing swap
 	entries are reclaimed gradually and the swap usage may stay
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..c2a6206ce84b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7470,17 +7470,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (!memcg)
 		return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
-
 	memcg = mem_cgroup_id_get_online(memcg);
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
-		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+		struct mem_cgroup *swap_over_limit;
+
+		swap_over_limit = mem_cgroup_from_counter(counter, swap);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_MAX);
+		memcg_memory_event(swap_over_limit, MEMCG_SWAP_FAIL);
 		mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..66076bd60e2b 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -342,7 +342,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	get_swap_pages(1, &entry, 1);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry) < 0) {
 		put_swap_folio(folio, entry);
 		entry.val = 0;
 	}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 18:27   ` Shakeel Butt
  0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2023-02-02 18:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Christian Brauner

On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
>
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
>
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs. In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
>
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
>
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
>
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.
>
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
>
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
>
>    However, it's a change in documented behavior.
>
> c) Leave it as is. The documentation says system-level events are
>    counted, so stick to that.
>
>    This is the conservative option, but isn't very user friendly.
>    Cgroup events are usually due to a local control choice made by the
>    user. Mixing in events that are beyond the user's control makes it
>    difficult to id root causes and configure the system properly.
>
> Implement option b).

I prefer option b too.

>
> Reported-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

I think we should CC stable as well for early exposure.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 18:27   ` Shakeel Butt
  0 siblings, 0 replies; 24+ messages in thread
From: Shakeel Butt @ 2023-02-02 18:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Roman Gushchin, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
>
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
>
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs. In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
>
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
>
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
>
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.
>
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
>
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
>
>    However, it's a change in documented behavior.
>
> c) Leave it as is. The documentation says system-level events are
>    counted, so stick to that.
>
>    This is the conservative option, but isn't very user friendly.
>    Cgroup events are usually due to a local control choice made by the
>    user. Mixing in events that are beyond the user's control makes it
>    difficult to id root causes and configure the system properly.
>
> Implement option b).

I prefer option b too.

>
> Reported-by: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Acked-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

I think we should CC stable as well for early exposure.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 18:30   ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2023-02-02 18:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
>
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
>
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs. In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
>
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
>
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
>
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.

We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
completeness) to memcg events for this if/when a use case arises,
right?

>
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
>
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
>
>    However, it's a change in documented behavior.

This option makes sense to me, but I can't speak to the change of
documented behavior. However, looking at the code, it seems like if we do this
the "max" & "fail" counters become effectively the same. "fail" would
not provide much value then.

I wonder if it makes sense to have both, and clarify that "fail" -
"max" would be non-limit based failures (e.g. ran out of swap space),
or would this cause confusion as to whether those non-limit failures
were transient (THP fallback) or eventual?


>
> c) Leave it as is. The documentation says system-level events are
>    counted, so stick to that.
>
>    This is the conservative option, but isn't very user friendly.
>    Cgroup events are usually due to a local control choice made by the
>    user. Mixing in events that are beyond the user's control makes it
>    difficult to id root causes and configure the system properly.
>
> Implement option b).
>
> Reported-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++---
>  mm/memcontrol.c                         | 12 +++++-------
>  mm/swap_slots.c                         |  2 +-
>  3 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..a8ffb89a4169 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1605,9 +1605,9 @@ PAGE_SIZE multiple when read back.
>                 failed.
>
>           fail
> -               The number of times swap allocation failed either
> -               because of running out of swap system-wide or max
> -               limit.
> +
> +               The number of times swap allocation failed because of
> +               the max limit.
>
>         When reduced under the current usage, the existing swap
>         entries are reclaimed gradually and the swap usage may stay
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..c2a6206ce84b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7470,17 +7470,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
>         if (!memcg)
>                 return 0;
>
> -       if (!entry.val) {
> -               memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
> -               return 0;
> -       }
> -
>         memcg = mem_cgroup_id_get_online(memcg);
>
>         if (!mem_cgroup_is_root(memcg) &&
>             !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
> -               memcg_memory_event(memcg, MEMCG_SWAP_MAX);
> -               memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
> +               struct mem_cgroup *swap_over_limit;
> +
> +               swap_over_limit = mem_cgroup_from_counter(counter, swap);
> +               memcg_memory_event(swap_over_limit, MEMCG_SWAP_MAX);
> +               memcg_memory_event(swap_over_limit, MEMCG_SWAP_FAIL);
>                 mem_cgroup_id_put(memcg);
>                 return -ENOMEM;
>         }
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..66076bd60e2b 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -342,7 +342,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>
>         get_swap_pages(1, &entry, 1);
>  out:
> -       if (mem_cgroup_try_charge_swap(folio, entry)) {
> +       if (entry.val && mem_cgroup_try_charge_swap(folio, entry) < 0) {
>                 put_swap_folio(folio, entry);
>                 entry.val = 0;
>         }
> --
> 2.39.1
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-02 18:30   ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2023-02-02 18:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
>
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
>
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs. In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
>
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
>
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
>
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.

We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
completeness) to memcg events for this if/when a use case arises,
right?

>
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
>
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
>
>    However, it's a change in documented behavior.

This option makes sense to me, but I can't speak to the change of
documented behavior. However, looking at the code, it seems like if we do this
the "max" & "fail" counters become effectively the same. "fail" would
not provide much value then.

I wonder if it makes sense to have both, and clarify that "fail" -
"max" would be non-limit based failures (e.g. ran out of swap space),
or would this cause confusion as to whether those non-limit failures
were transient (THP fallback) or eventual?


>
> c) Leave it as is. The documentation says system-level events are
>    counted, so stick to that.
>
>    This is the conservative option, but isn't very user friendly.
>    Cgroup events are usually due to a local control choice made by the
>    user. Mixing in events that are beyond the user's control makes it
>    difficult to id root causes and configure the system properly.
>
> Implement option b).
>
> Reported-by: Christian Brauner <brauner-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++---
>  mm/memcontrol.c                         | 12 +++++-------
>  mm/swap_slots.c                         |  2 +-
>  3 files changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index c8ae7c897f14..a8ffb89a4169 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1605,9 +1605,9 @@ PAGE_SIZE multiple when read back.
>                 failed.
>
>           fail
> -               The number of times swap allocation failed either
> -               because of running out of swap system-wide or max
> -               limit.
> +
> +               The number of times swap allocation failed because of
> +               the max limit.
>
>         When reduced under the current usage, the existing swap
>         entries are reclaimed gradually and the swap usage may stay
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..c2a6206ce84b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7470,17 +7470,15 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
>         if (!memcg)
>                 return 0;
>
> -       if (!entry.val) {
> -               memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
> -               return 0;
> -       }
> -
>         memcg = mem_cgroup_id_get_online(memcg);
>
>         if (!mem_cgroup_is_root(memcg) &&
>             !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
> -               memcg_memory_event(memcg, MEMCG_SWAP_MAX);
> -               memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
> +               struct mem_cgroup *swap_over_limit;
> +
> +               swap_over_limit = mem_cgroup_from_counter(counter, swap);
> +               memcg_memory_event(swap_over_limit, MEMCG_SWAP_MAX);
> +               memcg_memory_event(swap_over_limit, MEMCG_SWAP_FAIL);
>                 mem_cgroup_id_put(memcg);
>                 return -ENOMEM;
>         }
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..66076bd60e2b 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -342,7 +342,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>
>         get_swap_pages(1, &entry, 1);
>  out:
> -       if (mem_cgroup_try_charge_swap(folio, entry)) {
> +       if (entry.val && mem_cgroup_try_charge_swap(folio, entry) < 0) {
>                 put_swap_folio(folio, entry);
>                 entry.val = 0;
>         }
> --
> 2.39.1
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:00   ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-03 19:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Tejun Heo, linux-mm, cgroups,
	linux-kernel, Christian Brauner

On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
> 
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
> 
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs.

Do we expect it to be added any time soon or it's caused by some system
limitations?

> In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
> 
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
> 
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
> 
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.

I agree, but it's probably better to reflect it in a form of a per-memcg
thp split failure counter (e.g. in memory.stat), not as swap out failures.
Overall option a) looks preferable to me. Especially if in the long run
the arm64 limitation will be fixed.

> 
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
> 
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
> 
>    However, it's a change in documented behavior.

I'm not sure about this option: I can easily imagine a setup with a
memcg-specific swap space, which would require setting an artificial
memory.swap.max to get the fail counter working. On the other side not a deal
breaker.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:00   ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-03 19:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> Christian reports the following situation in a cgroup that doesn't
> have memory.swap.max configured:
> 
>   $ cat memory.swap.events
>   high 0
>   max 0
>   fail 6218
> 
> Upon closer examination, this is an ARM64 machine that doesn't support
> swapping out THPs.

Do we expect it to be added any time soon or it's caused by some system
limitations?

> In that case, the first get_swap_page() fails, and
> the kernel falls back to splitting the THP and swapping the 4k
> constituents one by one. /proc/vmstat confirms this with a high rate
> of thp_swpout_fallback events.
> 
> While the behavior can ultimately be explained, it's unexpected and
> confusing. I see three choices how to address this:
> 
> a) Specifically exlude THP fallbacks from being counted, as the
>    failure is transient and the memory is ultimately swapped.
> 
>    Arguably, though, the user would like to know if their cgroup's
>    swap limit is causing high rates of THP splitting during swapout.

I agree, but it's probably better to reflect it in a form of a per-memcg
thp split failure counter (e.g. in memory.stat), not as swap out failures.
Overall option a) looks preferable to me. Especially if in the long run
the arm64 limitation will be fixed.

> 
> b) Only count cgroup swap events when they are actually due to a
>    cgroup's own limit. Exclude failures that are due to physical swap
>    shortage or other system-level conditions (like !THP_SWAP). Also
>    count them at the level where the limit is configured, which may be
>    above the local cgroup that holds the page-to-be-swapped.
> 
>    This is in line with how memory.swap.high, memory.high and
>    memory.max events are counted.
> 
>    However, it's a change in documented behavior.

I'm not sure about this option: I can easily imagine a setup with a
memcg-specific swap space, which would require setting an artificial
memory.swap.max to get the fail counter working. On the other side not a deal
breaker.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:07     ` Yang Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Yang Shi @ 2023-02-03 19:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > Christian reports the following situation in a cgroup that doesn't
> > have memory.swap.max configured:
> >
> >   $ cat memory.swap.events
> >   high 0
> >   max 0
> >   fail 6218
> >
> > Upon closer examination, this is an ARM64 machine that doesn't support
> > swapping out THPs.
>
> Do we expect it to be added any time soon or it's caused by some system
> limitations?

AFAIK, it has been supported since 6.0. See commit d0637c505f8a1

>
> > In that case, the first get_swap_page() fails, and
> > the kernel falls back to splitting the THP and swapping the 4k
> > constituents one by one. /proc/vmstat confirms this with a high rate
> > of thp_swpout_fallback events.
> >
> > While the behavior can ultimately be explained, it's unexpected and
> > confusing. I see three choices how to address this:
> >
> > a) Specifically exlude THP fallbacks from being counted, as the
> >    failure is transient and the memory is ultimately swapped.
> >
> >    Arguably, though, the user would like to know if their cgroup's
> >    swap limit is causing high rates of THP splitting during swapout.
>
> I agree, but it's probably better to reflect it in a form of a per-memcg
> thp split failure counter (e.g. in memory.stat), not as swap out failures.
> Overall option a) looks preferable to me. Especially if in the long run
> the arm64 limitation will be fixed.
>
> >
> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
>
> I'm not sure about this option: I can easily imagine a setup with a
> memcg-specific swap space, which would require setting an artificial
> memory.swap.max to get the fail counter working. On the other side not a deal
> breaker.
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:07     ` Yang Shi
  0 siblings, 0 replies; 24+ messages in thread
From: Yang Shi @ 2023-02-03 19:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> wrote:
>
> On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > Christian reports the following situation in a cgroup that doesn't
> > have memory.swap.max configured:
> >
> >   $ cat memory.swap.events
> >   high 0
> >   max 0
> >   fail 6218
> >
> > Upon closer examination, this is an ARM64 machine that doesn't support
> > swapping out THPs.
>
> Do we expect it to be added any time soon or it's caused by some system
> limitations?

AFAIK, it has been supported since 6.0. See commit d0637c505f8a1

>
> > In that case, the first get_swap_page() fails, and
> > the kernel falls back to splitting the THP and swapping the 4k
> > constituents one by one. /proc/vmstat confirms this with a high rate
> > of thp_swpout_fallback events.
> >
> > While the behavior can ultimately be explained, it's unexpected and
> > confusing. I see three choices how to address this:
> >
> > a) Specifically exlude THP fallbacks from being counted, as the
> >    failure is transient and the memory is ultimately swapped.
> >
> >    Arguably, though, the user would like to know if their cgroup's
> >    swap limit is causing high rates of THP splitting during swapout.
>
> I agree, but it's probably better to reflect it in a form of a per-memcg
> thp split failure counter (e.g. in memory.stat), not as swap out failures.
> Overall option a) looks preferable to me. Especially if in the long run
> the arm64 limitation will be fixed.
>
> >
> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
>
> I'm not sure about this option: I can easily imagine a setup with a
> memcg-specific swap space, which would require setting an artificial
> memory.swap.max to get the fail counter working. On the other side not a deal
> breaker.
>
> Thanks!
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:19       ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-03 19:19 UTC (permalink / raw)
  To: Yang Shi
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Fri, Feb 03, 2023 at 11:07:30AM -0800, Yang Shi wrote:
> On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > > Christian reports the following situation in a cgroup that doesn't
> > > have memory.swap.max configured:
> > >
> > >   $ cat memory.swap.events
> > >   high 0
> > >   max 0
> > >   fail 6218
> > >
> > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > swapping out THPs.
> >
> > Do we expect it to be added any time soon or it's caused by some system
> > limitations?
> 
> AFAIK, it has been supported since 6.0. See commit d0637c505f8a1

Great, thank you for the link!
Then it looks like we have even fewer reasons to change the interface.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-03 19:19       ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-03 19:19 UTC (permalink / raw)
  To: Yang Shi
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Fri, Feb 03, 2023 at 11:07:30AM -0800, Yang Shi wrote:
> On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> wrote:
> >
> > On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > > Christian reports the following situation in a cgroup that doesn't
> > > have memory.swap.max configured:
> > >
> > >   $ cat memory.swap.events
> > >   high 0
> > >   max 0
> > >   fail 6218
> > >
> > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > swapping out THPs.
> >
> > Do we expect it to be added any time soon or it's caused by some system
> > limitations?
> 
> AFAIK, it has been supported since 6.0. See commit d0637c505f8a1

Great, thank you for the link!
Then it looks like we have even fewer reasons to change the interface.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-06 16:18     ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2023-02-06 16:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Roman Gushchin,
	Tejun Heo, linux-mm, cgroups, linux-kernel, Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed <yosryahmed@google.com> wrote:
> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
> 
> This option makes sense to me, but I can't speak to the change of
> documented behavior. However, looking at the code, it seems like if we do this
> the "max" & "fail" counters become effectively the same. "fail" would
> not provide much value then.
> 
> I wonder if it makes sense to have both, and clarify that "fail" -
> "max" would be non-limit based failures (e.g. ran out of swap space),
> or would this cause confusion as to whether those non-limit failures
> were transient (THP fallback) or eventual?

I somewhat second this.

Perhaps, could the patch (and arguments) be split in two:
1) count .max events on respective limit's level (other limits consistency),
2) redefine (remove?) memory.swap.fail events?

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-06 16:18     ` Michal Koutný
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Koutný @ 2023-02-06 16:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Roman Gushchin,
	Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed <yosryahmed-hpIqsD4AKldhl2p70BpVqQ@public.gmane.orgm> wrote:
> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
> 
> This option makes sense to me, but I can't speak to the change of
> documented behavior. However, looking at the code, it seems like if we do this
> the "max" & "fail" counters become effectively the same. "fail" would
> not provide much value then.
> 
> I wonder if it makes sense to have both, and clarify that "fail" -
> "max" would be non-limit based failures (e.g. ran out of swap space),
> or would this cause confusion as to whether those non-limit failures
> were transient (THP fallback) or eventual?

I somewhat second this.

Perhaps, could the patch (and arguments) be split in two:
1) count .max events on respective limit's level (other limits consistency),
2) redefine (remove?) memory.swap.fail events?

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 16:52         ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 16:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yang Shi, Michal Hocko, Shakeel Butt, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Fri, Feb 03, 2023 at 11:19:32AM -0800, Roman Gushchin wrote:
> On Fri, Feb 03, 2023 at 11:07:30AM -0800, Yang Shi wrote:
> > On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > > > Christian reports the following situation in a cgroup that doesn't
> > > > have memory.swap.max configured:
> > > >
> > > >   $ cat memory.swap.events
> > > >   high 0
> > > >   max 0
> > > >   fail 6218
> > > >
> > > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > > swapping out THPs.
> > >
> > > Do we expect it to be added any time soon or it's caused by some system
> > > limitations?
> > 
> > AFAIK, it has been supported since 6.0. See commit d0637c505f8a1
> 
> Great, thank you for the link!
> Then it looks like we have even fewer reasons to change the interface.

Yes, ARM supports it now. But the point wasn't necessarily to fix this
because of ARM. THP swap can fall back due to plenty of other reasons,
for example fragmentation. It always falls back on swapfiles since
they don't have the cluster allocator that bdevs have.

The broader point was that we show failures in the cgroup event
counter that have nothing to do with the cgroup's configuration.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 16:52         ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 16:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Yang Shi, Michal Hocko, Shakeel Butt, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Fri, Feb 03, 2023 at 11:19:32AM -0800, Roman Gushchin wrote:
> On Fri, Feb 03, 2023 at 11:07:30AM -0800, Yang Shi wrote:
> > On Fri, Feb 3, 2023 at 11:00 AM Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org> wrote:
> > >
> > > On Thu, Feb 02, 2023 at 10:56:26AM -0500, Johannes Weiner wrote:
> > > > Christian reports the following situation in a cgroup that doesn't
> > > > have memory.swap.max configured:
> > > >
> > > >   $ cat memory.swap.events
> > > >   high 0
> > > >   max 0
> > > >   fail 6218
> > > >
> > > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > > swapping out THPs.
> > >
> > > Do we expect it to be added any time soon or it's caused by some system
> > > limitations?
> > 
> > AFAIK, it has been supported since 6.0. See commit d0637c505f8a1
> 
> Great, thank you for the link!
> Then it looks like we have even fewer reasons to change the interface.

Yes, ARM supports it now. But the point wasn't necessarily to fix this
because of ARM. THP swap can fall back due to plenty of other reasons,
for example fragmentation. It always falls back on swapfiles since
they don't have the cluster allocator that bdevs have.

The broader point was that we show failures in the cgroup event
counter that have nothing to do with the cgroup's configuration.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 16:54       ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 16:54 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Michal Hocko, Shakeel Butt, Roman Gushchin,
	Tejun Heo, linux-mm, cgroups, linux-kernel, Christian Brauner

On Mon, Feb 06, 2023 at 05:18:43PM +0100, Michal Koutný wrote:
> On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed <yosryahmed@google.com> wrote:
> > > b) Only count cgroup swap events when they are actually due to a
> > >    cgroup's own limit. Exclude failures that are due to physical swap
> > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > >    count them at the level where the limit is configured, which may be
> > >    above the local cgroup that holds the page-to-be-swapped.
> > >
> > >    This is in line with how memory.swap.high, memory.high and
> > >    memory.max events are counted.
> > >
> > >    However, it's a change in documented behavior.
> > 
> > This option makes sense to me, but I can't speak to the change of
> > documented behavior. However, looking at the code, it seems like if we do this
> > the "max" & "fail" counters become effectively the same. "fail" would
> > not provide much value then.
> > 
> > I wonder if it makes sense to have both, and clarify that "fail" -
> > "max" would be non-limit based failures (e.g. ran out of swap space),
> > or would this cause confusion as to whether those non-limit failures
> > were transient (THP fallback) or eventual?
> 
> I somewhat second this.
> 
> Perhaps, could the patch (and arguments) be split in two:
> 1) count .max events on respective limit's level (other limits consistency),

Okay, I'll split this one out. It's good to have regardless of what we
do with the fail counter.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 16:54       ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 16:54 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Michal Hocko, Shakeel Butt, Roman Gushchin,
	Tejun Heo, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Mon, Feb 06, 2023 at 05:18:43PM +0100, Michal Koutný wrote:
> On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > > b) Only count cgroup swap events when they are actually due to a
> > >    cgroup's own limit. Exclude failures that are due to physical swap
> > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > >    count them at the level where the limit is configured, which may be
> > >    above the local cgroup that holds the page-to-be-swapped.
> > >
> > >    This is in line with how memory.swap.high, memory.high and
> > >    memory.max events are counted.
> > >
> > >    However, it's a change in documented behavior.
> > 
> > This option makes sense to me, but I can't speak to the change of
> > documented behavior. However, looking at the code, it seems like if we do this
> > the "max" & "fail" counters become effectively the same. "fail" would
> > not provide much value then.
> > 
> > I wonder if it makes sense to have both, and clarify that "fail" -
> > "max" would be non-limit based failures (e.g. ran out of swap space),
> > or would this cause confusion as to whether those non-limit failures
> > were transient (THP fallback) or eventual?
> 
> I somewhat second this.
> 
> Perhaps, could the patch (and arguments) be split in two:
> 1) count .max events on respective limit's level (other limits consistency),

Okay, I'll split this one out. It's good to have regardless of what we
do with the fail counter.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 19:09     ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 19:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Christian reports the following situation in a cgroup that doesn't
> > have memory.swap.max configured:
> >
> >   $ cat memory.swap.events
> >   high 0
> >   max 0
> >   fail 6218
> >
> > Upon closer examination, this is an ARM64 machine that doesn't support
> > swapping out THPs. In that case, the first get_swap_page() fails, and
> > the kernel falls back to splitting the THP and swapping the 4k
> > constituents one by one. /proc/vmstat confirms this with a high rate
> > of thp_swpout_fallback events.
> >
> > While the behavior can ultimately be explained, it's unexpected and
> > confusing. I see three choices how to address this:
> >
> > a) Specifically exlude THP fallbacks from being counted, as the
> >    failure is transient and the memory is ultimately swapped.
> >
> >    Arguably, though, the user would like to know if their cgroup's
> >    swap limit is causing high rates of THP splitting during swapout.
> 
> We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> completeness) to memcg events for this if/when a use case arises,
> right?

Yes, we can add that to memory.stat.

> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
> 
> This option makes sense to me, but I can't speak to the change of
> documented behavior. However, looking at the code, it seems like if we do this
> the "max" & "fail" counters become effectively the same. "fail" would
> not provide much value then.

Right.

> I wonder if it makes sense to have both, and clarify that "fail" -
> "max" would be non-limit based failures (e.g. ran out of swap space),
> or would this cause confusion as to whether those non-limit failures
> were transient (THP fallback) or eventual?

If we add the fallback events, the user can calculate it. I wouldn't
split the fail counter itself. There are too many reasons why swap can
fail, half of them implementation-defined (as in the ARM example).

So I think I'll send patches either way to:

1. Fix the hierarchical accounting of the events to make it consistent
   with other counters.

2. Add THP swap/fallback counts to memory.stat

We could consider excluding THP fallbacks from the fail count. But it
seems really wrong for the cgroup controller to start classifying
individual types of failures in the swap layer and make decisions on
how to report them to the user. Cgroups really shouldn't be in the
business of making up its own MM events. I should provide per-cgroup
accounting of native MM events. And nobody has felt the need to add
native swap failure counts yet.

So I'd argue we should either remove the swap fail count altogether
for all the reasons mentioned, or just leave it as is and as a
documented interface that is unfortunately out the door.

Thoughts?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 19:09     ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2023-02-07 19:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> > Christian reports the following situation in a cgroup that doesn't
> > have memory.swap.max configured:
> >
> >   $ cat memory.swap.events
> >   high 0
> >   max 0
> >   fail 6218
> >
> > Upon closer examination, this is an ARM64 machine that doesn't support
> > swapping out THPs. In that case, the first get_swap_page() fails, and
> > the kernel falls back to splitting the THP and swapping the 4k
> > constituents one by one. /proc/vmstat confirms this with a high rate
> > of thp_swpout_fallback events.
> >
> > While the behavior can ultimately be explained, it's unexpected and
> > confusing. I see three choices how to address this:
> >
> > a) Specifically exlude THP fallbacks from being counted, as the
> >    failure is transient and the memory is ultimately swapped.
> >
> >    Arguably, though, the user would like to know if their cgroup's
> >    swap limit is causing high rates of THP splitting during swapout.
> 
> We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> completeness) to memcg events for this if/when a use case arises,
> right?

Yes, we can add that to memory.stat.

> > b) Only count cgroup swap events when they are actually due to a
> >    cgroup's own limit. Exclude failures that are due to physical swap
> >    shortage or other system-level conditions (like !THP_SWAP). Also
> >    count them at the level where the limit is configured, which may be
> >    above the local cgroup that holds the page-to-be-swapped.
> >
> >    This is in line with how memory.swap.high, memory.high and
> >    memory.max events are counted.
> >
> >    However, it's a change in documented behavior.
> 
> This option makes sense to me, but I can't speak to the change of
> documented behavior. However, looking at the code, it seems like if we do this
> the "max" & "fail" counters become effectively the same. "fail" would
> not provide much value then.

Right.

> I wonder if it makes sense to have both, and clarify that "fail" -
> "max" would be non-limit based failures (e.g. ran out of swap space),
> or would this cause confusion as to whether those non-limit failures
> were transient (THP fallback) or eventual?

If we add the fallback events, the user can calculate it. I wouldn't
split the fail counter itself. There are too many reasons why swap can
fail, half of them implementation-defined (as in the ARM example).

So I think I'll send patches either way to:

1. Fix the hierarchical accounting of the events to make it consistent
   with other counters.

2. Add THP swap/fallback counts to memory.stat

We could consider excluding THP fallbacks from the fail count. But it
seems really wrong for the cgroup controller to start classifying
individual types of failures in the swap layer and make decisions on
how to report them to the user. Cgroups really shouldn't be in the
business of making up its own MM events. I should provide per-cgroup
accounting of native MM events. And nobody has felt the need to add
native swap failure counts yet.

So I'd argue we should either remove the swap fail count altogether
for all the reasons mentioned, or just leave it as is and as a
documented interface that is unfortunately out the door.

Thoughts?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 19:21       ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2023-02-07 19:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Tue, Feb 7, 2023 at 11:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> > On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Christian reports the following situation in a cgroup that doesn't
> > > have memory.swap.max configured:
> > >
> > >   $ cat memory.swap.events
> > >   high 0
> > >   max 0
> > >   fail 6218
> > >
> > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > swapping out THPs. In that case, the first get_swap_page() fails, and
> > > the kernel falls back to splitting the THP and swapping the 4k
> > > constituents one by one. /proc/vmstat confirms this with a high rate
> > > of thp_swpout_fallback events.
> > >
> > > While the behavior can ultimately be explained, it's unexpected and
> > > confusing. I see three choices how to address this:
> > >
> > > a) Specifically exlude THP fallbacks from being counted, as the
> > >    failure is transient and the memory is ultimately swapped.
> > >
> > >    Arguably, though, the user would like to know if their cgroup's
> > >    swap limit is causing high rates of THP splitting during swapout.
> >
> > We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> > completeness) to memcg events for this if/when a use case arises,
> > right?
>
> Yes, we can add that to memory.stat.
>
> > > b) Only count cgroup swap events when they are actually due to a
> > >    cgroup's own limit. Exclude failures that are due to physical swap
> > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > >    count them at the level where the limit is configured, which may be
> > >    above the local cgroup that holds the page-to-be-swapped.
> > >
> > >    This is in line with how memory.swap.high, memory.high and
> > >    memory.max events are counted.
> > >
> > >    However, it's a change in documented behavior.
> >
> > This option makes sense to me, but I can't speak to the change of
> > documented behavior. However, looking at the code, it seems like if we do this
> > the "max" & "fail" counters become effectively the same. "fail" would
> > not provide much value then.
>
> Right.
>
> > I wonder if it makes sense to have both, and clarify that "fail" -
> > "max" would be non-limit based failures (e.g. ran out of swap space),
> > or would this cause confusion as to whether those non-limit failures
> > were transient (THP fallback) or eventual?
>
> If we add the fallback events, the user can calculate it. I wouldn't
> split the fail counter itself. There are too many reasons why swap can
> fail, half of them implementation-defined (as in the ARM example).
>
> So I think I'll send patches either way to:
>
> 1. Fix the hierarchical accounting of the events to make it consistent
>    with other counters.
>
> 2. Add THP swap/fallback counts to memory.stat

Sounds good, thanks!

>
> We could consider excluding THP fallbacks from the fail count. But it
> seems really wrong for the cgroup controller to start classifying
> individual types of failures in the swap layer and make decisions on
> how to report them to the user. Cgroups really shouldn't be in the
> business of making up its own MM events. I should provide per-cgroup
> accounting of native MM events. And nobody has felt the need to add
> native swap failure counts yet.
>
> So I'd argue we should either remove the swap fail count altogether
> for all the reasons mentioned, or just leave it as is and as a
> documented interface that is unfortunately out the door.
>
> Thoughts?

Agreed. We should either report all failures or failures specific to
cgroup limits (which the max count already tracks).

About removing the fail count completely as-is or completely removing
it, I don't feel strongly either way. If we add THP swap fallbacks to
memory.stat, then the fail counter becomes more understandable as you
can break it down into (max, THP swap fallback, others), with others
being *probably* swap is full. Arguably, it seems like the interesting
components (or the cgroup-relevant) components are already there
regardless. The counter might be useful from a monitoring perspective,
if a memcg OOMs for example a high fail count can show us that it
tried to swap multiple times but failed, we can then look at the max
count and THP fallbacks to understand where the failure is coming
from.

I would say we can leave it as-is, but whatever you prefer.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 19:21       ` Yosry Ahmed
  0 siblings, 0 replies; 24+ messages in thread
From: Yosry Ahmed @ 2023-02-07 19:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Shakeel Butt, Roman Gushchin, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Tue, Feb 7, 2023 at 11:09 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> > On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > >
> > > Christian reports the following situation in a cgroup that doesn't
> > > have memory.swap.max configured:
> > >
> > >   $ cat memory.swap.events
> > >   high 0
> > >   max 0
> > >   fail 6218
> > >
> > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > swapping out THPs. In that case, the first get_swap_page() fails, and
> > > the kernel falls back to splitting the THP and swapping the 4k
> > > constituents one by one. /proc/vmstat confirms this with a high rate
> > > of thp_swpout_fallback events.
> > >
> > > While the behavior can ultimately be explained, it's unexpected and
> > > confusing. I see three choices how to address this:
> > >
> > > a) Specifically exlude THP fallbacks from being counted, as the
> > >    failure is transient and the memory is ultimately swapped.
> > >
> > >    Arguably, though, the user would like to know if their cgroup's
> > >    swap limit is causing high rates of THP splitting during swapout.
> >
> > We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> > completeness) to memcg events for this if/when a use case arises,
> > right?
>
> Yes, we can add that to memory.stat.
>
> > > b) Only count cgroup swap events when they are actually due to a
> > >    cgroup's own limit. Exclude failures that are due to physical swap
> > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > >    count them at the level where the limit is configured, which may be
> > >    above the local cgroup that holds the page-to-be-swapped.
> > >
> > >    This is in line with how memory.swap.high, memory.high and
> > >    memory.max events are counted.
> > >
> > >    However, it's a change in documented behavior.
> >
> > This option makes sense to me, but I can't speak to the change of
> > documented behavior. However, looking at the code, it seems like if we do this
> > the "max" & "fail" counters become effectively the same. "fail" would
> > not provide much value then.
>
> Right.
>
> > I wonder if it makes sense to have both, and clarify that "fail" -
> > "max" would be non-limit based failures (e.g. ran out of swap space),
> > or would this cause confusion as to whether those non-limit failures
> > were transient (THP fallback) or eventual?
>
> If we add the fallback events, the user can calculate it. I wouldn't
> split the fail counter itself. There are too many reasons why swap can
> fail, half of them implementation-defined (as in the ARM example).
>
> So I think I'll send patches either way to:
>
> 1. Fix the hierarchical accounting of the events to make it consistent
>    with other counters.
>
> 2. Add THP swap/fallback counts to memory.stat

Sounds good, thanks!

>
> We could consider excluding THP fallbacks from the fail count. But it
> seems really wrong for the cgroup controller to start classifying
> individual types of failures in the swap layer and make decisions on
> how to report them to the user. Cgroups really shouldn't be in the
> business of making up its own MM events. I should provide per-cgroup
> accounting of native MM events. And nobody has felt the need to add
> native swap failure counts yet.
>
> So I'd argue we should either remove the swap fail count altogether
> for all the reasons mentioned, or just leave it as is and as a
> documented interface that is unfortunately out the door.
>
> Thoughts?

Agreed. We should either report all failures or failures specific to
cgroup limits (which the max count already tracks).

About removing the fail count completely as-is or completely removing
it, I don't feel strongly either way. If we add THP swap fallbacks to
memory.stat, then the fail counter becomes more understandable as you
can break it down into (max, THP swap fallback, others), with others
being *probably* swap is full. Arguably, it seems like the interesting
components (or the cgroup-relevant) components are already there
regardless. The counter might be useful from a monitoring perspective,
if a memcg OOMs for example a high fail count can show us that it
tried to swap multiple times but failed, we can then look at the max
count and THP fallbacks to understand where the failure is coming
from.

I would say we can leave it as-is, but whatever you prefer.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 22:14         ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-07 22:14 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo, linux-mm,
	cgroups, linux-kernel, Christian Brauner

On Tue, Feb 07, 2023 at 11:21:15AM -0800, Yosry Ahmed wrote:
> On Tue, Feb 7, 2023 at 11:09 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> > > On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > Christian reports the following situation in a cgroup that doesn't
> > > > have memory.swap.max configured:
> > > >
> > > >   $ cat memory.swap.events
> > > >   high 0
> > > >   max 0
> > > >   fail 6218
> > > >
> > > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > > swapping out THPs. In that case, the first get_swap_page() fails, and
> > > > the kernel falls back to splitting the THP and swapping the 4k
> > > > constituents one by one. /proc/vmstat confirms this with a high rate
> > > > of thp_swpout_fallback events.
> > > >
> > > > While the behavior can ultimately be explained, it's unexpected and
> > > > confusing. I see three choices how to address this:
> > > >
> > > > a) Specifically exlude THP fallbacks from being counted, as the
> > > >    failure is transient and the memory is ultimately swapped.
> > > >
> > > >    Arguably, though, the user would like to know if their cgroup's
> > > >    swap limit is causing high rates of THP splitting during swapout.
> > >
> > > We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> > > completeness) to memcg events for this if/when a use case arises,
> > > right?
> >
> > Yes, we can add that to memory.stat.
> >
> > > > b) Only count cgroup swap events when they are actually due to a
> > > >    cgroup's own limit. Exclude failures that are due to physical swap
> > > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > > >    count them at the level where the limit is configured, which may be
> > > >    above the local cgroup that holds the page-to-be-swapped.
> > > >
> > > >    This is in line with how memory.swap.high, memory.high and
> > > >    memory.max events are counted.
> > > >
> > > >    However, it's a change in documented behavior.
> > >
> > > This option makes sense to me, but I can't speak to the change of
> > > documented behavior. However, looking at the code, it seems like if we do this
> > > the "max" & "fail" counters become effectively the same. "fail" would
> > > not provide much value then.
> >
> > Right.
> >
> > > I wonder if it makes sense to have both, and clarify that "fail" -
> > > "max" would be non-limit based failures (e.g. ran out of swap space),
> > > or would this cause confusion as to whether those non-limit failures
> > > were transient (THP fallback) or eventual?
> >
> > If we add the fallback events, the user can calculate it. I wouldn't
> > split the fail counter itself. There are too many reasons why swap can
> > fail, half of them implementation-defined (as in the ARM example).
> >
> > So I think I'll send patches either way to:
> >
> > 1. Fix the hierarchical accounting of the events to make it consistent
> >    with other counters.
> >
> > 2. Add THP swap/fallback counts to memory.stat
> 
> Sounds good, thanks!
> 
> >
> > We could consider excluding THP fallbacks from the fail count. But it
> > seems really wrong for the cgroup controller to start classifying
> > individual types of failures in the swap layer and make decisions on
> > how to report them to the user. Cgroups really shouldn't be in the
> > business of making up its own MM events. I should provide per-cgroup
> > accounting of native MM events. And nobody has felt the need to add
> > native swap failure counts yet.
> >
> > So I'd argue we should either remove the swap fail count altogether
> > for all the reasons mentioned, or just leave it as is and as a
> > documented interface that is unfortunately out the door.
> >
> > Thoughts?
> 
> Agreed. We should either report all failures or failures specific to
> cgroup limits (which the max count already tracks).
> 
> About removing the fail count completely as-is or completely removing
> it, I don't feel strongly either way. If we add THP swap fallbacks to
> memory.stat, then the fail counter becomes more understandable as you
> can break it down into (max, THP swap fallback, others), with others
> being *probably* swap is full. Arguably, it seems like the interesting
> components (or the cgroup-relevant) components are already there
> regardless. The counter might be useful from a monitoring perspective,
> if a memcg OOMs for example a high fail count can show us that it
> tried to swap multiple times but failed, we can then look at the max
> count and THP fallbacks to understand where the failure is coming
> from.
> 
> I would say we can leave it as-is, but whatever you prefer.

+1

On one hand, the less counters the better.
On the other, removing things which is an API break should have a really good reason.
So probably let's leave it as it is.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits
@ 2023-02-07 22:14         ` Roman Gushchin
  0 siblings, 0 replies; 24+ messages in thread
From: Roman Gushchin @ 2023-02-07 22:14 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Tejun Heo,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christian Brauner

On Tue, Feb 07, 2023 at 11:21:15AM -0800, Yosry Ahmed wrote:
> On Tue, Feb 7, 2023 at 11:09 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> > On Thu, Feb 02, 2023 at 10:30:40AM -0800, Yosry Ahmed wrote:
> > > On Thu, Feb 2, 2023 at 7:56 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > > >
> > > > Christian reports the following situation in a cgroup that doesn't
> > > > have memory.swap.max configured:
> > > >
> > > >   $ cat memory.swap.events
> > > >   high 0
> > > >   max 0
> > > >   fail 6218
> > > >
> > > > Upon closer examination, this is an ARM64 machine that doesn't support
> > > > swapping out THPs. In that case, the first get_swap_page() fails, and
> > > > the kernel falls back to splitting the THP and swapping the 4k
> > > > constituents one by one. /proc/vmstat confirms this with a high rate
> > > > of thp_swpout_fallback events.
> > > >
> > > > While the behavior can ultimately be explained, it's unexpected and
> > > > confusing. I see three choices how to address this:
> > > >
> > > > a) Specifically exlude THP fallbacks from being counted, as the
> > > >    failure is transient and the memory is ultimately swapped.
> > > >
> > > >    Arguably, though, the user would like to know if their cgroup's
> > > >    swap limit is causing high rates of THP splitting during swapout.
> > >
> > > We have the option to add THP_SWPOUT_FALLBACK (and THP_SWPOUT for
> > > completeness) to memcg events for this if/when a use case arises,
> > > right?
> >
> > Yes, we can add that to memory.stat.
> >
> > > > b) Only count cgroup swap events when they are actually due to a
> > > >    cgroup's own limit. Exclude failures that are due to physical swap
> > > >    shortage or other system-level conditions (like !THP_SWAP). Also
> > > >    count them at the level where the limit is configured, which may be
> > > >    above the local cgroup that holds the page-to-be-swapped.
> > > >
> > > >    This is in line with how memory.swap.high, memory.high and
> > > >    memory.max events are counted.
> > > >
> > > >    However, it's a change in documented behavior.
> > >
> > > This option makes sense to me, but I can't speak to the change of
> > > documented behavior. However, looking at the code, it seems like if we do this
> > > the "max" & "fail" counters become effectively the same. "fail" would
> > > not provide much value then.
> >
> > Right.
> >
> > > I wonder if it makes sense to have both, and clarify that "fail" -
> > > "max" would be non-limit based failures (e.g. ran out of swap space),
> > > or would this cause confusion as to whether those non-limit failures
> > > were transient (THP fallback) or eventual?
> >
> > If we add the fallback events, the user can calculate it. I wouldn't
> > split the fail counter itself. There are too many reasons why swap can
> > fail, half of them implementation-defined (as in the ARM example).
> >
> > So I think I'll send patches either way to:
> >
> > 1. Fix the hierarchical accounting of the events to make it consistent
> >    with other counters.
> >
> > 2. Add THP swap/fallback counts to memory.stat
> 
> Sounds good, thanks!
> 
> >
> > We could consider excluding THP fallbacks from the fail count. But it
> > seems really wrong for the cgroup controller to start classifying
> > individual types of failures in the swap layer and make decisions on
> > how to report them to the user. Cgroups really shouldn't be in the
> > business of making up its own MM events. I should provide per-cgroup
> > accounting of native MM events. And nobody has felt the need to add
> > native swap failure counts yet.
> >
> > So I'd argue we should either remove the swap fail count altogether
> > for all the reasons mentioned, or just leave it as is and as a
> > documented interface that is unfortunately out the door.
> >
> > Thoughts?
> 
> Agreed. We should either report all failures or failures specific to
> cgroup limits (which the max count already tracks).
> 
> About removing the fail count completely as-is or completely removing
> it, I don't feel strongly either way. If we add THP swap fallbacks to
> memory.stat, then the fail counter becomes more understandable as you
> can break it down into (max, THP swap fallback, others), with others
> being *probably* swap is full. Arguably, it seems like the interesting
> components (or the cgroup-relevant) components are already there
> regardless. The counter might be useful from a monitoring perspective,
> if a memcg OOMs for example a high fail count can show us that it
> tried to swap multiple times but failed, we can then look at the max
> count and THP fallbacks to understand where the failure is coming
> from.
> 
> I would say we can leave it as-is, but whatever you prefer.

+1

On one hand, the less counters the better.
On the other, removing things which is an API break should have a really good reason.
So probably let's leave it as it is.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-02-07 22:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-02 15:56 [RFC PATCH] mm: memcontrol: don't account swap failures not due to cgroup limits Johannes Weiner
2023-02-02 15:56 ` Johannes Weiner
2023-02-02 18:27 ` Shakeel Butt
2023-02-02 18:27   ` Shakeel Butt
2023-02-02 18:30 ` Yosry Ahmed
2023-02-02 18:30   ` Yosry Ahmed
2023-02-06 16:18   ` Michal Koutný
2023-02-06 16:18     ` Michal Koutný
2023-02-07 16:54     ` Johannes Weiner
2023-02-07 16:54       ` Johannes Weiner
2023-02-07 19:09   ` Johannes Weiner
2023-02-07 19:09     ` Johannes Weiner
2023-02-07 19:21     ` Yosry Ahmed
2023-02-07 19:21       ` Yosry Ahmed
2023-02-07 22:14       ` Roman Gushchin
2023-02-07 22:14         ` Roman Gushchin
2023-02-03 19:00 ` Roman Gushchin
2023-02-03 19:00   ` Roman Gushchin
2023-02-03 19:07   ` Yang Shi
2023-02-03 19:07     ` Yang Shi
2023-02-03 19:19     ` Roman Gushchin
2023-02-03 19:19       ` Roman Gushchin
2023-02-07 16:52       ` Johannes Weiner
2023-02-07 16:52         ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.