linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover
@ 2021-02-12 17:01 Muchun Song
  2021-02-12 17:01 ` [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page Muchun Song
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Muchun Song @ 2021-02-12 17:01 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: cgroups, linux-mm, linux-kernel, Muchun Song

The memcg_oom_recover() almost never do anything but the test (because
oom_disabled is a rarely used) is just waste of cycles in some hot
paths (e.g. kmem uncharge). And it is very small, so it is better to
make it inline. Also, the parameter of memcg cannot be NULL, so removing
the check can reduce useless check.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c035846c7a4..7afca9677693 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1925,7 +1925,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
 	return autoremove_wake_function(wait, mode, sync, arg);
 }
 
-static void memcg_oom_recover(struct mem_cgroup *memcg)
+static inline void memcg_oom_recover(struct mem_cgroup *memcg)
 {
 	/*
 	 * For the following lockless ->under_oom test, the only required
@@ -1935,7 +1935,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 	 * achieved by invoking mem_cgroup_mark_under_oom() before
 	 * triggering notification.
 	 */
-	if (memcg && memcg->under_oom)
+	if (memcg->under_oom)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page
  2021-02-12 17:01 [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Muchun Song
@ 2021-02-12 17:01 ` Muchun Song
  2021-02-15  9:37   ` Michal Hocko
  2021-02-12 17:01 ` [PATCH 3/4] mm: memcontrol: bail out early when id is zero Muchun Song
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2021-02-12 17:01 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: cgroups, linux-mm, linux-kernel, Muchun Song

When we uncharge a page, we wake up oom victims when the memcg oom
handling is outsourced to the userspace. The uncharge_batch do that
for normal and kmem pages but not slab pages. It is likely an
omission. So add the missing memcg_oom_recover() to
__memcg_kmem_uncharge(). And the function of memory.oom_control
is only suitable for cgroup v1. So guard this test (memcg->under_oom)
by the cgroup_subsys_on_dfl(memory_cgrp_subsys).

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7afca9677693..a3f26522765a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3126,8 +3126,10 @@ static int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
  */
 static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
 		page_counter_uncharge(&memcg->kmem, nr_pages);
+		memcg_oom_recover(memcg);
+	}
 
 	refill_stock(memcg, nr_pages);
 }
@@ -6806,11 +6808,15 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 
 	if (!mem_cgroup_is_root(ug->memcg)) {
 		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
-		if (do_memsw_account())
-			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
-			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
-		memcg_oom_recover(ug->memcg);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+			if (!cgroup_memory_noswap)
+				page_counter_uncharge(&ug->memcg->memsw,
+						      ug->nr_pages);
+			if (ug->nr_kmem)
+				page_counter_uncharge(&ug->memcg->kmem,
+						      ug->nr_kmem);
+			memcg_oom_recover(ug->memcg);
+		}
 	}
 
 	local_irq_save(flags);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] mm: memcontrol: bail out early when id is zero
  2021-02-12 17:01 [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Muchun Song
  2021-02-12 17:01 ` [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page Muchun Song
@ 2021-02-12 17:01 ` Muchun Song
  2021-02-15  9:39   ` Michal Hocko
  2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
  2021-02-15  9:24 ` [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Michal Hocko
  3 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2021-02-12 17:01 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: cgroups, linux-mm, linux-kernel, Muchun Song

The memcg ID cannot be zero, but we can pass zero to mem_cgroup_from_id,
so idr_find() is pointless and wastes CPU cycles.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3f26522765a..68ed4b297c13 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5173,6 +5173,9 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
 struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
+	/* The memcg ID cannot be zero. */
+	if (id == 0)
+		return NULL;
 	return idr_find(&mem_cgroup_idr, id);
 }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-12 17:01 [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Muchun Song
  2021-02-12 17:01 ` [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page Muchun Song
  2021-02-12 17:01 ` [PATCH 3/4] mm: memcontrol: bail out early when id is zero Muchun Song
@ 2021-02-12 17:01 ` Muchun Song
  2021-02-12 18:56   ` Shakeel Butt
                     ` (2 more replies)
  2021-02-15  9:24 ` [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Michal Hocko
  3 siblings, 3 replies; 22+ messages in thread
From: Muchun Song @ 2021-02-12 17:01 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm
  Cc: cgroups, linux-mm, linux-kernel, Muchun Song

The swap charges the actual number of swap entries on cgroup v2.
If a swap cache page is charged successful, and then we uncharge
the swap counter. It is wrong on cgroup v2. Because the swap
entry is not freed.

Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c737c8f05992..be6bc5044150 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
-	if (PageSwapCache(page)) {
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 		/*
 		 * The swap entry might not get freed for a long time,
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
@ 2021-02-12 18:56   ` Shakeel Butt
  2021-02-13  6:48     ` [External] " Muchun Song
  2021-02-15  9:47   ` Michal Hocko
  2021-02-16 16:59   ` Johannes Weiner
  2 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2021-02-12 18:56 UTC (permalink / raw)
  To: Muchun Song, Huang Ying, Tim Chen
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Cgroups, Linux MM, LKML

CCing more folks.

On Fri, Feb 12, 2021 at 9:14 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> The swap charges the actual number of swap entries on cgroup v2.
> If a swap cache page is charged successful, and then we uncharge
> the swap counter. It is wrong on cgroup v2. Because the swap
> entry is not freed.
>
> Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

What's the user visible impact of this change?

One impact I can see is that without this patch meminfo's (SwapTotal -
SwapFree) is larger than the sum of top level memory.swap.current.
This change will reduce that gap.

BTW what about per-cpu slots_ret cache? Should we call
mem_cgroup_uncharge_swap() before putting in the cache after this
change?

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c737c8f05992..be6bc5044150 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>         memcg_check_events(memcg, page);
>         local_irq_enable();
>
> -       if (PageSwapCache(page)) {
> +       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
>                 swp_entry_t entry = { .val = page_private(page) };
>                 /*
>                  * The swap entry might not get freed for a long time,
> --
> 2.11.0
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-12 18:56   ` Shakeel Butt
@ 2021-02-13  6:48     ` Muchun Song
  2021-02-16 17:16       ` Shakeel Butt
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2021-02-13  6:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Huang Ying, Tim Chen, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Cgroups, Linux MM, LKML

On Sat, Feb 13, 2021 at 2:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> CCing more folks.
>
> On Fri, Feb 12, 2021 at 9:14 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
> >
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> What's the user visible impact of this change?

IIUC, I think that we cannot limit the swap to memory.swap.max
on cgroup v2.

  cd /sys/fs/cgroup/
  mkdir test
  cd test
  echo 8192 > memory.max
  echo 4096 > memory.swap.max

OK. Now we limit swap to 1 page and memory to 2 pages.
Firstly, we allocate 1 page from this memory cgroup and
swap this page to swap disk. We can see:

  memory.current: 0
  memory.swap.current: 1

Then we touch this page, we will swap in and charge
the swap cache page to the memory counter and uncharge
the swap counter.

  memory.current: 1
  memory.swap.current: 0 (but actually we use a swap entry)

Then we allocate another 1 page from this memory cgroup.

  memory.current: 2
  memory.swap.current: 0 (but actually we use a swap entry)

If we swap those 2 pages to swap disk. We can charge and swap
those 2 pages successfully. Right? Maybe I am wrong.

>
> One impact I can see is that without this patch meminfo's (SwapTotal -
> SwapFree) is larger than the sum of top level memory.swap.current.
> This change will reduce that gap.
>
> BTW what about per-cpu slots_ret cache? Should we call
> mem_cgroup_uncharge_swap() before putting in the cache after this
> change?
>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >         memcg_check_events(memcg, page);
> >         local_irq_enable();
> >
> > -       if (PageSwapCache(page)) {
> > +       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
> >                 swp_entry_t entry = { .val = page_private(page) };
> >                 /*
> >                  * The swap entry might not get freed for a long time,
> > --
> > 2.11.0
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover
  2021-02-12 17:01 [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Muchun Song
                   ` (2 preceding siblings ...)
  2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
@ 2021-02-15  9:24 ` Michal Hocko
  3 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-02-15  9:24 UTC (permalink / raw)
  To: Muchun Song; +Cc: hannes, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Sat 13-02-21 01:01:56, Muchun Song wrote:
> The memcg_oom_recover() almost never do anything but the test (because
> oom_disabled is a rarely used) is just waste of cycles in some hot
> paths (e.g. kmem uncharge). And it is very small, so it is better to
> make it inline. Also, the parameter of memcg cannot be NULL, so removing
> the check can reduce useless check.

You probably wanted to make this patch follow the second one in the
series. As there is no oom recover form the kmem uncharge path now. Also
I believe that I've asked you to split the memcg check to its separate
patch.

Regarding the inlining, I would add it along with a static key check in
memcg_oom_recover.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8c035846c7a4..7afca9677693 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1925,7 +1925,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
>  	return autoremove_wake_function(wait, mode, sync, arg);
>  }
>  
> -static void memcg_oom_recover(struct mem_cgroup *memcg)
> +static inline void memcg_oom_recover(struct mem_cgroup *memcg)
>  {
>  	/*
>  	 * For the following lockless ->under_oom test, the only required
> @@ -1935,7 +1935,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  	 * achieved by invoking mem_cgroup_mark_under_oom() before
>  	 * triggering notification.
>  	 */
> -	if (memcg && memcg->under_oom)
> +	if (memcg->under_oom)
>  		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
>  }
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page
  2021-02-12 17:01 ` [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page Muchun Song
@ 2021-02-15  9:37   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-02-15  9:37 UTC (permalink / raw)
  To: Muchun Song; +Cc: hannes, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Sat 13-02-21 01:01:57, Muchun Song wrote:
> When we uncharge a page, we wake up oom victims when the memcg oom
> handling is outsourced to the userspace. The uncharge_batch do that
> for normal and kmem pages but not slab pages. It is likely an
> omission. So add the missing memcg_oom_recover() to
> __memcg_kmem_uncharge(). And the function of memory.oom_control
> is only suitable for cgroup v1. So guard this test (memcg->under_oom)
> by the cgroup_subsys_on_dfl(memory_cgrp_subsys).

User visible effects please? I believe there are unlikely any. I do not
have any data at hands but I would expect that slab pages freeing
wouldn't really contribute much to help a memcg out of oom without an
external intervention for oom_disabled case. If that is the case then
make it explicit to the changelog. If you have workloads which do see a
suboptimal behavior then please mention that as well. This is important
for future readers to understand the code and motivation behind it.

Also, now I guess I can see why you have decided to not do cgroup v2
check directly in memcg_oom_recover. You do not want to repeat the check
in paths which already do do check for you. That is fine. Appart from
the uncharging path, none of the others is really a hot path so this is
likely a reasonable decision.

I have a minor comment below.

> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7afca9677693..a3f26522765a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3126,8 +3126,10 @@ static int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
>   */
>  static void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
>  {
> -	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
>  		page_counter_uncharge(&memcg->kmem, nr_pages);
> +		memcg_oom_recover(memcg);
> +	}
>  
>  	refill_stock(memcg, nr_pages);
>  }
> @@ -6806,11 +6808,15 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  
>  	if (!mem_cgroup_is_root(ug->memcg)) {
>  		page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
> -		if (do_memsw_account())
> -			page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
> -		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
> -			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
> -		memcg_oom_recover(ug->memcg);
> +		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> +			if (!cgroup_memory_noswap)
> +				page_counter_uncharge(&ug->memcg->memsw,
> +						      ug->nr_pages);

This is functionally equivalent but I am not sure this will make a
further maintainability easier. do_memsw_account check is used at many
other places and it is a general helper which you have split into its
current implementation. This makes any future changes more tricky. Is
this miro-optimization worth it?

> +			if (ug->nr_kmem)
> +				page_counter_uncharge(&ug->memcg->kmem,
> +						      ug->nr_kmem);
> +			memcg_oom_recover(ug->memcg);
> +		}
>  	}
>  
>  	local_irq_save(flags);
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] mm: memcontrol: bail out early when id is zero
  2021-02-12 17:01 ` [PATCH 3/4] mm: memcontrol: bail out early when id is zero Muchun Song
@ 2021-02-15  9:39   ` Michal Hocko
  2021-02-15 10:09     ` [External] " Muchun Song
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-02-15  9:39 UTC (permalink / raw)
  To: Muchun Song; +Cc: hannes, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Sat 13-02-21 01:01:58, Muchun Song wrote:
> The memcg ID cannot be zero, but we can pass zero to mem_cgroup_from_id,
> so idr_find() is pointless and wastes CPU cycles.

Is this possible at all to happen? If not why should we add a test for
_all_ invocations?

> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a3f26522765a..68ed4b297c13 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5173,6 +5173,9 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
>  struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  {
>  	WARN_ON_ONCE(!rcu_read_lock_held());
> +	/* The memcg ID cannot be zero. */
> +	if (id == 0)
> +		return NULL;
>  	return idr_find(&mem_cgroup_idr, id);
>  }
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
  2021-02-12 18:56   ` Shakeel Butt
@ 2021-02-15  9:47   ` Michal Hocko
  2021-02-15 10:15     ` [External] " Muchun Song
  2021-02-16 16:59   ` Johannes Weiner
  2 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-02-15  9:47 UTC (permalink / raw)
  To: Muchun Song; +Cc: hannes, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Sat 13-02-21 01:01:59, Muchun Song wrote:
> The swap charges the actual number of swap entries on cgroup v2.
> If a swap cache page is charged successful, and then we uncharge
> the swap counter. It is wrong on cgroup v2. Because the swap
> entry is not freed.

Is there any actual problem though? Can you describe the specific
scenario please? Swap cache charge life time is a bit tricky and I have
to confess I have to relearn it every time I need to understand it. The
patch would be much more easier to review if the changelog was much more
specific.

> Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c737c8f05992..be6bc5044150 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  	memcg_check_events(memcg, page);
>  	local_irq_enable();
>  
> -	if (PageSwapCache(page)) {
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
>  		swp_entry_t entry = { .val = page_private(page) };
>  		/*
>  		 * The swap entry might not get freed for a long time,
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 3/4] mm: memcontrol: bail out early when id is zero
  2021-02-15  9:39   ` Michal Hocko
@ 2021-02-15 10:09     ` Muchun Song
  2021-02-15 10:27       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2021-02-15 10:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Mon, Feb 15, 2021 at 5:39 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sat 13-02-21 01:01:58, Muchun Song wrote:
> > The memcg ID cannot be zero, but we can pass zero to mem_cgroup_from_id,
> > so idr_find() is pointless and wastes CPU cycles.
>
> Is this possible at all to happen? If not why should we add a test for
> _all_ invocations?

Yeah, this indeed can happen. If we allocate a new swap cache page
and charge it via mem_cgroup_charge, then the page will uncharge
the swap counter via mem_cgroup_uncharge_swap. When the swap
entry is indeed freed, we will call mem_cgroup_uncharge_swap again,
In this routine, we can pass zero to mem_cgroup_from_id. Right?


>
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/memcontrol.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a3f26522765a..68ed4b297c13 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5173,6 +5173,9 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
> >  struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> >  {
> >       WARN_ON_ONCE(!rcu_read_lock_held());
> > +     /* The memcg ID cannot be zero. */
> > +     if (id == 0)
> > +             return NULL;
> >       return idr_find(&mem_cgroup_idr, id);
> >  }
> >
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-15  9:47   ` Michal Hocko
@ 2021-02-15 10:15     ` Muchun Song
  2021-02-15 10:24       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2021-02-15 10:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Mon, Feb 15, 2021 at 5:47 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sat 13-02-21 01:01:59, Muchun Song wrote:
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
>
> Is there any actual problem though? Can you describe the specific
> scenario please? Swap cache charge life time is a bit tricky and I have
> to confess I have to relearn it every time I need to understand it. The
> patch would be much more easier to review if the changelog was much more
> specific.

I copied the reply to Shakeel here. :-)

IIUC, I think that we cannot limit the swap to memory.swap.max
on cgroup v2.

  cd /sys/fs/cgroup/
  mkdir test
  cd test
  echo 8192 > memory.max
  echo 4096 > memory.swap.max

OK. Now we limit swap to 1 page and memory to 2 pages.
Firstly, we allocate 1 page from this memory cgroup and
swap this page to swap disk. We can see:

  memory.current: 0
  memory.swap.current: 1

Then we touch this page, we will swap in and charge
the swap cache page to the memory counter and uncharge
the swap counter.

  memory.current: 1
  memory.swap.current: 0 (but actually we use a swap entry)

Then we allocate another 1 page from this memory cgroup.

  memory.current: 2
  memory.swap.current: 0 (but actually we use a swap entry)

If we swap those 2 pages to swap disk. We can charge and swap
those 2 pages successfully. Right? Maybe I am wrong.

>
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >       memcg_check_events(memcg, page);
> >       local_irq_enable();
> >
> > -     if (PageSwapCache(page)) {
> > +     if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
> >               swp_entry_t entry = { .val = page_private(page) };
> >               /*
> >                * The swap entry might not get freed for a long time,
> > --
> > 2.11.0
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-15 10:15     ` [External] " Muchun Song
@ 2021-02-15 10:24       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-02-15 10:24 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Mon 15-02-21 18:15:36, Muchun Song wrote:
> On Mon, Feb 15, 2021 at 5:47 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sat 13-02-21 01:01:59, Muchun Song wrote:
> > > The swap charges the actual number of swap entries on cgroup v2.
> > > If a swap cache page is charged successful, and then we uncharge
> > > the swap counter. It is wrong on cgroup v2. Because the swap
> > > entry is not freed.
> >
> > Is there any actual problem though? Can you describe the specific
> > scenario please? Swap cache charge life time is a bit tricky and I have
> > to confess I have to relearn it every time I need to understand it. The
> > patch would be much more easier to review if the changelog was much more
> > specific.
> 
> I copied the reply to Shakeel here. :-)

Not helpful much. I have seen that reply before answering your patch.
I really have to say I find it rather annoying to constantly get unclear
and cryptic responses from you. This is a subtle area! Make your claims
clear and specific so they can be verified.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 3/4] mm: memcontrol: bail out early when id is zero
  2021-02-15 10:09     ` [External] " Muchun Song
@ 2021-02-15 10:27       ` Michal Hocko
  2021-02-15 11:34         ` Muchun Song
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2021-02-15 10:27 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Mon 15-02-21 18:09:44, Muchun Song wrote:
> On Mon, Feb 15, 2021 at 5:39 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sat 13-02-21 01:01:58, Muchun Song wrote:
> > > The memcg ID cannot be zero, but we can pass zero to mem_cgroup_from_id,
> > > so idr_find() is pointless and wastes CPU cycles.
> >
> > Is this possible at all to happen? If not why should we add a test for
> > _all_ invocations?
> 
> Yeah, this indeed can happen. If we allocate a new swap cache page
> and charge it via mem_cgroup_charge, then the page will uncharge
> the swap counter via mem_cgroup_uncharge_swap. When the swap
> entry is indeed freed, we will call mem_cgroup_uncharge_swap again,
> In this routine, we can pass zero to mem_cgroup_from_id. Right?

If the above claim is correct, which I would need to double check then
it should have been part of the changelog! Please think of your poor
reviewers and the time they have to invest into the review.

I would also like to see your waste of CPU cycles argument to be backed
by something. Are we talking about cycles due to an additional function
call? Is this really something we should even care about?

> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/memcontrol.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index a3f26522765a..68ed4b297c13 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5173,6 +5173,9 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
> > >  struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> > >  {
> > >       WARN_ON_ONCE(!rcu_read_lock_held());
> > > +     /* The memcg ID cannot be zero. */
> > > +     if (id == 0)
> > > +             return NULL;
> > >       return idr_find(&mem_cgroup_idr, id);
> > >  }
> > >
> > > --
> > > 2.11.0
> >
> > --
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 3/4] mm: memcontrol: bail out early when id is zero
  2021-02-15 10:27       ` Michal Hocko
@ 2021-02-15 11:34         ` Muchun Song
  0 siblings, 0 replies; 22+ messages in thread
From: Muchun Song @ 2021-02-15 11:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Mon, Feb 15, 2021 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 15-02-21 18:09:44, Muchun Song wrote:
> > On Mon, Feb 15, 2021 at 5:39 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sat 13-02-21 01:01:58, Muchun Song wrote:
> > > > The memcg ID cannot be zero, but we can pass zero to mem_cgroup_from_id,
> > > > so idr_find() is pointless and wastes CPU cycles.
> > >
> > > Is this possible at all to happen? If not why should we add a test for
> > > _all_ invocations?
> >
> > Yeah, this indeed can happen. If we allocate a new swap cache page
> > and charge it via mem_cgroup_charge, then the page will uncharge
> > the swap counter via mem_cgroup_uncharge_swap. When the swap
> > entry is indeed freed, we will call mem_cgroup_uncharge_swap again,
> > In this routine, we can pass zero to mem_cgroup_from_id. Right?
>
> If the above claim is correct, which I would need to double check then
> it should have been part of the changelog! Please think of your poor
> reviewers and the time they have to invest into the review.

The easy way may be adding a printk to mem_cgroup_from_id when
the parameter is zero.

>
> I would also like to see your waste of CPU cycles argument to be backed
> by something. Are we talking about cycles due to an additional function

Yeah, when the parameter is already zero, idr_find() must return zero.
So I think that the additional function call is unnecessary. I have added
a printk to mem_cgroup_from_id, I found the parameter can be zero
several times.

> call? Is this really something we should even care about?

Maybe not. Just my thoughts.

Thanks.

>
> > > >
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  mm/memcontrol.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index a3f26522765a..68ed4b297c13 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -5173,6 +5173,9 @@ static inline void mem_cgroup_id_put(struct mem_cgroup *memcg)
> > > >  struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> > > >  {
> > > >       WARN_ON_ONCE(!rcu_read_lock_held());
> > > > +     /* The memcg ID cannot be zero. */
> > > > +     if (id == 0)
> > > > +             return NULL;
> > > >       return idr_find(&mem_cgroup_idr, id);
> > > >  }
> > > >
> > > > --
> > > > 2.11.0
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
  2021-02-12 18:56   ` Shakeel Butt
  2021-02-15  9:47   ` Michal Hocko
@ 2021-02-16 16:59   ` Johannes Weiner
  2021-02-16 17:17     ` Shakeel Butt
                       ` (3 more replies)
  2 siblings, 4 replies; 22+ messages in thread
From: Johannes Weiner @ 2021-02-16 16:59 UTC (permalink / raw)
  To: Muchun Song; +Cc: mhocko, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

Hello Muchun,

On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> The swap charges the actual number of swap entries on cgroup v2.
> If a swap cache page is charged successful, and then we uncharge
> the swap counter. It is wrong on cgroup v2. Because the swap
> entry is not freed.

The patch makes sense to me. But this code is a bit tricky, we should
add more documentation to how it works and what the problem is.

How about this for the changelog?

---
mm: memcontrol: fix swap undercounting for shared pages in cgroup2

When shared pages are swapped in partially, we can have some page
tables referencing the in-memory page and some referencing the swap
slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
differently due to the nature of how they account memory and swap:

Cgroup1 has a unified memory+swap counter that tracks a data page
regardless whether it's in-core or swapped out. On swapin, we transfer
the charge from the swap entry to the newly allocated swapcache page,
even though the swap entry might stick around for a while. That's why
we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().

Cgroup2 tracks memory and swap as separate, independent resources and
thus has split memory and swap counters. On swapin, we charge the
newly allocated swapcache page as memory, while the swap slot in turn
must remain charged to the swap counter as long as its allocated too.

The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
make swap tracking an integral part of memory control"), because it
accidentally removed the do_memsw_account() check in the branch inside
mem_cgroup_uncharge() that was supposed to tell the difference between
the charge transfer in cgroup1 and the separate counters in cgroup2.

As a result, cgroup2 currently undercounts consumed swap when shared
pages are partially swapped back in. This in turn allows a cgroup to
consume more swap than its configured limit intends.

Add the do_memsw_account() check back to fix this problem.
---

> Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/memcontrol.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c737c8f05992..be6bc5044150 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>  	memcg_check_events(memcg, page);
>  	local_irq_enable();
>  
> -	if (PageSwapCache(page)) {
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {

It's more descriptive to use do_memsw_account() here, IMO.

We should also add a comment. How about this above the branch?

	/*
	 * Cgroup1's unified memory+swap counter has been charged with the
	 * new swapcache page, finish the transfer by uncharging the swap
	 * slot. The swap slot would also get uncharged when it dies, but
	 * for shared pages it can stick around indefinitely and we'd count
	 * the page twice the entire time.
	 *
	 * Cgroup2 has separate resource counters for memory and swap,
	 * so this is a non-issue here. Memory and swap charge lifetimes
	 * correspond 1:1 to page and swap slot lifetimes: we charge the
	 * page to memory here, and uncharge swap when the slot is freed.
	 */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-13  6:48     ` [External] " Muchun Song
@ 2021-02-16 17:16       ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-02-16 17:16 UTC (permalink / raw)
  To: Muchun Song
  Cc: Huang Ying, Tim Chen, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Cgroups, Linux MM, LKML

On Fri, Feb 12, 2021 at 10:48 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Sat, Feb 13, 2021 at 2:57 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > CCing more folks.
> >
> > On Fri, Feb 12, 2021 at 9:14 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > >
> > > The swap charges the actual number of swap entries on cgroup v2.
> > > If a swap cache page is charged successful, and then we uncharge
> > > the swap counter. It is wrong on cgroup v2. Because the swap
> > > entry is not freed.
> > >
> > > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >
> > What's the user visible impact of this change?
>
> IIUC, I think that we cannot limit the swap to memory.swap.max
> on cgroup v2.
>
>   cd /sys/fs/cgroup/
>   mkdir test
>   cd test
>   echo 8192 > memory.max
>   echo 4096 > memory.swap.max
>
> OK. Now we limit swap to 1 page and memory to 2 pages.
> Firstly, we allocate 1 page from this memory cgroup and
> swap this page to swap disk. We can see:
>
>   memory.current: 0
>   memory.swap.current: 1
>
> Then we touch this page, we will swap in and charge
> the swap cache page to the memory counter and uncharge
> the swap counter.
>
>   memory.current: 1
>   memory.swap.current: 0 (but actually we use a swap entry)
>
> Then we allocate another 1 page from this memory cgroup.
>
>   memory.current: 2
>   memory.swap.current: 0 (but actually we use a swap entry)
>
> If we swap those 2 pages to swap disk. We can charge and swap
> those 2 pages successfully. Right? Maybe I am wrong.
>

I was trying to repro this but couldn't and later remembered that swap
on zram skips the swapcache and thus is not impacted by this issue.
This is reproducible on swap on disk and I see Johannes has already
described in good detail.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-16 16:59   ` Johannes Weiner
@ 2021-02-16 17:17     ` Shakeel Butt
  2021-02-16 17:19     ` Michal Hocko
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-02-16 17:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Muchun Song, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Cgroups, Linux MM, LKML

On Tue, Feb 16, 2021 at 9:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Muchun,
>
> On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
>
> The patch makes sense to me. But this code is a bit tricky, we should
> add more documentation to how it works and what the problem is.
>
> How about this for the changelog?
>
> ---
> mm: memcontrol: fix swap undercounting for shared pages in cgroup2
>
> When shared pages are swapped in partially, we can have some page
> tables referencing the in-memory page and some referencing the swap
> slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> differently due to the nature of how they account memory and swap:
>
> Cgroup1 has a unified memory+swap counter that tracks a data page
> regardless whether it's in-core or swapped out. On swapin, we transfer
> the charge from the swap entry to the newly allocated swapcache page,
> even though the swap entry might stick around for a while. That's why
> we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().
>
> Cgroup2 tracks memory and swap as separate, independent resources and
> thus has split memory and swap counters. On swapin, we charge the
> newly allocated swapcache page as memory, while the swap slot in turn
> must remain charged to the swap counter as long as its allocated too.
>
> The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
> make swap tracking an integral part of memory control"), because it
> accidentally removed the do_memsw_account() check in the branch inside
> mem_cgroup_uncharge() that was supposed to tell the difference between
> the charge transfer in cgroup1 and the separate counters in cgroup2.
>
> As a result, cgroup2 currently undercounts consumed swap when shared
> pages are partially swapped back in. This in turn allows a cgroup to
> consume more swap than its configured limit intends.
>
> Add the do_memsw_account() check back to fix this problem.
> ---
>
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

With Johannes's suggestions:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >       memcg_check_events(memcg, page);
> >       local_irq_enable();
> >
> > -     if (PageSwapCache(page)) {
> > +     if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
>
> It's more descriptive to use do_memsw_account() here, IMO.
>
> We should also add a comment. How about this above the branch?
>
>         /*
>          * Cgroup1's unified memory+swap counter has been charged with the
>          * new swapcache page, finish the transfer by uncharging the swap
>          * slot. The swap slot would also get uncharged when it dies, but
>          * for shared pages it can stick around indefinitely and we'd count
>          * the page twice the entire time.
>          *
>          * Cgroup2 has separate resource counters for memory and swap,
>          * so this is a non-issue here. Memory and swap charge lifetimes
>          * correspond 1:1 to page and swap slot lifetimes: we charge the
>          * page to memory here, and uncharge swap when the slot is freed.
>          */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-16 16:59   ` Johannes Weiner
  2021-02-16 17:17     ` Shakeel Butt
@ 2021-02-16 17:19     ` Michal Hocko
  2021-02-16 17:28     ` Johannes Weiner
  2021-02-17  5:11     ` Muchun Song
  3 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2021-02-16 17:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Muchun Song, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Tue 16-02-21 11:59:00, Johannes Weiner wrote:
> Hello Muchun,
> 
> On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
> 
> The patch makes sense to me. But this code is a bit tricky, we should
> add more documentation to how it works and what the problem is.
> 
> How about this for the changelog?
> 
> ---
> mm: memcontrol: fix swap undercounting for shared pages in cgroup2
> 
> When shared pages are swapped in partially, we can have some page
> tables referencing the in-memory page and some referencing the swap
> slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> differently due to the nature of how they account memory and swap:
> 
> Cgroup1 has a unified memory+swap counter that tracks a data page
> regardless whether it's in-core or swapped out. On swapin, we transfer
> the charge from the swap entry to the newly allocated swapcache page,
> even though the swap entry might stick around for a while. That's why
> we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().
> 
> Cgroup2 tracks memory and swap as separate, independent resources and
> thus has split memory and swap counters. On swapin, we charge the
> newly allocated swapcache page as memory, while the swap slot in turn
> must remain charged to the swap counter as long as its allocated too.
> 
> The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
> make swap tracking an integral part of memory control"), because it
> accidentally removed the do_memsw_account() check in the branch inside
> mem_cgroup_uncharge() that was supposed to tell the difference between
> the charge transfer in cgroup1 and the separate counters in cgroup2.
> 
> As a result, cgroup2 currently undercounts consumed swap when shared
> pages are partially swapped back in. This in turn allows a cgroup to
> consume more swap than its configured limit intends.
> 
> Add the do_memsw_account() check back to fix this problem.

Yes this clarfies both the issue and the subtlety of the accounting.
Thanks a lot Johannes! This is a great example of how changelogs should
really look.

> ---
> 
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >  	memcg_check_events(memcg, page);
> >  	local_irq_enable();
> >  
> > -	if (PageSwapCache(page)) {
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
> 
> It's more descriptive to use do_memsw_account() here, IMO.
> 
> We should also add a comment. How about this above the branch?
> 
> 	/*
> 	 * Cgroup1's unified memory+swap counter has been charged with the
> 	 * new swapcache page, finish the transfer by uncharging the swap
> 	 * slot. The swap slot would also get uncharged when it dies, but
> 	 * for shared pages it can stick around indefinitely and we'd count
> 	 * the page twice the entire time.
> 	 *
> 	 * Cgroup2 has separate resource counters for memory and swap,
> 	 * so this is a non-issue here. Memory and swap charge lifetimes
> 	 * correspond 1:1 to page and swap slot lifetimes: we charge the
> 	 * page to memory here, and uncharge swap when the slot is freed.
> 	 */

Yes very helpful.

With the changelog update and the comment
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-16 16:59   ` Johannes Weiner
  2021-02-16 17:17     ` Shakeel Butt
  2021-02-16 17:19     ` Michal Hocko
@ 2021-02-16 17:28     ` Johannes Weiner
  2021-02-17  5:15       ` [External] " Muchun Song
  2021-02-17  5:11     ` Muchun Song
  3 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2021-02-16 17:28 UTC (permalink / raw)
  To: Muchun Song; +Cc: mhocko, vdavydov.dev, akpm, cgroups, linux-mm, linux-kernel

On Tue, Feb 16, 2021 at 11:59:01AM -0500, Johannes Weiner wrote:
> Hello Muchun,
> 
> On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
> 
> The patch makes sense to me. But this code is a bit tricky, we should
> add more documentation to how it works and what the problem is.
> 
> How about this for the changelog?
> 
> ---
> mm: memcontrol: fix swap undercounting for shared pages in cgroup2

Coming to think of it, this isn't just for shared pages. We also hold
on to the swap slot as long as the page is read-only, not mlocked, and
swap isn't full. So scratch "for shared pages" here.

> When shared pages are swapped in partially, we can have some page
> tables referencing the in-memory page and some referencing the swap
> slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> differently due to the nature of how they account memory and swap:

Correction:

When pages are swapped in, the VM may retain the swap copy to avoid
repeated writes in the future. It's also retained if shared pages are
faulted back in some processes, but not in others. During that time we
have an in-memory copy of the page, as well as an on-swap
copy. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
differently due to the nature of how they account memory and swap:

> Cgroup1 has a unified memory+swap counter that tracks a data page
> regardless whether it's in-core or swapped out. On swapin, we transfer
> the charge from the swap entry to the newly allocated swapcache page,
> even though the swap entry might stick around for a while. That's why
> we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().
> 
> Cgroup2 tracks memory and swap as separate, independent resources and
> thus has split memory and swap counters. On swapin, we charge the
> newly allocated swapcache page as memory, while the swap slot in turn
> must remain charged to the swap counter as long as its allocated too.
> 
> The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
> make swap tracking an integral part of memory control"), because it
> accidentally removed the do_memsw_account() check in the branch inside
> mem_cgroup_uncharge() that was supposed to tell the difference between
> the charge transfer in cgroup1 and the separate counters in cgroup2.
> 
> As a result, cgroup2 currently undercounts consumed swap when shared
> pages are partially swapped back in. This in turn allows a cgroup to
> consume more swap than its configured limit intends.

Correction:

As a result, cgroup2 currently undercounts retained swap to varying
degrees: swap slots are cached up to 50% of the configured limit or
total available swap space; partially faulted back shared pages are
only limited by physical capacity. This in turn allows cgroups to
significantly overconsume their alloted swap space.

> Add the do_memsw_account() check back to fix this problem.
> ---
> 
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I also think we should tag stable on this one. The potential
accounting error is quite large and, even without concrete examples,
is likely to cause problems for swap management in the real world.

Cc: stable@vger.kernel.org # 5.8+

> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >  	memcg_check_events(memcg, page);
> >  	local_irq_enable();
> >  
> > -	if (PageSwapCache(page)) {
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
> 
> It's more descriptive to use do_memsw_account() here, IMO.
> 
> We should also add a comment. How about this above the branch?
> 
> 	/*
> 	 * Cgroup1's unified memory+swap counter has been charged with the
> 	 * new swapcache page, finish the transfer by uncharging the swap
> 	 * slot. The swap slot would also get uncharged when it dies, but
> 	 * for shared pages it can stick around indefinitely and we'd count
> 	 * the page twice the entire time.

-for shared pages

> 	 * Cgroup2 has separate resource counters for memory and swap,
> 	 * so this is a non-issue here. Memory and swap charge lifetimes
> 	 * correspond 1:1 to page and swap slot lifetimes: we charge the
> 	 * page to memory here, and uncharge swap when the slot is freed.
> 	 */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-16 16:59   ` Johannes Weiner
                       ` (2 preceding siblings ...)
  2021-02-16 17:28     ` Johannes Weiner
@ 2021-02-17  5:11     ` Muchun Song
  3 siblings, 0 replies; 22+ messages in thread
From: Muchun Song @ 2021-02-17  5:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Wed, Feb 17, 2021 at 12:59 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Muchun,
>
> On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> > The swap charges the actual number of swap entries on cgroup v2.
> > If a swap cache page is charged successful, and then we uncharge
> > the swap counter. It is wrong on cgroup v2. Because the swap
> > entry is not freed.
>
> The patch makes sense to me. But this code is a bit tricky, we should
> add more documentation to how it works and what the problem is.
>
> How about this for the changelog?
>
> ---
> mm: memcontrol: fix swap undercounting for shared pages in cgroup2
>
> When shared pages are swapped in partially, we can have some page
> tables referencing the in-memory page and some referencing the swap
> slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> differently due to the nature of how they account memory and swap:
>
> Cgroup1 has a unified memory+swap counter that tracks a data page
> regardless whether it's in-core or swapped out. On swapin, we transfer
> the charge from the swap entry to the newly allocated swapcache page,
> even though the swap entry might stick around for a while. That's why
> we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().
>
> Cgroup2 tracks memory and swap as separate, independent resources and
> thus has split memory and swap counters. On swapin, we charge the
> newly allocated swapcache page as memory, while the swap slot in turn
> must remain charged to the swap counter as long as its allocated too.
>
> The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
> make swap tracking an integral part of memory control"), because it
> accidentally removed the do_memsw_account() check in the branch inside
> mem_cgroup_uncharge() that was supposed to tell the difference between
> the charge transfer in cgroup1 and the separate counters in cgroup2.
>
> As a result, cgroup2 currently undercounts consumed swap when shared
> pages are partially swapped back in. This in turn allows a cgroup to
> consume more swap than its configured limit intends.
>
> Add the do_memsw_account() check back to fix this problem.

Thanks. Sorry for my poor English. I should improve my changelog.

> ---
>
> > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks.

>
> > ---
> >  mm/memcontrol.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c737c8f05992..be6bc5044150 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >       memcg_check_events(memcg, page);
> >       local_irq_enable();
> >
> > -     if (PageSwapCache(page)) {
> > +     if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
>
> It's more descriptive to use do_memsw_account() here, IMO.
>
> We should also add a comment. How about this above the branch?

I will add a comment here. Thanks.

>
>         /*
>          * Cgroup1's unified memory+swap counter has been charged with the
>          * new swapcache page, finish the transfer by uncharging the swap
>          * slot. The swap slot would also get uncharged when it dies, but
>          * for shared pages it can stick around indefinitely and we'd count
>          * the page twice the entire time.
>          *
>          * Cgroup2 has separate resource counters for memory and swap,
>          * so this is a non-issue here. Memory and swap charge lifetimes
>          * correspond 1:1 to page and swap slot lifetimes: we charge the
>          * page to memory here, and uncharge swap when the slot is freed.
>          */


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2
  2021-02-16 17:28     ` Johannes Weiner
@ 2021-02-17  5:15       ` Muchun Song
  0 siblings, 0 replies; 22+ messages in thread
From: Muchun Song @ 2021-02-17  5:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Cgroups,
	Linux Memory Management List, LKML

On Wed, Feb 17, 2021 at 1:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Feb 16, 2021 at 11:59:01AM -0500, Johannes Weiner wrote:
> > Hello Muchun,
> >
> > On Sat, Feb 13, 2021 at 01:01:59AM +0800, Muchun Song wrote:
> > > The swap charges the actual number of swap entries on cgroup v2.
> > > If a swap cache page is charged successful, and then we uncharge
> > > the swap counter. It is wrong on cgroup v2. Because the swap
> > > entry is not freed.
> >
> > The patch makes sense to me. But this code is a bit tricky, we should
> > add more documentation to how it works and what the problem is.
> >
> > How about this for the changelog?
> >
> > ---
> > mm: memcontrol: fix swap undercounting for shared pages in cgroup2
>
> Coming to think of it, this isn't just for shared pages. We also hold
> on to the swap slot as long as the page is read-only, not mlocked, and
> swap isn't full. So scratch "for shared pages" here.

OK. Will do.

>
> > When shared pages are swapped in partially, we can have some page
> > tables referencing the in-memory page and some referencing the swap
> > slot. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> > differently due to the nature of how they account memory and swap:
>
> Correction:
>
> When pages are swapped in, the VM may retain the swap copy to avoid
> repeated writes in the future. It's also retained if shared pages are
> faulted back in some processes, but not in others. During that time we
> have an in-memory copy of the page, as well as an on-swap
> copy. Cgroup1 and cgroup2 handle these overlapping lifetimes slightly
> differently due to the nature of how they account memory and swap:

Thanks a lot.

>
> > Cgroup1 has a unified memory+swap counter that tracks a data page
> > regardless whether it's in-core or swapped out. On swapin, we transfer
> > the charge from the swap entry to the newly allocated swapcache page,
> > even though the swap entry might stick around for a while. That's why
> > we have a mem_cgroup_uncharge_swap() call inside mem_cgroup_charge().
> >
> > Cgroup2 tracks memory and swap as separate, independent resources and
> > thus has split memory and swap counters. On swapin, we charge the
> > newly allocated swapcache page as memory, while the swap slot in turn
> > must remain charged to the swap counter as long as its allocated too.
> >
> > The cgroup2 logic was broken by commit 2d1c498072de ("mm: memcontrol:
> > make swap tracking an integral part of memory control"), because it
> > accidentally removed the do_memsw_account() check in the branch inside
> > mem_cgroup_uncharge() that was supposed to tell the difference between
> > the charge transfer in cgroup1 and the separate counters in cgroup2.
> >
> > As a result, cgroup2 currently undercounts consumed swap when shared
> > pages are partially swapped back in. This in turn allows a cgroup to
> > consume more swap than its configured limit intends.
>
> Correction:
>
> As a result, cgroup2 currently undercounts retained swap to varying
> degrees: swap slots are cached up to 50% of the configured limit or
> total available swap space; partially faulted back shared pages are
> only limited by physical capacity. This in turn allows cgroups to
> significantly overconsume their alloted swap space.

Thanks a lot.

>
> > Add the do_memsw_account() check back to fix this problem.
> > ---
> >
> > > Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of memory control")
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> I also think we should tag stable on this one. The potential
> accounting error is quite large and, even without concrete examples,
> is likely to cause problems for swap management in the real world.
>
> Cc: stable@vger.kernel.org # 5.8+

OK. I will add a Cc stable tag. Thanks.

>
> > >  mm/memcontrol.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index c737c8f05992..be6bc5044150 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6753,7 +6753,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> > >     memcg_check_events(memcg, page);
> > >     local_irq_enable();
> > >
> > > -   if (PageSwapCache(page)) {
> > > +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && PageSwapCache(page)) {
> >
> > It's more descriptive to use do_memsw_account() here, IMO.
> >
> > We should also add a comment. How about this above the branch?
> >
> >       /*
> >        * Cgroup1's unified memory+swap counter has been charged with the
> >        * new swapcache page, finish the transfer by uncharging the swap
> >        * slot. The swap slot would also get uncharged when it dies, but
> >        * for shared pages it can stick around indefinitely and we'd count
> >        * the page twice the entire time.
>
> -for shared pages
>
> >        * Cgroup2 has separate resource counters for memory and swap,
> >        * so this is a non-issue here. Memory and swap charge lifetimes
> >        * correspond 1:1 to page and swap slot lifetimes: we charge the
> >        * page to memory here, and uncharge swap when the slot is freed.
> >        */


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-02-17  5:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-12 17:01 [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Muchun Song
2021-02-12 17:01 ` [PATCH 2/4] mm: memcontrol: add missing memcg_oom_recover() when uncharge slab page Muchun Song
2021-02-15  9:37   ` Michal Hocko
2021-02-12 17:01 ` [PATCH 3/4] mm: memcontrol: bail out early when id is zero Muchun Song
2021-02-15  9:39   ` Michal Hocko
2021-02-15 10:09     ` [External] " Muchun Song
2021-02-15 10:27       ` Michal Hocko
2021-02-15 11:34         ` Muchun Song
2021-02-12 17:01 ` [PATCH 4/4] mm: memcontrol: fix swap uncharge on cgroup v2 Muchun Song
2021-02-12 18:56   ` Shakeel Butt
2021-02-13  6:48     ` [External] " Muchun Song
2021-02-16 17:16       ` Shakeel Butt
2021-02-15  9:47   ` Michal Hocko
2021-02-15 10:15     ` [External] " Muchun Song
2021-02-15 10:24       ` Michal Hocko
2021-02-16 16:59   ` Johannes Weiner
2021-02-16 17:17     ` Shakeel Butt
2021-02-16 17:19     ` Michal Hocko
2021-02-16 17:28     ` Johannes Weiner
2021-02-17  5:15       ` [External] " Muchun Song
2021-02-17  5:11     ` Muchun Song
2021-02-15  9:24 ` [PATCH 1/4] mm: memcontrol: remove memcg check from memcg_oom_recover Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).