linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
@ 2012-10-10 14:11 Michal Hocko
  2012-10-10 20:50 ` David Rientjes
  2012-10-15  9:11 ` [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Kamezawa Hiroyuki
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-10 14:11 UTC (permalink / raw)
  To: linux-mm
  Cc: David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

Hi,
I am sending the patch below as an RFC because I am not entirely happy
about myself and maybe somebody can come up with a different approach
which would be less hackish.
As a background, I have noticed that memcg OOM killer kills a wrong
tasks while playing with memory.swappiness==0 in a small group (e.g.
50M). I have multiple anon mem eaters which fault in more than the hard
limit. OOM killer kills the last executed task:

# mem_eater spawns one process per parameter, mmaps the given size and
# faults memory in in parallel (all of them are synced to start together)
./mem_eater anon:50M anon:20M anon:20M anon:20M
10571: anon_eater for 20971520B
10570: anon_eater for 52428800B
10573: anon_eater for 20971520B
10572: anon_eater for 20971520B
10573: done with status 9
10571: done with status 0
10572: done with status 9
10570: done with status 9

[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 5706]     0  5706     4955      556      13        0             0 bash
[10569]     0 10569     1015      134       6        0             0 mem_eater
[10570]     0 10570    13815     4118      15        0             0 mem_eater
[10571]     0 10571     6135     5140      16        0             0 mem_eater
[10572]     0 10572     6135       22       7        0             0 mem_eater
[10573]     0 10573     6135     3541      14        0             0 mem_eater
Memory cgroup out of memory: Kill process 10573 (mem_eater) score 0 or sacrifice child
Killed process 10573 (mem_eater) total-vm:24540kB, anon-rss:14028kB, file-rss:136kB
[...]
[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 5706]     0  5706     4955      556      13        0             0 bash
[10569]     0 10569     1015      134       6        0             0 mem_eater
[10570]     0 10570    13815    10267      27        0             0 mem_eater
[10572]     0 10572     6135     2519      12        0             0 mem_eater
Memory cgroup out of memory: Kill process 10572 (mem_eater) score 0 or sacrifice child
Killed process 10572 (mem_eater) total-vm:24540kB, anon-rss:9940kB, file-rss:136kB
[...]
[ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 5706]     0  5706     4955      556      13        0             0 bash
[10569]     0 10569     1015      134       6        0             0 mem_eater
[10570]     0 10570    13815    12773      31        0             0 mem_eater
Memory cgroup out of memory: Kill process 10570 (mem_eater) score 2 or sacrifice child
Killed process 10570 (mem_eater) total-vm:55260kB, anon-rss:50956kB, file-rss:136kB

As you can see 50M (pid:10570) is killed as the last one while 20M ones
are killed first. See the patch for more details about the problem.
As I state in the changelog the very same issue is present in the global
oom killer as well but it is much less probable as the amount of swap is
usualy much smaller than the available RAM and I think it is not worth
considering.

---
>From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 10 Oct 2012 15:46:54 +0200
Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0

oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).

This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller
than the available swap because the used memory (capped by the limit)
is negligible comparing to totalpages so the resulting score is too
small. A wrong process might be selected as result.

The same issue exists for the global oom killer as well but it is not
that problematic as the amount of the RAM is usually much bigger than
the swap space.

The problem can be worked around by checking swappiness==0 and not
considering swap at all.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
-	u64 memsw;
 
 	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-	limit += total_swap_pages << PAGE_SHIFT;
 
-	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
 	/*
-	 * If memsw is finite and limits the amount of swap space available
-	 * to this memcg, return that limit.
+	 * Do not consider swap space if we cannot swap due to swappiness
 	 */
-	return min(limit, memsw);
+	if (mem_cgroup_swappiness(memcg)) {
+		u64 memsw;
+
+		limit += total_swap_pages << PAGE_SHIFT;
+		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+		/*
+		 * If memsw is finite and limits the amount of swap space
+		 * available to this memcg, return that limit.
+		 */
+		limit = min(limit, memsw);
+	}
+
+	return limit;
 }
 
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-10 14:11 [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Michal Hocko
@ 2012-10-10 20:50 ` David Rientjes
  2012-10-11  8:50   ` Michal Hocko
  2012-10-15  9:11 ` [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Kamezawa Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2012-10-10 20:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Johannes Weiner, LKML

On Wed, 10 Oct 2012, Michal Hocko wrote:

> Hi,
> I am sending the patch below as an RFC because I am not entirely happy
> about myself and maybe somebody can come up with a different approach
> which would be less hackish.

I don't see this as hackish, if memory.swappiness limits access to swap 
then this shouldn't be factored into the calculation, and that's what your 
patch fixes.

The reason why the process with the largest rss isn't killed in this case 
is because all processes have CAP_SYS_ADMIN so they get a 3% bonus; when 
factoring swap into the calculation and subtracting 3% from the score in 
oom_badness(), they all end up having an internal score of 1 so they are 
all considered equal.  It appears like the cgroup_iter_next() iteration 
for memcg ooms does this in reverse order, which is actually helpful so it 
will select the task that is newer.

The only suggestion I have to make is specify this is for 
memory.swappiness in the patch title, otherwise:

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-10 20:50 ` David Rientjes
@ 2012-10-11  8:50   ` Michal Hocko
  2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2012-10-11  8:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Johannes Weiner, LKML

On Wed 10-10-12 13:50:21, David Rientjes wrote:
> On Wed, 10 Oct 2012, Michal Hocko wrote:
> 
> > Hi,
> > I am sending the patch below as an RFC because I am not entirely happy
> > about myself and maybe somebody can come up with a different approach
> > which would be less hackish.
> 
> I don't see this as hackish, 

I didn't like how swappiness spreads outside of the LRU scanning code...

> if memory.swappiness limits access to swap then this shouldn't be
> factored into the calculation, and that's what your patch fixes.
> 
> The reason why the process with the largest rss isn't killed in this case 
> is because all processes have CAP_SYS_ADMIN so they get a 3% bonus;

OK I should have mentioned that I have tested it as root which makes a
big difference with the current upstream as totalpages are considered
only if adj!=0. 
I have originally seen the problem in 3.0 kernel (with fe35004f applied)
where the calculation is different (missing a7f638f9) and we always
consider total_pages there so it doesn't depend on root or oom_score_adj.

> when factoring swap into the calculation and subtracting 3% from
> the score in oom_badness(), they all end up having an internal
> score of 1 so they are all considered equal.  It appears like the
> cgroup_iter_next() iteration for memcg ooms does this in reverse
> order, which is actually helpful so it will select the task that is
> newer.
> 
> The only suggestion I have to make is specify this is for 
> memory.swappiness in the patch title, otherwise:

OK. I will also update the changelog to mention oom_score_adj and
CAP_SYS_ADMIN, mark the patch for stable and repost it.

> Acked-by: David Rientjes <rientjes@google.com>

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11  8:50   ` Michal Hocko
@ 2012-10-11  8:57     ` Michal Hocko
  2012-10-11  9:13       ` Michal Hocko
                         ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-11  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).

This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.

The same issue exists for the global oom killer as well but it is not
that problematic as the amount of the RAM is usually much bigger than
the swap space.

The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: stable [3.5+]
---
 mm/memcontrol.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
-	u64 memsw;
 
 	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-	limit += total_swap_pages << PAGE_SHIFT;
 
-	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
 	/*
-	 * If memsw is finite and limits the amount of swap space available
-	 * to this memcg, return that limit.
+	 * Do not consider swap space if we cannot swap due to swappiness
 	 */
-	return min(limit, memsw);
+	if (mem_cgroup_swappiness(memcg)) {
+		u64 memsw;
+
+		limit += total_swap_pages << PAGE_SHIFT;
+		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+		/*
+		 * If memsw is finite and limits the amount of swap space
+		 * available to this memcg, return that limit.
+		 */
+		limit = min(limit, memsw);
+	}
+
+	return limit;
 }
 
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
@ 2012-10-11  9:13       ` Michal Hocko
  2012-10-11 12:20       ` Johannes Weiner
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-11  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Thu 11-10-12 10:57:39, Michal Hocko wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
> 
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
> 
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
> 
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: stable [3.5+]

I have just realized that fe35004f (introduced in 3.5-rc1) has been
backported to 3.2 and 3.4 stable kernels so this should be [3.2+]

> ---
>  mm/memcontrol.c |   21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..93a7e36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
>  static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
>  {
>  	u64 limit;
> -	u64 memsw;
>  
>  	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -	limit += total_swap_pages << PAGE_SHIFT;
>  
> -	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
>  	/*
> -	 * If memsw is finite and limits the amount of swap space available
> -	 * to this memcg, return that limit.
> +	 * Do not consider swap space if we cannot swap due to swappiness
>  	 */
> -	return min(limit, memsw);
> +	if (mem_cgroup_swappiness(memcg)) {
> +		u64 memsw;
> +
> +		limit += total_swap_pages << PAGE_SHIFT;
> +		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> +
> +		/*
> +		 * If memsw is finite and limits the amount of swap space
> +		 * available to this memcg, return that limit.
> +		 */
> +		limit = min(limit, memsw);
> +	}
> +
> +	return limit;
>  }
>  
>  void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
  2012-10-11  9:13       ` Michal Hocko
@ 2012-10-11 12:20       ` Johannes Weiner
  2012-10-12 13:01         ` Michal Hocko
  2012-10-11 22:36       ` KOSAKI Motohiro
  2012-10-15 22:04       ` [PATCH v2] " Michal Hocko
  3 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2012-10-11 12:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, David Rientjes, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, LKML

On Thu, Oct 11, 2012 at 10:57:39AM +0200, Michal Hocko wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
> 
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
> 
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
> 
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: stable [3.5+]

I also don't think it's hackish, the limit depends very much on
whether reclaim can swap, so it's natural that swappiness shows up
here.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
  2012-10-11  9:13       ` Michal Hocko
  2012-10-11 12:20       ` Johannes Weiner
@ 2012-10-11 22:36       ` KOSAKI Motohiro
  2012-10-12 13:01         ` Michal Hocko
  2012-10-15 22:04       ` [PATCH v2] " Michal Hocko
  3 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-10-11 22:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Thu, Oct 11, 2012 at 4:57 AM, Michal Hocko <mhocko@suse.cz> wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: stable [3.5+]

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11 12:20       ` Johannes Weiner
@ 2012-10-12 13:01         ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-12 13:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, David Rientjes, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, LKML

On Thu 11-10-12 08:20:37, Johannes Weiner wrote:
> On Thu, Oct 11, 2012 at 10:57:39AM +0200, Michal Hocko wrote:
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> > 
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> > 
> > The same issue exists for the global oom killer as well but it is not
> > that problematic as the amount of the RAM is usually much bigger than
> > the swap space.
> > 
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > Acked-by: David Rientjes <rientjes@google.com>
> > Cc: stable [3.5+]
> 
> I also don't think it's hackish, the limit depends very much on
> whether reclaim can swap, so it's natural that swappiness shows up
> here.

OK, maybe I was just a bit over sensitive here. The other reason I
didn't like it is that swappiness might change over time we some of the
tasks could be already swapped out. oom_score already considers
MM_SWAPENTS but this just tells the number of swapped out ptes not the
pages. So we could still kill something that is resident with much
smaller memory footprint. But this is a different issue and probably a
corner case.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11 22:36       ` KOSAKI Motohiro
@ 2012-10-12 13:01         ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-12 13:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-mm, David Rientjes, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Thu 11-10-12 18:36:26, KOSAKI Motohiro wrote:
> On Thu, Oct 11, 2012 at 4:57 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> >
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> >
> > The same issue exists for the global oom killer as well but it is not
> > that problematic as the amount of the RAM is usually much bigger than
> > the swap space.
> >
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > Acked-by: David Rientjes <rientjes@google.com>
> > Cc: stable [3.5+]
> 
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-10 14:11 [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Michal Hocko
  2012-10-10 20:50 ` David Rientjes
@ 2012-10-15  9:11 ` Kamezawa Hiroyuki
  2012-10-15  9:49   ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-15  9:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, Johannes Weiner, LKML

(2012/10/10 23:11), Michal Hocko wrote:
> Hi,
> I am sending the patch below as an RFC because I am not entirely happy
> about myself and maybe somebody can come up with a different approach
> which would be less hackish.
> As a background, I have noticed that memcg OOM killer kills a wrong
> tasks while playing with memory.swappiness==0 in a small group (e.g.
> 50M). I have multiple anon mem eaters which fault in more than the hard
> limit. OOM killer kills the last executed task:
>
> # mem_eater spawns one process per parameter, mmaps the given size and
> # faults memory in in parallel (all of them are synced to start together)
> ./mem_eater anon:50M anon:20M anon:20M anon:20M
> 10571: anon_eater for 20971520B
> 10570: anon_eater for 52428800B
> 10573: anon_eater for 20971520B
> 10572: anon_eater for 20971520B
> 10573: done with status 9
> 10571: done with status 0
> 10572: done with status 9
> 10570: done with status 9
>
> [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> [ 5706]     0  5706     4955      556      13        0             0 bash
> [10569]     0 10569     1015      134       6        0             0 mem_eater
> [10570]     0 10570    13815     4118      15        0             0 mem_eater
> [10571]     0 10571     6135     5140      16        0             0 mem_eater
> [10572]     0 10572     6135       22       7        0             0 mem_eater
> [10573]     0 10573     6135     3541      14        0             0 mem_eater
> Memory cgroup out of memory: Kill process 10573 (mem_eater) score 0 or sacrifice child
> Killed process 10573 (mem_eater) total-vm:24540kB, anon-rss:14028kB, file-rss:136kB
> [...]
> [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> [ 5706]     0  5706     4955      556      13        0             0 bash
> [10569]     0 10569     1015      134       6        0             0 mem_eater
> [10570]     0 10570    13815    10267      27        0             0 mem_eater
> [10572]     0 10572     6135     2519      12        0             0 mem_eater
> Memory cgroup out of memory: Kill process 10572 (mem_eater) score 0 or sacrifice child
> Killed process 10572 (mem_eater) total-vm:24540kB, anon-rss:9940kB, file-rss:136kB
> [...]
> [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
> [ 5706]     0  5706     4955      556      13        0             0 bash
> [10569]     0 10569     1015      134       6        0             0 mem_eater
> [10570]     0 10570    13815    12773      31        0             0 mem_eater
> Memory cgroup out of memory: Kill process 10570 (mem_eater) score 2 or sacrifice child
> Killed process 10570 (mem_eater) total-vm:55260kB, anon-rss:50956kB, file-rss:136kB
>
> As you can see 50M (pid:10570) is killed as the last one while 20M ones
> are killed first. See the patch for more details about the problem.
> As I state in the changelog the very same issue is present in the global
> oom killer as well but it is much less probable as the amount of swap is
> usualy much smaller than the available RAM and I think it is not worth
> considering.
>
> ---
>  From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 10 Oct 2012 15:46:54 +0200
> Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0
>
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller
> than the available swap because the used memory (capped by the limit)
> is negligible comparing to totalpages so the resulting score is too
> small. A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking swappiness==0 and not
> considering swap at all.
>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>@jp.fujitsu.com>

Hm...where should we describe this behavior....
Documentation/cgroup/memory.txt "5.3 swappiness" ?

Anyway, the patch itself seems good.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-15  9:11 ` [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Kamezawa Hiroyuki
@ 2012-10-15  9:49   ` Michal Hocko
  2012-10-15 14:25     ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2012-10-15  9:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, Johannes Weiner, LKML

On Mon 15-10-12 18:11:24, KAMEZAWA Hiroyuki wrote:
> (2012/10/10 23:11), Michal Hocko wrote:
[...]
> > From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
> >From: Michal Hocko <mhocko@suse.cz>
> >Date: Wed, 10 Oct 2012 15:46:54 +0200
> >Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0
> >
> >oom_badness takes totalpages argument which says how many pages are
> >available and it uses it as a base for the score calculation. The value
> >is calculated by mem_cgroup_get_limit which considers both limit and
> >total_swap_pages (resp. memsw portion of it).
> >
> >This is usually correct but since fe35004f (mm: avoid swapping out
> >with swappiness==0) we do not swap when swappiness is 0 which means
> >that we cannot really use up all the totalpages pages. This in turn
> >confuses oom score calculation if the memcg limit is much smaller
> >than the available swap because the used memory (capped by the limit)
> >is negligible comparing to totalpages so the resulting score is too
> >small. A wrong process might be selected as result.
> >
> >The same issue exists for the global oom killer as well but it is not
> >that problematic as the amount of the RAM is usually much bigger than
> >the swap space.
> >
> >The problem can be worked around by checking swappiness==0 and not
> >considering swap at all.
> >
> >Signed-off-by: Michal Hocko <mhocko@suse.cz>@jp.fujitsu.com>
> 
> Hm...where should we describe this behavior....
> Documentation/cgroup/memory.txt "5.3 swappiness" ?

Hmm. The swappiness behavior is consistent with the global knob. On the
other hand the visible effects are still "stronger" as the environment
is much more constrained with memcgs so the corner cases are hit more
frequently. But this is somehow expected so I am not sure whether we
need to be explicit about this one.
Maybe we could be more explicit about the swappiness==0 behavior in
Documentation/sysctl/vm.txt because the current description is quite
vague as it doesn't say anything about the range. Maybe a patch bellow
will help to clarify this?

> Anyway, the patch itself seems good.
> 
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks!

---
>From 712995bc656cb7ad278aad45974b9e23fb524498 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe swappiness more precisely

since fe35004f (mm: avoid swapping out with swappiness==0) reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect.
While we are at it also mention the upper limit and its effect.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/sysctl/vm.txt |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..308fd77 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -640,6 +640,9 @@ swappiness
 This control is used to define how aggressive the kernel will swap
 memory pages.  Higher values will increase agressiveness, lower values
 decrease the amount of swap.
+The value can be used from the [0, 100] range, where 0 means no swapping
+at all (even if there is a swap storage enabled) while 100 means that
+anonymous pages are reclaimed in the same rate as file pages.
 
 The default value is 60.
 
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-15  9:49   ` Michal Hocko
@ 2012-10-15 14:25     ` KOSAKI Motohiro
  2012-10-15 14:47       ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-10-15 14:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kamezawa Hiroyuki, linux-mm, David Rientjes, Johannes Weiner, LKML

> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 078701f..308fd77 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -640,6 +640,9 @@ swappiness
>  This control is used to define how aggressive the kernel will swap
>  memory pages.  Higher values will increase agressiveness, lower values
>  decrease the amount of swap.
> +The value can be used from the [0, 100] range, where 0 means no swapping
> +at all (even if there is a swap storage enabled) while 100 means that
> +anonymous pages are reclaimed in the same rate as file pages.

I think this only correct when memcg. Even if swappiness==0, global reclaim swap
out anon pages before oom.

see below.

get_scan_count()
(snip)
	if (global_reclaim(sc)) {
		free  = zone_page_state(zone, NR_FREE_PAGES);
		/* If we have very few page cache pages,
		   force-scan anon pages. */
		if (unlikely(file + free <= high_wmark_pages(zone))) {
			fraction[0] = 1;
			fraction[1] = 0;
			denominator = 1;
			goto out;
		}
	}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-15 14:25     ` KOSAKI Motohiro
@ 2012-10-15 14:47       ` Michal Hocko
  2012-10-15 22:33         ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2012-10-15 14:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Kamezawa Hiroyuki, linux-mm, David Rientjes, Johannes Weiner, LKML

On Mon 15-10-12 10:25:14, KOSAKI Motohiro wrote:
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 078701f..308fd77 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -640,6 +640,9 @@ swappiness
> >  This control is used to define how aggressive the kernel will swap
> >  memory pages.  Higher values will increase agressiveness, lower values
> >  decrease the amount of swap.
> > +The value can be used from the [0, 100] range, where 0 means no swapping
> > +at all (even if there is a swap storage enabled) while 100 means that
> > +anonymous pages are reclaimed in the same rate as file pages.
> 
> I think this only correct when memcg. Even if swappiness==0, global reclaim swap
> out anon pages before oom.

Right you are (we really do swap when the file pages are really
low)! Sorry about the confusion. I kind of became if(global_reclaim)
block blind...

Then this really needs a memcg specific documentation fix. What about
the following?
---
>From 59a60705abd2faf9e266a4270bbf302001845588 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe memcg swappiness more precisely

since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..71c4da4 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -466,6 +466,10 @@ Note:
 5.3 swappiness
 
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+Please note that unlike the global swappiness, memcg knob set to 0
+really prevents from any swapping even if there is a swap storage
+available. This might lead to memcg OOM killer if there are no file
+pages to reclaim.
 
 Following cgroups' swappiness can't be changed.
 - root cgroup (uses /proc/sys/vm/swappiness).
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
                         ` (2 preceding siblings ...)
  2012-10-11 22:36       ` KOSAKI Motohiro
@ 2012-10-15 22:04       ` Michal Hocko
  2012-10-15 22:07         ` [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0 Michal Hocko
  2012-11-07 22:10         ` [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0 Andrew Morton
  3 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-15 22:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

As Kosaki correctly pointed out, the glogal reclaim doesn't have this
issue because we _do_ swap on swappinnes==0 so the swap space has
to be considered. So the v2 is just acks + changelog fix.

Changes since v1
- drop a note about global swappiness affected as well from the
  changelog
- stable needs 3.2+ rather than 3.5+ because the fe35004f has been
  backported to stable
---
>From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 10 Oct 2012 15:46:54 +0200
Subject: [PATCH] memcg: oom: fix totalpages calculation for
 memory.swappiness==0

oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).

This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.

The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: stable [3.2+]
---
 mm/memcontrol.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
 static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
 {
 	u64 limit;
-	u64 memsw;
 
 	limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-	limit += total_swap_pages << PAGE_SHIFT;
 
-	memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
 	/*
-	 * If memsw is finite and limits the amount of swap space available
-	 * to this memcg, return that limit.
+	 * Do not consider swap space if we cannot swap due to swappiness
 	 */
-	return min(limit, memsw);
+	if (mem_cgroup_swappiness(memcg)) {
+		u64 memsw;
+
+		limit += total_swap_pages << PAGE_SHIFT;
+		memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+		/*
+		 * If memsw is finite and limits the amount of swap space
+		 * available to this memcg, return that limit.
+		 */
+		limit = min(limit, memsw);
+	}
+
+	return limit;
 }
 
 void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0
  2012-10-15 22:04       ` [PATCH v2] " Michal Hocko
@ 2012-10-15 22:07         ` Michal Hocko
  2012-10-16  0:51           ` Kamezawa Hiroyuki
  2012-10-16  0:54           ` David Rientjes
  2012-11-07 22:10         ` [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0 Andrew Morton
  1 sibling, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2012-10-15 22:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

And a follow up for memcg.swappiness documentation which is more
specific about spwappiness==0 meaning.
---
>From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe memcg swappiness more precisely

since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 Documentation/cgroups/memory.txt |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..71c4da4 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -466,6 +466,10 @@ Note:
 5.3 swappiness
 
 Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+Please note that unlike the global swappiness, memcg knob set to 0
+really prevents from any swapping even if there is a swap storage
+available. This might lead to memcg OOM killer if there are no file
+pages to reclaim.
 
 Following cgroups' swappiness can't be changed.
 - root cgroup (uses /proc/sys/vm/swappiness).
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0
  2012-10-15 14:47       ` Michal Hocko
@ 2012-10-15 22:33         ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2012-10-15 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kamezawa Hiroyuki, linux-mm, David Rientjes, Johannes Weiner, LKML

>> I think this only correct when memcg. Even if swappiness==0, global reclaim swap
>> out anon pages before oom.
>
> Right you are (we really do swap when the file pages are really
> low)! Sorry about the confusion. I kind of became if(global_reclaim)
> block blind...
>
> Then this really needs a memcg specific documentation fix. What about
> the following?
> ---
> From 59a60705abd2faf9e266a4270bbf302001845588 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  Documentation/cgroups/memory.txt |    4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index c07f7b4..71c4da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -466,6 +466,10 @@ Note:
>  5.3 swappiness
>
>  Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> +Please note that unlike the global swappiness, memcg knob set to 0
> +really prevents from any swapping even if there is a swap storage
> +available. This might lead to memcg OOM killer if there are no file
> +pages to reclaim.

Pretty good to me. Thank you!

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0
  2012-10-15 22:07         ` [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0 Michal Hocko
@ 2012-10-16  0:51           ` Kamezawa Hiroyuki
  2012-10-16  0:54           ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-16  0:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, David Rientjes, KOSAKI Motohiro,
	Johannes Weiner, LKML

(2012/10/16 7:07), Michal Hocko wrote:
> And a follow up for memcg.swappiness documentation which is more
> specific about spwappiness==0 meaning.
> ---
>  From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Nice :)
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>   Documentation/cgroups/memory.txt |    4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index c07f7b4..71c4da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -466,6 +466,10 @@ Note:
>   5.3 swappiness
>
>   Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> +Please note that unlike the global swappiness, memcg knob set to 0
> +really prevents from any swapping even if there is a swap storage
> +available. This might lead to memcg OOM killer if there are no file
> +pages to reclaim.
>
>   Following cgroups' swappiness can't be changed.
>   - root cgroup (uses /proc/sys/vm/swappiness).
>



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0
  2012-10-15 22:07         ` [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0 Michal Hocko
  2012-10-16  0:51           ` Kamezawa Hiroyuki
@ 2012-10-16  0:54           ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: David Rientjes @ 2012-10-16  0:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Tue, 16 Oct 2012, Michal Hocko wrote:

> And a follow up for memcg.swappiness documentation which is more
> specific about spwappiness==0 meaning.
> ---
> From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
> 
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-10-15 22:04       ` [PATCH v2] " Michal Hocko
  2012-10-15 22:07         ` [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0 Michal Hocko
@ 2012-11-07 22:10         ` Andrew Morton
  2012-11-07 22:46           ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2012-11-07 22:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Tue, 16 Oct 2012 00:04:08 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> As Kosaki correctly pointed out, the glogal reclaim doesn't have this
> issue because we _do_ swap on swappinnes==0 so the swap space has
> to be considered. So the v2 is just acks + changelog fix.
> 
> Changes since v1
> - drop a note about global swappiness affected as well from the
>   changelog
> - stable needs 3.2+ rather than 3.5+ because the fe35004f has been
>   backported to stable
> ---
> >From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Wed, 10 Oct 2012 15:46:54 +0200
> Subject: [PATCH] memcg: oom: fix totalpages calculation for
>  memory.swappiness==0
> 
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
> 
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
> 
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: stable [3.2+]

That's "Cc: <stable@vger.kernel.org>", please.

It's unobvious from the changelog that a -stable backport is really
needed.  The bug looks pretty obscure and has been there for a long
time.  Realistically, is anyone likely to hurt from this?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-11-07 22:10         ` [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0 Andrew Morton
@ 2012-11-07 22:46           ` Michal Hocko
  2012-11-07 22:53             ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2012-11-07 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Wed 07-11-12 14:10:25, Andrew Morton wrote:
> On Tue, 16 Oct 2012 00:04:08 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > As Kosaki correctly pointed out, the glogal reclaim doesn't have this
> > issue because we _do_ swap on swappinnes==0 so the swap space has
> > to be considered. So the v2 is just acks + changelog fix.
> > 
> > Changes since v1
> > - drop a note about global swappiness affected as well from the
> >   changelog
> > - stable needs 3.2+ rather than 3.5+ because the fe35004f has been
> >   backported to stable
> > ---
> > >From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Wed, 10 Oct 2012 15:46:54 +0200
> > Subject: [PATCH] memcg: oom: fix totalpages calculation for
> >  memory.swappiness==0
> > 
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> > 
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> > 
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > Acked-by: David Rientjes <rientjes@google.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: stable [3.2+]
> 
> That's "Cc: <stable@vger.kernel.org>", please.

Will do next time.

> It's unobvious from the changelog that a -stable backport is really
> needed.  The bug looks pretty obscure and has been there for a long
> time.

Yes but it is not _that_ long since fe35004f made it into stable trees
(e.g. 3.2.29).
The reason why we probably do not see many reports is because people
didn't get used to swappiness==0 really works these days - especially
with memcg where it means _really_ no swapping.

> Realistically, is anyone likely to hurt from this?

The primary motivation for the fix was a real report by a customer.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-11-07 22:46           ` Michal Hocko
@ 2012-11-07 22:53             ` Andrew Morton
  2012-11-08  8:35               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2012-11-07 22:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Wed, 7 Nov 2012 23:46:40 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> > Realistically, is anyone likely to hurt from this?
> 
> The primary motivation for the fix was a real report by a customer.

Describe it please and I'll copy it to the changelog.

So that Greg can understand why we sent this at him and so that others
who hit the same problem (or any other problem!) will be able to
determine whether this patch might solve it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0
  2012-11-07 22:53             ` Andrew Morton
@ 2012-11-08  8:35               ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2012-11-08  8:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Johannes Weiner, LKML

On Wed 07-11-12 14:53:40, Andrew Morton wrote:
> On Wed, 7 Nov 2012 23:46:40 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > > Realistically, is anyone likely to hurt from this?
> > 
> > The primary motivation for the fix was a real report by a customer.
> 
> Describe it please and I'll copy it to the changelog.

The original issue (a wrong tasks get killed in a small group and memcg
swappiness=0) has been reported on top of our 3.0 based kernel (with
fe35004f backported). I have tried to replicate it by the test case
mentioned https://lkml.org/lkml/2012/10/10/223.

As David correctly pointed out (https://lkml.org/lkml/2012/10/10/418)
the significant role played the fact that all the processes in the group
have CAP_SYS_ADMIN but oom_score_adj has the similar effect. 
Say there is 2G of swap space which is 524288 pages. If you add
CAP_SYS_ADMIN bonus then you have -15728 score for the bias. This means
that all tasks with less than 60M get the minimum score and it is tasks
ordering which determines who gets killed as a result.

To summarize it. Users of small groups (relatively to the swap size)
with CAP_SYS_ADMIN tasks resp. oom_score_adj are affected the most
others might see an unexpected oom_badness calculation.
Whether this is a workload which is representative, I don't know but
I think that it is worth fixing and pushing to stable as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-11-08  8:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-10 14:11 [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Michal Hocko
2012-10-10 20:50 ` David Rientjes
2012-10-11  8:50   ` Michal Hocko
2012-10-11  8:57     ` [PATCH] memcg: oom: fix totalpages calculation for memory.swappiness==0 Michal Hocko
2012-10-11  9:13       ` Michal Hocko
2012-10-11 12:20       ` Johannes Weiner
2012-10-12 13:01         ` Michal Hocko
2012-10-11 22:36       ` KOSAKI Motohiro
2012-10-12 13:01         ` Michal Hocko
2012-10-15 22:04       ` [PATCH v2] " Michal Hocko
2012-10-15 22:07         ` [PATCH] doc: describe memcg swappiness more precisely memory.swappiness==0 Michal Hocko
2012-10-16  0:51           ` Kamezawa Hiroyuki
2012-10-16  0:54           ` David Rientjes
2012-11-07 22:10         ` [PATCH v2] memcg: oom: fix totalpages calculation for memory.swappiness==0 Andrew Morton
2012-11-07 22:46           ` Michal Hocko
2012-11-07 22:53             ` Andrew Morton
2012-11-08  8:35               ` Michal Hocko
2012-10-15  9:11 ` [RFC PATCH] memcg: oom: fix totalpages calculation for swappiness==0 Kamezawa Hiroyuki
2012-10-15  9:49   ` Michal Hocko
2012-10-15 14:25     ` KOSAKI Motohiro
2012-10-15 14:47       ` Michal Hocko
2012-10-15 22:33         ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).