All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-19 14:50 ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-19 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

Current implementation of dirty pages throttling is not memcg aware which makes
it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
hard limit is small and so the lists are scanned faster than pages written
back.

This patch fixes the problem by throttling the allocating process (possibly
a writer) during the hard limit reclaim by waiting on PageReclaim pages.
We are waiting only for PageReclaim pages because those are the pages
that made one full round over LRU and that means that the writeback is much
slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.
We are seeing this happening during nightly backups which are placed into
containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter PageReclaim
pages which is a signal that the reclaim doesn't catch up on with the writers
so somebody should be throttled. This could be potentially unfair because it
could be somebody else from the group who gets throttled on behalf of the
writer but as writers need to allocate as well and they allocate in higher rate
the probability that only innocent processes would be penalized is not that
high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3
running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and
dd got killed by OOM killer every time. With the patch I could run the dd with
the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c978ce4..7cccd81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
-			nr_writeback++;
-			unlock_page(page);
-			goto keep;
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.
+			 */
+			if (PageReclaim(page)
+					&& may_enter_fs && !global_reclaim(sc))
+				wait_on_page_writeback(page);
+			else {
+				nr_writeback++;
+				unlock_page(page);
+				goto keep;
+			}
 		}
 
 		references = page_check_references(page, sc);
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-19 14:50 ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-19 14:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

Current implementation of dirty pages throttling is not memcg aware which makes
it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
hard limit is small and so the lists are scanned faster than pages written
back.

This patch fixes the problem by throttling the allocating process (possibly
a writer) during the hard limit reclaim by waiting on PageReclaim pages.
We are waiting only for PageReclaim pages because those are the pages
that made one full round over LRU and that means that the writeback is much
slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.
We are seeing this happening during nightly backups which are placed into
containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter PageReclaim
pages which is a signal that the reclaim doesn't catch up on with the writers
so somebody should be throttled. This could be potentially unfair because it
could be somebody else from the group who gets throttled on behalf of the
writer but as writers need to allocate as well and they allocate in higher rate
the probability that only innocent processes would be penalized is not that
high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3
running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and
dd got killed by OOM killer every time. With the patch I could run the dd with
the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c978ce4..7cccd81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
-			nr_writeback++;
-			unlock_page(page);
-			goto keep;
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.
+			 */
+			if (PageReclaim(page)
+					&& may_enter_fs && !global_reclaim(sc))
+				wait_on_page_writeback(page);
+			else {
+				nr_writeback++;
+				unlock_page(page);
+				goto keep;
+			}
 		}
 
 		references = page_check_references(page, sc);
-- 
1.7.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-19 14:50 ` Michal Hocko
@ 2012-06-19 22:00   ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-06-19 22:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue, 19 Jun 2012 16:50:04 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> Current implementation of dirty pages throttling is not memcg aware which makes
> it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> hard limit is small and so the lists are scanned faster than pages written
> back.

This is a bit hard to parse.  I changed it to

: The current implementation of dirty pages throttling is not memcg aware
: which makes it easy to have memcg LRUs full of dirty pages.  Without
: throttling, these LRUs can be scanned faster than the rate of writeback,
: leading to memcg OOM conditions when the hard limit is small.

does that still say what you meant to say?

> The solution is far from being ideal - long term solution is memcg aware
> dirty throttling - but it is meant to be a band aid until we have a real
> fix.

Fair enough I guess.  The fix is small and simple and if it makes the
kernel better, why not?

Would like to see a few more acks though.  Why hasn't everyone been
hitting this?

> We are seeing this happening during nightly backups which are placed into
> containers to prevent from eviction of the real working set.

Well that's a trick which we want to work well.  It's a killer
featurelet for people who wonder what all this memcg crap is for ;)

> The change affects only memcg reclaim and only when we encounter PageReclaim
> pages which is a signal that the reclaim doesn't catch up on with the writers
> so somebody should be throttled. This could be potentially unfair because it
> could be somebody else from the group who gets throttled on behalf of the
> writer but as writers need to allocate as well and they allocate in higher rate
> the probability that only innocent processes would be penalized is not that
> high.

OK.

> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
>  		if (PageWriteback(page)) {
> -			nr_writeback++;
> -			unlock_page(page);
> -			goto keep;
> +			/*
> +			 * memcg doesn't have any dirty pages throttling so we
> +			 * could easily OOM just because too many pages are in
> +			 * writeback from reclaim and there is nothing else to
> +			 * reclaim.
> +			 */
> +			if (PageReclaim(page)
> +					&& may_enter_fs && !global_reclaim(sc))
> +				wait_on_page_writeback(page);
> +			else {
> +				nr_writeback++;
> +				unlock_page(page);
> +				goto keep;
> +			}

A couple of things here.

With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
please rename this to CONFIG_MEMCG?), this:

--- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
+++ a/mm/vmscan.c
@@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
 			 */
-			if (PageReclaim(page)
-					&& may_enter_fs && !global_reclaim(sc))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;


reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
avoid generating any code for PageReclaim() and perhaps the
may_enter_fs test.  Because global_reclaim() evaluates to constant
true.  Do you think that's an improvement?

Also, why do we test may_enter_fs here?  I should have been able to
work out your reasoning from either code comments or changelogging but
I cannot (bad).  I don't *think* there's a deadlock issue here?  If the
page is now under writeback, that writeback *will* complete?

Finally, I wonder if there should be some timeout of that wait.  I
don't know why, but I wouldn't be surprised if we hit some glitch which
causes us to add one!



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-19 22:00   ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-06-19 22:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue, 19 Jun 2012 16:50:04 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> Current implementation of dirty pages throttling is not memcg aware which makes
> it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> hard limit is small and so the lists are scanned faster than pages written
> back.

This is a bit hard to parse.  I changed it to

: The current implementation of dirty pages throttling is not memcg aware
: which makes it easy to have memcg LRUs full of dirty pages.  Without
: throttling, these LRUs can be scanned faster than the rate of writeback,
: leading to memcg OOM conditions when the hard limit is small.

does that still say what you meant to say?

> The solution is far from being ideal - long term solution is memcg aware
> dirty throttling - but it is meant to be a band aid until we have a real
> fix.

Fair enough I guess.  The fix is small and simple and if it makes the
kernel better, why not?

Would like to see a few more acks though.  Why hasn't everyone been
hitting this?

> We are seeing this happening during nightly backups which are placed into
> containers to prevent from eviction of the real working set.

Well that's a trick which we want to work well.  It's a killer
featurelet for people who wonder what all this memcg crap is for ;)

> The change affects only memcg reclaim and only when we encounter PageReclaim
> pages which is a signal that the reclaim doesn't catch up on with the writers
> so somebody should be throttled. This could be potentially unfair because it
> could be somebody else from the group who gets throttled on behalf of the
> writer but as writers need to allocate as well and they allocate in higher rate
> the probability that only innocent processes would be penalized is not that
> high.

OK.

> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
>  		if (PageWriteback(page)) {
> -			nr_writeback++;
> -			unlock_page(page);
> -			goto keep;
> +			/*
> +			 * memcg doesn't have any dirty pages throttling so we
> +			 * could easily OOM just because too many pages are in
> +			 * writeback from reclaim and there is nothing else to
> +			 * reclaim.
> +			 */
> +			if (PageReclaim(page)
> +					&& may_enter_fs && !global_reclaim(sc))
> +				wait_on_page_writeback(page);
> +			else {
> +				nr_writeback++;
> +				unlock_page(page);
> +				goto keep;
> +			}

A couple of things here.

With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
please rename this to CONFIG_MEMCG?), this:

--- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
+++ a/mm/vmscan.c
@@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
 			 */
-			if (PageReclaim(page)
-					&& may_enter_fs && !global_reclaim(sc))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;


reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
avoid generating any code for PageReclaim() and perhaps the
may_enter_fs test.  Because global_reclaim() evaluates to constant
true.  Do you think that's an improvement?

Also, why do we test may_enter_fs here?  I should have been able to
work out your reasoning from either code comments or changelogging but
I cannot (bad).  I don't *think* there's a deadlock issue here?  If the
page is now under writeback, that writeback *will* complete?

Finally, I wonder if there should be some timeout of that wait.  I
don't know why, but I wouldn't be surprised if we hit some glitch which
causes us to add one!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-19 22:00   ` Andrew Morton
@ 2012-06-20  8:27     ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue 19-06-12 15:00:14, Andrew Morton wrote:
> On Tue, 19 Jun 2012 16:50:04 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Current implementation of dirty pages throttling is not memcg aware which makes
> > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > hard limit is small and so the lists are scanned faster than pages written
> > back.
> 
> This is a bit hard to parse.  I changed it to
> 
> : The current implementation of dirty pages throttling is not memcg aware
> : which makes it easy to have memcg LRUs full of dirty pages.  Without
> : throttling, these LRUs can be scanned faster than the rate of writeback,
> : leading to memcg OOM conditions when the hard limit is small.
> 
> does that still say what you meant to say?

Yes, Thanks!

> > The solution is far from being ideal - long term solution is memcg aware
> > dirty throttling - but it is meant to be a band aid until we have a real
> > fix.
> 
> Fair enough I guess.  The fix is small and simple and if it makes the
> kernel better, why not?
> 
> Would like to see a few more acks though. 

> Why hasn't everyone been hitting this?

Because you need very small hard limit an heavy writers. We have seen
some complains in the past but our answer was "make the limit bigger".

[...]
> A couple of things here.
> 
> With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
> please rename this to CONFIG_MEMCG?), this:
> 
> --- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
> +++ a/mm/vmscan.c
> @@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (PageReclaim(page)
> -					&& may_enter_fs && !global_reclaim(sc))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;
> 
> 
> reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
> avoid generating any code for PageReclaim() and perhaps the
> may_enter_fs test.  Because global_reclaim() evaluates to constant
> true.  Do you think that's an improvement?

Yes you are right. We should optimize for the non-memcg case.

> Also, why do we test may_enter_fs here?  I should have been able to
> work out your reasoning from either code comments or changelogging but
> I cannot (bad).  I don't *think* there's a deadlock issue here?  If the
> page is now under writeback, that writeback *will* complete?

Good question.  To be honest I mimicked what sync. lumpy reclaim
did. You are right that we cannot deadlock here because writeback has
been already started.  But when I was digging back into history I found
this: https://lkml.org/lkml/2007/7/30/344

But now that I am thinking about it some more, memcg (hard limit) reclaim
is different and we shouldn't end up with !may_enter_fs allocation here
because all those allocations are for page cache or anon pages. 
So I guess we can drop the may_enter_fs part.
Thanks for pointing it out.

> Finally, I wonder if there should be some timeout of that wait.  I
> don't know why, but I wouldn't be surprised if we hit some glitch which
> causes us to add one!

As you said, the writeback will eventually complete so we will not wait
for ever. I have played with slow USB storages and saw only small
stalls which are much better than OOM.
Johannes was worried about stalls when we hit PageReclaim pages while
there are a lot of clean pages to reclaim when we would stall without
any good reason. This situation is rather hard to simulate even with
artificial loads so we concluded that this is a room for additional
improvements but this band aid is worth on its own.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-20  8:27     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue 19-06-12 15:00:14, Andrew Morton wrote:
> On Tue, 19 Jun 2012 16:50:04 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Current implementation of dirty pages throttling is not memcg aware which makes
> > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > hard limit is small and so the lists are scanned faster than pages written
> > back.
> 
> This is a bit hard to parse.  I changed it to
> 
> : The current implementation of dirty pages throttling is not memcg aware
> : which makes it easy to have memcg LRUs full of dirty pages.  Without
> : throttling, these LRUs can be scanned faster than the rate of writeback,
> : leading to memcg OOM conditions when the hard limit is small.
> 
> does that still say what you meant to say?

Yes, Thanks!

> > The solution is far from being ideal - long term solution is memcg aware
> > dirty throttling - but it is meant to be a band aid until we have a real
> > fix.
> 
> Fair enough I guess.  The fix is small and simple and if it makes the
> kernel better, why not?
> 
> Would like to see a few more acks though. 

> Why hasn't everyone been hitting this?

Because you need very small hard limit an heavy writers. We have seen
some complains in the past but our answer was "make the limit bigger".

[...]
> A couple of things here.
> 
> With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
> please rename this to CONFIG_MEMCG?), this:
> 
> --- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
> +++ a/mm/vmscan.c
> @@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (PageReclaim(page)
> -					&& may_enter_fs && !global_reclaim(sc))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;
> 
> 
> reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
> avoid generating any code for PageReclaim() and perhaps the
> may_enter_fs test.  Because global_reclaim() evaluates to constant
> true.  Do you think that's an improvement?

Yes you are right. We should optimize for the non-memcg case.

> Also, why do we test may_enter_fs here?  I should have been able to
> work out your reasoning from either code comments or changelogging but
> I cannot (bad).  I don't *think* there's a deadlock issue here?  If the
> page is now under writeback, that writeback *will* complete?

Good question.  To be honest I mimicked what sync. lumpy reclaim
did. You are right that we cannot deadlock here because writeback has
been already started.  But when I was digging back into history I found
this: https://lkml.org/lkml/2007/7/30/344

But now that I am thinking about it some more, memcg (hard limit) reclaim
is different and we shouldn't end up with !may_enter_fs allocation here
because all those allocations are for page cache or anon pages. 
So I guess we can drop the may_enter_fs part.
Thanks for pointing it out.

> Finally, I wonder if there should be some timeout of that wait.  I
> don't know why, but I wouldn't be surprised if we hit some glitch which
> causes us to add one!

As you said, the writeback will eventually complete so we will not wait
for ever. I have played with slow USB storages and saw only small
stalls which are much better than OOM.
Johannes was worried about stalls when we hit PageReclaim pages while
there are a lot of clean pages to reclaim when we would stall without
any good reason. This situation is rather hard to simulate even with
artificial loads so we concluded that this is a room for additional
improvements but this band aid is worth on its own.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-19 22:00   ` Andrew Morton
@ 2012-06-20  9:20     ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2012-06-20  9:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue, Jun 19, 2012 at 03:00:14PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2012 16:50:04 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Current implementation of dirty pages throttling is not memcg aware which makes
> > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > hard limit is small and so the lists are scanned faster than pages written
> > back.
> 
> This is a bit hard to parse.  I changed it to
> 
> : The current implementation of dirty pages throttling is not memcg aware
> : which makes it easy to have memcg LRUs full of dirty pages.  Without
> : throttling, these LRUs can be scanned faster than the rate of writeback,
> : leading to memcg OOM conditions when the hard limit is small.
> 
> does that still say what you meant to say?
> 
> > The solution is far from being ideal - long term solution is memcg aware
> > dirty throttling - but it is meant to be a band aid until we have a real
> > fix.
> 
> Fair enough I guess.  The fix is small and simple and if it makes the
> kernel better, why not?
> 
> Would like to see a few more acks though.  Why hasn't everyone been
> hitting this?
> 

I had been quiet because Acks from people in the same company tend to not
carry much weight.

I think this patch is appropriate. It is not necessarily the *best*
and potentially there is a better solution out there which is why I think
people have been reluctent to ack it. However, some of the better solutions
also had corner cases where they could simply break again or require a lot
of new infrastructure such as dirty-limit tracking within memcgs that we
are just not ready for.  This patch may not be subtle but it fixes a very
annoying issue that currently makes memcg dangerous to use for workloads
that dirty a lot of their memory. When the all singing all dancing fix
exists then it can be reverted if necessary but from me;

Reviewed-by: Mel Gorman <mgorman@suse.de>

Some caveats with may_enter_fs below.

> > We are seeing this happening during nightly backups which are placed into
> > containers to prevent from eviction of the real working set.
> 
> Well that's a trick which we want to work well.  It's a killer
> featurelet for people who wonder what all this memcg crap is for ;)
> 

Turns out people get really pissed when their straight-forward workload
blows up.

> > +			/*
> > +			 * memcg doesn't have any dirty pages throttling so we
> > +			 * could easily OOM just because too many pages are in
> > +			 * writeback from reclaim and there is nothing else to
> > +			 * reclaim.
> > +			 */
> > +			if (PageReclaim(page)
> > +					&& may_enter_fs && !global_reclaim(sc))
> > +				wait_on_page_writeback(page);
> > +			else {
> > +				nr_writeback++;
> > +				unlock_page(page);
> > +				goto keep;
> > +			}
> 
> A couple of things here.
> 
> With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
> please rename this to CONFIG_MEMCG?), this:
> 
> --- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
> +++ a/mm/vmscan.c
> @@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (PageReclaim(page)
> -					&& may_enter_fs && !global_reclaim(sc))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;
> 
> 
> reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
> avoid generating any code for PageReclaim() and perhaps the
> may_enter_fs test.  Because global_reclaim() evaluates to constant
> true.  Do you think that's an improvement?
> 

Looks functionally equivalent to me so why not get the 48 bytes!

> Also, why do we test may_enter_fs here? 

I think this is partially my fault because it's based on a similar test
lumpy reclaim used to do and I at least didn't reconsider it properly during
review. Back then, there were two reasons for the may_enter_fs check. The
first was to avoid processes like kjournald ever stalling on page writeback
because it caused the system to "stutter". The more relevant reason was
because callers that lacked may_enter_fs were also likely to fail lumpy
reclaim if they could not write dirty pages and wait on them so it was
better to give up or move to another block.

In the context of memcg reclaim there should be no concern about kernel
threads getting stuck on writeback and it does not have the same problem
as lumpy reclaim had with being unable to writeout pages. IMO, the check
is safe to drop. Michal?

> Finally, I wonder if there should be some timeout of that wait.  I
> don't know why, but I wouldn't be surprised if we hit some glitch which
> causes us to add one!
> 

If we hit such a situation it means that flush is no longer working which
is interesting in itself. I guess one possibility where it can occur is
if we hit global dirty limits (or memcg dirty limits when they exist)
and the page is backed by NFS that is disconnected. That would stall here
potentially forever but it's already the case that a system that hits its
dirty limits with a disconnected NFS is in trouble and a timeout here will
not do much to help.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-20  9:20     ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2012-06-20  9:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Tue, Jun 19, 2012 at 03:00:14PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2012 16:50:04 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Current implementation of dirty pages throttling is not memcg aware which makes
> > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > hard limit is small and so the lists are scanned faster than pages written
> > back.
> 
> This is a bit hard to parse.  I changed it to
> 
> : The current implementation of dirty pages throttling is not memcg aware
> : which makes it easy to have memcg LRUs full of dirty pages.  Without
> : throttling, these LRUs can be scanned faster than the rate of writeback,
> : leading to memcg OOM conditions when the hard limit is small.
> 
> does that still say what you meant to say?
> 
> > The solution is far from being ideal - long term solution is memcg aware
> > dirty throttling - but it is meant to be a band aid until we have a real
> > fix.
> 
> Fair enough I guess.  The fix is small and simple and if it makes the
> kernel better, why not?
> 
> Would like to see a few more acks though.  Why hasn't everyone been
> hitting this?
> 

I had been quiet because Acks from people in the same company tend to not
carry much weight.

I think this patch is appropriate. It is not necessarily the *best*
and potentially there is a better solution out there which is why I think
people have been reluctent to ack it. However, some of the better solutions
also had corner cases where they could simply break again or require a lot
of new infrastructure such as dirty-limit tracking within memcgs that we
are just not ready for.  This patch may not be subtle but it fixes a very
annoying issue that currently makes memcg dangerous to use for workloads
that dirty a lot of their memory. When the all singing all dancing fix
exists then it can be reverted if necessary but from me;

Reviewed-by: Mel Gorman <mgorman@suse.de>

Some caveats with may_enter_fs below.

> > We are seeing this happening during nightly backups which are placed into
> > containers to prevent from eviction of the real working set.
> 
> Well that's a trick which we want to work well.  It's a killer
> featurelet for people who wonder what all this memcg crap is for ;)
> 

Turns out people get really pissed when their straight-forward workload
blows up.

> > +			/*
> > +			 * memcg doesn't have any dirty pages throttling so we
> > +			 * could easily OOM just because too many pages are in
> > +			 * writeback from reclaim and there is nothing else to
> > +			 * reclaim.
> > +			 */
> > +			if (PageReclaim(page)
> > +					&& may_enter_fs && !global_reclaim(sc))
> > +				wait_on_page_writeback(page);
> > +			else {
> > +				nr_writeback++;
> > +				unlock_page(page);
> > +				goto keep;
> > +			}
> 
> A couple of things here.
> 
> With my gcc and CONFIG_CGROUP_MEM_RES_CTLR=n (for gawd's sake can we
> please rename this to CONFIG_MEMCG?), this:
> 
> --- a/mm/vmscan.c~memcg-prevent-from-oom-with-too-many-dirty-pages-fix
> +++ a/mm/vmscan.c
> @@ -726,8 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (PageReclaim(page)
> -					&& may_enter_fs && !global_reclaim(sc))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;
> 
> 
> reduces vmscan.o's .text by 48 bytes(!).  Because the compiler can
> avoid generating any code for PageReclaim() and perhaps the
> may_enter_fs test.  Because global_reclaim() evaluates to constant
> true.  Do you think that's an improvement?
> 

Looks functionally equivalent to me so why not get the 48 bytes!

> Also, why do we test may_enter_fs here? 

I think this is partially my fault because it's based on a similar test
lumpy reclaim used to do and I at least didn't reconsider it properly during
review. Back then, there were two reasons for the may_enter_fs check. The
first was to avoid processes like kjournald ever stalling on page writeback
because it caused the system to "stutter". The more relevant reason was
because callers that lacked may_enter_fs were also likely to fail lumpy
reclaim if they could not write dirty pages and wait on them so it was
better to give up or move to another block.

In the context of memcg reclaim there should be no concern about kernel
threads getting stuck on writeback and it does not have the same problem
as lumpy reclaim had with being unable to writeout pages. IMO, the check
is safe to drop. Michal?

> Finally, I wonder if there should be some timeout of that wait.  I
> don't know why, but I wouldn't be surprised if we hit some glitch which
> causes us to add one!
> 

If we hit such a situation it means that flush is no longer working which
is interesting in itself. I guess one possibility where it can occur is
if we hit global dirty limits (or memcg dirty limits when they exist)
and the page is backed by NFS that is disconnected. That would stall here
potentially forever but it's already the case that a system that hits its
dirty limits with a disconnected NFS is in trouble and a timeout here will
not do much to help.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-20  9:20     ` Mel Gorman
@ 2012-06-20  9:55       ` Fengguang Wu
  -1 siblings, 0 replies; 44+ messages in thread
From: Fengguang Wu @ 2012-06-20  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel,
	KAMEZAWA Hiroyuki, Minchan Kim, Rik van Riel, Ying Han,
	Greg Thelen, Hugh Dickins, Johannes Weiner

> > Finally, I wonder if there should be some timeout of that wait.  I
> > don't know why, but I wouldn't be surprised if we hit some glitch which
> > causes us to add one!
> > 
> 
> If we hit such a situation it means that flush is no longer working which
> is interesting in itself. I guess one possibility where it can occur is
> if we hit global dirty limits (or memcg dirty limits when they exist)
> and the page is backed by NFS that is disconnected. That would stall here
> potentially forever but it's already the case that a system that hits its
> dirty limits with a disconnected NFS is in trouble and a timeout here will
> not do much to help.

Agreed. I've run into such cases and cannot login even locally because
the shell will be blocked trying to write even 1 byte at startup time. 
Any opened shells are also stalled on writing to .bash_history etc.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-20  9:55       ` Fengguang Wu
  0 siblings, 0 replies; 44+ messages in thread
From: Fengguang Wu @ 2012-06-20  9:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Michal Hocko, linux-mm, linux-kernel,
	KAMEZAWA Hiroyuki, Minchan Kim, Rik van Riel, Ying Han,
	Greg Thelen, Hugh Dickins, Johannes Weiner

> > Finally, I wonder if there should be some timeout of that wait.  I
> > don't know why, but I wouldn't be surprised if we hit some glitch which
> > causes us to add one!
> > 
> 
> If we hit such a situation it means that flush is no longer working which
> is interesting in itself. I guess one possibility where it can occur is
> if we hit global dirty limits (or memcg dirty limits when they exist)
> and the page is backed by NFS that is disconnected. That would stall here
> potentially forever but it's already the case that a system that hits its
> dirty limits with a disconnected NFS is in trouble and a timeout here will
> not do much to help.

Agreed. I've run into such cases and cannot login even locally because
the shell will be blocked trying to write even 1 byte at startup time. 
Any opened shells are also stalled on writing to .bash_history etc.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-20  9:20     ` Mel Gorman
@ 2012-06-20  9:59       ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20  9:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Wed 20-06-12 10:20:11, Mel Gorman wrote:
> On Tue, Jun 19, 2012 at 03:00:14PM -0700, Andrew Morton wrote:
> > On Tue, 19 Jun 2012 16:50:04 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Current implementation of dirty pages throttling is not memcg aware which makes
> > > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > > hard limit is small and so the lists are scanned faster than pages written
> > > back.
> > 
> > This is a bit hard to parse.  I changed it to
> > 
> > : The current implementation of dirty pages throttling is not memcg aware
> > : which makes it easy to have memcg LRUs full of dirty pages.  Without
> > : throttling, these LRUs can be scanned faster than the rate of writeback,
> > : leading to memcg OOM conditions when the hard limit is small.
> > 
> > does that still say what you meant to say?
> > 
> > > The solution is far from being ideal - long term solution is memcg aware
> > > dirty throttling - but it is meant to be a band aid until we have a real
> > > fix.
> > 
> > Fair enough I guess.  The fix is small and simple and if it makes the
> > kernel better, why not?
> > 
> > Would like to see a few more acks though.  Why hasn't everyone been
> > hitting this?
> > 
> 
> I had been quiet because Acks from people in the same company tend to not
> carry much weight.
> 
> I think this patch is appropriate. It is not necessarily the *best*
> and potentially there is a better solution out there which is why I think
> people have been reluctent to ack it. However, some of the better solutions
> also had corner cases where they could simply break again or require a lot
> of new infrastructure such as dirty-limit tracking within memcgs that we
> are just not ready for.  This patch may not be subtle but it fixes a very
> annoying issue that currently makes memcg dangerous to use for workloads
> that dirty a lot of their memory. When the all singing all dancing fix
> exists then it can be reverted if necessary but from me;
> 
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks, I will respin the patch and send v2.

[...]
> > Also, why do we test may_enter_fs here? 
> 
> I think this is partially my fault because it's based on a similar test
> lumpy reclaim used to do and I at least didn't reconsider it properly during
> review. Back then, there were two reasons for the may_enter_fs check. The
> first was to avoid processes like kjournald ever stalling on page writeback
> because it caused the system to "stutter". The more relevant reason was
> because callers that lacked may_enter_fs were also likely to fail lumpy
> reclaim if they could not write dirty pages and wait on them so it was
> better to give up or move to another block.
> 
> In the context of memcg reclaim there should be no concern about kernel
> threads getting stuck on writeback and it does not have the same problem
> as lumpy reclaim had with being unable to writeout pages. IMO, the check
> is safe to drop. Michal?

Yes, Is I wrote in other email. memcg reclaim is about LRU pages so the
may_enter_fs is not needed here.

> 
> > Finally, I wonder if there should be some timeout of that wait.  I
> > don't know why, but I wouldn't be surprised if we hit some glitch which
> > causes us to add one!
> > 
> 
> If we hit such a situation it means that flush is no longer working which
> is interesting in itself. I guess one possibility where it can occur is
> if we hit global dirty limits (or memcg dirty limits when they exist)
> and the page is backed by NFS that is disconnected. That would stall here
> potentially forever but it's already the case that a system that hits its
> dirty limits with a disconnected NFS is in trouble and a timeout here will
> not do much to help.

Agreed.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-20  9:59       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20  9:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

On Wed 20-06-12 10:20:11, Mel Gorman wrote:
> On Tue, Jun 19, 2012 at 03:00:14PM -0700, Andrew Morton wrote:
> > On Tue, 19 Jun 2012 16:50:04 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > Current implementation of dirty pages throttling is not memcg aware which makes
> > > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > > hard limit is small and so the lists are scanned faster than pages written
> > > back.
> > 
> > This is a bit hard to parse.  I changed it to
> > 
> > : The current implementation of dirty pages throttling is not memcg aware
> > : which makes it easy to have memcg LRUs full of dirty pages.  Without
> > : throttling, these LRUs can be scanned faster than the rate of writeback,
> > : leading to memcg OOM conditions when the hard limit is small.
> > 
> > does that still say what you meant to say?
> > 
> > > The solution is far from being ideal - long term solution is memcg aware
> > > dirty throttling - but it is meant to be a band aid until we have a real
> > > fix.
> > 
> > Fair enough I guess.  The fix is small and simple and if it makes the
> > kernel better, why not?
> > 
> > Would like to see a few more acks though.  Why hasn't everyone been
> > hitting this?
> > 
> 
> I had been quiet because Acks from people in the same company tend to not
> carry much weight.
> 
> I think this patch is appropriate. It is not necessarily the *best*
> and potentially there is a better solution out there which is why I think
> people have been reluctent to ack it. However, some of the better solutions
> also had corner cases where they could simply break again or require a lot
> of new infrastructure such as dirty-limit tracking within memcgs that we
> are just not ready for.  This patch may not be subtle but it fixes a very
> annoying issue that currently makes memcg dangerous to use for workloads
> that dirty a lot of their memory. When the all singing all dancing fix
> exists then it can be reverted if necessary but from me;
> 
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks, I will respin the patch and send v2.

[...]
> > Also, why do we test may_enter_fs here? 
> 
> I think this is partially my fault because it's based on a similar test
> lumpy reclaim used to do and I at least didn't reconsider it properly during
> review. Back then, there were two reasons for the may_enter_fs check. The
> first was to avoid processes like kjournald ever stalling on page writeback
> because it caused the system to "stutter". The more relevant reason was
> because callers that lacked may_enter_fs were also likely to fail lumpy
> reclaim if they could not write dirty pages and wait on them so it was
> better to give up or move to another block.
> 
> In the context of memcg reclaim there should be no concern about kernel
> threads getting stuck on writeback and it does not have the same problem
> as lumpy reclaim had with being unable to writeout pages. IMO, the check
> is safe to drop. Michal?

Yes, Is I wrote in other email. memcg reclaim is about LRU pages so the
may_enter_fs is not needed here.

> 
> > Finally, I wonder if there should be some timeout of that wait.  I
> > don't know why, but I wouldn't be surprised if we hit some glitch which
> > causes us to add one!
> > 
> 
> If we hit such a situation it means that flush is no longer working which
> is interesting in itself. I guess one possibility where it can occur is
> if we hit global dirty limits (or memcg dirty limits when they exist)
> and the page is backed by NFS that is disconnected. That would stall here
> potentially forever but it's already the case that a system that hits its
> dirty limits with a disconnected NFS is in trouble and a timeout here will
> not do much to help.

Agreed.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-19 22:00   ` Andrew Morton
@ 2012-06-20 10:11     ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20 10:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

Hi Andrew,
here is an updated version if it is easier for you to drop the previous
one.
changes since v1
* added Mel's Reviewed-by
* updated changelog as per Andrew
* updated the condition to be optimized for no-memcg case
---
>From 72b61c1c5da8039a7ba31d3f98420e6649ab73b8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Wed, 20 Jun 2012 12:06:07 +0200
Subject: [PATCH] memcg: prevent from OOM with too many dirty pages

The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages.  Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.

This patch fixes the problem by throttling the allocating process (possibly
a writer) during the hard limit reclaim by waiting on PageReclaim pages.
We are waiting only for PageReclaim pages because those are the pages
that made one full round over LRU and that means that the writeback is much
slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.
We are seeing this happening during nightly backups which are placed into
containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter PageReclaim
pages which is a signal that the reclaim doesn't catch up on with the writers
so somebody should be throttled. This could be potentially unfair because it
could be somebody else from the group who gets throttled on behalf of the
writer but as writers need to allocate as well and they allocate in higher rate
the probability that only innocent processes would be penalized is not that
high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3
running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and
dd got killed by OOM killer every time. With the patch I could run the dd with
the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
[akpm@linux-foundation.org: tweak changelog, reordered the test to
 optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
Reviewed-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c978ce4..3b6cc22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,9 +720,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
-			nr_writeback++;
-			unlock_page(page);
-			goto keep;
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.
+			 */
+			if (!global_reclaim(sc) && PageReclaim(page))
+				wait_on_page_writeback(page);
+			else {
+				nr_writeback++;
+				unlock_page(page);
+				goto keep;
+			}
 		}
 
 		references = page_check_references(page, sc);
-- 
1.7.10

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-06-20 10:11     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-06-20 10:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen, Hugh Dickins,
	Johannes Weiner, Fengguang Wu

Hi Andrew,
here is an updated version if it is easier for you to drop the previous
one.
changes since v1
* added Mel's Reviewed-by
* updated changelog as per Andrew
* updated the condition to be optimized for no-memcg case
---

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-20 10:11     ` Michal Hocko
@ 2012-07-12  1:57       ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12  1:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

Hi Michal,

On Wed, 20 Jun 2012, Michal Hocko wrote:
> Hi Andrew,
> here is an updated version if it is easier for you to drop the previous
> one.
> changes since v1
> * added Mel's Reviewed-by
> * updated changelog as per Andrew
> * updated the condition to be optimized for no-memcg case

I mentioned in Johannes's [03/11] thread a couple of days ago, that
I was having a problem with your wait_on_page_writeback() in mmotm.

It turns out that your original patch was fine, but you let dark angels
whisper into your ear, to persuade you to remove the "&& may_enter_fs".

Part of my load builds kernels on extN over loop over tmpfs: loop does
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
because it knows it will deadlock, if the loop thread enters reclaim,
and reclaim tries to write back a dirty page, one which needs the loop
thread to perform the write.

With the may_enter_fs check restored, all is well.  I don't entirely
like your patch: I think it would be much better to wait in the same
place as the wait_iff_congested(), when the pages gathered have been
sent for writing and unlocked and putback and freed; and I also wonder
if it should go beyond the !global_reclaim case for swap pages, because
they don't participate in dirty limiting.

But those are things I should investigate later - I did write a patch
like that before, when I was having some unexpected OOM trouble with a
private kernel; but my OOMs then were because of something silly that
I'd left out, and I'm not at present sure if we have a problem in this
regard or not.

The important thing is to get the may_enter_fs back into your patch:
I can't really Sign-off the below because it's yours, but
Acked-by: Hugh Dickins <hughd@google.com>
---

 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
+++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
@@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12  1:57       ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12  1:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

Hi Michal,

On Wed, 20 Jun 2012, Michal Hocko wrote:
> Hi Andrew,
> here is an updated version if it is easier for you to drop the previous
> one.
> changes since v1
> * added Mel's Reviewed-by
> * updated changelog as per Andrew
> * updated the condition to be optimized for no-memcg case

I mentioned in Johannes's [03/11] thread a couple of days ago, that
I was having a problem with your wait_on_page_writeback() in mmotm.

It turns out that your original patch was fine, but you let dark angels
whisper into your ear, to persuade you to remove the "&& may_enter_fs".

Part of my load builds kernels on extN over loop over tmpfs: loop does
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
because it knows it will deadlock, if the loop thread enters reclaim,
and reclaim tries to write back a dirty page, one which needs the loop
thread to perform the write.

With the may_enter_fs check restored, all is well.  I don't entirely
like your patch: I think it would be much better to wait in the same
place as the wait_iff_congested(), when the pages gathered have been
sent for writing and unlocked and putback and freed; and I also wonder
if it should go beyond the !global_reclaim case for swap pages, because
they don't participate in dirty limiting.

But those are things I should investigate later - I did write a patch
like that before, when I was having some unexpected OOM trouble with a
private kernel; but my OOMs then were because of something silly that
I'd left out, and I'm not at present sure if we have a problem in this
regard or not.

The important thing is to get the may_enter_fs back into your patch:
I can't really Sign-off the below because it's yours, but
Acked-by: Hugh Dickins <hughd@google.com>
---

 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
+++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
@@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12  1:57       ` Hugh Dickins
@ 2012-07-12  2:21         ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-12  2:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed, 11 Jul 2012 18:57:43 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;

um, that may_enter_fs test got removed because nobody knew why it was
there.  Nobody knew why it was there because it was undocumented.  Do
you see where I'm going with this?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12  2:21         ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-12  2:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed, 11 Jul 2012 18:57:43 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;

um, that may_enter_fs test got removed because nobody knew why it was
there.  Nobody knew why it was there because it was undocumented.  Do
you see where I'm going with this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12  2:21         ` Andrew Morton
@ 2012-07-12  3:13           ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed, 11 Jul 2012, Andrew Morton wrote:
> On Wed, 11 Jul 2012 18:57:43 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> > +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> > @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
> >  			 * writeback from reclaim and there is nothing else to
> >  			 * reclaim.
> >  			 */
> > -			if (!global_reclaim(sc) && PageReclaim(page))
> > +			if (!global_reclaim(sc) && PageReclaim(page) &&
> > +					may_enter_fs)
> >  				wait_on_page_writeback(page);
> >  			else {
> >  				nr_writeback++;
> 
> um, that may_enter_fs test got removed because nobody knew why it was
> there.  Nobody knew why it was there because it was undocumented.  Do
> you see where I'm going with this?

I was hoping you might do that bit ;)  Here's my display of ignorance:

--- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
+++ linux/mm/vmscan.c	2012-07-11 20:09:33.182829986 -0700
@@ -725,8 +725,15 @@ static unsigned long shrink_page_list(st
 			 * could easily OOM just because too many pages are in
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
+			 *
+			 * Check may_enter_fs, certainly because a loop driver
+			 * thread might enter reclaim, and deadlock if it waits
+			 * on a page for which it is needed to do the write
+			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+			 * but more thought would probably show more reasons.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12  3:13           ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed, 11 Jul 2012, Andrew Morton wrote:
> On Wed, 11 Jul 2012 18:57:43 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:
> 
> > --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> > +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> > @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
> >  			 * writeback from reclaim and there is nothing else to
> >  			 * reclaim.
> >  			 */
> > -			if (!global_reclaim(sc) && PageReclaim(page))
> > +			if (!global_reclaim(sc) && PageReclaim(page) &&
> > +					may_enter_fs)
> >  				wait_on_page_writeback(page);
> >  			else {
> >  				nr_writeback++;
> 
> um, that may_enter_fs test got removed because nobody knew why it was
> there.  Nobody knew why it was there because it was undocumented.  Do
> you see where I'm going with this?

I was hoping you might do that bit ;)  Here's my display of ignorance:

--- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
+++ linux/mm/vmscan.c	2012-07-11 20:09:33.182829986 -0700
@@ -725,8 +725,15 @@ static unsigned long shrink_page_list(st
 			 * could easily OOM just because too many pages are in
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
+			 *
+			 * Check may_enter_fs, certainly because a loop driver
+			 * thread might enter reclaim, and deadlock if it waits
+			 * on a page for which it is needed to do the write
+			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+			 * but more thought would probably show more reasons.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page))
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
 				wait_on_page_writeback(page);
 			else {
 				nr_writeback++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12  1:57       ` Hugh Dickins
@ 2012-07-12  7:05         ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-12  7:05 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> Hi Michal,

Hi,

> 
> On Wed, 20 Jun 2012, Michal Hocko wrote:
> > Hi Andrew,
> > here is an updated version if it is easier for you to drop the previous
> > one.
> > changes since v1
> > * added Mel's Reviewed-by
> > * updated changelog as per Andrew
> > * updated the condition to be optimized for no-memcg case
> 
> I mentioned in Johannes's [03/11] thread a couple of days ago, that
> I was having a problem with your wait_on_page_writeback() in mmotm.
> 
> It turns out that your original patch was fine, but you let dark angels
> whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> 
> Part of my load builds kernels on extN over loop over tmpfs: loop does
> mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> because it knows it will deadlock, if the loop thread enters reclaim,
> and reclaim tries to write back a dirty page, one which needs the loop
> thread to perform the write.

Good catch! I have totally missed the loop driver.

> With the may_enter_fs check restored, all is well.  I don't entirely
> like your patch: I think it would be much better to wait in the same
> place as the wait_iff_congested(), when the pages gathered have been
> sent for writing and unlocked and putback and freed; 

I guess you mean
	if (nr_writeback && nr_writeback >=
                        (nr_taken >> (DEF_PRIORITY - sc->priority)))
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I have tried to hook here but it has some issues. First of all we do not
know how long we should wait. Waiting for specific pages sounded more
event based and more precise.

We can surely do better but I wanted to stop the OOM first without any
other possible side effects on the global reclaim. I have tried to make
the band aid as simple as possible. Memcg dirty pages accounting is
forming already so we are one (tiny) step closer to the throttling.
 
> and I also wonder if it should go beyond the !global_reclaim case for
> swap pages, because they don't participate in dirty limiting.

Worth a separate patch?

> But those are things I should investigate later - I did write a patch
> like that before, when I was having some unexpected OOM trouble with a
> private kernel; but my OOMs then were because of something silly that
> I'd left out, and I'm not at present sure if we have a problem in this
> regard or not.
> 
> The important thing is to get the may_enter_fs back into your patch:
> I can't really Sign-off the below because it's yours, but
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks a lot Hugh!

When we are back to the patch. Is it going into 3.5? I hope so and I
think it is really worth stable as well. Andrew?

> ---
> 
>  mm/vmscan.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12  7:05         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-12  7:05 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> Hi Michal,

Hi,

> 
> On Wed, 20 Jun 2012, Michal Hocko wrote:
> > Hi Andrew,
> > here is an updated version if it is easier for you to drop the previous
> > one.
> > changes since v1
> > * added Mel's Reviewed-by
> > * updated changelog as per Andrew
> > * updated the condition to be optimized for no-memcg case
> 
> I mentioned in Johannes's [03/11] thread a couple of days ago, that
> I was having a problem with your wait_on_page_writeback() in mmotm.
> 
> It turns out that your original patch was fine, but you let dark angels
> whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> 
> Part of my load builds kernels on extN over loop over tmpfs: loop does
> mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> because it knows it will deadlock, if the loop thread enters reclaim,
> and reclaim tries to write back a dirty page, one which needs the loop
> thread to perform the write.

Good catch! I have totally missed the loop driver.

> With the may_enter_fs check restored, all is well.  I don't entirely
> like your patch: I think it would be much better to wait in the same
> place as the wait_iff_congested(), when the pages gathered have been
> sent for writing and unlocked and putback and freed; 

I guess you mean
	if (nr_writeback && nr_writeback >=
                        (nr_taken >> (DEF_PRIORITY - sc->priority)))
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I have tried to hook here but it has some issues. First of all we do not
know how long we should wait. Waiting for specific pages sounded more
event based and more precise.

We can surely do better but I wanted to stop the OOM first without any
other possible side effects on the global reclaim. I have tried to make
the band aid as simple as possible. Memcg dirty pages accounting is
forming already so we are one (tiny) step closer to the throttling.
 
> and I also wonder if it should go beyond the !global_reclaim case for
> swap pages, because they don't participate in dirty limiting.

Worth a separate patch?

> But those are things I should investigate later - I did write a patch
> like that before, when I was having some unexpected OOM trouble with a
> private kernel; but my OOMs then were because of something silly that
> I'd left out, and I'm not at present sure if we have a problem in this
> regard or not.
> 
> The important thing is to get the may_enter_fs back into your patch:
> I can't really Sign-off the below because it's yours, but
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks a lot Hugh!

When we are back to the patch. Is it going into 3.5? I hope so and I
think it is really worth stable as well. Andrew?

> ---
> 
>  mm/vmscan.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 3.5-rc6-mm1/mm/vmscan.c	2012-07-11 14:42:13.668335884 -0700
> +++ linux/mm/vmscan.c	2012-07-11 16:01:20.712814127 -0700
> @@ -726,7 +726,8 @@ static unsigned long shrink_page_list(st
>  			 * writeback from reclaim and there is nothing else to
>  			 * reclaim.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page))
> +			if (!global_reclaim(sc) && PageReclaim(page) &&
> +					may_enter_fs)
>  				wait_on_page_writeback(page);
>  			else {
>  				nr_writeback++;

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12  7:05         ` Michal Hocko
@ 2012-07-12 21:13           ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-12 21:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012 09:05:01 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> When we are back to the patch. Is it going into 3.5? I hope so and I
> think it is really worth stable as well. Andrew?

What patch.   "memcg: prevent OOM with too many dirty pages"?

I wasn't planning on 3.5, given the way it's been churning around.  How
about we put it into 3.6 and tag it for a -stable backport, so it gets
a bit of a run in mainline before we inflict it upon -stable users?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12 21:13           ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-12 21:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012 09:05:01 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> When we are back to the patch. Is it going into 3.5? I hope so and I
> think it is really worth stable as well. Andrew?

What patch.   "memcg: prevent OOM with too many dirty pages"?

I wasn't planning on 3.5, given the way it's been churning around.  How
about we put it into 3.6 and tag it for a -stable backport, so it gets
a bit of a run in mainline before we inflict it upon -stable users?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12 21:13           ` Andrew Morton
@ 2012-07-12 22:42             ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12 22:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012, Andrew Morton wrote:
> On Thu, 12 Jul 2012 09:05:01 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > When we are back to the patch. Is it going into 3.5? I hope so and I
> > think it is really worth stable as well. Andrew?
> 
> What patch.   "memcg: prevent OOM with too many dirty pages"?

Yes.

> 
> I wasn't planning on 3.5, given the way it's been churning around.

I don't know if you had been intending to send it in for 3.5 earlier;
but I'm sorry if my late intervention on may_enter_fs has delayed it.

> How
> about we put it into 3.6 and tag it for a -stable backport, so it gets
> a bit of a run in mainline before we inflict it upon -stable users?

That sounds good enough to me, but does fall short of Michal's hope.

Hugh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-12 22:42             ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-12 22:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012, Andrew Morton wrote:
> On Thu, 12 Jul 2012 09:05:01 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > When we are back to the patch. Is it going into 3.5? I hope so and I
> > think it is really worth stable as well. Andrew?
> 
> What patch.   "memcg: prevent OOM with too many dirty pages"?

Yes.

> 
> I wasn't planning on 3.5, given the way it's been churning around.

I don't know if you had been intending to send it in for 3.5 earlier;
but I'm sorry if my late intervention on may_enter_fs has delayed it.

> How
> about we put it into 3.6 and tag it for a -stable backport, so it gets
> a bit of a run in mainline before we inflict it upon -stable users?

That sounds good enough to me, but does fall short of Michal's hope.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12 22:42             ` Hugh Dickins
@ 2012-07-13  8:21               ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-13  8:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu 12-07-12 15:42:53, Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Andrew Morton wrote:
> > On Thu, 12 Jul 2012 09:05:01 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > When we are back to the patch. Is it going into 3.5? I hope so and I
> > > think it is really worth stable as well. Andrew?
> > 
> > What patch.   "memcg: prevent OOM with too many dirty pages"?
> 
> Yes.
> 
> > 
> > I wasn't planning on 3.5, given the way it's been churning around.
> 
> I don't know if you had been intending to send it in for 3.5 earlier;
> but I'm sorry if my late intervention on may_enter_fs has delayed it.

Well I should investigate more when the question came up...
 
> > How about we put it into 3.6 and tag it for a -stable backport, so
> > it gets a bit of a run in mainline before we inflict it upon -stable
> > users?
> 
> That sounds good enough to me, but does fall short of Michal's hope.

I would be happier if it went into 3.5 already because the problem (OOM
on too many dirty pages) is real and long term (basically since ever).
We have the patch in SLES11-SP2 for quite some time (the original one
with the may_enter_fs check) and it helped a lot.
The patch was designed as a band aid primarily because it is very simple
that way and with a hope that the real fix will come later.
The decision is up to you Andrew, but I vote for pushing it as soon as
possible and try to come up with something more clever for 3.6.

> 
> Hugh

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-13  8:21               ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-13  8:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu 12-07-12 15:42:53, Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Andrew Morton wrote:
> > On Thu, 12 Jul 2012 09:05:01 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > When we are back to the patch. Is it going into 3.5? I hope so and I
> > > think it is really worth stable as well. Andrew?
> > 
> > What patch.   "memcg: prevent OOM with too many dirty pages"?
> 
> Yes.
> 
> > 
> > I wasn't planning on 3.5, given the way it's been churning around.
> 
> I don't know if you had been intending to send it in for 3.5 earlier;
> but I'm sorry if my late intervention on may_enter_fs has delayed it.

Well I should investigate more when the question came up...
 
> > How about we put it into 3.6 and tag it for a -stable backport, so
> > it gets a bit of a run in mainline before we inflict it upon -stable
> > users?
> 
> That sounds good enough to me, but does fall short of Michal's hope.

I would be happier if it went into 3.5 already because the problem (OOM
on too many dirty pages) is real and long term (basically since ever).
We have the patch in SLES11-SP2 for quite some time (the original one
with the may_enter_fs check) and it helped a lot.
The patch was designed as a band aid primarily because it is very simple
that way and with a hope that the real fix will come later.
The decision is up to you Andrew, but I vote for pushing it as soon as
possible and try to come up with something more clever for 3.6.

> 
> Hugh

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-12  7:05         ` Michal Hocko
@ 2012-07-16  8:10           ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012, Michal Hocko wrote:
> On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> > 
> > I mentioned in Johannes's [03/11] thread a couple of days ago, that
> > I was having a problem with your wait_on_page_writeback() in mmotm.
> > 
> > It turns out that your original patch was fine, but you let dark angels
> > whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> > 
> > Part of my load builds kernels on extN over loop over tmpfs: loop does
> > mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> > because it knows it will deadlock, if the loop thread enters reclaim,
> > and reclaim tries to write back a dirty page, one which needs the loop
> > thread to perform the write.
> 
> Good catch! I have totally missed the loop driver.
> 
> > With the may_enter_fs check restored, all is well.

Not as well as I thought when I wrote that: but those issues I'll deal
with in separate mail (and my alternative patch was no better).

> > I don't entirely
> > like your patch: I think it would be much better to wait in the same
> > place as the wait_iff_congested(), when the pages gathered have been
> > sent for writing and unlocked and putback and freed; 
> 
> I guess you mean
> 	if (nr_writeback && nr_writeback >=
>                         (nr_taken >> (DEF_PRIORITY - sc->priority)))
>                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

Yes, I've appended the patch I was meaning below; but although it's
the way I had approached the issue, I don't in practice see any better
behaviour from mine than from yours.  So unless a good reason appears
later, to do it my way instead of yours, let's just forget about mine.

> 
> I have tried to hook here but it has some issues. First of all we do not
> know how long we should wait. Waiting for specific pages sounded more
> event based and more precise.
> 
> We can surely do better but I wanted to stop the OOM first without any
> other possible side effects on the global reclaim. I have tried to make
> the band aid as simple as possible. Memcg dirty pages accounting is
> forming already so we are one (tiny) step closer to the throttling.
>  
> > and I also wonder if it should go beyond the !global_reclaim case for
> > swap pages, because they don't participate in dirty limiting.
> 
> Worth a separate patch?

If I could ever generate a suitable testcase, yes.  But in practice,
the only way I've managed to generate such a preponderance of swapping
over file reclaim, is by using memcgs, which your patch already catches.
And if there actually is the swapping issue I suggest, then it's been
around for a very long time, apparently without complaint.

Here is the patch I had in mind: I'm posting it as illustration, so we
can look back to it in the archives if necessary; but it's definitely
not signed-off, I've seen no practical advantage over yours, probably
we just forget about this one below now.

But more mail to follow, returning to yours...

Hugh

p.s. KAMEZAWA-san, if you wonder why you're suddenly brought into this
conversation, it's because there was a typo in your email address before.

--- 3.5-rc6/vmscan.c	2012-06-03 06:42:11.000000000 -0700
+++ linux/vmscan.c	2012-07-13 11:53:20.372087273 -0700
@@ -675,7 +675,8 @@ static unsigned long shrink_page_list(st
 				      struct zone *zone,
 				      struct scan_control *sc,
 				      unsigned long *ret_nr_dirty,
-				      unsigned long *ret_nr_writeback)
+				      unsigned long *ret_nr_writeback,
+				      struct page **slow_page)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -720,6 +721,27 @@ static unsigned long shrink_page_list(st
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.  Nor is swap subject to dirty throttling.
+			 *
+			 * Check may_enter_fs, certainly because a loop driver
+			 * thread might enter reclaim, and deadlock if it waits
+			 * on a page for which it is needed to do the write
+			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+			 * but more thought would probably show more reasons.
+			 *
+			 * Just use one page per shrink for this: wait on its
+			 * writeback once we have done the rest.  If device is
+			 * slow, in due course we shall choose one of its pages.
+			 */
+			if (!*slow_page && may_enter_fs && PageReclaim(page) &&
+			    (PageSwapCache(page) || !global_reclaim(sc))) {
+				*slow_page = page;
+				get_page(page);
+			}
 			nr_writeback++;
 			unlock_page(page);
 			goto keep;
@@ -1208,6 +1230,7 @@ shrink_inactive_list(unsigned long nr_to
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct page *slow_page = NULL;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1245,7 +1268,7 @@ shrink_inactive_list(unsigned long nr_to
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
-						&nr_dirty, &nr_writeback);
+					&nr_dirty, &nr_writeback, &slow_page);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1292,8 +1315,13 @@ shrink_inactive_list(unsigned long nr_to
 	 *                     isolated page is PageWriteback
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
+			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		if (slow_page && PageReclaim(slow_page))
+			wait_on_page_writeback(slow_page);
+	}
+	if (slow_page)
+		put_page(slow_page);
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-16  8:10           ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Thu, 12 Jul 2012, Michal Hocko wrote:
> On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> > 
> > I mentioned in Johannes's [03/11] thread a couple of days ago, that
> > I was having a problem with your wait_on_page_writeback() in mmotm.
> > 
> > It turns out that your original patch was fine, but you let dark angels
> > whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> > 
> > Part of my load builds kernels on extN over loop over tmpfs: loop does
> > mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> > because it knows it will deadlock, if the loop thread enters reclaim,
> > and reclaim tries to write back a dirty page, one which needs the loop
> > thread to perform the write.
> 
> Good catch! I have totally missed the loop driver.
> 
> > With the may_enter_fs check restored, all is well.

Not as well as I thought when I wrote that: but those issues I'll deal
with in separate mail (and my alternative patch was no better).

> > I don't entirely
> > like your patch: I think it would be much better to wait in the same
> > place as the wait_iff_congested(), when the pages gathered have been
> > sent for writing and unlocked and putback and freed; 
> 
> I guess you mean
> 	if (nr_writeback && nr_writeback >=
>                         (nr_taken >> (DEF_PRIORITY - sc->priority)))
>                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

Yes, I've appended the patch I was meaning below; but although it's
the way I had approached the issue, I don't in practice see any better
behaviour from mine than from yours.  So unless a good reason appears
later, to do it my way instead of yours, let's just forget about mine.

> 
> I have tried to hook here but it has some issues. First of all we do not
> know how long we should wait. Waiting for specific pages sounded more
> event based and more precise.
> 
> We can surely do better but I wanted to stop the OOM first without any
> other possible side effects on the global reclaim. I have tried to make
> the band aid as simple as possible. Memcg dirty pages accounting is
> forming already so we are one (tiny) step closer to the throttling.
>  
> > and I also wonder if it should go beyond the !global_reclaim case for
> > swap pages, because they don't participate in dirty limiting.
> 
> Worth a separate patch?

If I could ever generate a suitable testcase, yes.  But in practice,
the only way I've managed to generate such a preponderance of swapping
over file reclaim, is by using memcgs, which your patch already catches.
And if there actually is the swapping issue I suggest, then it's been
around for a very long time, apparently without complaint.

Here is the patch I had in mind: I'm posting it as illustration, so we
can look back to it in the archives if necessary; but it's definitely
not signed-off, I've seen no practical advantage over yours, probably
we just forget about this one below now.

But more mail to follow, returning to yours...

Hugh

p.s. KAMEZAWA-san, if you wonder why you're suddenly brought into this
conversation, it's because there was a typo in your email address before.

--- 3.5-rc6/vmscan.c	2012-06-03 06:42:11.000000000 -0700
+++ linux/vmscan.c	2012-07-13 11:53:20.372087273 -0700
@@ -675,7 +675,8 @@ static unsigned long shrink_page_list(st
 				      struct zone *zone,
 				      struct scan_control *sc,
 				      unsigned long *ret_nr_dirty,
-				      unsigned long *ret_nr_writeback)
+				      unsigned long *ret_nr_writeback,
+				      struct page **slow_page)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -720,6 +721,27 @@ static unsigned long shrink_page_list(st
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.  Nor is swap subject to dirty throttling.
+			 *
+			 * Check may_enter_fs, certainly because a loop driver
+			 * thread might enter reclaim, and deadlock if it waits
+			 * on a page for which it is needed to do the write
+			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+			 * but more thought would probably show more reasons.
+			 *
+			 * Just use one page per shrink for this: wait on its
+			 * writeback once we have done the rest.  If device is
+			 * slow, in due course we shall choose one of its pages.
+			 */
+			if (!*slow_page && may_enter_fs && PageReclaim(page) &&
+			    (PageSwapCache(page) || !global_reclaim(sc))) {
+				*slow_page = page;
+				get_page(page);
+			}
 			nr_writeback++;
 			unlock_page(page);
 			goto keep;
@@ -1208,6 +1230,7 @@ shrink_inactive_list(unsigned long nr_to
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	struct page *slow_page = NULL;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1245,7 +1268,7 @@ shrink_inactive_list(unsigned long nr_to
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
-						&nr_dirty, &nr_writeback);
+					&nr_dirty, &nr_writeback, &slow_page);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1292,8 +1315,13 @@ shrink_inactive_list(unsigned long nr_to
 	 *                     isolated page is PageWriteback
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
+			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		if (slow_page && PageReclaim(slow_page))
+			wait_on_page_writeback(slow_page);
+	}
+	if (slow_page)
+		put_page(slow_page);
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-13  8:21               ` Michal Hocko
@ 2012-07-16  8:30                 ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Fri, 13 Jul 2012, Michal Hocko wrote:
> On Thu 12-07-12 15:42:53, Hugh Dickins wrote:
> > On Thu, 12 Jul 2012, Andrew Morton wrote:
> > > 
> > > I wasn't planning on 3.5, given the way it's been churning around.
> > 
> > I don't know if you had been intending to send it in for 3.5 earlier;
> > but I'm sorry if my late intervention on may_enter_fs has delayed it.
> 
> Well I should investigate more when the question came up...
>  
> > > How about we put it into 3.6 and tag it for a -stable backport, so
> > > it gets a bit of a run in mainline before we inflict it upon -stable
> > > users?
> > 
> > That sounds good enough to me, but does fall short of Michal's hope.
> 
> I would be happier if it went into 3.5 already because the problem (OOM
> on too many dirty pages) is real and long term (basically since ever).
> We have the patch in SLES11-SP2 for quite some time (the original one
> with the may_enter_fs check) and it helped a lot.
> The patch was designed as a band aid primarily because it is very simple
> that way and with a hope that the real fix will come later.
> The decision is up to you Andrew, but I vote for pushing it as soon as
> possible and try to come up with something more clever for 3.6.

Once I got to trying dd in memcg to FS on USB stick, yes, I very much
agree that the problem is real and well worth fixing, and that your
patch takes us most of the way there.

But Andrew's caution has proved to be well founded: in the last
few days I've found several problems with it.

I guess it makes more sense to go into detail in the patch I'm about
to send, fixing up what is (I think) currently in mmotm.

But in brief: my insistence on may_enter_fs actually took us backwards
on ext4, because that does __GFP_NOFS page allocations when writing.
I still don't understand how this showed up in none of my testing at
the end of the week, and only hit me today (er, yesterday).  But not
as big a problem as I thought at first, because loop also turns off
__GFP_IO, so we can go by that instead.

And though I found your patch works most of the time, one in five
or ten attempts would OOM just as before: we actually have a problem
also with PageWriteback pages which are not PageReclaim, but the
answer is to mark those PageReclaim.

Patch follows separately in a moment.  I'm pretty happy with it now,
but I've not yet tried xfs, btrfs, vfat, tmpfs.  I notice now that
you specifically describe testing on ext3, but don't mention ext4:
I wonder if you got bogged down in the problems I've fixed on that.

Hugh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-16  8:30                 ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Fri, 13 Jul 2012, Michal Hocko wrote:
> On Thu 12-07-12 15:42:53, Hugh Dickins wrote:
> > On Thu, 12 Jul 2012, Andrew Morton wrote:
> > > 
> > > I wasn't planning on 3.5, given the way it's been churning around.
> > 
> > I don't know if you had been intending to send it in for 3.5 earlier;
> > but I'm sorry if my late intervention on may_enter_fs has delayed it.
> 
> Well I should investigate more when the question came up...
>  
> > > How about we put it into 3.6 and tag it for a -stable backport, so
> > > it gets a bit of a run in mainline before we inflict it upon -stable
> > > users?
> > 
> > That sounds good enough to me, but does fall short of Michal's hope.
> 
> I would be happier if it went into 3.5 already because the problem (OOM
> on too many dirty pages) is real and long term (basically since ever).
> We have the patch in SLES11-SP2 for quite some time (the original one
> with the may_enter_fs check) and it helped a lot.
> The patch was designed as a band aid primarily because it is very simple
> that way and with a hope that the real fix will come later.
> The decision is up to you Andrew, but I vote for pushing it as soon as
> possible and try to come up with something more clever for 3.6.

Once I got to trying dd in memcg to FS on USB stick, yes, I very much
agree that the problem is real and well worth fixing, and that your
patch takes us most of the way there.

But Andrew's caution has proved to be well founded: in the last
few days I've found several problems with it.

I guess it makes more sense to go into detail in the patch I'm about
to send, fixing up what is (I think) currently in mmotm.

But in brief: my insistence on may_enter_fs actually took us backwards
on ext4, because that does __GFP_NOFS page allocations when writing.
I still don't understand how this showed up in none of my testing at
the end of the week, and only hit me today (er, yesterday).  But not
as big a problem as I thought at first, because loop also turns off
__GFP_IO, so we can go by that instead.

And though I found your patch works most of the time, one in five
or ten attempts would OOM just as before: we actually have a problem
also with PageWriteback pages which are not PageReclaim, but the
answer is to mark those PageReclaim.

Patch follows separately in a moment.  I'm pretty happy with it now,
but I've not yet tried xfs, btrfs, vfat, tmpfs.  I notice now that
you specifically describe testing on ext3, but don't mention ext4:
I wonder if you got bogged down in the problems I've fixed on that.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
  2012-07-16  8:30                 ` Hugh Dickins
@ 2012-07-16  8:35                   ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:35 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

The may_enter_fs test turns out to be too restrictive: though I saw
no problem with it when testing on 3.5-rc6, it very soon OOMed when
I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
perhaps I just slightly changed the way I started off the testing:
dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
to me why the transaction needs to be started even before allocating
pagecache memory.  But it may not be worth worrying about these days:
if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the
loop device; but since that also masks off __GFP_IO, we can test for
__GFP_IO directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing
on 3.5-rc6, it OOMed about one time in five or ten; when testing
just now on 3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under
ordinary writeback, not marked PageReclaim, so rightly not causing
the memcg check to wait on their writeback: these too can prevent
shrink_page_list() from freeing any pages, so many times that memcg
reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with
dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
these to tail of list when writepage completes, but more importantly,
the PageReclaim flag makes memcg reclaim wait on them if encountered
again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else,
and goto keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org (along with the patch it fixes)
---
Incremental on top of what I believe you presently have in mmotm:
better folded in on top of Michal's original and the may_enter_fs "fix".

 mm/vmscan.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

--- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
+++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
 			/*
 			 * memcg doesn't have any dirty pages throttling so we
 			 * could easily OOM just because too many pages are in
-			 * writeback from reclaim and there is nothing else to
-			 * reclaim.
+			 * writeback and there is nothing else to reclaim.
 			 *
-			 * Check may_enter_fs, certainly because a loop driver
+			 * Check __GFP_IO, certainly because a loop driver
 			 * thread might enter reclaim, and deadlock if it waits
 			 * on a page for which it is needed to do the write
 			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
 			 * but more thought would probably show more reasons.
+			 *
+			 * Don't require __GFP_FS, since we're not going into
+			 * the FS, just waiting on its writeback completion.
+			 * Worryingly, ext4 gfs2 and xfs allocate pages with
+			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
+			 * testing may_enter_fs here is liable to OOM on them.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page) &&
-					may_enter_fs)
-				wait_on_page_writeback(page);
-			else {
+			if (global_reclaim(sc) ||
+			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
+				/*
+				 * This is slightly racy - end_page_writeback()
+				 * might have just cleared PageReclaim, then
+				 * setting PageReclaim here end up interpreted
+				 * as PageReadahead - but that does not matter
+				 * enough to care.  What we do want is for this
+				 * page to have PageReclaim set next time memcg
+				 * reclaim reaches the tests above, so it will
+				 * then wait_on_page_writeback() to avoid OOM;
+				 * and it's also appropriate in global reclaim.
+				 */
+				SetPageReclaim(page);
 				nr_writeback++;
-				unlock_page(page);
-				goto keep;
+				goto keep_locked;
 			}
+			wait_on_page_writeback(page);
 		}
 
 		references = page_check_references(page, sc);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
@ 2012-07-16  8:35                   ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-16  8:35 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

The may_enter_fs test turns out to be too restrictive: though I saw
no problem with it when testing on 3.5-rc6, it very soon OOMed when
I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
perhaps I just slightly changed the way I started off the testing:
dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
to me why the transaction needs to be started even before allocating
pagecache memory.  But it may not be worth worrying about these days:
if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the
loop device; but since that also masks off __GFP_IO, we can test for
__GFP_IO directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing
on 3.5-rc6, it OOMed about one time in five or ten; when testing
just now on 3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under
ordinary writeback, not marked PageReclaim, so rightly not causing
the memcg check to wait on their writeback: these too can prevent
shrink_page_list() from freeing any pages, so many times that memcg
reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with
dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
these to tail of list when writepage completes, but more importantly,
the PageReclaim flag makes memcg reclaim wait on them if encountered
again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else,
and goto keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org (along with the patch it fixes)
---
Incremental on top of what I believe you presently have in mmotm:
better folded in on top of Michal's original and the may_enter_fs "fix".

 mm/vmscan.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

--- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
+++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
 			/*
 			 * memcg doesn't have any dirty pages throttling so we
 			 * could easily OOM just because too many pages are in
-			 * writeback from reclaim and there is nothing else to
-			 * reclaim.
+			 * writeback and there is nothing else to reclaim.
 			 *
-			 * Check may_enter_fs, certainly because a loop driver
+			 * Check __GFP_IO, certainly because a loop driver
 			 * thread might enter reclaim, and deadlock if it waits
 			 * on a page for which it is needed to do the write
 			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
 			 * but more thought would probably show more reasons.
+			 *
+			 * Don't require __GFP_FS, since we're not going into
+			 * the FS, just waiting on its writeback completion.
+			 * Worryingly, ext4 gfs2 and xfs allocate pages with
+			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
+			 * testing may_enter_fs here is liable to OOM on them.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page) &&
-					may_enter_fs)
-				wait_on_page_writeback(page);
-			else {
+			if (global_reclaim(sc) ||
+			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
+				/*
+				 * This is slightly racy - end_page_writeback()
+				 * might have just cleared PageReclaim, then
+				 * setting PageReclaim here end up interpreted
+				 * as PageReadahead - but that does not matter
+				 * enough to care.  What we do want is for this
+				 * page to have PageReclaim set next time memcg
+				 * reclaim reaches the tests above, so it will
+				 * then wait_on_page_writeback() to avoid OOM;
+				 * and it's also appropriate in global reclaim.
+				 */
+				SetPageReclaim(page);
 				nr_writeback++;
-				unlock_page(page);
-				goto keep;
+				goto keep_locked;
 			}
+			wait_on_page_writeback(page);
 		}
 
 		references = page_check_references(page, sc);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
  2012-07-16  8:10           ` Hugh Dickins
@ 2012-07-16  8:48             ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-16  8:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 01:10:47, Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Michal Hocko wrote:
> > On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> > > 
> > > I mentioned in Johannes's [03/11] thread a couple of days ago, that
> > > I was having a problem with your wait_on_page_writeback() in mmotm.
> > > 
> > > It turns out that your original patch was fine, but you let dark angels
> > > whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> > > 
> > > Part of my load builds kernels on extN over loop over tmpfs: loop does
> > > mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> > > because it knows it will deadlock, if the loop thread enters reclaim,
> > > and reclaim tries to write back a dirty page, one which needs the loop
> > > thread to perform the write.
> > 
> > Good catch! I have totally missed the loop driver.
> > 
> > > With the may_enter_fs check restored, all is well.
> 
> Not as well as I thought when I wrote that: but those issues I'll deal
> with in separate mail (and my alternative patch was no better).
> 
> > > I don't entirely
> > > like your patch: I think it would be much better to wait in the same
> > > place as the wait_iff_congested(), when the pages gathered have been
> > > sent for writing and unlocked and putback and freed; 
> > 
> > I guess you mean
> > 	if (nr_writeback && nr_writeback >=
> >                         (nr_taken >> (DEF_PRIORITY - sc->priority)))
> >                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> Yes, I've appended the patch I was meaning below; but although it's
> the way I had approached the issue, I don't in practice see any better
> behaviour from mine than from yours.  So unless a good reason appears
> later, to do it my way instead of yours, let's just forget about mine.

OK

> > I have tried to hook here but it has some issues. First of all we do not
> > know how long we should wait. Waiting for specific pages sounded more
> > event based and more precise.
> > 
> > We can surely do better but I wanted to stop the OOM first without any
> > other possible side effects on the global reclaim. I have tried to make
> > the band aid as simple as possible. Memcg dirty pages accounting is
> > forming already so we are one (tiny) step closer to the throttling.
> >  
> > > and I also wonder if it should go beyond the !global_reclaim case for
> > > swap pages, because they don't participate in dirty limiting.
> > 
> > Worth a separate patch?
> 
> If I could ever generate a suitable testcase, yes.  But in practice,
> the only way I've managed to generate such a preponderance of swapping
> over file reclaim, is by using memcgs, which your patch already catches.
> And if there actually is the swapping issue I suggest, then it's been
> around for a very long time, apparently without complaint.
> 
> Here is the patch I had in mind: I'm posting it as illustration, so we
> can look back to it in the archives if necessary; but it's definitely
> not signed-off, I've seen no practical advantage over yours, probably
> we just forget about this one below now.
> 
> But more mail to follow, returning to yours...
> 
> Hugh
> 
> p.s. KAMEZAWA-san, if you wonder why you're suddenly brought into this
> conversation, it's because there was a typo in your email address before.

Sorry, my fault. I misspelled the domain (jp.fujtisu.com).

> --- 3.5-rc6/vmscan.c	2012-06-03 06:42:11.000000000 -0700
> +++ linux/vmscan.c	2012-07-13 11:53:20.372087273 -0700
> @@ -675,7 +675,8 @@ static unsigned long shrink_page_list(st
>  				      struct zone *zone,
>  				      struct scan_control *sc,
>  				      unsigned long *ret_nr_dirty,
> -				      unsigned long *ret_nr_writeback)
> +				      unsigned long *ret_nr_writeback,
> +				      struct page **slow_page)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> @@ -720,6 +721,27 @@ static unsigned long shrink_page_list(st
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
>  		if (PageWriteback(page)) {
> +			/*
> +			 * memcg doesn't have any dirty pages throttling so we
> +			 * could easily OOM just because too many pages are in
> +			 * writeback from reclaim and there is nothing else to
> +			 * reclaim.  Nor is swap subject to dirty throttling.
> +			 *
> +			 * Check may_enter_fs, certainly because a loop driver
> +			 * thread might enter reclaim, and deadlock if it waits
> +			 * on a page for which it is needed to do the write
> +			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
> +			 * but more thought would probably show more reasons.
> +			 *
> +			 * Just use one page per shrink for this: wait on its
> +			 * writeback once we have done the rest.  If device is
> +			 * slow, in due course we shall choose one of its pages.
> +			 */
> +			if (!*slow_page && may_enter_fs && PageReclaim(page) &&
> +			    (PageSwapCache(page) || !global_reclaim(sc))) {
> +				*slow_page = page;
> +				get_page(page);
> +			}
>  			nr_writeback++;
>  			unlock_page(page);
>  			goto keep;
> @@ -1208,6 +1230,7 @@ shrink_inactive_list(unsigned long nr_to
>  	int file = is_file_lru(lru);
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	struct page *slow_page = NULL;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1245,7 +1268,7 @@ shrink_inactive_list(unsigned long nr_to
>  		return 0;
>  
>  	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
> -						&nr_dirty, &nr_writeback);
> +					&nr_dirty, &nr_writeback, &slow_page);
>  
>  	spin_lock_irq(&zone->lru_lock);
>  
> @@ -1292,8 +1315,13 @@ shrink_inactive_list(unsigned long nr_to
>  	 *                     isolated page is PageWriteback
>  	 */
>  	if (nr_writeback && nr_writeback >=
> -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
>  		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +		if (slow_page && PageReclaim(slow_page))
> +			wait_on_page_writeback(slow_page);
> +	}
> +	if (slow_page)
> +		put_page(slow_page);

Hmm. This relies on another round of shrinking because even if we wait
for the page it doesn't add up to nr_reclaimed. Not a big deal in
practice I guess because those will be rotated and seen in the next
loop. We are reclaiming with priority 0 so the whole list so we should
gather SWAP_CLUSTER pages sooner or later so the patch seems to be
correct.
It should even cope with the sudden latency issue when seeing a random
PageReclaim page in the middle of the LRU mentioned by Johannes. I
wasn't able to trigger this issue though and I think it is more a
theoretical than real one.

Anyway, thanks for looking into this. It's good to see there is other
approach as well so that we can compare.

>  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>  		zone_idx(zone),

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-07-16  8:48             ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-16  8:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 01:10:47, Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Michal Hocko wrote:
> > On Wed 11-07-12 18:57:43, Hugh Dickins wrote:
> > > 
> > > I mentioned in Johannes's [03/11] thread a couple of days ago, that
> > > I was having a problem with your wait_on_page_writeback() in mmotm.
> > > 
> > > It turns out that your original patch was fine, but you let dark angels
> > > whisper into your ear, to persuade you to remove the "&& may_enter_fs".
> > > 
> > > Part of my load builds kernels on extN over loop over tmpfs: loop does
> > > mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS))
> > > because it knows it will deadlock, if the loop thread enters reclaim,
> > > and reclaim tries to write back a dirty page, one which needs the loop
> > > thread to perform the write.
> > 
> > Good catch! I have totally missed the loop driver.
> > 
> > > With the may_enter_fs check restored, all is well.
> 
> Not as well as I thought when I wrote that: but those issues I'll deal
> with in separate mail (and my alternative patch was no better).
> 
> > > I don't entirely
> > > like your patch: I think it would be much better to wait in the same
> > > place as the wait_iff_congested(), when the pages gathered have been
> > > sent for writing and unlocked and putback and freed; 
> > 
> > I guess you mean
> > 	if (nr_writeback && nr_writeback >=
> >                         (nr_taken >> (DEF_PRIORITY - sc->priority)))
> >                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> Yes, I've appended the patch I was meaning below; but although it's
> the way I had approached the issue, I don't in practice see any better
> behaviour from mine than from yours.  So unless a good reason appears
> later, to do it my way instead of yours, let's just forget about mine.

OK

> > I have tried to hook here but it has some issues. First of all we do not
> > know how long we should wait. Waiting for specific pages sounded more
> > event based and more precise.
> > 
> > We can surely do better but I wanted to stop the OOM first without any
> > other possible side effects on the global reclaim. I have tried to make
> > the band aid as simple as possible. Memcg dirty pages accounting is
> > forming already so we are one (tiny) step closer to the throttling.
> >  
> > > and I also wonder if it should go beyond the !global_reclaim case for
> > > swap pages, because they don't participate in dirty limiting.
> > 
> > Worth a separate patch?
> 
> If I could ever generate a suitable testcase, yes.  But in practice,
> the only way I've managed to generate such a preponderance of swapping
> over file reclaim, is by using memcgs, which your patch already catches.
> And if there actually is the swapping issue I suggest, then it's been
> around for a very long time, apparently without complaint.
> 
> Here is the patch I had in mind: I'm posting it as illustration, so we
> can look back to it in the archives if necessary; but it's definitely
> not signed-off, I've seen no practical advantage over yours, probably
> we just forget about this one below now.
> 
> But more mail to follow, returning to yours...
> 
> Hugh
> 
> p.s. KAMEZAWA-san, if you wonder why you're suddenly brought into this
> conversation, it's because there was a typo in your email address before.

Sorry, my fault. I misspelled the domain (jp.fujtisu.com).

> --- 3.5-rc6/vmscan.c	2012-06-03 06:42:11.000000000 -0700
> +++ linux/vmscan.c	2012-07-13 11:53:20.372087273 -0700
> @@ -675,7 +675,8 @@ static unsigned long shrink_page_list(st
>  				      struct zone *zone,
>  				      struct scan_control *sc,
>  				      unsigned long *ret_nr_dirty,
> -				      unsigned long *ret_nr_writeback)
> +				      unsigned long *ret_nr_writeback,
> +				      struct page **slow_page)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> @@ -720,6 +721,27 @@ static unsigned long shrink_page_list(st
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
>  		if (PageWriteback(page)) {
> +			/*
> +			 * memcg doesn't have any dirty pages throttling so we
> +			 * could easily OOM just because too many pages are in
> +			 * writeback from reclaim and there is nothing else to
> +			 * reclaim.  Nor is swap subject to dirty throttling.
> +			 *
> +			 * Check may_enter_fs, certainly because a loop driver
> +			 * thread might enter reclaim, and deadlock if it waits
> +			 * on a page for which it is needed to do the write
> +			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
> +			 * but more thought would probably show more reasons.
> +			 *
> +			 * Just use one page per shrink for this: wait on its
> +			 * writeback once we have done the rest.  If device is
> +			 * slow, in due course we shall choose one of its pages.
> +			 */
> +			if (!*slow_page && may_enter_fs && PageReclaim(page) &&
> +			    (PageSwapCache(page) || !global_reclaim(sc))) {
> +				*slow_page = page;
> +				get_page(page);
> +			}
>  			nr_writeback++;
>  			unlock_page(page);
>  			goto keep;
> @@ -1208,6 +1230,7 @@ shrink_inactive_list(unsigned long nr_to
>  	int file = is_file_lru(lru);
>  	struct zone *zone = lruvec_zone(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +	struct page *slow_page = NULL;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1245,7 +1268,7 @@ shrink_inactive_list(unsigned long nr_to
>  		return 0;
>  
>  	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
> -						&nr_dirty, &nr_writeback);
> +					&nr_dirty, &nr_writeback, &slow_page);
>  
>  	spin_lock_irq(&zone->lru_lock);
>  
> @@ -1292,8 +1315,13 @@ shrink_inactive_list(unsigned long nr_to
>  	 *                     isolated page is PageWriteback
>  	 */
>  	if (nr_writeback && nr_writeback >=
> -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
>  		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +		if (slow_page && PageReclaim(slow_page))
> +			wait_on_page_writeback(slow_page);
> +	}
> +	if (slow_page)
> +		put_page(slow_page);

Hmm. This relies on another round of shrinking because even if we wait
for the page it doesn't add up to nr_reclaimed. Not a big deal in
practice I guess because those will be rotated and seen in the next
loop. We are reclaiming with priority 0 so the whole list so we should
gather SWAP_CLUSTER pages sooner or later so the patch seems to be
correct.
It should even cope with the sudden latency issue when seeing a random
PageReclaim page in the middle of the LRU mentioned by Johannes. I
wasn't able to trigger this issue though and I think it is more a
theoretical than real one.

Anyway, thanks for looking into this. It's good to see there is other
approach as well so that we can compare.

>  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>  		zone_idx(zone),

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
  2012-07-16  8:35                   ` Hugh Dickins
@ 2012-07-16  9:26                     ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-16  9:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> The may_enter_fs test turns out to be too restrictive: though I saw
> no problem with it when testing on 3.5-rc6, it very soon OOMed when
> I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
> perhaps I just slightly changed the way I started off the testing:
> dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
> repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.
> 
> ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
> with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
> to me why the transaction needs to be started even before allocating
> pagecache memory.  But it may not be worth worrying about these days:
> if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?
> 
> Anyway, we insisted on the may_enter_fs test to avoid hangs with the
> loop device; but since that also masks off __GFP_IO, we can test for
> __GFP_IO directly, ignoring may_enter_fs and __GFP_FS.
> 
> But even so, the test still OOMs sometimes: when originally testing
> on 3.5-rc6, it OOMed about one time in five or ten; when testing
> just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> 
> This residual problem comes from an accumulation of pages under
> ordinary writeback, not marked PageReclaim, so rightly not causing
> the memcg check to wait on their writeback: these too can prevent
> shrink_page_list() from freeing any pages, so many times that memcg
> reclaim fails and OOMs.

I guess you managed to trigger this with 20M limit, right? I have tested
with different group sizes but the writeback didn't trigger for most of
them and all the dirty data were flushed from the reclaim. Have you used
any special setting the dirty ratio? Or was it with xfs (IIUC that one
does ignore writeback from the direct reclaim completely).

> Deal with these in the same way as direct reclaim now deals with
> dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
> these to tail of list when writepage completes, but more importantly,
> the PageReclaim flag makes memcg reclaim wait on them if encountered
> again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
> 
> Setting PageReclaim here may occasionally race with end_page_writeback()
> clearing it: lru_deactivate_fn() already faced the same race, and
> correctly concluded that the window is small and the issue non-critical.
> 
> With these changes, the test runs indefinitely without OOMing on ext4,
> ext3 and ext2: I'll move on to test with other filesystems later.
> 
> Trivia: invert conditions for a clearer block without an else,
> and goto keep_locked to do the unlock_page.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org (along with the patch it fixes)

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks Hugh

> ---
> Incremental on top of what I believe you presently have in mmotm:
> better folded in on top of Michal's original and the may_enter_fs "fix".
> 
>  mm/vmscan.c |   33 ++++++++++++++++++++++++---------
>  1 file changed, 24 insertions(+), 9 deletions(-)
> 
> --- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
> +++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
> @@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
>  			/*
>  			 * memcg doesn't have any dirty pages throttling so we
>  			 * could easily OOM just because too many pages are in
> -			 * writeback from reclaim and there is nothing else to
> -			 * reclaim.
> +			 * writeback and there is nothing else to reclaim.
>  			 *
> -			 * Check may_enter_fs, certainly because a loop driver
> +			 * Check __GFP_IO, certainly because a loop driver
>  			 * thread might enter reclaim, and deadlock if it waits
>  			 * on a page for which it is needed to do the write
>  			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
>  			 * but more thought would probably show more reasons.
> +			 *
> +			 * Don't require __GFP_FS, since we're not going into
> +			 * the FS, just waiting on its writeback completion.
> +			 * Worryingly, ext4 gfs2 and xfs allocate pages with
> +			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> +			 * testing may_enter_fs here is liable to OOM on them.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page) &&
> -					may_enter_fs)
> -				wait_on_page_writeback(page);
> -			else {
> +			if (global_reclaim(sc) ||
> +			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
> +				/*
> +				 * This is slightly racy - end_page_writeback()
> +				 * might have just cleared PageReclaim, then
> +				 * setting PageReclaim here end up interpreted
> +				 * as PageReadahead - but that does not matter
> +				 * enough to care.  What we do want is for this
> +				 * page to have PageReclaim set next time memcg
> +				 * reclaim reaches the tests above, so it will
> +				 * then wait_on_page_writeback() to avoid OOM;
> +				 * and it's also appropriate in global reclaim.
> +				 */
> +				SetPageReclaim(page);
>  				nr_writeback++;
> -				unlock_page(page);
> -				goto keep;
> +				goto keep_locked;
>  			}
> +			wait_on_page_writeback(page);
>  		}
>  
>  		references = page_check_references(page, sc);

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
@ 2012-07-16  9:26                     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-16  9:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> The may_enter_fs test turns out to be too restrictive: though I saw
> no problem with it when testing on 3.5-rc6, it very soon OOMed when
> I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
> perhaps I just slightly changed the way I started off the testing:
> dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
> repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.
> 
> ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
> with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
> to me why the transaction needs to be started even before allocating
> pagecache memory.  But it may not be worth worrying about these days:
> if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?
> 
> Anyway, we insisted on the may_enter_fs test to avoid hangs with the
> loop device; but since that also masks off __GFP_IO, we can test for
> __GFP_IO directly, ignoring may_enter_fs and __GFP_FS.
> 
> But even so, the test still OOMs sometimes: when originally testing
> on 3.5-rc6, it OOMed about one time in five or ten; when testing
> just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> 
> This residual problem comes from an accumulation of pages under
> ordinary writeback, not marked PageReclaim, so rightly not causing
> the memcg check to wait on their writeback: these too can prevent
> shrink_page_list() from freeing any pages, so many times that memcg
> reclaim fails and OOMs.

I guess you managed to trigger this with 20M limit, right? I have tested
with different group sizes but the writeback didn't trigger for most of
them and all the dirty data were flushed from the reclaim. Have you used
any special setting the dirty ratio? Or was it with xfs (IIUC that one
does ignore writeback from the direct reclaim completely).

> Deal with these in the same way as direct reclaim now deals with
> dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
> these to tail of list when writepage completes, but more importantly,
> the PageReclaim flag makes memcg reclaim wait on them if encountered
> again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
> 
> Setting PageReclaim here may occasionally race with end_page_writeback()
> clearing it: lru_deactivate_fn() already faced the same race, and
> correctly concluded that the window is small and the issue non-critical.
> 
> With these changes, the test runs indefinitely without OOMing on ext4,
> ext3 and ext2: I'll move on to test with other filesystems later.
> 
> Trivia: invert conditions for a clearer block without an else,
> and goto keep_locked to do the unlock_page.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: stable@vger.kernel.org (along with the patch it fixes)

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks Hugh

> ---
> Incremental on top of what I believe you presently have in mmotm:
> better folded in on top of Michal's original and the may_enter_fs "fix".
> 
>  mm/vmscan.c |   33 ++++++++++++++++++++++++---------
>  1 file changed, 24 insertions(+), 9 deletions(-)
> 
> --- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
> +++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
> @@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
>  			/*
>  			 * memcg doesn't have any dirty pages throttling so we
>  			 * could easily OOM just because too many pages are in
> -			 * writeback from reclaim and there is nothing else to
> -			 * reclaim.
> +			 * writeback and there is nothing else to reclaim.
>  			 *
> -			 * Check may_enter_fs, certainly because a loop driver
> +			 * Check __GFP_IO, certainly because a loop driver
>  			 * thread might enter reclaim, and deadlock if it waits
>  			 * on a page for which it is needed to do the write
>  			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
>  			 * but more thought would probably show more reasons.
> +			 *
> +			 * Don't require __GFP_FS, since we're not going into
> +			 * the FS, just waiting on its writeback completion.
> +			 * Worryingly, ext4 gfs2 and xfs allocate pages with
> +			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
> +			 * testing may_enter_fs here is liable to OOM on them.
>  			 */
> -			if (!global_reclaim(sc) && PageReclaim(page) &&
> -					may_enter_fs)
> -				wait_on_page_writeback(page);
> -			else {
> +			if (global_reclaim(sc) ||
> +			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
> +				/*
> +				 * This is slightly racy - end_page_writeback()
> +				 * might have just cleared PageReclaim, then
> +				 * setting PageReclaim here end up interpreted
> +				 * as PageReadahead - but that does not matter
> +				 * enough to care.  What we do want is for this
> +				 * page to have PageReclaim set next time memcg
> +				 * reclaim reaches the tests above, so it will
> +				 * then wait_on_page_writeback() to avoid OOM;
> +				 * and it's also appropriate in global reclaim.
> +				 */
> +				SetPageReclaim(page);
>  				nr_writeback++;
> -				unlock_page(page);
> -				goto keep;
> +				goto keep_locked;
>  			}
> +			wait_on_page_writeback(page);
>  		}
>  
>  		references = page_check_references(page, sc);

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
  2012-07-16  8:35                   ` Hugh Dickins
@ 2012-07-16 21:08                     ` Andrew Morton
  -1 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-16 21:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon, 16 Jul 2012 01:35:34 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Incremental on top of what I believe you presently have in mmotm:
> better folded in on top of Michal's original and the may_enter_fs "fix".

I think I'll keep it as a separate patch, actually.  This is a pretty
tricky and error-prone area and all those details in the changelog may
prove useful next time this code explodes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
@ 2012-07-16 21:08                     ` Andrew Morton
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Morton @ 2012-07-16 21:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon, 16 Jul 2012 01:35:34 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Incremental on top of what I believe you presently have in mmotm:
> better folded in on top of Michal's original and the may_enter_fs "fix".

I think I'll keep it as a separate patch, actually.  This is a pretty
tricky and error-prone area and all those details in the changelog may
prove useful next time this code explodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
  2012-07-16  9:26                     ` Michal Hocko
@ 2012-07-17  4:52                       ` Hugh Dickins
  -1 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-17  4:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon, 16 Jul 2012, Michal Hocko wrote:
> On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> > But even so, the test still OOMs sometimes: when originally testing
> > on 3.5-rc6, it OOMed about one time in five or ten; when testing
> > just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> > 
> > This residual problem comes from an accumulation of pages under
> > ordinary writeback, not marked PageReclaim, so rightly not causing
> > the memcg check to wait on their writeback: these too can prevent
> > shrink_page_list() from freeing any pages, so many times that memcg
> > reclaim fails and OOMs.
> 
> I guess you managed to trigger this with 20M limit, right?

That's right.

> I have tested
> with different group sizes but the writeback didn't trigger for most of
> them and all the dirty data were flushed from the reclaim.

I didn't examine writeback stats to confirm, but I guess that just
occasionally it managed to come in and do enough work to confound us.

> Have you used any special setting the dirty ratio?

No, I wasn't imaginative enough to try that.

> Or was it with xfs (IIUC that one
> does ignore writeback from the direct reclaim completely).

No, just ext4 at that point.

I have since tested the final patch with ext4, ext3 (by ext3 driver
and by ext4 driver), ext2 (by ext2 driver and by ext4 driver), xfs,
btrfs, vfat, tmpfs (with swap on the USB stick) and block device:
about an hour on each, no surprises, all okay.

But I didn't experiment beyond the 20M memcg.

Hugh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
@ 2012-07-17  4:52                       ` Hugh Dickins
  0 siblings, 0 replies; 44+ messages in thread
From: Hugh Dickins @ 2012-07-17  4:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon, 16 Jul 2012, Michal Hocko wrote:
> On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> > But even so, the test still OOMs sometimes: when originally testing
> > on 3.5-rc6, it OOMed about one time in five or ten; when testing
> > just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> > 
> > This residual problem comes from an accumulation of pages under
> > ordinary writeback, not marked PageReclaim, so rightly not causing
> > the memcg check to wait on their writeback: these too can prevent
> > shrink_page_list() from freeing any pages, so many times that memcg
> > reclaim fails and OOMs.
> 
> I guess you managed to trigger this with 20M limit, right?

That's right.

> I have tested
> with different group sizes but the writeback didn't trigger for most of
> them and all the dirty data were flushed from the reclaim.

I didn't examine writeback stats to confirm, but I guess that just
occasionally it managed to come in and do enough work to confound us.

> Have you used any special setting the dirty ratio?

No, I wasn't imaginative enough to try that.

> Or was it with xfs (IIUC that one
> does ignore writeback from the direct reclaim completely).

No, just ext4 at that point.

I have since tested the final patch with ext4, ext3 (by ext3 driver
and by ext4 driver), ext2 (by ext2 driver and by ext4 driver), xfs,
btrfs, vfat, tmpfs (with swap on the USB stick) and block device:
about an hour on each, no surprises, all okay.

But I didn't experiment beyond the 20M memcg.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
  2012-07-17  4:52                       ` Hugh Dickins
@ 2012-07-17  6:33                         ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-17  6:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 21:52:51, Hugh Dickins wrote:
> On Mon, 16 Jul 2012, Michal Hocko wrote:
> > On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> > > But even so, the test still OOMs sometimes: when originally testing
> > > on 3.5-rc6, it OOMed about one time in five or ten; when testing
> > > just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> > > 
> > > This residual problem comes from an accumulation of pages under
> > > ordinary writeback, not marked PageReclaim, so rightly not causing
> > > the memcg check to wait on their writeback: these too can prevent
> > > shrink_page_list() from freeing any pages, so many times that memcg
> > > reclaim fails and OOMs.
> > 
> > I guess you managed to trigger this with 20M limit, right?
> 
> That's right.
> 
> > I have tested
> > with different group sizes but the writeback didn't trigger for most of
> > them and all the dirty data were flushed from the reclaim.
> 
> I didn't examine writeback stats to confirm, but I guess that just
> occasionally it managed to come in and do enough work to confound us.
> 
> > Have you used any special setting the dirty ratio?
> 
> No, I wasn't imaginative enough to try that.
> 
> > Or was it with xfs (IIUC that one
> > does ignore writeback from the direct reclaim completely).
> 
> No, just ext4 at that point.
> 
> I have since tested the final patch with ext4, ext3 (by ext3 driver
> and by ext4 driver), ext2 (by ext2 driver and by ext4 driver), xfs,
> btrfs, vfat, tmpfs (with swap on the USB stick) and block device:
> about an hour on each, no surprises, all okay.
> 
> But I didn't experiment beyond the 20M memcg.

Great coverage anyway. Thanks a lot Hugh!

> 
> Hugh

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
@ 2012-07-17  6:33                         ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2012-07-17  6:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Johannes Weiner, Fengguang Wu

On Mon 16-07-12 21:52:51, Hugh Dickins wrote:
> On Mon, 16 Jul 2012, Michal Hocko wrote:
> > On Mon 16-07-12 01:35:34, Hugh Dickins wrote:
> > > But even so, the test still OOMs sometimes: when originally testing
> > > on 3.5-rc6, it OOMed about one time in five or ten; when testing
> > > just now on 3.5-rc6-mm1, it OOMed on the first iteration.
> > > 
> > > This residual problem comes from an accumulation of pages under
> > > ordinary writeback, not marked PageReclaim, so rightly not causing
> > > the memcg check to wait on their writeback: these too can prevent
> > > shrink_page_list() from freeing any pages, so many times that memcg
> > > reclaim fails and OOMs.
> > 
> > I guess you managed to trigger this with 20M limit, right?
> 
> That's right.
> 
> > I have tested
> > with different group sizes but the writeback didn't trigger for most of
> > them and all the dirty data were flushed from the reclaim.
> 
> I didn't examine writeback stats to confirm, but I guess that just
> occasionally it managed to come in and do enough work to confound us.
> 
> > Have you used any special setting the dirty ratio?
> 
> No, I wasn't imaginative enough to try that.
> 
> > Or was it with xfs (IIUC that one
> > does ignore writeback from the direct reclaim completely).
> 
> No, just ext4 at that point.
> 
> I have since tested the final patch with ext4, ext3 (by ext3 driver
> and by ext4 driver), ext2 (by ext2 driver and by ext4 driver), xfs,
> btrfs, vfat, tmpfs (with swap on the USB stick) and block device:
> about an hour on each, no surprises, all okay.
> 
> But I didn't experiment beyond the 20M memcg.

Great coverage anyway. Thanks a lot Hugh!

> 
> Hugh

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2012-07-17  6:33 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-19 14:50 [PATCH -mm] memcg: prevent from OOM with too many dirty pages Michal Hocko
2012-06-19 14:50 ` Michal Hocko
2012-06-19 22:00 ` Andrew Morton
2012-06-19 22:00   ` Andrew Morton
2012-06-20  8:27   ` Michal Hocko
2012-06-20  8:27     ` Michal Hocko
2012-06-20  9:20   ` Mel Gorman
2012-06-20  9:20     ` Mel Gorman
2012-06-20  9:55     ` Fengguang Wu
2012-06-20  9:55       ` Fengguang Wu
2012-06-20  9:59     ` Michal Hocko
2012-06-20  9:59       ` Michal Hocko
2012-06-20 10:11   ` [PATCH v2 " Michal Hocko
2012-06-20 10:11     ` Michal Hocko
2012-07-12  1:57     ` Hugh Dickins
2012-07-12  1:57       ` Hugh Dickins
2012-07-12  2:21       ` Andrew Morton
2012-07-12  2:21         ` Andrew Morton
2012-07-12  3:13         ` Hugh Dickins
2012-07-12  3:13           ` Hugh Dickins
2012-07-12  7:05       ` Michal Hocko
2012-07-12  7:05         ` Michal Hocko
2012-07-12 21:13         ` Andrew Morton
2012-07-12 21:13           ` Andrew Morton
2012-07-12 22:42           ` Hugh Dickins
2012-07-12 22:42             ` Hugh Dickins
2012-07-13  8:21             ` Michal Hocko
2012-07-13  8:21               ` Michal Hocko
2012-07-16  8:30               ` Hugh Dickins
2012-07-16  8:30                 ` Hugh Dickins
2012-07-16  8:35                 ` [PATCH mmotm] memcg: further prevent " Hugh Dickins
2012-07-16  8:35                   ` Hugh Dickins
2012-07-16  9:26                   ` Michal Hocko
2012-07-16  9:26                     ` Michal Hocko
2012-07-17  4:52                     ` Hugh Dickins
2012-07-17  4:52                       ` Hugh Dickins
2012-07-17  6:33                       ` Michal Hocko
2012-07-17  6:33                         ` Michal Hocko
2012-07-16 21:08                   ` Andrew Morton
2012-07-16 21:08                     ` Andrew Morton
2012-07-16  8:10         ` [PATCH v2 -mm] memcg: prevent from " Hugh Dickins
2012-07-16  8:10           ` Hugh Dickins
2012-07-16  8:48           ` Michal Hocko
2012-07-16  8:48             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.