All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan@kernel.org>,
	Rik van Riel <riel@redhat.com>, Ying Han <yinghan@google.com>,
	Greg Thelen <gthelen@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
Date: Mon, 16 Jul 2012 01:35:34 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.00.1207160131120.3936@eggly.anvils> (raw)
In-Reply-To: <alpine.LSU.2.00.1207160111280.3936@eggly.anvils>

The may_enter_fs test turns out to be too restrictive: though I saw
no problem with it when testing on 3.5-rc6, it very soon OOMed when
I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
perhaps I just slightly changed the way I started off the testing:
dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
to me why the transaction needs to be started even before allocating
pagecache memory.  But it may not be worth worrying about these days:
if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the
loop device; but since that also masks off __GFP_IO, we can test for
__GFP_IO directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing
on 3.5-rc6, it OOMed about one time in five or ten; when testing
just now on 3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under
ordinary writeback, not marked PageReclaim, so rightly not causing
the memcg check to wait on their writeback: these too can prevent
shrink_page_list() from freeing any pages, so many times that memcg
reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with
dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
these to tail of list when writepage completes, but more importantly,
the PageReclaim flag makes memcg reclaim wait on them if encountered
again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else,
and goto keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org (along with the patch it fixes)
---
Incremental on top of what I believe you presently have in mmotm:
better folded in on top of Michal's original and the may_enter_fs "fix".

 mm/vmscan.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

--- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
+++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
 			/*
 			 * memcg doesn't have any dirty pages throttling so we
 			 * could easily OOM just because too many pages are in
-			 * writeback from reclaim and there is nothing else to
-			 * reclaim.
+			 * writeback and there is nothing else to reclaim.
 			 *
-			 * Check may_enter_fs, certainly because a loop driver
+			 * Check __GFP_IO, certainly because a loop driver
 			 * thread might enter reclaim, and deadlock if it waits
 			 * on a page for which it is needed to do the write
 			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
 			 * but more thought would probably show more reasons.
+			 *
+			 * Don't require __GFP_FS, since we're not going into
+			 * the FS, just waiting on its writeback completion.
+			 * Worryingly, ext4 gfs2 and xfs allocate pages with
+			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
+			 * testing may_enter_fs here is liable to OOM on them.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page) &&
-					may_enter_fs)
-				wait_on_page_writeback(page);
-			else {
+			if (global_reclaim(sc) ||
+			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
+				/*
+				 * This is slightly racy - end_page_writeback()
+				 * might have just cleared PageReclaim, then
+				 * setting PageReclaim here end up interpreted
+				 * as PageReadahead - but that does not matter
+				 * enough to care.  What we do want is for this
+				 * page to have PageReclaim set next time memcg
+				 * reclaim reaches the tests above, so it will
+				 * then wait_on_page_writeback() to avoid OOM;
+				 * and it's also appropriate in global reclaim.
+				 */
+				SetPageReclaim(page);
 				nr_writeback++;
-				unlock_page(page);
-				goto keep;
+				goto keep_locked;
 			}
+			wait_on_page_writeback(page);
 		}
 
 		references = page_check_references(page, sc);

WARNING: multiple messages have this Message-ID (diff)
From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Mel Gorman <mgorman@suse.de>, Minchan Kim <minchan@kernel.org>,
	Rik van Riel <riel@redhat.com>, Ying Han <yinghan@google.com>,
	Greg Thelen <gthelen@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: [PATCH mmotm] memcg: further prevent OOM with too many dirty pages
Date: Mon, 16 Jul 2012 01:35:34 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.00.1207160131120.3936@eggly.anvils> (raw)
In-Reply-To: <alpine.LSU.2.00.1207160111280.3936@eggly.anvils>

The may_enter_fs test turns out to be too restrictive: though I saw
no problem with it when testing on 3.5-rc6, it very soon OOMed when
I tested on 3.5-rc6-mm1.  I don't know what the difference there is,
perhaps I just slightly changed the way I started off the testing:
dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync
repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick.

ext4 (and gfs2 and xfs) turn out to allocate new pages for writing
with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear
to me why the transaction needs to be started even before allocating
pagecache memory.  But it may not be worth worrying about these days:
if direct reclaim avoids FS writeback, does __GFP_FS now mean anything?

Anyway, we insisted on the may_enter_fs test to avoid hangs with the
loop device; but since that also masks off __GFP_IO, we can test for
__GFP_IO directly, ignoring may_enter_fs and __GFP_FS.

But even so, the test still OOMs sometimes: when originally testing
on 3.5-rc6, it OOMed about one time in five or ten; when testing
just now on 3.5-rc6-mm1, it OOMed on the first iteration.

This residual problem comes from an accumulation of pages under
ordinary writeback, not marked PageReclaim, so rightly not causing
the memcg check to wait on their writeback: these too can prevent
shrink_page_list() from freeing any pages, so many times that memcg
reclaim fails and OOMs.

Deal with these in the same way as direct reclaim now deals with
dirty FS pages: mark them PageReclaim.  It is appropriate to rotate
these to tail of list when writepage completes, but more importantly,
the PageReclaim flag makes memcg reclaim wait on them if encountered
again.  Increment NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.

Setting PageReclaim here may occasionally race with end_page_writeback()
clearing it: lru_deactivate_fn() already faced the same race, and
correctly concluded that the window is small and the issue non-critical.

With these changes, the test runs indefinitely without OOMing on ext4,
ext3 and ext2: I'll move on to test with other filesystems later.

Trivia: invert conditions for a clearer block without an else,
and goto keep_locked to do the unlock_page.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org (along with the patch it fixes)
---
Incremental on top of what I believe you presently have in mmotm:
better folded in on top of Michal's original and the may_enter_fs "fix".

 mm/vmscan.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

--- mmotm/mm/vmscan.c	2012-07-14 18:43:46.618738947 -0700
+++ linux/mm/vmscan.c	2012-07-15 19:28:50.038830668 -0700
@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(st
 			/*
 			 * memcg doesn't have any dirty pages throttling so we
 			 * could easily OOM just because too many pages are in
-			 * writeback from reclaim and there is nothing else to
-			 * reclaim.
+			 * writeback and there is nothing else to reclaim.
 			 *
-			 * Check may_enter_fs, certainly because a loop driver
+			 * Check __GFP_IO, certainly because a loop driver
 			 * thread might enter reclaim, and deadlock if it waits
 			 * on a page for which it is needed to do the write
 			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
 			 * but more thought would probably show more reasons.
+			 *
+			 * Don't require __GFP_FS, since we're not going into
+			 * the FS, just waiting on its writeback completion.
+			 * Worryingly, ext4 gfs2 and xfs allocate pages with
+			 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
+			 * testing may_enter_fs here is liable to OOM on them.
 			 */
-			if (!global_reclaim(sc) && PageReclaim(page) &&
-					may_enter_fs)
-				wait_on_page_writeback(page);
-			else {
+			if (global_reclaim(sc) ||
+			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
+				/*
+				 * This is slightly racy - end_page_writeback()
+				 * might have just cleared PageReclaim, then
+				 * setting PageReclaim here end up interpreted
+				 * as PageReadahead - but that does not matter
+				 * enough to care.  What we do want is for this
+				 * page to have PageReclaim set next time memcg
+				 * reclaim reaches the tests above, so it will
+				 * then wait_on_page_writeback() to avoid OOM;
+				 * and it's also appropriate in global reclaim.
+				 */
+				SetPageReclaim(page);
 				nr_writeback++;
-				unlock_page(page);
-				goto keep;
+				goto keep_locked;
 			}
+			wait_on_page_writeback(page);
 		}
 
 		references = page_check_references(page, sc);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-07-16  8:36 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-19 14:50 [PATCH -mm] memcg: prevent from OOM with too many dirty pages Michal Hocko
2012-06-19 14:50 ` Michal Hocko
2012-06-19 22:00 ` Andrew Morton
2012-06-19 22:00   ` Andrew Morton
2012-06-20  8:27   ` Michal Hocko
2012-06-20  8:27     ` Michal Hocko
2012-06-20  9:20   ` Mel Gorman
2012-06-20  9:20     ` Mel Gorman
2012-06-20  9:55     ` Fengguang Wu
2012-06-20  9:55       ` Fengguang Wu
2012-06-20  9:59     ` Michal Hocko
2012-06-20  9:59       ` Michal Hocko
2012-06-20 10:11   ` [PATCH v2 " Michal Hocko
2012-06-20 10:11     ` Michal Hocko
2012-07-12  1:57     ` Hugh Dickins
2012-07-12  1:57       ` Hugh Dickins
2012-07-12  2:21       ` Andrew Morton
2012-07-12  2:21         ` Andrew Morton
2012-07-12  3:13         ` Hugh Dickins
2012-07-12  3:13           ` Hugh Dickins
2012-07-12  7:05       ` Michal Hocko
2012-07-12  7:05         ` Michal Hocko
2012-07-12 21:13         ` Andrew Morton
2012-07-12 21:13           ` Andrew Morton
2012-07-12 22:42           ` Hugh Dickins
2012-07-12 22:42             ` Hugh Dickins
2012-07-13  8:21             ` Michal Hocko
2012-07-13  8:21               ` Michal Hocko
2012-07-16  8:30               ` Hugh Dickins
2012-07-16  8:30                 ` Hugh Dickins
2012-07-16  8:35                 ` Hugh Dickins [this message]
2012-07-16  8:35                   ` [PATCH mmotm] memcg: further prevent " Hugh Dickins
2012-07-16  9:26                   ` Michal Hocko
2012-07-16  9:26                     ` Michal Hocko
2012-07-17  4:52                     ` Hugh Dickins
2012-07-17  4:52                       ` Hugh Dickins
2012-07-17  6:33                       ` Michal Hocko
2012-07-17  6:33                         ` Michal Hocko
2012-07-16 21:08                   ` Andrew Morton
2012-07-16 21:08                     ` Andrew Morton
2012-07-16  8:10         ` [PATCH v2 -mm] memcg: prevent from " Hugh Dickins
2012-07-16  8:10           ` Hugh Dickins
2012-07-16  8:48           ` Michal Hocko
2012-07-16  8:48             ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.00.1207160131120.3936@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengguang.wu@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.