linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC -mm] memcg: prevent from OOM with too many dirty pages
@ 2012-05-28 15:38 Michal Hocko
  2012-05-29  3:08 ` Fengguang Wu
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2012-05-28 15:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki, Mel Gorman,
	Minchan Kim, Rik van Riel, Ying Han, Johannes Weiner,
	Greg Thelen, Hugh Dickins

Current implementation of dirty pages throttling is not memcg aware which makes
it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
hard limit is small and so the lists are scanned faster than pages written
back.

This patch fixes the problem by throttling the allocating process (possibly
a writer) during the hard limit reclaim by waiting on PageReclaim pages.
We are waiting only for PageReclaim pages because those are the pages
that made one full round over LRU and that means that the writeback is much
slower than scanning.
The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.
We are seeing this happening during nightly backups which are placed into
containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter PageReclaim
pages which is a signal that the reclaim doesn't catch up on with the writers
so somebody should be throttled. This could be potentially unfair because it
could be somebody else from the group who gets throttled on behalf of the
writer but as writers need to allocate as well and they allocate in higher rate
the probability that only innocent processes would be penalized is not that
high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3
running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and
dd got killed by OOM killer every time. With the patch I could run the dd with
the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/vmscan.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c978ce4..7cccd81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
-			nr_writeback++;
-			unlock_page(page);
-			goto keep;
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.
+			 */
+			if (PageReclaim(page)
+					&& may_enter_fs && !global_reclaim(sc))
+				wait_on_page_writeback(page);
+			else {
+				nr_writeback++;
+				unlock_page(page);
+				goto keep;
+			}
 		}
 
 		references = page_check_references(page, sc);
-- 
1.7.10


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-28 15:38 [RFC -mm] memcg: prevent from OOM with too many dirty pages Michal Hocko
@ 2012-05-29  3:08 ` Fengguang Wu
  2012-05-29  7:28   ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Fengguang Wu @ 2012-05-29  3:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrew Morton, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Johannes Weiner,
	Greg Thelen, Hugh Dickins

Hi Michal,

On Mon, May 28, 2012 at 05:38:55PM +0200, Michal Hocko wrote:
> Current implementation of dirty pages throttling is not memcg aware which makes
> it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> hard limit is small and so the lists are scanned faster than pages written
> back.
> 
> This patch fixes the problem by throttling the allocating process (possibly
> a writer) during the hard limit reclaim by waiting on PageReclaim pages.
> We are waiting only for PageReclaim pages because those are the pages
> that made one full round over LRU and that means that the writeback is much
> slower than scanning.
> The solution is far from being ideal - long term solution is memcg aware
> dirty throttling - but it is meant to be a band aid until we have a real
> fix.

IMHO it's still an important "band aid" -- perhaps worthwhile for
sending to Greg's stable trees. Because it fixes a really important
use case: it enables the users to put backups into a small memcg.

The users visible changes are:

        the backup program get OOM killed
=>
        it runs now, although being a bit slow and bumpy


Talking about the more comprehensive fix, I'm sorry for delaying this
patch for several months:

        [PATCH 6/9] vmscan: dirty reclaim throttling
        http://thread.gmane.org/gmane.linux.kernel.mm/74582

Now that Jan's iput() avoidance patchset has just been merged by
Linus, I'll rebase the patchset on top of his great work.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c978ce4..7cccd81 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -720,9 +720,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>  
>  		if (PageWriteback(page)) {
> -			nr_writeback++;
> -			unlock_page(page);
> -			goto keep;
> +			/*
> +			 * memcg doesn't have any dirty pages throttling so we
> +			 * could easily OOM just because too many pages are in
> +			 * writeback from reclaim and there is nothing else to
> +			 * reclaim.
> +			 */
> +			if (PageReclaim(page)
> +					&& may_enter_fs && !global_reclaim(sc))
> +				wait_on_page_writeback(page);
> +			else {
> +				nr_writeback++;
> +				unlock_page(page);
> +				goto keep;
> +			}
>  		}
>  
>  		references = page_check_references(page, sc);

Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>

This is all good for memcg. I'd suggest sending it to 3.5 and 3.4.x as well.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  3:08 ` Fengguang Wu
@ 2012-05-29  7:28   ` Johannes Weiner
  2012-05-29  8:48     ` Fengguang Wu
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2012-05-29  7:28 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Tue, May 29, 2012 at 11:08:57AM +0800, Fengguang Wu wrote:
> Hi Michal,
> 
> On Mon, May 28, 2012 at 05:38:55PM +0200, Michal Hocko wrote:
> > Current implementation of dirty pages throttling is not memcg aware which makes
> > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > hard limit is small and so the lists are scanned faster than pages written
> > back.
> > 
> > This patch fixes the problem by throttling the allocating process (possibly
> > a writer) during the hard limit reclaim by waiting on PageReclaim pages.
> > We are waiting only for PageReclaim pages because those are the pages
> > that made one full round over LRU and that means that the writeback is much
> > slower than scanning.
> > The solution is far from being ideal - long term solution is memcg aware
> > dirty throttling - but it is meant to be a band aid until we have a real
> > fix.
> 
> IMHO it's still an important "band aid" -- perhaps worthwhile for
> sending to Greg's stable trees. Because it fixes a really important
> use case: it enables the users to put backups into a small memcg.
> 
> The users visible changes are:
> 
>         the backup program get OOM killed
> =>
>         it runs now, although being a bit slow and bumpy

The problem is workloads that /don't/ have excessive dirty pages, but
instantiate clean page cache at a much faster rate than writeback can
clean the few dirties.  The dirty/writeback pages reach the end of the
lru several times while there are always easily reclaimable pages
around.

This was the rationale for introducing the backoff function that
considers the dirty page percentage of all pages looked at (bottom of
shrink_active_list) and removing all other sleeps that didn't look at
the bigger picture and made problems.  I'd hate for them to come back.

On the other hand, is there a chance to make this backoff function
work for memcgs?  Right now it only applies to the global case to not
mark a whole zone congested because of some dirty pages on a single
memcg LRU.  But maybe it can work by considering congestion on a
per-lruvec basis rather than per-zone?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  7:28   ` Johannes Weiner
@ 2012-05-29  8:48     ` Fengguang Wu
  2012-05-29  9:35       ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Fengguang Wu @ 2012-05-29  8:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Tue, May 29, 2012 at 09:28:53AM +0200, Johannes Weiner wrote:
> On Tue, May 29, 2012 at 11:08:57AM +0800, Fengguang Wu wrote:
> > Hi Michal,
> > 
> > On Mon, May 28, 2012 at 05:38:55PM +0200, Michal Hocko wrote:
> > > Current implementation of dirty pages throttling is not memcg aware which makes
> > > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > > hard limit is small and so the lists are scanned faster than pages written
> > > back.
> > > 
> > > This patch fixes the problem by throttling the allocating process (possibly
> > > a writer) during the hard limit reclaim by waiting on PageReclaim pages.
> > > We are waiting only for PageReclaim pages because those are the pages
> > > that made one full round over LRU and that means that the writeback is much
> > > slower than scanning.
> > > The solution is far from being ideal - long term solution is memcg aware
> > > dirty throttling - but it is meant to be a band aid until we have a real
> > > fix.
> > 
> > IMHO it's still an important "band aid" -- perhaps worthwhile for
> > sending to Greg's stable trees. Because it fixes a really important
> > use case: it enables the users to put backups into a small memcg.
> > 
> > The users visible changes are:
> > 
> >         the backup program get OOM killed
> > =>
> >         it runs now, although being a bit slow and bumpy
> 
> The problem is workloads that /don't/ have excessive dirty pages, but
> instantiate clean page cache at a much faster rate than writeback can
> clean the few dirties.  The dirty/writeback pages reach the end of the
> lru several times while there are always easily reclaimable pages
> around.

Good point!

> This was the rationale for introducing the backoff function that
> considers the dirty page percentage of all pages looked at (bottom of
> shrink_active_list) and removing all other sleeps that didn't look at
> the bigger picture and made problems.  I'd hate for them to come back.
> 
> On the other hand, is there a chance to make this backoff function
> work for memcgs?  Right now it only applies to the global case to not
> mark a whole zone congested because of some dirty pages on a single
> memcg LRU.  But maybe it can work by considering congestion on a
> per-lruvec basis rather than per-zone?

Johannes, would you paste the backoff code? Sorry I'm not sure about
the exact logic you are talking. 

As for this patch, can it be improved by adding some test like
(priority < DEF_PRIORITY/2)? That should reasonably filter out the
"fast read rotating dirty pages fast" situation and still avoid OOM
for "heavy write inside small memcg".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  8:48     ` Fengguang Wu
@ 2012-05-29  9:35       ` Johannes Weiner
  2012-05-29 10:21         ` Fengguang Wu
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Johannes Weiner @ 2012-05-29  9:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Tue, May 29, 2012 at 04:48:48PM +0800, Fengguang Wu wrote:
> On Tue, May 29, 2012 at 09:28:53AM +0200, Johannes Weiner wrote:
> > On Tue, May 29, 2012 at 11:08:57AM +0800, Fengguang Wu wrote:
> > > Hi Michal,
> > > 
> > > On Mon, May 28, 2012 at 05:38:55PM +0200, Michal Hocko wrote:
> > > > Current implementation of dirty pages throttling is not memcg aware which makes
> > > > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > > > hard limit is small and so the lists are scanned faster than pages written
> > > > back.
> > > > 
> > > > This patch fixes the problem by throttling the allocating process (possibly
> > > > a writer) during the hard limit reclaim by waiting on PageReclaim pages.
> > > > We are waiting only for PageReclaim pages because those are the pages
> > > > that made one full round over LRU and that means that the writeback is much
> > > > slower than scanning.
> > > > The solution is far from being ideal - long term solution is memcg aware
> > > > dirty throttling - but it is meant to be a band aid until we have a real
> > > > fix.
> > > 
> > > IMHO it's still an important "band aid" -- perhaps worthwhile for
> > > sending to Greg's stable trees. Because it fixes a really important
> > > use case: it enables the users to put backups into a small memcg.
> > > 
> > > The users visible changes are:
> > > 
> > >         the backup program get OOM killed
> > > =>
> > >         it runs now, although being a bit slow and bumpy
> > 
> > The problem is workloads that /don't/ have excessive dirty pages, but
> > instantiate clean page cache at a much faster rate than writeback can
> > clean the few dirties.  The dirty/writeback pages reach the end of the
> > lru several times while there are always easily reclaimable pages
> > around.
> 
> Good point!
> 
> > This was the rationale for introducing the backoff function that
> > considers the dirty page percentage of all pages looked at (bottom of
> > shrink_active_list) and removing all other sleeps that didn't look at
> > the bigger picture and made problems.  I'd hate for them to come back.
> > 
> > On the other hand, is there a chance to make this backoff function
> > work for memcgs?  Right now it only applies to the global case to not
> > mark a whole zone congested because of some dirty pages on a single
> > memcg LRU.  But maybe it can work by considering congestion on a
> > per-lruvec basis rather than per-zone?
> 
> Johannes, would you paste the backoff code? Sorry I'm not sure about
> the exact logic you are talking.

Sure, it's this guy here:

        /*
         * If reclaim is isolating dirty pages under writeback, it implies
         * that the long-lived page allocation rate is exceeding the page
         * laundering rate. Either the global limits are not being effective
         * at throttling processes due to the page distribution throughout
         * zones or there is heavy usage of a slow backing device. The
         * only option is to throttle from reclaim context which is not ideal
         * as there is no guarantee the dirtying process is throttled in the
         * same way balance_dirty_pages() manages.
         *
         * This scales the number of dirty pages that must be under writeback
         * before throttling depending on priority. It is a simple backoff
         * function that has the most effect in the range DEF_PRIORITY to
         * DEF_PRIORITY-2 which is the priority reclaim is considered to be
         * in trouble and reclaim is considered to be in trouble.
         *
         * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
         * DEF_PRIORITY-1  50% must be PageWriteback
         * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
         * ...
         * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
         *                     isolated page is PageWriteback
         */
        if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

But the problem is the part declaring the zone congested:

        /*
         * Tag a zone as congested if all the dirty pages encountered were
         * backed by a congested BDI. In this case, reclaimers should just
         * back off and wait for congestion to clear because further reclaim
         * will encounter the same problem
         */
        if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
                zone_set_flag(mz->zone, ZONE_CONGESTED);

Note the global_reclaim().  It would be nice to have these two operate
against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
is not necessarily involved in a memcg-constrained load, so we need to
find clearing sites that work for both global and memcg reclaim.

> As for this patch, can it be improved by adding some test like
> (priority < DEF_PRIORITY/2)? That should reasonably filter out the
> "fast read rotating dirty pages fast" situation and still avoid OOM
> for "heavy write inside small memcg".

I think we tried these thresholds for global sync reclaim, too, but
couldn't find the right value.  IIRC, we tried to strike a balance
between excessive stalls and wasting CPU, but obviously the CPU
wasting is not a concern because that is completely uninhibited right
now for memcg reclaim.  So it may be an improvement if I didn't miss
anything.  Maybe Mel remembers more?

It'd still be preferrable to keep the differences between memcg and
global reclaim at a minimum, though, and extend the dirty throttling
we already have.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  9:35       ` Johannes Weiner
@ 2012-05-29 10:21         ` Fengguang Wu
  2012-05-29 13:32         ` Mel Gorman
  2012-05-29 13:51         ` Michal Hocko
  2 siblings, 0 replies; 15+ messages in thread
From: Fengguang Wu @ 2012-05-29 10:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Tue, May 29, 2012 at 11:35:11AM +0200, Johannes Weiner wrote:
> On Tue, May 29, 2012 at 04:48:48PM +0800, Fengguang Wu wrote:
> > On Tue, May 29, 2012 at 09:28:53AM +0200, Johannes Weiner wrote:
> > > On Tue, May 29, 2012 at 11:08:57AM +0800, Fengguang Wu wrote:
> > > > Hi Michal,
> > > > 
> > > > On Mon, May 28, 2012 at 05:38:55PM +0200, Michal Hocko wrote:
> > > > > Current implementation of dirty pages throttling is not memcg aware which makes
> > > > > it easy to have LRUs full of dirty pages which might lead to memcg OOM if the
> > > > > hard limit is small and so the lists are scanned faster than pages written
> > > > > back.
> > > > > 
> > > > > This patch fixes the problem by throttling the allocating process (possibly
> > > > > a writer) during the hard limit reclaim by waiting on PageReclaim pages.
> > > > > We are waiting only for PageReclaim pages because those are the pages
> > > > > that made one full round over LRU and that means that the writeback is much
> > > > > slower than scanning.
> > > > > The solution is far from being ideal - long term solution is memcg aware
> > > > > dirty throttling - but it is meant to be a band aid until we have a real
> > > > > fix.
> > > > 
> > > > IMHO it's still an important "band aid" -- perhaps worthwhile for
> > > > sending to Greg's stable trees. Because it fixes a really important
> > > > use case: it enables the users to put backups into a small memcg.
> > > > 
> > > > The users visible changes are:
> > > > 
> > > >         the backup program get OOM killed
> > > > =>
> > > >         it runs now, although being a bit slow and bumpy
> > > 
> > > The problem is workloads that /don't/ have excessive dirty pages, but
> > > instantiate clean page cache at a much faster rate than writeback can
> > > clean the few dirties.  The dirty/writeback pages reach the end of the
> > > lru several times while there are always easily reclaimable pages
> > > around.
> > 
> > Good point!
> > 
> > > This was the rationale for introducing the backoff function that
> > > considers the dirty page percentage of all pages looked at (bottom of
> > > shrink_active_list) and removing all other sleeps that didn't look at
> > > the bigger picture and made problems.  I'd hate for them to come back.
> > > 
> > > On the other hand, is there a chance to make this backoff function
> > > work for memcgs?  Right now it only applies to the global case to not
> > > mark a whole zone congested because of some dirty pages on a single
> > > memcg LRU.  But maybe it can work by considering congestion on a
> > > per-lruvec basis rather than per-zone?
> > 
> > Johannes, would you paste the backoff code? Sorry I'm not sure about
> > the exact logic you are talking.
> 
> Sure, it's this guy here:

Yeah I knew this code, but it's in shrink_inactive_list() ;)

>         /*
>          * If reclaim is isolating dirty pages under writeback, it implies
>          * that the long-lived page allocation rate is exceeding the page
>          * laundering rate. Either the global limits are not being effective
>          * at throttling processes due to the page distribution throughout
>          * zones or there is heavy usage of a slow backing device. The
>          * only option is to throttle from reclaim context which is not ideal
>          * as there is no guarantee the dirtying process is throttled in the
>          * same way balance_dirty_pages() manages.
>          *
>          * This scales the number of dirty pages that must be under writeback
>          * before throttling depending on priority. It is a simple backoff
>          * function that has the most effect in the range DEF_PRIORITY to
>          * DEF_PRIORITY-2 which is the priority reclaim is considered to be
>          * in trouble and reclaim is considered to be in trouble.
>          *
>          * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
>          * DEF_PRIORITY-1  50% must be PageWriteback
>          * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
>          * ...
>          * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
>          *                     isolated page is PageWriteback
>          */
>         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
>                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> But the problem is the part declaring the zone congested:
> 
>         /*
>          * Tag a zone as congested if all the dirty pages encountered were
>          * backed by a congested BDI. In this case, reclaimers should just
>          * back off and wait for congestion to clear because further reclaim
>          * will encounter the same problem
>          */
>         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>                 zone_set_flag(mz->zone, ZONE_CONGESTED);
> 
> Note the global_reclaim().  It would be nice to have these two operate
> against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
> problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
> is not necessarily involved in a memcg-constrained load, so we need to
> find clearing sites that work for both global and memcg reclaim.

The problem of the above backoff logic is, both the conditions

>         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))

and

>         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))

are based on local nr_writeback/nr_dirty values. "local" means inside
one SWAP_CLUSTER_MAX=32 batch. So if there is a continuous run of 32
dirty/writeback pages in the LRU, which is a common case even if there
are less than 20% dirty pages, the above conditions could accidentally
evaluate to true.

So in long term, we may consider the opposite way: to replace it with
the (PageReclaim && priority < X) test where the priority test is more
global wise.

For now, "priority" is not very stable. I often observe it being
knocked down to small values (eg. 5) due to the uneven distribution of
dirty pages over the LRU. But once we put dirty pages to a standalone
LRU list, "priority" will no longer come up and down that often, being
easily affected by the distribution of dirty pages.

> > As for this patch, can it be improved by adding some test like
> > (priority < DEF_PRIORITY/2)? That should reasonably filter out the
> > "fast read rotating dirty pages fast" situation and still avoid OOM
> > for "heavy write inside small memcg".
> 
> I think we tried these thresholds for global sync reclaim, too, but
> couldn't find the right value.  IIRC, we tried to strike a balance
> between excessive stalls and wasting CPU, but obviously the CPU
> wasting is not a concern because that is completely uninhibited right
> now for memcg reclaim.  So it may be an improvement if I didn't miss
> anything.  Maybe Mel remembers more?
> 
> It'd still be preferrable to keep the differences between memcg and
> global reclaim at a minimum, though, and extend the dirty throttling
> we already have.

Yeah we'll be introducing yet another magic value... Here we make
things simple by limiting the goal to avoid OOM in small memcg and
ignore other CPU/stall issues. For this target, it seems good to
choose a very low priority. For example, (priority < 3), which means
we've scanned 1/(2^2) = 25% dirty/writeback pages, which is slightly
larger than the 20% global dirty limit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  9:35       ` Johannes Weiner
  2012-05-29 10:21         ` Fengguang Wu
@ 2012-05-29 13:32         ` Mel Gorman
  2012-05-29 13:51         ` Michal Hocko
  2 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2012-05-29 13:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Fengguang Wu, Michal Hocko, linux-mm, linux-kernel,
	Andrew Morton, KAMEZAWA Hiroyuki, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

I'm afraid this response will be a little rushed. I'm offline from
Thursday for almost a week and trying to get other bits and pieces tied
up before then.

On Tue, May 29, 2012 at 11:35:11AM +0200, Johannes Weiner wrote:
> > > <SNIP>
> > > This was the rationale for introducing the backoff function that
> > > considers the dirty page percentage of all pages looked at (bottom of
> > > shrink_active_list) and removing all other sleeps that didn't look at
> > > the bigger picture and made problems.  I'd hate for them to come back.
> > > 
> > > On the other hand, is there a chance to make this backoff function
> > > work for memcgs?  Right now it only applies to the global case to not
> > > mark a whole zone congested because of some dirty pages on a single
> > > memcg LRU.  But maybe it can work by considering congestion on a
> > > per-lruvec basis rather than per-zone?
> > 
> > Johannes, would you paste the backoff code? Sorry I'm not sure about
> > the exact logic you are talking.
> 
> Sure, it's this guy here:
> 
>         /*
>          * If reclaim is isolating dirty pages under writeback, it implies
>          * that the long-lived page allocation rate is exceeding the page
>          * laundering rate. Either the global limits are not being effective
>          * at throttling processes due to the page distribution throughout
>          * zones or there is heavy usage of a slow backing device. The
>          * only option is to throttle from reclaim context which is not ideal
>          * as there is no guarantee the dirtying process is throttled in the
>          * same way balance_dirty_pages() manages.
>          *
>          * This scales the number of dirty pages that must be under writeback
>          * before throttling depending on priority. It is a simple backoff
>          * function that has the most effect in the range DEF_PRIORITY to
>          * DEF_PRIORITY-2 which is the priority reclaim is considered to be
>          * in trouble and reclaim is considered to be in trouble.
>          *
>          * DEF_PRIORITY   100% isolated pages must be PageWriteback to throttle
>          * DEF_PRIORITY-1  50% must be PageWriteback
>          * DEF_PRIORITY-2  25% must be PageWriteback, kswapd in trouble
>          * ...
>          * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
>          *                     isolated page is PageWriteback
>          */
>         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
>                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> But the problem is the part declaring the zone congested:
> 
>         /*
>          * Tag a zone as congested if all the dirty pages encountered were
>          * backed by a congested BDI. In this case, reclaimers should just
>          * back off and wait for congestion to clear because further reclaim
>          * will encounter the same problem
>          */
>         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>                 zone_set_flag(mz->zone, ZONE_CONGESTED);
> 
> Note the global_reclaim().  It would be nice to have these two operate
> against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
> problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
> is not necessarily involved in a memcg-constrained load, so we need to
> find clearing sites that work for both global and memcg reclaim.
> 

I talked with Michal about this a bit offline and I hope I represent the
discussion fairly.

In my opinion it would be quite complex but you could create
a MEMCG_CONGESTED that is set under similar circumstances to
ZONE_CONGESTED. The ideal would be to clear it when enough writeback had
happened within a memcg to be below the limits but the accounting for
that just isn't there. An alternative would be to clear the flag if *any*
page within that memcg gets cleaned which is clumsy but better than nothing
when we cannot depend on kswapd.

On that is in place I can see two ways of acting based on the congestion
information

1. wait_on_page_writeback iff MEMCG_CONGESTED as Michal's patch does

2 Teach wait_iff_congested() about sleeping based on either the zone or
  the memcg. i.e. sleep for either HZ/10 or until the MEMCG_CONGESTED flag
  gets cleared. It would then need to loop again after sleeping to ensure
  it didn't go OOM.

Of course, an important thing to remember is that even *if* this works
better that it shouldn't stop us starting with Michal's patch as a
not-very-pretty-but-avoids-KABLAMO solution and merge that to mainline and
-stable. Then either build upon it or revert it when a proper solution is
found. Don't ignore something that works and fixes a serious problem just
because something better may or may not exist :)

> > As for this patch, can it be improved by adding some test like
> > (priority < DEF_PRIORITY/2)? That should reasonably filter out the
> > "fast read rotating dirty pages fast" situation and still avoid OOM
> > for "heavy write inside small memcg".
> 
> I think we tried these thresholds for global sync reclaim, too, but
> couldn't find the right value. 

Yes.

> IIRC, we tried to strike a balance
> between excessive stalls and wasting CPU, but obviously the CPU
> wasting is not a concern because that is completely uninhibited right
> now for memcg reclaim.  So it may be an improvement if I didn't miss
> anything.  Maybe Mel remembers more?
> 

I think that would just be fiddling around with thresholds and how effective
would depend on too many factors such as memcg size and the speed of the
backing storage. It is preferable to me that memcgs would behave like
global reclaim in this respect by setting a congested flag and blocking
iff it is set and if it does block, wake up again when that flag is cleared.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29  9:35       ` Johannes Weiner
  2012-05-29 10:21         ` Fengguang Wu
  2012-05-29 13:32         ` Mel Gorman
@ 2012-05-29 13:51         ` Michal Hocko
  2012-05-31  9:09           ` Michal Hocko
  2012-05-31 15:18           ` Fengguang Wu
  2 siblings, 2 replies; 15+ messages in thread
From: Michal Hocko @ 2012-05-29 13:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Fengguang Wu, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Tue 29-05-12 11:35:11, Johannes Weiner wrote:
[...]
>         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
>                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> 
> But the problem is the part declaring the zone congested:
> 
>         /*
>          * Tag a zone as congested if all the dirty pages encountered were
>          * backed by a congested BDI. In this case, reclaimers should just
>          * back off and wait for congestion to clear because further reclaim
>          * will encounter the same problem
>          */
>         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
>                 zone_set_flag(mz->zone, ZONE_CONGESTED);
> 
> Note the global_reclaim().  It would be nice to have these two operate
> against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
> problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
> is not necessarily involved in a memcg-constrained load, so we need to
> find clearing sites that work for both global and memcg reclaim.

OK, I have tried it with a simpler approach:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c978ce4..e45cf2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 *                     isolated page is PageWriteback
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority)))
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
+		if (global_reclaim(sc))
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+		else
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+	}
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),

without 'lruvec-zone' congestion flag and it worked reasonably well, for
my testcase at least (no OOM). We still could stall even if we managed
to writeback pages in the meantime but we should at least prevent from
the problem you are mentioning (most of the time).

The issue with pagevec zone tagging is, as you mentioned, that the
flag clearing places are not that easy to get right because we do
not have anything like zone_watermark_ok in a memcg context. I am even
thinking whether it is possible without per-memcg dirtly accounting.

To be honest, I was considering congestion waiting at the beginning as
well but I hate using an arbitrary timeout when we are, in fact, waiting
for a specific event.
Nevertheless I do acknowledge your concern with accidental page reclaim
pages in the middle of the LRU because of clean page cache which would
lead to an unnecessary stalls.

I have updated the test case to do a parallel read with the write (read
from an existing file, same size, out=/dev/null) and compared the results:

* congestion_wait approach
==========================
* input file on a tmpfs so the read should be really fast:
----------------------------------------------------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
read 1048576000 bytes (1.0 GB) copied, 0.785611 s, 1.3 GB/s
write 1048576000 bytes (1.0 GB) copied, 27.4083 s, 38.3 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
read 1048576000 bytes (1.0 GB) copied, 0.844437 s, 1.2 GB/s
write 1048576000 bytes (1.0 GB) copied, 29.9868 s, 35.0 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
read 1048576000 bytes (1.0 GB) copied, 0.793694 s, 1.3 GB/s
write 1048576000 bytes (1.0 GB) copied, 21.3534 s, 49.1 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
read 1048576000 bytes (1.0 GB) copied, 1.44286 s, 727 MB/s
write 1048576000 bytes (1.0 GB) copied, 20.8535 s, 50.3 MB/s

* input file on the ext3 (same partition)
-----------------------------------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
write 1048576000 bytes (1.0 GB) copied, 49.7673 s, 21.1 MB/s
read 1048576000 bytes (1.0 GB) copied, 59.5391 s, 17.6 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
write 1048576000 bytes (1.0 GB) copied, 36.8087 s, 28.5 MB/s
read 1048576000 bytes (1.0 GB) copied, 50.1079 s, 20.9 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
write 1048576000 bytes (1.0 GB) copied, 29.9918 s, 35.0 MB/s
read 1048576000 bytes (1.0 GB) copied, 47.2997 s, 22.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
write 1048576000 bytes (1.0 GB) copied, 27.6548 s, 37.9 MB/s
read 1048576000 bytes (1.0 GB) copied, 41.6577 s, 25.2 MB/s

* PageReclaim approach [congestion is 100%]
======================
* input file on a tmpfs:
------------------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
read 1048576000 bytes (1.0 GB) copied, 0.820246 s, 1.3 GB/s	[104.4%]
write 1048576000 bytes (1.0 GB) copied, 28.6641 s, 36.6 MB/s	[104.5%]
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
read 1048576000 bytes (1.0 GB) copied, 0.858179 s, 1.2 GB/s	[101.6%]
write 1048576000 bytes (1.0 GB) copied, 32.4644 s, 32.3 MB/s	[108.2%]
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
read 1048576000 bytes (1.0 GB) copied, 0.853459 s, 1.2 GB/s	[107.5%]
write 1048576000 bytes (1.0 GB) copied, 25.0716 s, 41.8 MB/s	[117.4%]
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
read 1048576000 bytes (1.0 GB) copied, 0.854251 s, 1.2 GB/s	[ 59.2%]
write 1048576000 bytes (1.0 GB) copied, 14.7382 s, 71.1 MB/s	[ 70.7%]

* input file on the ext3 (same partition)
-----------------------------------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
read 1048576000 bytes (1.0 GB) copied, 57.1462 s, 18.3 MB/s	[114.8%]
write 1048576000 bytes (1.0 GB) copied, 64.8275 s, 16.2 MB/s	[108.9%]
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
write 1048576000 bytes (1.0 GB) copied, 37.4216 s, 28.0 MB/s	[101.7%]
read 1048576000 bytes (1.0 GB) copied, 49.3022 s, 21.3 MB/s	[ 98.4%]
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
write 1048576000 bytes (1.0 GB) copied, 30.2872 s, 34.6 MB/s	[101.0%]
read 1048576000 bytes (1.0 GB) copied, 48.9104 s, 21.4 MB/s	[103.4%]
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
write 1048576000 bytes (1.0 GB) copied, 21.1995 s, 49.5 MB/s	[ 76.7%]
read 1048576000 bytes (1.0 GB) copied, 49.1416 s, 21.3 MB/s	[118.8%]

As a conclusion congestion wait performs better (even though I haven't
done repeated testing to see what is the deviation) when the
reader/writer size doesn't fit into the memcg, while it performs much
worse (at least for writer) if it does fit.

I will play with that some more
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29 13:51         ` Michal Hocko
@ 2012-05-31  9:09           ` Michal Hocko
  2012-06-01  8:37             ` Michal Hocko
  2012-05-31 15:18           ` Fengguang Wu
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2012-05-31  9:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Fengguang Wu, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 4934 bytes --]

On Tue 29-05-12 15:51:01, Michal Hocko wrote:
[...]
> OK, I have tried it with a simpler approach:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c978ce4..e45cf2a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	 *                     isolated page is PageWriteback
>  	 */
>  	if (nr_writeback && nr_writeback >=
> -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
> +		if (global_reclaim(sc))
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +		else
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	}
>  
>  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>  		zone_idx(zone),
> 
[...]
> As a conclusion congestion wait performs better (even though I haven't
> done repeated testing to see what is the deviation) when the
> reader/writer size doesn't fit into the memcg, while it performs much
> worse (at least for writer) if it does fit.
> 
> I will play with that some more

I have, yet again, updated the test. I am writing data to an USB stick
(with ext3, mounted in sync mode) and which writes 1G in 274.518s,
3.8MB/s so the storage is really slow. The parallel read is performed
from tmpfs and from a local ext3 partition (testing script is attached).
We start with writing so the LRUs will have some dirty pages when the
read starts and fill up the LRU with clean page cache.

congestion wait:
================
* ext3 (reader)                         avg      std/avg
** Write
5M	412.128	334.944	337.708	339.457	356.0593 [10.51%]
60M	566.652	321.607	492.025	317.942	424.5565 [29.39%]
300M	318.437	315.321	319.515	314.981	317.0635 [0.71%]
2G	317.777	314.8	318.657	319.409	317.6608 [0.64%]

** Read
5M	40.1829	40.8907	48.8362	40.0535	42.4908  [9.99%]
60M	15.4104	16.1693	18.9162	16.0049	16.6252  [9.39%]
300M	17.0376	15.6721	15.6137	15.756	16.0199  [4.25%]
2G	15.3718	17.3714	15.3873	15.4554	15.8965  [6.19%]

* Tmpfs (reader)
** Write
5M	324.425	327.395	573.688	314.884	385.0980 [32.68%]
60M	464.578	317.084	375.191	318.947	368.9500 [18.76%]
300M	316.885	323.759	317.212	318.149	319.0013 [1.01%]
2G	317.276	318.148	318.97	316.897	317.8228 [0.29%]

** Read
5M	0.9241	0.8620	0.9391	1.2922	1.0044   [19.39%]
60M	0.8753	0.8359	1.0072	1.3317	1.0125   [22.23%]
300M	0.9825	0.8143	0.9864	0.8692	0.9131   [9.35%]
2G	0.9990	0.8281	1.0312	0.9034	0.9404   [9.83%]


PageReclaim:
=============
* ext3 (reader)
** Write                                avg      std/avg  comparision 
                                                         (cong is 100%)
5M	313.08	319.924	325.206	325.149	320.8398 [1.79%]  90.11%
60M	314.135	415.245	502.157	313.776	386.3283 [23.50%] 91.00%
300M	313.718	320.448	315.663	316.714	316.6358 [0.89%]  99.87%
2G	317.591	316.67	316.285	316.624	316.7925 [0.18%]  99.73%

** Read
5M	19.0228	20.6743	17.2508	17.5946	18.6356	 [8.37%]  43.86%
60M	17.3657	15.6402	16.5168	15.5601	16.2707	 [5.22%]  97.87%
300M	17.1986	15.7616	19.5163	16.9544	17.3577	 [9.05%]  108.35%
2G	15.6696	15.5384	15.4381	15.2454	15.4729	 [1.16%]  97.34%

* Tmpfs (reader)
** Write
5M	317.303	314.366	316.508	318.883	316.7650 [0.59%]  82.26%
60M	579.952	666.606	660.021	655.346	640.4813 [6.34%]  173.60%
300M	318.494	318.64	319.516	316.79	318.3600 [0.36%]  99.80%
2G	315.935	318.069	321.097	320.329	318.8575 [0.73%]  100.33%

** Read  
5M	0.8415	0.8550	0.7892	0.8515	0.8343	 [3.67%]  83.07%
60M	0.8536	0.8685	0.8237	0.8805	0.8565	 [2.86%]  84.60%
300M	0.8309	0.8724	0.8553	0.8577	0.8541	 [2.01%]  93.53%
2G	0.8427	0.8468	0.8325	1.4658	0.9970	 [31.36%] 106.01%

Variance (std/avg) seems to be lower for both reads and writes with
PageReclaim approach and also if we compare the average numbers it seems
to be mostly better (especially for reads) or within the noise.
There are two "peaks" in numbers, though.
* 60M cgroup write performance when reading from tmpfs. While read
behaved well with PageReclaim patch (actually much better than
congwait), the writer stalled a lot.
* 5M cgroup read performance when reading from ext3 when congestion_wait
approach fall down flat while PageReclaim did better for both read and
write.

So I guess that the conclusion could be that the two approaches are
comparable. Both of them could lead to stalling but they are doing
mostly good which is much better than an OOM killer. We can do much
better but that would require conditional sleeping.

How do people feel about going with the simpler approach for now (even
for stable kernels as the problem is real and long term) and work on the
conditional part as a follow up?
Which way would be preferable? I can post a full patch for the
congestion wait approach if you are interested. I do not care much as
both of them fix the problem.
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

[-- Attachment #2: cgroup_cache_oom_test.sh --]
[-- Type: application/x-sh, Size: 940 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-29 13:51         ` Michal Hocko
  2012-05-31  9:09           ` Michal Hocko
@ 2012-05-31 15:18           ` Fengguang Wu
       [not found]             ` <20120531153249.GD12809@tiehlicka.suse.cz>
  1 sibling, 1 reply; 15+ messages in thread
From: Fengguang Wu @ 2012-05-31 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 3745 bytes --]

On Tue, May 29, 2012 at 03:51:01PM +0200, Michal Hocko wrote:
> On Tue 29-05-12 11:35:11, Johannes Weiner wrote:
> [...]
> >         if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
> >                 wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > 
> > But the problem is the part declaring the zone congested:
> > 
> >         /*
> >          * Tag a zone as congested if all the dirty pages encountered were
> >          * backed by a congested BDI. In this case, reclaimers should just
> >          * back off and wait for congestion to clear because further reclaim
> >          * will encounter the same problem
> >          */
> >         if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
> >                 zone_set_flag(mz->zone, ZONE_CONGESTED);
> > 
> > Note the global_reclaim().  It would be nice to have these two operate
> > against the lruvec of sc->target_mem_cgroup and mz->zone instead.  The
> > problem is that ZONE_CONGESTED clearing happens in kswapd alone, which
> > is not necessarily involved in a memcg-constrained load, so we need to
> > find clearing sites that work for both global and memcg reclaim.
> 
> OK, I have tried it with a simpler approach:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c978ce4..e45cf2a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	 *                     isolated page is PageWriteback
>  	 */
>  	if (nr_writeback && nr_writeback >=
> -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
> +		if (global_reclaim(sc))
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +		else
> +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	}
>  
>  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
>  		zone_idx(zone),
> 
> without 'lruvec-zone' congestion flag and it worked reasonably well, for
> my testcase at least (no OOM). We still could stall even if we managed
> to writeback pages in the meantime but we should at least prevent from
> the problem you are mentioning (most of the time).
> 
> The issue with pagevec zone tagging is, as you mentioned, that the
> flag clearing places are not that easy to get right because we do
> not have anything like zone_watermark_ok in a memcg context. I am even
> thinking whether it is possible without per-memcg dirtly accounting.
> 
> To be honest, I was considering congestion waiting at the beginning as
> well but I hate using an arbitrary timeout when we are, in fact, waiting
> for a specific event.
> Nevertheless I do acknowledge your concern with accidental page reclaim
> pages in the middle of the LRU because of clean page cache which would
> lead to an unnecessary stalls.

Hi Michal,

Now the only concern is, to confirm whether the patch will impact
interactive performance when there are not so many dirty pages in the
memcg.

For example, running a dd write to disk plus several another dd's read
from either disk or sparse file.

There is no dirty accounting for memcg, however if you run workloads
in one single 100MB memcg, the global dirty pages in /proc/vmstat will
be exactly the dirty number inside that memcg. Thus we can create
situations with eg. 10%, 30%, 50% dirty pages inside memcg and watch
how well your patch performs.

I happen to have a debug patch for showing the number of page reclaim
stalls.  It applies cleanly to 3.4, and you'll need to add accounting
to your new code. If it shows low stall numbers in the cases of 10-30%
dirty pages even if they are quickly rotated due to fast reads, we may
go ahead with any approach :-)

Thanks,
Fengguang

[-- Attachment #2: mm-debugfs-vmscan-stalls-0.patch --]
[-- Type: text/x-diff, Size: 4543 bytes --]

Subject: mm: create /debug/vm for page reclaim stalls
Date: Fri Sep 10 13:05:57 CST 2010

Create /debug/vm/ -- a convenient place for kernel hackers to play with
VM variables.

The first exported is vm_dirty_pressure for avoiding excessive pageout()s.
It ranges from 0 to 1024, the lower value, the lower dirty limit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   10 ++++++++++
 mm/internal.h    |    5 +++++
 mm/migrate.c     |    3 +++
 mm/vmscan.c      |   45 +++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 61 insertions(+), 2 deletions(-)

--- linux.orig/mm/vmscan.c	2012-05-31 22:43:42.239868770 +0800
+++ linux/mm/vmscan.c	2012-05-31 22:43:49.815868950 +0800
@@ -759,6 +759,8 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+u32 nr_reclaim_wait_writeback;
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -820,9 +822,10 @@ static unsigned long shrink_page_list(st
 			 * for the IO to complete.
 			 */
 			if ((sc->reclaim_mode & RECLAIM_MODE_SYNC) &&
-			    may_enter_fs)
+			    may_enter_fs) {
 				wait_on_page_writeback(page);
-			else {
+				nr_reclaim_wait_writeback++;
+			} else {
 				unlock_page(page);
 				goto keep_lumpy;
 			}
@@ -3660,3 +3663,41 @@ void scan_unevictable_unregister_node(st
 	device_remove_file(&node->dev, &dev_attr_scan_unevictable_pages);
 }
 #endif
+
+#if defined(CONFIG_DEBUG_FS)
+#include <linux/debugfs.h>
+
+static struct dentry *vm_debug_root;
+
+static int __init vm_debug_init(void)
+{
+	struct dentry *dentry;
+
+	vm_debug_root = debugfs_create_dir("vm", NULL);
+	if (!vm_debug_root)
+		goto fail;
+
+#ifdef CONFIG_MIGRATION
+	dentry = debugfs_create_u32("nr_migrate_wait_writeback", 0644,
+				    vm_debug_root, &nr_migrate_wait_writeback);
+#endif
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_writeback", 0644,
+				    vm_debug_root, &nr_reclaim_wait_writeback);
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_congested", 0644,
+				    vm_debug_root, &nr_reclaim_wait_congested);
+
+	dentry = debugfs_create_u32("nr_congestion_wait", 0644,
+				    vm_debug_root, &nr_congestion_wait);
+
+	if (!dentry)
+		goto fail;
+
+	return 0;
+fail:
+	return -ENOMEM;
+}
+
+module_init(vm_debug_init);
+#endif /* CONFIG_DEBUG_FS */
--- linux.orig/mm/migrate.c	2012-05-31 22:43:42.215868770 +0800
+++ linux/mm/migrate.c	2012-05-31 22:43:49.815868950 +0800
@@ -674,6 +674,8 @@ static int move_to_new_page(struct page
 	return rc;
 }
 
+u32 nr_migrate_wait_writeback;
+
 static int __unmap_and_move(struct page *page, struct page *newpage,
 			int force, bool offlining, enum migrate_mode mode)
 {
@@ -742,6 +744,7 @@ static int __unmap_and_move(struct page
 		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
+		nr_migrate_wait_writeback++;
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
--- linux.orig/mm/internal.h	2012-05-31 22:43:42.231868771 +0800
+++ linux/mm/internal.h	2012-05-31 22:43:49.815868950 +0800
@@ -309,3 +309,8 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
+extern u32 nr_migrate_wait_writeback;
+extern u32 nr_reclaim_wait_congested;
+extern u32 nr_congestion_wait;
+
--- linux.orig/mm/backing-dev.c	2012-05-31 22:43:42.223868770 +0800
+++ linux/mm/backing-dev.c	2012-05-31 22:43:49.815868950 +0800
@@ -12,6 +12,8 @@
 #include <linux/device.h>
 #include <trace/events/writeback.h>
 
+#include "internal.h"
+
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
 struct backing_dev_info default_backing_dev_info = {
@@ -805,6 +807,9 @@ void set_bdi_congested(struct backing_de
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+u32 nr_reclaim_wait_congested;
+u32 nr_congestion_wait;
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -825,6 +830,10 @@ long congestion_wait(int sync, long time
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_congestion_wait++;
+	trace_printk("%pS %pS\n",
+		     __builtin_return_address(0),
+		     __builtin_return_address(1));
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
 
@@ -879,6 +888,7 @@ long wait_iff_congested(struct zone *zon
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_reclaim_wait_congested++;
 out:
 	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
       [not found]                     ` <20120531182509.GA22539@tiehlicka.suse.cz>
@ 2012-06-01  1:33                       ` Fengguang Wu
  0 siblings, 0 replies; 15+ messages in thread
From: Fengguang Wu @ 2012-06-01  1:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Mel Gorman, linux-mm, linux-kernel,
	Andrew Morton, KAMEZAWA Hiroyuki, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

[restore CC list]

On Thu, May 31, 2012 at 08:25:09PM +0200, Michal Hocko wrote:
> On Fri 01-06-12 00:01:29, Fengguang Wu wrote:
> > On Thu, May 31, 2012 at 05:49:00PM +0200, Michal Hocko wrote:
> > > On Thu 31-05-12 23:42:48, Fengguang Wu wrote:
> > > > On Thu, May 31, 2012 at 05:32:49PM +0200, Michal Hocko wrote:
> > > > > JFYI: You might have missed https://lkml.org/lkml/2012/5/31/122 because
> > > > > intel mail server returned with "we do not like shell scripts as a
> > > > > attachment".
> > > > 
> > > > Yeah I did miss it.. A quick question: does the PageReclaim patch
> > > > tested include the (priority < 3) test, or it's the originally posted
> > > > patch?
> > > 
> > > It's the original patch.
> > 
> > OK, that's fine. I suspect even adding (priority < 3), it can help
> > only some situations. In the others, the effect will be that writeback
> > pages get accumulated in LRU until it starts to throttle page reclaims
> > from both reads/writes. There will have to be some throttling somewhere.
> 
> Yes, I agree. But it could help at least sporadic "hey this is a PageReclaim"
> issues. I just didn't like to push priority into shrink_page_list
> without. The justification is quite hard.

Yeah priority is just a rule of thumb "there may be lots of
dirty/writeback pages or other pressure if priority goes low". 
And it's already been used this way in shrink_page_list().

Considering that it's also targeting for -stable merge, we do need
a very strict condition to safeguard no regressions on other cases.
This is also true for the wait_iff_congested() scheme.

> > Subject: mm: pass __GFP_WRITE to memcg charge and reclaim routines
> > 
> > __GFP_WRITE will be tested in vmscan to find out the write tasks.
> > 
> > For good interactive performance, we try to focus dirty reclaim waits on
> > them and avoid blocking unrelated tasks.
> > 
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> 
> I will have a look at this one tomorrow with a clean head.

OK. The usage in my mind is

        if (PageWriteback(page) && PageReclaim(page))
+               if ((sc->gfp_mask & __GFP_WAIT) || (priority < 3))
                        do some dirty throttling

But note that it only detects writes to new pages (ie. simple dd).
Overwrites to already cached clean pages cannot be detected this way..

Thanks,
Fengguang

> > ---
> >  include/linux/gfp.h |    2 +-
> >  mm/filemap.c        |   17 ++++++++++-------
> >  2 files changed, 11 insertions(+), 8 deletions(-)
> > 
> > --- linux.orig/include/linux/gfp.h	2012-03-02 14:06:47.501765703 +0800
> > +++ linux/include/linux/gfp.h	2012-03-02 14:07:39.921766949 +0800
> > @@ -129,7 +129,7 @@ struct vm_area_struct;
> >  /* Control page allocator reclaim behavior */
> >  #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
> >  			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
> > -			__GFP_NORETRY|__GFP_NOMEMALLOC)
> > +			__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_WRITE)
> >  
> >  /* Control slab gfp mask during early boot */
> >  #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
> > --- linux.orig/mm/filemap.c	2012-03-02 14:07:21.000000000 +0800
> > +++ linux/mm/filemap.c	2012-03-02 14:07:53.709767277 +0800
> > @@ -2339,23 +2339,26 @@ struct page *grab_cache_page_write_begin
> >  	int status;
> >  	gfp_t gfp_mask;
> >  	struct page *page;
> > -	gfp_t gfp_notmask = 0;
> > +	gfp_t lru_gfp_mask = GFP_KERNEL;
> >  
> >  	gfp_mask = mapping_gfp_mask(mapping);
> > -	if (mapping_cap_account_dirty(mapping))
> > +	if (mapping_cap_account_dirty(mapping)) {
> >  		gfp_mask |= __GFP_WRITE;
> > -	if (flags & AOP_FLAG_NOFS)
> > -		gfp_notmask = __GFP_FS;
> > +		lru_gfp_mask |= __GFP_WRITE;
> > +	}
> > +	if (flags & AOP_FLAG_NOFS) {
> > +		gfp_mask &= ~__GFP_FS;
> > +		lru_gfp_mask &= ~__GFP_FS;
> > +	}
> >  repeat:
> >  	page = find_lock_page(mapping, index);
> >  	if (page)
> >  		goto found;
> >  
> > -	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
> > +	page = __page_cache_alloc(gfp_mask);
> >  	if (!page)
> >  		return NULL;
> > -	status = add_to_page_cache_lru(page, mapping, index,
> > -						GFP_KERNEL & ~gfp_notmask);
> > +	status = add_to_page_cache_lru(page, mapping, index, lru_gfp_mask);
> >  	if (unlikely(status)) {
> >  		page_cache_release(page);
> >  		if (status == -EEXIST)
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-05-31  9:09           ` Michal Hocko
@ 2012-06-01  8:37             ` Michal Hocko
  2012-06-07 14:45               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2012-06-01  8:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Fengguang Wu, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Thu 31-05-12 11:09:57, Michal Hocko wrote:
> On Tue 29-05-12 15:51:01, Michal Hocko wrote:
> [...]
> > OK, I have tried it with a simpler approach:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c978ce4..e45cf2a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1294,8 +1294,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	 *                     isolated page is PageWriteback
> >  	 */
> >  	if (nr_writeback && nr_writeback >=
> > -			(nr_taken >> (DEF_PRIORITY - sc->priority)))
> > -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
> > +		if (global_reclaim(sc))
> > +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +		else
> > +			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	}
> >  
> >  	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
> >  		zone_idx(zone),
> > 
> [...]
> > As a conclusion congestion wait performs better (even though I haven't
> > done repeated testing to see what is the deviation) when the
> > reader/writer size doesn't fit into the memcg, while it performs much
> > worse (at least for writer) if it does fit.
> > 
> > I will play with that some more
> 
> I have, yet again, updated the test. I am writing data to an USB stick
> (with ext3, mounted in sync mode) and which writes 1G in 274.518s,
> 3.8MB/s so the storage is really slow. The parallel read is performed
> from tmpfs and from a local ext3 partition (testing script is attached).
> We start with writing so the LRUs will have some dirty pages when the
> read starts and fill up the LRU with clean page cache.
> 
> congestion wait:
> ================
> * ext3 (reader)                         avg      std/avg
> ** Write
> 5M	412.128	334.944	337.708	339.457	356.0593 [10.51%]
> 60M	566.652	321.607	492.025	317.942	424.5565 [29.39%]
> 300M	318.437	315.321	319.515	314.981	317.0635 [0.71%]
> 2G	317.777	314.8	318.657	319.409	317.6608 [0.64%]
> 
> ** Read
> 5M	40.1829	40.8907	48.8362	40.0535	42.4908  [9.99%]
> 60M	15.4104	16.1693	18.9162	16.0049	16.6252  [9.39%]
> 300M	17.0376	15.6721	15.6137	15.756	16.0199  [4.25%]
> 2G	15.3718	17.3714	15.3873	15.4554	15.8965  [6.19%]
> 
> * Tmpfs (reader)
> ** Write
> 5M	324.425	327.395	573.688	314.884	385.0980 [32.68%]
> 60M	464.578	317.084	375.191	318.947	368.9500 [18.76%]
> 300M	316.885	323.759	317.212	318.149	319.0013 [1.01%]
> 2G	317.276	318.148	318.97	316.897	317.8228 [0.29%]
> 
> ** Read
> 5M	0.9241	0.8620	0.9391	1.2922	1.0044   [19.39%]
> 60M	0.8753	0.8359	1.0072	1.3317	1.0125   [22.23%]
> 300M	0.9825	0.8143	0.9864	0.8692	0.9131   [9.35%]
> 2G	0.9990	0.8281	1.0312	0.9034	0.9404   [9.83%]
> 
> 
> PageReclaim:
> =============
> * ext3 (reader)
> ** Write                                avg      std/avg  comparision 
>                                                          (cong is 100%)
> 5M	313.08	319.924	325.206	325.149	320.8398 [1.79%]  90.11%
> 60M	314.135	415.245	502.157	313.776	386.3283 [23.50%] 91.00%
> 300M	313.718	320.448	315.663	316.714	316.6358 [0.89%]  99.87%
> 2G	317.591	316.67	316.285	316.624	316.7925 [0.18%]  99.73%
> 
> ** Read
> 5M	19.0228	20.6743	17.2508	17.5946	18.6356	 [8.37%]  43.86%
> 60M	17.3657	15.6402	16.5168	15.5601	16.2707	 [5.22%]  97.87%
> 300M	17.1986	15.7616	19.5163	16.9544	17.3577	 [9.05%]  108.35%
> 2G	15.6696	15.5384	15.4381	15.2454	15.4729	 [1.16%]  97.34%
> 
> * Tmpfs (reader)
> ** Write
> 5M	317.303	314.366	316.508	318.883	316.7650 [0.59%]  82.26%
> 60M	579.952	666.606	660.021	655.346	640.4813 [6.34%]  173.60%
> 300M	318.494	318.64	319.516	316.79	318.3600 [0.36%]  99.80%
> 2G	315.935	318.069	321.097	320.329	318.8575 [0.73%]  100.33%
> 
> ** Read  
> 5M	0.8415	0.8550	0.7892	0.8515	0.8343	 [3.67%]  83.07%
> 60M	0.8536	0.8685	0.8237	0.8805	0.8565	 [2.86%]  84.60%
> 300M	0.8309	0.8724	0.8553	0.8577	0.8541	 [2.01%]  93.53%
> 2G	0.8427	0.8468	0.8325	1.4658	0.9970	 [31.36%] 106.01%

And just finished a test without any patch (current memcg-devel tree).
Surprisingly enough OOM killer didn't trigger in this setup (the storage
is probably too slow):

					avg	std/avg		comparison      comparison 
                                                        	(cong is 100%)	(page reclaim 100%)
ext3 (reader)
** Write
5M	329.953	319.305	705.561	338.379	423.2995 [44.49%]	118.88%		131.93%
60M	320.940	529.418	314.126	552.817	429.3253 [30.16%]	101.12%		111.13%
300M	315.600	318.759	314.052	313.366	315.4443 [0.76%]	99.49%		99.62%
2G	316.799	313.328	316.605	317.873	316.151  [0.62%]	99.52%		99.80%

** Read	
5M	17.2729	15.9298	15.5007	15.7594	16.1157	[4.91%]		37.93%		86.48%
60M	16.0478	15.8576	16.7704	16.9675	16.4108	[3.29%]		98.71%		100.86%
300M	15.7392	15.5122	15.5084	15.6455	15.6013	[0.72%]		97.39%		89.88%
2G	15.3784	15.3592	15.5804	15.6464	15.4911	[0.93%]		97.45%		100.12%

Tmpfs (reader)
** write
5M	313.910	504.897	699.040	352.671	467.6295 [37.40%]	121.43%		147.63%
60M	654.229	316.980	316.147	651.824	484.7950 [40.07%]	131.40%		75.69%
300M	315.442	317.248	316.668	316.163	316.3803 [0.24%]	99.18%		99.38%
2G	316.971	315.687	316.283	316.879	316.4550 [0.19%]	99.57%		99.25%

** read
5M	0.8013	1.1041	0.8345	0.8223	0.8906	[16.06%]	88.67%		106.74%
60M	0.8312	0.7887	0.8577	0.8273	0.8262	[3.44%]		81.60%		96.46%
300M	1.1530	0.8674	1.1260	1.1116	1.0645	[12.45%]	116.58%		124.64%
2G	0.8318	0.8323	0.8897	0.8278	0.8454	[3.50%]		89.89%		84.79%

Write performance is within the noise. Sometimes the patched kernel does
much better, especially for the small groups.
Read performance is more interesting. We seem to regress. The PageReclaim
approach seem to regrees less than congestion_wait.
The biggest drop down seems to be for cong. wait and reader from ext3
with 5M cgroup (there was no big peak during that run ~10% std/avg and
the performance is steady also without any patches).

More detailed statistics (max/min - the worst/best performance).
	comparison (cong is 100%)	comparison (page reclaim 100%)			
	max	min	median		max	min	median
* ext3
** Write
5M	171.20%	95.33%	98.70%		216.96%	101.99%	103.61%
60M	97.56%	98.80%	104.51%		110.09%	100.11%	116.59%
300M	99.76%	99.49%	99.35%		99.47%	99.89%	99.57%
2G	99.52%	99.53%	99.52%		100.09%	99.07%	100.02%

** Read					
5M	35.37%	38.70%	39.09%		83.55%	89.85%	86.54%
60M	89.70%	102.90%	102.00%		97.71%	101.91%	102.06%
300M	92.38%	99.33%	99.14%		80.65%	98.39%	91.23%
2G	90.07%	99.92%	100.38%		99.85%	100.75%	99.94%

* Tmpfs					
** write
5M	121.85%	99.69%	131.57%		219.22%	99.85%	135.30%
60M	140.82%	99.70%	139.57%		98.14%	54.51%	73.65%
300M	97.99%	99.54%	99.60%		99.29%	99.57%	99.32%
2G	99.37%	99.62%	99.64%		98.72%	99.92%	99.18%

** read				
5M	85.44%	92.96%	88.92%		129.13%	101.54%	97.87%
60M	64.41%	94.35%	88.10%		97.41%	95.75%	96.31%
300M	116.89%	106.52%	120.84%		132.17%	104.39%	130.63%
2G	86.27%	99.96%	87.47%		60.69%	99.44%	98.49%

These numbers show  that PageReclaim gives us slightly better results
than congestion wait. There are not so big dropdowns (like 5M ext3 read
or 60M tmpfs read).
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-01  8:37             ` Michal Hocko
@ 2012-06-07 14:45               ` Michal Hocko
  2012-06-14  7:27                 ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2012-06-07 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Fengguang Wu, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Fri 01-06-12 10:37:30, Michal Hocko wrote:
[...]
> More detailed statistics (max/min - the worst/best performance).
> 	comparison (cong is 100%)	comparison (page reclaim 100%)			
> 	max	min	median		max	min	median
> * ext3
> ** Write
> 5M	171.20%	95.33%	98.70%		216.96%	101.99%	103.61%
> 60M	97.56%	98.80%	104.51%		110.09%	100.11%	116.59%
> 300M	99.76%	99.49%	99.35%		99.47%	99.89%	99.57%
> 2G	99.52%	99.53%	99.52%		100.09%	99.07%	100.02%
> 
> ** Read					
> 5M	35.37%	38.70%	39.09%		83.55%	89.85%	86.54%
> 60M	89.70%	102.90%	102.00%		97.71%	101.91%	102.06%
> 300M	92.38%	99.33%	99.14%		80.65%	98.39%	91.23%
> 2G	90.07%	99.92%	100.38%		99.85%	100.75%	99.94%
> 
> * Tmpfs					
> ** write
> 5M	121.85%	99.69%	131.57%		219.22%	99.85%	135.30%
> 60M	140.82%	99.70%	139.57%		98.14%	54.51%	73.65%
> 300M	97.99%	99.54%	99.60%		99.29%	99.57%	99.32%
> 2G	99.37%	99.62%	99.64%		98.72%	99.92%	99.18%
> 
> ** read				
> 5M	85.44%	92.96%	88.92%		129.13%	101.54%	97.87%
> 60M	64.41%	94.35%	88.10%		97.41%	95.75%	96.31%
> 300M	116.89%	106.52%	120.84%		132.17%	104.39%	130.63%
> 2G	86.27%	99.96%	87.47%		60.69%	99.44%	98.49%

I have played with the patch below but it didn't show too much
difference in the end or we end up doing even worse. 

Here is the no_patch/patched comparison:

	comparison (page reclaim is 100%)
* ext3  avg	max	min	median
** Write
5M    	81.49%	77.53%	101.91%	76.60%
60M   	98.60%	95.58%	101.40%	99.62%
300M  	101.68%	102.05%	101.19%	101.73%
2G    	102.20%	102.25%	102.12%	102.22%
				
** Read  				
5M    	103.94%	105.14%	103.95%	103.32%
60M   	105.26%	107.91%	103.15%	104.95%
300M  	104.83%	107.86%	101.65%	104.88%
2G    	102.67%	101.26%	102.83%	103.35%

* Tmpfs
** Write
5M    	107.68%	119.66%	105.26%	102.78%
60M   	122.16%	138.51%	103.62%	121.09%
300M  	101.03%	100.67%	101.11%	101.17%
2G    	101.82%	101.66%	101.87%	101.87%
				
** Read			
5M    	102.47%	124.02%	98.05%	92.57%
60M   	103.62%	121.03%	96.97%	96.52%
300M  	98.90%	118.92%	102.64%	86.19%
2G    	83.50%	76.34%	97.36%	81.92%

I am not sure it really makes sense to play with the priority here. All
the values we would end up with would be just wild guesses or mostly
artificial workloads. So I think it makes some to go with the original
version of the PageReclaim patch without any further fiddling with the
priority.

Is this sufficient to go with the patch or do people still have concerns
which would block the patch from merging?

---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7cccd81..a240bdf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -726,7 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * writeback from reclaim and there is nothing else to
 			 * reclaim.
 			 */
-			if (PageReclaim(page)
+			if (PageReclaim(page) && sc->priority < DEF_PRIORITY - 3
 					&& may_enter_fs && !global_reclaim(sc))
 				wait_on_page_writeback(page);
 			else {
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-07 14:45               ` Michal Hocko
@ 2012-06-14  7:27                 ` Johannes Weiner
  2012-06-14 10:13                   ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2012-06-14  7:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Fengguang Wu, linux-mm, linux-kernel, Andrew Morton,
	KAMEZAWA Hiroyuki, Mel Gorman, Minchan Kim, Rik van Riel,
	Ying Han, Greg Thelen, Hugh Dickins

On Thu, Jun 07, 2012 at 04:45:56PM +0200, Michal Hocko wrote:
> On Fri 01-06-12 10:37:30, Michal Hocko wrote:
> [...]
> > More detailed statistics (max/min - the worst/best performance).
> > 	comparison (cong is 100%)	comparison (page reclaim 100%)			
> > 	max	min	median		max	min	median
> > * ext3
> > ** Write
> > 5M	171.20%	95.33%	98.70%		216.96%	101.99%	103.61%
> > 60M	97.56%	98.80%	104.51%		110.09%	100.11%	116.59%
> > 300M	99.76%	99.49%	99.35%		99.47%	99.89%	99.57%
> > 2G	99.52%	99.53%	99.52%		100.09%	99.07%	100.02%
> > 
> > ** Read					
> > 5M	35.37%	38.70%	39.09%		83.55%	89.85%	86.54%
> > 60M	89.70%	102.90%	102.00%		97.71%	101.91%	102.06%
> > 300M	92.38%	99.33%	99.14%		80.65%	98.39%	91.23%
> > 2G	90.07%	99.92%	100.38%		99.85%	100.75%	99.94%
> > 
> > * Tmpfs					
> > ** write
> > 5M	121.85%	99.69%	131.57%		219.22%	99.85%	135.30%
> > 60M	140.82%	99.70%	139.57%		98.14%	54.51%	73.65%
> > 300M	97.99%	99.54%	99.60%		99.29%	99.57%	99.32%
> > 2G	99.37%	99.62%	99.64%		98.72%	99.92%	99.18%
> > 
> > ** read				
> > 5M	85.44%	92.96%	88.92%		129.13%	101.54%	97.87%
> > 60M	64.41%	94.35%	88.10%		97.41%	95.75%	96.31%
> > 300M	116.89%	106.52%	120.84%		132.17%	104.39%	130.63%
> > 2G	86.27%	99.96%	87.47%		60.69%	99.44%	98.49%
> 
> I have played with the patch below but it didn't show too much
> difference in the end or we end up doing even worse. 
> 
> Here is the no_patch/patched comparison:
> 
> 	comparison (page reclaim is 100%)
> * ext3  avg	max	min	median
> ** Write
> 5M    	81.49%	77.53%	101.91%	76.60%
> 60M   	98.60%	95.58%	101.40%	99.62%
> 300M  	101.68%	102.05%	101.19%	101.73%
> 2G    	102.20%	102.25%	102.12%	102.22%
> 				
> ** Read  				
> 5M    	103.94%	105.14%	103.95%	103.32%
> 60M   	105.26%	107.91%	103.15%	104.95%
> 300M  	104.83%	107.86%	101.65%	104.88%
> 2G    	102.67%	101.26%	102.83%	103.35%
> 
> * Tmpfs
> ** Write
> 5M    	107.68%	119.66%	105.26%	102.78%
> 60M   	122.16%	138.51%	103.62%	121.09%
> 300M  	101.03%	100.67%	101.11%	101.17%
> 2G    	101.82%	101.66%	101.87%	101.87%
> 				
> ** Read			
> 5M    	102.47%	124.02%	98.05%	92.57%
> 60M   	103.62%	121.03%	96.97%	96.52%
> 300M  	98.90%	118.92%	102.64%	86.19%
> 2G    	83.50%	76.34%	97.36%	81.92%
> 
> I am not sure it really makes sense to play with the priority here. All
> the values we would end up with would be just wild guesses or mostly
> artificial workloads. So I think it makes some to go with the original
> version of the PageReclaim patch without any further fiddling with the
> priority.
> 
> Is this sufficient to go with the patch or do people still have concerns
> which would block the patch from merging?

No, let's go for it.  It's a net improvement as it stands.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC -mm] memcg: prevent from OOM with too many dirty pages
  2012-06-14  7:27                 ` Johannes Weiner
@ 2012-06-14 10:13                   ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2012-06-14 10:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, linux-mm, linux-kernel, KAMEZAWA Hiroyuki,
	Mel Gorman, Minchan Kim, Rik van Riel, Ying Han, Greg Thelen,
	Hugh Dickins, Johannes Weiner

On Thu 14-06-12 09:27:55, Johannes Weiner wrote:
> On Thu, Jun 07, 2012 at 04:45:56PM +0200, Michal Hocko wrote:
> > On Fri 01-06-12 10:37:30, Michal Hocko wrote:
> > [...]
> > > More detailed statistics (max/min - the worst/best performance).
> > > 	comparison (cong is 100%)	comparison (page reclaim 100%)			
> > > 	max	min	median		max	min	median
> > > * ext3
> > > ** Write
> > > 5M	171.20%	95.33%	98.70%		216.96%	101.99%	103.61%
> > > 60M	97.56%	98.80%	104.51%		110.09%	100.11%	116.59%
> > > 300M	99.76%	99.49%	99.35%		99.47%	99.89%	99.57%
> > > 2G	99.52%	99.53%	99.52%		100.09%	99.07%	100.02%
> > > 
> > > ** Read					
> > > 5M	35.37%	38.70%	39.09%		83.55%	89.85%	86.54%
> > > 60M	89.70%	102.90%	102.00%		97.71%	101.91%	102.06%
> > > 300M	92.38%	99.33%	99.14%		80.65%	98.39%	91.23%
> > > 2G	90.07%	99.92%	100.38%		99.85%	100.75%	99.94%
> > > 
> > > * Tmpfs					
> > > ** write
> > > 5M	121.85%	99.69%	131.57%		219.22%	99.85%	135.30%
> > > 60M	140.82%	99.70%	139.57%		98.14%	54.51%	73.65%
> > > 300M	97.99%	99.54%	99.60%		99.29%	99.57%	99.32%
> > > 2G	99.37%	99.62%	99.64%		98.72%	99.92%	99.18%
> > > 
> > > ** read				
> > > 5M	85.44%	92.96%	88.92%		129.13%	101.54%	97.87%
> > > 60M	64.41%	94.35%	88.10%		97.41%	95.75%	96.31%
> > > 300M	116.89%	106.52%	120.84%		132.17%	104.39%	130.63%
> > > 2G	86.27%	99.96%	87.47%		60.69%	99.44%	98.49%
> > 
> > I have played with the patch below but it didn't show too much
> > difference in the end or we end up doing even worse. 
> > 
> > Here is the no_patch/patched comparison:
> > 
> > 	comparison (page reclaim is 100%)
> > * ext3  avg	max	min	median
> > ** Write
> > 5M    	81.49%	77.53%	101.91%	76.60%
> > 60M   	98.60%	95.58%	101.40%	99.62%
> > 300M  	101.68%	102.05%	101.19%	101.73%
> > 2G    	102.20%	102.25%	102.12%	102.22%
> > 				
> > ** Read  				
> > 5M    	103.94%	105.14%	103.95%	103.32%
> > 60M   	105.26%	107.91%	103.15%	104.95%
> > 300M  	104.83%	107.86%	101.65%	104.88%
> > 2G    	102.67%	101.26%	102.83%	103.35%
> > 
> > * Tmpfs
> > ** Write
> > 5M    	107.68%	119.66%	105.26%	102.78%
> > 60M   	122.16%	138.51%	103.62%	121.09%
> > 300M  	101.03%	100.67%	101.11%	101.17%
> > 2G    	101.82%	101.66%	101.87%	101.87%
> > 				
> > ** Read			
> > 5M    	102.47%	124.02%	98.05%	92.57%
> > 60M   	103.62%	121.03%	96.97%	96.52%
> > 300M  	98.90%	118.92%	102.64%	86.19%
> > 2G    	83.50%	76.34%	97.36%	81.92%
> > 
> > I am not sure it really makes sense to play with the priority here. All
> > the values we would end up with would be just wild guesses or mostly
> > artificial workloads. So I think it makes some to go with the original
> > version of the PageReclaim patch without any further fiddling with the
> > priority.
> > 
> > Is this sufficient to go with the patch or do people still have concerns
> > which would block the patch from merging?
> 
> No, let's go for it.  It's a net improvement as it stands.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Johannes!

Andrew, do you want me to resend the patch?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-06-14 10:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-28 15:38 [RFC -mm] memcg: prevent from OOM with too many dirty pages Michal Hocko
2012-05-29  3:08 ` Fengguang Wu
2012-05-29  7:28   ` Johannes Weiner
2012-05-29  8:48     ` Fengguang Wu
2012-05-29  9:35       ` Johannes Weiner
2012-05-29 10:21         ` Fengguang Wu
2012-05-29 13:32         ` Mel Gorman
2012-05-29 13:51         ` Michal Hocko
2012-05-31  9:09           ` Michal Hocko
2012-06-01  8:37             ` Michal Hocko
2012-06-07 14:45               ` Michal Hocko
2012-06-14  7:27                 ` Johannes Weiner
2012-06-14 10:13                   ` Michal Hocko
2012-05-31 15:18           ` Fengguang Wu
     [not found]             ` <20120531153249.GD12809@tiehlicka.suse.cz>
     [not found]               ` <20120531154248.GA32734@localhost>
     [not found]                 ` <20120531154859.GA20546@tiehlicka.suse.cz>
     [not found]                   ` <20120531160129.GA439@localhost>
     [not found]                     ` <20120531182509.GA22539@tiehlicka.suse.cz>
2012-06-01  1:33                       ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).