All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Andi Kleen <andi@firstfloor.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [patch v3] swap: virtual swap readahead
Date: Wed, 10 Jun 2009 09:45:08 +0200	[thread overview]
Message-ID: <20090610074508.GA1960@cmpxchg.org> (raw)
In-Reply-To: <20090610050342.GA8867@localhost>

Hi Fengguang,

On Wed, Jun 10, 2009 at 01:03:42PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 03:37:02AM +0800, Johannes Weiner wrote:
> > On Tue, Jun 09, 2009 at 09:01:28PM +0200, Johannes Weiner wrote:
> > > [resend with lists cc'd, sorry]
> > 
> > [and fixed Hugh's email.  crap]
> > 
> > > Hi,
> > > 
> > > here is a new iteration of the virtual swap readahead.  Per Hugh's
> > > suggestion, I moved the pte collecting to the callsite and thus out
> > > ouf swap code.  Unfortunately, I had to bound page_cluster due to an
> > > array of that many swap entries on the stack, but I think it is better
> > > to limit the cluster size to a sane maximum than using dynamic
> > > allocation for this purpose.
> 
> Hi Johannes,
> 
> When stress testing your patch, I found it triggered many OOM kills.
> Around the time of last OOMs, the memory usage is:
> 
>              total       used       free     shared    buffers     cached
> Mem:           474        468          5          0          0        239
> -/+ buffers/cache:        229        244
> Swap:         1023        221        802

Wow, that really confused me for a second as we shouldn't read more
pages ahead than without the patch, probably even less under stress.

So the problem has to be a runaway reading.  And indeed, severe
stupidity here:

+       window = cluster << PAGE_SHIFT;
+       min = addr & ~(window - 1);
+       max = min + cluster;
+       /*
+        * To keep the locking/highpte mapping simple, stay
+        * within the PTE range of one PMD entry.
+        */
+       limit = addr & PMD_MASK;
+       if (limit > min)
+               min = limit;
+       limit = pmd_addr_end(addr, max);
+       if (limit < max)
+               max = limit;
+       limit = max - min;

The mistake is at the initial calculation of max.  It should be

	max = min + window;

The resulting problem is that min could get bigger than max when
cluster is bigger than PMD_SHIFT.  Did you use page_cluster == 5?

The initial min is aligned to a value below the PMD boundary and max
based on it with a too small offset, staying below the PMD boundary as
well.  When min is rounded up, this becomes a bit large:

	limit = max - min;

So if my brain is already functioning, fixing the initial max should
be enough because either

	o window is smaller than PMD_SIZE, than we won't round down
	below a PMD boundary in the first place or

	o window is bigger than PMD_SIZE, than we can round down below
	a PMD boundary but adding window to that is garuanteed to
	cross the boundary again

and thus max is always bigger than min.

Fengguang, does this make sense?  If so, the patch below should fix
it.

Thank you,

	Hannes

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2467,7 +2467,7 @@ static int swap_readahead_ptes(struct mm
 
 	window = cluster << PAGE_SHIFT;
 	min = addr & ~(window - 1);
-	max = min + cluster;
+	max = min + window;
 	/*
 	 * To keep the locking/highpte mapping simple, stay
 	 * within the PTE range of one PMD entry.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Andi Kleen <andi@firstfloor.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Minchan Kim <minchan.kim@gmail.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [patch v3] swap: virtual swap readahead
Date: Wed, 10 Jun 2009 09:45:08 +0200	[thread overview]
Message-ID: <20090610074508.GA1960@cmpxchg.org> (raw)
In-Reply-To: <20090610050342.GA8867@localhost>

Hi Fengguang,

On Wed, Jun 10, 2009 at 01:03:42PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 03:37:02AM +0800, Johannes Weiner wrote:
> > On Tue, Jun 09, 2009 at 09:01:28PM +0200, Johannes Weiner wrote:
> > > [resend with lists cc'd, sorry]
> > 
> > [and fixed Hugh's email.  crap]
> > 
> > > Hi,
> > > 
> > > here is a new iteration of the virtual swap readahead.  Per Hugh's
> > > suggestion, I moved the pte collecting to the callsite and thus out
> > > ouf swap code.  Unfortunately, I had to bound page_cluster due to an
> > > array of that many swap entries on the stack, but I think it is better
> > > to limit the cluster size to a sane maximum than using dynamic
> > > allocation for this purpose.
> 
> Hi Johannes,
> 
> When stress testing your patch, I found it triggered many OOM kills.
> Around the time of last OOMs, the memory usage is:
> 
>              total       used       free     shared    buffers     cached
> Mem:           474        468          5          0          0        239
> -/+ buffers/cache:        229        244
> Swap:         1023        221        802

Wow, that really confused me for a second as we shouldn't read more
pages ahead than without the patch, probably even less under stress.

So the problem has to be a runaway reading.  And indeed, severe
stupidity here:

+       window = cluster << PAGE_SHIFT;
+       min = addr & ~(window - 1);
+       max = min + cluster;
+       /*
+        * To keep the locking/highpte mapping simple, stay
+        * within the PTE range of one PMD entry.
+        */
+       limit = addr & PMD_MASK;
+       if (limit > min)
+               min = limit;
+       limit = pmd_addr_end(addr, max);
+       if (limit < max)
+               max = limit;
+       limit = max - min;

The mistake is at the initial calculation of max.  It should be

	max = min + window;

The resulting problem is that min could get bigger than max when
cluster is bigger than PMD_SHIFT.  Did you use page_cluster == 5?

The initial min is aligned to a value below the PMD boundary and max
based on it with a too small offset, staying below the PMD boundary as
well.  When min is rounded up, this becomes a bit large:

	limit = max - min;

So if my brain is already functioning, fixing the initial max should
be enough because either

	o window is smaller than PMD_SIZE, than we won't round down
	below a PMD boundary in the first place or

	o window is bigger than PMD_SIZE, than we can round down below
	a PMD boundary but adding window to that is garuanteed to
	cross the boundary again

and thus max is always bigger than min.

Fengguang, does this make sense?  If so, the patch below should fix
it.

Thank you,

	Hannes

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2467,7 +2467,7 @@ static int swap_readahead_ptes(struct mm
 
 	window = cluster << PAGE_SHIFT;
 	min = addr & ~(window - 1);
-	max = min + cluster;
+	max = min + window;
 	/*
 	 * To keep the locking/highpte mapping simple, stay
 	 * within the PTE range of one PMD entry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2009-06-10  7:47 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-09 19:01 [patch v3] swap: virtual swap readahead Johannes Weiner
2009-06-09 19:01 ` Johannes Weiner
2009-06-09 19:37 ` Johannes Weiner
2009-06-09 19:37   ` Johannes Weiner
2009-06-10  5:03   ` Wu Fengguang
2009-06-10  5:03     ` Wu Fengguang
2009-06-10  7:45     ` Johannes Weiner [this message]
2009-06-10  7:45       ` Johannes Weiner
2009-06-10  8:11       ` Wu Fengguang
2009-06-10  8:11         ` Wu Fengguang
2009-06-10  8:32         ` KAMEZAWA Hiroyuki
2009-06-10  8:32           ` KAMEZAWA Hiroyuki
2009-06-10  8:56           ` Wu Fengguang
2009-06-10  8:56             ` Wu Fengguang
2009-06-10  9:42             ` Peter Zijlstra
2009-06-10  9:42               ` Peter Zijlstra
2009-06-10  9:59               ` Wu Fengguang
2009-06-10  9:59                 ` Wu Fengguang
2009-06-10 10:05                 ` Peter Zijlstra
2009-06-10 10:05                   ` Peter Zijlstra
2009-06-10 11:32                   ` Wu Fengguang
2009-06-10 11:32                     ` Wu Fengguang
2009-06-10 17:25                     ` Jesse Barnes
2009-06-10 17:25                       ` Jesse Barnes
2009-06-11  5:22                       ` Wu Fengguang
2009-06-11  5:22                         ` Wu Fengguang
2009-06-11 10:17                         ` Johannes Weiner
2009-06-11 10:17                           ` Johannes Weiner
2009-06-12  1:59                           ` Wu Fengguang
2009-06-12  1:59                             ` Wu Fengguang
2009-06-15 18:22                             ` Johannes Weiner
2009-06-15 18:22                               ` Johannes Weiner
2009-06-18  9:19                               ` Wu Fengguang
2009-06-18  9:19                                 ` Wu Fengguang
2009-06-18 13:01                                 ` Johannes Weiner
2009-06-18 13:01                                   ` Johannes Weiner
2009-06-19  3:30                                   ` Wu Fengguang
2009-06-19  3:30                                     ` Wu Fengguang
2009-06-21 18:07                                   ` Hugh Dickins
2009-06-21 18:07                                     ` Hugh Dickins
2009-06-21 18:37                                     ` Johannes Weiner
2009-06-21 18:37                                       ` Johannes Weiner
2009-06-10  9:30           ` Johannes Weiner
2009-06-10  9:30             ` Johannes Weiner
2009-06-10  6:39   ` KAMEZAWA Hiroyuki
2009-06-10  6:39     ` KAMEZAWA Hiroyuki
2009-06-11  5:31 ` KAMEZAWA Hiroyuki
2009-06-11  5:31   ` KAMEZAWA Hiroyuki
2009-06-17 22:41   ` Johannes Weiner
2009-06-17 22:41     ` Johannes Weiner
2009-06-18  9:29     ` Wu Fengguang
2009-06-18  9:29       ` Wu Fengguang
2009-06-18 13:09       ` Johannes Weiner
2009-06-18 13:09         ` Johannes Weiner
2009-06-19  3:17         ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090610074508.GA1960@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=fengguang.wu@intel.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.