All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 5/5] mm: refault distance-based file cache sizing
Date: Tue, 1 May 2012 17:38:25 +0200	[thread overview]
Message-ID: <20120501153825.GA4837@cmpxchg.org> (raw)
In-Reply-To: <20120501141330.GA2207@barrios>

On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > To protect frequently used page cache (workingset) from bursts of less
> > frequently used or one-shot cache, page cache pages are managed on two
> > linked lists.  The inactive list is where all cache starts out on
> > fault and ends on reclaim.  Pages that get accessed another time while
> > on the inactive list get promoted to the active list to protect them
> > from reclaim.
> > 
> > Right now we have two main problems.
> > 
> > One stems from numa allocation decisions and how the page allocator
> > and kswapd interact.  The both of them can enter into a perfect loop
> > where kswapd reclaims from the preferred zone of a task, allowing the
> > task to continuously allocate from that zone.  Or, the node distance
> > can lead to the allocator to do direct zone reclaim to stay in the
> > preferred zone.  This may be good for locality, but the task has only
> 
> Understood.
> 
> > the inactive space of that one zone to get its memory activated.
> > Forcing the allocator to spread out to lower zones in the right
> > situation makes the difference between continuous IO to serve the
> > workingset, or taking the numa cost but serving fully from memory.
> 
> It's hard to parse your word due to my dumb brain.
> Could you elaborate on it?
> It would be a good if you say with example.

Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
file pages in Normal and DMA32 is full of other stuff.  Now you access
a new 6G file repeatedly.  First it allocates from Normal (preferred),
then tries DMA32 (full), wakes up kswapd and retries all zones.  If
kswapd then frees pages at roughly the same pace as the allocator
allocates from Normal, kswapd never goes to sleep and evicts pages
from the 6G file before they can get accessed a second time.  Even
though the 6G file could fit in memory (4G Normal + 4G DMA32), the
allocator only uses the 4G Normal zone.

Same applies if you have a load that would fit in the memory of two
nodes but the node distance leads the allocator to do zone_reclaim()
and forcing the pages to stay in one node, again preventing the load
from being fully cached in memory, which is much more expensive than
the foreign node cost.

> > up to half of memory, and don't recognize workingset changes that are
> > bigger than half of memory.
> 
> Workingset change?
> You mean if new workingset is bigger than half of memory and it's like
> stream before retouch, we could cache only part of working set because 
> head pages on working set would be discared by tail pages of working set
> in inactive list?

Spot-on.  I called that 'tail-chasing' in my notes :-) When you are in
a perpetual loop of evicting pages you will need in a couple hundred
page faults.  Those couple hundred page faults are the refault
distance and my code is able to detect these loops and increases the
space available to the inactive list to end them, if possible.

This is the whole principle of the series.

If such a loop is recognized in a single zone, the allocator goes for
lower zones to increase the inactive space.  If such a loop is
recognized over all allowed zones in the zonelist, the active lists
are shrunk to increase the inactive space.

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org, Rik van Riel <riel@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	Hugh Dickins <hughd@google.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 5/5] mm: refault distance-based file cache sizing
Date: Tue, 1 May 2012 17:38:25 +0200	[thread overview]
Message-ID: <20120501153825.GA4837@cmpxchg.org> (raw)
In-Reply-To: <20120501141330.GA2207@barrios>

On Tue, May 01, 2012 at 11:13:30PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Tue, May 01, 2012 at 10:41:53AM +0200, Johannes Weiner wrote:
> > To protect frequently used page cache (workingset) from bursts of less
> > frequently used or one-shot cache, page cache pages are managed on two
> > linked lists.  The inactive list is where all cache starts out on
> > fault and ends on reclaim.  Pages that get accessed another time while
> > on the inactive list get promoted to the active list to protect them
> > from reclaim.
> > 
> > Right now we have two main problems.
> > 
> > One stems from numa allocation decisions and how the page allocator
> > and kswapd interact.  The both of them can enter into a perfect loop
> > where kswapd reclaims from the preferred zone of a task, allowing the
> > task to continuously allocate from that zone.  Or, the node distance
> > can lead to the allocator to do direct zone reclaim to stay in the
> > preferred zone.  This may be good for locality, but the task has only
> 
> Understood.
> 
> > the inactive space of that one zone to get its memory activated.
> > Forcing the allocator to spread out to lower zones in the right
> > situation makes the difference between continuous IO to serve the
> > workingset, or taking the numa cost but serving fully from memory.
> 
> It's hard to parse your word due to my dumb brain.
> Could you elaborate on it?
> It would be a good if you say with example.

Say your Normal zone is 4G (DMA32 also 4G) and you have 2G of active
file pages in Normal and DMA32 is full of other stuff.  Now you access
a new 6G file repeatedly.  First it allocates from Normal (preferred),
then tries DMA32 (full), wakes up kswapd and retries all zones.  If
kswapd then frees pages at roughly the same pace as the allocator
allocates from Normal, kswapd never goes to sleep and evicts pages
from the 6G file before they can get accessed a second time.  Even
though the 6G file could fit in memory (4G Normal + 4G DMA32), the
allocator only uses the 4G Normal zone.

Same applies if you have a load that would fit in the memory of two
nodes but the node distance leads the allocator to do zone_reclaim()
and forcing the pages to stay in one node, again preventing the load
from being fully cached in memory, which is much more expensive than
the foreign node cost.

> > up to half of memory, and don't recognize workingset changes that are
> > bigger than half of memory.
> 
> Workingset change?
> You mean if new workingset is bigger than half of memory and it's like
> stream before retouch, we could cache only part of working set because 
> head pages on working set would be discared by tail pages of working set
> in inactive list?

Spot-on.  I called that 'tail-chasing' in my notes :-) When you are in
a perpetual loop of evicting pages you will need in a couple hundred
page faults.  Those couple hundred page faults are the refault
distance and my code is able to detect these loops and increases the
space available to the inactive list to end them, if possible.

This is the whole principle of the series.

If such a loop is recognized in a single zone, the allocator goes for
lower zones to increase the inactive space.  If such a loop is
recognized over all allowed zones in the zonelist, the active lists
are shrunk to increase the inactive space.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-05-01 15:38 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-01  8:41 [patch 0/5] refault distance-based file cache sizing Johannes Weiner
2012-05-01  8:41 ` Johannes Weiner
2012-05-01  8:41 ` [patch 1/5] mm: readahead: move radix tree hole searching here Johannes Weiner
2012-05-01  8:41   ` Johannes Weiner
2012-05-01 21:06   ` Rik van Riel
2012-05-01 21:06     ` Rik van Riel
2012-05-01  8:41 ` [patch 2/5] mm + fs: prepare for non-page entries in page cache Johannes Weiner
2012-05-01  8:41   ` Johannes Weiner
2012-05-01 19:02   ` Andrew Morton
2012-05-01 19:02     ` Andrew Morton
2012-05-01 20:15     ` Johannes Weiner
2012-05-01 20:15       ` Johannes Weiner
2012-05-01 20:24       ` Andrew Morton
2012-05-01 20:24         ` Andrew Morton
2012-05-01 21:14         ` Rik van Riel
2012-05-01 21:14           ` Rik van Riel
2012-05-01 21:29         ` Johannes Weiner
2012-05-01 21:29           ` Johannes Weiner
2012-05-01  8:41 ` [patch 3/5] mm + fs: store shadow pages " Johannes Weiner
2012-05-01  8:41   ` Johannes Weiner
2012-05-01  8:41 ` [patch 4/5] mm + fs: provide refault distance to page cache instantiations Johannes Weiner
2012-05-01  8:41   ` Johannes Weiner
2012-05-01  9:30   ` Peter Zijlstra
2012-05-01  9:30     ` Peter Zijlstra
2012-05-01  9:30     ` Peter Zijlstra
2012-05-01  9:55     ` Johannes Weiner
2012-05-01  9:55       ` Johannes Weiner
2012-05-01  9:58       ` Peter Zijlstra
2012-05-01  9:58         ` Peter Zijlstra
2012-05-01  9:58         ` Peter Zijlstra
2012-05-01  8:41 ` [patch 5/5] mm: refault distance-based file cache sizing Johannes Weiner
2012-05-01  8:41   ` Johannes Weiner
2012-05-01 14:13   ` Minchan Kim
2012-05-01 14:13     ` Minchan Kim
2012-05-01 15:38     ` Johannes Weiner [this message]
2012-05-01 15:38       ` Johannes Weiner
2012-05-02  5:21       ` Minchan Kim
2012-05-02  5:21         ` Minchan Kim
2012-05-02  1:57   ` Andrea Arcangeli
2012-05-02  1:57     ` Andrea Arcangeli
2012-05-02  6:23     ` Johannes Weiner
2012-05-02  6:23       ` Johannes Weiner
2012-05-02 15:11       ` Andrea Arcangeli
2012-05-02 15:11         ` Andrea Arcangeli
2012-05-01 19:08 ` [patch 0/5] " Andrew Morton
2012-05-01 19:08   ` Andrew Morton
2012-05-01 21:19   ` Rik van Riel
2012-05-01 21:19     ` Rik van Riel
2012-05-01 21:26     ` Andrew Morton
2012-05-01 21:26       ` Andrew Morton
2012-05-02  1:10       ` Andrea Arcangeli
2012-05-02  1:10         ` Andrea Arcangeli
2012-05-03 13:15       ` Johannes Weiner
2012-05-03 13:15         ` Johannes Weiner
2012-05-16  5:25 ` nai.xia
2012-05-16  5:25   ` nai.xia
2012-05-16  6:51   ` Johannes Weiner
2012-05-16  6:51     ` Johannes Weiner
2012-05-16 12:56     ` nai.xia
2012-05-16 12:56       ` nai.xia
2012-05-17 21:08       ` Johannes Weiner
2012-05-17 21:08         ` Johannes Weiner
2012-05-18  3:44         ` Nai Xia
2012-05-18  3:44           ` Nai Xia
2012-05-18 15:07           ` Rik van Riel
2012-05-18 15:07             ` Rik van Riel
2012-05-18 15:30             ` Nai Xia
2012-05-18 15:30               ` Nai Xia
2012-05-18 15:30               ` Nai Xia
2012-05-17 13:11   ` Rik van Riel
2012-05-17 13:11     ` Rik van Riel
2012-05-18  5:03     ` Nai Xia
2012-05-18  5:03       ` Nai Xia
2012-05-18  5:03       ` Nai Xia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120501153825.GA4837@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan.kim@gmail.com \
    --cc=minchan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.