All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <jweiner@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@redhat.com>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	Christoph Hellwig <hch@infradead.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Mel Gorman <mgorman@suse.de>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Mon, 11 Jul 2011 19:20:50 +0200	[thread overview]
Message-ID: <20110711172050.GA2849@redhat.com> (raw)
In-Reply-To: <20110708095456.GI1026@dastard>

On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > We have to remember that memory reclaim is doing LRU reclaim and the
> > > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > > to operate in the same direction (oldest to youngest) for the same
> > > purpose.  The fundamental problem that occurs when memory reclaim
> > > starts writing pages back from the LRU is this:
> > > 
> > > 	- memory reclaim has run ahead of IO writeback -
> > > 
> > > The LRU usually looks like this:
> > > 
> > > 	oldest					youngest
> > > 	+---------------+---------------+--------------+
> > > 	clean		writeback	dirty
> > > 			^		^
> > > 			|		|
> > > 			|		Where flusher will next work from
> > > 			|		Where kswapd is working from
> > > 			|
> > > 			IO submitted by flusher, waiting on completion
> > > 
> > > 
> > > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > > got ahead of writeback without being throttled - it's passed over
> > > all the pages currently under writeback and is trying to write back
> > > pages that are *newer* than what writeback is working on. IOWs, it
> > > starts trying to do the job of the flusher threads, and it does that
> > > very badly.
> > > 
> > > The $100 question is ∗why is it getting ahead of writeback*?
> > 
> > Unless you have a purely sequential writer, the LRU order is - at
> > least in theory - diverging away from the writeback order.
> 
> Which is the root cause of the IO collapse that writeback from the
> LRU causes, yes?
> 
> > According to the reasoning behind generational garbage collection,
> > they should in fact be inverse to each other.  The oldest pages still
> > in use are the most likely to be still needed in the future.
> > 
> > In practice we only make a generational distinction between used-once
> > and used-many, which manifests in the inactive and the active list.
> > But still, when reclaim starts off with a localized writer, the oldest
> > pages are likely to be at the end of the active list.
> 
> Yet the file pages on the active list are unlikely to be dirty -
> overwrite-in-place cache hot workloads are pretty scarce in my
> experience. hence writeback of dirty pages from the active LRU is
> unlikely to be a problem.

Just to clarify, I looked at this too much from the reclaim POV, where
use-once applies to full pages, not bytes.

Even if you do not overwrite the same bytes over and over again,
issuing two subsequent write()s that end up against the same page will
have it activated.

Are your workloads writing in perfectly page-aligned chunks?

This effect may build up slowly, but every page that is written from
the active list makes room for a dirty page on the inactive list wrt
the dirty limit.  I.e. without the active pages, you have 10-20% dirty
pages at the head of the inactive list (default dirty ratio), or a
80-90% clean tail, and for every page cleaned, a new dirty page can
appear at the inactive head.

But taking the active list into account, some of these clean pages are
taken away from the headstart the flusher has over the reclaimer, they
sit on the active list.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a few deactivated clean pages.

Now, the active list is not scanned anymore until it is bigger than
the inactive list, giving the flushers plenty of time to clean the
pages on it and let them accumulate even while memory pressure is
already occurring.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a LOT of deactivated clean pages.

So when memory needs to be reclaimed, the LRU lists in those three
scenarios look like this:

	inactive-only: [CCCCCCCCDD][]

	active-small:  [CCCCCCDD][CC]

	active-huge:   [CCCDD][CCCCC]

where the third scenario is the most likely for the reclaimer to run
into dirty pages.

I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
this there is a chance that the change that let the active list grow
unscanned may actually have contributed to this single-page writing
problem becoming worse?

commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Jun 16 15:32:28 2009 -0700

    vmscan: evict use-once pages first
    
    When the file LRU lists are dominated by streaming IO pages, evict those
    pages first, before considering evicting other pages.
    
    This should be safe from deadlocks or performance problems
    because only three things can happen to an inactive file page:
    
    1) referenced twice and promoted to the active list
    2) evicted by the pageout code
    3) under IO, after which it will get evicted or promoted
    
    The pages freed in this way can either be reused for streaming IO, or
    allocated for something else.  If the pages are used for streaming IO,
    this pageout pattern continues.  Otherwise, we will fall back to the
    normal pageout pattern.
    
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Reported-by: Elladan <elladan@eskimo.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <jweiner@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Mon, 11 Jul 2011 19:20:50 +0200	[thread overview]
Message-ID: <20110711172050.GA2849@redhat.com> (raw)
In-Reply-To: <20110708095456.GI1026@dastard>

On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > We have to remember that memory reclaim is doing LRU reclaim and the
> > > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > > to operate in the same direction (oldest to youngest) for the same
> > > purpose.  The fundamental problem that occurs when memory reclaim
> > > starts writing pages back from the LRU is this:
> > > 
> > > 	- memory reclaim has run ahead of IO writeback -
> > > 
> > > The LRU usually looks like this:
> > > 
> > > 	oldest					youngest
> > > 	+---------------+---------------+--------------+
> > > 	clean		writeback	dirty
> > > 			^		^
> > > 			|		|
> > > 			|		Where flusher will next work from
> > > 			|		Where kswapd is working from
> > > 			|
> > > 			IO submitted by flusher, waiting on completion
> > > 
> > > 
> > > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > > got ahead of writeback without being throttled - it's passed over
> > > all the pages currently under writeback and is trying to write back
> > > pages that are *newer* than what writeback is working on. IOWs, it
> > > starts trying to do the job of the flusher threads, and it does that
> > > very badly.
> > > 
> > > The $100 question is a??why is it getting ahead of writeback*?
> > 
> > Unless you have a purely sequential writer, the LRU order is - at
> > least in theory - diverging away from the writeback order.
> 
> Which is the root cause of the IO collapse that writeback from the
> LRU causes, yes?
> 
> > According to the reasoning behind generational garbage collection,
> > they should in fact be inverse to each other.  The oldest pages still
> > in use are the most likely to be still needed in the future.
> > 
> > In practice we only make a generational distinction between used-once
> > and used-many, which manifests in the inactive and the active list.
> > But still, when reclaim starts off with a localized writer, the oldest
> > pages are likely to be at the end of the active list.
> 
> Yet the file pages on the active list are unlikely to be dirty -
> overwrite-in-place cache hot workloads are pretty scarce in my
> experience. hence writeback of dirty pages from the active LRU is
> unlikely to be a problem.

Just to clarify, I looked at this too much from the reclaim POV, where
use-once applies to full pages, not bytes.

Even if you do not overwrite the same bytes over and over again,
issuing two subsequent write()s that end up against the same page will
have it activated.

Are your workloads writing in perfectly page-aligned chunks?

This effect may build up slowly, but every page that is written from
the active list makes room for a dirty page on the inactive list wrt
the dirty limit.  I.e. without the active pages, you have 10-20% dirty
pages at the head of the inactive list (default dirty ratio), or a
80-90% clean tail, and for every page cleaned, a new dirty page can
appear at the inactive head.

But taking the active list into account, some of these clean pages are
taken away from the headstart the flusher has over the reclaimer, they
sit on the active list.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a few deactivated clean pages.

Now, the active list is not scanned anymore until it is bigger than
the inactive list, giving the flushers plenty of time to clean the
pages on it and let them accumulate even while memory pressure is
already occurring.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a LOT of deactivated clean pages.

So when memory needs to be reclaimed, the LRU lists in those three
scenarios look like this:

	inactive-only: [CCCCCCCCDD][]

	active-small:  [CCCCCCDD][CC]

	active-huge:   [CCCDD][CCCCC]

where the third scenario is the most likely for the reclaimer to run
into dirty pages.

I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
this there is a chance that the change that let the active list grow
unscanned may actually have contributed to this single-page writing
problem becoming worse?

commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Jun 16 15:32:28 2009 -0700

    vmscan: evict use-once pages first
    
    When the file LRU lists are dominated by streaming IO pages, evict those
    pages first, before considering evicting other pages.
    
    This should be safe from deadlocks or performance problems
    because only three things can happen to an inactive file page:
    
    1) referenced twice and promoted to the active list
    2) evicted by the pageout code
    3) under IO, after which it will get evicted or promoted
    
    The pages freed in this way can either be reused for streaming IO, or
    allocated for something else.  If the pages are used for streaming IO,
    this pageout pattern continues.  Otherwise, we will fall back to the
    normal pageout pattern.
    
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Reported-by: Elladan <elladan@eskimo.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-07-11 17:21 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
2011-06-30  1:34   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
2011-06-30  0:15   ` Dave Chinner
2011-06-30  1:26     ` Dave Chinner
2011-06-30  6:55     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
2011-06-30  2:00   ` Dave Chinner
2011-06-30  2:48     ` Dave Chinner
2011-06-30  6:57     ` Christoph Hellwig
2011-07-01  2:22   ` Dave Chinner
2011-07-01  4:18     ` Dave Chinner
2011-07-01  8:59       ` Christoph Hellwig
2011-07-01  9:20         ` Dave Chinner
2011-07-01  9:33       ` Christoph Hellwig
2011-07-01  9:33         ` Christoph Hellwig
2011-07-01 14:59         ` Mel Gorman
2011-07-01 14:59           ` Mel Gorman
2011-07-01 15:15           ` Christoph Hellwig
2011-07-01 15:15             ` Christoph Hellwig
2011-07-02  2:42           ` Dave Chinner
2011-07-02  2:42             ` Dave Chinner
2011-07-05 14:10             ` Mel Gorman
2011-07-05 14:10               ` Mel Gorman
2011-07-05 15:55               ` Dave Chinner
2011-07-05 15:55                 ` Dave Chinner
2011-07-11 10:26             ` Christoph Hellwig
2011-07-11 10:26               ` Christoph Hellwig
2011-07-01 15:41         ` Wu Fengguang
2011-07-01 15:41           ` Wu Fengguang
2011-07-04  3:25           ` Dave Chinner
2011-07-04  3:25             ` Dave Chinner
2011-07-05 14:34             ` Mel Gorman
2011-07-05 14:34               ` Mel Gorman
2011-07-06  1:23               ` Dave Chinner
2011-07-06  1:23                 ` Dave Chinner
2011-07-11 11:10               ` Christoph Hellwig
2011-07-11 11:10                 ` Christoph Hellwig
2011-07-06  4:53             ` Wu Fengguang
2011-07-06  4:53               ` Wu Fengguang
2011-07-06  6:47               ` Minchan Kim
2011-07-06  6:47                 ` Minchan Kim
2011-07-06  7:17               ` Dave Chinner
2011-07-06  7:17                 ` Dave Chinner
2011-07-06 15:12             ` Johannes Weiner
2011-07-06 15:12               ` Johannes Weiner
2011-07-08  9:54               ` Dave Chinner
2011-07-08  9:54                 ` Dave Chinner
2011-07-11 17:20                 ` Johannes Weiner [this message]
2011-07-11 17:20                   ` Johannes Weiner
2011-07-11 17:24                   ` Christoph Hellwig
2011-07-11 17:24                     ` Christoph Hellwig
2011-07-11 19:09                   ` Rik van Riel
2011-07-11 19:09                     ` Rik van Riel
2011-07-01  8:51     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  2:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  7:03     ` Christoph Hellwig
2011-06-30 12:28       ` Alex Elder
2011-06-30  2:11   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
2011-06-30  2:44   ` Dave Chinner
2011-06-30  7:18     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
2011-06-30  2:52   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
2011-06-30  2:59   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
2011-06-30  3:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
2011-06-30  6:11   ` Dave Chinner
2011-06-30  7:34     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
2011-06-30  6:35   ` Dave Chinner
2011-06-30  7:39     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
2011-06-30  7:04   ` Dave Chinner
2011-06-30  7:09     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig
2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig
2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig
2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig
2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig
2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig
2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig
2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig
2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
2011-06-30  6:50   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110711172050.GA2849@redhat.com \
    --to=jweiner@redhat.com \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.