All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason <chris.mason@oracle.com>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Christoph Hellwig <hch@infradead.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
Date: Tue, 15 Jun 2010 13:20:34 +1000	[thread overview]
Message-ID: <20100615032034.GR6590@dastard> (raw)
In-Reply-To: <20100614183957.ad0cdb58.akpm@linux-foundation.org>

On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 09:11:44 +1000
> > > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > > shows that if we optimise IO patterns at higher layers where the
> > > > sort window is much, much larger than in the IO scheduler, then
> > > > overall system performance improves....
> > > 
> > > Yup.
> > > 
> > > But then, this all really should be done at the block layer so other
> > > io-submitting-paths can benefit from it.
> > 
> > That was what we did in the past with really, really deep IO
> > scheduler queues. That leads to IO latency and OOM problems because
> > we could lock gigabytes of memory away under IO and take minutes to
> > clean it.
> > 
> > Besides, there really isn't the right context in the block layer to
> > be able to queue and prioritise large amounts of IO without
> > significant penalties to some higher layer operation.
> > 
> > > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> > 
> > Perhaps, but IMO sorting should be done where the context allows it
> > to be done most efficiently. Sorting is most effective when ever a
> > significant queue of IO is formed, whether it be in the filesystem,
> > the VFS, the VM or the block layer because the IO stack is very much
> > a GIGO queue.
> > 
> > Simply put, there's nothing the lower layers can do to optimise bad
> > IO patterns from the higher layers because they have small sort
> > windows which are necessary to keep IO latency in check. Hence if
> > the higher layers feed the lower layers crap they simply don't have
> > the context or depth to perform the same level of optimistations we
> > can do easily higher up the stack.
> > 
> > IOWs, IMO anywhere there is a context with significant queue of IO,
> > that's where we should be doing a better job of sorting before that
> > IO is dispatched to the lower layers. This is still no guarantee of
> > better IO (e.g. if the filesystem fragments the file) but it does
> > give the lower layers a far better chance at optimal allocation and
> > scheduling of IO...
> 
> None of what you said had much to do with what I said.
> 
> What you've described are implementation problems in the current block
> layer because it conflates "sorting" with "queueing".  I'm saying "fix
> that".

You can't sort until you've queued.

> And...  sorting at the block layer will always be superior to sorting
> at the pagecache layer because the block layer sorts at the physical
> block level and can handle not-well-laid-out files and can sort and merge
> pages from different address_spaces.

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
optmisations to the IO that the VM issues?

> Still, I suspect none of it will improve anything anyway.  Those pages
> are still dirty, possibly-locked and need to go to disk.  It doesn't
> matter from the MM POV whether they sit in some VM list or in the
> request queue.

Oh, but it does.....

> Possibly there may be some benefit to not putting so many of these
> unreclaimable pages into the queue all at the the same time.  But
> that's a shortcoming in the block code: we should be able to shove
> arbitrary numbers of dirty page (segments) into the queue and not gum
> the system up.  Don't try to work around that in the VM.

I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly.  If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason <chris.mason@oracle.com>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Christoph Hellwig <hch@infradead.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
Date: Tue, 15 Jun 2010 13:20:34 +1000	[thread overview]
Message-ID: <20100615032034.GR6590@dastard> (raw)
In-Reply-To: <20100614183957.ad0cdb58.akpm@linux-foundation.org>

On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 09:11:44 +1000
> > > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > > shows that if we optimise IO patterns at higher layers where the
> > > > sort window is much, much larger than in the IO scheduler, then
> > > > overall system performance improves....
> > > 
> > > Yup.
> > > 
> > > But then, this all really should be done at the block layer so other
> > > io-submitting-paths can benefit from it.
> > 
> > That was what we did in the past with really, really deep IO
> > scheduler queues. That leads to IO latency and OOM problems because
> > we could lock gigabytes of memory away under IO and take minutes to
> > clean it.
> > 
> > Besides, there really isn't the right context in the block layer to
> > be able to queue and prioritise large amounts of IO without
> > significant penalties to some higher layer operation.
> > 
> > > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> > 
> > Perhaps, but IMO sorting should be done where the context allows it
> > to be done most efficiently. Sorting is most effective when ever a
> > significant queue of IO is formed, whether it be in the filesystem,
> > the VFS, the VM or the block layer because the IO stack is very much
> > a GIGO queue.
> > 
> > Simply put, there's nothing the lower layers can do to optimise bad
> > IO patterns from the higher layers because they have small sort
> > windows which are necessary to keep IO latency in check. Hence if
> > the higher layers feed the lower layers crap they simply don't have
> > the context or depth to perform the same level of optimistations we
> > can do easily higher up the stack.
> > 
> > IOWs, IMO anywhere there is a context with significant queue of IO,
> > that's where we should be doing a better job of sorting before that
> > IO is dispatched to the lower layers. This is still no guarantee of
> > better IO (e.g. if the filesystem fragments the file) but it does
> > give the lower layers a far better chance at optimal allocation and
> > scheduling of IO...
> 
> None of what you said had much to do with what I said.
> 
> What you've described are implementation problems in the current block
> layer because it conflates "sorting" with "queueing".  I'm saying "fix
> that".

You can't sort until you've queued.

> And...  sorting at the block layer will always be superior to sorting
> at the pagecache layer because the block layer sorts at the physical
> block level and can handle not-well-laid-out files and can sort and merge
> pages from different address_spaces.

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
optmisations to the IO that the VM issues?

> Still, I suspect none of it will improve anything anyway.  Those pages
> are still dirty, possibly-locked and need to go to disk.  It doesn't
> matter from the MM POV whether they sit in some VM list or in the
> request queue.

Oh, but it does.....

> Possibly there may be some benefit to not putting so many of these
> unreclaimable pages into the queue all at the the same time.  But
> that's a shortcoming in the block code: we should be able to shove
> arbitrary numbers of dirty page (segments) into the queue and not gum
> the system up.  Don't try to work around that in the VM.

I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly.  If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-06-15  3:21 UTC|newest]

Thread overview: 198+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-14 11:17 [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Mel Gorman
2010-06-14 11:17 ` Mel Gorman
2010-06-14 11:17 ` [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 15:45   ` Rik van Riel
2010-06-14 15:45     ` Rik van Riel
2010-06-14 21:01   ` Larry Woodman
2010-06-14 21:01     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 16:47   ` Rik van Riel
2010-06-14 16:47     ` Rik van Riel
2010-06-14 21:02   ` Larry Woodman
2010-06-14 21:02     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 03/12] tracing, vmscan: Add trace event when a page is written Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 16:48   ` Rik van Riel
2010-06-14 16:48     ` Rik van Riel
2010-06-14 21:02   ` Larry Woodman
2010-06-14 21:02     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 17:55   ` Rik van Riel
2010-06-14 17:55     ` Rik van Riel
2010-06-14 21:03   ` Larry Woodman
2010-06-14 21:03     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 05/12] vmscan: kill prev_priority completely Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:04   ` Rik van Riel
2010-06-14 18:04     ` Rik van Riel
2010-06-16 23:37   ` Andrew Morton
2010-06-16 23:37     ` Andrew Morton
2010-06-16 23:45     ` Rik van Riel
2010-06-16 23:45       ` Rik van Riel
2010-06-17  0:18       ` Andrew Morton
2010-06-17  0:18         ` Andrew Morton
2010-06-17  0:34         ` Rik van Riel
2010-06-17  0:34           ` Rik van Riel
2010-06-25  8:29     ` KOSAKI Motohiro
2010-06-25  8:29       ` KOSAKI Motohiro
2010-06-28 10:35       ` Mel Gorman
2010-06-28 10:35         ` Mel Gorman
2010-06-14 11:17 ` [PATCH 06/12] vmscan: simplify shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:06   ` Rik van Riel
2010-06-14 18:06     ` Rik van Riel
2010-06-15 10:13     ` Mel Gorman
2010-06-15 10:13       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:14   ` Rik van Riel
2010-06-14 18:14     ` Rik van Riel
2010-06-14 11:17 ` [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:59   ` Rik van Riel
2010-06-14 18:59     ` Rik van Riel
2010-06-15 10:47   ` Christoph Hellwig
2010-06-15 10:47     ` Christoph Hellwig
2010-06-15 15:56     ` Mel Gorman
2010-06-15 15:56       ` Mel Gorman
2010-06-16 23:43   ` Andrew Morton
2010-06-16 23:43     ` Andrew Morton
2010-06-17 10:30     ` Mel Gorman
2010-06-17 10:30       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 19:24   ` Rik van Riel
2010-06-14 19:24     ` Rik van Riel
2010-06-16 23:48   ` Andrew Morton
2010-06-16 23:48     ` Andrew Morton
2010-06-17 10:46     ` Mel Gorman
2010-06-17 10:46       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 19:42   ` Rik van Riel
2010-06-14 19:42     ` Rik van Riel
2010-06-14 11:17 ` [PATCH 11/12] vmscan: Write out dirty pages in batch Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 21:13   ` Rik van Riel
2010-06-14 21:13     ` Rik van Riel
2010-06-15 10:18     ` Mel Gorman
2010-06-15 10:18       ` Mel Gorman
2010-06-14 23:11   ` Dave Chinner
2010-06-14 23:11     ` Dave Chinner
2010-06-14 23:21     ` Andrew Morton
2010-06-14 23:21       ` Andrew Morton
2010-06-15  0:39       ` Dave Chinner
2010-06-15  0:39         ` Dave Chinner
2010-06-15  1:16         ` Rik van Riel
2010-06-15  1:16           ` Rik van Riel
2010-06-15  1:45           ` Andrew Morton
2010-06-15  1:45             ` Andrew Morton
2010-06-15  4:08             ` Rik van Riel
2010-06-15  4:08               ` Rik van Riel
2010-06-15  4:37               ` Andrew Morton
2010-06-15  4:37                 ` Andrew Morton
2010-06-15  5:12                 ` Nick Piggin
2010-06-15  5:12                   ` Nick Piggin
2010-06-15  5:43                   ` [patch] mm: vmscan fix mapping use after free Nick Piggin
2010-06-15  5:43                     ` Nick Piggin
2010-06-15 13:23                     ` Mel Gorman
2010-06-15 13:23                       ` Mel Gorman
2010-06-15 11:01           ` [PATCH 11/12] vmscan: Write out dirty pages in batch Christoph Hellwig
2010-06-15 11:01             ` Christoph Hellwig
2010-06-15 13:32             ` Rik van Riel
2010-06-15 13:32               ` Rik van Riel
2010-06-15  1:39         ` Andrew Morton
2010-06-15  1:39           ` Andrew Morton
2010-06-15  3:20           ` Dave Chinner [this message]
2010-06-15  3:20             ` Dave Chinner
2010-06-15  4:15             ` Andrew Morton
2010-06-15  4:15               ` Andrew Morton
2010-06-15  6:36               ` Dave Chinner
2010-06-15  6:36                 ` Dave Chinner
2010-06-15 10:28                 ` Evgeniy Polyakov
2010-06-15 10:28                   ` Evgeniy Polyakov
2010-06-15 10:55                   ` Nick Piggin
2010-06-15 10:55                     ` Nick Piggin
2010-06-15 11:10                     ` Christoph Hellwig
2010-06-15 11:10                       ` Christoph Hellwig
2010-06-15 11:20                       ` Nick Piggin
2010-06-15 11:20                         ` Nick Piggin
2010-06-15 23:20                     ` Dave Chinner
2010-06-15 23:20                       ` Dave Chinner
2010-06-16  6:04                       ` Nick Piggin
2010-06-16  6:04                         ` Nick Piggin
2010-06-15 11:08                   ` Christoph Hellwig
2010-06-15 11:08                     ` Christoph Hellwig
2010-06-15 11:43               ` Mel Gorman
2010-06-15 11:43                 ` Mel Gorman
2010-06-15 13:07                 ` tytso
2010-06-15 13:07                   ` tytso
2010-06-15 15:44                 ` Mel Gorman
2010-06-15 15:44                   ` Mel Gorman
2010-06-15 10:57       ` Christoph Hellwig
2010-06-15 10:57         ` Christoph Hellwig
2010-06-15 10:53   ` Christoph Hellwig
2010-06-15 10:53     ` Christoph Hellwig
2010-06-15 11:11     ` Mel Gorman
2010-06-15 11:11       ` Mel Gorman
2010-06-15 11:13     ` Nick Piggin
2010-06-15 11:13       ` Nick Piggin
2010-06-14 11:17 ` [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 21:55   ` Rik van Riel
2010-06-14 21:55     ` Rik van Riel
2010-06-15 11:45     ` Mel Gorman
2010-06-15 11:45       ` Mel Gorman
2010-06-15 13:34       ` Rik van Riel
2010-06-15 13:34         ` Rik van Riel
2010-06-15 13:37         ` Christoph Hellwig
2010-06-15 13:37           ` Christoph Hellwig
2010-06-15 13:54           ` Mel Gorman
2010-06-15 13:54             ` Mel Gorman
2010-06-16  0:30             ` KAMEZAWA Hiroyuki
2010-06-16  0:30               ` KAMEZAWA Hiroyuki
2010-06-15 14:02           ` Rik van Riel
2010-06-15 14:02             ` Rik van Riel
2010-06-15 13:59         ` Mel Gorman
2010-06-15 13:59           ` Mel Gorman
2010-06-15 14:04           ` Rik van Riel
2010-06-15 14:04             ` Rik van Riel
2010-06-15 14:16             ` Mel Gorman
2010-06-15 14:16               ` Mel Gorman
2010-06-16  0:17               ` KAMEZAWA Hiroyuki
2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
2010-06-16  0:29                 ` Rik van Riel
2010-06-16  0:29                   ` Rik van Riel
2010-06-16  0:39                   ` KAMEZAWA Hiroyuki
2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
2010-06-16  0:53                     ` Rik van Riel
2010-06-16  0:53                       ` Rik van Riel
2010-06-16  1:40                       ` KAMEZAWA Hiroyuki
2010-06-16  1:40                         ` KAMEZAWA Hiroyuki
2010-06-16  2:20                         ` KAMEZAWA Hiroyuki
2010-06-16  2:20                           ` KAMEZAWA Hiroyuki
2010-06-16  5:11                           ` Christoph Hellwig
2010-06-16  5:11                             ` Christoph Hellwig
2010-06-16 10:51                             ` Jens Axboe
2010-06-16 10:51                               ` Jens Axboe
2010-06-16  5:07                     ` Christoph Hellwig
2010-06-16  5:07                       ` Christoph Hellwig
2010-06-16  5:06                 ` Christoph Hellwig
2010-06-16  5:06                   ` Christoph Hellwig
2010-06-17  0:25                   ` KAMEZAWA Hiroyuki
2010-06-17  0:25                     ` KAMEZAWA Hiroyuki
2010-06-17  6:16                     ` Christoph Hellwig
2010-06-17  6:16                       ` Christoph Hellwig
2010-06-17  6:23                       ` KAMEZAWA Hiroyuki
2010-06-17  6:23                         ` KAMEZAWA Hiroyuki
2010-06-14 15:10 ` [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Christoph Hellwig
2010-06-14 15:10   ` Christoph Hellwig
2010-06-15 11:45   ` Mel Gorman
2010-06-15 11:45     ` Mel Gorman
2010-06-15  0:08 ` KAMEZAWA Hiroyuki
2010-06-15  0:08   ` KAMEZAWA Hiroyuki
2010-06-15 11:49   ` Mel Gorman
2010-06-15 11:49     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100615032034.GR6590@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.