From: Dave Chinner <david@fromorbit.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Chris Mason <chris.mason@oracle.com>, Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Christoph Hellwig <hch@infradead.org>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch Date: Tue, 15 Jun 2010 10:39:43 +1000 [thread overview] Message-ID: <20100615003943.GK6590@dastard> (raw) In-Reply-To: <20100614162143.04783749.akpm@linux-foundation.org> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote: > On Tue, 15 Jun 2010 09:11:44 +1000 > Dave Chinner <david@fromorbit.com> wrote: > > > On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote: > > > Page reclaim cleans individual pages using a_ops->writepage() because from > > > the VM perspective, it is known that pages in a particular zone must be freed > > > soon, it considers the target page to be the oldest and it does not want > > > to wait while background flushers cleans other pages. From a filesystem > > > perspective this is extremely inefficient as it generates a very seeky > > > IO pattern leading to the perverse situation where it can take longer to > > > clean all dirty pages than it would have otherwise. > > > > > > This patch queues all dirty pages at once to maximise the chances that > > > the write requests get merged efficiently. It also makes the next patch > > > that avoids writeout from direct reclaim more straight-forward. > > > > Seeing as you have a list of pages for IO, perhaps they could be sorted > > before issuing ->writepage on them. > > > > That is, while this patch issues all the IO in one hit, it doesn't > > change the order in which the IO is issued - it is still issued in > > LRU order. Given that they are issued in a short period of time now, > > rather than across a longer scan period, it is likely that it will > > not be any faster as: > > > > a) IO will not be started as soon, and > > b) the IO scheduler still only has a small re-ordering > > window and will choke just as much on random IO patterns. > > > > However, there is a list_sort() function that could be used to sort > > the list; sorting the list of pages by mapping and page->index > > within the mapping would result in all the pages on each mapping > > being sent down in ascending offset order at once - exactly how the > > filesystems want IO to be sent to it. Perhaps this is a simple > > improvement that can be made to this code that will make a big > > difference to worst case performance. > > > > FWIW, I did this for delayed metadata buffer writeback in XFS > > recently (i.e. sort the queue of (potentially tens of thousands of) > > buffers in ascending block order before dispatch) and that showed a > > 10-15% reduction in seeks on simple kernel compile workloads. This > > shows that if we optimise IO patterns at higher layers where the > > sort window is much, much larger than in the IO scheduler, then > > overall system performance improves.... > > Yup. > > But then, this all really should be done at the block layer so other > io-submitting-paths can benefit from it. That was what we did in the past with really, really deep IO scheduler queues. That leads to IO latency and OOM problems because we could lock gigabytes of memory away under IO and take minutes to clean it. Besides, there really isn't the right context in the block layer to be able to queue and prioritise large amounts of IO without significant penalties to some higher layer operation. > IOW, maybe "the sort queue is the submission queue" wasn't a good idea. Perhaps, but IMO sorting should be done where the context allows it to be done most efficiently. Sorting is most effective when ever a significant queue of IO is formed, whether it be in the filesystem, the VFS, the VM or the block layer because the IO stack is very much a GIGO queue. Simply put, there's nothing the lower layers can do to optimise bad IO patterns from the higher layers because they have small sort windows which are necessary to keep IO latency in check. Hence if the higher layers feed the lower layers crap they simply don't have the context or depth to perform the same level of optimistations we can do easily higher up the stack. IOWs, IMO anywhere there is a context with significant queue of IO, that's where we should be doing a better job of sorting before that IO is dispatched to the lower layers. This is still no guarantee of better IO (e.g. if the filesystem fragments the file) but it does give the lower layers a far better chance at optimal allocation and scheduling of IO... Cheers, Dave. -- Dave Chinner david@fromorbit.com
WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Chris Mason <chris.mason@oracle.com>, Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Christoph Hellwig <hch@infradead.org>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch Date: Tue, 15 Jun 2010 10:39:43 +1000 [thread overview] Message-ID: <20100615003943.GK6590@dastard> (raw) In-Reply-To: <20100614162143.04783749.akpm@linux-foundation.org> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote: > On Tue, 15 Jun 2010 09:11:44 +1000 > Dave Chinner <david@fromorbit.com> wrote: > > > On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote: > > > Page reclaim cleans individual pages using a_ops->writepage() because from > > > the VM perspective, it is known that pages in a particular zone must be freed > > > soon, it considers the target page to be the oldest and it does not want > > > to wait while background flushers cleans other pages. From a filesystem > > > perspective this is extremely inefficient as it generates a very seeky > > > IO pattern leading to the perverse situation where it can take longer to > > > clean all dirty pages than it would have otherwise. > > > > > > This patch queues all dirty pages at once to maximise the chances that > > > the write requests get merged efficiently. It also makes the next patch > > > that avoids writeout from direct reclaim more straight-forward. > > > > Seeing as you have a list of pages for IO, perhaps they could be sorted > > before issuing ->writepage on them. > > > > That is, while this patch issues all the IO in one hit, it doesn't > > change the order in which the IO is issued - it is still issued in > > LRU order. Given that they are issued in a short period of time now, > > rather than across a longer scan period, it is likely that it will > > not be any faster as: > > > > a) IO will not be started as soon, and > > b) the IO scheduler still only has a small re-ordering > > window and will choke just as much on random IO patterns. > > > > However, there is a list_sort() function that could be used to sort > > the list; sorting the list of pages by mapping and page->index > > within the mapping would result in all the pages on each mapping > > being sent down in ascending offset order at once - exactly how the > > filesystems want IO to be sent to it. Perhaps this is a simple > > improvement that can be made to this code that will make a big > > difference to worst case performance. > > > > FWIW, I did this for delayed metadata buffer writeback in XFS > > recently (i.e. sort the queue of (potentially tens of thousands of) > > buffers in ascending block order before dispatch) and that showed a > > 10-15% reduction in seeks on simple kernel compile workloads. This > > shows that if we optimise IO patterns at higher layers where the > > sort window is much, much larger than in the IO scheduler, then > > overall system performance improves.... > > Yup. > > But then, this all really should be done at the block layer so other > io-submitting-paths can benefit from it. That was what we did in the past with really, really deep IO scheduler queues. That leads to IO latency and OOM problems because we could lock gigabytes of memory away under IO and take minutes to clean it. Besides, there really isn't the right context in the block layer to be able to queue and prioritise large amounts of IO without significant penalties to some higher layer operation. > IOW, maybe "the sort queue is the submission queue" wasn't a good idea. Perhaps, but IMO sorting should be done where the context allows it to be done most efficiently. Sorting is most effective when ever a significant queue of IO is formed, whether it be in the filesystem, the VFS, the VM or the block layer because the IO stack is very much a GIGO queue. Simply put, there's nothing the lower layers can do to optimise bad IO patterns from the higher layers because they have small sort windows which are necessary to keep IO latency in check. Hence if the higher layers feed the lower layers crap they simply don't have the context or depth to perform the same level of optimistations we can do easily higher up the stack. IOWs, IMO anywhere there is a context with significant queue of IO, that's where we should be doing a better job of sorting before that IO is dispatched to the lower layers. This is still no guarantee of better IO (e.g. if the filesystem fragments the file) but it does give the lower layers a far better chance at optimal allocation and scheduling of IO... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-06-15 0:40 UTC|newest] Thread overview: 198+ messages / expand[flat|nested] mbox.gz Atom feed top 2010-06-14 11:17 [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 11:17 ` [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 15:45 ` Rik van Riel 2010-06-14 15:45 ` Rik van Riel 2010-06-14 21:01 ` Larry Woodman 2010-06-14 21:01 ` Larry Woodman 2010-06-14 11:17 ` [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 16:47 ` Rik van Riel 2010-06-14 16:47 ` Rik van Riel 2010-06-14 21:02 ` Larry Woodman 2010-06-14 21:02 ` Larry Woodman 2010-06-14 11:17 ` [PATCH 03/12] tracing, vmscan: Add trace event when a page is written Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 16:48 ` Rik van Riel 2010-06-14 16:48 ` Rik van Riel 2010-06-14 21:02 ` Larry Woodman 2010-06-14 21:02 ` Larry Woodman 2010-06-14 11:17 ` [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 17:55 ` Rik van Riel 2010-06-14 17:55 ` Rik van Riel 2010-06-14 21:03 ` Larry Woodman 2010-06-14 21:03 ` Larry Woodman 2010-06-14 11:17 ` [PATCH 05/12] vmscan: kill prev_priority completely Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 18:04 ` Rik van Riel 2010-06-14 18:04 ` Rik van Riel 2010-06-16 23:37 ` Andrew Morton 2010-06-16 23:37 ` Andrew Morton 2010-06-16 23:45 ` Rik van Riel 2010-06-16 23:45 ` Rik van Riel 2010-06-17 0:18 ` Andrew Morton 2010-06-17 0:18 ` Andrew Morton 2010-06-17 0:34 ` Rik van Riel 2010-06-17 0:34 ` Rik van Riel 2010-06-25 8:29 ` KOSAKI Motohiro 2010-06-25 8:29 ` KOSAKI Motohiro 2010-06-28 10:35 ` Mel Gorman 2010-06-28 10:35 ` Mel Gorman 2010-06-14 11:17 ` [PATCH 06/12] vmscan: simplify shrink_inactive_list() Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 18:06 ` Rik van Riel 2010-06-14 18:06 ` Rik van Riel 2010-06-15 10:13 ` Mel Gorman 2010-06-15 10:13 ` Mel Gorman 2010-06-14 11:17 ` [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 18:14 ` Rik van Riel 2010-06-14 18:14 ` Rik van Riel 2010-06-14 11:17 ` [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list() Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 18:59 ` Rik van Riel 2010-06-14 18:59 ` Rik van Riel 2010-06-15 10:47 ` Christoph Hellwig 2010-06-15 10:47 ` Christoph Hellwig 2010-06-15 15:56 ` Mel Gorman 2010-06-15 15:56 ` Mel Gorman 2010-06-16 23:43 ` Andrew Morton 2010-06-16 23:43 ` Andrew Morton 2010-06-17 10:30 ` Mel Gorman 2010-06-17 10:30 ` Mel Gorman 2010-06-14 11:17 ` [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list() Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 19:24 ` Rik van Riel 2010-06-14 19:24 ` Rik van Riel 2010-06-16 23:48 ` Andrew Morton 2010-06-16 23:48 ` Andrew Morton 2010-06-17 10:46 ` Mel Gorman 2010-06-17 10:46 ` Mel Gorman 2010-06-14 11:17 ` [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list() Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 19:42 ` Rik van Riel 2010-06-14 19:42 ` Rik van Riel 2010-06-14 11:17 ` [PATCH 11/12] vmscan: Write out dirty pages in batch Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 21:13 ` Rik van Riel 2010-06-14 21:13 ` Rik van Riel 2010-06-15 10:18 ` Mel Gorman 2010-06-15 10:18 ` Mel Gorman 2010-06-14 23:11 ` Dave Chinner 2010-06-14 23:11 ` Dave Chinner 2010-06-14 23:21 ` Andrew Morton 2010-06-14 23:21 ` Andrew Morton 2010-06-15 0:39 ` Dave Chinner [this message] 2010-06-15 0:39 ` Dave Chinner 2010-06-15 1:16 ` Rik van Riel 2010-06-15 1:16 ` Rik van Riel 2010-06-15 1:45 ` Andrew Morton 2010-06-15 1:45 ` Andrew Morton 2010-06-15 4:08 ` Rik van Riel 2010-06-15 4:08 ` Rik van Riel 2010-06-15 4:37 ` Andrew Morton 2010-06-15 4:37 ` Andrew Morton 2010-06-15 5:12 ` Nick Piggin 2010-06-15 5:12 ` Nick Piggin 2010-06-15 5:43 ` [patch] mm: vmscan fix mapping use after free Nick Piggin 2010-06-15 5:43 ` Nick Piggin 2010-06-15 13:23 ` Mel Gorman 2010-06-15 13:23 ` Mel Gorman 2010-06-15 11:01 ` [PATCH 11/12] vmscan: Write out dirty pages in batch Christoph Hellwig 2010-06-15 11:01 ` Christoph Hellwig 2010-06-15 13:32 ` Rik van Riel 2010-06-15 13:32 ` Rik van Riel 2010-06-15 1:39 ` Andrew Morton 2010-06-15 1:39 ` Andrew Morton 2010-06-15 3:20 ` Dave Chinner 2010-06-15 3:20 ` Dave Chinner 2010-06-15 4:15 ` Andrew Morton 2010-06-15 4:15 ` Andrew Morton 2010-06-15 6:36 ` Dave Chinner 2010-06-15 6:36 ` Dave Chinner 2010-06-15 10:28 ` Evgeniy Polyakov 2010-06-15 10:28 ` Evgeniy Polyakov 2010-06-15 10:55 ` Nick Piggin 2010-06-15 10:55 ` Nick Piggin 2010-06-15 11:10 ` Christoph Hellwig 2010-06-15 11:10 ` Christoph Hellwig 2010-06-15 11:20 ` Nick Piggin 2010-06-15 11:20 ` Nick Piggin 2010-06-15 23:20 ` Dave Chinner 2010-06-15 23:20 ` Dave Chinner 2010-06-16 6:04 ` Nick Piggin 2010-06-16 6:04 ` Nick Piggin 2010-06-15 11:08 ` Christoph Hellwig 2010-06-15 11:08 ` Christoph Hellwig 2010-06-15 11:43 ` Mel Gorman 2010-06-15 11:43 ` Mel Gorman 2010-06-15 13:07 ` tytso 2010-06-15 13:07 ` tytso 2010-06-15 15:44 ` Mel Gorman 2010-06-15 15:44 ` Mel Gorman 2010-06-15 10:57 ` Christoph Hellwig 2010-06-15 10:57 ` Christoph Hellwig 2010-06-15 10:53 ` Christoph Hellwig 2010-06-15 10:53 ` Christoph Hellwig 2010-06-15 11:11 ` Mel Gorman 2010-06-15 11:11 ` Mel Gorman 2010-06-15 11:13 ` Nick Piggin 2010-06-15 11:13 ` Nick Piggin 2010-06-14 11:17 ` [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim Mel Gorman 2010-06-14 11:17 ` Mel Gorman 2010-06-14 21:55 ` Rik van Riel 2010-06-14 21:55 ` Rik van Riel 2010-06-15 11:45 ` Mel Gorman 2010-06-15 11:45 ` Mel Gorman 2010-06-15 13:34 ` Rik van Riel 2010-06-15 13:34 ` Rik van Riel 2010-06-15 13:37 ` Christoph Hellwig 2010-06-15 13:37 ` Christoph Hellwig 2010-06-15 13:54 ` Mel Gorman 2010-06-15 13:54 ` Mel Gorman 2010-06-16 0:30 ` KAMEZAWA Hiroyuki 2010-06-16 0:30 ` KAMEZAWA Hiroyuki 2010-06-15 14:02 ` Rik van Riel 2010-06-15 14:02 ` Rik van Riel 2010-06-15 13:59 ` Mel Gorman 2010-06-15 13:59 ` Mel Gorman 2010-06-15 14:04 ` Rik van Riel 2010-06-15 14:04 ` Rik van Riel 2010-06-15 14:16 ` Mel Gorman 2010-06-15 14:16 ` Mel Gorman 2010-06-16 0:17 ` KAMEZAWA Hiroyuki 2010-06-16 0:17 ` KAMEZAWA Hiroyuki 2010-06-16 0:29 ` Rik van Riel 2010-06-16 0:29 ` Rik van Riel 2010-06-16 0:39 ` KAMEZAWA Hiroyuki 2010-06-16 0:39 ` KAMEZAWA Hiroyuki 2010-06-16 0:53 ` Rik van Riel 2010-06-16 0:53 ` Rik van Riel 2010-06-16 1:40 ` KAMEZAWA Hiroyuki 2010-06-16 1:40 ` KAMEZAWA Hiroyuki 2010-06-16 2:20 ` KAMEZAWA Hiroyuki 2010-06-16 2:20 ` KAMEZAWA Hiroyuki 2010-06-16 5:11 ` Christoph Hellwig 2010-06-16 5:11 ` Christoph Hellwig 2010-06-16 10:51 ` Jens Axboe 2010-06-16 10:51 ` Jens Axboe 2010-06-16 5:07 ` Christoph Hellwig 2010-06-16 5:07 ` Christoph Hellwig 2010-06-16 5:06 ` Christoph Hellwig 2010-06-16 5:06 ` Christoph Hellwig 2010-06-17 0:25 ` KAMEZAWA Hiroyuki 2010-06-17 0:25 ` KAMEZAWA Hiroyuki 2010-06-17 6:16 ` Christoph Hellwig 2010-06-17 6:16 ` Christoph Hellwig 2010-06-17 6:23 ` KAMEZAWA Hiroyuki 2010-06-17 6:23 ` KAMEZAWA Hiroyuki 2010-06-14 15:10 ` [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Christoph Hellwig 2010-06-14 15:10 ` Christoph Hellwig 2010-06-15 11:45 ` Mel Gorman 2010-06-15 11:45 ` Mel Gorman 2010-06-15 0:08 ` KAMEZAWA Hiroyuki 2010-06-15 0:08 ` KAMEZAWA Hiroyuki 2010-06-15 11:49 ` Mel Gorman 2010-06-15 11:49 ` Mel Gorman
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20100615003943.GK6590@dastard \ --to=david@fromorbit.com \ --cc=akpm@linux-foundation.org \ --cc=chris.mason@oracle.com \ --cc=hannes@cmpxchg.org \ --cc=hch@infradead.org \ --cc=kamezawa.hiroyu@jp.fujitsu.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mel@csn.ul.ie \ --cc=npiggin@suse.de \ --cc=riel@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.