All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"Li, Shaohua" <shaohua.li@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"richard@rsk.demon.co.uk" <richard@rsk.demon.co.uk>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>
Subject: Re: regression in page writeback
Date: Mon, 28 Sep 2009 15:15:07 +0800	[thread overview]
Message-ID: <20090928071507.GA20068@localhost> (raw)
In-Reply-To: <20090928010700.GE9464@discord.disaster>

On Mon, Sep 28, 2009 at 09:07:00AM +0800, Dave Chinner wrote:
> On Fri, Sep 25, 2009 at 02:45:03PM +0800, Wu Fengguang wrote:
> > On Fri, Sep 25, 2009 at 01:04:13PM +0800, Dave Chinner wrote:
> > > On Thu, Sep 24, 2009 at 08:38:20PM -0400, Chris Mason wrote:
> > > > On Fri, Sep 25, 2009 at 10:11:17AM +1000, Dave Chinner wrote:
> > > > > On Thu, Sep 24, 2009 at 11:15:08AM +0800, Wu Fengguang wrote:
> > > > > > On Wed, Sep 23, 2009 at 10:00:58PM +0800, Chris Mason wrote:
> > > > > > > The only place that actually honors the congestion flag is pdflush.
> > > > > > > It's trivial to get pdflush backed up and make it sit down without
> > > > > > > making any progress because once the queue congests, pdflush goes away.
> > > > > > 
> > > > > > Right. I guess that's more or less intentional - to give lowest priority
> > > > > > to periodic/background writeback.
> > > > > 
> > > > > IMO, this is the wrong design. Background writeback should
> > > > > have higher CPU/scheduler priority than normal tasks. If there is
> > > > > sufficient dirty pages in the system for background writeback to
> > > > > be active, it should be running *now* to start as much IO as it can
> > > > > without being held up by other, lower priority tasks.
> > > > 
> > > > I'd say that an fsync from mutt or vi should be done at a higher prio
> > > > than a background streaming writer.
> > > 
> > > I don't think you caught everything I said - synchronous IO is
> > > un-throttled.
> > 
> > O_SYNC writes may be un-throttled in theory, however it seems to be
> > throttled in practice:
> > 
> >   generic_file_aio_write
> >     __generic_file_aio_write
> >       generic_file_buffered_write
> >         generic_perform_write
> >           balance_dirty_pages_ratelimited
> >     generic_write_sync
> > 
> > Do you mean some other code path?
> 
> In the context of the setup I was talking about, I meant is that sync
> IO _should_ be unthrottled because it is self-throttling by it's
> very nature. The current code makes no differentiation between the
> two.

Yes, O_SYNC writers are double throttled now..

> > > Background writeback should dump async IO to the elevator as fast as
> > > it can, then get the hell out of the way. If you've got a UP system,
> > > then the fsync can't be issued at the same time pdflush is running
> > > (same as right now), and if you've got a MP system then fsync can
> > > run at the same time.
> > 
> > I think you are right for system wide sync.
> > 
> > System wide sync seems to always wait for the queued bdi writeback
> > works to finish, which should be fine in terms of efficiency, except
> > that sync could end up do more works and even live lock.
> > 
> > > On the premise that sync IO is unthrottled and given that elevators
> > > queue and issue sync IO sperately to async writes, fsync latency
> > > would be entirely derived from the elevator queuing behaviour, not
> > > the CPU priority of pdflush.
> > 
> > It's not exactly CPU priority, but queue fullness priority.
> 
> That's exactly what I implied. The elevator manages the
> queue fullness and when it decides when to block background or
> foreground writes. The problem is, the elevator can't make a sane
> scheduling decision because it can't tell the difference between
> async and sync IO because we don't propagate that information to
> THE Block layer from the VFS.
> 
> We have all the smarts in the block layer interface to distinguish
> between sync and async IO and the elevators do smart stuff with this
> information. But by throwing away that information at the VFS level,
> we hamstring the elevator scheduler because it never sees any
> "synchronous" write IO for data writes. Hence any synchronous data
> write gets stuck in the same queue with all the background stuff
> and doesn't get priority.

Yes this is a problem. We may also need to add priority awareness to
get_request_wait() to get a complete solution.

> Hence right now if you issue an fsync or pageout, it's a crap shoot
> as to whether the elevator will schedule it first or last behind
> other IO. The fact that they then ignore congestion is relying on a
> side effect to stop background writeback and allow the fsync to
> monopolise the elevator. It is not predictable and hence IO patterns
> under load will change all the time regardless of whether the system
> is in a steady state or not.
>
> IMO there are architectural failings from top to bottom in the
> writeback stack - while people are interested in fixing stuff, I
> figured that they should be pointed out to give y'all something to
> think about...

Thanks, your information helps a lot.

> > fsync operations always use nonblocking=0, so in fact they _used to_
> > enjoy better priority than pdflush. Same is vmscan pageout, which
> > calls writepage directly. Both won't back off on congested bdi.
> > 
> > So when there comes fsync/pageout, they will always be served first.
> 
> pageout is so horribly inefficient from an IO perspective it is not
> funny. It is one of the reasons Linux sucks so much when under
> memory pressure. It basically causes the system to do random 4k
> writeback of dirty pages (and lumpy reclaim can make it
> synchronous!). 
> 
> pageout needs an enema, and preferably it should defer to background
> writeback to clean pages. background writeback will clean pages
> much, much faster than the random crap that pageout spews at the
> disk right now.
> 
> Given that I can basically lock up my 2.6.30-based laptop for 10-15
> minutes at a time with the disk running flat out in low memory
> situations simply by starting to copy a large file(*), I think that
> the way we currently handle dirty page writeback needs a bit of a
> rethink.
> 
> (*) I had this happen 4-5 times last week moving VM images around on
> my laptop, and it involved the Linux VM switching between pageout
> and swapping to make more memory available while the copy was was
> hammering the same drive with dirty pages from foreground writeback.
> It made for extremely fragmented files when the machine finally
> recovered because of the non-sequential writeback patterns on the
> single file being copied.  You can't tell me that this is sane,
> desirable behaviour, and this is the sort of problem that I want
> sorted out. I don't beleive it can be fixed by maintaining the
> number of uncoordinated, competing writeback mechanisms we currently
> have.

I imagined some lumpy pageout policy would help, but didn't realize
it's such a severe problem that can happen in daily desktop workload..

Below is a quick patch. Any comments?

> > Small random IOs may hurt a bit though.
> 
> They *always* hurt, and under load, that appears to be the common IO
> pattern that Linux is generating....

Thanks,
Fengguang
---

vmscan: lumpy pageout

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |   72 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 63 insertions(+), 9 deletions(-)

--- linux.orig/mm/vmscan.c	2009-09-28 11:45:48.000000000 +0800
+++ linux/mm/vmscan.c	2009-09-28 14:45:19.000000000 +0800
@@ -344,6 +344,64 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
+static pageout_t try_lumpy_pageout(struct page *page,
+				   struct address_space *mapping,
+				   struct writeback_control *wbc)
+{
+	struct page *pages[PAGEVEC_SIZE];
+	pgoff_t start;
+	int total;
+	int count;
+	int i;
+	int err;
+	int res = 0;
+
+	page_cache_get(page);
+	pages[0] = page;
+	i = 0;
+	count = 1;
+	start = page->index + 1;
+
+	for (total = LUMPY_PAGEOUT_PAGES; total > 0; total--) {
+		if (i >= count) {
+			i = 0;
+			count = find_get_pages(mapping, start,
+					       min(total, PAGEVEC_SIZE), pages);
+			if (!count)
+				break;
+
+			/* continuous? */
+			if (start + count - 1 != pages[count - 1]->index)
+				break;
+
+			start += count;
+		}
+
+		page = pages[i];
+		if (!PageDirty(page))
+			break;
+		if (!PageActive(page))
+			SetPageReclaim(page);
+		err = mapping->a_ops->writepage(page, wbc);
+		if (err < 0)
+			handle_write_error(mapping, page, res);
+		if (err == AOP_WRITEPAGE_ACTIVATE) {
+			ClearPageReclaim(page);
+			res = PAGE_ACTIVATE;
+			break;
+		}
+		page_cache_release(page);
+		i++;
+	}
+
+	for (; i < count; i++)
+		page_cache_release(pages[i]);
+
+	return res;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -392,21 +450,17 @@ static pageout_t pageout(struct page *pa
 		int res;
 		struct writeback_control wbc = {
 			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
+			.nr_to_write = LUMPY_PAGEOUT_PAGES,
 			.range_start = 0,
 			.range_end = LLONG_MAX,
 			.nonblocking = 1,
 			.for_reclaim = 1,
 		};
 
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
+
+		res = try_lumpy_pageout(page, mapping, &wbc);
+		if (res == PAGE_ACTIVATE)
+			return res;
 
 		/*
 		 * Wait on writeback if requested to. This happens when

  reply	other threads:[~2009-09-28  7:15 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-22  5:49 regression in page writeback Shaohua Li
2009-09-22  6:40 ` Peter Zijlstra
2009-09-22  8:05   ` Wu Fengguang
2009-09-22  8:09     ` Peter Zijlstra
2009-09-22  8:24       ` Wu Fengguang
2009-09-22  8:32         ` Peter Zijlstra
2009-09-22  8:51           ` Wu Fengguang
2009-09-22  8:52           ` Richard Kennedy
2009-09-22  9:05             ` Wu Fengguang
2009-09-22 11:41               ` Shaohua Li
2009-09-22 15:52           ` Chris Mason
2009-09-23  0:22             ` Wu Fengguang
2009-09-23  0:54               ` Andrew Morton
2009-09-23  1:17                 ` Wu Fengguang
2009-09-23  1:27                   ` Wu Fengguang
2009-09-23  1:28                   ` Andrew Morton
2009-09-23  1:32                     ` Wu Fengguang
2009-09-23  1:47                       ` Andrew Morton
2009-09-23  2:01                         ` Wu Fengguang
2009-09-23  2:09                           ` Andrew Morton
2009-09-23  3:07                             ` Wu Fengguang
2009-09-23  1:45                     ` Wu Fengguang
2009-09-23  1:59                       ` Andrew Morton
2009-09-23  2:26                         ` Wu Fengguang
2009-09-23  2:36                           ` Andrew Morton
2009-09-23  2:49                             ` Wu Fengguang
2009-09-23  2:56                               ` Andrew Morton
2009-09-23  3:11                                 ` Wu Fengguang
2009-09-23  3:10                               ` Shaohua Li
2009-09-23  3:14                                 ` Wu Fengguang
2009-09-23  3:25                                   ` Wu Fengguang
2009-09-23 14:00                             ` Chris Mason
2009-09-24  3:15                               ` Wu Fengguang
2009-09-24 12:10                                 ` Chris Mason
2009-09-25  3:26                                   ` Wu Fengguang
2009-09-25  0:11                                 ` Dave Chinner
2009-09-25  0:38                                   ` Chris Mason
2009-09-25  5:04                                     ` Dave Chinner
2009-09-25  6:45                                       ` Wu Fengguang
2009-09-28  1:07                                         ` Dave Chinner
2009-09-28  7:15                                           ` Wu Fengguang [this message]
2009-09-28 13:08                                             ` Christoph Hellwig
2009-09-28 14:07                                               ` Theodore Tso
2009-09-30  5:26                                                 ` Wu Fengguang
2009-09-30  5:32                                                   ` Wu Fengguang
2009-10-01 22:17                                                     ` Jan Kara
2009-10-02  3:27                                                       ` Wu Fengguang
2009-10-06 12:55                                                         ` Jan Kara
2009-10-06 13:18                                                           ` Wu Fengguang
2009-09-30 14:11                                                   ` Theodore Tso
2009-10-01 15:14                                                     ` Wu Fengguang
2009-10-01 21:54                                                       ` Theodore Tso
2009-10-02  2:55                                                         ` Wu Fengguang
2009-10-02  8:19                                                           ` Wu Fengguang
2009-10-02 17:26                                                             ` Theodore Tso
2009-10-03  6:10                                                               ` Wu Fengguang
2009-09-29  2:32                                               ` Wu Fengguang
2009-09-29 14:00                                                 ` Chris Mason
2009-09-29 14:21                                                 ` Christoph Hellwig
2009-09-29  0:15                                             ` Wu Fengguang
2009-09-28 14:25                                           ` Chris Mason
2009-09-29 23:39                                             ` Dave Chinner
2009-09-30  1:30                                               ` Wu Fengguang
2009-09-25 12:06                                       ` Chris Mason
2009-09-25  3:19                                   ` Wu Fengguang
2009-09-26  1:47                                     ` Dave Chinner
2009-09-26  3:02                                       ` Wu Fengguang
2009-09-26  3:02                                         ` Wu Fengguang
2009-09-23  9:19                         ` Richard Kennedy
2009-09-23  9:23                           ` Peter Zijlstra
2009-09-23  9:37                             ` Wu Fengguang
2009-09-23 10:30                               ` Wu Fengguang
2009-09-23  6:41             ` Shaohua Li
2009-09-22 10:49 ` Wu Fengguang
2009-09-22 11:50   ` Shaohua Li
2009-09-22 13:39     ` Wu Fengguang
2009-09-23  1:52       ` Shaohua Li
2009-09-23  4:00         ` Wu Fengguang
2009-09-25  6:14           ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090928071507.GA20068@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=richard@rsk.demon.co.uk \
    --cc=shaohua.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.