All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <jweiner@redhat.com>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Fri, 1 Jul 2011 23:41:36 +0800	[thread overview]
Message-ID: <20110701154136.GA17881@localhost> (raw)
In-Reply-To: <20110701093305.GA28531@infradead.org>

Christoph,

On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> Johannes, Mel, Wu,
> 
> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 
> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?

I once tried this approach:

http://www.spinics.net/lists/linux-mm/msg09202.html

It used a list structure that is not linearly scalable, however that
part should be independently improvable when necessary.

The real problem was, it seem to not very effective in my test runs.
I found many ->nr_pages works queued before the ->inode works, which
effectively makes the flusher working on more dispersed pages rather
than focusing on the dirty pages encountered in LRU reclaim.

So for the patch to work efficiently, we'll need to first merge the
->nr_pages works and make them lower priority than the ->inode works.

Thanks,
Fengguang

> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem.  That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 
> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > --- 180.out     2010-04-28 15:00:22.000000000 +1000
> > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> > @@ -1 +1,9 @@
> >  QA output created by 180
> > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> > 
> > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> > 0x244093 10473472 81
> > 0x244098 10371072 86
> > 0x244099 10104832 87
> > 0x24409a 10125312 88
> > 0x24409b 10469376 89
> > 0x24409c 10240000 90
> > 0x24409d 10362880 91
> > 0x24409e 10366976 92
> > 
> > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> > call in the trace (got a separate patch for that) is:
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>
> 
> 
> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 
> > When kswapd starts doing writeback from the LRU, the iops rate goes
> > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> > the IOPS stays as high as it does - maybe that is why I saw this and
> > you haven't.
> > 
> > As it is, the kswapd writeback behaviour is utterly atrocious and,
> > ultimately, quite easy to provoke. I wish the MM folk would fix that
> > goddamn problem already - we've only been complaining about it for
> > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> > even consider removing the .writepage clustering...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

WARNING: multiple messages have this Message-ID (diff)
From: Wu Fengguang <fengguang.wu@intel.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Chinner <david@fromorbit.com>,
	"xfs@oss.sgi.com" <xfs@oss.sgi.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Date: Fri, 1 Jul 2011 23:41:36 +0800	[thread overview]
Message-ID: <20110701154136.GA17881@localhost> (raw)
In-Reply-To: <20110701093305.GA28531@infradead.org>

Christoph,

On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> Johannes, Mel, Wu,
> 
> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 
> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?

I once tried this approach:

http://www.spinics.net/lists/linux-mm/msg09202.html

It used a list structure that is not linearly scalable, however that
part should be independently improvable when necessary.

The real problem was, it seem to not very effective in my test runs.
I found many ->nr_pages works queued before the ->inode works, which
effectively makes the flusher working on more dispersed pages rather
than focusing on the dirty pages encountered in LRU reclaim.

So for the patch to work efficiently, we'll need to first merge the
->nr_pages works and make them lower priority than the ->inode works.

Thanks,
Fengguang

> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem.  That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 
> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > --- 180.out     2010-04-28 15:00:22.000000000 +1000
> > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> > @@ -1 +1,9 @@
> >  QA output created by 180
> > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> > 
> > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> > 0x244093 10473472 81
> > 0x244098 10371072 86
> > 0x244099 10104832 87
> > 0x24409a 10125312 88
> > 0x24409b 10469376 89
> > 0x24409c 10240000 90
> > 0x24409d 10362880 91
> > 0x24409e 10366976 92
> > 
> > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> > call in the trace (got a separate patch for that) is:
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>
> 
> 
> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 
> > When kswapd starts doing writeback from the LRU, the iops rate goes
> > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> > the IOPS stays as high as it does - maybe that is why I saw this and
> > you haven't.
> > 
> > As it is, the kswapd writeback behaviour is utterly atrocious and,
> > ultimately, quite easy to provoke. I wish the MM folk would fix that
> > goddamn problem already - we've only been complaining about it for
> > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> > even consider removing the .writepage clustering...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2011-07-01 15:41 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-29 14:01 [PATCH 00/27] patch queue for Linux 3.1 Christoph Hellwig
2011-06-29 14:01 ` [PATCH 01/27] xfs: PF_FSTRANS should never be set in ->writepage Christoph Hellwig
2011-06-30  1:34   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 02/27] xfs: remove the unused ilock_nowait codepath in writepage Christoph Hellwig
2011-06-30  0:15   ` Dave Chinner
2011-06-30  1:26     ` Dave Chinner
2011-06-30  6:55     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
2011-06-30  2:00   ` Dave Chinner
2011-06-30  2:48     ` Dave Chinner
2011-06-30  6:57     ` Christoph Hellwig
2011-07-01  2:22   ` Dave Chinner
2011-07-01  4:18     ` Dave Chinner
2011-07-01  8:59       ` Christoph Hellwig
2011-07-01  9:20         ` Dave Chinner
2011-07-01  9:33       ` Christoph Hellwig
2011-07-01  9:33         ` Christoph Hellwig
2011-07-01 14:59         ` Mel Gorman
2011-07-01 14:59           ` Mel Gorman
2011-07-01 15:15           ` Christoph Hellwig
2011-07-01 15:15             ` Christoph Hellwig
2011-07-02  2:42           ` Dave Chinner
2011-07-02  2:42             ` Dave Chinner
2011-07-05 14:10             ` Mel Gorman
2011-07-05 14:10               ` Mel Gorman
2011-07-05 15:55               ` Dave Chinner
2011-07-05 15:55                 ` Dave Chinner
2011-07-11 10:26             ` Christoph Hellwig
2011-07-11 10:26               ` Christoph Hellwig
2011-07-01 15:41         ` Wu Fengguang [this message]
2011-07-01 15:41           ` Wu Fengguang
2011-07-04  3:25           ` Dave Chinner
2011-07-04  3:25             ` Dave Chinner
2011-07-05 14:34             ` Mel Gorman
2011-07-05 14:34               ` Mel Gorman
2011-07-06  1:23               ` Dave Chinner
2011-07-06  1:23                 ` Dave Chinner
2011-07-11 11:10               ` Christoph Hellwig
2011-07-11 11:10                 ` Christoph Hellwig
2011-07-06  4:53             ` Wu Fengguang
2011-07-06  4:53               ` Wu Fengguang
2011-07-06  6:47               ` Minchan Kim
2011-07-06  6:47                 ` Minchan Kim
2011-07-06  7:17               ` Dave Chinner
2011-07-06  7:17                 ` Dave Chinner
2011-07-06 15:12             ` Johannes Weiner
2011-07-06 15:12               ` Johannes Weiner
2011-07-08  9:54               ` Dave Chinner
2011-07-08  9:54                 ` Dave Chinner
2011-07-11 17:20                 ` Johannes Weiner
2011-07-11 17:20                   ` Johannes Weiner
2011-07-11 17:24                   ` Christoph Hellwig
2011-07-11 17:24                     ` Christoph Hellwig
2011-07-11 19:09                   ` Rik van Riel
2011-07-11 19:09                     ` Rik van Riel
2011-07-01  8:51     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 04/27] xfs: cleanup xfs_add_to_ioend Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  2:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 05/27] xfs: work around bogus gcc warning in xfs_allocbt_init_cursor Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 06/27] xfs: split xfs_setattr Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-30  7:03     ` Christoph Hellwig
2011-06-30 12:28       ` Alex Elder
2011-06-30  2:11   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 08/27] xfs: kill xfs_itruncate_start Christoph Hellwig
2011-06-29 22:13   ` Alex Elder
2011-06-29 14:01 ` [PATCH 09/27] xfs: split xfs_itruncate_finish Christoph Hellwig
2011-06-30  2:44   ` Dave Chinner
2011-06-30  7:18     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 10/27] xfs: improve sync behaviour in the fact of aggressive dirtying Christoph Hellwig
2011-06-30  2:52   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 11/27] xfs: fix filesystsem freeze race in xfs_trans_alloc Christoph Hellwig
2011-06-30  2:59   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 12/27] xfs: remove i_transp Christoph Hellwig
2011-06-30  3:00   ` Dave Chinner
2011-06-29 14:01 ` [PATCH 13/27] xfs: factor out xfs_dir2_leaf_find_entry Christoph Hellwig
2011-06-30  6:11   ` Dave Chinner
2011-06-30  7:34     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 14/27] xfs: cleanup shortform directory inode number handling Christoph Hellwig
2011-06-30  6:35   ` Dave Chinner
2011-06-30  7:39     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 15/27] xfs: kill struct xfs_dir2_sf Christoph Hellwig
2011-06-30  7:04   ` Dave Chinner
2011-06-30  7:09     ` Christoph Hellwig
2011-06-29 14:01 ` [PATCH 16/27] xfs: cleanup the defintion of struct xfs_dir2_sf_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 17/27] xfs: avoid usage of struct xfs_dir2_block Christoph Hellwig
2011-06-29 14:01 ` [PATCH 18/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 19/27] xfs: avoid usage of struct xfs_dir2_data Christoph Hellwig
2011-06-29 14:01 ` [PATCH 20/27] xfs: kill " Christoph Hellwig
2011-06-29 14:01 ` [PATCH 21/27] xfs: cleanup the defintion of struct xfs_dir2_data_entry Christoph Hellwig
2011-06-29 14:01 ` [PATCH 22/27] xfs: cleanup struct xfs_dir2_leaf Christoph Hellwig
2011-06-29 14:01 ` [PATCH 23/27] xfs: remove the unused xfs_bufhash structure Christoph Hellwig
2011-06-29 14:01 ` [PATCH 24/27] xfs: clean up buffer locking helpers Christoph Hellwig
2011-06-29 14:01 ` [PATCH 25/27] xfs: return the buffer locked from xfs_buf_get_uncached Christoph Hellwig
2011-06-29 14:01 ` [PATCH 26/27] xfs: cleanup I/O-related buffer flags Christoph Hellwig
2011-06-29 14:01 ` [PATCH 27/27] xfs: avoid a few disk cache flushes Christoph Hellwig
2011-06-30  6:36 ` [PATCH 00/27] patch queue for Linux 3.1 Dave Chinner
2011-06-30  6:50   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110701154136.GA17881@localhost \
    --to=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=jweiner@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.