From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p5U2moQV166420 for ; Wed, 29 Jun 2011 21:48:51 -0500 Received: from ipmail06.adl2.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id E21114C28F9 for ; Wed, 29 Jun 2011 19:48:48 -0700 (PDT) Received: from ipmail06.adl2.internode.on.net (ipmail06.adl2.internode.on.net [150.101.137.129]) by cuda.sgi.com with ESMTP id R32sbtFhFuXYjctG for ; Wed, 29 Jun 2011 19:48:48 -0700 (PDT) Date: Thu, 30 Jun 2011 12:48:45 +1000 From: Dave Chinner Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Message-ID: <20110630024845.GC561@dastard> References: <20110629140109.003209430@bombadil.infradead.org> <20110629140336.950805096@bombadil.infradead.org> <20110630020013.GX561@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110630020013.GX561@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Christoph Hellwig Cc: xfs@oss.sgi.com On Thu, Jun 30, 2011 at 12:00:13PM +1000, Dave Chinner wrote: > On Wed, Jun 29, 2011 at 10:01:12AM -0400, Christoph Hellwig wrote: > > Instead of implementing our own writeback clustering use write_cache_pages > > to do it for us. This means the guts of the current writepage implementation > > become a new helper used both for implementing ->writepage and as a callback > > to write_cache_pages for ->writepages. A new struct xfs_writeback_ctx > > is used to track block mapping state and the ioend chain over multiple > > invocation of it. > > > > The advantage over the old code is that we avoid a double pagevec lookup, > > and a more efficient handling of extent boundaries inside a page for > > small blocksize filesystems, as well as having less XFS specific code. > > Yes, it should be, but I can't actually measure any noticable CPU > usage difference @800MB/s writeback. The profiles change shape > around the changed code, but overall cpu usage does not change. I > think this is because the second pagevec lookup is pretty much free > because the radix tree is already hot in cache when we do the second > lookup... > > > The downside is that we don't do writeback clustering when called from > > kswapd anyore, but that is a case that should be avoided anyway. Note > > that we still convert the whole delalloc range from ->writepage, so > > the on-disk allocation pattern is not affected. > > All the more reason to ensure the mm subsystem doesn't do this.... > > ..... > > error: > > - if (iohead) > > - xfs_cancel_ioend(iohead); > > - > > - if (err == -EAGAIN) > > - goto redirty; > > - > > Should this EAGAIN handling be dealt with in the removing-the-non- > blocking-mode patch? > > > +STATIC int > > xfs_vm_writepages( > > struct address_space *mapping, > > struct writeback_control *wbc) > > { > > + struct xfs_writeback_ctx ctx = { }; > > + int ret; > > + > > xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); > > - return generic_writepages(mapping, wbc); > > + > > + ret = write_cache_pages(mapping, wbc, __xfs_vm_writepage, &ctx); > > + > > + if (ctx.iohead) { > > + if (ret) > > + xfs_cancel_ioend(ctx.iohead); > > + else > > + xfs_submit_ioend(wbc, ctx.iohead); > > + } > > I think this error handling does not work. If we have put pages into > the ioend (i.e. successful ->writepage calls) and then have a > ->writepage call fail, we'll get all the pages under writeback (i.e. > those on the ioend) remain in that state, and not ever get written > back (so move into the clean state) or redirtied (so written again > later) > > xfs_cancel_ioend() was only ever called for the first page sent down > to ->writepage, and on error that page was redirtied separately. > Hence it doesn't handle this case at all as it never occurs in the > existing code. > > I'd suggest that regardless of whether an error is returned here, > the existence of ctx.iohead indicates a valid ioend that needs to be > submitted.... I think i just tripped this. I'm running a 1k block size filesystem, and test 224 has hung with waiting on IO completion after .writepage errors: [ 2850.300979] XFS (vdb): Mounting Filesystem [ 2850.310069] XFS (vdb): Ending clean mount [ 2867.246341] Filesystem "vdb": reserve blocks depleted! Consider increasing reserve pool size. [ 2867.247652] XFS (vdb): page discard on page ffffea0000257b40, inode 0x1c6, offset 1187840. [ 2867.254135] XFS (vdb): page discard on page ffffea0000025f40, inode 0x423, offset 1839104. [ 2867.256289] XFS (vdb): page discard on page ffffea0000a21aa0, inode 0x34e, offset 28672. [ 2867.258845] XFS (vdb): page discard on page ffffea00001830d0, inode 0xe5, offset 3637248. [ 2867.260637] XFS (vdb): page discard on page ffffea0000776af8, inode 0x132, offset 6283264. [ 2867.269380] XFS (vdb): page discard on page ffffea00009d5d38, inode 0xf1, offset 5632000. [ 2867.277851] XFS (vdb): page discard on page ffffea0000017e60, inode 0x27a, offset 32768. [ 2867.281165] XFS (vdb): page discard on page ffffea0000258278, inode 0x274, offset 32768. [ 2867.282802] XFS (vdb): page discard on page ffffea00009a3c60, inode 0x48a, offset 32768. [ 2867.284166] XFS (vdb): page discard on page ffffea0000cc7808, inode 0x42e, offset 32768. [ 2867.287138] XFS (vdb): page discard on page ffffea00004d4440, inode 0x4e0, offset 32768. [ 2867.288500] XFS (vdb): page discard on page ffffea0000b34978, inode 0x4cd, offset 32768. [ 2867.289381] XFS (vdb): page discard on page ffffea00003f40f8, inode 0x4c4, offset 155648. [ 2867.291536] XFS (vdb): page discard on page ffffea0000023578, inode 0x4c7, offset 32768. [ 2867.300880] XFS (vdb): page discard on page ffffea00005276e8, inode 0x4cc, offset 32768. [ 2867.318819] XFS (vdb): page discard on page ffffea0000777230, inode 0x449, offset 8581120. [ 4701.141666] SysRq : Show Blocked State [ 4701.142093] task PC stack pid father [ 4701.142707] dd D ffff8800076edbe8 0 14211 8946 0x00000000 [ 4701.143509] ffff88002b03fa58 0000000000000086 ffffea00002db598 ffffea0000000000 [ 4701.144009] ffff88002b03f9d8 ffffffff81113a35 ffff8800076ed860 0000000000010f80 [ 4701.144009] ffff88002b03ffd8 ffff88002b03e010 ffff88002b03ffd8 0000000000010f80 [ 4701.144009] Call Trace: [ 4701.144009] [] ? __free_pages+0x35/0x40 [ 4701.144009] [] ? default_spin_lock_flags+0x9/0x10 [ 4701.144009] [] ? __lock_page+0x70/0x70 [ 4701.144009] [] io_schedule+0x60/0x80 [ 4701.144009] [] sleep_on_page+0xe/0x20 [ 4701.144009] [] __wait_on_bit+0x5f/0x90 [ 4701.144009] [] wait_on_page_bit+0x73/0x80 [ 4701.144009] [] ? autoremove_wake_function+0x40/0x40 [ 4701.144009] [] ? pagevec_lookup_tag+0x25/0x40 [ 4701.144009] [] filemap_fdatawait_range+0x112/0x1a0 [ 4701.144009] [] xfs_wait_on_pages+0x59/0x80 [ 4701.144009] [] xfs_flush_pages+0x8d/0xb0 [ 4701.144009] [] xfs_file_buffered_aio_write+0x104/0x190 [ 4701.144009] [] ? do_page_fault+0x1e8/0x450 [ 4701.144009] [] xfs_file_aio_write+0x1bf/0x300 [ 4701.144009] [] ? path_openat+0x104/0x3f0 [ 4701.144009] [] do_sync_write+0xda/0x120 [ 4701.144009] [] ? security_file_permission+0x23/0x90 [ 4701.144009] [] vfs_write+0xc8/0x180 [ 4701.144009] [] sys_write+0x51/0x90 [ 4701.144009] [] system_call_fastpath+0x16/0x1b Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs