From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p66FDD5Y129008 for ; Wed, 6 Jul 2011 10:13:13 -0500 Received: from mx1.redhat.com (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 2B59E1EE993D for ; Wed, 6 Jul 2011 08:13:12 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by cuda.sgi.com with ESMTP id pF5ASxA1syovPzPn for ; Wed, 06 Jul 2011 08:13:12 -0700 (PDT) Date: Wed, 6 Jul 2011 17:12:29 +0200 From: Johannes Weiner Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Message-ID: <20110706151229.GA1998@redhat.com> References: <20110629140109.003209430@bombadil.infradead.org> <20110629140336.950805096@bombadil.infradead.org> <20110701022248.GM561@dastard> <20110701041851.GN561@dastard> <20110701093305.GA28531@infradead.org> <20110701154136.GA17881@localhost> <20110704032534.GD1026@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110704032534.GD1026@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: Christoph Hellwig , "linux-mm@kvack.org" , Wu Fengguang , Mel Gorman , "xfs@oss.sgi.com" T24gTW9uLCBKdWwgMDQsIDIwMTEgYXQgMDE6MjU6MzRQTSArMTAwMCwgRGF2ZSBDaGlubmVyIHdy b3RlOgo+IE9uIEZyaSwgSnVsIDAxLCAyMDExIGF0IDExOjQxOjM2UE0gKzA4MDAsIFd1IEZlbmdn dWFuZyB3cm90ZToKPiA+IENocmlzdG9waCwKPiA+IAo+ID4gT24gRnJpLCBKdWwgMDEsIDIwMTEg YXQgMDU6MzM6MDVQTSArMDgwMCwgQ2hyaXN0b3BoIEhlbGx3aWcgd3JvdGU6Cj4gPiA+IEpvaGFu bmVzLCBNZWwsIFd1LAo+ID4gPiAKPiA+ID4gRGF2ZSBoYXMgYmVlbiBzdHJlc3Npbmcgc29tZSBY RlMgcGF0Y2hlcyBvZiBtaW5lIHRoYXQgcmVtb3ZlIHRoZSBYRlMKPiA+ID4gaW50ZXJuYWwgd3Jp dGViYWNrIGNsdXN0ZXJpbmcgaW4gZmF2b3VyIG9mIHVzaW5nIHdyaXRlX2NhY2hlX3BhZ2VzLgo+ ID4gPiAKPiA+ID4gQXMgcGFydCBvZiBpbnZlc3RpZ2F0aW5nIHRoZSBiZWhhdmlvdXIgaGUgZm91 bmQgb3V0IHRoYXQgd2UncmUgc3RpbGwKPiA+ID4gZG9pbmcgbG90cyBvZiBJL08gZnJvbSB0aGUg ZW5kIG9mIHRoZSBMUlUgaW4ga3N3YXBkLiAgTm90IG9ubHkgaXMgdGhhdAo+ID4gPiBwcmV0dHkg YmFkIGJlaGF2aW91ciBpbiBnZW5lcmFsLCBidXQgaXQgYWxzbyBtZWFucyB3ZSByZWFsbHkgY2Fu J3QKPiA+ID4ganVzdCByZW1vdmUgdGhlIHdyaXRlYmFjayBjbHVzdGVyaW5nIGluIHdyaXRlcGFn ZSBnaXZlbiBob3cgbXVjaAo+ID4gPiBJL08gaXMgc3RpbGwgZG9uZSB0aHJvdWdoIHRoYXQuCj4g PiA+IAo+ID4gPiBBbnkgY2hhbmNlIHdlIGNvdWxkIHRoZSB3cml0ZWJhY2sgdnMga3N3YXAgYmVo YXZpb3VyIHNvcnRlZCBvdXQgYSBiaXQKPiA+ID4gYmV0dGVyIGZpbmFsbHk/Cj4gPiAKPiA+IEkg b25jZSB0cmllZCB0aGlzIGFwcHJvYWNoOgo+ID4gCj4gPiBodHRwOi8vd3d3LnNwaW5pY3MubmV0 L2xpc3RzL2xpbnV4LW1tL21zZzA5MjAyLmh0bWwKPiA+IAo+ID4gSXQgdXNlZCBhIGxpc3Qgc3Ry dWN0dXJlIHRoYXQgaXMgbm90IGxpbmVhcmx5IHNjYWxhYmxlLCBob3dldmVyIHRoYXQKPiA+IHBh cnQgc2hvdWxkIGJlIGluZGVwZW5kZW50bHkgaW1wcm92YWJsZSB3aGVuIG5lY2Vzc2FyeS4KPiAK PiBJIGRvbid0IHRoaW5rIHRoYXQgaGFuZGluZyByYW5kb20gd3JpdGViYWNrIHRvIHRoZSBmbHVz aGVyIHRocmVhZCBpcwo+IG11Y2ggYmV0dGVyIHRoYW4gZG9pbmcgcmFuZG9tIHdyaXRlYmFjayBk aXJlY3RseS4gIFllcywgeW91IGFkZGVkCj4gc29tZSBjbHVzdGVyaW5nLCBidXQgSSdtIHN0aWxs IGRvbid0IHRoaW5rIHdyaXRpbmcgc3BlY2lmaWMgcGFnZXMgaXMKPiB0aGUgYmVzdCBzb2x1dGlv bi4KPiAKPiA+IFRoZSByZWFsIHByb2JsZW0gd2FzLCBpdCBzZWVtIHRvIG5vdCB2ZXJ5IGVmZmVj dGl2ZSBpbiBteSB0ZXN0IHJ1bnMuCj4gPiBJIGZvdW5kIG1hbnkgLT5ucl9wYWdlcyB3b3JrcyBx dWV1ZWQgYmVmb3JlIHRoZSAtPmlub2RlIHdvcmtzLCB3aGljaAo+ID4gZWZmZWN0aXZlbHkgbWFr ZXMgdGhlIGZsdXNoZXIgd29ya2luZyBvbiBtb3JlIGRpc3BlcnNlZCBwYWdlcyByYXRoZXIKPiA+ IHRoYW4gZm9jdXNpbmcgb24gdGhlIGRpcnR5IHBhZ2VzIGVuY291bnRlcmVkIGluIExSVSByZWNs YWltLgo+IAo+IEJ1dCB0aGF0J3MgcmVhbGx5IGp1c3QgYW4gaW1wbGVtZW50YXRpb24gaXNzdWUg cmVsYXRlZCB0byBob3cgeW91Cj4gdHJpZWQgdG8gc29sdmUgdGhlIHByb2JsZW0uIFRoYXQgY291 bGQgYmUgYWRkcmVzc2VkLgo+IAo+IEhvd2V2ZXIsIHdoYXQgSSdtIHF1ZXN0aW9uaW5nIGlzIHdo ZXRoZXIgd2Ugc2hvdWxkIGV2ZW4gY2FyZSB3aGF0Cj4gcGFnZSBtZW1vcnkgcmVjbGFpbSB3YW50 cyB0byB3cml0ZSAtIGl0IHNlZW1zIHRvIG1ha2UgZnVuZGFtZW50YWxseQo+IGJhZCBkZWNpc2lv bnMgZnJvbSBhbiBJTyBwZXJzZXBjdGl2ZS4KPiAKPiBXZSBoYXZlIHRvIHJlbWVtYmVyIHRoYXQg bWVtb3J5IHJlY2xhaW0gaXMgZG9pbmcgTFJVIHJlY2xhaW0gYW5kIHRoZQo+IGZsdXNoZXIgdGhy ZWFkcyBhcmUgZG9pbmcgIm9sZGVzdCBmaXJzdCIgd3JpdGViYWNrLiBJT1dzLCBib3RoIGFyZSB0 cnlpbmcKPiB0byBvcGVyYXRlIGluIHRoZSBzYW1lIGRpcmVjdGlvbiAob2xkZXN0IHRvIHlvdW5n ZXN0KSBmb3IgdGhlIHNhbWUKPiBwdXJwb3NlLiAgVGhlIGZ1bmRhbWVudGFsIHByb2JsZW0gdGhh dCBvY2N1cnMgd2hlbiBtZW1vcnkgcmVjbGFpbQo+IHN0YXJ0cyB3cml0aW5nIHBhZ2VzIGJhY2sg ZnJvbSB0aGUgTFJVIGlzIHRoaXM6Cj4gCj4gCS0gbWVtb3J5IHJlY2xhaW0gaGFzIHJ1biBhaGVh ZCBvZiBJTyB3cml0ZWJhY2sgLQo+IAo+IFRoZSBMUlUgdXN1YWxseSBsb29rcyBsaWtlIHRoaXM6 Cj4gCj4gCW9sZGVzdAkJCQkJeW91bmdlc3QKPiAJKy0tLS0tLS0tLS0tLS0tLSstLS0tLS0tLS0t LS0tLS0rLS0tLS0tLS0tLS0tLS0rCj4gCWNsZWFuCQl3cml0ZWJhY2sJZGlydHkKPiAJCQleCQle Cj4gCQkJfAkJfAo+IAkJCXwJCVdoZXJlIGZsdXNoZXIgd2lsbCBuZXh0IHdvcmsgZnJvbQo+IAkJ CXwJCVdoZXJlIGtzd2FwZCBpcyB3b3JraW5nIGZyb20KPiAJCQl8Cj4gCQkJSU8gc3VibWl0dGVk IGJ5IGZsdXNoZXIsIHdhaXRpbmcgb24gY29tcGxldGlvbgo+IAo+IAo+IElmIG1lbW9yeSByZWNs YWltIGlzIGhpdHRpbmcgZGlydHkgcGFnZXMgb24gdGhlIExSVSwgaXQgbWVhbnMgaXQgaGFzCj4g Z290IGFoZWFkIG9mIHdyaXRlYmFjayB3aXRob3V0IGJlaW5nIHRocm90dGxlZCAtIGl0J3MgcGFz c2VkIG92ZXIKPiBhbGwgdGhlIHBhZ2VzIGN1cnJlbnRseSB1bmRlciB3cml0ZWJhY2sgYW5kIGlz IHRyeWluZyB0byB3cml0ZSBiYWNrCj4gcGFnZXMgdGhhdCBhcmUgKm5ld2VyKiB0aGFuIHdoYXQg d3JpdGViYWNrIGlzIHdvcmtpbmcgb24uIElPV3MsIGl0Cj4gc3RhcnRzIHRyeWluZyB0byBkbyB0 aGUgam9iIG9mIHRoZSBmbHVzaGVyIHRocmVhZHMsIGFuZCBpdCBkb2VzIHRoYXQKPiB2ZXJ5IGJh ZGx5Lgo+IAo+IFRoZSAkMTAwIHF1ZXN0aW9uIGlzIOKIl3doeSBpcyBpdCBnZXR0aW5nIGFoZWFk IG9mIHdyaXRlYmFjayo/CgpVbmxlc3MgeW91IGhhdmUgYSBwdXJlbHkgc2VxdWVudGlhbCB3cml0 ZXIsIHRoZSBMUlUgb3JkZXIgaXMgLSBhdApsZWFzdCBpbiB0aGVvcnkgLSBkaXZlcmdpbmcgYXdh eSBmcm9tIHRoZSB3cml0ZWJhY2sgb3JkZXIuCgpBY2NvcmRpbmcgdG8gdGhlIHJlYXNvbmluZyBi ZWhpbmQgZ2VuZXJhdGlvbmFsIGdhcmJhZ2UgY29sbGVjdGlvbiwKdGhleSBzaG91bGQgaW4gZmFj dCBiZSBpbnZlcnNlIHRvIGVhY2ggb3RoZXIuICBUaGUgb2xkZXN0IHBhZ2VzIHN0aWxsCmluIHVz ZSBhcmUgdGhlIG1vc3QgbGlrZWx5IHRvIGJlIHN0aWxsIG5lZWRlZCBpbiB0aGUgZnV0dXJlLgoK SW4gcHJhY3RpY2Ugd2Ugb25seSBtYWtlIGEgZ2VuZXJhdGlvbmFsIGRpc3RpbmN0aW9uIGJldHdl ZW4gdXNlZC1vbmNlCmFuZCB1c2VkLW1hbnksIHdoaWNoIG1hbmlmZXN0cyBpbiB0aGUgaW5hY3Rp dmUgYW5kIHRoZSBhY3RpdmUgbGlzdC4KQnV0IHN0aWxsLCB3aGVuIHJlY2xhaW0gc3RhcnRzIG9m ZiB3aXRoIGEgbG9jYWxpemVkIHdyaXRlciwgdGhlIG9sZGVzdApwYWdlcyBhcmUgbGlrZWx5IHRv IGJlIGF0IHRoZSBlbmQgb2YgdGhlIGFjdGl2ZSBsaXN0LgoKU28gcGFnZXMgZnJvbSB0aGUgaW5h Y3RpdmUgbGlzdCBhcmUgbGlrZWx5IHRvIGJlIHdyaXR0ZW4gaW4gdGhlIHJpZ2h0Cm9yZGVyLCBi dXQgYXQgdGhlIHNhbWUgdGltZSBhY3RpdmUgcGFnZXMgYXJlIGV2ZW4gb2xkZXIsIHRodXMgd3Jp dHRlbgpiZWZvcmUgdGhlbS4gIE1lbW9yeSByZWNsYWltIHN0YXJ0cyB3aXRoIHRoZSBpbmFjdGl2 ZSBwYWdlcywgYW5kIHRoaXMKaXMgd2h5IGl0IGdldHMgYWhlYWQuCgpUaGVuIHRoZXJlIGlzIGFs c28gdGhlIGNhc2Ugd2hlcmUgYSBmYXN0IHdyaXRlciBwdXNoZXMgZGlydHkgcGFnZXMgdG8KdGhl IGVuZCBvZiB0aGUgTFJVIGxpc3QsIG9mIGNvdXJzZSwgYnV0IHlvdSBhbHJlYWR5IHNhaWQgdGhh dCB0aGlzIGlzCm5vdCBhcHBsaWNhYmxlIHRvIHlvdXIgd29ya2xvYWQuCgpNeSBwb2ludCBpcyB0 aGF0IEkgZG9uJ3QgdGhpbmsgaXQncyB1bmV4cGVjdGVkIHRoYXQgZGlydHkgcGFnZXMgY29tZQpv ZmYgdGhlIGluYWN0aXZlIGxpc3QgaW4gcHJhY3RpY2UuICBJdCBqdXN0IHN1Y2tzIGhvdyB3ZSBo YW5kbGUgdGhlbS4KCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fCnhmcyBtYWlsaW5nIGxpc3QKeGZzQG9zcy5zZ2kuY29tCmh0dHA6Ly9vc3Muc2dpLmNvbS9t YWlsbWFuL2xpc3RpbmZvL3hmcwo= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta12.messagelabs.com (mail6.bemta12.messagelabs.com [216.82.250.247]) by kanga.kvack.org (Postfix) with ESMTP id 138E09000C2 for ; Wed, 6 Jul 2011 11:13:12 -0400 (EDT) Date: Wed, 6 Jul 2011 17:12:29 +0200 From: Johannes Weiner Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Message-ID: <20110706151229.GA1998@redhat.com> References: <20110629140109.003209430@bombadil.infradead.org> <20110629140336.950805096@bombadil.infradead.org> <20110701022248.GM561@dastard> <20110701041851.GN561@dastard> <20110701093305.GA28531@infradead.org> <20110701154136.GA17881@localhost> <20110704032534.GD1026@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20110704032534.GD1026@dastard> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: Wu Fengguang , Christoph Hellwig , Mel Gorman , "xfs@oss.sgi.com" , "linux-mm@kvack.org" On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > Christoph, > > > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > > Johannes, Mel, Wu, > > > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > internal writeback clustering in favour of using write_cache_pages. > > > > > > As part of investigating the behaviour he found out that we're still > > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > > pretty bad behaviour in general, but it also means we really can't > > > just remove the writeback clustering in writepage given how much > > > I/O is still done through that. > > > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > > better finally? > > > > I once tried this approach: > > > > http://www.spinics.net/lists/linux-mm/msg09202.html > > > > It used a list structure that is not linearly scalable, however that > > part should be independently improvable when necessary. > > I don't think that handing random writeback to the flusher thread is > much better than doing random writeback directly. Yes, you added > some clustering, but I'm still don't think writing specific pages is > the best solution. > > > The real problem was, it seem to not very effective in my test runs. > > I found many ->nr_pages works queued before the ->inode works, which > > effectively makes the flusher working on more dispersed pages rather > > than focusing on the dirty pages encountered in LRU reclaim. > > But that's really just an implementation issue related to how you > tried to solve the problem. That could be addressed. > > However, what I'm questioning is whether we should even care what > page memory reclaim wants to write - it seems to make fundamentally > bad decisions from an IO persepctive. > > We have to remember that memory reclaim is doing LRU reclaim and the > flusher threads are doing "oldest first" writeback. IOWs, both are trying > to operate in the same direction (oldest to youngest) for the same > purpose. The fundamental problem that occurs when memory reclaim > starts writing pages back from the LRU is this: > > - memory reclaim has run ahead of IO writeback - > > The LRU usually looks like this: > > oldest youngest > +---------------+---------------+--------------+ > clean writeback dirty > ^ ^ > | | > | Where flusher will next work from > | Where kswapd is working from > | > IO submitted by flusher, waiting on completion > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > got ahead of writeback without being throttled - it's passed over > all the pages currently under writeback and is trying to write back > pages that are *newer* than what writeback is working on. IOWs, it > starts trying to do the job of the flusher threads, and it does that > very badly. > > The $100 question is a??why is it getting ahead of writeback*? Unless you have a purely sequential writer, the LRU order is - at least in theory - diverging away from the writeback order. According to the reasoning behind generational garbage collection, they should in fact be inverse to each other. The oldest pages still in use are the most likely to be still needed in the future. In practice we only make a generational distinction between used-once and used-many, which manifests in the inactive and the active list. But still, when reclaim starts off with a localized writer, the oldest pages are likely to be at the end of the active list. So pages from the inactive list are likely to be written in the right order, but at the same time active pages are even older, thus written before them. Memory reclaim starts with the inactive pages, and this is why it gets ahead. Then there is also the case where a fast writer pushes dirty pages to the end of the LRU list, of course, but you already said that this is not applicable to your workload. My point is that I don't think it's unexpected that dirty pages come off the inactive list in practice. It just sucks how we handle them. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org