From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p643Pfhr103471 for ; Sun, 3 Jul 2011 22:25:42 -0500 Received: from ipmail05.adl6.internode.on.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 7E9D016786E4 for ; Sun, 3 Jul 2011 20:25:38 -0700 (PDT) Received: from ipmail05.adl6.internode.on.net (ipmail05.adl6.internode.on.net [150.101.137.143]) by cuda.sgi.com with ESMTP id 28nh0FrbkO0XDQv7 for ; Sun, 03 Jul 2011 20:25:38 -0700 (PDT) Date: Mon, 4 Jul 2011 13:25:34 +1000 From: Dave Chinner Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Message-ID: <20110704032534.GD1026@dastard> References: <20110629140109.003209430@bombadil.infradead.org> <20110629140336.950805096@bombadil.infradead.org> <20110701022248.GM561@dastard> <20110701041851.GN561@dastard> <20110701093305.GA28531@infradead.org> <20110701154136.GA17881@localhost> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20110701154136.GA17881@localhost> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Wu Fengguang Cc: Christoph Hellwig , "linux-mm@kvack.org" , "xfs@oss.sgi.com" , Mel Gorman , Johannes Weiner T24gRnJpLCBKdWwgMDEsIDIwMTEgYXQgMTE6NDE6MzZQTSArMDgwMCwgV3UgRmVuZ2d1YW5nIHdy b3RlOgo+IENocmlzdG9waCwKPiAKPiBPbiBGcmksIEp1bCAwMSwgMjAxMSBhdCAwNTozMzowNVBN ICswODAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90ZToKPiA+IEpvaGFubmVzLCBNZWwsIFd1LAo+ ID4gCj4gPiBEYXZlIGhhcyBiZWVuIHN0cmVzc2luZyBzb21lIFhGUyBwYXRjaGVzIG9mIG1pbmUg dGhhdCByZW1vdmUgdGhlIFhGUwo+ID4gaW50ZXJuYWwgd3JpdGViYWNrIGNsdXN0ZXJpbmcgaW4g ZmF2b3VyIG9mIHVzaW5nIHdyaXRlX2NhY2hlX3BhZ2VzLgo+ID4gCj4gPiBBcyBwYXJ0IG9mIGlu dmVzdGlnYXRpbmcgdGhlIGJlaGF2aW91ciBoZSBmb3VuZCBvdXQgdGhhdCB3ZSdyZSBzdGlsbAo+ ID4gZG9pbmcgbG90cyBvZiBJL08gZnJvbSB0aGUgZW5kIG9mIHRoZSBMUlUgaW4ga3N3YXBkLiAg Tm90IG9ubHkgaXMgdGhhdAo+ID4gcHJldHR5IGJhZCBiZWhhdmlvdXIgaW4gZ2VuZXJhbCwgYnV0 IGl0IGFsc28gbWVhbnMgd2UgcmVhbGx5IGNhbid0Cj4gPiBqdXN0IHJlbW92ZSB0aGUgd3JpdGVi YWNrIGNsdXN0ZXJpbmcgaW4gd3JpdGVwYWdlIGdpdmVuIGhvdyBtdWNoCj4gPiBJL08gaXMgc3Rp bGwgZG9uZSB0aHJvdWdoIHRoYXQuCj4gPiAKPiA+IEFueSBjaGFuY2Ugd2UgY291bGQgdGhlIHdy aXRlYmFjayB2cyBrc3dhcCBiZWhhdmlvdXIgc29ydGVkIG91dCBhIGJpdAo+ID4gYmV0dGVyIGZp bmFsbHk/Cj4gCj4gSSBvbmNlIHRyaWVkIHRoaXMgYXBwcm9hY2g6Cj4gCj4gaHR0cDovL3d3dy5z cGluaWNzLm5ldC9saXN0cy9saW51eC1tbS9tc2cwOTIwMi5odG1sCj4gCj4gSXQgdXNlZCBhIGxp c3Qgc3RydWN0dXJlIHRoYXQgaXMgbm90IGxpbmVhcmx5IHNjYWxhYmxlLCBob3dldmVyIHRoYXQK PiBwYXJ0IHNob3VsZCBiZSBpbmRlcGVuZGVudGx5IGltcHJvdmFibGUgd2hlbiBuZWNlc3Nhcnku CgpJIGRvbid0IHRoaW5rIHRoYXQgaGFuZGluZyByYW5kb20gd3JpdGViYWNrIHRvIHRoZSBmbHVz aGVyIHRocmVhZCBpcwptdWNoIGJldHRlciB0aGFuIGRvaW5nIHJhbmRvbSB3cml0ZWJhY2sgZGly ZWN0bHkuICBZZXMsIHlvdSBhZGRlZApzb21lIGNsdXN0ZXJpbmcsIGJ1dCBJJ20gc3RpbGwgZG9u J3QgdGhpbmsgd3JpdGluZyBzcGVjaWZpYyBwYWdlcyBpcwp0aGUgYmVzdCBzb2x1dGlvbi4KCj4g VGhlIHJlYWwgcHJvYmxlbSB3YXMsIGl0IHNlZW0gdG8gbm90IHZlcnkgZWZmZWN0aXZlIGluIG15 IHRlc3QgcnVucy4KPiBJIGZvdW5kIG1hbnkgLT5ucl9wYWdlcyB3b3JrcyBxdWV1ZWQgYmVmb3Jl IHRoZSAtPmlub2RlIHdvcmtzLCB3aGljaAo+IGVmZmVjdGl2ZWx5IG1ha2VzIHRoZSBmbHVzaGVy IHdvcmtpbmcgb24gbW9yZSBkaXNwZXJzZWQgcGFnZXMgcmF0aGVyCj4gdGhhbiBmb2N1c2luZyBv biB0aGUgZGlydHkgcGFnZXMgZW5jb3VudGVyZWQgaW4gTFJVIHJlY2xhaW0uCgpCdXQgdGhhdCdz IHJlYWxseSBqdXN0IGFuIGltcGxlbWVudGF0aW9uIGlzc3VlIHJlbGF0ZWQgdG8gaG93IHlvdQp0 cmllZCB0byBzb2x2ZSB0aGUgcHJvYmxlbS4gVGhhdCBjb3VsZCBiZSBhZGRyZXNzZWQuCgpIb3dl dmVyLCB3aGF0IEknbSBxdWVzdGlvbmluZyBpcyB3aGV0aGVyIHdlIHNob3VsZCBldmVuIGNhcmUg d2hhdApwYWdlIG1lbW9yeSByZWNsYWltIHdhbnRzIHRvIHdyaXRlIC0gaXQgc2VlbXMgdG8gbWFr ZSBmdW5kYW1lbnRhbGx5CmJhZCBkZWNpc2lvbnMgZnJvbSBhbiBJTyBwZXJzZXBjdGl2ZS4KCldl IGhhdmUgdG8gcmVtZW1iZXIgdGhhdCBtZW1vcnkgcmVjbGFpbSBpcyBkb2luZyBMUlUgcmVjbGFp bSBhbmQgdGhlCmZsdXNoZXIgdGhyZWFkcyBhcmUgZG9pbmcgIm9sZGVzdCBmaXJzdCIgd3JpdGVi YWNrLiBJT1dzLCBib3RoIGFyZSB0cnlpbmcKdG8gb3BlcmF0ZSBpbiB0aGUgc2FtZSBkaXJlY3Rp b24gKG9sZGVzdCB0byB5b3VuZ2VzdCkgZm9yIHRoZSBzYW1lCnB1cnBvc2UuICBUaGUgZnVuZGFt ZW50YWwgcHJvYmxlbSB0aGF0IG9jY3VycyB3aGVuIG1lbW9yeSByZWNsYWltCnN0YXJ0cyB3cml0 aW5nIHBhZ2VzIGJhY2sgZnJvbSB0aGUgTFJVIGlzIHRoaXM6CgoJLSBtZW1vcnkgcmVjbGFpbSBo YXMgcnVuIGFoZWFkIG9mIElPIHdyaXRlYmFjayAtCgpUaGUgTFJVIHVzdWFsbHkgbG9va3MgbGlr ZSB0aGlzOgoKCW9sZGVzdAkJCQkJeW91bmdlc3QKCSstLS0tLS0tLS0tLS0tLS0rLS0tLS0tLS0t LS0tLS0tKy0tLS0tLS0tLS0tLS0tKwoJY2xlYW4JCXdyaXRlYmFjawlkaXJ0eQoJCQleCQleCgkJ CXwJCXwKCQkJfAkJV2hlcmUgZmx1c2hlciB3aWxsIG5leHQgd29yayBmcm9tCgkJCXwJCVdoZXJl IGtzd2FwZCBpcyB3b3JraW5nIGZyb20KCQkJfAoJCQlJTyBzdWJtaXR0ZWQgYnkgZmx1c2hlciwg d2FpdGluZyBvbiBjb21wbGV0aW9uCgoKSWYgbWVtb3J5IHJlY2xhaW0gaXMgaGl0dGluZyBkaXJ0 eSBwYWdlcyBvbiB0aGUgTFJVLCBpdCBtZWFucyBpdCBoYXMKZ290IGFoZWFkIG9mIHdyaXRlYmFj ayB3aXRob3V0IGJlaW5nIHRocm90dGxlZCAtIGl0J3MgcGFzc2VkIG92ZXIKYWxsIHRoZSBwYWdl cyBjdXJyZW50bHkgdW5kZXIgd3JpdGViYWNrIGFuZCBpcyB0cnlpbmcgdG8gd3JpdGUgYmFjawpw YWdlcyB0aGF0IGFyZSAqbmV3ZXIqIHRoYW4gd2hhdCB3cml0ZWJhY2sgaXMgd29ya2luZyBvbi4g SU9XcywgaXQKc3RhcnRzIHRyeWluZyB0byBkbyB0aGUgam9iIG9mIHRoZSBmbHVzaGVyIHRocmVh ZHMsIGFuZCBpdCBkb2VzIHRoYXQKdmVyeSBiYWRseS4KClRoZSAkMTAwIHF1ZXN0aW9uIGlzIOKI l3doeSBpcyBpdCBnZXR0aW5nIGFoZWFkIG9mIHdyaXRlYmFjayo/CgpGcm9tIGEgYnJpZWYgbG9v ayBhdCB0aGUgdm1zY2FuIGNvZGUsIGl0IGFwcGVhcnMgdGhhdCBzY2FubmluZyBkb2VzCm5vdCB0 aHJvdHRsZS9ibG9jayB1bnRpbCByZWNsYWltIHByaW9yaXR5IGhhcyBnb3QgcHJldHR5IGhpZ2gu IFRoYXQKbWVhbnMgYXQgbG93IHByaW9yaXR5IHJlY2xhaW0sIGl0ICpza2lwcyBwYWdlcyB1bmRl ciB3cml0ZWJhY2sqLgpIb3dldmVyLCBpZiBpdCBjb21lcyBhY3Jvc3MgYSBkaXJ0eSBwYWdlLCBp dCB3aWxsIHRyaWdnZXIgd3JpdGViYWNrCm9mIHRoZSBwYWdlLgoKTm93IGNhbGwgbWUgY3Jhenks IGJ1dCBpZiB3ZSd2ZSBhbHJlYWR5IGdvdCBhIGxhcmdlIG51bWJlciBvZiBwYWdlcwp1bmRlciB3 cml0ZWJhY2ssIHdoeSB3b3VsZCB3ZSB3YW50IHRvICpzdGFydCBtb3JlIElPKiB3aGVuIGNsZWFy bHkKdGhlIHN5c3RlbSBpcyB0YWtpbmcgY2FyZSBvZiBjbGVhbmluZyBwYWdlcyBhbHJlYWR5IGFu ZCBhbGwgd2UgaGF2ZQp0byBkbyBpcyB3YWl0IGZvciBhIHNob3J0IHdoaWxlIHRvIGdldCBjbGVh biBwYWdlcyByZWFkeSBmb3IKcmVjbGFpbT8KCkluZGVlZCwgSSBhZGRlZCB0aGlzIHF1aWNrIGhh Y2sgdG8gcHJldmVudCB0aGUgVk0gZnJvbSBkb2luZwp3cml0ZWJhY2sgdmlhIHBhZ2VvdXQgdW50 aWwgYWZ0ZXIgaXQgc3RhcnRzIGJsb2NraW5nIG9uIHdyaXRlYmFjawpwYWdlczoKCkBAIC04MjUs NiArODI1LDggQEAgc3RhdGljIHVuc2lnbmVkIGxvbmcgc2hyaW5rX3BhZ2VfbGlzdChzdHJ1Y3Qg bGlzdF9oZWFkICpwYWdlX2wKIAkJaWYgKFBhZ2VEaXJ0eShwYWdlKSkgewogCQkJbnJfZGlydHkr KzsKIAorCQkJaWYgKCEoc2MtPnJlY2xhaW1fbW9kZSAmIFJFQ0xBSU1fTU9ERV9TWU5DKSkKKwkJ CQlnb3RvIGtlZXBfbG9ja2VkOwogCQkJaWYgKHJlZmVyZW5jZXMgPT0gUEFHRVJFRl9SRUNMQUlN X0NMRUFOKQogCQkJCWdvdG8ga2VlcF9sb2NrZWQ7CiAJCQlpZiAoIW1heV9lbnRlcl9mcykKCklP V3MsIHdlIGRvbid0IHdyaXRlIHBhZ2VzIGZyb20ga3N3YXBkIHVubGVzcyB0aGVyZSBpcyBubyBJ Twp3cml0ZWJhY2sgZ29pbmcgb24gYXQgYWxsICh3YWl0ZWQgb24gYWxsIHRoZSB3cml0ZWJhY2sg cGFnZXMgb3Igbm9uZQpleGlzdCkgYW5kIHRoZXJlIGFyZSBkaXJ0eSBwYWdlcyBvbiB0aGUgTFJV LgoKVGhpcyBkb2Vzbid0IGNvbXBsZXRlbHkgc3RvcCB0aGUgSU8gY29sbGFwc2UsIChsb29rcyBs aWtlIGZvcmVncm91bmQKdGhyb3R0bGluZyBpcyB0aGUgb3RoZXIgY2F1c2UsIHdoaWNoIElPLWxl c3Mgd3JpdGUgdGhyb3R0bGluZyBmaXhlcykKYnV0IHRoZSBjb2xsYXBzZSB3YXMgc2lnbmlmaWNh bnRseSByZWR1Y2VkIGluIGR1cmF0aW9uIGFuZCBpbnRlbnNpdHkKYnkgcmVtb3Zpbmcga3N3YXBk IHdyaXRlYmFjay4gSW4gZmFjdCwgdGhlIElPIHJhdGUgb25seSBkcm9wcGVkIHRvCn42ME1CL3Mg aW5zdGVhZCBvZiAzME1CL3MsIGFuZCB0aGUgaW1wcm92ZW1lbnQgaXMgZWFzaWx5IG1lYXN1cmVk IGJ5CnRoZSBydW50aW1lIG9mIHRoZSB0ZXN0OgoKCQkJcnVuIDEJcnVuIDIJcnVuIDMKMy4wLXJj NS12YW5pbGxhCQkxMzVzCTEzN3MJMTM4cwozLjAtcmM1LXBhdGNoZWQJCTExN3MJMTE1cwkxMTVz CgpUaGF0J3MgYSBwcmV0dHkgbWFzc2l2ZSBpbXByb3ZlbWVudCBmb3IgYSAyLWxpbmUgcGF0Y2gu IDspIEkgZXhwZWN0CnRoZSBJTy1sZXNzIHdyaXRlIHRocm90dGxpbmcgcGF0Y2hzZXQgd2lsbCBm dXJ0aGVyIGltcHJvdmUgdGhpcy4KCkZXSVcsIHRoZSBucl92bXNjYW5fd3JpdGUgdmFsdWVzIGNo YW5nZWQgbGlrZSB0aGlzOgoKCQkJcnVuIDEJcnVuIDIJcnVuIDMKMy4wLXJjNS12YW5pbGxhCQk2 NzUxCTY4OTMJNjQ2NQozLjAtcmM1LXBhdGNoZWQJCTAJMAkwCgpUaGVzZSByZXN1bHRzIHN1cHBv cnQgbXkgYXJndW1lbnQgdGhhdCBtZW1vcnkgcmVjbGFpbSBzaG91bGQgbm90IGJlCmRvaW5nIGRp cnR5IHBhZ2Ugd3JpdGViYWNrIGF0IGFsbCAtIGRlZmVyaW5nIHdyaXRlYmFjayB0byB0aGUKd3Jp dGViYWNrIGluZnJhc3RydWN0dXJlIGFuZCBqdXN0IHdhaXRpbmcgZm9yIGl0IHRvIGNvbXBsZXRl CmFwcHJvcHJpYXRlbHkgaXMgdGhlIFJpZ2h0IFRoaW5nIFRvIERvLiBpLmUuIElPLWxlc3MgbWVt b3J5IHJlY2xhaW0Kd29ya3MgYmV0dGVyIHRoYW4gdGhlIGN1cnJlbnQgY29kZSBmb3IgdGhlIHNh bWUgcmVhc29uIElPLWxlc3Mgd3JpdGUKdGhyb3R0bGluZyB3b3JrcyBiZXR0ZXIgdGhhbiB0aGUg Y3VycmVudCBjb2RlLi4uLgoKQ2hlZXJzLAoKRGF2ZS4KLS0gCkRhdmUgQ2hpbm5lcgpkYXZpZEBm cm9tb3JiaXQuY29tCgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fXwp4ZnMgbWFpbGluZyBsaXN0Cnhmc0Bvc3Muc2dpLmNvbQpodHRwOi8vb3NzLnNnaS5jb20v bWFpbG1hbi9saXN0aW5mby94ZnMK From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta12.messagelabs.com (mail6.bemta12.messagelabs.com [216.82.250.247]) by kanga.kvack.org (Postfix) with ESMTP id 2DBE39000C2 for ; Sun, 3 Jul 2011 23:25:41 -0400 (EDT) Date: Mon, 4 Jul 2011 13:25:34 +1000 From: Dave Chinner Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Message-ID: <20110704032534.GD1026@dastard> References: <20110629140109.003209430@bombadil.infradead.org> <20110629140336.950805096@bombadil.infradead.org> <20110701022248.GM561@dastard> <20110701041851.GN561@dastard> <20110701093305.GA28531@infradead.org> <20110701154136.GA17881@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20110701154136.GA17881@localhost> Sender: owner-linux-mm@kvack.org List-ID: To: Wu Fengguang Cc: Christoph Hellwig , Mel Gorman , Johannes Weiner , "xfs@oss.sgi.com" , "linux-mm@kvack.org" On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > Christoph, > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > Johannes, Mel, Wu, > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > internal writeback clustering in favour of using write_cache_pages. > > > > As part of investigating the behaviour he found out that we're still > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > pretty bad behaviour in general, but it also means we really can't > > just remove the writeback clustering in writepage given how much > > I/O is still done through that. > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > better finally? > > I once tried this approach: > > http://www.spinics.net/lists/linux-mm/msg09202.html > > It used a list structure that is not linearly scalable, however that > part should be independently improvable when necessary. I don't think that handing random writeback to the flusher thread is much better than doing random writeback directly. Yes, you added some clustering, but I'm still don't think writing specific pages is the best solution. > The real problem was, it seem to not very effective in my test runs. > I found many ->nr_pages works queued before the ->inode works, which > effectively makes the flusher working on more dispersed pages rather > than focusing on the dirty pages encountered in LRU reclaim. But that's really just an implementation issue related to how you tried to solve the problem. That could be addressed. However, what I'm questioning is whether we should even care what page memory reclaim wants to write - it seems to make fundamentally bad decisions from an IO persepctive. We have to remember that memory reclaim is doing LRU reclaim and the flusher threads are doing "oldest first" writeback. IOWs, both are trying to operate in the same direction (oldest to youngest) for the same purpose. The fundamental problem that occurs when memory reclaim starts writing pages back from the LRU is this: - memory reclaim has run ahead of IO writeback - The LRU usually looks like this: oldest youngest +---------------+---------------+--------------+ clean writeback dirty ^ ^ | | | Where flusher will next work from | Where kswapd is working from | IO submitted by flusher, waiting on completion If memory reclaim is hitting dirty pages on the LRU, it means it has got ahead of writeback without being throttled - it's passed over all the pages currently under writeback and is trying to write back pages that are *newer* than what writeback is working on. IOWs, it starts trying to do the job of the flusher threads, and it does that very badly. The $100 question is a??why is it getting ahead of writeback*? >>From a brief look at the vmscan code, it appears that scanning does not throttle/block until reclaim priority has got pretty high. That means at low priority reclaim, it *skips pages under writeback*. However, if it comes across a dirty page, it will trigger writeback of the page. Now call me crazy, but if we've already got a large number of pages under writeback, why would we want to *start more IO* when clearly the system is taking care of cleaning pages already and all we have to do is wait for a short while to get clean pages ready for reclaim? Indeed, I added this quick hack to prevent the VM from doing writeback via pageout until after it starts blocking on writeback pages: @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l if (PageDirty(page)) { nr_dirty++; + if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC)) + goto keep_locked; if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) IOWs, we don't write pages from kswapd unless there is no IO writeback going on at all (waited on all the writeback pages or none exist) and there are dirty pages on the LRU. This doesn't completely stop the IO collapse, (looks like foreground throttling is the other cause, which IO-less write throttling fixes) but the collapse was significantly reduced in duration and intensity by removing kswapd writeback. In fact, the IO rate only dropped to ~60MB/s instead of 30MB/s, and the improvement is easily measured by the runtime of the test: run 1 run 2 run 3 3.0-rc5-vanilla 135s 137s 138s 3.0-rc5-patched 117s 115s 115s That's a pretty massive improvement for a 2-line patch. ;) I expect the IO-less write throttling patchset will further improve this. FWIW, the nr_vmscan_write values changed like this: run 1 run 2 run 3 3.0-rc5-vanilla 6751 6893 6465 3.0-rc5-patched 0 0 0 These results support my argument that memory reclaim should not be doing dirty page writeback at all - defering writeback to the writeback infrastructure and just waiting for it to complete appropriately is the Right Thing To Do. i.e. IO-less memory reclaim works better than the current code for the same reason IO-less write throttling works better than the current code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org