From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <xfs-bounces@oss.sgi.com>
Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15])
	by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id
	p66FDD5Y129008 for <xfs@oss.sgi.com>; Wed, 6 Jul 2011 10:13:13 -0500
Received: from mx1.redhat.com (localhost [127.0.0.1])
	by cuda.sgi.com (Spam Firewall) with ESMTP id 2B59E1EE993D
	for <xfs@oss.sgi.com>; Wed,  6 Jul 2011 08:13:12 -0700 (PDT)
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by
	cuda.sgi.com with ESMTP id pF5ASxA1syovPzPn for
	<xfs@oss.sgi.com>; Wed, 06 Jul 2011 08:13:12 -0700 (PDT)
Date: Wed, 6 Jul 2011 17:12:29 +0200
From: Johannes Weiner <jweiner@redhat.com>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Message-ID: <20110706151229.GA1998@redhat.com>
References: <20110629140109.003209430@bombadil.infradead.org>
	<20110629140336.950805096@bombadil.infradead.org>
	<20110701022248.GM561@dastard> <20110701041851.GN561@dastard>
	<20110701093305.GA28531@infradead.org>
	<20110701154136.GA17881@localhost> <20110704032534.GD1026@dastard>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20110704032534.GD1026@dastard>
List-Id: XFS Filesystem from SGI <xfs.oss.sgi.com>
List-Unsubscribe: <http://oss.sgi.com/mailman/options/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=unsubscribe>
List-Archive: <http://oss.sgi.com/pipermail/xfs>
List-Post: <mailto:xfs@oss.sgi.com>
List-Help: <mailto:xfs-request@oss.sgi.com?subject=help>
List-Subscribe: <http://oss.sgi.com/mailman/listinfo/xfs>,
	<mailto:xfs-request@oss.sgi.com?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: xfs-bounces@oss.sgi.com
Errors-To: xfs-bounces@oss.sgi.com
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, Wu Fengguang <fengguang.wu@intel.com>, Mel Gorman <mgorman@suse.de>, "xfs@oss.sgi.com" <xfs@oss.sgi.com>

T24gTW9uLCBKdWwgMDQsIDIwMTEgYXQgMDE6MjU6MzRQTSArMTAwMCwgRGF2ZSBDaGlubmVyIHdy
b3RlOgo+IE9uIEZyaSwgSnVsIDAxLCAyMDExIGF0IDExOjQxOjM2UE0gKzA4MDAsIFd1IEZlbmdn
dWFuZyB3cm90ZToKPiA+IENocmlzdG9waCwKPiA+IAo+ID4gT24gRnJpLCBKdWwgMDEsIDIwMTEg
YXQgMDU6MzM6MDVQTSArMDgwMCwgQ2hyaXN0b3BoIEhlbGx3aWcgd3JvdGU6Cj4gPiA+IEpvaGFu
bmVzLCBNZWwsIFd1LAo+ID4gPiAKPiA+ID4gRGF2ZSBoYXMgYmVlbiBzdHJlc3Npbmcgc29tZSBY
RlMgcGF0Y2hlcyBvZiBtaW5lIHRoYXQgcmVtb3ZlIHRoZSBYRlMKPiA+ID4gaW50ZXJuYWwgd3Jp
dGViYWNrIGNsdXN0ZXJpbmcgaW4gZmF2b3VyIG9mIHVzaW5nIHdyaXRlX2NhY2hlX3BhZ2VzLgo+
ID4gPiAKPiA+ID4gQXMgcGFydCBvZiBpbnZlc3RpZ2F0aW5nIHRoZSBiZWhhdmlvdXIgaGUgZm91
bmQgb3V0IHRoYXQgd2UncmUgc3RpbGwKPiA+ID4gZG9pbmcgbG90cyBvZiBJL08gZnJvbSB0aGUg
ZW5kIG9mIHRoZSBMUlUgaW4ga3N3YXBkLiAgTm90IG9ubHkgaXMgdGhhdAo+ID4gPiBwcmV0dHkg
YmFkIGJlaGF2aW91ciBpbiBnZW5lcmFsLCBidXQgaXQgYWxzbyBtZWFucyB3ZSByZWFsbHkgY2Fu
J3QKPiA+ID4ganVzdCByZW1vdmUgdGhlIHdyaXRlYmFjayBjbHVzdGVyaW5nIGluIHdyaXRlcGFn
ZSBnaXZlbiBob3cgbXVjaAo+ID4gPiBJL08gaXMgc3RpbGwgZG9uZSB0aHJvdWdoIHRoYXQuCj4g
PiA+IAo+ID4gPiBBbnkgY2hhbmNlIHdlIGNvdWxkIHRoZSB3cml0ZWJhY2sgdnMga3N3YXAgYmVo
YXZpb3VyIHNvcnRlZCBvdXQgYSBiaXQKPiA+ID4gYmV0dGVyIGZpbmFsbHk/Cj4gPiAKPiA+IEkg
b25jZSB0cmllZCB0aGlzIGFwcHJvYWNoOgo+ID4gCj4gPiBodHRwOi8vd3d3LnNwaW5pY3MubmV0
L2xpc3RzL2xpbnV4LW1tL21zZzA5MjAyLmh0bWwKPiA+IAo+ID4gSXQgdXNlZCBhIGxpc3Qgc3Ry
dWN0dXJlIHRoYXQgaXMgbm90IGxpbmVhcmx5IHNjYWxhYmxlLCBob3dldmVyIHRoYXQKPiA+IHBh
cnQgc2hvdWxkIGJlIGluZGVwZW5kZW50bHkgaW1wcm92YWJsZSB3aGVuIG5lY2Vzc2FyeS4KPiAK
PiBJIGRvbid0IHRoaW5rIHRoYXQgaGFuZGluZyByYW5kb20gd3JpdGViYWNrIHRvIHRoZSBmbHVz
aGVyIHRocmVhZCBpcwo+IG11Y2ggYmV0dGVyIHRoYW4gZG9pbmcgcmFuZG9tIHdyaXRlYmFjayBk
aXJlY3RseS4gIFllcywgeW91IGFkZGVkCj4gc29tZSBjbHVzdGVyaW5nLCBidXQgSSdtIHN0aWxs
IGRvbid0IHRoaW5rIHdyaXRpbmcgc3BlY2lmaWMgcGFnZXMgaXMKPiB0aGUgYmVzdCBzb2x1dGlv
bi4KPiAKPiA+IFRoZSByZWFsIHByb2JsZW0gd2FzLCBpdCBzZWVtIHRvIG5vdCB2ZXJ5IGVmZmVj
dGl2ZSBpbiBteSB0ZXN0IHJ1bnMuCj4gPiBJIGZvdW5kIG1hbnkgLT5ucl9wYWdlcyB3b3JrcyBx
dWV1ZWQgYmVmb3JlIHRoZSAtPmlub2RlIHdvcmtzLCB3aGljaAo+ID4gZWZmZWN0aXZlbHkgbWFr
ZXMgdGhlIGZsdXNoZXIgd29ya2luZyBvbiBtb3JlIGRpc3BlcnNlZCBwYWdlcyByYXRoZXIKPiA+
IHRoYW4gZm9jdXNpbmcgb24gdGhlIGRpcnR5IHBhZ2VzIGVuY291bnRlcmVkIGluIExSVSByZWNs
YWltLgo+IAo+IEJ1dCB0aGF0J3MgcmVhbGx5IGp1c3QgYW4gaW1wbGVtZW50YXRpb24gaXNzdWUg
cmVsYXRlZCB0byBob3cgeW91Cj4gdHJpZWQgdG8gc29sdmUgdGhlIHByb2JsZW0uIFRoYXQgY291
bGQgYmUgYWRkcmVzc2VkLgo+IAo+IEhvd2V2ZXIsIHdoYXQgSSdtIHF1ZXN0aW9uaW5nIGlzIHdo
ZXRoZXIgd2Ugc2hvdWxkIGV2ZW4gY2FyZSB3aGF0Cj4gcGFnZSBtZW1vcnkgcmVjbGFpbSB3YW50
cyB0byB3cml0ZSAtIGl0IHNlZW1zIHRvIG1ha2UgZnVuZGFtZW50YWxseQo+IGJhZCBkZWNpc2lv
bnMgZnJvbSBhbiBJTyBwZXJzZXBjdGl2ZS4KPiAKPiBXZSBoYXZlIHRvIHJlbWVtYmVyIHRoYXQg
bWVtb3J5IHJlY2xhaW0gaXMgZG9pbmcgTFJVIHJlY2xhaW0gYW5kIHRoZQo+IGZsdXNoZXIgdGhy
ZWFkcyBhcmUgZG9pbmcgIm9sZGVzdCBmaXJzdCIgd3JpdGViYWNrLiBJT1dzLCBib3RoIGFyZSB0
cnlpbmcKPiB0byBvcGVyYXRlIGluIHRoZSBzYW1lIGRpcmVjdGlvbiAob2xkZXN0IHRvIHlvdW5n
ZXN0KSBmb3IgdGhlIHNhbWUKPiBwdXJwb3NlLiAgVGhlIGZ1bmRhbWVudGFsIHByb2JsZW0gdGhh
dCBvY2N1cnMgd2hlbiBtZW1vcnkgcmVjbGFpbQo+IHN0YXJ0cyB3cml0aW5nIHBhZ2VzIGJhY2sg
ZnJvbSB0aGUgTFJVIGlzIHRoaXM6Cj4gCj4gCS0gbWVtb3J5IHJlY2xhaW0gaGFzIHJ1biBhaGVh
ZCBvZiBJTyB3cml0ZWJhY2sgLQo+IAo+IFRoZSBMUlUgdXN1YWxseSBsb29rcyBsaWtlIHRoaXM6
Cj4gCj4gCW9sZGVzdAkJCQkJeW91bmdlc3QKPiAJKy0tLS0tLS0tLS0tLS0tLSstLS0tLS0tLS0t
LS0tLS0rLS0tLS0tLS0tLS0tLS0rCj4gCWNsZWFuCQl3cml0ZWJhY2sJZGlydHkKPiAJCQleCQle
Cj4gCQkJfAkJfAo+IAkJCXwJCVdoZXJlIGZsdXNoZXIgd2lsbCBuZXh0IHdvcmsgZnJvbQo+IAkJ
CXwJCVdoZXJlIGtzd2FwZCBpcyB3b3JraW5nIGZyb20KPiAJCQl8Cj4gCQkJSU8gc3VibWl0dGVk
IGJ5IGZsdXNoZXIsIHdhaXRpbmcgb24gY29tcGxldGlvbgo+IAo+IAo+IElmIG1lbW9yeSByZWNs
YWltIGlzIGhpdHRpbmcgZGlydHkgcGFnZXMgb24gdGhlIExSVSwgaXQgbWVhbnMgaXQgaGFzCj4g
Z290IGFoZWFkIG9mIHdyaXRlYmFjayB3aXRob3V0IGJlaW5nIHRocm90dGxlZCAtIGl0J3MgcGFz
c2VkIG92ZXIKPiBhbGwgdGhlIHBhZ2VzIGN1cnJlbnRseSB1bmRlciB3cml0ZWJhY2sgYW5kIGlz
IHRyeWluZyB0byB3cml0ZSBiYWNrCj4gcGFnZXMgdGhhdCBhcmUgKm5ld2VyKiB0aGFuIHdoYXQg
d3JpdGViYWNrIGlzIHdvcmtpbmcgb24uIElPV3MsIGl0Cj4gc3RhcnRzIHRyeWluZyB0byBkbyB0
aGUgam9iIG9mIHRoZSBmbHVzaGVyIHRocmVhZHMsIGFuZCBpdCBkb2VzIHRoYXQKPiB2ZXJ5IGJh
ZGx5Lgo+IAo+IFRoZSAkMTAwIHF1ZXN0aW9uIGlzIOKIl3doeSBpcyBpdCBnZXR0aW5nIGFoZWFk
IG9mIHdyaXRlYmFjayo/CgpVbmxlc3MgeW91IGhhdmUgYSBwdXJlbHkgc2VxdWVudGlhbCB3cml0
ZXIsIHRoZSBMUlUgb3JkZXIgaXMgLSBhdApsZWFzdCBpbiB0aGVvcnkgLSBkaXZlcmdpbmcgYXdh
eSBmcm9tIHRoZSB3cml0ZWJhY2sgb3JkZXIuCgpBY2NvcmRpbmcgdG8gdGhlIHJlYXNvbmluZyBi
ZWhpbmQgZ2VuZXJhdGlvbmFsIGdhcmJhZ2UgY29sbGVjdGlvbiwKdGhleSBzaG91bGQgaW4gZmFj
dCBiZSBpbnZlcnNlIHRvIGVhY2ggb3RoZXIuICBUaGUgb2xkZXN0IHBhZ2VzIHN0aWxsCmluIHVz
ZSBhcmUgdGhlIG1vc3QgbGlrZWx5IHRvIGJlIHN0aWxsIG5lZWRlZCBpbiB0aGUgZnV0dXJlLgoK
SW4gcHJhY3RpY2Ugd2Ugb25seSBtYWtlIGEgZ2VuZXJhdGlvbmFsIGRpc3RpbmN0aW9uIGJldHdl
ZW4gdXNlZC1vbmNlCmFuZCB1c2VkLW1hbnksIHdoaWNoIG1hbmlmZXN0cyBpbiB0aGUgaW5hY3Rp
dmUgYW5kIHRoZSBhY3RpdmUgbGlzdC4KQnV0IHN0aWxsLCB3aGVuIHJlY2xhaW0gc3RhcnRzIG9m
ZiB3aXRoIGEgbG9jYWxpemVkIHdyaXRlciwgdGhlIG9sZGVzdApwYWdlcyBhcmUgbGlrZWx5IHRv
IGJlIGF0IHRoZSBlbmQgb2YgdGhlIGFjdGl2ZSBsaXN0LgoKU28gcGFnZXMgZnJvbSB0aGUgaW5h
Y3RpdmUgbGlzdCBhcmUgbGlrZWx5IHRvIGJlIHdyaXR0ZW4gaW4gdGhlIHJpZ2h0Cm9yZGVyLCBi
dXQgYXQgdGhlIHNhbWUgdGltZSBhY3RpdmUgcGFnZXMgYXJlIGV2ZW4gb2xkZXIsIHRodXMgd3Jp
dHRlbgpiZWZvcmUgdGhlbS4gIE1lbW9yeSByZWNsYWltIHN0YXJ0cyB3aXRoIHRoZSBpbmFjdGl2
ZSBwYWdlcywgYW5kIHRoaXMKaXMgd2h5IGl0IGdldHMgYWhlYWQuCgpUaGVuIHRoZXJlIGlzIGFs
c28gdGhlIGNhc2Ugd2hlcmUgYSBmYXN0IHdyaXRlciBwdXNoZXMgZGlydHkgcGFnZXMgdG8KdGhl
IGVuZCBvZiB0aGUgTFJVIGxpc3QsIG9mIGNvdXJzZSwgYnV0IHlvdSBhbHJlYWR5IHNhaWQgdGhh
dCB0aGlzIGlzCm5vdCBhcHBsaWNhYmxlIHRvIHlvdXIgd29ya2xvYWQuCgpNeSBwb2ludCBpcyB0
aGF0IEkgZG9uJ3QgdGhpbmsgaXQncyB1bmV4cGVjdGVkIHRoYXQgZGlydHkgcGFnZXMgY29tZQpv
ZmYgdGhlIGluYWN0aXZlIGxpc3QgaW4gcHJhY3RpY2UuICBJdCBqdXN0IHN1Y2tzIGhvdyB3ZSBo
YW5kbGUgdGhlbS4KCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fCnhmcyBtYWlsaW5nIGxpc3QKeGZzQG9zcy5zZ2kuY29tCmh0dHA6Ly9vc3Muc2dpLmNvbS9t
YWlsbWFuL2xpc3RpbmZvL3hmcwo=

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail6.bemta12.messagelabs.com (mail6.bemta12.messagelabs.com [216.82.250.247])
	by kanga.kvack.org (Postfix) with ESMTP id 138E09000C2
	for <linux-mm@kvack.org>; Wed,  6 Jul 2011 11:13:12 -0400 (EDT)
Date: Wed, 6 Jul 2011 17:12:29 +0200
From: Johannes Weiner <jweiner@redhat.com>
Subject: Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
Message-ID: <20110706151229.GA1998@redhat.com>
References: <20110629140109.003209430@bombadil.infradead.org>
 <20110629140336.950805096@bombadil.infradead.org>
 <20110701022248.GM561@dastard>
 <20110701041851.GN561@dastard>
 <20110701093305.GA28531@infradead.org>
 <20110701154136.GA17881@localhost>
 <20110704032534.GD1026@dastard>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20110704032534.GD1026@dastard>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>, Christoph Hellwig <hch@infradead.org>, Mel Gorman <mgorman@suse.de>, "xfs@oss.sgi.com" <xfs@oss.sgi.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is a??why is it getting ahead of writeback*?

Unless you have a purely sequential writer, the LRU order is - at
least in theory - diverging away from the writeback order.

According to the reasoning behind generational garbage collection,
they should in fact be inverse to each other.  The oldest pages still
in use are the most likely to be still needed in the future.

In practice we only make a generational distinction between used-once
and used-many, which manifests in the inactive and the active list.
But still, when reclaim starts off with a localized writer, the oldest
pages are likely to be at the end of the active list.

So pages from the inactive list are likely to be written in the right
order, but at the same time active pages are even older, thus written
before them.  Memory reclaim starts with the inactive pages, and this
is why it gets ahead.

Then there is also the case where a fast writer pushes dirty pages to
the end of the LRU list, of course, but you already said that this is
not applicable to your workload.

My point is that I don't think it's unexpected that dirty pages come
off the inactive list in practice.  It just sucks how we handle them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>