From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755798AbcECN5P (ORCPT <rfc822;w@1wt.eu>);
	Tue, 3 May 2016 09:57:15 -0400
Received: from mx2.suse.de ([195.135.220.15]:35863 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750709AbcECN5N (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Tue, 3 May 2016 09:57:13 -0400
Date: Tue, 3 May 2016 15:57:10 +0200
From: Jan Kara <jack@suse.cz>
To: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>, Jens Axboe <axboe@kernel.dk>,
        linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
        linux-block@vger.kernel.org, dchinner@redhat.com,
        sedat.dilek@gmail.com
Subject: Re: [PATCHSET v5] Make background writeback great again for the
 first time
Message-ID: <20160503135710.GC25436@quack2.suse.cz>
References: <20160427180105.GA17362@quack2.suse.cz>
 <5721021E.8060006@fb.com>
 <20160427203708.GA25397@kernel.dk>
 <20160427205915.GC25397@kernel.dk>
 <20160428115401.GD17362@quack2.suse.cz>
 <57225A91.50002@kernel.dk>
 <20160503121719.GA25436@quack2.suse.cz>
 <20160503124011.igocaapb2nvnjj3o@floor.masoncoding.com>
 <20160503130609.GB25436@quack2.suse.cz>
 <20160503134240.6p65qehgv5uruxt4@floor.masoncoding.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160503134240.6p65qehgv5uruxt4@floor.masoncoding.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue 03-05-16 09:42:40, Chris Mason wrote:
> On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> > On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > > >>+	if (rwb->queue_depth == 1) {
> > > > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > > > >>+		rwb->wb_background = 1;
> > > > > >
> > > > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > > 
> > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > > 
> > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > > > >Runtime: 105.126 107.125 105.641
> > > > > >
> > > > > >So about the same as before. I'll try to debug this later today...
> > > > > 
> > > > > Thanks, I'm very interested in what you find!
> > > > 
> > > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > > of background writes gave more priority to writes from the journalling
> > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > > journalling thread ended up having to do more data writeback to be able to
> > > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > > less efficient at that than the normal flusher thread.
> > > > 
> > > > So this is an example where throttling background writeback effectively
> > > > just pushes more work into another context which does it less efficiently
> > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > > > issues like this. ext4 is using delayed allocation and thus only data
> > > > writes into holes end up being part of a transaction -> simple dd test case
> > > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > > the numbers with and without your patch are exactly the same.
> > > > 
> > > > The question remains how common a pattern where throttling of background
> > > > writeback delays also something else is. I'll schedule a couple of
> > > > benchmarks to measure impact of your patches for a wider range of workloads
> > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > > doing something rather stupid (but inherent in its journal design) and we
> > > > have a way to deal with this either by enabling delayed allocation or by
> > > > turning off the writeback throttling...
> > > 
> > > At least in the case of io that we know is going to be data=ordered, we
> > > can bump the prio of those pages?
> > 
> > But how would flusher thread, which is submitting IO, know that? We would
> > have to somehow mark inodes that are part of the running transaction and
> > flusher thread could give more priority to such writeback - e.g. by using
> > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> > it could be doable.
> 
> This would be specific to the data=ordered code in the FS.  If there's
> some way to test for an inode or a page's status in the data=ordered
> list, the FS writepages call could flag the IO as higher prio?

Oh, right, we could do that. I can experiment with that later.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR