Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>, Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Josef Bacik <jbacik@fb.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@lst.de>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
Date: Thu, 17 Sep 2015 14:30:08 +1000	[thread overview]
Message-ID: <20150917043008.GP3902@dastard> (raw)
In-Reply-To: <20150917034859.GC8624@ret.masoncoding.com>

On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > [cc Tejun]
> > 
> > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> > #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > 
> > FSUse%        Count         Size    Files/sec     App Overhead
> >      0        80000         4096     106938.0           543310
> >      0       160000         4096     102922.7           476362
> >      0       240000         4096     107182.9           538206
> >      0       320000         4096     107871.7           619821
> >      0       400000         4096      99255.6           622021
> >      0       480000         4096     103217.8           609943
> >      0       560000         4096      96544.2           640988
> >      0       640000         4096     100347.3           676237
> >      0       720000         4096      87534.8           483495
> >      0       800000         4096      72577.5          2556920
> >      0       880000         4096      97569.0           646996
> > 
> > <RAM fills here, sustained performance is now dependent on writeback>
> 
> I think too many variables have changed here.
> 
> My numbers:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       160000         4096     356407.1          1458461
>      0       320000         4096     368755.1          1030047
>      0       480000         4096     358736.8           992123
>      0       640000         4096     361912.5          1009566
>      0       800000         4096     342851.4          1004152

<snip>

> I can push the dirty threshold lower to try and make sure we end up in
> the hard dirty limits but none of this is going to be related to the
> plugging patch.

The point of this test is to drive writeback as hard as possible,
not to measure how fast we can create files in memory.  i.e. if the
test isn't pushing the dirty limits on your machines, then it really
isn't putting a meaningful load on writeback, and so the plugging
won't make significant difference because writeback isn't IO
bound....

> I do see lower numbers if I let the test run even
> longer, but there are a lot of things in the way that can slow it down
> as the filesystem gets that big.

Sure, that's why I hit the dirty limits early in the test - so it
measures steady state performance before the fs gets to any
significant scalability limits....

> > The baseline of no plugging is a full 3 minutes faster than the
> > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > that, sustaining between 25-30,000 IOPS and throughput of
> > 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> > it does not result in a performance improvement like the original
> > plugging patch did.
> 
> How consistent is this across runs?

That's what I'm trying to work out. I didn't report it until I got
consistently bad results - the numbers I reported were from the
third time I ran the comparison, and they were representative and
reproducable. I also ran my inode creation workload that is similar
(but has not data writeback so doesn't go through writeback paths at
all) and that shows no change in performance, so this problem
(whatever it is) is only manifesting itself through data
writeback....

The only measurable change I've noticed in my monitoring graphs is
that there is a lot more iowait time than I normally see, even when
the plugging appears to be working as desired. That's what I'm
trying to track down now, and once I've got to the bottom of that I
should have some idea of where the performance has gone....

As it is, there are a bunch of other things going wrong with
4.3-rc1+ right now that I'm working through - I haven't updated my
kernel tree for 10 days because I've been away on holidays so I'm
doing my usual "-rc1 is broken again" dance that I do every release
cycle.  (e.g every second boot hangs because systemd appears to be
waiting for iscsi devices to appear without first starting the iscsi
target daemon.  Never happened before today, every new kernel I've
booted today has hung on the first cold boot of the VM).

> > IOWs, what we are seeing here is that the baseline writeback
> > performance has regressed quite significantly since I took these
> > numbers back on 3.17.  I'm running on exactly the same test setup;
> > the only difference is the kernel and so the current kernel baseline
> > is ~20% slower than the baseline numbers I have in my patch.
> 
> All of this in a VM, I'd much rather see this reproduced on bare metal.
> I've had really consistent results with VMs in the past, but there is a
> huge amount of code between 3.17 and now.

I'm pretty sure it's not the VM - with the locking fix in place,
everything else I've looked at (since applying the locking fix Linus
mentioned) is within measurement error compared to 4.2. The only
thing that is out of whack from a performance POV is data
writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com