Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Chris Mason <clm@fb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Josef Bacik <jbacik@fb.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@lst.de>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
Date: Wed, 16 Sep 2015 23:48:59 -0400	[thread overview]
Message-ID: <20150917034859.GC8624@ret.masoncoding.com> (raw)
In-Reply-To: <20150917003738.GN3902@dastard>

On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
> 
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > > 
> > > > Thanks for the data Chris. So I guess we are fine with what's currently in,
> > > > right?
> > > 
> > > Looks like it works well to me.
> > 
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
> 
> I ran some tests in the background while reading other email.....
> 
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
> 
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
> 
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
> 
> The test:
> 
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0
> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=131072, version=2
>          =                       sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> #  ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> #       Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0        80000         4096     106938.0           543310
>      0       160000         4096     102922.7           476362
>      0       240000         4096     107182.9           538206
>      0       320000         4096     107871.7           619821
>      0       400000         4096      99255.6           622021
>      0       480000         4096     103217.8           609943
>      0       560000         4096      96544.2           640988
>      0       640000         4096     100347.3           676237
>      0       720000         4096      87534.8           483495
>      0       800000         4096      72577.5          2556920
>      0       880000         4096      97569.0           646996
> 
> <RAM fills here, sustained performance is now dependent on writeback>

I think too many variables have changed here.

My numbers:

FSUse%        Count         Size    Files/sec     App Overhead
     0       160000         4096     356407.1          1458461
     0       320000         4096     368755.1          1030047
     0       480000         4096     358736.8           992123
     0       640000         4096     361912.5          1009566
     0       800000         4096     342851.4          1004152
     0       960000         4096     358357.2           996014
     0      1120000         4096     338025.8          1004412
     0      1280000         4096     354440.3           997380
     0      1440000         4096     335225.9          1000222
     0      1600000         4096     278786.1          1164962
     0      1760000         4096     268161.4          1205255
     0      1920000         4096     259158.0          1298054
     0      2080000         4096     276939.1          1219411
     0      2240000         4096     252385.1          1245496
     0      2400000         4096     280674.1          1189161
     0      2560000         4096     290155.4          1141941
     0      2720000         4096     280842.2          1179964
     0      2880000         4096     272446.4          1155527
     0      3040000         4096     268827.4          1235095
     0      3200000         4096     251767.1          1250006
     0      3360000         4096     248339.8          1235471
     0      3520000         4096     267129.9          1200834
     0      3680000         4096     257320.7          1244854
     0      3840000         4096     233540.8          1267764
     0      4000000         4096     269237.0          1216324
     0      4160000         4096     249787.6          1291767
     0      4320000         4096     256185.7          1253776
     0      4480000         4096     257849.7          1212953
     0      4640000         4096     253933.9          1181216
     0      4800000         4096     263567.2          1233937
     0      4960000         4096     255666.4          1231802
     0      5120000         4096     257083.2          1282893
     0      5280000         4096     254285.0          1229031
     0      5440000         4096     265561.6          1219472
     0      5600000         4096     266374.1          1229886
     0      5760000         4096     241003.7          1257064
     0      5920000         4096     245047.4          1298330
     0      6080000         4096     254771.7          1257241
     0      6240000         4096     254355.2          1261006
     0      6400000         4096     254800.4          1201074
     0      6560000         4096     262794.5          1234816
     0      6720000         4096     248103.0          1287921
     0      6880000         4096     231397.3          1291224
     0      7040000         4096     227898.0          1285359
     0      7200000         4096     227279.6          1296340
     0      7360000         4096     232561.5          1748248
     0      7520000         4096     231055.3          1169373
     0      7680000         4096     245738.5          1121856
     0      7840000         4096     234961.7          1147035
     0      8000000         4096     243973.0          1152202
     0      8160000         4096     246292.6          1169527
     0      8320000         4096     249433.2          1197921
     0      8480000         4096     222576.0          1253650
     0      8640000         4096     239407.5          1263257
     0      8800000         4096     246037.1          1218109
     0      8960000         4096     242306.5          1293567
     0      9120000         4096     238525.9          3745133
     0      9280000         4096     269869.5          1159541
     0      9440000         4096     266447.1          4794719
     0      9600000         4096     265748.9          1161584
     0      9760000         4096     269067.8          1149918
     0      9920000         4096     248896.2          1164112
     0     10080000         4096     261342.9          1174536
     0     10240000         4096     254778.3          1225425
     0     10400000         4096     257702.2          1211634
     0     10560000         4096     233972.5          1203665
     0     10720000         4096     232647.1          1197486
     0     10880000         4096     242320.6          1203984

I can push the dirty threshold lower to try and make sure we end up in
the hard dirty limits but none of this is going to be related to the
plugging patch.  I do see lower numbers if I let the test run even
longer, but there are a lot of things in the way that can slow it down
as the filesystem gets that big.

I'll try again with lower ratios.

[ ... ]

> The baseline of no plugging is a full 3 minutes faster than the
> plugging behaviour of Linus' patch. The IO behaviour demonstrates
> that, sustaining between 25-30,000 IOPS and throughput of
> 130-150MB/s.  Hence, while Linus' patch does change the IO patterns,
> it does not result in a performance improvement like the original
> plugging patch did.
> 

How consistent is this across runs?

> So I went back and had a look at my original patch, which I've been
> using locally for a couple of years and was similar to the original
> commit. It has this description from when I last updated the perf
> numbers from testing done on 3.17:
> 
> | Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
> | metadata CRCs enabled.
> | 
> | Test:
> | 
> | $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> | /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> | /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> | /mnt/scratch/6  -d  /mnt/scratch/7
> | 
> | Result:
> |                 wall    sys     create rate     Physical write IO
> |                 time    CPU     (avg files/s)    IOPS   Bandwidth
> |                 -----   -----   -------------   ------  ---------
> | unpatched       5m54s   15m32s  32,500+/-2200   28,000  150MB/s
> | patched         3m19s   13m28s  52,900+/-1800    1,500  280MB/s
> | improvement     -43.8%  -13.3%    +62.7%        -94.6%  +86.6%
> 
> IOWs, what we are seeing here is that the baseline writeback
> performance has regressed quite significantly since I took these
> numbers back on 3.17.  I'm running on exactly the same test setup;
> the only difference is the kernel and so the current kernel baseline
> is ~20% slower than the baseline numbers I have in my patch.

All of this in a VM, I'd much rather see this reproduced on bare metal.
I've had really consistent results with VMs in the past, but there is a
huge amount of code between 3.17 and now.

-chris