From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S934092AbbIDWDa (ORCPT <rfc822;w@1wt.eu>);
	Fri, 4 Sep 2015 18:03:30 -0400
Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:23770 "EHLO
	ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S933725AbbIDWD2 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 4 Sep 2015 18:03:28 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A2CcBgABFOpV/3EALHldgyGBPalnil6REAICAQECgTlNAQEBAQEBgQuEIwEBAQMBOhwjBQsIAxgJJQ8FJQMhE4gmB8snAQEBAQEBBAEBAQEeGYYThUKBPQGCdBAERQeDGIEUBYx2iFuMdIFNlTqDbCaCDQMND4FmLDOIAwEfgSgBAQE
Date: Sat, 5 Sep 2015 08:03:12 +1000
From: Dave Chinner <david@fromorbit.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@kernel.org>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance
 regression
Message-ID: <20150904220312.GC3902@dastard>
References: <20150904054820.GY3902@dastard>
 <CA+55aFyuob5iOOptzdD1W7gsxcrUGkgU50UoLA+Aq29-jO0KSw@mail.gmail.com>
 <20150904073917.GA18489@twins.programming.kicks-ass.net>
 <20150904081234.GA3902@dastard>
 <20150904113233.GT3644@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150904113233.GT3644@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Sep 04, 2015 at 01:32:33PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> > You probably don't even need a VM to reproduce it - that would
> > certainly be an interesting counterpoint if it didn't....
> 
> Even though you managed to restore your DEBUG_SPINLOCK performance by
> changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
> actual hardware just to test.
> 
> [ Note: In any case, I would recommend you use (or at least try)
>   PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
>   performance, the test-and-set fallback really wasn't meant as a
>   performance option (although it clearly sucks worse than expected).

I will try it, but that can happen when I've got a bit of spare
time...

>   Pre qspinlock, your setup would have used regular ticket locks on
>   vCPUs, which mostly works as long as there is almost no vCPU
>   preemption, if you overload your machine such that the vCPU threads
>   get preempted that will implode into silly-land. ]

I don't tend to overload the host CPUs - all my test loads are IO
bound - so this has never really been a problem I've noticed
in the past.

> So on to native performance:
> 
>  - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
>  - 1.1T of md-stripe (5x200GB) SSDs
>  - Linux v4.2 (distro style .config)
>  - Debian "testing" base system
>  - xfsprogs v3.2.1
> 
> 
> # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0

If you use xfsprogs v3.2.4 (current debian unstable) these are the
default options.

> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1
> data     =                       bsize=4096   blocks=293038720, imaxpct=5
>          =                       sunit=128    swidth=640 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=143088, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch
> 
> # ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
>          -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
>          -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
>          -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
>          -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
>          -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
>          -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
>          -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
>          -d  /mnt/scratch/14  -d  /mnt/scratch/15 \
> 
> 
> Regular v4.2 (qspinlock) does:
> 
>      0      6400000            0     286491.9          3500179
>      0      7200000            0     293229.5          3963140
>      0      8000000            0     271182.4          3708212
>      0      8800000            0     300592.0          3595722
> 
> Modified v4.2 (ticket) does:
> 
>      0      6400000            0     310419.6          3343821
>      0      7200000            0     348346.5          4721133
>      0      8000000            0     328098.2          3235753
>      0      8800000            0     316765.3          3238971
> 
> 
> Which shows that qspinlock is clearly slower, even for these large-ish
> NUMA boxes where it was supposed to be better.

Be careful just reading the throughput numbers like that. You can
have the files/s number go down, but the benchmark wall time get
faster because the userspace portion runs faster (i.e. CPU cache
residency effects).  In this case, however, both the userspace time
is down by 5-10% and the files/s is up by 5-10%, so (without knowing
the wall time) I'd say that there is significance in these
numbers....

FWIW. you've got a lot more CPUs than I have - you can scale up the
parallelism of the workload by increasing the number of working
directories (i.e. -d <dir> options). You'd also need to scale up the
amount of allocation concurrency in XFS - 32 AGs will be the
limiting factor for any more workload concurrency. i.e. use "-d
agcount=<xxx>" on the mkfs.xfs command line to increase the AG
count. For artificial scalability testing like this, you want the AG
count ot be at least 2x the number of directories you are working in
concurrently.

> Clearly our benchmarks used before this were not sufficient, and more
> works needs to be done.
> 
> 
> Also, I note that after running to completion, there is only 14G of
> actual data on the device, so you don't need silly large storage to run
> this -- I expect your previous 275G quote was due to XFS populating the
> sparse file with meta-data or something along those lines.

Yeah, that would have been after lots of other work being done on
the sparse file I use to back the 500TB filesystem I test on in the
VM. Currently:

$ ls -lh /mnt/fast-ssd
total 61G
-rw------- 1 root root 500T Sep  4 19:36 vm-500t.img
$ df -h /mnt/fast-ssd
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        400G   61G  340G  16% /mnt/fast-ssd
$

I'm using 61GB of space in the file that backs the 500TB device I'm
testing against. Every so often I punch out the file so that it gets
laid out again- I usually do that after running btrfs testing as
btrfs fragments the crap out of the backing file, even with extent
size hints set to minimise the fragmentation...

> Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)

That's why I do it in parallel - step 6 of my test script is:

echo removing files
for f in /mnt/scratch/* ; do time rm -rf $f &  done
wait

And so:

.....
removing files

real    4m2.752s
user    0m3.387s
sys     2m56.801s
....
real    4m17.326s
user    0m3.333s
sys     2m57.831s
$

It takes a lot less than forever :)

Really, the fsmark run is just the part of my concurrent XFS inode
test script that takes about 20 minutes to run. It does:

	Prep: mkfs, mount
	1. run fsmark to create inodes in parallel
	2. run xfs_repair with maximum concurrency
	3. run multi-threaded bulkstat
	4. run concurrent find+stat
	5. run concurrent ls -R
	6. run concurrent rm -rf

It stresses all sorts of stuff:

	- steps 1 and 6 stress the XFS inode allocation and
	  transaction subsystems - it runs at about 4-500,000
	  transaction commits a second here.

	- Step 2 absolutely thrashes the mmap_sem from userspace due
	  to the memory demand and concurrent access patterns of
	  xfs_repair.

	- Step 3 is a cold cache inode traversal - it pushes close
	  to a million inodes/second through the slab caches.  It
	  puts a hell of a lot of load on the inode and xfs_buf slab
	  cache, the xfs_buf slab shrinker and all the VFS inode
	  instantiation and teardown paths. It is currently limited
	  in scalability by the inode_sb_list_lock contention.

	- Step 4 and 5 do different types of directory traversal,
	  putting heavy demand on the XFS buffer cache and inode
	  cache shrinkers to work effectively.

I have several variants - small files, different filesystems,
different directory structures, etc - because they all stress
different aspects of filesystem and core infrastructure. It's found
locking regressions. It's found mm/ subsystem regressions. It's
found writeback regressions. It's found all sorts of bugs in my code
over the years - it's a very useful test, so I keep using it. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com