From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934092AbbIDWDa (ORCPT ); Fri, 4 Sep 2015 18:03:30 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:23770 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933725AbbIDWD2 (ORCPT ); Fri, 4 Sep 2015 18:03:28 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CcBgABFOpV/3EALHldgyGBPalnil6REAICAQECgTlNAQEBAQEBgQuEIwEBAQMBOhwjBQsIAxgJJQ8FJQMhE4gmB8snAQEBAQEBBAEBAQEeGYYThUKBPQGCdBAERQeDGIEUBYx2iFuMdIFNlTqDbCaCDQMND4FmLDOIAwEfgSgBAQE Date: Sat, 5 Sep 2015 08:03:12 +1000 From: Dave Chinner To: Peter Zijlstra Cc: Linus Torvalds , Linux Kernel Mailing List , Waiman Long , Ingo Molnar Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression Message-ID: <20150904220312.GC3902@dastard> References: <20150904054820.GY3902@dastard> <20150904073917.GA18489@twins.programming.kicks-ass.net> <20150904081234.GA3902@dastard> <20150904113233.GT3644@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150904113233.GT3644@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 04, 2015 at 01:32:33PM +0200, Peter Zijlstra wrote: > On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote: > > You probably don't even need a VM to reproduce it - that would > > certainly be an interesting counterpoint if it didn't.... > > Even though you managed to restore your DEBUG_SPINLOCK performance by > changing virt_queued_spin_lock() to use __delay(1), I ran the thing on > actual hardware just to test. > > [ Note: In any case, I would recommend you use (or at least try) > PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for > performance, the test-and-set fallback really wasn't meant as a > performance option (although it clearly sucks worse than expected). I will try it, but that can happen when I've got a bit of spare time... > Pre qspinlock, your setup would have used regular ticket locks on > vCPUs, which mostly works as long as there is almost no vCPU > preemption, if you overload your machine such that the vCPU threads > get preempted that will implode into silly-land. ] I don't tend to overload the host CPUs - all my test loads are IO bound - so this has never really been a problem I've noticed in the past. > So on to native performance: > > - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs > - 1.1T of md-stripe (5x200GB) SSDs > - Linux v4.2 (distro style .config) > - Debian "testing" base system > - xfsprogs v3.2.1 > > > # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0 If you use xfsprogs v3.2.4 (current debian unstable) these are the default options. > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md0 isize=512 agcount=32, agsize=9157504 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1 > data = bsize=4096 blocks=293038720, imaxpct=5 > = sunit=128 swidth=640 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=143088, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch > > # ./fs_mark -D 10000 -S0 -n 50000 -s 0 -L 32 \ > -d /mnt/scratch/0 -d /mnt/scratch/1 \ > -d /mnt/scratch/2 -d /mnt/scratch/3 \ > -d /mnt/scratch/4 -d /mnt/scratch/5 \ > -d /mnt/scratch/6 -d /mnt/scratch/7 \ > -d /mnt/scratch/8 -d /mnt/scratch/9 \ > -d /mnt/scratch/10 -d /mnt/scratch/11 \ > -d /mnt/scratch/12 -d /mnt/scratch/13 \ > -d /mnt/scratch/14 -d /mnt/scratch/15 \ > > > Regular v4.2 (qspinlock) does: > > 0 6400000 0 286491.9 3500179 > 0 7200000 0 293229.5 3963140 > 0 8000000 0 271182.4 3708212 > 0 8800000 0 300592.0 3595722 > > Modified v4.2 (ticket) does: > > 0 6400000 0 310419.6 3343821 > 0 7200000 0 348346.5 4721133 > 0 8000000 0 328098.2 3235753 > 0 8800000 0 316765.3 3238971 > > > Which shows that qspinlock is clearly slower, even for these large-ish > NUMA boxes where it was supposed to be better. Be careful just reading the throughput numbers like that. You can have the files/s number go down, but the benchmark wall time get faster because the userspace portion runs faster (i.e. CPU cache residency effects). In this case, however, both the userspace time is down by 5-10% and the files/s is up by 5-10%, so (without knowing the wall time) I'd say that there is significance in these numbers.... FWIW. you've got a lot more CPUs than I have - you can scale up the parallelism of the workload by increasing the number of working directories (i.e. -d options). You'd also need to scale up the amount of allocation concurrency in XFS - 32 AGs will be the limiting factor for any more workload concurrency. i.e. use "-d agcount=" on the mkfs.xfs command line to increase the AG count. For artificial scalability testing like this, you want the AG count ot be at least 2x the number of directories you are working in concurrently. > Clearly our benchmarks used before this were not sufficient, and more > works needs to be done. > > > Also, I note that after running to completion, there is only 14G of > actual data on the device, so you don't need silly large storage to run > this -- I expect your previous 275G quote was due to XFS populating the > sparse file with meta-data or something along those lines. Yeah, that would have been after lots of other work being done on the sparse file I use to back the 500TB filesystem I test on in the VM. Currently: $ ls -lh /mnt/fast-ssd total 61G -rw------- 1 root root 500T Sep 4 19:36 vm-500t.img $ df -h /mnt/fast-ssd Filesystem Size Used Avail Use% Mounted on /dev/sdb 400G 61G 340G 16% /mnt/fast-ssd $ I'm using 61GB of space in the file that backs the 500TB device I'm testing against. Every so often I punch out the file so that it gets laid out again- I usually do that after running btrfs testing as btrfs fragments the crap out of the backing file, even with extent size hints set to minimise the fragmentation... > Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-) That's why I do it in parallel - step 6 of my test script is: echo removing files for f in /mnt/scratch/* ; do time rm -rf $f & done wait And so: ..... removing files real 4m2.752s user 0m3.387s sys 2m56.801s .... real 4m17.326s user 0m3.333s sys 2m57.831s $ It takes a lot less than forever :) Really, the fsmark run is just the part of my concurrent XFS inode test script that takes about 20 minutes to run. It does: Prep: mkfs, mount 1. run fsmark to create inodes in parallel 2. run xfs_repair with maximum concurrency 3. run multi-threaded bulkstat 4. run concurrent find+stat 5. run concurrent ls -R 6. run concurrent rm -rf It stresses all sorts of stuff: - steps 1 and 6 stress the XFS inode allocation and transaction subsystems - it runs at about 4-500,000 transaction commits a second here. - Step 2 absolutely thrashes the mmap_sem from userspace due to the memory demand and concurrent access patterns of xfs_repair. - Step 3 is a cold cache inode traversal - it pushes close to a million inodes/second through the slab caches. It puts a hell of a lot of load on the inode and xfs_buf slab cache, the xfs_buf slab shrinker and all the VFS inode instantiation and teardown paths. It is currently limited in scalability by the inode_sb_list_lock contention. - Step 4 and 5 do different types of directory traversal, putting heavy demand on the XFS buffer cache and inode cache shrinkers to work effectively. I have several variants - small files, different filesystems, different directory structures, etc - because they all stress different aspects of filesystem and core infrastructure. It's found locking regressions. It's found mm/ subsystem regressions. It's found writeback regressions. It's found all sorts of bugs in my code over the years - it's a very useful test, so I keep using it. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com