From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759123AbbIDLco (ORCPT ); Fri, 4 Sep 2015 07:32:44 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:49940 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752335AbbIDLcn (ORCPT ); Fri, 4 Sep 2015 07:32:43 -0400 Date: Fri, 4 Sep 2015 13:32:33 +0200 From: Peter Zijlstra To: Dave Chinner Cc: Linus Torvalds , Linux Kernel Mailing List , Waiman Long , Ingo Molnar Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression Message-ID: <20150904113233.GT3644@twins.programming.kicks-ass.net> References: <20150904054820.GY3902@dastard> <20150904073917.GA18489@twins.programming.kicks-ass.net> <20150904081234.GA3902@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150904081234.GA3902@dastard> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote: > You probably don't even need a VM to reproduce it - that would > certainly be an interesting counterpoint if it didn't.... Even though you managed to restore your DEBUG_SPINLOCK performance by changing virt_queued_spin_lock() to use __delay(1), I ran the thing on actual hardware just to test. [ Note: In any case, I would recommend you use (or at least try) PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for performance, the test-and-set fallback really wasn't meant as a performance option (although it clearly sucks worse than expected). Pre qspinlock, your setup would have used regular ticket locks on vCPUs, which mostly works as long as there is almost no vCPU preemption, if you overload your machine such that the vCPU threads get preempted that will implode into silly-land. ] So on to native performance: - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs - 1.1T of md-stripe (5x200GB) SSDs - Linux v4.2 (distro style .config) - Debian "testing" base system - xfsprogs v3.2.1 # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0 log stripe unit (524288 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/md0 isize=512 agcount=32, agsize=9157504 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1 data = bsize=4096 blocks=293038720, imaxpct=5 = sunit=128 swidth=640 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=143088, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch # ./fs_mark -D 10000 -S0 -n 50000 -s 0 -L 32 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ -d /mnt/scratch/8 -d /mnt/scratch/9 \ -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 \ -d /mnt/scratch/14 -d /mnt/scratch/15 \ Regular v4.2 (qspinlock) does: 0 6400000 0 286491.9 3500179 0 7200000 0 293229.5 3963140 0 8000000 0 271182.4 3708212 0 8800000 0 300592.0 3595722 Modified v4.2 (ticket) does: 0 6400000 0 310419.6 3343821 0 7200000 0 348346.5 4721133 0 8000000 0 328098.2 3235753 0 8800000 0 316765.3 3238971 Which shows that qspinlock is clearly slower, even for these large-ish NUMA boxes where it was supposed to be better. Clearly our benchmarks used before this were not sufficient, and more works needs to be done. Also, I note that after running to completion, there is only 14G of actual data on the device, so you don't need silly large storage to run this -- I expect your previous 275G quote was due to XFS populating the sparse file with meta-data or something along those lines. Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)