From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1759123AbbIDLco (ORCPT <rfc822;w@1wt.eu>);
	Fri, 4 Sep 2015 07:32:44 -0400
Received: from bombadil.infradead.org ([198.137.202.9]:49940 "EHLO
	bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752335AbbIDLcn (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 4 Sep 2015 07:32:43 -0400
Date: Fri, 4 Sep 2015 13:32:33 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@kernel.org>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance
 regression
Message-ID: <20150904113233.GT3644@twins.programming.kicks-ass.net>
References: <20150904054820.GY3902@dastard>
 <CA+55aFyuob5iOOptzdD1W7gsxcrUGkgU50UoLA+Aq29-jO0KSw@mail.gmail.com>
 <20150904073917.GA18489@twins.programming.kicks-ass.net>
 <20150904081234.GA3902@dastard>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150904081234.GA3902@dastard>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> You probably don't even need a VM to reproduce it - that would
> certainly be an interesting counterpoint if it didn't....

Even though you managed to restore your DEBUG_SPINLOCK performance by
changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
actual hardware just to test.

[ Note: In any case, I would recommend you use (or at least try)
  PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
  performance, the test-and-set fallback really wasn't meant as a
  performance option (although it clearly sucks worse than expected).

  Pre qspinlock, your setup would have used regular ticket locks on
  vCPUs, which mostly works as long as there is almost no vCPU
  preemption, if you overload your machine such that the vCPU threads
  get preempted that will implode into silly-land. ]

So on to native performance:

 - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
 - 1.1T of md-stripe (5x200GB) SSDs
 - Linux v4.2 (distro style .config)
 - Debian "testing" base system
 - xfsprogs v3.2.1


# mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1
data     =                       bsize=4096   blocks=293038720, imaxpct=5
         =                       sunit=128    swidth=640 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=143088, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch

# ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
         -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
         -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
         -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
         -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
         -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
         -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
         -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
         -d  /mnt/scratch/14  -d  /mnt/scratch/15 \


Regular v4.2 (qspinlock) does:

     0      6400000            0     286491.9          3500179
     0      7200000            0     293229.5          3963140
     0      8000000            0     271182.4          3708212
     0      8800000            0     300592.0          3595722

Modified v4.2 (ticket) does:

     0      6400000            0     310419.6          3343821
     0      7200000            0     348346.5          4721133
     0      8000000            0     328098.2          3235753
     0      8800000            0     316765.3          3238971


Which shows that qspinlock is clearly slower, even for these large-ish
NUMA boxes where it was supposed to be better.

Clearly our benchmarks used before this were not sufficient, and more
works needs to be done.


Also, I note that after running to completion, there is only 14G of
actual data on the device, so you don't need silly large storage to run
this -- I expect your previous 275G quote was due to XFS populating the
sparse file with meta-data or something along those lines.

Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)