From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754627AbbI2Arf (ORCPT <rfc822;w@1wt.eu>);
	Mon, 28 Sep 2015 20:47:35 -0400
Received: from mail-yk0-f180.google.com ([209.85.160.180]:34214 "EHLO
	mail-yk0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754098AbbI2Are convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 28 Sep 2015 20:47:34 -0400
MIME-Version: 1.0
In-Reply-To: <56094F05.4090809@hpe.com>
References: <20150904054820.GY3902@dastard>
	<CA+55aFyuob5iOOptzdD1W7gsxcrUGkgU50UoLA+Aq29-jO0KSw@mail.gmail.com>
	<20150904073917.GA18489@twins.programming.kicks-ass.net>
	<20150904081234.GA3902@dastard>
	<20150904113233.GT3644@twins.programming.kicks-ass.net>
	<CAC=cRTOraeOeu3Z8C1qx6w=GMSzD_4VevrEzn0mMhrqy=7n3wQ@mail.gmail.com>
	<56094F05.4090809@hpe.com>
Date: Tue, 29 Sep 2015 08:47:33 +0800
Message-ID: <CAC=cRTO8W5seDWyTZr4e_yV0LjRvqF+VXnnp-2bxjTSgWBZeQw@mail.gmail.com>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression
From: huang ying <huang.ying.caritas@gmail.com>
To: Waiman Long <waiman.long@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Dave Chinner <david@fromorbit.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@kernel.org>,
        ying.huang@linux.intel.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi, Waiman,

On Mon, Sep 28, 2015 at 10:30 PM, Waiman Long <waiman.long@hpe.com> wrote:
>
> On 09/28/2015 04:54 AM, huang ying wrote:
>
> Hi, Peter
>
> On Fri, Sep 4, 2015 at 7:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
>> > You probably don't even need a VM to reproduce it - that would
>> > certainly be an interesting counterpoint if it didn't....
>>
>> Even though you managed to restore your DEBUG_SPINLOCK performance by
>> changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
>> actual hardware just to test.
>>
>> [ Note: In any case, I would recommend you use (or at least try)
>>   PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
>>   performance, the test-and-set fallback really wasn't meant as a
>>   performance option (although it clearly sucks worse than expected).
>>
>>   Pre qspinlock, your setup would have used regular ticket locks on
>>   vCPUs, which mostly works as long as there is almost no vCPU
>>   preemption, if you overload your machine such that the vCPU threads
>>   get preempted that will implode into silly-land. ]
>>
>> So on to native performance:
>>
>>  - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
>>  - 1.1T of md-stripe (5x200GB) SSDs
>>  - Linux v4.2 (distro style .config)
>>  - Debian "testing" base system
>>  - xfsprogs v3.2.1
>>
>>
>> # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
>> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
>> log stripe unit adjusted to 32KiB
>> meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
>>          =                       sectsz=512   attr=2, projid32bit=1
>>          =                       crc=1        finobt=1
>> data     =                       bsize=4096   blocks=293038720, imaxpct=5
>>          =                       sunit=128    swidth=640 blks
>> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
>> log      =internal log           bsize=4096   blocks=143088, version=2
>>          =                       sectsz=512   sunit=8 blks, lazy-count=1
>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>
>> # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch
>>
>> # ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
>>          -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
>>          -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
>>          -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
>>          -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
>>          -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
>>          -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
>>          -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
>>          -d  /mnt/scratch/14  -d  /mnt/scratch/15 \
>>
>>
>> Regular v4.2 (qspinlock) does:
>>
>>      0      6400000            0     286491.9          3500179
>>      0      7200000            0     293229.5          3963140
>>      0      8000000            0     271182.4          3708212
>>      0      8800000            0     300592.0          3595722
>>
>> Modified v4.2 (ticket) does:
>>
>>      0      6400000            0     310419.6          3343821
>>      0      7200000            0     348346.5          4721133
>>      0      8000000            0     328098.2          3235753
>>      0      8800000            0     316765.3          3238971
>>
>>
>
> Is the "modified v4.2 (ticket)" means you are just removing ARCH_USE_QUEUED_SPINLOCKS from the config file when building the 2 kernels in the above test? Your config file is for 4.1. If you compare a 4.1 kernel with 4.2 kernel, there are lot more changes than just the qspinlock switch.

I think you are confused between PeterZ's test and my test.  PeterZ's
test is for v4.2 and modified v4.2 and he didn't post his
configuration.  My test is for
fc934d40178ad4e551a17e2733241d9f29fddd70 and
68722101ec3a0e179408a13708dd020e04f54aab, so my configuration
(attached in previous email) is for v4.1.

> Could you also use the perf command to profile the 2 cases to see where the performance bottleneck is?
>
>
>> Which shows that qspinlock is clearly slower, even for these large-ish
>> NUMA boxes where it was supposed to be better.
>>
>> Clearly our benchmarks used before this were not sufficient, and more
>> works needs to be done.
>>
>>
>> Also, I note that after running to completion, there is only 14G of
>> actual data on the device, so you don't need silly large storage to run
>> this -- I expect your previous 275G quote was due to XFS populating the
>> sparse file with meta-data or something along those lines.
>>
>> Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)
>
>
> We are trying to reproduce your regression in our test environment (LKP).  We tested fs_mark with following command line:
>
> # mkfs -t xfs /dev/ram0
> # mount -t xfs -o nobarrier,inode64 /dev/ram0 /fs/ram0
> # ./fs_mark -d /fs/ram0/1 -d /fs/ram0/2 -d /fs/ram0/3 -d /fs/ram0/4 -d /fs/ram0/5 -d /fs/ram0/6 -d /fs/ram0/7 -d /fs/ram0/8 -d /fs/ram0/9 -d /fs/ram0/10 -d /fs/ram0/11 -d /fs/ram0/12 -d /fs/ram0/13 -d /fs/ram0/14 -d /fs/ram0/15 -d /fs/ram0/16 -D 10000 -N 5 -n 49152 -L 32 -S 0 -s 0
>
> The test was run on a IVB-EX box, with ramdisk.  We tested two commits,
>
> fc934d40178ad4e551a17e2733241d9f29fddd70
> 68722101ec3a0e179408a13708dd020e04f54aab
>
> I think they were the commits before and after introducing the qspinlock.  The test results show no regressions:
>
> fc934d40178ad4e5 68722101ec3a0e179408a13708
> ---------------- --------------------------
>          %stddev     %change         %stddev
>              \          |                \
>   13214787 ±  0%      -1.0%   13088679 ±  1%  fsmark.app_overhead
>      36895 ±  0%      -0.1%      36841 ±  0%  fsmark.files_per_sec
>     687.69 ±  0%      +0.1%     688.68 ±  0%  fsmark.time.elapsed_time
>     687.69 ±  0%      +0.1%     688.68 ±  0%  fsmark.time.elapsed_time.max
>     208.00 ±  0%      +0.0%     208.00 ±  0%  fsmark.time.file_system_inputs
>       8.00 ±  0%      +0.0%       8.00 ±  0%  fsmark.time.file_system_outputs
>       6627 ±  1%      +0.3%       6647 ±  1%  fsmark.time.involuntary_context_switches
>      10904 ±  0%      +0.0%      10904 ±  0%  fsmark.time.maximum_resident_set_size
>     307635 ±  0%      -1.0%     304646 ±  0%  fsmark.time.minor_page_faults
>       4096 ±  0%      +0.0%       4096 ±  0%  fsmark.time.page_size
>     338.33 ±  0%      +0.5%     340.00 ±  0%  fsmark.time.percent_of_cpu_this_job_got
>       2119 ±  0%      +0.5%       2130 ±  0%  fsmark.time.system_time
>     211.90 ±  0%      +1.4%     214.94 ±  0%  fsmark.time.user_time
>   14193260 ±  0%      +0.6%   14284812 ±  0%  fsmark.time.voluntary_context_switches
>
> Could you give us some help on how to reproduce this regression? Could you provide your kernel configuration?  Ours is attached with the email. Or Could you help to pointed out other difference in our configuration?
>
>
> The regression that was previously reported only happens when run in a VM without PARAVIRT_SPINLOCKS. With bare metal, you won't see that.

Yes.  The regressions reported in the first email of the thread is for
a VM.  But the regression reported by PeterZ is for bare metal.  I am
trying to reproduce that one.

Best Regards,
Huang, Ying