Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression

From: Waiman Long <waiman.long@hpe.com>
To: huang ying <huang.ying.caritas@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Dave Chinner <david@fromorbit.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@kernel.org>,
	ying.huang@linux.intel.com
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression
Date: Mon, 28 Sep 2015 22:57:12 -0400	[thread overview]
Message-ID: <5609FE08.8080104@hpe.com> (raw)
In-Reply-To: <CAC=cRTO8W5seDWyTZr4e_yV0LjRvqF+VXnnp-2bxjTSgWBZeQw@mail.gmail.com>

On 09/28/2015 08:47 PM, huang ying wrote:
> Hi, Waiman,
>
> On Mon, Sep 28, 2015 at 10:30 PM, Waiman Long<waiman.long@hpe.com>  wrote:
>> On 09/28/2015 04:54 AM, huang ying wrote:
>>
>> Hi, Peter
>>
>> On Fri, Sep 4, 2015 at 7:32 PM, Peter Zijlstra<peterz@infradead.org>  wrote:
>>> On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
>>>> You probably don't even need a VM to reproduce it - that would
>>>> certainly be an interesting counterpoint if it didn't....
>>> Even though you managed to restore your DEBUG_SPINLOCK performance by
>>> changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
>>> actual hardware just to test.
>>>
>>> [ Note: In any case, I would recommend you use (or at least try)
>>>    PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
>>>    performance, the test-and-set fallback really wasn't meant as a
>>>    performance option (although it clearly sucks worse than expected).
>>>
>>>    Pre qspinlock, your setup would have used regular ticket locks on
>>>    vCPUs, which mostly works as long as there is almost no vCPU
>>>    preemption, if you overload your machine such that the vCPU threads
>>>    get preempted that will implode into silly-land. ]
>>>
>>> So on to native performance:
>>>
>>>   - IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
>>>   - 1.1T of md-stripe (5x200GB) SSDs
>>>   - Linux v4.2 (distro style .config)
>>>   - Debian "testing" base system
>>>   - xfsprogs v3.2.1
>>>
>>>
>>> # mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
>>> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
>>> log stripe unit adjusted to 32KiB
>>> meta-data=/dev/md0               isize=512    agcount=32, agsize=9157504 blks
>>>           =                       sectsz=512   attr=2, projid32bit=1
>>>           =                       crc=1        finobt=1
>>> data     =                       bsize=4096   blocks=293038720, imaxpct=5
>>>           =                       sunit=128    swidth=640 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
>>> log      =internal log           bsize=4096   blocks=143088, version=2
>>>           =                       sectsz=512   sunit=8 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>
>>> # mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch
>>>
>>> # ./fs_mark  -D  10000  -S0  -n  50000  -s  0  -L  32 \
>>>           -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
>>>           -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
>>>           -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
>>>           -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
>>>           -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
>>>           -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
>>>           -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
>>>           -d  /mnt/scratch/14  -d  /mnt/scratch/15 \
>>>
>>>
>>> Regular v4.2 (qspinlock) does:
>>>
>>>       0      6400000            0     286491.9          3500179
>>>       0      7200000            0     293229.5          3963140
>>>       0      8000000            0     271182.4          3708212
>>>       0      8800000            0     300592.0          3595722
>>>
>>> Modified v4.2 (ticket) does:
>>>
>>>       0      6400000            0     310419.6          3343821
>>>       0      7200000            0     348346.5          4721133
>>>       0      8000000            0     328098.2          3235753
>>>       0      8800000            0     316765.3          3238971
>>>
>>>
>> Is the "modified v4.2 (ticket)" means you are just removing ARCH_USE_QUEUED_SPINLOCKS from the config file when building the 2 kernels in the above test? Your config file is for 4.1. If you compare a 4.1 kernel with 4.2 kernel, there are lot more changes than just the qspinlock switch.
> I think you are confused between PeterZ's test and my test.  PeterZ's
> test is for v4.2 and modified v4.2 and he didn't post his
> configuration.  My test is for
> fc934d40178ad4e551a17e2733241d9f29fddd70 and
> 68722101ec3a0e179408a13708dd020e04f54aab, so my configuration
> (attached in previous email) is for v4.1.

Yes, I am sorry that I misread the quoted part.

>> Could you also use the perf command to profile the 2 cases to see where the performance bottleneck is?
>>
>>
>>> Which shows that qspinlock is clearly slower, even for these large-ish
>>> NUMA boxes where it was supposed to be better.
>>>
>>> Clearly our benchmarks used before this were not sufficient, and more
>>> works needs to be done.
>>>
>>>
>>> Also, I note that after running to completion, there is only 14G of
>>> actual data on the device, so you don't need silly large storage to run
>>> this -- I expect your previous 275G quote was due to XFS populating the
>>> sparse file with meta-data or something along those lines.
>>>
>>> Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)
>>
>> We are trying to reproduce your regression in our test environment (LKP).  We tested fs_mark with following command line:
>>
>> # mkfs -t xfs /dev/ram0
>> # mount -t xfs -o nobarrier,inode64 /dev/ram0 /fs/ram0
>> # ./fs_mark -d /fs/ram0/1 -d /fs/ram0/2 -d /fs/ram0/3 -d /fs/ram0/4 -d /fs/ram0/5 -d /fs/ram0/6 -d /fs/ram0/7 -d /fs/ram0/8 -d /fs/ram0/9 -d /fs/ram0/10 -d /fs/ram0/11 -d /fs/ram0/12 -d /fs/ram0/13 -d /fs/ram0/14 -d /fs/ram0/15 -d /fs/ram0/16 -D 10000 -N 5 -n 49152 -L 32 -S 0 -s 0
>>
>> The test was run on a IVB-EX box, with ramdisk.  We tested two commits,
>>
>> fc934d40178ad4e551a17e2733241d9f29fddd70
>> 68722101ec3a0e179408a13708dd020e04f54aab
>>
>> I think they were the commits before and after introducing the qspinlock.  The test results show no regressions:
>>
>> fc934d40178ad4e5 68722101ec3a0e179408a13708
>> ---------------- --------------------------
>>           %stddev     %change         %stddev
>>               \          |                \
>>    13214787 ±  0%      -1.0%   13088679 ±  1%  fsmark.app_overhead
>>       36895 ±  0%      -0.1%      36841 ±  0%  fsmark.files_per_sec
>>      687.69 ±  0%      +0.1%     688.68 ±  0%  fsmark.time.elapsed_time
>>      687.69 ±  0%      +0.1%     688.68 ±  0%  fsmark.time.elapsed_time.max
>>      208.00 ±  0%      +0.0%     208.00 ±  0%  fsmark.time.file_system_inputs
>>        8.00 ±  0%      +0.0%       8.00 ±  0%  fsmark.time.file_system_outputs
>>        6627 ±  1%      +0.3%       6647 ±  1%  fsmark.time.involuntary_context_switches
>>       10904 ±  0%      +0.0%      10904 ±  0%  fsmark.time.maximum_resident_set_size
>>      307635 ±  0%      -1.0%     304646 ±  0%  fsmark.time.minor_page_faults
>>        4096 ±  0%      +0.0%       4096 ±  0%  fsmark.time.page_size
>>      338.33 ±  0%      +0.5%     340.00 ±  0%  fsmark.time.percent_of_cpu_this_job_got
>>        2119 ±  0%      +0.5%       2130 ±  0%  fsmark.time.system_time
>>      211.90 ±  0%      +1.4%     214.94 ±  0%  fsmark.time.user_time
>>    14193260 ±  0%      +0.6%   14284812 ±  0%  fsmark.time.voluntary_context_switches
>>
>> Could you give us some help on how to reproduce this regression? Could you provide your kernel configuration?  Ours is attached with the email. Or Could you help to pointed out other difference in our configuration?
>>
>>
>> The regression that was previously reported only happens when run in a VM without PARAVIRT_SPINLOCKS. With bare metal, you won't see that.
> Yes.  The regressions reported in the first email of the thread is for
> a VM.  But the regression reported by PeterZ is for bare metal.  I am
> trying to reproduce that one.
>
> Best Regards,
> Huang, Ying

I see. Thanks for the clarification.

Cheers,
Longman