Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

From: Shaokun Zhang <zhangshaokun@hisilicon.com>
To: Dave Chinner <david@fromorbit.com>
Cc: <linux-xfs@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	Yang Guo <guoyang2@huawei.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>
Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0
Date: Fri, 8 Nov 2019 13:58:56 +0800	[thread overview]
Message-ID: <d627883a-850c-1ec4-e057-cf9e9b47c50e@hisilicon.com> (raw)
In-Reply-To: <20191106212041.GF4614@dread.disaster.area>

Hi Dave,

On 2019/11/7 5:20, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 12:03, Dave Chinner wrote:
>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>>>> Hi Dave,
>>>>
>>>> On 2019/11/5 4:49, Dave Chinner wrote:
>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>>>> From: Yang Guo <guoyang2@huawei.com>
>>>>>>
>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>>>> whether the counter less than 0 and it is a expensive function.
>>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>>>
>>>>> Hmmm. I don't recall this as being expensive.
>>>>>
>>>>
>>>> Sorry about the misunderstanding information in commit message.
>>>>
>>>>> How did you find this? Can you please always document how you found
>>>>
>>>> If user creates million of files and the delete them, We found that the
>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>>>> is not expensive, but it calls __percpu_counter_sum which will use
>>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>>>
>>>> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>>>>    - 5.86% xfs_mod_ifree
>>>>       - 5.78% __percpu_counter_compare
>>>>            5.61% __percpu_counter_sum
>>>
>>> Interesting. Your workload is hitting the slow path, which I most
>>> certainly do no see when creating lots of files. What's your
>>> workload?
>>>
>>
>> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
>> while the test is a single thread, as follow:
>> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2
> 
> What version and where do I get it?

You can get the mdtest from github: https://github.com/LLNL/mdtest.

> 
> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
> metadata workload testing? How representative is it of your actual
> production workload? Is that single threaded?
> 

We just use mdtest to test the performance of a file system, it can't representative
the actual workload and it's single threaded. But we also find that it goes to slow
path when we remove a dir with many files. The cmd is below:
rm -rf xxx.

>> xfs info:
>> meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks
> 
> only 4 AGs, which explains the lack of free inodes - there isn't
> enough concurrency in the filesystem layout to push the free inode
> count in all AGs beyond the batchsize * num_online_cpus().
> 
> i.e. single threaded workloads typically drain the free inode count
> all the way down to zero before new inodes are allocated. Workloads
> that are highly concurrent allocate from lots of AGs at once,
> leaving free inodes in every AG that is not current being actively
> allocated out of.
> 
> As a test, can you remake that test filesystem with "-d agcount=32"
> and see if the overhead you are seeing disappears?
> 

We try to remake the filesystem with "-d agcount=32" and it also enters slow path
mostly. Print the batch * num_online_cpus() and find that it's 32768.
Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
Then we change the agcount=1024, and it also goes to slow path frequently because
mostly there are no 32768 free inodes.

>>> files and you have lots of idle CPU and hence the inode allocation
>>> is not clearing the fast path batch threshold on the ifree counter.
>>> And because you have lots of CPUs, the cost of a sum is very
>>> expensive compared to running single threaded creates. That's my
>>> current hypothesis based what I see on my workloads that
>>> xfs_mod_ifree overhead goes down as concurrency goes up....
>>>
>>
>> Agree, we add some debug info in xfs_mod_ifree and found most times
>> m_ifree.count < batch * num_online_cpus(),  because we have 128 online
>> cpus and m_ifree.count around 999.
> 
> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
> path. 32 AGs may well push the count over this threshold, so it's
> definitely worth trying....
> 

Yes, we tried it and found that threshold was 32768, because percpu_counter_batch
was initialized to 2 * num_online_cpus().

>>> FWIW, the profiles I took came from running this on 16 and 32p
>>> machines:
>>>
>>> --
>>> dirs=""
>>> for i in `seq 1 $THREADS`; do
>>>         dirs="$dirs -d /mnt/scratch/$i"
>>> done
>>>
>>> cycles=$((512 / $THREADS))
>>>
>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
>>> --
>>>
>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
>>> image:
>>>
>>> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>>>          =                       sectsz=512   attr=2, projid32bit=1
>>>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>>>          =                       reflink=1
>>> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>
>>> That's allocating enough inodes to keep the free inode counter
>>> entirely out of the slow path...
>>
>> percpu_counter_read that reads the count will cause cache synchronization
>> cost if other cpu changes the count, Maybe it's better not to call
>> percpu_counter_compare if possible.
> 
> Depends.  Sometimes we trade off ultimate single threaded
> performance and efficiency for substantially better scalability.
> i.e. if we lose 5% on single threaded performance but gain 10x on
> concurrent workloads, then that is a good tradeoff to make.
> 

Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
xfs_mod_ifree/icount.

Thanks,
Shaokun

> Cheers,
> 
> Dave.
>