From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08B44C432C3 for ; Fri, 15 Nov 2019 09:16:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D292420723 for ; Fri, 15 Nov 2019 09:16:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726901AbfKOJQa (ORCPT ); Fri, 15 Nov 2019 04:16:30 -0500 Received: from szxga04-in.huawei.com ([45.249.212.190]:6678 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726567AbfKOJQa (ORCPT ); Fri, 15 Nov 2019 04:16:30 -0500 Received: from DGGEMS405-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id EA4E9542940C91A49CA8; Fri, 15 Nov 2019 17:16:27 +0800 (CST) Received: from [127.0.0.1] (10.74.221.148) by DGGEMS405-HUB.china.huawei.com (10.3.19.205) with Microsoft SMTP Server id 14.3.439.0; Fri, 15 Nov 2019 17:16:21 +0800 Subject: Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0 To: Dave Chinner References: <1572866980-13001-1-git-send-email-zhangshaokun@hisilicon.com> <20191104204909.GB4614@dread.disaster.area> <20191105040325.GC4614@dread.disaster.area> <675693c2-8600-1cbd-ce50-5696c45c6cd9@hisilicon.com> <20191106212041.GF4614@dread.disaster.area> CC: , , Yang Guo , "Darrick J. Wong" , "Christoph Hellwig" From: Shaokun Zhang Message-ID: <724125af-dfff-c0e0-93f2-2da7a2fe19cb@hisilicon.com> Date: Fri, 15 Nov 2019 17:16:21 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.74.221.148] X-CFilter-Loop: Reflected Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org Hi Dave, With configuration "-d agcount=32", it also enters slow path frequently when there are 128 cpu cores, any thoughts about this issue? Can we remove debug check entirely as Christoph's suggestion? Thanks, Shaokun On 2019/11/8 13:58, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/7 5:20, Dave Chinner wrote: >> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: >>> Hi Dave, >>> >>> On 2019/11/5 12:03, Dave Chinner wrote: >>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: >>>>> Hi Dave, >>>>> >>>>> On 2019/11/5 4:49, Dave Chinner wrote: >>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >>>>>>> From: Yang Guo >>>>>>> >>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >>>>>>> whether the counter less than 0 and it is a expensive function. >>>>>>> let's check it only when delta < 0, it will be good for xfs's performance. >>>>>> >>>>>> Hmmm. I don't recall this as being expensive. >>>>>> >>>>> >>>>> Sorry about the misunderstanding information in commit message. >>>>> >>>>>> How did you find this? Can you please always document how you found >>>>> >>>>> If user creates million of files and the delete them, We found that the >>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself >>>>> is not expensive, but it calls __percpu_counter_sum which will use >>>>> spin_lock and read other cpu's count. perf record -g is used to profile it: >>>>> >>>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree >>>>> - 5.86% xfs_mod_ifree >>>>> - 5.78% __percpu_counter_compare >>>>> 5.61% __percpu_counter_sum >>>> >>>> Interesting. Your workload is hitting the slow path, which I most >>>> certainly do no see when creating lots of files. What's your >>>> workload? >>>> >>> >>> The hardware has 128 cpu cores, and the xfs filesystem format config is default, >>> while the test is a single thread, as follow: >>> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 >> >> What version and where do I get it? > > You can get the mdtest from github: https://github.com/LLNL/mdtest. > >> >> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent >> metadata workload testing? How representative is it of your actual >> production workload? Is that single threaded? >> > > We just use mdtest to test the performance of a file system, it can't representative > the actual workload and it's single threaded. But we also find that it goes to slow > path when we remove a dir with many files. The cmd is below: > rm -rf xxx. > >>> xfs info: >>> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks >> >> only 4 AGs, which explains the lack of free inodes - there isn't >> enough concurrency in the filesystem layout to push the free inode >> count in all AGs beyond the batchsize * num_online_cpus(). >> >> i.e. single threaded workloads typically drain the free inode count >> all the way down to zero before new inodes are allocated. Workloads >> that are highly concurrent allocate from lots of AGs at once, >> leaving free inodes in every AG that is not current being actively >> allocated out of. >> >> As a test, can you remake that test filesystem with "-d agcount=32" >> and see if the overhead you are seeing disappears? >> > > We try to remake the filesystem with "-d agcount=32" and it also enters slow path > mostly. Print the batch * num_online_cpus() and find that it's 32768. > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. > Then we change the agcount=1024, and it also goes to slow path frequently because > mostly there are no 32768 free inodes. > >>>> files and you have lots of idle CPU and hence the inode allocation >>>> is not clearing the fast path batch threshold on the ifree counter. >>>> And because you have lots of CPUs, the cost of a sum is very >>>> expensive compared to running single threaded creates. That's my >>>> current hypothesis based what I see on my workloads that >>>> xfs_mod_ifree overhead goes down as concurrency goes up.... >>>> >>> >>> Agree, we add some debug info in xfs_mod_ifree and found most times >>> m_ifree.count < batch * num_online_cpus(), because we have 128 online >>> cpus and m_ifree.count around 999. >> >> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow >> path. 32 AGs may well push the count over this threshold, so it's >> definitely worth trying.... >> > > Yes, we tried it and found that threshold was 32768, because percpu_counter_batch > was initialized to 2 * num_online_cpus(). > >>>> FWIW, the profiles I took came from running this on 16 and 32p >>>> machines: >>>> >>>> -- >>>> dirs="" >>>> for i in `seq 1 $THREADS`; do >>>> dirs="$dirs -d /mnt/scratch/$i" >>>> done >>>> >>>> cycles=$((512 / $THREADS)) >>>> >>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs >>>> -- >>>> >>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem >>>> image: >>>> >>>> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks >>>> = sectsz=512 attr=2, projid32bit=1 >>>> = crc=1 finobt=1, sparse=1, rmapbt=0 >>>> = reflink=1 >>>> data = bsize=4096 blocks=134217727500, imaxpct=1 >>>> = sunit=0 swidth=0 blks >>>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1 >>>> log =internal log bsize=4096 blocks=521728, version=2 >>>> = sectsz=512 sunit=0 blks, lazy-count=1 >>>> realtime =none extsz=4096 blocks=0, rtextents=0 >>>> >>>> That's allocating enough inodes to keep the free inode counter >>>> entirely out of the slow path... >>> >>> percpu_counter_read that reads the count will cause cache synchronization >>> cost if other cpu changes the count, Maybe it's better not to call >>> percpu_counter_compare if possible. >> >> Depends. Sometimes we trade off ultimate single threaded >> performance and efficiency for substantially better scalability. >> i.e. if we lose 5% on single threaded performance but gain 10x on >> concurrent workloads, then that is a good tradeoff to make. >> > > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in > xfs_mod_ifree/icount. > > Thanks, > Shaokun > >> Cheers, >> >> Dave. >>