From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9A54C433DB for ; Mon, 8 Feb 2021 13:41:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 62F7364DE9 for ; Mon, 8 Feb 2021 13:41:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 62F7364DE9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CFB426B006C; Mon, 8 Feb 2021 08:41:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C82AE6B006E; Mon, 8 Feb 2021 08:41:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B4C9C6B0070; Mon, 8 Feb 2021 08:41:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144]) by kanga.kvack.org (Postfix) with ESMTP id 9A05A6B006C for ; Mon, 8 Feb 2021 08:41:28 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 5730F8249980 for ; Mon, 8 Feb 2021 13:41:28 +0000 (UTC) X-FDA: 77795212656.23.chair33_4c07520275ff Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 34C8E37604 for ; Mon, 8 Feb 2021 13:41:28 +0000 (UTC) X-HE-Tag: chair33_4c07520275ff X-Filterd-Recvd-Size: 6858 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Mon, 8 Feb 2021 13:41:27 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id E6A94AD24; Mon, 8 Feb 2021 13:41:25 +0000 (UTC) From: Vlastimil Babka To: vbabka@suse.cz Cc: Catalin.Marinas@arm.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, bharata@linux.ibm.com, cl@linux.com, guro@fb.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, rientjes@google.com, shakeelb@google.com, vincent.guittot@linaro.org, will@kernel.org, Mel Gorman , stable@vger.kernel.org Subject: [PATCH] mm, slub: better heuristic for number of cpus when calculating slab order Date: Mon, 8 Feb 2021 14:41:08 +0100 Message-Id: <20210208134108.22286-1-vbabka@suse.cz> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When creating a new kmem cache, SLUB determines how large the slab pages = will based on number of inputs, including the number of CPUs in the system. La= rger slab pages mean that more objects can be allocated/free from per-cpu slab= s before accessing shared structures, but also potentially more memory can = be wasted due to low slab usage and fragmentation. The rough idea of using number of CPUs is that larger systems will be mor= e likely to benefit from reduced contention, and also should have enough me= mory to spare. Number of CPUs used to be determined as nr_cpu_ids, which is number of po= ssible cpus, but on some systems many will never be onlined, thus commit 045ab8c= 9487b ("mm/slub: let number of online CPUs determine the slub page order") chan= ged it to nr_online_cpus(). However, for kmem caches created early before CPUs a= re onlined, this may lead to permamently low slab page sizes. Vincent reports a regression [1] of hackbench on arm64 systems: > I'm facing significant performances regression on a large arm64 server > system (224 CPUs). Regressions is also present on small arm64 system > (8 CPUs) but in a far smaller order of magnitude > On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16 > v5.11-rc4 : 9.135sec (+/- 0.45%) > v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%) > v5.10: 3.136sec (+/- 0.40%) Mel reports a regression [2] of hackbench on x86_64, with lockstat sugges= ting page allocator contention: > i.e. the patch incurs a 7% to 32% performance penalty. This bisected > cleanly yesterday when I was looking for the regression and then found > the thread. > Numerous caches change size. For example, kmalloc-512 goes from order-0 > (vanilla) to order-2 with the revert. > So mostly this is down to the number of times SLUB calls into the page > allocator which only caches order-0 pages on a per-cpu basis. Clearly num_online_cpus() doesn't work too early in bootup. We could chan= ge the order dynamically in a memory hotplug callback, but runtime order cha= nging for existing kmem caches has been already shown as dangerous, and removed= in 32a6f409b693 ("mm, slub: remove runtime allocation order changes"). It co= uld be resurrected in a safe manner with some effort, but to fix the regression = we need something simpler. We could use num_present_cpus() that should be the number of physically p= resent CPUs even before they are onlined. That would for for PowerPC [3], which triggered the original commit, but that still doesn't work on arm64 [4] = as explained in [5]. So this patch tries to determine the best available value without specifi= c arch knowledge. - num_present_cpus() if the number is larger than 1, as that means the ar= ch is likely setting it properly - nr_cpu_ids otherwise This should fix the reported regressions while also keeping the effect of 045ab8c9487b for PowerPC systems. It's possible there are configurations = where num_present_cpus() is 1 during boot while nr_cpu_ids is at the same time bloated, so these (if they exist) would keep the large orders based on nr_cpu_ids as was before 045ab8c9487b. [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9= irgwFj7Rou=3DxzZg@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularit= y.net/ [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbng= f-03YsY18cmWv_g@mail.gmail.com/ [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-tr= uck/ Fixes: 045ab8c9487b ("mm/slub: let number of online CPUs determine the sl= ub page order") Reported-by: Vincent Guittot Reported-by: Mel Gorman Cc: Signed-off-by: Vlastimil Babka --- OK, this is a 5.11 regression, so we should try to it by 5.12. I've also Cc'd stable for that reason although it's not a crash fix. We can still try later to replace this with a safe order update in hotplu= g callbacks, but that's infeasible for 5.12. mm/slub.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 176b1cb0d006..8fc9190e6cb3 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3454,6 +3454,7 @@ static inline int calculate_order(unsigned int size= ) unsigned int order; unsigned int min_objects; unsigned int max_objects; + unsigned int nr_cpus; =20 /* * Attempt to find best configuration for a slab. This @@ -3464,8 +3465,21 @@ static inline int calculate_order(unsigned int siz= e) * we reduce the minimum objects required in a slab. */ min_objects =3D slub_min_objects; - if (!min_objects) - min_objects =3D 4 * (fls(num_online_cpus()) + 1); + if (!min_objects) { + /* + * Some architectures will only update present cpus when + * onlining them, so don't trust the number if it's just 1. But + * we also don't want to use nr_cpu_ids always, as on some other + * architectures, there can be many possible cpus, but never + * onlined. Here we compromise between trying to avoid too high + * order on systems that appear larger than they are, and too + * low order on systems that appear smaller than they are. + */ + nr_cpus =3D num_present_cpus(); + if (nr_cpus <=3D 1) + nr_cpus =3D nr_cpu_ids; + min_objects =3D 4 * (fls(nr_cpus) + 1); + } max_objects =3D order_objects(slub_max_order, size); min_objects =3D min(min_objects, max_objects); =20 --=20 2.30.0