From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0C65C433E0 for ; Wed, 17 Mar 2021 20:02:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1A8B064F04 for ; Wed, 17 Mar 2021 20:02:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1A8B064F04 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 82FF06B006E; Wed, 17 Mar 2021 16:02:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B89A6B0070; Wed, 17 Mar 2021 16:02:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 632D66B0071; Wed, 17 Mar 2021 16:02:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0062.hostedemail.com [216.40.44.62]) by kanga.kvack.org (Postfix) with ESMTP id 42DF26B006E for ; Wed, 17 Mar 2021 16:02:15 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D73C78249980 for ; Wed, 17 Mar 2021 20:02:14 +0000 (UTC) X-FDA: 77930437788.27.0688C23 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf24.hostedemail.com (Postfix) with ESMTP id 2FE05A033A59 for ; Wed, 17 Mar 2021 18:45:52 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id B81EAABD7; Wed, 17 Mar 2021 18:45:50 +0000 (UTC) To: Xunlei Pang , Christoph Lameter , Christoph Lameter , Pekka Enberg , Roman Gushchin , Konstantin Khlebnikov , David Rientjes , Matthew Wilcox , Shu Ming , Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Wen Yang , James Wang References: <1615967692-80524-1-git-send-email-xlpang@linux.alibaba.com> <1615967692-80524-2-git-send-email-xlpang@linux.alibaba.com> From: Vlastimil Babka Subject: Re: [PATCH v4 1/3] mm/slub: Introduce two counters for partial objects Message-ID: <322e2b18-e529-3004-c19a-8c4a3b97c532@suse.cz> Date: Wed, 17 Mar 2021 19:45:50 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: <1615967692-80524-2-git-send-email-xlpang@linux.alibaba.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 2FE05A033A59 X-Stat-Signature: 1qetfgpb54ekkqwkcg3gzda4bfn5xbgh Received-SPF: none (suse.cz>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from=""; helo=mx2.suse.de; client-ip=195.135.220.15 X-HE-DKIM-Result: none/none X-HE-Tag: 1616006752-450236 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/17/21 8:54 AM, Xunlei Pang wrote: > The node list_lock in count_partial() spends long time iterating > in case of large amount of partial page lists, which can cause > thunder herd effect to the list_lock contention. >=20 > We have HSF RT(High-speed Service Framework Response-Time) monitors, > the RT figures fluctuated randomly, then we deployed a tool detecting > "irq off" and "preempt off" to dump the culprit's calltrace, capturing > the list_lock cost nearly 100ms with irq off issued by "ss", this also > caused network timeouts. >=20 > This patch introduces two counters to maintain the actual number > of partial objects dynamically instead of iterating the partial > page lists with list_lock held. >=20 > New counters of kmem_cache_node: partial_free_objs, partial_total_objs. > The main operations are under list_lock in slow path, its performance > impact is expected to be minimal except the __slab_free() path. >=20 > The only concern of introducing partial counter is that partial_free_ob= js > may cause cacheline contention and false sharing issues in case of same > SLUB concurrent __slab_free(), so define it to be a percpu counter and > places it carefully. Hm I wonder, is it possible that this will eventually overflow/underflow = the counter on some CPU? (I guess practially only on 32bit). Maybe the operat= ions that are already done under n->list_lock should flush the percpu counter = to a shared counter? ... > @@ -3039,6 +3066,13 @@ static void __slab_free(struct kmem_cache *s, st= ruct page *page, > head, new.counters, > "__slab_free")); > =20 > + if (!was_frozen && prior) { > + if (n) > + __update_partial_free(n, cnt); > + else > + __update_partial_free(get_node(s, page_to_nid(page)), cnt); > + } I would guess this is the part that makes your measurements notice that (although tiny) difference. We didn't need to obtain the node pointer bef= ore and now we do. And that is really done just for the per-node breakdown in "ob= jects" and "objects_partial" files under /sys/kernel/slab - distinguishing nodes= is not needed for /proc/slabinfo. So that kinda justifies putting this under a n= ew CONFIG as you did. Although perhaps somebody interested in these kind of = stats would enable CONFIG_SLUB_STATS anyway, so that's still an option to use i= nstead of introducing a new oddly specific CONFIG? At least until somebody comes= up and presents an use case where they want the per-node breakdowns in /sys but = cannot afford CONFIG_SLUB_STATS. But I'm also still thinking about simply counting all free objects (for t= he purposes of accurate in /proc/slabinfo) as a percpu variabl= e in struct kmem_cache itself. That would basically put this_cpu_add() in all = the fast paths, but AFAICS thanks to the segment register it doesn't mean dis= abling interrupts nor a LOCK operation, so maybe it wouldn't be that bad? And it shouldn't need to deal with these node pointers. So maybe that would be acceptable for CONFIG_SLUB_DEBUG? Guess I'll have to try...