From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FA84C6FD1D for ; Fri, 17 Mar 2023 12:01:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3513B6B0074; Fri, 17 Mar 2023 08:01:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 300DE6B0075; Fri, 17 Mar 2023 08:01:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C93A6B0078; Fri, 17 Mar 2023 08:01:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0C6A46B0074 for ; Fri, 17 Mar 2023 08:01:11 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B63DA1C729D for ; Fri, 17 Mar 2023 12:01:10 +0000 (UTC) X-FDA: 80578249500.23.342F3CB Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf09.hostedemail.com (Postfix) with ESMTP id 09466140037 for ; Fri, 17 Mar 2023 12:01:05 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=chenjun102@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679054467; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:references; bh=JOMac0Jms3aeF0laE8I2CUT8we2QDLvY3g02S5rTs0g=; b=b7Mfy37aUdejJFceGdPRsr396O7hETeLcJj40/w8VlA9jFQyfyU9roSfHUa/+PsFtdkgau 5WhSw9Kx8RXYWpMRZ3HW4oZ3Bg9MJfEvmz3MQkxmXcFX7OC1S2yDlUwz71lrtSK8Coi/bG GKDPLbqJI+HRw+T5kLCc4jx0xkaEeKY= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=chenjun102@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679054467; a=rsa-sha256; cv=none; b=sIpINUOa1i+0iOecCTK5rzko9kVe3kpcw4XOOaMZglzL8Mw5C0UzkVWJzVyA6hYwtilgLB loWlAAJp8j+SzQTNWYUd1xoPjIMx+uDTnOXyHRRa0mRLbqPHn4GicgHtid5SKBxco/GPST 3e/nXYkOeZh2N2OyUqop73nbT6OdMeo= Received: from dggpemm500004.china.huawei.com (unknown [172.30.72.54]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4PdMQc0zh6zrS14; Fri, 17 Mar 2023 19:31:20 +0800 (CST) Received: from dggpemm500006.china.huawei.com (7.185.36.236) by dggpemm500004.china.huawei.com (7.185.36.219) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Fri, 17 Mar 2023 19:32:15 +0800 Received: from dggpemm500006.china.huawei.com ([7.185.36.236]) by dggpemm500006.china.huawei.com ([7.185.36.236]) with mapi id 15.01.2507.021; Fri, 17 Mar 2023 19:32:15 +0800 From: "chenjun (AM)" To: Vlastimil Babka , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "cl@linux.com" , "penberg@kernel.org" , "rientjes@google.com" , "iamjoonsoo.kim@lge.com" , "akpm@linux-foundation.org" , Hyeonggon Yoo <42.hyeyoo@gmail.com> CC: "xuqiang (M)" , "Wangkefeng (OS Kernel Lab)" Subject: Re: [PATCH] mm/slub: Reduce memory consumption in extreme scenarios Thread-Topic: [PATCH] mm/slub: Reduce memory consumption in extreme scenarios Thread-Index: AQHZVnGyqCLx+d0GeECeM+z9zE2Dsg== Date: Fri, 17 Mar 2023 11:32:15 +0000 Message-ID: <344c7521d72e4107b451c19b329e9864@huawei.com> References: <20230314123403.100158-1-chenjun102@huawei.com> <0cad1ff3-8339-a3eb-fc36-c8bda1392451@suse.cz> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.174.178.43] Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 09466140037 X-Stat-Signature: fpsg7di1374zckhf4hdp4749xjh83cfp X-Rspam-User: X-HE-Tag: 1679054465-270026 X-HE-Meta: U2FsdGVkX1//vw4JiA0w007oR61fi1AyKpahDWqK91ce5h23mLkAyxhdg3lmi6m/ZLVjg5QDrJbdMU5NXCtEmAA81xjZHeo89fouESMjDTUONVwpha6D/sg9eDXkyO6XWKp2lmU30i7W6g71Koyrz2yjOWwnJc4s8zZ/RZP/gbbgUy5wHK2nwaEg3dMdy0DUvB39VjFXTERZunO571v2Xa3o/6V8stz1MgBv54zpsqPdi9zs64QZdEPJDJV5nLbFCMAq4qsXMjjzjybae5ZmVJXQWw8WS05Ju7UjcDMG7X9wMPFAbA2zv7vU6VxbdykZ88ubW86KAcav65DsoFhn+OslsdjJeAjPi93ZenjnudZhQ46SVLUvG4tNki+Ut+1wNzk+E7pRHmwQryHknAggC540T1W8i/XAFN2RqQoE+DA/ZT5DumTCqwu5SC7LU5ts+seTKHDGICZ8e+kPzvrxBphAAZCbF7mZhPB2CHb5l3AqFG4fZRhHqCID7BdFAulviBueMrUspI8BXUwfEIyZyuOHZG/KWYcqWCrZeNlOq4XHp4osnYk7H1/P0RRgwWwk0pBR06PUeoEEE1axAFdoeynjZ8JJrrxVK9cNxnfopTqvx1qW5XL1BNFoBcUkaplO0j/3eYXS/CKGdlkUt23kcZ8EkPqKBT7/Za4WzjV2JGRh/Cf6cnIdLhmwtov7zjkz/GpRxOEAV+gq/PA2WPzyrkxej98ujfbI7htbWCTzytvEKZDyTf4PNa6MI4kqLe5uQiBEycIe7UL0dPAW/mSc19iVG6BfpgY2aUoavZPaikj4bu9nl5TM/v3GdP8GGg5zAw31v5C2FpXYAPviMBY7frI7fAMDiY86WyE1j2g6JwA27jTKCzHcZHDzFyXUSFclslTVya3o63qMoFVQutiF2ZHfIcPRRpnQTCvaDe3NNDk+Fom28e1kmDBBWSmgK1V9M2n+1hGl/podNWs6UNh 9xlSb0tM ypy9YCFKEMs3r7GOK8LraTbKXhQsnJxGHfM3i7Uf/MpvxLsjMXx6XRrrGrZq12yGGOmEZKoajOwk0VNNqVCub3T5PH2+eGUlOT9ED+nrfBLsbs9MEjhMk83PrgZ1pcMVGenIXz77GAK/Od5m0pxTlkH1iIMv47L1AWjv7CDVacgBibLXYYsDK/FrSPwPhReTs5U6Go6FxvNe40f2K5SLyHqxF/w8Ael0nBMKpXjTbV8txRFKUa0GAN/+U5gSLKUOjjDI80chBWV2BF5XAlp5ButgyNnxz1b8g2ITG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =1B$B:_=1B(B 2023/3/14 22:41, Vlastimil Babka =1B$B =0A= > On 3/14/23 13:34, Chen Jun wrote:=0A= >> When kmalloc_node() is called without __GFP_THISNODE and the target node= =0A= >> lacks sufficient memory, SLUB allocates a folio from a different node=0A= >> other than the requested node, instead of taking a partial slab from it.= =0A= >>=0A= >> However, since the allocated folio does not belong to the requested=0A= >> node, it is deactivated and added to the partial slab list of the node= =0A= >> it belongs to.=0A= >>=0A= >> This behavior can result in excessive memory usage when the requested=0A= >> node has insufficient memory, as SLUB will repeatedly allocate folios=0A= >> from other nodes without reusing the previously allocated ones.=0A= >>=0A= >> To prevent memory wastage,=0A= >> when (node !=3D NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:=0A= >> 1) try to get a partial slab from target node with __GFP_THISNODE.=0A= >> 2) if 1) failed, try to allocate a new slab from target node with=0A= >> __GFP_THISNODE.=0A= >> 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint.=0A= >>=0A= >> when node !=3D NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior= =0A= >> remains unchanged.=0A= >>=0A= >> On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko= =0A= >> to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times= .=0A= >>=0A= >> cat /proc/slabinfo shows:=0A= >> kmalloc-256 4200530 13519712 256 32 2 : tunables..=0A= >>=0A= >> after this patch,=0A= >> cat /proc/slabinfo shows:=0A= >> kmalloc-256 4200558 4200768 256 32 2 : tunables..=0A= >>=0A= >> Signed-off-by: Chen Jun =0A= >> ---=0A= >> mm/slub.c | 22 +++++++++++++++++++---=0A= >> 1 file changed, 19 insertions(+), 3 deletions(-)=0A= >>=0A= >> diff --git a/mm/slub.c b/mm/slub.c=0A= >> index 39327e98fce3..32e436957e03 100644=0A= >> --- a/mm/slub.c=0A= >> +++ b/mm/slub.c=0A= >> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int= node, struct partial_context=0A= >> searchnode =3D numa_mem_id();=0A= >> =0A= >> object =3D get_partial_node(s, get_node(s, searchnode), pc);=0A= >> - if (object || node !=3D NUMA_NO_NODE)=0A= >> + if (object || (node !=3D NUMA_NO_NODE && (pc->flags & __GFP_THISNODE))= )=0A= >> return object;=0A= >> =0A= >> return get_any_partial(s, pc);=0A= >> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, g= fp_t gfpflags, int node,=0A= >> struct slab *slab;=0A= >> unsigned long flags;=0A= >> struct partial_context pc;=0A= >> + bool try_thisnode =3D true;=0A= >> =0A= >> stat(s, ALLOC_SLOWPATH);=0A= >> =0A= >> @@ -3181,8 +3182,18 @@ static void *___slab_alloc(struct kmem_cache *s, = gfp_t gfpflags, int node,=0A= >> }=0A= >> =0A= >> new_objects:=0A= >> -=0A= >> pc.flags =3D gfpflags;=0A= >> +=0A= >> + /*=0A= >> + * when (node !=3D NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE)=0A= >> + * 1) try to get a partial slab from target node with __GFP_THISNODE.= =0A= >> + * 2) if 1) failed, try to allocate a new slab from target node with= =0A= >> + * __GFP_THISNODE.=0A= >> + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint.= =0A= >> + */=0A= >> + if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_this= node)=0A= >> + pc.flags |=3D __GFP_THISNODE;=0A= > =0A= > Hmm I'm thinking we should also perhaps remove direct reclaim possibiliti= es=0A= > from the attempt 2). In your qemu test it should make no difference, as i= t=0A= > fills everything with kernel memory that is not reclaimable. But in pract= ice=0A= > the target node might be filled with user memory, and I think it's better= to=0A= > quickly allocate on a different node than spend time in direct reclaim. S= o=0A= > the following should work I think?=0A= > =0A= > pc.flags =3D GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE=0A= > =0A= =0A= Hmm, Should it be that:=0A= =0A= pc.flags |=3D GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE=0A= ^=0A= >> +=0A= >> pc.slab =3D &slab;=0A= >> pc.orig_size =3D orig_size;=0A= >> freelist =3D get_partial(s, node, &pc);=0A= >> @@ -3190,10 +3201,15 @@ static void *___slab_alloc(struct kmem_cache *s,= gfp_t gfpflags, int node,=0A= >> goto check_new_slab;=0A= >> =0A= >> slub_put_cpu_ptr(s->cpu_slab);=0A= >> - slab =3D new_slab(s, gfpflags, node);=0A= >> + slab =3D new_slab(s, pc.flags, node);=0A= >> c =3D slub_get_cpu_ptr(s->cpu_slab);=0A= >> =0A= >> if (unlikely(!slab)) {=0A= >> + if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thi= snode) {=0A= >> + try_thisnode =3D false;=0A= >> + goto new_objects;=0A= >> + }=0A= >> +=0A= >> slab_out_of_memory(s, gfpflags, node);=0A= >> return NULL;=0A= >> }=0A= > =0A= > =0A= =0A=