From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=kLjY=4I=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,
	USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F420EC11D09
	for <linux-mm@archiver.kernel.org>; Thu, 20 Feb 2020 13:54:01 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id AED17208C4
	for <linux-mm@archiver.kernel.org>; Thu, 20 Feb 2020 13:54:01 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AED17208C4
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 32BD96B0005; Thu, 20 Feb 2020 08:54:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2DC1D6B0006; Thu, 20 Feb 2020 08:54:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1CB356B0007; Thu, 20 Feb 2020 08:54:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0186.hostedemail.com [216.40.44.186])
	by kanga.kvack.org (Postfix) with ESMTP id 009E56B0005
	for <linux-mm@kvack.org>; Thu, 20 Feb 2020 08:54:00 -0500 (EST)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 944B3441E
	for <linux-mm@kvack.org>; Thu, 20 Feb 2020 13:54:00 +0000 (UTC)
X-FDA: 76510649040.10.shoes48_7a20e37bbe920
X-HE-Tag: shoes48_7a20e37bbe920
X-Filterd-Recvd-Size: 7314
Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44])
	by imf28.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 20 Feb 2020 13:53:58 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=wenyang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0TqSK9Y._1582206806;
Received: from IT-C02W23QPG8WN.local(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0TqSK9Y._1582206806)
          by smtp.aliyun-inc.com(127.0.0.1);
          Thu, 20 Feb 2020 21:53:48 +0800
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
To: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
 David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
 Xunlei Pang <xlpang@linux.alibaba.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20200201031502.92218-1-wenyang@linux.alibaba.com>
 <20200212145247.bf89431272038de53dd9d975@linux-foundation.org>
 <b42f7daa-4aea-1cf8-5bbb-2cd5d48b4e9a@linux.alibaba.com>
 <20200218205312.GA3156@carbon>
From: Wen Yang <wenyang@linux.alibaba.com>
Message-ID: <cb36f3e5-c01c-a99d-9230-af52f806e227@linux.alibaba.com>
Date: Thu, 20 Feb 2020 21:53:26 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0)
 Gecko/20100101 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <20200218205312.GA3156@carbon>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 2020/2/19 4:53 =E4=B8=8A=E5=8D=88, Roman Gushchin wrote:
> On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
>>
>>
>> On 2020/2/13 6:52 =E4=B8=8A=E5=8D=88, Andrew Morton wrote:
>>> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.co=
m> wrote:
>>>
>>>> The lock, protecting the node partial list, is taken when couting th=
e free
>>>> objects resident in that list. It introduces locking contention when=
 the
>>>> page(s) is moved between CPU and node partial lists in allocation pa=
th
>>>> on another CPU. So reading "/proc/slabinfo" can possibily block the =
slab
>>>> allocation on another CPU for a while, 200ms in extreme cases. If th=
e
>>>> slab object is to carry network packet, targeting the far-end disk a=
rray,
>>>> it causes block IO jitter issue.
>>>>
>>>> This fixes the block IO jitter issue by caching the total inuse obje=
cts in
>>>> the node in advance. The value is retrieved without taking the node =
partial
>>>> list lock on reading "/proc/slabinfo".
>>>>
>>>> ...
>>>>
>>>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, st=
ruct page *page)
>>>>    static void discard_slab(struct kmem_cache *s, struct page *page)
>>>>    {
>>>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>>>> +	int inuse =3D page->objects;
>>>> +
>>>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>>>
>>> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
>>>
>>> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_nod=
e
>>> and save a function argument.
>>>
>>> If yes then why?
>>>
>>
>> Thanks for your comments.
>> We are happy to improve this patch based on your suggestions.
>>
>>
>> When the user reads /proc/slabinfo, in order to obtain the active_objs
>> information, the kernel traverses all slabs and executes the following=
 code
>> snippet:
>> static unsigned long count_partial(struct kmem_cache_node *n,
>>                                          int (*get_count)(struct page =
*))
>> {
>>          unsigned long flags;
>>          unsigned long x =3D 0;
>>          struct page *page;
>>
>>          spin_lock_irqsave(&n->list_lock, flags);
>>          list_for_each_entry(page, &n->partial, slab_list)
>>                  x +=3D get_count(page);
>>          spin_unlock_irqrestore(&n->list_lock, flags);
>>          return x;
>> }
>>
>> It may cause performance issues.
>>
>> Christoph suggested "you could cache the value in the userspace applic=
ation?
>> Why is this value read continually?", But reading the /proc/slabinfo i=
s
>> initiated by the user program. As a cloud provider, we cannot control =
user
>> behavior. If a user program inadvertently executes cat /proc/slabinfo,=
 it
>> may affect other user programs.
>>
>> As Christoph said: "The count is not needed for any operations. Just f=
or the
>> slabinfo output. The value has no operational value for the allocator
>> itself. So why use extra logic to track it in potentially performance
>> critical paths?"
>>
>> In this way, could we show the approximate value of active_objs in the
>> /proc/slabinfo?
>>
>> Based on the following information:
>> In the discard_slab() function, page->inuse is equal to page->total_ob=
jects;
>> In the allocate_slab() function, page->inuse is also equal to
>> page->total_objects (with one exception: for kmem_cache_node, page-> i=
nuse
>> equals 1);
>> page->inuse will only change continuously when the obj is constantly
>> allocated or released. (This should be the performance critical path
>> emphasized by Christoph)
>>
>> When users query the global slabinfo information, we may use total_obj=
ects
>> to approximate active_objs.
>=20
> Well, from one point of view, it makes no sense, because the ratio betw=
een
> these two numbers is very meaningful: it's the slab utilization rate.
>=20
> On the other side, with enabled per-cpu partial lists active_objs has
> nothing to do with the reality anyway, so I agree with you, calling
> count_partial() is almost useless.
>=20
> That said, I wonder if the right thing to do is something like the patc=
h below?
>=20
> Thanks!
>=20
> Roman
>=20
> --
>=20
> diff --git a/mm/slub.c b/mm/slub.c
> index 1d644143f93e..ba0505e75ecc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct=
 kmem_cache_node *n)
>   static unsigned long count_partial(struct kmem_cache_node *n,
>                                          int (*get_count)(struct page *=
))
>   {
> -       unsigned long flags;
>          unsigned long x =3D 0;
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +       unsigned long flags;
>          struct page *page;
>  =20
>          spin_lock_irqsave(&n->list_lock, flags);
>          list_for_each_entry(page, &n->partial, slab_list)
>                  x +=3D get_count(page);
>          spin_unlock_irqrestore(&n->list_lock, flags);
> +#endif
>          return x;
>   }
>   #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
>=20

Hi Roman,

Thanks for your comments.

In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and=20
can improve the performance of the cloud server, as follows:

   default y
   depends on SLUB && SMP
   bool "SLUB per cpu partial cache"
   help
     Per cpu partial caches accelerate objects allocation and freeing
     that is local to a processor at the price of more indeterminism
     in the latency of the free. On overflow these caches will be cleared
     which requires the taking of locks that may cause latency spikes.
     Typically one would choose no for a realtime system.


Best wishes,
Wen