From: Wen Yang <wenyang@linux.alibaba.com>
To: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <cl@linux.com>,
Pekka Enberg <penberg@kernel.org>,
David Rientjes <rientjes@google.com>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Xunlei Pang <xlpang@linux.alibaba.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
Date: Sat, 22 Feb 2020 14:55:39 +0800 [thread overview]
Message-ID: <98c84f23-0636-a877-a96d-d6e58d540aa4@linux.alibaba.com> (raw)
In-Reply-To: <20200220154036.GA191388@carbon.dhcp.thefacebook.com>
On 2020/2/20 11:40 下午, Roman Gushchin wrote:
> On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
>>
>>
>> On 2020/2/19 4:53 上午, Roman Gushchin wrote:
>>> On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
>>>>
>>>>
>>>> On 2020/2/13 6:52 上午, Andrew Morton wrote:
>>>>> On Sat, 1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
>>>>>
>>>>>> The lock, protecting the node partial list, is taken when couting the free
>>>>>> objects resident in that list. It introduces locking contention when the
>>>>>> page(s) is moved between CPU and node partial lists in allocation path
>>>>>> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
>>>>>> allocation on another CPU for a while, 200ms in extreme cases. If the
>>>>>> slab object is to carry network packet, targeting the far-end disk array,
>>>>>> it causes block IO jitter issue.
>>>>>>
>>>>>> This fixes the block IO jitter issue by caching the total inuse objects in
>>>>>> the node in advance. The value is retrieved without taking the node partial
>>>>>> list lock on reading "/proc/slabinfo".
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>>>>>> static void discard_slab(struct kmem_cache *s, struct page *page)
>>>>>> {
>>>>>> - dec_slabs_node(s, page_to_nid(page), page->objects);
>>>>>> + int inuse = page->objects;
>>>>>> +
>>>>>> + dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>>>>>
>>>>> Is this right? dec_slabs_node(..., page->objects, page->objects)?
>>>>>
>>>>> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
>>>>> and save a function argument.
>>>>>
>>>>> If yes then why?
>>>>>
>>>>
>>>> Thanks for your comments.
>>>> We are happy to improve this patch based on your suggestions.
>>>>
>>>>
>>>> When the user reads /proc/slabinfo, in order to obtain the active_objs
>>>> information, the kernel traverses all slabs and executes the following code
>>>> snippet:
>>>> static unsigned long count_partial(struct kmem_cache_node *n,
>>>> int (*get_count)(struct page *))
>>>> {
>>>> unsigned long flags;
>>>> unsigned long x = 0;
>>>> struct page *page;
>>>>
>>>> spin_lock_irqsave(&n->list_lock, flags);
>>>> list_for_each_entry(page, &n->partial, slab_list)
>>>> x += get_count(page);
>>>> spin_unlock_irqrestore(&n->list_lock, flags);
>>>> return x;
>>>> }
>>>>
>>>> It may cause performance issues.
>>>>
>>>> Christoph suggested "you could cache the value in the userspace application?
>>>> Why is this value read continually?", But reading the /proc/slabinfo is
>>>> initiated by the user program. As a cloud provider, we cannot control user
>>>> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
>>>> may affect other user programs.
>>>>
>>>> As Christoph said: "The count is not needed for any operations. Just for the
>>>> slabinfo output. The value has no operational value for the allocator
>>>> itself. So why use extra logic to track it in potentially performance
>>>> critical paths?"
>>>>
>>>> In this way, could we show the approximate value of active_objs in the
>>>> /proc/slabinfo?
>>>>
>>>> Based on the following information:
>>>> In the discard_slab() function, page->inuse is equal to page->total_objects;
>>>> In the allocate_slab() function, page->inuse is also equal to
>>>> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
>>>> equals 1);
>>>> page->inuse will only change continuously when the obj is constantly
>>>> allocated or released. (This should be the performance critical path
>>>> emphasized by Christoph)
>>>>
>>>> When users query the global slabinfo information, we may use total_objects
>>>> to approximate active_objs.
>>>
>>> Well, from one point of view, it makes no sense, because the ratio between
>>> these two numbers is very meaningful: it's the slab utilization rate.
>>>
>>> On the other side, with enabled per-cpu partial lists active_objs has
>>> nothing to do with the reality anyway, so I agree with you, calling
>>> count_partial() is almost useless.
>>>
>>> That said, I wonder if the right thing to do is something like the patch below?
>>>
>>> Thanks!
>>>
>>> Roman
>>>
>>> --
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 1d644143f93e..ba0505e75ecc 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>> static unsigned long count_partial(struct kmem_cache_node *n,
>>> int (*get_count)(struct page *))
>>> {
>>> - unsigned long flags;
>>> unsigned long x = 0;
>>> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>>> + unsigned long flags;
>>> struct page *page;
>>> spin_lock_irqsave(&n->list_lock, flags);
>>> list_for_each_entry(page, &n->partial, slab_list)
>>> x += get_count(page);
>>> spin_unlock_irqrestore(&n->list_lock, flags);
>>> +#endif
>>> return x;
>>> }
>>> #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
>>>
>>
>> Hi Roman,
>>
>> Thanks for your comments.
>>
>> In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and can
>> improve the performance of the cloud server, as follows:
>
> Hello, Wen!
>
> That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partial() is useless
> anyway because the returned number is far from the reality. So if we define
> active_objects == total_objects, as you basically suggest, we do not introduce any
> regression. Actually I think it's even preferable to show the unrealistic uniform 100%
> slab utilization rather than some very high but incorrect value.
>
> And on real-time systems uncontrolled readings of /proc/slabinfo is less
> of a concern, I hope.
>
> Thank you!
>
Great!
We only need to correct a typo to achieve this goal, as follows:
Change #ifdef CONFIG_SLUB_CPU_PARTIAL to #ifndef CONFIG_SLUB_CPU_PARTIAL
We will continue testing and send the modified patch soon.
Thank you very much.
next prev parent reply other threads:[~2020-02-22 6:55 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-01 3:15 [PATCH] mm/slub: Detach node lock from counting free objects Wen Yang
2020-02-08 3:03 ` Wen Yang
2020-02-08 21:41 ` Christopher Lameter
2020-02-12 22:56 ` Andrew Morton
2020-02-14 2:16 ` Christopher Lameter
2020-02-12 22:52 ` Andrew Morton
2020-02-16 4:15 ` Wen Yang
2020-02-18 20:53 ` Roman Gushchin
2020-02-20 13:53 ` Wen Yang
2020-02-20 15:40 ` Roman Gushchin
2020-02-22 6:55 ` Wen Yang [this message]
2020-02-24 17:01 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=98c84f23-0636-a877-a96d-d6e58d540aa4@linux.alibaba.com \
--to=wenyang@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=guro@fb.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=penberg@kernel.org \
--cc=rientjes@google.com \
--cc=xlpang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).