linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Wen Yang <wenyang@linux.alibaba.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Xunlei Pang <xlpang@linux.alibaba.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
Date: Sun, 16 Feb 2020 12:15:54 +0800	[thread overview]
Message-ID: <b42f7daa-4aea-1cf8-5bbb-2cd5d48b4e9a@linux.alibaba.com> (raw)
In-Reply-To: <20200212145247.bf89431272038de53dd9d975@linux-foundation.org>



On 2020/2/13 6:52 上午, Andrew Morton wrote:
> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> 
>> The lock, protecting the node partial list, is taken when couting the free
>> objects resident in that list. It introduces locking contention when the
>> page(s) is moved between CPU and node partial lists in allocation path
>> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
>> allocation on another CPU for a while, 200ms in extreme cases. If the
>> slab object is to carry network packet, targeting the far-end disk array,
>> it causes block IO jitter issue.
>>
>> This fixes the block IO jitter issue by caching the total inuse objects in
>> the node in advance. The value is retrieved without taking the node partial
>> list lock on reading "/proc/slabinfo".
>>
>> ...
>>
>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>>   
>>   static void discard_slab(struct kmem_cache *s, struct page *page)
>>   {
>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>> +	int inuse = page->objects;
>> +
>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> 
> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> 
> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> and save a function argument.
> 
> If yes then why?
> 

Thanks for your comments.
We are happy to improve this patch based on your suggestions.


When the user reads /proc/slabinfo, in order to obtain the active_objs 
information, the kernel traverses all slabs and executes the following 
code snippet:
static unsigned long count_partial(struct kmem_cache_node *n,
                                         int (*get_count)(struct page *))
{
         unsigned long flags;
         unsigned long x = 0;
         struct page *page;

         spin_lock_irqsave(&n->list_lock, flags);
         list_for_each_entry(page, &n->partial, slab_list)
                 x += get_count(page);
         spin_unlock_irqrestore(&n->list_lock, flags);
         return x;
}

It may cause performance issues.

Christoph suggested "you could cache the value in the userspace 
application? Why is this value read continually?", But reading the 
/proc/slabinfo is initiated by the user program. As a cloud provider, we 
cannot control user behavior. If a user program inadvertently executes 
cat /proc/slabinfo, it may affect other user programs.

As Christoph said: "The count is not needed for any operations. Just for 
the slabinfo output. The value has no operational value for the 
allocator itself. So why use extra logic to track it in potentially 
performance critical paths?"

In this way, could we show the approximate value of active_objs in the 
/proc/slabinfo?

Based on the following information:
In the discard_slab() function, page->inuse is equal to page->total_objects;
In the allocate_slab() function, page->inuse is also equal to 
page->total_objects (with one exception: for kmem_cache_node, page-> 
inuse equals 1);
page->inuse will only change continuously when the obj is constantly 
allocated or released. (This should be the performance critical path 
emphasized by Christoph)

When users query the global slabinfo information, we may use 
total_objects to approximate active_objs.

In this way, the modified patch is as follows:

diff --git a/mm/slub.c b/mm/slub.c
index a0b335d..ef0e6ac 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5900,17 +5900,15 @@ void get_slabinfo(struct kmem_cache *s, struct 
slabinfo *sinfo)
  {
         unsigned long nr_slabs = 0;
         unsigned long nr_objs = 0;
-       unsigned long nr_free = 0;
         int node;
         struct kmem_cache_node *n;

         for_each_kmem_cache_node(s, node, n) {
                 nr_slabs += node_nr_slabs(n);
                 nr_objs += node_nr_objs(n);
-               nr_free += count_partial(n, count_free);
         }

-       sinfo->active_objs = nr_objs - nr_free;
+       sinfo->active_objs = nr_objs;
         sinfo->num_objs = nr_objs;
         sinfo->active_slabs = nr_slabs;
         sinfo->num_slabs = nr_slabs;


In addition, when the user really needs to view the precise active_obj 
value of a slab, he can query this single slab info through an interface 
similar to the following, which avoids traversing all the slabs.

# cat /sys/kernel/slab/kmalloc-512/total_objects
1472 N0=1472
# cat /sys/kernel/slab/kmalloc-512/objects
1311 N0=1311

or
# cat /sys/kernel/slab/kmalloc-8k/total_objects
60 N0=60
# cat /sys/kernel/slab/kmalloc-8k/objects
60 N0=60


Best wishes,
Wen

>>   	free_slab(s, page);
>>   }
>>   


  reply	other threads:[~2020-02-16  4:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-01  3:15 [PATCH] mm/slub: Detach node lock from counting free objects Wen Yang
2020-02-08  3:03 ` Wen Yang
2020-02-08 21:41   ` Christopher Lameter
2020-02-12 22:56     ` Andrew Morton
2020-02-14  2:16       ` Christopher Lameter
2020-02-12 22:52 ` Andrew Morton
2020-02-16  4:15   ` Wen Yang [this message]
2020-02-18 20:53     ` Roman Gushchin
2020-02-20 13:53       ` Wen Yang
2020-02-20 15:40         ` Roman Gushchin
2020-02-22  6:55           ` Wen Yang
2020-02-24 17:01             ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b42f7daa-4aea-1cf8-5bbb-2cd5d48b4e9a@linux.alibaba.com \
    --to=wenyang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=xlpang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).