linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Wen Yang <wenyang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Xunlei Pang <xlpang@linux.alibaba.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
Date: Mon, 24 Feb 2020 09:01:09 -0800	[thread overview]
Message-ID: <20200224170109.GA484928@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <98c84f23-0636-a877-a96d-d6e58d540aa4@linux.alibaba.com>

On Sat, Feb 22, 2020 at 02:55:39PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/20 11:40 下午, Roman Gushchin wrote:
> > On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
> > > 
> > > 
> > > On 2020/2/19 4:53 上午, Roman Gushchin wrote:
> > > > On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> > > > > 
> > > > > 
> > > > > On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > > > > > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> > > > > > 
> > > > > > > The lock, protecting the node partial list, is taken when couting the free
> > > > > > > objects resident in that list. It introduces locking contention when the
> > > > > > > page(s) is moved between CPU and node partial lists in allocation path
> > > > > > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > > > > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > > > > > slab object is to carry network packet, targeting the far-end disk array,
> > > > > > > it causes block IO jitter issue.
> > > > > > > 
> > > > > > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > > > > > the node in advance. The value is retrieved without taking the node partial
> > > > > > > list lock on reading "/proc/slabinfo".
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > > > > > >     static void discard_slab(struct kmem_cache *s, struct page *page)
> > > > > > >     {
> > > > > > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > > > > > +	int inuse = page->objects;
> > > > > > > +
> > > > > > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > > > > > 
> > > > > > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > > > > > 
> > > > > > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > > > > > and save a function argument.
> > > > > > 
> > > > > > If yes then why?
> > > > > > 
> > > > > 
> > > > > Thanks for your comments.
> > > > > We are happy to improve this patch based on your suggestions.
> > > > > 
> > > > > 
> > > > > When the user reads /proc/slabinfo, in order to obtain the active_objs
> > > > > information, the kernel traverses all slabs and executes the following code
> > > > > snippet:
> > > > > static unsigned long count_partial(struct kmem_cache_node *n,
> > > > >                                           int (*get_count)(struct page *))
> > > > > {
> > > > >           unsigned long flags;
> > > > >           unsigned long x = 0;
> > > > >           struct page *page;
> > > > > 
> > > > >           spin_lock_irqsave(&n->list_lock, flags);
> > > > >           list_for_each_entry(page, &n->partial, slab_list)
> > > > >                   x += get_count(page);
> > > > >           spin_unlock_irqrestore(&n->list_lock, flags);
> > > > >           return x;
> > > > > }
> > > > > 
> > > > > It may cause performance issues.
> > > > > 
> > > > > Christoph suggested "you could cache the value in the userspace application?
> > > > > Why is this value read continually?", But reading the /proc/slabinfo is
> > > > > initiated by the user program. As a cloud provider, we cannot control user
> > > > > behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> > > > > may affect other user programs.
> > > > > 
> > > > > As Christoph said: "The count is not needed for any operations. Just for the
> > > > > slabinfo output. The value has no operational value for the allocator
> > > > > itself. So why use extra logic to track it in potentially performance
> > > > > critical paths?"
> > > > > 
> > > > > In this way, could we show the approximate value of active_objs in the
> > > > > /proc/slabinfo?
> > > > > 
> > > > > Based on the following information:
> > > > > In the discard_slab() function, page->inuse is equal to page->total_objects;
> > > > > In the allocate_slab() function, page->inuse is also equal to
> > > > > page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> > > > > equals 1);
> > > > > page->inuse will only change continuously when the obj is constantly
> > > > > allocated or released. (This should be the performance critical path
> > > > > emphasized by Christoph)
> > > > > 
> > > > > When users query the global slabinfo information, we may use total_objects
> > > > > to approximate active_objs.
> > > > 
> > > > Well, from one point of view, it makes no sense, because the ratio between
> > > > these two numbers is very meaningful: it's the slab utilization rate.
> > > > 
> > > > On the other side, with enabled per-cpu partial lists active_objs has
> > > > nothing to do with the reality anyway, so I agree with you, calling
> > > > count_partial() is almost useless.
> > > > 
> > > > That said, I wonder if the right thing to do is something like the patch below?
> > > > 
> > > > Thanks!
> > > > 
> > > > Roman
> > > > 
> > > > --
> > > > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index 1d644143f93e..ba0505e75ecc 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
> > > >    static unsigned long count_partial(struct kmem_cache_node *n,
> > > >                                           int (*get_count)(struct page *))
> > > >    {
> > > > -       unsigned long flags;
> > > >           unsigned long x = 0;
> > > > +#ifdef CONFIG_SLUB_CPU_PARTIAL
> > > > +       unsigned long flags;
> > > >           struct page *page;
> > > >           spin_lock_irqsave(&n->list_lock, flags);
> > > >           list_for_each_entry(page, &n->partial, slab_list)
> > > >                   x += get_count(page);
> > > >           spin_unlock_irqrestore(&n->list_lock, flags);
> > > > +#endif
> > > >           return x;
> > > >    }
> > > >    #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
> > > > 
> > > 
> > > Hi Roman,
> > > 
> > > Thanks for your comments.
> > > 
> > > In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and can
> > > improve the performance of the cloud server, as follows:
> > 
> > Hello, Wen!
> > 
> > That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partial() is useless
> > anyway because the returned number is far from the reality. So if we define
> > active_objects == total_objects, as you basically suggest, we do not introduce any
> > regression. Actually I think it's even preferable to show the unrealistic uniform 100%
> > slab utilization rather than some very high but incorrect value.
> > 
> > And on real-time systems uncontrolled readings of /proc/slabinfo is less
> > of a concern, I hope.
> > 
> > Thank you!
> > 
> 
> Great!
> We only need to correct a typo to achieve this goal, as follows:
> Change #ifdef CONFIG_SLUB_CPU_PARTIAL to #ifndef CONFIG_SLUB_CPU_PARTIAL

Yes, you're obviously right.

Thanks!


      reply	other threads:[~2020-02-24 17:01 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-01  3:15 [PATCH] mm/slub: Detach node lock from counting free objects Wen Yang
2020-02-08  3:03 ` Wen Yang
2020-02-08 21:41   ` Christopher Lameter
2020-02-12 22:56     ` Andrew Morton
2020-02-14  2:16       ` Christopher Lameter
2020-02-12 22:52 ` Andrew Morton
2020-02-16  4:15   ` Wen Yang
2020-02-18 20:53     ` Roman Gushchin
2020-02-20 13:53       ` Wen Yang
2020-02-20 15:40         ` Roman Gushchin
2020-02-22  6:55           ` Wen Yang
2020-02-24 17:01             ` Roman Gushchin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200224170109.GA484928@carbon.dhcp.thefacebook.com \
    --to=guro@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=wenyang@linux.alibaba.com \
    --cc=xlpang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).