linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Masayoshi Mizuma <msys.mizuma@gmail.com>
To: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: memcg: performance degradation since v5.9
Date: Fri, 9 Apr 2021 12:05:41 -0400	[thread overview]
Message-ID: <20210409160541.4tfkeex7mcfrwras@gabell> (raw)
In-Reply-To: <YG9tW1h9VSJcir+Y@carbon.dhcp.thefacebook.com>

On Thu, Apr 08, 2021 at 01:53:47PM -0700, Roman Gushchin wrote:
> On Thu, Apr 08, 2021 at 03:39:48PM -0400, Masayoshi Mizuma wrote:
> > Hello,
> > 
> > I detected a performance degradation issue for a benchmark of PostgresSQL [1],
> > and the issue seems to be related to object level memory cgroup [2].
> > I would appreciate it if you could give me some ideas to solve it.
> > 
> > The benchmark shows the transaction per second (tps) and the tps for v5.9
> > and later kernel get about 10%-20% smaller than v5.8.
> > 
> > The benchmark does sendto() and recvfrom() system calls repeatedly,
> > and the duration of the system calls get longer than v5.8.
> > The result of perf trace of the benchmark is as follows:
> > 
> >   - v5.8
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699574      0  2595.220     0.001     0.004     0.462      0.03%
> >    recvfrom         1391089 694427  2163.458     0.001     0.002     0.442      0.04%
> > 
> >   - v5.9
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699187      0  3316.948     0.002     0.005     0.044      0.02%
> >    recvfrom         1397042 698828  2464.995     0.001     0.002     0.025      0.04%
> > 
> >   - v5.12-rc6
> > 
> >    syscall            calls  errors  total       min       avg       max       stddev
> >                                      (msec)    (msec)    (msec)    (msec)        (%)
> >    --------------- --------  ------ -------- --------- --------- ---------     ------
> >    sendto            699445      0  3015.642     0.002     0.004     0.027      0.02%
> >    recvfrom         1395929 697909  2338.783     0.001     0.002     0.024      0.03%
> > 
> > I bisected the kernel patches, then I found the patch series, which add
> > object level memory cgroup support, causes the degradation.
> > 
> > I confirmed the delay with a kernel module which just runs
> > kmem_cache_alloc/kmem_cache_free as follows. The duration is about
> > 2-3 times than v5.8.
> > 
> >    dummy_cache = KMEM_CACHE(dummy, SLAB_ACCOUNT);
> >    for (i = 0; i < 100000000; i++)
> >    {
> >            p = kmem_cache_alloc(dummy_cache, GFP_KERNEL);
> >            kmem_cache_free(dummy_cache, p);
> >    }
> > 
> > It seems that the object accounting work in slab_pre_alloc_hook() and
> > slab_post_alloc_hook() is the overhead.
> > 
> > cgroup.nokmem kernel parameter doesn't work for my case because it disables
> > all of kmem accounting.
> > 
> > The degradation is gone when I apply a patch (at the bottom of this email)
> > that adds a kernel parameter that expects to fallback to the page level
> > accounting, however, I'm not sure it's a good approach though...
> 
> Hello Masayoshi!
> 
> Thank you for the report!

Hi!

> 
> It's not a secret that per-object accounting is more expensive than a per-page
> allocation. I had micro-benchmark results similar to yours: accounted
> allocations are about 2x slower. But in general it tends to not affect real
> workloads, because the cost of allocations is still low and tends to be only
> a small fraction of the whole cpu load. And because it brings up significant
> benefits: 40%+ slab memory savings, less fragmentation, more stable workingset,
> etc, real workloads tend to perform on pair or better.
> 
> So my first question is if you see the regression in any real workload
> or it's only about the benchmark?

It's only about the benchmark so far. I'll let you know if I get the issue with
real workload.

> 
> Second, I'll try to take a look into the benchmark to figure out why it's
> affected so badly, but I'm not sure we can easily fix it. If you have any
> ideas what kind of objects the benchmark is allocating in big numbers,
> please let me know.

The benchmark does sendto() and recvfrom() to the unix domain socket
repeatedly, and kmem_cache_alloc_node()/kmem_cache_free() is called
to allocate/free the socket buffers.
The call graph to allocate the object is as flllows.

  do_syscall_64
    __x64_sys_sendto
      __sys_sendto
        sock_sendmsg
          unix_stream_sendmsg
            sock_alloc_send_pskb
              alloc_skb_with_frags
                __alloc_skb
                  kmem_cache_alloc_node

kmem_cache_alloc_node()/kmem_cache_free() is called about 1,400,000 times
during the benchmark and the object size is 216 byte, the GFP flag is 0x400cc0:
 ___GFP_ACCOUNT | ___GFP_KSWAPD_RECLAIM | ___GFP_DIRECT_RECLAIM | ___GFP_FS | ___GFP_IO

I got the data by following bpftrace script.

  # cat kmem.bt 
  #!/usr/bin/env bpftrace

  tracepoint:kmem:kmem_cache_alloc_node /comm == "pgbench"/
  {
	@alloc[comm, args->bytes_req, args->bytes_alloc, args->gfp_flags] = count();
  }

  tracepoint:kmem:kmem_cache_free /comm == "pgbench"/
  {
	@free[comm] = count();
  }
  # ./kmem.bt 
  Attaching 2 probes...
  ^C

  @alloc[pgbench, 11784, 11840, 3264]: 1
  @alloc[pgbench, 216, 256, 3264]: 23
  @alloc[pgbench, 216, 256, 4197568]: 1400046

  @free[pgbench]: 1400560

  # 

I hope this helps...

Thanks!
Masa


      parent reply	other threads:[~2021-04-09 16:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-08 19:39 memcg: performance degradation since v5.9 Masayoshi Mizuma
2021-04-08 20:53 ` Roman Gushchin
2021-04-08 21:08   ` Shakeel Butt
2021-04-09 16:35     ` Masayoshi Mizuma
2021-04-09 16:50       ` Shakeel Butt
2021-04-12 15:22         ` Masayoshi Mizuma
2021-04-09 16:05   ` Masayoshi Mizuma [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210409160541.4tfkeex7mcfrwras@gabell \
    --to=msys.mizuma@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).