linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Dave Hansen <dave.hansen@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
Date: Wed, 4 Mar 2020 12:52:29 -0800	[thread overview]
Message-ID: <20d941d9-0411-049f-76e0-f9bb9e07f6af@linux.intel.com> (raw)
In-Reply-To: <20200214104541.GT31689@dhcp22.suse.cz>

On 2/14/20 2:45 AM, Michal Hocko wrote:
> On Wed 05-02-20 10:34:57, Tim Chen wrote:
>> Topic: Memory cgroups, whether you like it or not
>>
>> 1. Memory cgroup counters scalability
>>
>> Recently, benchmark teams at Intel were running some bare-metal
>> benchmarks.  To our great surprise, we saw lots of memcg activity in
>> the profiles.  When we asked the benchmark team, they did not even
>> realize they were using memory cgroups.  They were fond of running all
>> their benchmarks in containers that just happened to use memory cgroups
>> by default.  What were previously problems only for memory cgroup users
>> are quickly becoming a problem for everyone.
>>
>> There are mem cgroup counters that are read in page management paths
>> which scale poorly when read.  These counters are per cpu based and
>> need to be summed over all CPUs to get the overall value for the mem
>> cgroup in lruvec_page_state_local function.  This led to scalability
>> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
>> CPU time consumed in snapshot_refaults().  We have also encountered a
>> similar issue recently when computing the lru_size[1].
>>
>> We'll like to do some brainstorming to see if there are ways to make
>> such accounting more scalable.  For example, not all usages
>> of such counters need precise counts, and some approximate counts that are
>> updated lazily can be used.
> 
> Please make sure to prepare numbers based on the current upstream kernel
> so that we have some grounds to base the discussion on. Ideally post
> them into the email.


Here's a profile on a 5.2 based kernel with some memory tiering modifications.
It shows that snapshot_refaults is consuming a big chunk of cpu cycles to gather the
refault stats stored in root memcg's lruvec's WORKINGSET_ACTIVATE field.

We have to read #memcg X #ncpu local counters 
to get the complete refault snapshot. So the computation scales poorly
with increasing number of memcg and cpus. 

We'll be recollecting some the data based on the 5.5 kernel. I will post
those when they become available. 

The cpu cycle percentage below shows the percentage of kernel cpu cycles taken.  And kernel
time consumed 31% of cpu cycles.  The MySQL workload ran on a 2 socket system with 24 cores
per socket.


    14.22%  mysqld           [kernel.kallsyms]         [k] snapshot_refaults
            |
            ---snapshot_refaults
               do_try_to_free_pages
               try_to_free_pages
               __alloc_pages_slowpath
               __alloc_pages_nodemask
               |
               |--14.07%--alloc_pages_vma
               |          |
               |           --14.06%--__handle_mm_fault
               |                     handle_mm_fault
               |                     |
               |                     |--12.57%--__get_user_pages
               |                     |          get_user_pages_unlocked
               |                     |          get_user_pages_fast
               |                     |          iov_iter_get_pages
               |                     |          do_blockdev_direct_IO
               |                     |          ext4_direct_IO
               |                     |          generic_file_read_iter
               |                     |          |
               |                     |          |--12.16%--new_sync_read
               |                     |          |          vfs_read
               |                     |          |          ksys_pread64
               |                     |          |          do_syscall_64
               |                     |          |          entry_SYSCALL_64_after_hwframe
 
Thanks.

Tim




      parent reply	other threads:[~2020-03-04 20:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-05 18:34 [LSF/MM TOPIC] Memory cgroups, whether you like it or not Tim Chen
2020-02-14 10:45 ` [Lsf-pc] " Michal Hocko
2020-02-20 16:06   ` Kirill A. Shutemov
2020-02-20 16:19     ` Michal Hocko
2020-02-20 22:16       ` Tim Chen
2020-02-21  8:42         ` Michal Hocko
2020-03-04 20:52   ` Tim Chen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20d941d9-0411-049f-76e0-f9bb9e07f6af@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).