linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] Memory cgroups, whether you like it or not
@ 2020-02-05 18:34 Tim Chen
  2020-02-14 10:45 ` [Lsf-pc] " Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Tim Chen @ 2020-02-05 18:34 UTC (permalink / raw)
  To: lsf-pc, linux-mm; +Cc: Dave Hansen, Dan Williams, Huang Ying

Topic: Memory cgroups, whether you like it or not

1. Memory cgroup counters scalability

Recently, benchmark teams at Intel were running some bare-metal
benchmarks.  To our great surprise, we saw lots of memcg activity in
the profiles.  When we asked the benchmark team, they did not even
realize they were using memory cgroups.  They were fond of running all
their benchmarks in containers that just happened to use memory cgroups
by default.  What were previously problems only for memory cgroup users
are quickly becoming a problem for everyone.

There are mem cgroup counters that are read in page management paths
which scale poorly when read.  These counters are per cpu based and
need to be summed over all CPUs to get the overall value for the mem
cgroup in lruvec_page_state_local function.  This led to scalability
problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
CPU time consumed in snapshot_refaults().  We have also encountered a
similar issue recently when computing the lru_size[1].

We'll like to do some brainstorming to see if there are ways to make
such accounting more scalable.  For example, not all usages
of such counters need precise counts, and some approximate counts that are
updated lazily can be used.

[1] https://lore.kernel.org/linux-mm/a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com/ 

2. Tiered memory accounting and management

Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
than others, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as High-Bandwidth
Memory or Persistent Memory, there is a choice between fast/expensive
and slow/cheap.  But, the current memory cgroups still live in the
old model. There is only one set of limits, and it implies that all
memory has the same cost.  We would like to extend memory cgroups to
comprehend different memory tiers to give users a way to choose a mix
between fast/expensive and slow/cheap.

We would like to propose that for systems with multiple memory tiers,
We will add accounting per mem cgroup for a memory cgroup's usage of the
top tier memory.  Such top tier memory are precious resources, where it
makes sense to impose soft limits. We can start to actively demote the
top tier memory used by cgroups that exceed their allowance when the
system experience memory pressure in the top tier memory.

There is existing infrastructure for memory soft limit per cgroup we
can leverage to implement such a scheme.  We'll like to find out if this
approach makes sense for people working on systems with multiple memory tiers.








^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-05 18:34 [LSF/MM TOPIC] Memory cgroups, whether you like it or not Tim Chen
@ 2020-02-14 10:45 ` Michal Hocko
  2020-02-20 16:06   ` Kirill A. Shutemov
  2020-03-04 20:52   ` Tim Chen
  0 siblings, 2 replies; 7+ messages in thread
From: Michal Hocko @ 2020-02-14 10:45 UTC (permalink / raw)
  To: Tim Chen; +Cc: lsf-pc, linux-mm, Dave Hansen, Dan Williams, Huang Ying

On Wed 05-02-20 10:34:57, Tim Chen wrote:
> Topic: Memory cgroups, whether you like it or not
> 
> 1. Memory cgroup counters scalability
> 
> Recently, benchmark teams at Intel were running some bare-metal
> benchmarks.  To our great surprise, we saw lots of memcg activity in
> the profiles.  When we asked the benchmark team, they did not even
> realize they were using memory cgroups.  They were fond of running all
> their benchmarks in containers that just happened to use memory cgroups
> by default.  What were previously problems only for memory cgroup users
> are quickly becoming a problem for everyone.
> 
> There are mem cgroup counters that are read in page management paths
> which scale poorly when read.  These counters are per cpu based and
> need to be summed over all CPUs to get the overall value for the mem
> cgroup in lruvec_page_state_local function.  This led to scalability
> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
> CPU time consumed in snapshot_refaults().  We have also encountered a
> similar issue recently when computing the lru_size[1].
> 
> We'll like to do some brainstorming to see if there are ways to make
> such accounting more scalable.  For example, not all usages
> of such counters need precise counts, and some approximate counts that are
> updated lazily can be used.

Please make sure to prepare numbers based on the current upstream kernel
so that we have some grounds to base the discussion on. Ideally post
them into the email.

> 
> [1] https://lore.kernel.org/linux-mm/a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com/ 
> 
> 2. Tiered memory accounting and management
> 
> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> than others, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as High-Bandwidth
> Memory or Persistent Memory, there is a choice between fast/expensive
> and slow/cheap.  But, the current memory cgroups still live in the
> old model. There is only one set of limits, and it implies that all
> memory has the same cost.  We would like to extend memory cgroups to
> comprehend different memory tiers to give users a way to choose a mix
> between fast/expensive and slow/cheap.
> 
> We would like to propose that for systems with multiple memory tiers,
> We will add accounting per mem cgroup for a memory cgroup's usage of the
> top tier memory.  Such top tier memory are precious resources, where it
> makes sense to impose soft limits. We can start to actively demote the
> top tier memory used by cgroups that exceed their allowance when the
> system experience memory pressure in tyhe top tier memory.

A similar topic has been discussed last year. See https://lwn.net/Articles/787326/

> There is existing infrastructure for memory soft limit per cgroup we
> can leverage to implement such a scheme.  We'll like to find out if this
> approach makes sense for people working on systems with multiple memory tiers.

Soft limit is dead.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-14 10:45 ` [Lsf-pc] " Michal Hocko
@ 2020-02-20 16:06   ` Kirill A. Shutemov
  2020-02-20 16:19     ` Michal Hocko
  2020-03-04 20:52   ` Tim Chen
  1 sibling, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2020-02-20 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tim Chen, lsf-pc, linux-mm, Dave Hansen, Dan Williams, Huang Ying

On Fri, Feb 14, 2020 at 11:45:41AM +0100, Michal Hocko wrote:
> On Wed 05-02-20 10:34:57, Tim Chen wrote:
> > There is existing infrastructure for memory soft limit per cgroup we
> > can leverage to implement such a scheme.  We'll like to find out if this
> > approach makes sense for people working on systems with multiple memory tiers.
> 
> Soft limit is dead.

Michal, could you remind what the deal with soft limit? Why is it dead?

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-20 16:06   ` Kirill A. Shutemov
@ 2020-02-20 16:19     ` Michal Hocko
  2020-02-20 22:16       ` Tim Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2020-02-20 16:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Tim Chen, lsf-pc, linux-mm, Dave Hansen, Dan Williams, Huang Ying

On Thu 20-02-20 19:06:04, Kirill A. Shutemov wrote:
> On Fri, Feb 14, 2020 at 11:45:41AM +0100, Michal Hocko wrote:
> > On Wed 05-02-20 10:34:57, Tim Chen wrote:
> > > There is existing infrastructure for memory soft limit per cgroup we
> > > can leverage to implement such a scheme.  We'll like to find out if this
> > > approach makes sense for people working on systems with multiple memory tiers.
> > 
> > Soft limit is dead.
> 
> Michal, could you remind what the deal with soft limit? Why is it dead?

because of the very disruptive semantic. Essentially the way how it was
grafted into the normal reclaim. It is essentially a priority 0 reclaim
round to shrink a hierarchy which is the most in excess before we do a
normal reclaim. This can lead to an over reclaim, long stalls etc.

There were a lot of discussions on that matter on the mailing list few
years back. We have tried to make the semantic more reasonable but
failed on that and the result is a new cgroup v2 interface essentially.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-20 16:19     ` Michal Hocko
@ 2020-02-20 22:16       ` Tim Chen
  2020-02-21  8:42         ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Tim Chen @ 2020-02-20 22:16 UTC (permalink / raw)
  To: Michal Hocko, Kirill A. Shutemov
  Cc: lsf-pc, linux-mm, Dave Hansen, Dan Williams, Huang Ying

On 2/20/20 8:19 AM, Michal Hocko wrote:

>>
>> Michal, could you remind what the deal with soft limit? Why is it dead?
> 
> because of the very disruptive semantic. Essentially the way how it was
> grafted into the normal reclaim. It is essentially a priority 0 reclaim
> round to shrink a hierarchy which is the most in excess before we do a
> normal reclaim. This can lead to an over reclaim, long stalls etc.

Thanks for the explanation.  I wonder if a few factors could mitigate the
stalls in the tiered memory context:

1. The speed of demotion between top tier memory and second tier memory
is much faster than reclaiming the pages and swapping them out.

2. Demotion targets pages that are colder and less active.

3. If we engage the page demotion mostly in the background, say via kswapd,
and not in the direct reclaim path, we can avoid long stalls
during page allocation.  If the memory pressure is severe
on the top tier memory, perhaps the memory could be allocated from the second
tier memory node to avoid stalling.

The stalls could still prove to be problematic.  We're implementing
prototypes and we'll have a better ideas on workload latencies once we can collect data.

Tim



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-20 22:16       ` Tim Chen
@ 2020-02-21  8:42         ` Michal Hocko
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2020-02-21  8:42 UTC (permalink / raw)
  To: Tim Chen
  Cc: Kirill A. Shutemov, lsf-pc, linux-mm, Dave Hansen, Dan Williams,
	Huang Ying

On Thu 20-02-20 14:16:02, Tim Chen wrote:
> On 2/20/20 8:19 AM, Michal Hocko wrote:
> 
> >>
> >> Michal, could you remind what the deal with soft limit? Why is it dead?
> > 
> > because of the very disruptive semantic. Essentially the way how it was
> > grafted into the normal reclaim. It is essentially a priority 0 reclaim
> > round to shrink a hierarchy which is the most in excess before we do a
> > normal reclaim. This can lead to an over reclaim, long stalls etc.
> 
> Thanks for the explanation.  I wonder if a few factors could mitigate the
> stalls in the tiered memory context:
> 
> 1. The speed of demotion between top tier memory and second tier memory
> is much faster than reclaiming the pages and swapping them out.

You could have accumulated a lot of soft limit excess before it is
reclaimed. So I do not think the speed of the demotion is the primary
factor.

> 2. Demotion targets pages that are colder and less active.
> 
> 3. If we engage the page demotion mostly in the background, say via kswapd,
> and not in the direct reclaim path, we can avoid long stalls
> during page allocation.  If the memory pressure is severe
> on the top tier memory, perhaps the memory could be allocated from the second
> tier memory node to avoid stalling.
> 
> The stalls could still prove to be problematic.  We're implementing
> prototypes and we'll have a better ideas on workload latencies once we can collect data.

I would really encourage you to not hook into the soft limit reclaim
even if you somehow manage to reduce the problem with stalls for at
least three reasons
1) soft limit is not going to be added to cgroup v2 because there is a
   different API to achieve a pro-active reclaim
2) soft limit is not aware of the memory you are reclaiming so using it
   for tiered memory sounds like a bad fit to me.
3) changing the semantic of the existing interface is always
   troublesome. Please have to look into mailing list archives when we
   have attempted that the last time.
   Have a look at e.g. http://lkml.kernel.org/r/1371557387-22434-1-git-send-email-mhocko@suse.cz

Anyway it is hard to comment on without knowing details on how you
actually want to use soft limit for different memory types and their
balancing.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
  2020-02-14 10:45 ` [Lsf-pc] " Michal Hocko
  2020-02-20 16:06   ` Kirill A. Shutemov
@ 2020-03-04 20:52   ` Tim Chen
  1 sibling, 0 replies; 7+ messages in thread
From: Tim Chen @ 2020-03-04 20:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm, Dave Hansen, Dan Williams, Huang Ying

On 2/14/20 2:45 AM, Michal Hocko wrote:
> On Wed 05-02-20 10:34:57, Tim Chen wrote:
>> Topic: Memory cgroups, whether you like it or not
>>
>> 1. Memory cgroup counters scalability
>>
>> Recently, benchmark teams at Intel were running some bare-metal
>> benchmarks.  To our great surprise, we saw lots of memcg activity in
>> the profiles.  When we asked the benchmark team, they did not even
>> realize they were using memory cgroups.  They were fond of running all
>> their benchmarks in containers that just happened to use memory cgroups
>> by default.  What were previously problems only for memory cgroup users
>> are quickly becoming a problem for everyone.
>>
>> There are mem cgroup counters that are read in page management paths
>> which scale poorly when read.  These counters are per cpu based and
>> need to be summed over all CPUs to get the overall value for the mem
>> cgroup in lruvec_page_state_local function.  This led to scalability
>> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
>> CPU time consumed in snapshot_refaults().  We have also encountered a
>> similar issue recently when computing the lru_size[1].
>>
>> We'll like to do some brainstorming to see if there are ways to make
>> such accounting more scalable.  For example, not all usages
>> of such counters need precise counts, and some approximate counts that are
>> updated lazily can be used.
> 
> Please make sure to prepare numbers based on the current upstream kernel
> so that we have some grounds to base the discussion on. Ideally post
> them into the email.


Here's a profile on a 5.2 based kernel with some memory tiering modifications.
It shows that snapshot_refaults is consuming a big chunk of cpu cycles to gather the
refault stats stored in root memcg's lruvec's WORKINGSET_ACTIVATE field.

We have to read #memcg X #ncpu local counters 
to get the complete refault snapshot. So the computation scales poorly
with increasing number of memcg and cpus. 

We'll be recollecting some the data based on the 5.5 kernel. I will post
those when they become available. 

The cpu cycle percentage below shows the percentage of kernel cpu cycles taken.  And kernel
time consumed 31% of cpu cycles.  The MySQL workload ran on a 2 socket system with 24 cores
per socket.


    14.22%  mysqld           [kernel.kallsyms]         [k] snapshot_refaults
            |
            ---snapshot_refaults
               do_try_to_free_pages
               try_to_free_pages
               __alloc_pages_slowpath
               __alloc_pages_nodemask
               |
               |--14.07%--alloc_pages_vma
               |          |
               |           --14.06%--__handle_mm_fault
               |                     handle_mm_fault
               |                     |
               |                     |--12.57%--__get_user_pages
               |                     |          get_user_pages_unlocked
               |                     |          get_user_pages_fast
               |                     |          iov_iter_get_pages
               |                     |          do_blockdev_direct_IO
               |                     |          ext4_direct_IO
               |                     |          generic_file_read_iter
               |                     |          |
               |                     |          |--12.16%--new_sync_read
               |                     |          |          vfs_read
               |                     |          |          ksys_pread64
               |                     |          |          do_syscall_64
               |                     |          |          entry_SYSCALL_64_after_hwframe
 
Thanks.

Tim




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-03-04 20:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-05 18:34 [LSF/MM TOPIC] Memory cgroups, whether you like it or not Tim Chen
2020-02-14 10:45 ` [Lsf-pc] " Michal Hocko
2020-02-20 16:06   ` Kirill A. Shutemov
2020-02-20 16:19     ` Michal Hocko
2020-02-20 22:16       ` Tim Chen
2020-02-21  8:42         ` Michal Hocko
2020-03-04 20:52   ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).