Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not

From: Michal Hocko <mhocko@kernel.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Dave Hansen <dave.hansen@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Memory cgroups, whether you like it or not
Date: Fri, 14 Feb 2020 11:45:41 +0100	[thread overview]
Message-ID: <20200214104541.GT31689@dhcp22.suse.cz> (raw)
In-Reply-To: <e481a83e-4100-d144-80f5-6c97a77b9c8b@linux.intel.com>

On Wed 05-02-20 10:34:57, Tim Chen wrote:
> Topic: Memory cgroups, whether you like it or not
> 
> 1. Memory cgroup counters scalability
> 
> Recently, benchmark teams at Intel were running some bare-metal
> benchmarks.  To our great surprise, we saw lots of memcg activity in
> the profiles.  When we asked the benchmark team, they did not even
> realize they were using memory cgroups.  They were fond of running all
> their benchmarks in containers that just happened to use memory cgroups
> by default.  What were previously problems only for memory cgroup users
> are quickly becoming a problem for everyone.
> 
> There are mem cgroup counters that are read in page management paths
> which scale poorly when read.  These counters are per cpu based and
> need to be summed over all CPUs to get the overall value for the mem
> cgroup in lruvec_page_state_local function.  This led to scalability
> problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
> CPU time consumed in snapshot_refaults().  We have also encountered a
> similar issue recently when computing the lru_size[1].
> 
> We'll like to do some brainstorming to see if there are ways to make
> such accounting more scalable.  For example, not all usages
> of such counters need precise counts, and some approximate counts that are
> updated lazily can be used.

Please make sure to prepare numbers based on the current upstream kernel
so that we have some grounds to base the discussion on. Ideally post
them into the email.

> 
> [1] https://lore.kernel.org/linux-mm/a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com/ 
> 
> 2. Tiered memory accounting and management
> 
> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> than others, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as High-Bandwidth
> Memory or Persistent Memory, there is a choice between fast/expensive
> and slow/cheap.  But, the current memory cgroups still live in the
> old model. There is only one set of limits, and it implies that all
> memory has the same cost.  We would like to extend memory cgroups to
> comprehend different memory tiers to give users a way to choose a mix
> between fast/expensive and slow/cheap.
> 
> We would like to propose that for systems with multiple memory tiers,
> We will add accounting per mem cgroup for a memory cgroup's usage of the
> top tier memory.  Such top tier memory are precious resources, where it
> makes sense to impose soft limits. We can start to actively demote the
> top tier memory used by cgroups that exceed their allowance when the
> system experience memory pressure in tyhe top tier memory.

A similar topic has been discussed last year. See https://lwn.net/Articles/787326/

> There is existing infrastructure for memory soft limit per cgroup we
> can leverage to implement such a scheme.  We'll like to find out if this
> approach makes sense for people working on systems with multiple memory tiers.

Soft limit is dead.

-- 
Michal Hocko
SUSE Labs