[LSF/MM TOPIC] Memory cgroups, whether you like it or not

From: Tim Chen <tim.c.chen@linux.intel.com>
To: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org
Cc: Dave Hansen <dave.hansen@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: [LSF/MM TOPIC] Memory cgroups, whether you like it or not
Date: Wed, 5 Feb 2020 10:34:57 -0800	[thread overview]
Message-ID: <e481a83e-4100-d144-80f5-6c97a77b9c8b@linux.intel.com> (raw)

Topic: Memory cgroups, whether you like it or not

1. Memory cgroup counters scalability

Recently, benchmark teams at Intel were running some bare-metal
benchmarks.  To our great surprise, we saw lots of memcg activity in
the profiles.  When we asked the benchmark team, they did not even
realize they were using memory cgroups.  They were fond of running all
their benchmarks in containers that just happened to use memory cgroups
by default.  What were previously problems only for memory cgroup users
are quickly becoming a problem for everyone.

There are mem cgroup counters that are read in page management paths
which scale poorly when read.  These counters are per cpu based and
need to be summed over all CPUs to get the overall value for the mem
cgroup in lruvec_page_state_local function.  This led to scalability
problems on system with large numbers of CPUs. For example, we’ve seen 14+% kernel
CPU time consumed in snapshot_refaults().  We have also encountered a
similar issue recently when computing the lru_size[1].

We'll like to do some brainstorming to see if there are ways to make
such accounting more scalable.  For example, not all usages
of such counters need precise counts, and some approximate counts that are
updated lazily can be used.

[1] https://lore.kernel.org/linux-mm/a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com/ 

2. Tiered memory accounting and management

Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
than others, but a byte of media has about the same cost whether it
is close or far.  But, with new memory tiers such as High-Bandwidth
Memory or Persistent Memory, there is a choice between fast/expensive
and slow/cheap.  But, the current memory cgroups still live in the
old model. There is only one set of limits, and it implies that all
memory has the same cost.  We would like to extend memory cgroups to
comprehend different memory tiers to give users a way to choose a mix
between fast/expensive and slow/cheap.

We would like to propose that for systems with multiple memory tiers,
We will add accounting per mem cgroup for a memory cgroup's usage of the
top tier memory.  Such top tier memory are precious resources, where it
makes sense to impose soft limits. We can start to actively demote the
top tier memory used by cgroups that exceed their allowance when the
system experience memory pressure in the top tier memory.

There is existing infrastructure for memory soft limit per cgroup we
can leverage to implement such a scheme.  We'll like to find out if this
approach makes sense for people working on systems with multiple memory tiers.