linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Ying Huang <ying.huang@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
Date: Thu, 8 Apr 2021 13:52:33 +0200	[thread overview]
Message-ID: <YG7ugXZZ9BcXyGGk@dhcp22.suse.cz> (raw)
In-Reply-To: <c615a610-eb4b-7e1e-16d1-4bc12938b08a@linux.intel.com>

On Wed 07-04-21 15:33:26, Tim Chen wrote:
> 
> 
> On 4/6/21 2:08 AM, Michal Hocko wrote:
> > On Mon 05-04-21 10:08:24, Tim Chen wrote:
> > [...]
> >> To make fine grain cgroup based management of the precious top tier
> >> DRAM memory possible, this patchset adds a few new features:
> >> 1. Provides memory monitors on the amount of top tier memory used per cgroup 
> >>    and by the system as a whole.
> >> 2. Applies soft limits on the top tier memory each cgroup uses 
> >> 3. Enables kswapd to demote top tier pages from cgroup with excess top
> >>    tier memory usages.
> > 
> 
> Michal,
> 
> Thanks for giving your feedback.  Much appreciated.
> 
> > Could you be more specific on how this interface is supposed to be used?
> 
> We created a README section on the cgroup control part of this patchset:
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.71&id=20f20be02671384470c7cd8f66b56a9061a4071f
> to illustrate how this interface should be used.

I have to confess I didn't get to look at demotion patches yet.

> The top tier memory used is reported in
> 
> memory.toptier_usage_in_bytes
> 
> The amount of top tier memory usable by each cgroup without
> triggering page reclaim is controlled by the
> 
> memory.toptier_soft_limit_in_bytes 

Are you trying to say that soft limit acts as some sort of guarantee?
Does that mean that if the memcg is under memory pressure top tiear
memory is opted out from any reclaim if the usage is not in excess?

From you previous email it sounds more like the limit is evaluated on
the global memory pressure to balance specific memcgs which are in
excess when trying to reclaim/demote a toptier numa node.

Soft limit reclaim has several problems. Those are historical and
therefore the behavior cannot be changed. E.g. go after the biggest
excessed memcg (with priority 0 - aka potential full LRU scan) and then
continue with a normal reclaim. This can be really disruptive to the top
user.

So you can likely define a more sane semantic. E.g. push back memcgs
proporitional to their excess but then we have two different soft limits
behavior which is bad as well. I am not really sure there is a sensible
way out by (ab)using soft limit here.

Also I am not really sure how this is going to be used in practice.
There is no soft limit by default. So opting in would effectivelly
discriminate those memcgs. There has been a similar problem with the
soft limit we have in general. Is this really what you are looing for?
What would be a typical usecase?

[...]
> >> The patchset is based on cgroup v1 interface. One shortcoming of the v1
> >> interface is the limit on the cgroup is a soft limit, so a cgroup can
> >> exceed the limit quite a bit before reclaim before page demotion reins
> >> it in. 
> > 
> > I have to say that I dislike abusing soft limit reclaim for this. In the
> > past we have learned that the existing implementation is unfixable and
> > changing the existing semantic impossible due to backward compatibility.
> > So I would really prefer the soft limit just find its rest rather than
> > see new potential usecases.
> 
> Do you think we can reuse some of the existing soft reclaim machinery
> for the v2 interface?
> 
> More particularly, can we treat memory_toptier.high in cgroup v2 as a soft limit?

No, you should follow existing limits semantics. High limit acts as a
allocation throttling interface.

> We sort how much each mem cgroup exceeds memory_toptier.high and
> go after the cgroup that have largest excess first for page demotion.
> Will appreciate if you can shed some insights on what could go wrong
> with such an approach. 

This cannot work as a thorttling interface.
 
> > I haven't really looked into details of this patchset but from a cursory
> > look it seems like you are actually introducing a NUMA aware limits into
> > memcg that would control consumption from some nodes differently than
> > other nodes. This would be rather alien concept to the existing memcg
> > infrastructure IMO. It looks like it is fusing borders between memcg and
> > cputset controllers.
> 
> Want to make sure I understand what you mean by NUMA aware limits.
> Yes, in the patch set, it does treat the NUMA nodes differently.
> We are putting constraint on the "top tier" RAM nodes vs the lower
> tier PMEM nodes.  Is this what you mean?

What I am trying to say (and I have brought that up when demotion has been
discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
The specific technology shouldn't be imprinted into the interface.
Fundamentally you are trying to balance memory among NUMA nodes as we do
not have other abstraction to use. So rather than talking about top,
secondary, nth tier we have different NUMA nodes with different
characteristics and you want to express your "priorities" for them.

> I can see it does has
> some flavor of cpuset controller.  In this case, it doesn't explicitly
> set a node as allowed or forbidden as in cpuset, but put some constraints
> on the usage of a group of nodes.  
> 
> Do you have suggestions on alternative controller for allocating tiered memory resource?
 
I am not really sure what would be the best interface to be honest.
Maybe we want to carve this into memcg in some form of node priorities
for the reclaim. Any of the existing limits is numa aware so far. Maybe
we want to say hammer this node more than others if there is a memory
pressure. Not sure that would help you particular usecase though.

> > You also seem to be basing the interface on the very specific usecase.
> > Can we expect that there will be many different tiers requiring their
> > own balancing?
> > 
> 
> You mean more than two tiers of memory? We did think a bit about system
> that has stuff like high bandwidth memory that's faster than DRAM.
> Our thought is usage and freeing of those memory will require 
> explicit assignment (not used by default), so will be outside the
> realm of auto balancing.  So at this point, we think two tiers will be good.

Please keep in mind that once there is an interface it will be
impossible to change in the future. So do not bind yourself to the 2
tier setups that you have in hands right now.

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2021-04-08 11:52 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 07/11] mm: Account the total top tier memory in use Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim Tim Chen
2021-04-06  9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
2021-04-07 22:33   ` Tim Chen
2021-04-08 11:52     ` Michal Hocko [this message]
2021-04-09 23:26       ` Tim Chen
2021-04-12 19:20         ` Shakeel Butt
2021-04-14  8:59           ` Jonathan Cameron
2021-04-15  0:42           ` Tim Chen
2021-04-13  2:15         ` Huang, Ying
2021-04-13  8:33         ` Michal Hocko
2021-04-12 14:03       ` Shakeel Butt
2021-04-08 17:18 ` Shakeel Butt
2021-04-08 18:00   ` Yang Shi
2021-04-08 20:29     ` Shakeel Butt
2021-04-08 20:50       ` Yang Shi
2021-04-12 14:03         ` Shakeel Butt
2021-04-09  7:24       ` Michal Hocko
2021-04-15 22:31         ` Tim Chen
2021-04-16  6:38           ` Michal Hocko
2021-04-14 23:22       ` Tim Chen
2021-04-09  2:58     ` Huang, Ying
2021-04-09 20:50       ` Yang Shi
2021-04-15 22:25   ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YG7ugXZZ9BcXyGGk@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).