From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=AOS9=LL=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DC186C2B9F4
	for <linux-mm@archiver.kernel.org>; Thu, 17 Jun 2021 18:49:11 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4F6F8613C1
	for <linux-mm@archiver.kernel.org>; Thu, 17 Jun 2021 18:49:11 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4F6F8613C1
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AB5286B0070; Thu, 17 Jun 2021 14:49:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A53826B0071; Thu, 17 Jun 2021 14:49:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8F3C96B0072; Thu, 17 Jun 2021 14:49:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0142.hostedemail.com [216.40.44.142])
	by kanga.kvack.org (Postfix) with ESMTP id 5F46F6B0070
	for <linux-mm@kvack.org>; Thu, 17 Jun 2021 14:49:10 -0400 (EDT)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id F0C791D60D
	for <linux-mm@kvack.org>; Thu, 17 Jun 2021 18:49:09 +0000 (UTC)
X-FDA: 78264103218.22.AEFB170
Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172])
	by imf22.hostedemail.com (Postfix) with ESMTP id 2BA7DC00CBE0
	for <linux-mm@kvack.org>; Thu, 17 Jun 2021 18:49:09 +0000 (UTC)
Received: by mail-qk1-f172.google.com with SMTP id g4so3433264qkl.1
        for <linux-mm@kvack.org>; Thu, 17 Jun 2021 11:49:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=fbSeBrZCsvIYpqXlxcs9/rTeYiLs51+PDFqt3AQxQ7k=;
        b=YfOT/Sfrwg6OmFROsb2+Ce25eQLTKekiJpDcURwnR1tQno3BOWfh9ZUJ5IYgVcPADz
         We0ZbNosd8O+WM4QuZn69GdXpVE6dmtMN/i7jpV8kQTrZrBHiFOdGqkcTMHp0IZsk/qM
         YocVXFZfibgGknnW/v+DVy+5nYlTpdEpR87hNTPbSabPOY71yM0X0INiDkMy08aik6dR
         iaUPvHOhUldzye2+S1HHi/BDPi43i1K+oKAJtGv7GDA1V/jlFHnIjbCc3gS9dPpaTStQ
         +yUZzEfAjPCvsto9gmrah3JRcU38OzeJ23DYmiPD3nMVaCHdPZg9Hyu/MWEHPP/RfjWT
         TR4Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=fbSeBrZCsvIYpqXlxcs9/rTeYiLs51+PDFqt3AQxQ7k=;
        b=hSwXjLOt2C9M71+hzbHoelEW791oetjR3YsZEBxL2gvXAVT/laI1qr7oC8hg2iKR+R
         ohmoxcNDavErex8ZVIoZpf/J/bFfDu64zPdI2ZVD/Pxj9rcSQbcYJRVJF0PVKFppo2B0
         V7NRk15FqA7iA6YkHFbVxh+9tvy+fe04UHDcWASZRTFdgkwEmWEyMGqO03KXgK6s3cH4
         yauEfBkx7h4gq0ADFQ1vmPf7oxQOlE47t3HePrUs4Fue8MxXK49k8fs+kPvuKntwXr4/
         Shio3FNL8jQSDpSRVUIN/JY5D55Iiu29zEM7cFMpKtOEOAM1HrgkOpln0qEyYWh8tMYa
         oLxQ==
X-Gm-Message-State: AOAM531W2DSoQpH3uf+GJ7BhUlPGzXE+qbqePP9NWMP5B0EhhMZsNA8r
	con0hANPm0+a77titmGVCLF1wuEOmT3W5hyh4Jb9fg==
X-Google-Smtp-Source: ABdhPJx9/btGBJCJrkqOGvr/xhvqqjUyCT/w/TCbKp8RW/xzh8s/FKBkDOVlZgek8PQgOFqHncWq/OPbBdTDTDxVIog=
X-Received: by 2002:a25:2603:: with SMTP id m3mr7972651ybm.60.1623955748733;
 Thu, 17 Jun 2021 11:49:08 -0700 (PDT)
MIME-Version: 1.0
References: <475cbc62-a430-2c60-34cc-72ea8baebf2c@linux.intel.com> <CAHbLzkqNWn7ONEC=V9z18aWB34rS-q2banDUM=OYU0B=4t91Xw@mail.gmail.com>
In-Reply-To: <CAHbLzkqNWn7ONEC=V9z18aWB34rS-q2banDUM=OYU0B=4t91Xw@mail.gmail.com>
From: Shakeel Butt <shakeelb@google.com>
Date: Thu, 17 Jun 2021 11:48:56 -0700
Message-ID: <CALvZod5+dCgUwfs3sUt6tPCETMe7jF1++B7AQSOGG4+hOpBXLQ@mail.gmail.com>
Subject: Re: [LSF/MM TOPIC] Tiered memory accounting and management
To: Yang Shi <shy828301@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>, lsf-pc@lists.linux-foundation.org, 
	Linux MM <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>, 
	Dan Williams <dan.j.williams@intel.com>, Dave Hansen <dave.hansen@intel.com>, 
	David Rientjes <rientjes@google.com>, Wei Xu <weixugc@google.com>, 
	Greg Thelen <gthelen@google.com>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b="YfOT/Sfr";
	spf=pass (imf22.hostedemail.com: domain of shakeelb@google.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=shakeelb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 2BA7DC00CBE0
X-Stat-Signature: 5jyxpb7g4knueezq8qh3b33wh4d9per3
X-HE-Tag: 1623955749-386738
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Thanks Yang for the CC.

On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> >
> > From: Tim Chen <tim.c.chen@linux.intel.com>
> >
> > Tiered memory accounting and management
> > ------------------------------------------------------------
> > Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
> > than others, but a byte of media has about the same cost whether it
> > is close or far.  But, with new memory tiers such as High-Bandwidth
> > Memory or Persistent Memory, there is a choice between fast/expensive
> > and slow/cheap.  But, the current memory cgroups still live in the
> > old model. There is only one set of limits, and it implies that all
> > memory has the same cost.  We would like to extend memory cgroups to
> > comprehend different memory tiers to give users a way to choose a mix
> > between fast/expensive and slow/cheap.
> >
> > To manage such memory, we will need to account memory usage and
> > impose limits for each kind of memory.
> >
> > There were a couple of approaches that have been discussed previously to partition
> > the memory between the cgroups listed below.  We will like to
> > use the LSF/MM session to come to a consensus on the approach to
> > take.
> >
> > 1.      Per NUMA node limit and accounting for each cgroup.
> > We can assign higher limits on better performing memory node for higher priority cgroups.
> >
> > There are some loose ends here that warrant further discussions:
> > (1) A user friendly interface for such limits.  Will a proportional
> > weight for the cgroup that translate to actual absolute limit be more suitable?
> > (2) Memory mis-configurations can occur more easily as the admin
> > has a much larger number of limits spread among between the
> > cgroups to manage.  Over-restrictive limits can lead to under utilized
> > and wasted memory and hurt performance.
> > (3) OOM behavior when a cgroup hits its limit.
> >

This (numa based limits) is something I was pushing for but after
discussing this internally with userspace controller devs, I have to
backoff from this position.

The main feedback I got was that setting one memory limit is already
complicated and having to set/adjust these many limits would be
horrifying.

> > 2.      Per memory tier limit and accounting for each cgroup.
> > We can assign higher limits on memories in better performing
> > memory tier for higher priority cgroups.  I previously
> > prototyped a soft limit based implementation to demonstrate the
> > tiered limit idea.
> >
> > There are also a number of issues here:
> > (1)     The advantage is we have fewer limits to deal with simplifying
> > configuration. However, there are doubts raised by a number
> > of people on whether we can really properly classify the NUMA
> > nodes into memory tiers. There could still be significant performance
> > differences between NUMA nodes even for the same kind of memory.
> > We will also not have the fine-grained control and flexibility that comes
> > with a per NUMA node limit.
> > (2)     Will a memory hierarchy defined by promotion/demotion relationship between
> > memory nodes be a viable approach for defining memory tiers?
> >
> > These issues related to  the management of systems with multiple kind of memories
> > can be ironed out in this session.
>
> Thanks for suggesting this topic. I'm interested in the topic and
> would like to attend.
>
> Other than the above points. I'm wondering whether we shall discuss
> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
> development and I have been involved in the early development and
> review, but it seems there are still some open questions according to
> the latest review feedback.
>
> Some other folks may be interested in this topic either, CC'ed them in
> the thread.
>

At the moment "personally" I am more inclined towards a passive
approach towards the memcg accounting of memory tiers. By that I mean,
let's start by providing a 'usage' interface and get more
production/real-world data to motivate the 'limit' interfaces. (One
minor reason is that defining the 'limit' interface will force us to
make the decision on defining tiers i.e. numa or a set of numa or
others).

IMHO we should focus more on the "aging" of the application memory and
"migration/balance" between the tiers. I don't think the memory
reclaim infrastructure is the right place for these operations
(unevictable pages are ignored and not accurate ages). What we need is
proactive continuous aging and balancing. We need something like, with
additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
a new mechanism for balancing which takes ages into account.

To give a more concrete example: Let's say we have a system with two
memory tiers and multiple low and high priority jobs. For high
priority jobs, set the allocation try list from high to low tier and
for low priority jobs the reverse of that (I am not sure if we can do
that out of the box with today's kernel). In the background we migrate
cold memory down the tiers and hot memory in the reverse direction.

In this background mechanism we can enforce all different limiting
policies like Yang's original high and low tier percentage or
something like X% of accesses of high priority jobs should be from
high tier. Basically I am saying until we find from production data
that this background mechanism is not strong enough to enforce passive
limits, we should delay the decision on limit interfaces.