From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3KsH=DP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 03B3FC43457
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 21:10:55 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 2F7D022203
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 21:10:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="s3k82CHt"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2F7D022203
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 13DFC940008; Thu,  8 Oct 2020 17:10:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0C70C940007; Thu,  8 Oct 2020 17:10:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id ECEE7940008; Thu,  8 Oct 2020 17:10:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0173.hostedemail.com [216.40.44.173])
	by kanga.kvack.org (Postfix) with ESMTP id B47AB940007
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 17:10:51 -0400 (EDT)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 50AB9181AE868
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 21:10:51 +0000 (UTC)
X-FDA: 77350002702.11.silk33_4110956271db
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin11.hostedemail.com (Postfix) with ESMTP id 0A2CB180F8B82
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 21:10:51 +0000 (UTC)
X-HE-Tag: silk33_4110956271db
X-Filterd-Recvd-Size: 8439
Received: from mail-qk1-f196.google.com (mail-qk1-f196.google.com [209.85.222.196])
	by imf34.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 21:10:50 +0000 (UTC)
Received: by mail-qk1-f196.google.com with SMTP id 188so8484733qkk.12
        for <linux-mm@kvack.org>; Thu, 08 Oct 2020 14:10:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=bKBtP0ujyc6cQVOrIcuXYWFHroSBLTISizcx1XMf6ng=;
        b=s3k82CHtM77kUlpDxAcCO0fDeNqzwcdQM8Of6YdH+JSnzn+xzdpRJ9ziqQJ17PbB71
         Mqtm7QLzZTQmjyHN54EEKjqcrnVar1NQomaJT3VkEktsowxDbaBgGtqLQVh0M/BOAaD9
         ARebNxoHOvYc4+h8nX3dYiA4ASrfoBZw+PtKvaDVCwv8CEtNux6iZ46m7efurvObQfKa
         pgRycL6/yDCFdXCxwCz2QfxOGdAsLD+tVl05QF2XfsTkXbk+27DnGXntvqhDOIaLbdua
         VjKUUQkR0jHnr0i/di/mlHqms2aF9AyPaAlyMuHaHfPx/5aXe43ZgQt8cORCWgcVX9tP
         HkZw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=bKBtP0ujyc6cQVOrIcuXYWFHroSBLTISizcx1XMf6ng=;
        b=ElJ/raVgJ6mihAzaOZzA+HzNwQnj9G3P8ARwxAyACEJGoF4FI87Tg2f9o9+KYmu4g9
         mhiG+D0J6smSR34zJT/z1VlHP3dQIjC92ZrBnvoqjKPO5enA6TEFHDdg+EOjmZfTJyC2
         5GkQObSOlf06reFRCFgR4VjClo5hf/myQFtsntOuXvyn8QFLvfbAg/NBWXZsnSEPoDXw
         owEDiMns7YeoYzi1i+zoTttEYA+PshZxN9yrWiZ1suvpkXu7+/qXCpRf06CQ7m0WiYQV
         6N5Q436p5VHP5DMWr9lYWjvyR8LBFc5hWcOSc07/ubYJ2rQ7fIEp9ivlv871NKoDN40Z
         n6kg==
X-Gm-Message-State: AOAM530/aUPfaOc5M5Cm+CykFFbyle0sXqBz/63zuYKFayMIW7+Kr9YP
	0PYrN5Ko21M/w4ItJn35F6OmQw==
X-Google-Smtp-Source: ABdhPJxenKO35fQ9lSRNBaYbfhGQdNs1hYB3lDMu7W+9q6jWyjKxp/Vymq+bjrw9ziIpsiADY/vacg==
X-Received: by 2002:a37:7106:: with SMTP id m6mr10003822qkc.412.1602191449434;
        Thu, 08 Oct 2020 14:10:49 -0700 (PDT)
Received: from localhost ([2620:10d:c091:480::1:9294])
        by smtp.gmail.com with ESMTPSA id r13sm4690244qtp.94.2020.10.08.14.10.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 08 Oct 2020 14:10:48 -0700 (PDT)
Date: Thu, 8 Oct 2020 17:09:17 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Greg Thelen <gthelen@google.com>,
	David Rientjes <rientjes@google.com>,
	Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrea Righi <andrea.righi@canonical.com>,
	SeongJae Park <sjpark@amazon.com>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Message-ID: <20201008210917.GC163830@cmpxchg.org>
References: <20200909215752.1725525-1-shakeelb@google.com>
 <20200928210216.GA378894@cmpxchg.org>
 <20200929150444.GG2277@dhcp22.suse.cz>
 <20200929215341.GA408059@cmpxchg.org>
 <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw@mail.gmail.com>
 <20201001143149.GA493631@cmpxchg.org>
 <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg@mail.gmail.com>
 <20201008145336.GA163830@cmpxchg.org>
 <CALvZod5-EtB0jNi9DXTmLSKrUzK2jXRhW8h6+7sqB356k0t1+g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALvZod5-EtB0jNi9DXTmLSKrUzK2jXRhW8h6+7sqB356k0t1+g@mail.gmail.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Oct 08, 2020 at 08:55:57AM -0700, Shakeel Butt wrote:
> On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> > > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > [snip]
> > > > > >    So instead of asking users for a target size whose suitability
> > > > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > > > >    code, the IO device's capability and general load, why not directly
> > > > > >    ask the user for a pressure level that the workload is comfortable
> > > > > >    with and which captures all of the above factors implicitly? Then
> > > > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > > > >
> > > > > I am assuming here by pressure level you are referring to the PSI like
> > > > > interface e.g. allowing the users to tell about their jobs that X
> > > > > amount of stalls in a fixed time window is tolerable.
> > > >
> > > > Right, essentially the same parameters that psi poll() would take.
> > >
> > > I thought a bit more on the semantics of the psi usage for the
> > > proactive reclaim.
> > >
> > > Suppose I have a top level cgroup A on which I want to enable
> > > proactive reclaim. Which memory psi events should the proactive
> > > reclaim should consider?
> > >
> > > The simplest would be the memory.psi at 'A'. However memory.psi is
> > > hierarchical and I would not really want the pressure due limits in
> > > children of 'A' to impact the proactive reclaim.
> >
> > I don't think pressure from limits down the tree can be separated out,
> > generally. All events are accounted recursively as well. Of course, we
> > remember the reclaim level for evicted entries - but if there is
> > reclaim triggered at A and A/B concurrently, the distribution of who
> > ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.
> >
> > If A/B decides to do its own proactive reclaim with the sublimit, and
> > ends up consuming the pressure budget assigned to proactive reclaim in
> > A, there isn't much that can be done.
> >
> > It's also possible that proactive reclaim in A keeps A/B from hitting
> > its limit in the first place.
> >
> > I have to say, the configuration doesn't really strike me as sensible,
> > though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
> > gets 2G out of that. But if you do proactive reclaim on A you're
> > essentially saying A as a whole is auto-sizing dynamically based on
> > its memory access pattern. I'm not sure what it means to then start
> > doing fixed partitions in the sublevel.
> >
> 
> Think of the scenario where there is an infrastructure owner and the
> large number of job owners. The aim of the infra owner is to reduce
> cost by stuffing as many jobs as possible on the same machine while
> job owners want consistent performance.
> 
> The job owners usually have meta jobs i.e. a set of small jobs that
> run on the same machines and they manage these sub-jobs themselves.
>
> The infra owner wants to do proactive reclaim to trim the current jobs
> without impacting their performance and more importantly to have
> enough memory to land new jobs (We have learned the hard way that
> depending on global reclaim for memory overcommit is really bad for
> isolation).
>
> In the above scenario the configuration you mentioned might not be
> sensible is really possible. This is exactly what we have in prod.

I apologize if my statement was worded too broadly. I fully understand
your motivation and understand the sub job structure. It's more about
at which level to run proactive reclaim when there are sub-domains.

You said you're already using a feedback loop to adjust proactive
reclaim based on refault rates. How do you deal with this issue today
of one subgroup potentially having higher refaults due to a limit?

It appears that as soon as the subgroups can age independently, you
also need to treat them independently for proactive reclaim. Because
one group hitting its pressure limit says nothing about its sibling.

If you apply equal reclaim on them both based on the independently
pressured subjob, you'll under-reclaim the siblings.

If you apply equal reclaim on them both based on the unpressured
siblings alone, you'll over-pressurize the one with its own limit.

This seems independent of the exact metric you're using, and more
about at which level you apply pressure, and whether reclaim
subdomains created through a hard limit can be treated as part of a
larger shared pool or not.