From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jIX+=AZ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5E4E9C433E0
	for <linux-mm@archiver.kernel.org>; Tue, 14 Jul 2020 15:50:22 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1C41222224
	for <linux-mm@archiver.kernel.org>; Tue, 14 Jul 2020 15:50:22 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1C41222224
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B2F8E8D0001; Tue, 14 Jul 2020 11:50:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ADFFC6B0022; Tue, 14 Jul 2020 11:50:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9F8178D0001; Tue, 14 Jul 2020 11:50:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0138.hostedemail.com [216.40.44.138])
	by kanga.kvack.org (Postfix) with ESMTP id 8AA676B0010
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 11:50:21 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 4021A1E14
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 15:50:21 +0000 (UTC)
X-FDA: 77037118242.13.coal89_2f0290026ef2
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id 103FA18140B70
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 15:50:21 +0000 (UTC)
X-HE-Tag: coal89_2f0290026ef2
X-Filterd-Recvd-Size: 7382
Received: from mail-wr1-f67.google.com (mail-wr1-f67.google.com [209.85.221.67])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 15:50:20 +0000 (UTC)
Received: by mail-wr1-f67.google.com with SMTP id f2so22584030wrp.7
        for <linux-mm@kvack.org>; Tue, 14 Jul 2020 08:50:20 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=7UNjTjNj7nBrxc1O4cT43g/Jb9uDjM/8ow+jG2V3JdU=;
        b=KhtScX2BlWm265k5fOVXdW1IAndbVi/kfJv4CE6vcg91a/5vf9b1zPVMMffa24hjT+
         nrPTzJZrqSu7YHFNuEQt6UxNXeD+zwLnRQWgIT5Yg4tUdJdCCbzTVqYyMB2F5CUK0Vuo
         /aYm3wflyNjgB6MdU4d8iyR1CJL6+yC59YFCTrLd8hC+vALs+5RiDZG+OVb5+knaikA5
         /bcYz76LCWBVTM/7prI+A5crtBf2ty2AUi0L4/sEQox5p2/BJfwu4aU6B2Vhp8QG8qbH
         A9Wa/S7o0BSSz5QEckVLgvNvZq9cGf2mY7QOvwl1QYGFULwYsVwklxvF1z7R9w3qMpZ9
         F2ZA==
X-Gm-Message-State: AOAM5311g2rHqM87A+ordooJo+bpdRZdVkuBEBtBldrgReT+2kQxNVgc
	20EGwVe4kTiMciPVb1GoblQ=
X-Google-Smtp-Source: ABdhPJzkKuBEXARsKEFK+BNNFviqFhl02Embfs0hF4rMtyp2khXl4hcBMRqZxUzNHjwX/hgn6kxA4A==
X-Received: by 2002:adf:f2c5:: with SMTP id d5mr6660872wrp.96.1594741819457;
        Tue, 14 Jul 2020 08:50:19 -0700 (PDT)
Received: from localhost (ip-37-188-148-171.eurotel.cz. [37.188.148.171])
        by smtp.gmail.com with ESMTPSA id k14sm29563410wrn.76.2020.07.14.08.50.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Jul 2020 08:50:18 -0700 (PDT)
Date: Tue, 14 Jul 2020 17:50:17 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>, Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Linux MM <linux-mm@kvack.org>,
	Kernel Team <kernel-team@fb.com>,
	LKML <linux-kernel@vger.kernel.org>, Domas Mituzas <domas@fb.com>,
	Tejun Heo <tj@kernel.org>, Chris Down <chris@chrisdown.name>
Subject: Re: [PATCH] mm: memcontrol: avoid workload stalls when lowering
 memory.high
Message-ID: <20200714155017.GQ24642@dhcp22.suse.cz>
References: <20200709194718.189231-1-guro@fb.com>
 <20200710122917.GB3022@dhcp22.suse.cz>
 <CALvZod6Yk8QoZjbNkGE8-qeOD187Nu-+VwasoROGZs_UsMgbEQ@mail.gmail.com>
 <20200710184205.GB350256@carbon.dhcp.thefacebook.com>
 <CALvZod45_zVaFhvw-wc9b6-Fth=fZo5Fo6xCwRVkrWC6ZprYyw@mail.gmail.com>
 <20200714084123.GG24642@dhcp22.suse.cz>
 <CALvZod6kw++JnZnyYVg4-u2vNQ7SLMFh3qKG1xu7_AahdmXhdg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALvZod6kw++JnZnyYVg4-u2vNQ7SLMFh3qKG1xu7_AahdmXhdg@mail.gmail.com>
X-Rspamd-Queue-Id: 103FA18140B70
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue 14-07-20 08:32:09, Shakeel Butt wrote:
> On Tue, Jul 14, 2020 at 1:41 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 10-07-20 12:19:37, Shakeel Butt wrote:
> > > On Fri, Jul 10, 2020 at 11:42 AM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Fri, Jul 10, 2020 at 07:12:22AM -0700, Shakeel Butt wrote:
> > > > > On Fri, Jul 10, 2020 at 5:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Thu 09-07-20 12:47:18, Roman Gushchin wrote:
> > > > > > > Memory.high limit is implemented in a way such that the kernel
> > > > > > > penalizes all threads which are allocating a memory over the limit.
> > > > > > > Forcing all threads into the synchronous reclaim and adding some
> > > > > > > artificial delays allows to slow down the memory consumption and
> > > > > > > potentially give some time for userspace oom handlers/resource control
> > > > > > > agents to react.
> > > > > > >
> > > > > > > It works nicely if the memory usage is hitting the limit from below,
> > > > > > > however it works sub-optimal if a user adjusts memory.high to a value
> > > > > > > way below the current memory usage. It basically forces all workload
> > > > > > > threads (doing any memory allocations) into the synchronous reclaim
> > > > > > > and sleep. This makes the workload completely unresponsive for
> > > > > > > a long period of time and can also lead to a system-wide contention on
> > > > > > > lru locks. It can happen even if the workload is not actually tight on
> > > > > > > memory and has, for example, a ton of cold pagecache.
> > > > > > >
> > > > > > > In the current implementation writing to memory.high causes an atomic
> > > > > > > update of page counter's high value followed by an attempt to reclaim
> > > > > > > enough memory to fit into the new limit. To fix the problem described
> > > > > > > above, all we need is to change the order of execution: try to push
> > > > > > > the memory usage under the limit first, and only then set the new
> > > > > > > high limit.
> > > > > >
> > > > > > Shakeel would this help with your pro-active reclaim usecase? It would
> > > > > > require to reset the high limit right after the reclaim returns which is
> > > > > > quite ugly but it would at least not require a completely new interface.
> > > > > > You would simply do
> > > > > >         high = current - to_reclaim
> > > > > >         echo $high > memory.high
> > > > > >         echo infinity > memory.high # To prevent direct reclaim
> > > > > >                                     # allocation stalls
> > > > > >
> > > > >
> > > > > This will reduce the chance of stalls but the interface is still
> > > > > non-delegatable i.e. applications can not change their own memory.high
> > > > > for the use-cases like application controlled proactive reclaim and
> > > > > uswapd.
> > > >
> > > > Can you, please, elaborate a bit more on this? I didn't understand
> > > > why.
> > > >
> > >
> > > Sure. Do we want memory.high a CFTYPE_NS_DELEGATABLE type file? I
> > > don't think so otherwise any job on a system can change their
> > > memory.high and can adversely impact the isolation and memory
> > > scheduling of the system.
> >
> > Is this really the case? There should always be a parent cgroup that
> > overrides the setting.
> 
> Can you explain a bit more? I don't see any requirement of having a
> layer of cgroup between root and the job cgroup. Internally we
> schedule jobs as top level cgroups. There do exist jobs which are a
> combination of other jobs and there we do use an additional layer of
> cgroup (similar to pods running multiple containers in kubernetes).
> Surely we can add a layer for all the jobs but it comes with an
> overhead and at scale that overhead is not negligible.

What I've had in mind is that if you want to delegate then you have an
option to add a layer where you pre define restrictions/guanratees so
that the delegated cgroup under that hierarchy cannot runaway. So
configuring high limit in a delegated cgroup should be reasonably safe.

> > Also you can always set the hard limit if you do
> > not want to add another layer of cgroup in the hierarchy before
> > delegation. Or am I missing something?
> >
> 
> Yes, we can set memory.max though it has different oom semantics and
> not really a replacement for memory.high.

Right but you can define a safe cap this way and leave the high
watermark for the delegated cgroup.
-- 
Michal Hocko
SUSE Labs