From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=faA3=B4=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8B47CC433E1
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Aug 2020 13:50:21 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 527FA207D3
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Aug 2020 13:50:21 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="gFo2tdrj"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726803AbgHRNuS (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 18 Aug 2020 09:50:18 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45482 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726482AbgHRNuO (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 18 Aug 2020 09:50:14 -0400
Received: from mail-qt1-x841.google.com (mail-qt1-x841.google.com [IPv6:2607:f8b0:4864:20::841])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D22CFC061342
        for <linux-kernel@vger.kernel.org>; Tue, 18 Aug 2020 06:50:13 -0700 (PDT)
Received: by mail-qt1-x841.google.com with SMTP id c12so15138721qtn.9
        for <linux-kernel@vger.kernel.org>; Tue, 18 Aug 2020 06:50:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=i3gzOAPUdUUgkW+MfJiRBZzylD1SwKDup35ranLTE+8=;
        b=gFo2tdrjm1qkUJWuDiFskzbeO0w5LJx6dyzsZOPwPfOmHDSXzdBGEcXIwDX7whrNV9
         JEgVlPiTxZyMBirBuYCHGck9cYWCBQmt5ny0jayeNQ6gfZp4LNLHaLYVZRHrSHKeUY6e
         uRSpvrjdgx5mqhJbaHjlzb+ptaePm6k7omgqsuP3hL/1RmdPdVTPoALIZZdr5eyJ54WL
         dsftHxbUTIR9tXIDvqG8jN4H2/lH+Rf1hR50Z0vtfC6A7sLoLgtiM/v15If1BpTZRwDI
         SAX24oCanz2Ny6U8qxDU6UGtpXs/jD/FGeqLm7SIrw5GSVnzTYUHBo7UfMACrG1lFPPe
         a1mA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=i3gzOAPUdUUgkW+MfJiRBZzylD1SwKDup35ranLTE+8=;
        b=F5TKoDgOo7RTvAOkUwC7g6giRkyvQk5Ms4tEJqK06Q2BbHxrwxNeaYbtgk/eM0nmei
         9OcyH0tVNaz6KEWG6GttRv3hE3iNV7+p5QW2Oc9pSp66Ry/dD1KCzEkoiX+oCbZcKlTl
         Q8OLdBwNiyIt3ezaEDbz+bFFa2BObHF7T8ZnTge9pysKtErU5o6yMigPPWTwi/GraoYE
         X1MBUcJWqYp7bLaxNyeznnLrcpNYF+9VZ9wdk3e58kw3R5T9prQTj3mPYe9L/E/tWiBd
         HYafZ8Yb2pm50aXNQbu1bvBRg3Nd56pzslf2OryMq8HDUfmBKGxtJUcLVKg0JEJuBgHN
         Ewfg==
X-Gm-Message-State: AOAM530X7XgPPr+ZH1q2D7Ou+ytqRZdlEV5norMoz7pU8dIka75Cye2y
        XvDAPEbOGghLyfdTEujDdk7xBQ==
X-Google-Smtp-Source: ABdhPJw0MsM7i8bwRZrEdghN116s3dEIEMDqeIJ6VBf7FQiar70mWUPczCEi3eXDG6n3hDEhYsCK4w==
X-Received: by 2002:ac8:368f:: with SMTP id a15mr18218469qtc.288.1597758612969;
        Tue, 18 Aug 2020 06:50:12 -0700 (PDT)
Received: from localhost ([2620:10d:c091:480::1:8b3])
        by smtp.gmail.com with ESMTPSA id n15sm20639882qkk.28.2020.08.18.06.50.11
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 18 Aug 2020 06:50:11 -0700 (PDT)
Date:   Tue, 18 Aug 2020 09:49:00 -0400
From:   Johannes Weiner <hannes@cmpxchg.org>
To:     peterz@infradead.org
Cc:     Michal Hocko <mhocko@suse.com>, Waiman Long <longman@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Jonathan Corbet <corbet@lwn.net>,
        Alexey Dobriyan <adobriyan@gmail.com>,
        Ingo Molnar <mingo@kernel.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
        linux-mm@kvack.org
Subject: Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory
 control
Message-ID: <20200818134900.GA829964@cmpxchg.org>
References: <20200817140831.30260-1-longman@redhat.com>
 <20200818091453.GL2674@hirez.programming.kicks-ass.net>
 <20200818092617.GN28270@dhcp22.suse.cz>
 <20200818095910.GM2674@hirez.programming.kicks-ass.net>
 <20200818100516.GO28270@dhcp22.suse.cz>
 <20200818101844.GO2674@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200818101844.GO2674@hirez.programming.kicks-ass.net>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.

IO throttling solves a slightly different problem.

IO occurs in parallel to the workload's execution stream, and you're
trying to take the workload from dirtying at CPU speed to rate match
to the independent IO stream.

With memory allocations, though, freeing happens from inside the
execution stream of the workload. If you throttle allocations, you're
most likely throttling the freeing rate as well. And you'll slow down
reclaim scanning by the same amount as the page references, so it's
not making reclaim more successful either. The alloc/use/free
(im)balance is an inherent property of the workload, regardless of the
speed you're executing it at.

So the goal here is different. We're not trying to pace the workload
into some form of sustainability. Rather, it's for OOM handling. When
we detect the workload's alloc/use/free pattern is unsustainable given
available memory, we slow it down just enough to allow userspace to
implement OOM policy and job priorities (on containerized hosts these
tend to be too complex to express in the kernel's oom scoring system).

The exponential curve makes it look like we're trying to do some type
of feedback system, but it's really only to let minor infractions pass
and throttle unsustainable expansion ruthlessly. Drop-behind reclaim
can be a bit bumpy because we batch on the allocation side as well as
on the reclaim side, hence the fuzz factor there.

From mboxrd@z Thu Jan  1 00:00:00 1970
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Subject: Re: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory
 control
Date: Tue, 18 Aug 2020 09:49:00 -0400
Message-ID: <20200818134900.GA829964@cmpxchg.org>
References: <20200817140831.30260-1-longman@redhat.com>
 <20200818091453.GL2674@hirez.programming.kicks-ass.net>
 <20200818092617.GN28270@dhcp22.suse.cz>
 <20200818095910.GM2674@hirez.programming.kicks-ass.net>
 <20200818100516.GO28270@dhcp22.suse.cz>
 <20200818101844.GO2674@hirez.programming.kicks-ass.net>
Mime-Version: 1.0
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=i3gzOAPUdUUgkW+MfJiRBZzylD1SwKDup35ranLTE+8=;
        b=gFo2tdrjm1qkUJWuDiFskzbeO0w5LJx6dyzsZOPwPfOmHDSXzdBGEcXIwDX7whrNV9
         JEgVlPiTxZyMBirBuYCHGck9cYWCBQmt5ny0jayeNQ6gfZp4LNLHaLYVZRHrSHKeUY6e
         uRSpvrjdgx5mqhJbaHjlzb+ptaePm6k7omgqsuP3hL/1RmdPdVTPoALIZZdr5eyJ54WL
         dsftHxbUTIR9tXIDvqG8jN4H2/lH+Rf1hR50Z0vtfC6A7sLoLgtiM/v15If1BpTZRwDI
         SAX24oCanz2Ny6U8qxDU6UGtpXs/jD/FGeqLm7SIrw5GSVnzTYUHBo7UfMACrG1lFPPe
         a1mA==
Content-Disposition: inline
In-Reply-To: <20200818101844.GO2674-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>, Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Vladimir Davydov <vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>, Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, Ingo Molnar <mingo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>, Juri Lelli <juri.lelli-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, Vincent Guittot <vincent.guittot-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org

On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.

IO throttling solves a slightly different problem.

IO occurs in parallel to the workload's execution stream, and you're
trying to take the workload from dirtying at CPU speed to rate match
to the independent IO stream.

With memory allocations, though, freeing happens from inside the
execution stream of the workload. If you throttle allocations, you're
most likely throttling the freeing rate as well. And you'll slow down
reclaim scanning by the same amount as the page references, so it's
not making reclaim more successful either. The alloc/use/free
(im)balance is an inherent property of the workload, regardless of the
speed you're executing it at.

So the goal here is different. We're not trying to pace the workload
into some form of sustainability. Rather, it's for OOM handling. When
we detect the workload's alloc/use/free pattern is unsustainable given
available memory, we slow it down just enough to allow userspace to
implement OOM policy and job priorities (on containerized hosts these
tend to be too complex to express in the kernel's oom scoring system).

The exponential curve makes it look like we're trying to do some type
of feedback system, but it's really only to let minor infractions pass
and throttle unsustainable expansion ruthlessly. Drop-behind reclaim
can be a bit bumpy because we batch on the allocation side as well as
on the reclaim side, hence the fuzz factor there.