From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B37FC4741F for ; Thu, 1 Oct 2020 14:33:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 930ED207FB for ; Thu, 1 Oct 2020 14:33:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="pFDbQq9H" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 930ED207FB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2C6E16B0062; Thu, 1 Oct 2020 10:33:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 277B56B0068; Thu, 1 Oct 2020 10:33:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18D876B006C; Thu, 1 Oct 2020 10:33:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id DF11E6B0062 for ; Thu, 1 Oct 2020 10:33:33 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 64E45824999B for ; Thu, 1 Oct 2020 14:33:33 +0000 (UTC) X-FDA: 77323599906.15.frog73_090e9742719c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 3C7961814C6ED for ; Thu, 1 Oct 2020 14:33:33 +0000 (UTC) X-HE-Tag: frog73_090e9742719c X-Filterd-Recvd-Size: 9843 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Thu, 1 Oct 2020 14:33:32 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id x201so5488428qkb.11 for ; Thu, 01 Oct 2020 07:33:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=jO3BB8sVqiM7vkvwS1XKzZc6N0X8KTzVoqftN8TqfgY=; b=pFDbQq9H6572Zep0T0kiBK6O7gvcomiDy/QnxFgocFInKglLtah5+pLzvUOj3o9TPH +3da9GlUc4tqby6bzqiaELI5lmPN4OaxP9JadjXiNAl5peWgjvHsOEIxXZA8itf2TpYA JNviktKQyP0nsSSMAAW0MxjfMd7rEyANGzxySxkDh3bWE6xluydotD90yrNYtye8EqJ3 BvGaoNvS9pYwjfzCTbatP3d1ysGa5OH4ZqBeiLCC1vB107nF4lHhgKPjdxaGOrd5gfbM k2XnQf/MNMGoG7ONN84busjIoH2e/uYBL+MszKJcAeZxGFSM/C2lshGLybQO/xFPCg06 QReg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jO3BB8sVqiM7vkvwS1XKzZc6N0X8KTzVoqftN8TqfgY=; b=JEl4mSASm49urB5ponjvrvC3qNR91OpLwbtysQDQHNWH4BnNC/+pLCgznYeVBx0qu8 hKRkaJ+mkg9dhxr/Abbb6XoYXNS+0KCHcd6B4OTZb5mXAUFKeKwOR+tMCnT6auxFeEwn Wahj2V8jx5n0ZQQ5k0V2B3+6NKDw922oVW6KLSH2EQHGlKRPIkPZ9tHBakWo5To1UTkA 3C80Xbzup8wppL9aHSgko3F0uDtA1ZRAaAX2590DzTAoDCZVJCHsAEYLF2pjlT51uhkw CATgkUBplSBdAHzeFy/4R1Cv6QnI7Zg6yuxE9BBpnRusMVDa33YxZdrFaWTz0GVbFkFJ a97w== X-Gm-Message-State: AOAM530XO6cO3xyJN0WcoB+/zVwxM+nb/bkNMQpCEhl6I50ir8VkY1fB ghfW6JSa4BJoECplPxAiECFu1w== X-Google-Smtp-Source: ABdhPJxL3ROIE5EEr8xfyH9JXGHvkoQMCRX5NFXfsVuxPCSEzPGogJNQe2fKUfXyH7bAIptu5sJr8g== X-Received: by 2002:a37:4f90:: with SMTP id d138mr7888531qkb.451.1601562811437; Thu, 01 Oct 2020 07:33:31 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:4e22]) by smtp.gmail.com with ESMTPSA id m3sm6483356qkh.10.2020.10.01.07.33.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Oct 2020 07:33:29 -0700 (PDT) Date: Thu, 1 Oct 2020 10:31:49 -0400 From: Johannes Weiner To: Shakeel Butt Cc: Michal Hocko , Roman Gushchin , Yang Shi , Greg Thelen , David Rientjes , Michal =?iso-8859-1?Q?Koutn=FD?= , Andrew Morton , Linux MM , Cgroups , LKML Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface Message-ID: <20201001143149.GA493631@cmpxchg.org> References: <20200909215752.1725525-1-shakeelb@google.com> <20200928210216.GA378894@cmpxchg.org> <20200929150444.GG2277@dhcp22.suse.cz> <20200929215341.GA408059@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote: > On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner wrote: > > > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > > [...] > > > > My take is that a proactive reclaim feature, whose goal is never to > > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > > would ideally have: > > > > > > > > - a pressure or size target specified by userspace but with > > > > enforcement driven inside the kernel from the allocation path > > > > > > > > - the enforcement work NOT be done synchronously by the workload > > > > (something I'd argue we want for *all* memory limits) > > > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > > cgroup's memory allocations causing the work (again something I'd > > > > argue we want in general) > > > > > > > > - a delegatable knob that is independent of setting the maximum size > > > > of a container, as that expresses a different type of policy > > > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > > threshold or stop enforcement when the userspace component dies > > > > > > > > Thoughts? > > > > > > Agreed with above points. What do you think about > > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. > > > > I definitely agree with what you wrote in this email for background > > reclaim. Indeed, your description sounds like what I proposed in > > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > > - what's missing from that patch is proper work attribution. > > > > > I assume that you do not want to override memory.high to implement > > > this because that tends to be tricky from the configuration POV as > > > you mentioned above. But a new limit (memory.middle for a lack of a > > > better name) to define the background reclaim sounds like a good fit > > > with above points. > > > > I can see that with a new memory.middle you could kind of sort of do > > both - background reclaim and proactive reclaim. > > > > That said, I do see advantages in keeping them separate: > > > > 1. Background reclaim is essentially an allocation optimization that > > we may want to provide per default, just like kswapd. > > > > Kswapd is tweakable of course, but I think actually few users do, > > and it works pretty well out of the box. It would be nice to > > provide the same thing on a per-cgroup basis per default and not > > ask users to make decisions that we are generally better at making. > > > > 2. Proactive reclaim may actually be better configured through a > > pressure threshold rather than a size target. > > > > As per above, the goal is not to be punitive or containing. The > > goal is to keep the LRUs warm and move the colder pages to disk. > > > > But how aggressively do you run reclaim for this purpose? What > > target value should a user write to such a memory.middle file? > > > > For one, it depends on the job. A batch job, or a less important > > background job, may tolerate higher paging overhead than an > > interactive job. That means more of its pages could be trimmed from > > RAM and reloaded on-demand from disk. > > > > But also, it depends on the storage device. If you move a workload > > from a machine with a slow disk to a machine with a fast disk, you > > can page more data in the same amount of time. That means while > > your workload tolerances stays the same, the faster the disk, the > > more aggressively you can do reclaim and offload memory. > > > > So again, what should a user write to such a control file? > > > > Of course, you can approximate an optimal target size for the > > workload. You can run a manual workingset analysis with page_idle, > > damon, or similar, determine a hot/cold cutoff based on what you > > know about the storage characteristics, then echo a number of pages > > or a size target into a cgroup file and let kernel do the reclaim > > accordingly. The drawbacks are that the kernel LRU may do a > > different hot/cold classification than you did and evict the wrong > > pages, the storage device latencies may vary based on overall IO > > pattern, and two equally warm pages may have very different paging > > overhead depending on whether readahead can avert a major fault or > > not. So it's easy to overshoot the tolerance target and disrupt the > > workload, or undershoot and have stale LRU data, waste memory etc. > > > > You can also do a feedback loop, where you guess an optimal size, > > then adjust based on how much paging overhead the workload is > > experiencing, i.e. memory pressure. The drawbacks are that you have > > to monitor pressure closely and react quickly when the workload is > > expanding, as it can be potentially sensitive to latencies in the > > usec range. This can be tricky to do from userspace. > > > > This is actually what we do in our production i.e. feedback loop to > adjust the next iteration of proactive reclaim. That's what we do also right now. It works reasonably well, the only two pain points are/have been the reaction time under quick workload expansion and inadvertently forcing the workload into direct reclaim. > We eliminated the IO or slow disk issues you mentioned by only > focusing on anon memory and doing zswap. Interesting, may I ask how the file cache is managed in this setup? > > So instead of asking users for a target size whose suitability > > heavily depends on the kernel's LRU implementation, the readahead > > code, the IO device's capability and general load, why not directly > > ask the user for a pressure level that the workload is comfortable > > with and which captures all of the above factors implicitly? Then > > let the kernel do this feedback loop from a per-cgroup worker. > > I am assuming here by pressure level you are referring to the PSI like > interface e.g. allowing the users to tell about their jobs that X > amount of stalls in a fixed time window is tolerable. Right, essentially the same parameters that psi poll() would take.