From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49514C2D0E2 for ; Tue, 22 Sep 2020 18:10:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D4A162376F for ; Tue, 22 Sep 2020 18:10:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bGUgLenn" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D4A162376F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 524C96B0003; Tue, 22 Sep 2020 14:10:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D5EC6B0037; Tue, 22 Sep 2020 14:10:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C6846B0055; Tue, 22 Sep 2020 14:10:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0119.hostedemail.com [216.40.44.119]) by kanga.kvack.org (Postfix) with ESMTP id 1F3C06B0003 for ; Tue, 22 Sep 2020 14:10:32 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D58BC1EF2 for ; Tue, 22 Sep 2020 18:10:31 +0000 (UTC) X-FDA: 77291487462.04.base97_31134472714f Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin04.hostedemail.com (Postfix) with ESMTP id B0E3A800ABA6 for ; Tue, 22 Sep 2020 18:10:31 +0000 (UTC) X-HE-Tag: base97_31134472714f X-Filterd-Recvd-Size: 8080 Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Sep 2020 18:10:31 +0000 (UTC) Received: by mail-lj1-f196.google.com with SMTP id s205so14924979lja.7 for ; Tue, 22 Sep 2020 11:10:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WxgAr/AiZIGS3F5DA7CiQoFM4Z4cnpWVYbJLZoh9u3A=; b=bGUgLennYN/rsSq/6L0ik+O8ExptMS8MyFI1oMlMsKzUz2ja+yAKaydP9CmuTsVufw CcanCsWTdDQTDSLh4nrO9Jjp+E+UfQxqeV0RCuX4XZpyYoJh1wcyHLIcbN57kOsLUkC/ m4dfL9u84lQAEsgavSweIqTiyNqZvLR0hoNs1PMGHXoKJos5UTiWcHvjYTUUyh+Ti6dQ H35MzRTwtXYlSKRxARIUA4zP4IQzr9UpSm6QJs89+tpjx4v0ouHMslYZyOGBLtASE6I2 iSZfu4IJvDTv9Fb1enjrJDq+jY9pTbM3MgZtF8nsiI4gyNFwnwkuv4MN3WAxUMSFbTgS YXpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WxgAr/AiZIGS3F5DA7CiQoFM4Z4cnpWVYbJLZoh9u3A=; b=RpHGFF+whIKREam0jpRhTzo3jPyrpEnh/gKQscXk93ltkJaF6r3aJsGIeVR/XpJEAI rRvIsq5a86ksWp7CJjsXkEKUc8k3cFwJuVNPPjDP0qAzPpZ1d+YUwjaP7WFQ0OlIj8Dr CDV78i1LrlY6WlcfhMWsg4lweNKempObTEBENmTFa+UlhYVw5RPlEIClZadrMecNJt4U RiPs/M/7XOJvhXfnC2OaHeDe4a1lCXSY+6LNsGfKuBdXJiv2KWZ4vY+glalDcs0NXgNI OTNof3Dqc7v7A/kCmRvgAUM0+NSrAIAV/VqPBS5+OaSMY1yhMVpmfooCYk8uiZRvqf/w UW5Q== X-Gm-Message-State: AOAM532BFfPv65SX8/o+JVW4a0vrmfePd0N9SztB8ahnPuicRESoF0oa 3D7L6eUB3U+a1zoYKMHgcGb+I04PfsOVXhnjIhpbcg== X-Google-Smtp-Source: ABdhPJyV0jOvaKLalo6TTsqtv6BhNOuS3XRDpKTFNkDBSNi8TonuemMlWOcRMZMQgRnXh/DaahoRG51EDF6F4vjP6Q8= X-Received: by 2002:a2e:3511:: with SMTP id z17mr1788424ljz.58.1600798229214; Tue, 22 Sep 2020 11:10:29 -0700 (PDT) MIME-Version: 1.0 References: <20200909215752.1725525-1-shakeelb@google.com> <20200921163055.GQ12990@dhcp22.suse.cz> <20200922114908.GZ12990@dhcp22.suse.cz> <20200922165527.GD12990@dhcp22.suse.cz> In-Reply-To: <20200922165527.GD12990@dhcp22.suse.cz> From: Shakeel Butt Date: Tue, 22 Sep 2020 11:10:17 -0700 Message-ID: Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface To: Michal Hocko , Minchan Kim Cc: Johannes Weiner , Roman Gushchin , Greg Thelen , David Rientjes , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linux MM , Cgroups , LKML , Yang Shi Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko wrote: > > On Tue 22-09-20 08:54:25, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko wrote: > > > > > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote: > [...] > > > > Let me add one more point. Even if the high limit reclaim is swift, it > > > > can still take 100s of usecs. Most of our jobs are anon-only and we > > > > use zswap. Compressing a page can take a couple usec, so 100s of usecs > > > > in limit reclaim is normal. For latency sensitive jobs, this amount of > > > > hiccups do matters. > > > > > > Understood. But isn't this an implementation detail of zswap? Can it > > > offload some of the heavy lifting to a different context and reduce the > > > general overhead? > > > > > > > Are you saying doing the compression asynchronously? Similar to how > > the disk-based swap triggers the writeback and puts the page back to > > LRU, so the next time reclaim sees it, it will be instantly reclaimed? > > Or send the batch of pages to be compressed to a different CPU and > > wait for the completion? > > Yes. > Adding Minchan, if he has more experience/opinion on async swap on zram/zswap. > [...] > > > > You are right that misconfigured limits can result in problems. But such > > > a configuration should be quite easy to spot which is not the case for > > > targetted reclaim calls which do not leave any footprints behind. > > > Existing interfaces are trying to not expose internal implementation > > > details as much as well. You are proposing a very targeted interface to > > > fine control the memory reclaim. There is a risk that userspace will > > > start depending on a specific reclaim implementation/behavior and future > > > changes would be prone to regressions in workloads relying on that. So > > > effectively, any user space memory reclaimer would need to be tuned to a > > > specific implementation of the memory reclaim. > > > > I don't see the exposure of internal memory reclaim implementation. > > The interface is very simple. Reclaim a given amount of memory. Either > > the kernel will reclaim less memory or it will over reclaim. In case > > of reclaiming less memory, the user space can retry given there is > > enough reclaimable memory. For the over reclaim case, the user space > > will backoff for a longer time. How are the internal reclaim > > implementation details exposed? > > In an ideal world yes. A feedback mechanism will be independent on the > particular implementation. But the reality tends to disagree quite > often. Once we provide a tool there will be users using it to the best > of their knowlege. Very often as a hammer. This is what the history of > kernel regressions and "we have to revert an obvious fix because > userspace depends on an undocumented behavior which happened to work for > some time" has thought us in a hard way. > > I really do not want to deal with reports where a new heuristic in the > memory reclaim will break something just because the reclaim takes > slightly longer or over/under reclaims differently so the existing > assumptions break and the overall balancing from userspace breaks. > > This might be a shiny exception of course. And please note that I am not > saying that the interface is completely wrong or unacceptable. I just > want to be absolutely sure we cannot move forward with the existing API > space that we have. > > So far I have learned that you are primarily working around an > implementation detail in the zswap which is doing the swapout path > directly in the pageout path. Wait how did you reach this conclusion? I have explicitly said that we are not using uswapd like functionality in production. We are using this interface for proactive reclaim and proactive reclaim is not a workaround for implementation detail in the zswap. > That sounds like a very bad reason to add > a new interface. You are right that there are likely other usecases to > like this new interface - mostly to emulate drop_caches - but I believe > those are quite misguided as well and we should work harder to help > them out to use the existing APIs. I am not really understanding your concern specific for the new API. All of your concerns (user expectation of reclaim time or over/under reclaim) are still possible with the existing API i.e. memory.high. > Last but not least the memcg > background reclaim is something that should be possible without a new > interface. So, it comes down to adding more functionality/semantics to memory.high or introducing a new simple interface. I am fine with either of one but IMO convoluted memory.high might have a higher maintenance cost. I can send the patch to add the functionality in the memory.high but I would like to get Johannes's opinion first. Shakeel