From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A77BFC3A5A7 for ; Wed, 7 Dec 2022 01:56:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 277BD8E0005; Tue, 6 Dec 2022 20:56:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2280D8E0001; Tue, 6 Dec 2022 20:56:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0C9778E0005; Tue, 6 Dec 2022 20:56:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id ECABF8E0001 for ; Tue, 6 Dec 2022 20:56:11 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A8CEEC0BA0 for ; Wed, 7 Dec 2022 01:56:11 +0000 (UTC) X-FDA: 80213844942.15.A976960 Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf24.hostedemail.com (Postfix) with ESMTP id 42398180009 for ; Wed, 7 Dec 2022 01:56:11 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=p8ZTQ2rU; spf=pass (imf24.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670378171; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mL42ZLeHwCOgm2dvIiRUP/S/P8aaOI55orVdQTEq1so=; b=MRuDDGczKhqbXXVfVsSe9i6eNY3eQWh7rpH9YM0gw/Y5oNJ6d1NIbshftxjN0v+v5WIXjd m3z38wJSgBQi56+Tg+N2ooyLlOtLGUJ7Yrnft0sIEVel/1PKNHBQMCxsV7A9XZ4/M5SqF2 gPlWsbdItSBdeqHJMZRLbPNhDGEdmHs= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=p8ZTQ2rU; spf=pass (imf24.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670378171; a=rsa-sha256; cv=none; b=TB/A1e2pcWioL0y6e65IUFuHrKlHzsvnPCVX7HmLNdefDQ91gX0oqhh0MEPaDiICz49hjU YsQ4T4s02ZgIVMx4fxd0I7V0fvJ28KEP0n96cYeb5BWnYn6W2so4jYcqiYug6l/CpUDf13 hZblju+/4X08P7Kk02a5nolmLliTbW8= Received: by mail-vs1-f48.google.com with SMTP id i2so15982717vsc.1 for ; Tue, 06 Dec 2022 17:56:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=mL42ZLeHwCOgm2dvIiRUP/S/P8aaOI55orVdQTEq1so=; b=p8ZTQ2rU5P4QOtPRFnTYHuXdD+XDdWS37wPhaV6Vzsl06SEO63uhm5y8g5EY4pVhHt vZ/WZm1jjEnByccyvCIV8pS12Gnll4Vns3pQ1kk8YJWbgGaiw4TrKIGGyqj45oZ1YFFe 0dSIWF7rVcT5bNaqFYEy6JH/GmKfvqQeTdqyZSRbphVYy9hGksGun8HPh+NqpWfPhkiy SdgwG8QD74z9eeJc+IN5LumC8Co5cOyVoWz9UpdZKaUVBSYDPaJRPkyfx0LtPQD1TRzg 7cgd86WssXlHvN31iH8Z7rZ1uNOY7NiiNA0pJE/4KIgCyQKo9XZ0v34+6FaEQbMX65kq z+wQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mL42ZLeHwCOgm2dvIiRUP/S/P8aaOI55orVdQTEq1so=; b=dJY6QG0RQWcYAyXbaYoQv2lvxuQPszmaR6gqFXDNfV3OlgKs0BGYtpWmDzbsUb34Na o42mxeF2k8Q/ZHXfPBBdVcWsVsSlERtmRgJ+TivrtBjSj9wlu6c4/blhihH4zyoESROs QHx0XTGWd2bQZViA27Kg6G44+k09Cc12zg0gp6+5ckB5IipKy0Cmo9xrldFxct621bNm aaeIptY8GEUj9GAHZH21oIYisMKDgVwaaAlr0N6zscRED8mtep1MBfzl7KitX5Xy19If Ms/Gki1nNuZJT5iqrBR6jhQQYQ3PzeJhbeVLlec9dp41UtLGCPXcLoJdWVDXbQm2GnRG +7UQ== X-Gm-Message-State: ANoB5pkOqKQkDTWWT9YREXdz+SJsR5sYPyRRky0mZr5/B2kdCqJ9KUOv VoXs4gSLSbT/ko8oUJ3w8o+q8R8DliS7oQpuFF+cDA== X-Google-Smtp-Source: AA0mqf7Q5B67Pew3oFXM7oertV7l9BHogxhEnaO+dnmzmB/oGQH6BC7i3f6rXUCoLEzBckkxwigZoGYxLfaVPlELQLI= X-Received: by 2002:a05:6102:54a5:b0:3b0:7462:a88c with SMTP id bk37-20020a05610254a500b003b07462a88cmr34065725vsb.49.1670378170228; Tue, 06 Dec 2022 17:56:10 -0800 (PST) MIME-Version: 1.0 References: <20221206023406.3182800-1-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Tue, 6 Dec 2022 17:55:58 -0800 Message-ID: Subject: Re: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems To: Michal Hocko Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 42398180009 X-Stat-Signature: fwjwmiex6w3jsog3e9y5rq1p6jjner55 X-Spamd-Result: default: False [-2.90 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.217.48:from]; BAD_REP_POLICIES(0.10)[]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; RCPT_COUNT_TWELVE(0.00)[13]; DKIM_TRACE(0.00)[google.com:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspam-User: X-HE-Tag: 1670378171-892825 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Dec 6, 2022 at 11:55 AM Michal Hocko wrote: > > On Tue 06-12-22 08:06:51, Mina Almasry wrote: > > On Tue, Dec 6, 2022 at 4:20 AM Michal Hocko wrote: > > > > > > On Mon 05-12-22 18:34:05, Mina Almasry wrote: > > > > commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > reclaim"") enabled demotion in memcg reclaim, which is the right thing > > > > to do, however, it introduced a regression in the behavior of > > > > try_to_free_mem_cgroup_pages(). > > > > > > > > The callers of try_to_free_mem_cgroup_pages() expect it to attempt to > > > > reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage > > > > of the cgroup should reduce by nr_pages. The callers expect > > > > try_to_free_mem_cgroup_pages() to also return the number of pages > > > > reclaimed, not demoted. > > > > > > > > However, what try_to_free_mem_cgroup_pages() actually does is it > > > > unconditionally counts demoted pages as reclaimed pages. So in practice > > > > when it is called it will often demote nr_pages and return the number of > > > > demoted pages to the caller. Demoted pages don't lower the memcg usage, > > > > and so try_to_free_mem_cgroup_pages() is not actually doing what the > > > > callers want it to do. > > > > > > > > Various things work suboptimally on memory tiered systems or don't work > > > > at all due to this: > > > > > > > > - memory.high enforcement likely doesn't work (it just demotes nr_pages > > > > instead of lowering the memcg usage by nr_pages). > > > > - try_charge_memcg() will keep retrying the charge while > > > > try_to_free_mem_cgroup_pages() is just demoting pages and not actually > > > > making any room for the charge. > > > > > > This has been brought up during the review https://lore.kernel.org/all/YoYTEDD+c4GT0xYY@dhcp22.suse.cz/ > > > > > > > Ah, I did indeed miss this. Thanks for the pointer. However I don't > > understand this bit from your email (sorry I'm probably missing > > something): > > > > "I suspect this is rather unlikely situation, though. The last tear > > (without any fallback) should have some memory to reclaim most of > > the time." > > > > Reading the code in try_charge_memcg(), I don't see the last retry for > > try_to_free_mem_cgroup_pages() do anything special. My concern here is > > that try_charge_memcg() calls try_to_free_mem_cgroup_pages() > > MAX_RECLAIM_RETRIES times. Each time that call may demote pages and > > report back that it was able to 'reclaim' memory, but the charge keeps > > failing because the memcg reclaim didn't actually make room for the > > charge. What happens in this case? My understanding is that the memcg > > oom-killer gets wrongly invoked. > > The memcg reclaim shrinks from all zones in the allowed zonelist. In > general from all nodes. So unless the lower tier is outside of this > zonelist then there is a zone to reclaim from which cannot demote. > Correct? > Ah, thanks for pointing this out. I did indeed miss that the memcg reclaim tries to apply pressure equally to all the nodes. With some additional testing I'm able to see what you said: there could be no premature oom kill invocation because generally the memcg reclaim will find some pages to reclaim from lower tier nodes. I do find that the first call to try_to_free_mem_cgroup_pages() sometimes will mostly demote and not do much reclaim. I haven't been able to fully track the cause of that down but I suspect that the first call in my test will find most of the cgroup's memory on top tier nodes. However we do retry a bunch of times before we invoke oom, and in my testing subsequent calls will find plenty of memory in the lower tier nodes that it can reclaim. I'll update the commit message in the next version. > > > > - memory.reclaim has a wonky interface. It advertises to the user it > > > > reclaims the provided amount but it will actually often demote that > > > > amount. > > > > > > > > There may be more effects to this issue. > > > > > > > > To fix these issues I propose shrink_folio_list() to only count pages > > > > demoted from inside of sc->nodemask to outside of sc->nodemask as > > > > 'reclaimed'. > > > > > > Could you expand on why the node mask matters? From the charge point of > > > view it should be completely uninteresting as the charge remains. > > > > > > I suspect we really need to change to reclaim metrics for memcg reclaim. > > > In the memory balancing reclaim we can indeed consider demotions as a > > > reclaim because the memory is freed in the end but for the memcg reclaim > > > we really should be counting discharges instead. No demotion/migration will > > > free up charges. > > > > I think what you're describing is exactly what this patch aims to do. > > I'm proposing an interface change to shrink_folio_list() such that it > > only counts demoted pages as reclaimed iff sc->nodemask is provided by > > the caller and the demotion removed pages from inside sc->nodemask to > > outside sc->nodemask. In this case: > > > > 1. memory balancing reclaim would pass sc->nodemask=nid to > > shrink_folio_list() indicating that it should count pages demoted from > > sc->nodemask as reclaimed. > > > > 2. memcg reclaim would pass sc->nodemask=NULL to shrink_folio_list() > > indicating that it is looking for reclaim across all nodes and no > > demoted pages should count as reclaimed. > > > > Sorry if the commit message was not clear. I can try making it clearer > > in the next version but it's already very long. > > Either I am missing something or I simply do not understand why you are > hooked into nodemask so much. Why cannot we have a simple rule that > only global reclaim considers demotions as nr_reclaimed? > Thanks. I think this approach would work for most callers. My issue here is properly supporting the recently added nodes= arg[1] to memory.reclaim. If the user specifies all nodes or provides no arg, I'd like to treat it as memcg reclaim which doesn't count demotions. If the user provides the top tier nodes, I would like to count demotions as this interface is the way to trigger proactive demotion from top tier nodes. I guess I can check which args the user is passing and decide whether or not to count demotions. But the user right now can specify any combination of nodes, some of them top tier, some lower tier, some in the middle. I can return -EINVAL for that, but that seems like a shame. I thought a generic way to address this was what I'm doing here, i.e. counting pages demoted from the nodemask as reclaimed. Is that not acceptable? Is -EINVAL preferred here? [1] https://lore.kernel.org/linux-mm/87tu2a1v3y.fsf@yhuang6-desk2.ccr.corp.intel.com/ > -- > Michal Hocko > SUSE Labs