From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59064C4332F for ; Sun, 4 Dec 2022 09:30:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9D1DF8E0002; Sun, 4 Dec 2022 04:30:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 981878E0001; Sun, 4 Dec 2022 04:30:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 821DF8E0002; Sun, 4 Dec 2022 04:30:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6EFC78E0001 for ; Sun, 4 Dec 2022 04:30:18 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3498740A77 for ; Sun, 4 Dec 2022 09:30:18 +0000 (UTC) X-FDA: 80204102916.21.6BF8ED3 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf26.hostedemail.com (Postfix) with ESMTP id D927214000C for ; Sun, 4 Dec 2022 09:30:17 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=jTqf8mCr; spf=pass (imf26.hostedemail.com: domain of 3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670146217; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=jPW14thofpSQXmWVvL1ThjaB8Sp9xeVxX4UgncpwwuvZu+pQLxuHNcZJQq4KH1Iv6OLEdA S/4ZvIRg7GpMLycJVa+ekGdP5holHRV133BwlUPNzHHERWxA+pxN54bM6M0r5UX6hND5tk kAQDP/occQyLB8Bo8mxmN6mIyBMF2vI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=jTqf8mCr; spf=pass (imf26.hostedemail.com: domain of 3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670146217; a=rsa-sha256; cv=none; b=HkNQi3W5XBmTJmtP433a/6w643/0erXRARuQd0gKkqsD2cpJQc+Bem4F498C5qtmIKVCK2 bg0q6crwWS82UXU4GZ9fs2vGGBdEXGlHfKjRQZF3G6NQ63jd9ma9N3Kl5tDp02l3ZYGKEM P4M8C5uZdLJZ0Z3ZQ8r2lJLLa6eDaaY= Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-3c2837e751fso94752867b3.15 for ; Sun, 04 Dec 2022 01:30:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=jTqf8mCrAhYdbQXet7tnzjdA4IOXTGDFOyvIOTyDJDCAH0yNC5q95Z6bT5WYF8q2a0 GBkyNHGwVngkLcFbXsseKJl42kkpXy+cck1DIAKSpPpUgBmRrico0VnE4Gjwm4izpa6X wBxVDUYrRfPV4iwrFSYnwJBc3veHTwcBOMaXprk1cNKhTt2PkugHq2xNOmmcdtrIOctd VHcwn6p+I2JL2cKt5zz1+BOnmH5vrl+O2GKDU08xKZx79XPpljVqCJt1IIFDPRTVNDEn bYgovauUQXITwD1kAox3GoUvdzbqgWanMPS0Di0vgx8pt5wZM+WHpopY35RsbwM0/JtL wcdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=OaLjpR/QVt3hQbpdZlq+Q/65ck99KOIfezFgNLpv2/M=; b=vZ3zAztlmDU8xGgDFF9BljqOY9lksaujaAIdax5Qjp6yRn29utUy+E6lFfgTLKoTpa Q8jfUUZBhilMkVs4CzoyIU4gMAacBZxYJEUeOzZ2MmmJNy95VH7NLleqX49ZKMGtqgXW ejAOFnWbhRmranGXmpv7Al961+y3PJBZ34oqJ0ykfwFAnN205/upYZiGz8F1pk7LG6nd 2bXMlPm+q+wb3I/FwAPIZmZe5i7+wyD+0+CvMLZjeBqq3YHf7UR6PevIhQOqQXW8Z8CV eG441sADFwyioTg0K2Fn8LUgO0toTZZW/ke5TNPN5XKeeylMU4mYWuYhphl3a2oJkcaY a/lw== X-Gm-Message-State: ANoB5plEuv/Nts+JNWwj9HUbSi7R9bw8K4bWi1Ao1FQoX+AY/4LTilb6 Sq8I65K/B5zmuKbPlBwJjEGCkdpOMrCvjqxYYA== X-Google-Smtp-Source: AA0mqf42y9I7uXztT0FS34KLkixeTnSfbwsr3AolFQbkLL5vqRufMVL512qBUwGDSZR/RruLkiNKK9Y+mTxSVXliTg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:8120:ce77:bb35:7eba]) (user=almasrymina job=sendgmr) by 2002:a25:5d3:0:b0:6fe:f001:b029 with SMTP id 202-20020a2505d3000000b006fef001b029mr4468732ybf.324.1670146217039; Sun, 04 Dec 2022 01:30:17 -0800 (PST) Date: Sun, 4 Dec 2022 01:30:06 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221204093008.2620459-1-almasrymina@google.com> Subject: [PATCH v2] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems From: Mina Almasry To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, Mina Almasry , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: k5w1d5fkywwgbqsin6deso1jbhgjenf5 X-Rspam-User: X-Spamd-Result: default: False [-1.10 / 9.00]; BAYES_HAM(-6.00)[100.00%]; SORBS_IRL_BL(3.00)[209.85.128.201:from]; MID_CONTAINS_FROM(1.00)[]; MV_CASE(0.50)[]; FORGED_SENDER(0.30)[almasrymina@google.com,3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; BAD_REP_POLICIES(0.10)[]; PREVIOUSLY_DELIVERED(0.00)[linux-mm@kvack.org]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_DKIM_ALLOW(0.00)[google.com:s=20210112]; RCVD_COUNT_TWO(0.00)[2]; MIME_TRACE(0.00)[0:+]; R_SPF_ALLOW(0.00)[+ip4:209.85.128.0/17]; RCPT_COUNT_TWELVE(0.00)[14]; DKIM_TRACE(0.00)[google.com:+]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; DMARC_POLICY_ALLOW(0.00)[google.com,reject]; FROM_NEQ_ENVFROM(0.00)[almasrymina@google.com,3qWiMYwsKCM4u56uCBI627u08805y.w86527EH-664Fuw4.8B0@flex--almasrymina.bounces.google.com]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: D927214000C X-Rspamd-Server: rspam06 X-HE-Tag: 1670146217-503073 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") enabled demotion in memcg reclaim, which is the right thing to do, but introduced a regression in the behavior of try_to_free_mem_cgroup_pages(). The callers of try_to_free_mem_cgroup_pages() expect it to attempt to reclaim - not demote - nr_pages from the cgroup. I.e. the memory usage of the cgroup should reduce by nr_pages. The callers expect try_to_free_mem_cgroup_pages() to also return the number of pages reclaimed, not demoted. However, try_to_free_mem_cgroup_pages() actually unconditionally counts demoted pages as reclaimed pages. So in practice when it is called it will often demote nr_pages and return the number of demoted pages to the caller. Demoted pages don't lower the memcg usage as the caller requested. I suspect various things work suboptimally on memory systems or don't work at all due to this: - memory.high enforcement likely doesn't work (it just demotes nr_pages instead of lowering the memcg usage by nr_pages). - try_charge_memcg() will keep retrying the charge while try_to_free_mem_cgroup_pages() is just demoting pages and not actually making any room for the charge. - memory.reclaim has a wonky interface. It advertises to the user it reclaims the provided amount but it will actually demote that amount. There may be more effects to this issue. To fix these issues I propose shrink_folio_list() to only count pages demoted from inside of sc->nodemask to outside of sc->nodemask as 'reclaimed'. For callers such as reclaim_high() or try_charge_memcg() that set sc->nodemask to NULL, try_to_free_mem_cgroup_pages() will try to actually reclaim nr_pages and return the number of pages reclaimed. No demoted pages would count towards the nr_pages requirement. For callers such as memory_reclaim() that set sc->nodemask, try_to_free_mem_cgroup_pages() will free nr_pages from that nodemask with either demotion or reclaim. Tested this change using memory.reclaim interface. With this change, echo "1m" > memory.reclaim Will cause freeing of 1m of memory from the cgroup regardless of the demotions happening inside. echo "1m nodes=0" > memory.reclaim Will cause freeing of 1m of node 0 by demotion if a demotion target is available, and by reclaim if no demotion target is available. Signed-off-by: Mina Almasry --- This is developed on top of mm-unstable largely to test with memory.reclaim nodes= arg and ensure the fix is compatible with that. v2: - Shortened the commit message a bit. - Fixed issue when demotion falls back to other allowed target nodes returned by node_get_allowed_targets() as Wei suggested. Cc: weixugc@google.com --- include/linux/memory-tiers.h | 7 +++++-- mm/memory-tiers.c | 10 +++++++++- mm/vmscan.c | 20 +++++++++++++++++--- 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index fc9647b1b4f9..f3f359760fd0 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -38,7 +38,8 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, + nodemask_t *demote_from_targets); bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) @@ -46,7 +47,9 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } -static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +static inline void node_get_allowed_targets(pg_data_t *pgdat, + nodemask_t *targets, + nodemask_t *demote_from_targets) { *targets = NODE_MASK_NONE; } diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index c734658c6242..7f8f0b5de2b3 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -264,7 +264,8 @@ bool node_is_toptier(int node) return toptier; } -void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets, + nodemask_t *demote_from_targets) { struct memory_tier *memtier; @@ -280,6 +281,13 @@ void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) else *targets = NODE_MASK_NONE; rcu_read_unlock(); + + /* + * Exclude the demote_from_targets from the allowed targets if we're + * trying to demote from a specific set of nodes. + */ + if (demote_from_targets) + nodes_andnot(*targets, *targets, *demote_from_targets); } /** diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b42ac9ad755..97ca0445b5dc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1590,7 +1590,8 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private) * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + nodemask_t *demote_from_nodemask) { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1614,7 +1615,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, if (target_nid == NUMA_NO_NODE) return 0; - node_get_allowed_targets(pgdat, &allowed_mask); + node_get_allowed_targets(pgdat, &allowed_mask, demote_from_nodemask); /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_page, NULL, @@ -1653,6 +1654,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, LIST_HEAD(free_folios); LIST_HEAD(demote_folios); unsigned int nr_reclaimed = 0; + unsigned int nr_demoted = 0; unsigned int pgactivate = 0; bool do_demote_pass; struct swap_iocb *plug = NULL; @@ -2085,7 +2087,19 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); + nr_demoted = demote_folio_list(&demote_folios, pgdat, sc->nodemask); + + /* + * Only count demoted folios as reclaimed if the caller has requested + * demotion from a specific nodemask. In this case pages inside the + * noedmask have been demoted to outside the nodemask and we can count + * these pages as reclaimed. If no nodemask is passed, then the caller + * is requesting reclaim from all memory, which should not count + * demoted pages. + */ + if (sc->nodemask) + nr_reclaimed += nr_demoted; + /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ -- 2.39.0.rc0.267.gcb52ba06e7-goog