From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24C47C3A5A7 for ; Thu, 8 Dec 2022 11:54:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B7CD8E0003; Thu, 8 Dec 2022 06:54:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 666A58E0001; Thu, 8 Dec 2022 06:54:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 52E578E0003; Thu, 8 Dec 2022 06:54:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 443888E0001 for ; Thu, 8 Dec 2022 06:54:47 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 1B7A4A098C for ; Thu, 8 Dec 2022 11:54:47 +0000 (UTC) X-FDA: 80218982214.02.91629B1 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf02.hostedemail.com (Postfix) with ESMTP id 60F8880005 for ; Thu, 8 Dec 2022 11:54:45 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=O0nlaKrb; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670500485; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YjXjnSZF8dN8wBVEiXl81/F1aYZ4Om6oZ51zArHsf0s=; b=FazMwtDZBOKdYxlnT2rcIY0U2aUxSpVnqj0UvB57PzpDM0IHxbDriNGfA2E/A1OFuyfRB/ wGLoTXrbaiawVGndISSfVDr78IhAfzK1L8PGMGJjnaYVNw3w1AdIyVXp+U/wppyBpcz3vp 2/MBe/4HOP7vL8SbxA8ucKk1xPuV9Rg= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=O0nlaKrb; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670500485; a=rsa-sha256; cv=none; b=GAcz80jMEietWGgulllkZvzE4js9WcUjbPOT+YP9hZaL9ZQohsYwXMsb/CdaloYEF8h+bX XE4HQ/pkEoCuiJTjWjhnfh5i6wAFhRH0tFkx7TE0IkvehDd8TvjH/k2QAFO2qlJ65oQ0ni isxKPTjOdM1BaLja7j0VdDEpeFD0ujs= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id ABD8A225BA; Thu, 8 Dec 2022 11:54:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1670500483; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=YjXjnSZF8dN8wBVEiXl81/F1aYZ4Om6oZ51zArHsf0s=; b=O0nlaKrb24BED6au1G1332UtEhdnElx3sTz1BuWwewOxVtmyYLoZ/xRXyi5vcgMuCbleCj 4rhKHfOCuVuUgjjcvmbAs5hqfiuHeRjB3Xmu3XKMlqbymw6SW8la1YLxoAQ/XajMI+T9Wl KeQP8HvW/Tqe2U+lJM/l7jTJMesTRr4= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 8AB54138E0; Thu, 8 Dec 2022 11:54:43 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id kqKIH4PQkWPALQAAMHmgww (envelope-from ); Thu, 08 Dec 2022 11:54:43 +0000 Date: Thu, 8 Dec 2022 12:54:42 +0100 From: Michal Hocko To: Mina Almasry Cc: Andrew Morton , Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3] [mm-unstable] mm: Fix memcg reclaim on memory tiered systems Message-ID: References: <20221206023406.3182800-1-almasrymina@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 60F8880005 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: t98s5tppy9sh7tbkwrfosj86wdnfws61 X-HE-Tag: 1670500485-61052 X-HE-Meta: U2FsdGVkX197CXflTn6Rx6XM1gZY/ZqRzdbEdLe16G10QPuZAwEJRArrmrcQIxhSuMGk7uNgaee4+XNWCkJCcwMplOZ1Dj85ll+PjqO0VPvt7EKFJ/PbOEjg9lxzcZKgEkXkaUEtDK9FRs9CL2Q++tzfqJzO31wCsR6FlUJDE/zMOcSSib0N7axYwm7iK2YvjRIUcn/Tt4R0dR45eunoM9JS6Hcc7Um2HHmZjs1llnzuO9fsILuYuS3HurAYcaL+6KWkyJgyLr/h8QzITf/T3EY50ECGoFeBG4qUnbJAcGOnfBtIMpoZPHczG4V97wDMac7tVGZxhOe9dvPxQtlc8MhABcjIb+gQR4hfqCZlJ4VFusm+YIT7ZuRWL8Rowrfd3+hxG0KDsmduaexzIjcfIPHNYWuM8xrxHCmADsBx62UL4hTGCTY1Ft9ZHciyEysBAlOyFtQHh3wRL9iA3rlCARPDipfXzaEDnc03B5S9R/wIy09Ob5RZVyaSwOy6AGEwfMngNwKEew854Omuv1ub5MQttVTshjbEr3d84owAna1X4I3eErSYcCSBKjcmUV0sDDewEA13ugyDfCCJ4nW/z1ubNrVW1iGC6M+prHqQmeFzwdtLfoXTenZMZZntHY5v7ls1TetK2GWGuAYShtC04+luIphFQ+9/1hhXta9TKKXixuRJhTe0pY+1tp8ywVezrhs5IroC4VcxDkQbQ3xgidGyFdtQQkXx7i2BETAsaF5MQPX6XdcJQxgB9k7KyJVbkQ9HiJziSRlMcPH/MIK1L+/8DHCVNDe+YZwjHMEc8FON0FCHRxMqH8QgPDcgVKdyT2loyiUE2NwAmQd4wyNAOs+r0WZYJNeSeCMBHKGmS9vEnBuKNhmZsw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 08-12-22 01:00:40, Mina Almasry wrote: > On Thu, Dec 8, 2022 at 12:09 AM Michal Hocko wrote: > > > > On Wed 07-12-22 13:43:55, Mina Almasry wrote: > > > On Wed, Dec 7, 2022 at 3:12 AM Michal Hocko wrote: > > [...] > > > > Anyway a proper nr_reclaimed tracking should be rather straightforward > > > > but I do not expect to make a big difference in practice > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 026199c047e0..1b7f2d8cb128 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1633,7 +1633,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > LIST_HEAD(ret_folios); > > > > LIST_HEAD(free_folios); > > > > LIST_HEAD(demote_folios); > > > > - unsigned int nr_reclaimed = 0; > > > > + unsigned int nr_reclaimed = 0, nr_demoted = 0; > > > > unsigned int pgactivate = 0; > > > > bool do_demote_pass; > > > > struct swap_iocb *plug = NULL; > > > > @@ -2065,8 +2065,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > > > > } > > > > /* 'folio_list' is always empty here */ > > > > > > > > - /* Migrate folios selected for demotion */ > > > > - nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > > > > + /* > > > > + * Migrate folios selected for demotion. > > > > + * Do not consider demoted pages to be reclaimed for the memcg reclaim > > > > + * because no charges are really freed during the migration. Global > > > > + * reclaim aims at releasing memory from nodes/zones so consider > > > > + * demotion to reclaim memory. > > > > + */ > > > > + nr_demoted += demote_folio_list(&demote_folios, pgdat); > > > > + if (!cgroup_reclaim(sc)) > > > > + nr_reclaimed += nr_demoted; > > > > + > > > > /* Folios that could not be demoted are still in @demote_folios */ > > > > if (!list_empty(&demote_folios)) { > > > > /* Folios which weren't demoted go back on @folio_list for retry: */ > > > > > > > > [...] > > > > > > Thank you again, but this patch breaks the memory.reclaim nodes arg > > > for me. This is my test case. I run it on a machine with 2 memory > > > tiers. > > > > > > Memory tier 1= nodes 0-2 > > > Memory tier 2= node 3 > > > > > > mkdir -p /sys/fs/cgroup/unified/test > > > cd /sys/fs/cgroup/unified/test > > > echo $$ > cgroup.procs > > > head -c 500m /dev/random > /tmp/testfile > > > echo $$ > /sys/fs/cgroup/unified/cgroup.procs > > > echo "1m nodes=0-2" > memory.reclaim > > > > > > In my opinion the expected behavior is for the kernel to demote 1mb of > > > memory from nodes 0-2 to node 3. > > > > > > Actual behavior on the tip of mm-unstable is as expected. > > > > > > Actual behavior with your patch cherry-picked to mm-unstable is that > > > the kernel demotes all 500mb of memory from nodes 0-2 to node 3, and > > > returns -EAGAIN to the user. This may be the correct behavior you're > > > intending, but it completely breaks the use case I implemented the > > > nodes= arg for and listed on the commit message of that change. > > > > Yes, strictly speaking the behavior is correct albeit unexpected. You > > have told the kernel to _reclaim_ that much memory but demotion are > > simply aging handling rather than a reclaim if the demotion target has a > > lot of memory free. > > Yes, by the strict definition of reclaim, you're completely correct. > But in reality earlier I proposed a patch to the kernel that disables > demotion in proactive reclaim. That would have been a correct change > by the strict definition of reclaim, but Johannes informed me that > meta already has a dependency on proactive reclaim triggering demotion > and directed me to add a nodes= arg to memory.reclaim to trigger > demotion instead, to satisfy both use cases. Seems both us and meta > are using this interface to trigger both reclaim and demotion, despite > the strict definition of the word? Well, demotion is a part of aging and that is a part of the reclaim so I believe we want both and demotion mostly an implementation detail. If you want to have a very precise control then the nodemask should drive you there. [...] > > I am worried this will popping up again and again. I thought your nodes > > subset approach could deal with this but I have overlooked one important > > thing in your patch. The user provided nodemask controls where to > > reclaim from but it doesn't constrain demotion targets. Is this > > intentional? Would it actually make more sense to control demotion by > > addint demotion nodes into the nodemask? > > > > IMO, yes it is intentional, and no I would not recommend adding > demotion nodes (I assume you mean adding both demote_from_nodes and > demote_to_nodes as arg). What I really mean is to add demotion nodes to the nodemask along with the set of nodes you want to reclaim from. To me that sounds like a more natural interface allowing for all sorts of usecases: - free up demotion targets (only specify demotion nodes in the mask) - control where to demote (e.g. select specific demotion target(s)) - do not demote at all (skip demotion nodes from the node mask) > My opinion is based on 2 reasons: > > 1. We control proactive reclaim by looking for nodes/tiers approaching > pressure and triggering reclaim/demotion from said nodes/tiers. So we > know the node/tier we would like to reclaim from, but not necessarily > have a requirement on where the memory should go. I think it should be > up to the kernel. > 2. Currently I think most tiered machines will have 2 memory tiers, > but I think the code is designed to support N memory tiers. What > happens if N=3 and the user asks you to demote from the top tier nodes > to the bottom tier nodes (skipping the middle tier)? The user in this > case is explicitly asking to break the aging pipeline. From my short > while on the mailing list I see great interest in respecting the aging > pipeline, so it seems to me the demotion target should be decided by > the kernel based on the aging pipeline, and the user should not be > able to override it? I don't know. Maybe there is a valid use case for > that somewhere. I agree that the agining should be preserved as much as possible unless there is an explicit requirement to do otherwise which might be something application specific. It is really hard to assume all the usecases at this stage but we should keep in mind that the semantic of the interface will get cast into stone once it is released. As of now I do see a great confusion point in the nodemask semantic which pretends to allow some fine control while it is largerly hard to predict because it makes some assumptions about the reclaim while it has a very limited control of the demotion. -- Michal Hocko SUSE Labs