From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F1CCEC433DF for ; Tue, 30 Jun 2020 07:23:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B42E620759 for ; Tue, 30 Jun 2020 07:23:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B42E620759 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 610316B0096; Tue, 30 Jun 2020 03:23:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E7FB6B009D; Tue, 30 Jun 2020 03:23:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4FDCF6B009E; Tue, 30 Jun 2020 03:23:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0071.hostedemail.com [216.40.44.71]) by kanga.kvack.org (Postfix) with ESMTP id 3B11B6B0096 for ; Tue, 30 Jun 2020 03:23:59 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E970E1EE6 for ; Tue, 30 Jun 2020 07:23:58 +0000 (UTC) X-FDA: 76985038956.03.elbow38_4413d2726e76 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id C3CD928A4ED for ; Tue, 30 Jun 2020 07:23:58 +0000 (UTC) X-HE-Tag: elbow38_4413d2726e76 X-Filterd-Recvd-Size: 9204 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Tue, 30 Jun 2020 07:23:57 +0000 (UTC) IronPort-SDR: ct9URoqpMjseQCa4HU2p+Aexw7jMmCBZXKJI1mejyvVGkzkols8WWIMtJMfRENskbw7VtqmENv JyoiQxd594QQ== X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="145238214" X-IronPort-AV: E=Sophos;i="5.75,296,1589266800"; d="scan'208";a="145238214" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jun 2020 00:23:56 -0700 IronPort-SDR: Z6DVSBBxWsVOci4JsTvqEKZXhdZXf5Uajw2oR9kiGKT0GW/O0dJ9ZVWvcxtMUc5qdfBUd/n45p 6TvlR8tQXI2A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,296,1589266800"; d="scan'208";a="480837296" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23]) by fmsmga006.fm.intel.com with ESMTP; 30 Jun 2020 00:23:54 -0700 From: "Huang\, Ying" To: Dave Hansen Cc: , , , , Subject: Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration References: <20200629234503.749E5340@viggo.jf.intel.com> <20200629234517.A7EC4BD3@viggo.jf.intel.com> Date: Tue, 30 Jun 2020 15:23:53 +0800 In-Reply-To: <20200629234517.A7EC4BD3@viggo.jf.intel.com> (Dave Hansen's message of "Mon, 29 Jun 2020 16:45:17 -0700") Message-ID: <87v9j9ow3a.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: C3CD928A4ED X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Dave, Dave Hansen writes: > From: Dave Hansen > > Some method is obviously needed to enable reclaim-based migration. > > Just like traditional autonuma, there will be some workloads that > will benefit like workloads with more "static" configurations where > hot pages stay hot and cold pages stay cold. If pages come and go > from the hot and cold sets, the benefits of this approach will be > more limited. > > The benefits are truly workload-based and *not* hardware-based. > We do not believe that there is a viable threshold where certain > hardware configurations should have this mechanism enabled while > others do not. > > To be conservative, earlier work defaulted to disable reclaim- > based migration and did not include a mechanism to enable it. > This propses extending the existing "zone_reclaim_mode" (now > now really node_reclaim_mode) as a method to enable it. > > We are open to any alternative that allows end users to enable > this mechanism or disable it it workload harm is detected (just > like traditional autonuma). > > The implementation here is pretty simple and entirely unoptimized. > On any memory hotplug events, assume that a node was added or > removed and recalculate all migration targets. This ensures that > the node_demotion[] array is always ready to be used in case the > new reclaim mode is enabled. This recalculation is far from > optimal, most glaringly that it does not even attempt to figure > out if nodes are actually coming or going. > > Signed-off-by: Dave Hansen > Cc: Yang Shi > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > --- > > b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++ > b/mm/migrate.c | 61 +++++++++++++++++++++++++++++- > b/mm/vmscan.c | 7 +-- > 3 files changed, 73 insertions(+), 4 deletions(-) > > diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst > --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700 > +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700 > @@ -941,6 +941,7 @@ This is value OR'ed together of > 1 (bit currently ignored) > 2 Zone reclaim writes dirty pages out > 4 Zone reclaim swaps pages > +8 Zone reclaim migrates pages > = =================================== > > zone_reclaim_mode is disabled by default. For file servers or workloads > @@ -965,3 +966,11 @@ of other processes running on other node > Allowing regular swap effectively restricts allocations to the local > node unless explicitly overridden by memory policies or cpuset > configurations. > + > +Page migration during reclaim is intended for systems with tiered memory > +configurations. These systems have multiple types of memory with varied > +performance characteristics instead of plain NUMA systems where the same > +kind of memory is found at varied distances. Allowing page migration > +during reclaim enables these systems to migrate pages from fast tiers to > +slow tiers when the fast tier is under pressure. This migration is > +performed before swap. > diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c > --- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700 > +++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700 > @@ -49,6 +49,7 @@ > #include > #include > #include > +#include > > #include > > @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void) > * Avoid any oddities like cycles that could occur > * from changes in the topology. This will leave > * a momentary gap when migration is disabled. > + * > + * This is superfluous for memory offlining since > + * MEM_GOING_OFFLINE does it independently, but it > + * does not hurt to do it a second time. > */ > disable_all_migrate_targets(); > > @@ -3211,6 +3216,60 @@ again: > /* Is another pass necessary? */ > if (!nodes_empty(next_pass)) > goto again; > +} > > - put_online_mems(); > +/* > + * React to hotplug events that might online or offline > + * NUMA nodes. > + * > + * This leaves migrate-on-reclaim transiently disabled > + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events. > + * This runs whether RECLAIM_MIGRATE is enabled or not. > + * That ensures that the user can turn RECLAIM_MIGRATE > + * without needing to recalcuate migration targets. > + */ > +#if defined(CONFIG_MEMORY_HOTPLUG) > +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, > + unsigned long action, void *arg) > +{ > + switch (action) { > + case MEM_GOING_OFFLINE: > + /* > + * Make sure there are not transient states where > + * an offline node is a migration target. This > + * will leave migration disabled until the offline > + * completes and the MEM_OFFLINE case below runs. > + */ > + disable_all_migrate_targets(); > + break; > + case MEM_OFFLINE: > + case MEM_ONLINE: > + /* > + * Recalculate the target nodes once the node > + * reaches its final state (online or offline). > + */ > + set_migration_target_nodes(); > + break; > + case MEM_CANCEL_OFFLINE: > + /* > + * MEM_GOING_OFFLINE disabled all the migration > + * targets. Reenable them. > + */ > + set_migration_target_nodes(); > + break; > + case MEM_GOING_ONLINE: > + case MEM_CANCEL_ONLINE: > + break; > + } > + > + return notifier_from_errno(0); > } > + > +static int __init migrate_on_reclaim_init(void) > +{ > + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); > + return 0; > +} > +late_initcall(migrate_on_reclaim_init); > +#endif /* CONFIG_MEMORY_HOTPLUG */ > + > diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c > --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700 > +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700 > @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly; > * These bit locations are exposed in the vm.zone_reclaim_mode sysctl > * ABI. New bits are OK, but existing bits can never change. > */ > -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ > -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ > +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ > +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */ > > /* > * Priority for NODE_RECLAIM. This determines the fraction of pages I found that RECLAIM_MIGRATE is defined but never referenced in the patch. If my understanding of the code were correct, shrink_do_demote_mapping() is called by shrink_page_list(), which is used by kswapd and direct reclaim. So as long as the persistent memory node is onlined, reclaim-based migration will be enabled regardless of node reclaim mode. Best Regards, Huang, Ying