From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFB84C433E2 for ; Wed, 16 Sep 2020 08:46:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6F27121655 for ; Wed, 16 Sep 2020 08:46:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726606AbgIPIqm (ORCPT ); Wed, 16 Sep 2020 04:46:42 -0400 Received: from mga11.intel.com ([192.55.52.93]:20004 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726161AbgIPIqk (ORCPT ); Wed, 16 Sep 2020 04:46:40 -0400 IronPort-SDR: uX/Gee+UP5pCO+yzjmeAgMyIumQzrqrJOYK9pUwBj31mL84B7VsmGW0gn61YD4aWeyLR6VRaLJ 2xJIFqT6v1jw== X-IronPort-AV: E=McAfee;i="6000,8403,9745"; a="156835735" X-IronPort-AV: E=Sophos;i="5.76,432,1592895600"; d="scan'208";a="156835735" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2020 01:46:40 -0700 IronPort-SDR: x1Kb3+pCkKLuTJHZu4s9JgvNMpHiut9sXRkmv6enea3P3C6+xp9WDokcsDl+4HbVfN5/PWk0vh JsId+mfoikBw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.76,432,1592895600"; d="scan'208";a="331588621" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.164]) by fmsmga004.fm.intel.com with ESMTP; 16 Sep 2020 01:46:37 -0700 From: "Huang\, Ying" To: peterz@infradead.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Ingo Molnar , Mel Gorman , Rik van Riel , Johannes Weiner , "Matthew Wilcox \(Oracle\)" , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes Subject: Re: [RFC] autonuma: Migrate on fault among multiple bound nodes References: <20200916005936.232788-1-ying.huang@intel.com> <20200916081052.GI2674@hirez.programming.kicks-ass.net> Date: Wed, 16 Sep 2020 16:46:37 +0800 In-Reply-To: <20200916081052.GI2674@hirez.programming.kicks-ass.net> (peterz@infradead.org's message of "Wed, 16 Sep 2020 10:10:52 +0200") Message-ID: <87pn6mrtw2.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Peter, Thanks for comments! peterz@infradead.org writes: > On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote: > >> So in this patch, if MPOL_BIND is used to bind the memory of the >> application to multiple nodes, and in the hint page fault handler both >> the faulting page node and the accessing node are in the policy >> nodemask, the page will be tried to be migrated to the accessing node >> to reduce the cross-node accessing. > > Seems fair enough.. > >> Questions: >> >> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA >> optimizing globally. And now, it appears that the explicit NUMA >> memory policy specifying (e.g. via numactl, mbind(), etc.) acts like >> an implicit per-thread/VMA knob to enable/disable the AutoNUMA >> optimizing for the thread/VMA. Although this looks like a side effect >> instead of an API, from commit fc3147245d19 ("mm: numa: Limit NUMA >> scanning to migrate-on-fault VMAs"), this is used by some users? So >> the question is, do we need an explicit per-thread/VMA knob to >> enable/disable AutoNUMA optimizing for the thread/VMA? Or just use >> the global knob, either optimize all thread/VMAs as long as the >> explicitly specified memory policies are respected, or don't optimize >> at all. > > I don't understand the question; that commit is not about disabling numa > balancing, it's about avoiding pointless work and overhead. What's the > point of scanning memory if you're not going to be allowed to move it > anyway. Because we are going to enable the moving, this makes scanning not pointless, but may also introduce overhead. >> Signed-off-by: "Huang, Ying" >> Cc: Andrew Morton >> Cc: Ingo Molnar >> Cc: Mel Gorman >> Cc: Rik van Riel >> Cc: Johannes Weiner >> Cc: "Matthew Wilcox (Oracle)" >> Cc: Dave Hansen >> Cc: Andi Kleen >> Cc: Michal Hocko >> Cc: David Rientjes >> --- >> mm/mempolicy.c | 43 +++++++++++++++++++++++++++++++------------ >> 1 file changed, 31 insertions(+), 12 deletions(-) >> >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index eddbe4e56c73..a941eab2de24 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -1827,6 +1827,13 @@ static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, >> return pol; >> } >> >> +static bool mpol_may_mof(struct mempolicy *pol) >> +{ >> + /* May migrate among bound nodes for MPOL_BIND */ >> + return pol->flags & MPOL_F_MOF || >> + (pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1); >> +} > > This is weird, why not just set F_MOF on the policy? > > In fact, why wouldn't something like: > > mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY); > > work today? Afaict MF_LAZY will unconditionally result in M_MOF. There are some subtle difference. - LAZY appears unnecessary for the per-task memory policy via set_mempolicy(). While migrating among multiple bound nodes appears reasonable as a per-task memory policy. - LAZY also means move the pages not on the bound nodes to the bound nodes if the memory is available. Some users may want to do that only if should_numa_migrate_memory() returns true. >> @@ -2494,20 +2503,30 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long >> break; >> >> case MPOL_BIND: >> /* >> + * Allows binding to multiple nodes. If both current and >> + * accessing nodes are in policy nodemask, migrate to >> + * accessing node to optimize page placement. Otherwise, >> + * use current page if in policy nodemask or MPOL_F_MOF not >> + * set, else select nearest allowed node, if any. If no >> + * allowed nodes, use current [!misplaced]. >> */ >> + if (node_isset(curnid, pol->v.nodes)) { >> + if (node_isset(thisnid, pol->v.nodes)) { >> + moron = true; >> + polnid = thisnid; >> + } else { >> + goto out; >> + } >> + } else if (!(pol->flags & MPOL_F_MOF)) { >> goto out; >> + } else { >> + z = first_zones_zonelist( >> node_zonelist(numa_node_id(), GFP_HIGHUSER), >> gfp_zone(GFP_HIGHUSER), >> &pol->v.nodes); >> + polnid = zone_to_nid(z->zone); >> + } >> break; >> >> default: > > Did that want to be this instead? I don't think I follow the other > changes. > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index eddbe4e56c73..2a64913f9ac6 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2501,8 +2501,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long > * else select nearest allowed node, if any. > * If no allowed nodes, use current [!misplaced]. > */ > - if (node_isset(curnid, pol->v.nodes)) > + if (node_isset(curnid, pol->v.nodes)) { > + if (node_isset(thisnod, pol->v.nodes)) > + goto moron; > goto out; > + } > z = first_zones_zonelist( > node_zonelist(numa_node_id(), GFP_HIGHUSER), > gfp_zone(GFP_HIGHUSER), > @@ -2516,6 +2519,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long > > /* Migrate the page towards the node whose CPU is referencing it */ > if (pol->flags & MPOL_F_MORON) { > +moron: > polnid = thisnid; > > if (!should_numa_migrate_memory(current, page, curnid, thiscpu)) Yes. This looks better if we can just use F_MOF. Best Regards, Huang, Ying