From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wefk=GP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 92654C433DB
	for <linux-mm@archiver.kernel.org>; Tue, 12 Jan 2021 06:13:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0CC7322510
	for <linux-mm@archiver.kernel.org>; Tue, 12 Jan 2021 06:13:45 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0CC7322510
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 07FDA8D0067; Tue, 12 Jan 2021 01:13:45 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 02F618D0051; Tue, 12 Jan 2021 01:13:44 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E87E38D0067; Tue, 12 Jan 2021 01:13:44 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38])
	by kanga.kvack.org (Postfix) with ESMTP id D30048D0051
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 01:13:44 -0500 (EST)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 92E931EE6
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 06:13:44 +0000 (UTC)
X-FDA: 77696106768.10.rat66_310808d27513
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin10.hostedemail.com (Postfix) with ESMTP id 737AB16A4AC
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 06:13:44 +0000 (UTC)
X-HE-Tag: rat66_310808d27513
X-Filterd-Recvd-Size: 7152
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	by imf41.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Jan 2021 06:13:42 +0000 (UTC)
IronPort-SDR: nJ7w+H/sUVkTCu2cAdint/Wu3Y3D6YLM6Rf11RjjkrzGGRel58gRVmfuRVg1lvlVjVCZcdlAY8
 AyF5gQvo1qRQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9861"; a="175410939"
X-IronPort-AV: E=Sophos;i="5.79,340,1602572400"; 
   d="scan'208";a="175410939"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2021 22:13:41 -0800
IronPort-SDR: kADjeAZA0f7of4Obo7NUehXhDCw//EV19kjeVAAmshn4tAdXlwFmIYvkuGxFIEXqBjGGuA2OQ3
 ReKkCNAw6wpA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.79,340,1602572400"; 
   d="scan'208";a="424046706"
Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.50])
  by orsmga001.jf.intel.com with ESMTP; 11 Jan 2021 22:13:37 -0800
From: "Huang\, Ying" <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>,  <linux-mm@kvack.org>,  <linux-kernel@vger.kernel.org>,  Andrew Morton <akpm@linux-foundation.org>,  "Ingo Molnar" <mingo@redhat.com>,  Rik van Riel <riel@surriel.com>,  Johannes Weiner <hannes@cmpxchg.org>,  "Matthew Wilcox \(Oracle\)" <willy@infradead.org>,  "Dave Hansen" <dave.hansen@intel.com>,  Andi Kleen <ak@linux.intel.com>,  "Michal Hocko" <mhocko@suse.com>,  David Rientjes <rientjes@google.com>,  <linux-api@vger.kernel.org>
Subject: Re: [PATCH -V8 1/3] numa balancing: Migrate on fault among multiple bound nodes
References: <20210106065754.17955-1-ying.huang@intel.com>
	<20210106065754.17955-2-ying.huang@intel.com>
Date: Tue, 12 Jan 2021 14:13:36 +0800
In-Reply-To: <20210106065754.17955-2-ying.huang@intel.com> (Huang Ying's
	message of "Wed, 6 Jan 2021 14:57:52 +0800")
Message-ID: <87bldud6nj.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi, Peter,

Huang Ying <ying.huang@intel.com> writes:

> Now, NUMA balancing can only optimize the page placement among the
> NUMA nodes if the default memory policy is used.  Because the memory
> policy specified explicitly should take precedence.  But this seems
> too strict in some situations.  For example, on a system with 4 NUMA
> nodes, if the memory of an application is bound to the node 0 and 1,
> NUMA balancing can potentially migrate the pages between the node 0
> and 1 to reduce cross-node accessing without breaking the explicit
> memory binding policy.
>
> So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to
> set_mempolicy() when mode is MPOL_BIND.  With the flag specified, NUMA
> balancing will be enabled within the thread to optimize the page
> placement within the constrains of the specified memory binding
> policy.  With the newly added flag, the NUMA balancing control
> mechanism becomes,
>
> - sysctl knob numa_balancing can enable/disable the NUMA balancing
>   globally.
>
> - even if sysctl numa_balancing is enabled, the NUMA balancing will be
>   disabled for the memory areas or applications with the explicit memory
>   policy by default.
>
> - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the
>   applications when specifying the explicit memory policy (MPOL_BIND).
>
> Various page placement optimization based on the NUMA balancing can be
> done with these flags.  As the first step, in this patch, if the
> memory of the application is bound to multiple nodes (MPOL_BIND), and
> in the hint page fault handler the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node
> to reduce the cross-node accessing.
>
> If the newly added MPOL_F_NUMA_BALANCING flag is specified by an
> application on an old kernel version without its support,
> set_mempolicy() will return -1 and errno will be set to EINVAL.  The
> application can use this behavior to run on both old and new kernel
> versions.
>
> And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other
> than MPOL_BIND, set_mempolicy() will return -1 and errno will be set
> to EINVAL as before.  Because we don't support optimization based on
> the NUMA balancing for these modes.
>
> In the previous version of the patch, we tried to reuse MPOL_MF_LAZY
> for mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not
> a good API/ABI for the purpose of the patch.
>
> And because it's not clear whether it's necessary to enable NUMA
> balancing for a specific memory area inside an application, so we only
> add the flag at the thread level (set_mempolicy()) instead of the
> memory area level (mbind()).  We can do that when it become necessary.
>
> To test the patch, we run a test case as follows on a 4-node machine
> with 192 GB memory (48 GB per node).
>
> 1. Change pmbench memory accessing benchmark to call set_mempolicy()
>    to bind its memory to node 1 and 3 and enable NUMA balancing.  Some
>    related code snippets are as follows,
>
>      #include <numaif.h>
>      #include <numa.h>
>
> 	struct bitmask *bmp;
> 	int ret;
>
> 	bmp = numa_parse_nodestring("1,3");
> 	ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
> 			    bmp->maskp, bmp->size + 1);
> 	/* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */
> 	if (ret < 0 && errno == EINVAL)
> 		ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1);
> 	if (ret < 0) {
> 		perror("Failed to call set_mempolicy");
> 		exit(-1);
> 	}
>
> 2. Run a memory eater on node 3 to use 40 GB memory before running pmbench.
>
> 3. Run pmbench with 64 processes, the working-set size of each process
>    is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB.  The
>    CPU and the memory (as in step 1.) of all pmbench processes is bound
>    to node 1 and 3. So, after CPU usage is balanced, some pmbench
>    processes run on the CPUs of the node 3 will access the memory of
>    the node 1.
>
> 4. After the pmbench processes run for 100 seconds, kill the memory
>    eater.  Now it's possible for some pmbench processes to migrate
>    their pages from node 1 to node 3 to reduce cross-node accessing.
>
> Test results show that, with the patch, the pages can be migrated from
> node 1 to node 3 after killing the memory eater, and the pmbench score
> can increase about 17.5%.
>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: linux-api@vger.kernel.org

It seems that Andrew has no objection to this patch.  Is it possible for
you to merge it through your tree?

Best Regards,
Huang, Ying