From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 348D9C83014 for ; Wed, 2 Dec 2020 08:42:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A4BF6206C0 for ; Wed, 2 Dec 2020 08:42:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A4BF6206C0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 287DC8D0003; Wed, 2 Dec 2020 03:42:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 232F08D0001; Wed, 2 Dec 2020 03:42:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AF318D0003; Wed, 2 Dec 2020 03:42:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0172.hostedemail.com [216.40.44.172]) by kanga.kvack.org (Postfix) with ESMTP id E3AB98D0001 for ; Wed, 2 Dec 2020 03:42:52 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A3DE7181AEF21 for ; Wed, 2 Dec 2020 08:42:52 +0000 (UTC) X-FDA: 77547701784.24.grade55_11162c7273b1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id 818751A4A0 for ; Wed, 2 Dec 2020 08:42:52 +0000 (UTC) X-HE-Tag: grade55_11162c7273b1 X-Filterd-Recvd-Size: 8314 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Wed, 2 Dec 2020 08:42:50 +0000 (UTC) IronPort-SDR: PKZI27NTxkE9Vx3f1hVaY6jGJel8ZDvl7Q8ZvBe3cloOTffjsyirlyLVeab/cPHGPcrdAh1DeS vmZfWVbvpsyw== X-IronPort-AV: E=McAfee;i="6000,8403,9822"; a="257690428" X-IronPort-AV: E=Sophos;i="5.78,386,1599548400"; d="scan'208";a="257690428" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 00:42:49 -0800 IronPort-SDR: bGahTIHTnAnYEMVy9Wt9MxIpsZXLTjFiEetOQzXObG3XlTcq0mD/GhNTjTkJYma9Esy0DLIsuT MDoenYtMdkJA== X-IronPort-AV: E=Sophos;i="5.78,386,1599548400"; d="scan'208";a="539590089" Received: from yhuang6-mobl1.sh.intel.com (HELO yhuang6-MOBL1.intel.com) ([10.238.5.184]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Dec 2020 00:42:45 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Andrew Morton , Ingo Molnar , Rik van Riel , Johannes Weiner , "Matthew Wilcox (Oracle)" , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes , linux-api@vger.kernel.org Subject: [PATCH -V6 RESEND 1/3] numa balancing: Migrate on fault among multiple bound nodes Date: Wed, 2 Dec 2020 16:42:32 +0800 Message-Id: <20201202084234.15797-2-ying.huang@intel.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201202084234.15797-1-ying.huang@intel.com> References: <20201202084234.15797-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now, NUMA balancing can only optimize the page placement among the NUMA nodes if the default memory policy is used. Because the memory policy specified explicitly should take precedence. But this seems too strict in some situations. For example, on a system with 4 NUMA nodes, if the memory of an application is bound to the node 0 and 1, NUMA balancing can potentially migrate the pages between the node 0 and 1 to reduce cross-node accessing without breaking the explicit memory binding policy. So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to set_mempolicy(). With the flag specified, NUMA balancing will be enabled within the thread to optimize the page placement within the constrains of the specified memory binding policy. With the newly added flag, the NUMA balancing control mechanism becomes, - sysctl knob numa_balancing can enable/disable the NUMA balancing globally. - even if sysctl numa_balancing is enabled, the NUMA balancing will be disabled for the memory areas or applications with the explicit memory policy by default. - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the applications when specifying the explicit memory policy. Various page placement optimization based on the NUMA balancing can be done with these flags. As the first step, in this patch, if the memory of the application is bound to multiple nodes (MPOL_BIND), and in the hint page fault handler the accessing node are in the policy nodemask, the page will be tried to be migrated to the accessing node to reduce the cross-node accessing. If the newly added MPOL_F_NUMA_BALANCING flag is specified by an application on an old kernel version without its support, set_mempolicy() will return -1 and errno will be set to EINVAL. The application can use this behavior to run on both old and new kernel versions. In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for mbind(). But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good API/ABI for the purpose of the patch. And because it's not clear whether it's necessary to enable NUMA balancing for a specific memory area inside an application, so we only add the flag at the thread level (set_mempolicy()) instead of the memory area level (mbind()). We can do that when it become necessary. To test the patch, we run a test case as follows on a 4-node machine with 192 GB memory (48 GB per node). 1. Change pmbench memory accessing benchmark to call set_mempolicy() to bind its memory to node 1 and 3 and enable NUMA balancing. Some related code snippets are as follows, #include #include struct bitmask *bmp; int ret; bmp =3D numa_parse_nodestring("1,3"); ret =3D set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, bmp->maskp, bmp->size + 1); /* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */ if (ret < 0 && errno =3D=3D EINVAL) ret =3D set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1); if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } 2. Run a memory eater on node 3 to use 40 GB memory before running pmbenc= h. 3. Run pmbench with 64 processes, the working-set size of each process is 640 MB, so the total working-set size is 64 * 640 MB =3D 40 GB. Th= e CPU and the memory (as in step 1.) of all pmbench processes is bound to node 1 and 3. So, after CPU usage is balanced, some pmbench processes run on the CPUs of the node 3 will access the memory of the node 1. 4. After the pmbench processes run for 100 seconds, kill the memory eater. Now it's possible for some pmbench processes to migrate their pages from node 1 to node 3 to reduce cross-node accessing. Test results show that, with the patch, the pages can be migrated from node 1 to node 3 after killing the memory eater, and the pmbench score can increase about 17.5%. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Ingo Molnar Cc: Mel Gorman Cc: Rik van Riel Cc: Johannes Weiner Cc: "Matthew Wilcox (Oracle)" Cc: Dave Hansen Cc: Andi Kleen Cc: Michal Hocko Cc: David Rientjes Cc: linux-api@vger.kernel.org --- include/uapi/linux/mempolicy.h | 4 +++- mm/mempolicy.c | 9 +++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolic= y.h index 3354774af61e..8948467b3992 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -28,12 +28,14 @@ enum { /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) +#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing = if possible */ =20 /* * MPOL_MODE_FLAGS is the union of all possible optional mode flags pass= ed to * either set_mempolicy() or mbind(). */ -#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) +#define MPOL_MODE_FLAGS \ + (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_NUMA_BALANCING) =20 /* Flags for get_mempolicy */ #define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask *= / diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3ca4898f3f24..f74d863a9ad3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -875,6 +875,9 @@ static long do_set_mempolicy(unsigned short mode, uns= igned short flags, goto out; } =20 + if (new && new->mode =3D=3D MPOL_BIND && (flags & MPOL_F_NUMA_BALANCING= )) + new->flags |=3D (MPOL_F_MOF | MPOL_F_MORON); + ret =3D mpol_set_nodemask(new, nodes, scratch); if (ret) { mpol_put(new); @@ -2490,6 +2493,12 @@ int mpol_misplaced(struct page *page, struct vm_ar= ea_struct *vma, unsigned long break; =20 case MPOL_BIND: + /* Optimize placement among multiple nodes via NUMA balancing */ + if (pol->flags & MPOL_F_MORON) { + if (node_isset(thisnid, pol->v.nodes)) + break; + goto out; + } =20 /* * allows binding to multiple nodes. --=20 2.29.2