From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92654C433DB for ; Tue, 12 Jan 2021 06:13:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0CC7322510 for ; Tue, 12 Jan 2021 06:13:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0CC7322510 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 07FDA8D0067; Tue, 12 Jan 2021 01:13:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 02F618D0051; Tue, 12 Jan 2021 01:13:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E87E38D0067; Tue, 12 Jan 2021 01:13:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id D30048D0051 for ; Tue, 12 Jan 2021 01:13:44 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 92E931EE6 for ; Tue, 12 Jan 2021 06:13:44 +0000 (UTC) X-FDA: 77696106768.10.rat66_310808d27513 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id 737AB16A4AC for ; Tue, 12 Jan 2021 06:13:44 +0000 (UTC) X-HE-Tag: rat66_310808d27513 X-Filterd-Recvd-Size: 7152 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Tue, 12 Jan 2021 06:13:42 +0000 (UTC) IronPort-SDR: nJ7w+H/sUVkTCu2cAdint/Wu3Y3D6YLM6Rf11RjjkrzGGRel58gRVmfuRVg1lvlVjVCZcdlAY8 AyF5gQvo1qRQ== X-IronPort-AV: E=McAfee;i="6000,8403,9861"; a="175410939" X-IronPort-AV: E=Sophos;i="5.79,340,1602572400"; d="scan'208";a="175410939" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jan 2021 22:13:41 -0800 IronPort-SDR: kADjeAZA0f7of4Obo7NUehXhDCw//EV19kjeVAAmshn4tAdXlwFmIYvkuGxFIEXqBjGGuA2OQ3 ReKkCNAw6wpA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.79,340,1602572400"; d="scan'208";a="424046706" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.50]) by orsmga001.jf.intel.com with ESMTP; 11 Jan 2021 22:13:37 -0800 From: "Huang\, Ying" To: Peter Zijlstra Cc: Mel Gorman , , , Andrew Morton , "Ingo Molnar" , Rik van Riel , Johannes Weiner , "Matthew Wilcox \(Oracle\)" , "Dave Hansen" , Andi Kleen , "Michal Hocko" , David Rientjes , Subject: Re: [PATCH -V8 1/3] numa balancing: Migrate on fault among multiple bound nodes References: <20210106065754.17955-1-ying.huang@intel.com> <20210106065754.17955-2-ying.huang@intel.com> Date: Tue, 12 Jan 2021 14:13:36 +0800 In-Reply-To: <20210106065754.17955-2-ying.huang@intel.com> (Huang Ying's message of "Wed, 6 Jan 2021 14:57:52 +0800") Message-ID: <87bldud6nj.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Peter, Huang Ying writes: > Now, NUMA balancing can only optimize the page placement among the > NUMA nodes if the default memory policy is used. Because the memory > policy specified explicitly should take precedence. But this seems > too strict in some situations. For example, on a system with 4 NUMA > nodes, if the memory of an application is bound to the node 0 and 1, > NUMA balancing can potentially migrate the pages between the node 0 > and 1 to reduce cross-node accessing without breaking the explicit > memory binding policy. > > So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to > set_mempolicy() when mode is MPOL_BIND. With the flag specified, NUMA > balancing will be enabled within the thread to optimize the page > placement within the constrains of the specified memory binding > policy. With the newly added flag, the NUMA balancing control > mechanism becomes, > > - sysctl knob numa_balancing can enable/disable the NUMA balancing > globally. > > - even if sysctl numa_balancing is enabled, the NUMA balancing will be > disabled for the memory areas or applications with the explicit memory > policy by default. > > - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the > applications when specifying the explicit memory policy (MPOL_BIND). > > Various page placement optimization based on the NUMA balancing can be > done with these flags. As the first step, in this patch, if the > memory of the application is bound to multiple nodes (MPOL_BIND), and > in the hint page fault handler the accessing node are in the policy > nodemask, the page will be tried to be migrated to the accessing node > to reduce the cross-node accessing. > > If the newly added MPOL_F_NUMA_BALANCING flag is specified by an > application on an old kernel version without its support, > set_mempolicy() will return -1 and errno will be set to EINVAL. The > application can use this behavior to run on both old and new kernel > versions. > > And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other > than MPOL_BIND, set_mempolicy() will return -1 and errno will be set > to EINVAL as before. Because we don't support optimization based on > the NUMA balancing for these modes. > > In the previous version of the patch, we tried to reuse MPOL_MF_LAZY > for mbind(). But that flag is tied to MPOL_MF_MOVE.*, so it seems not > a good API/ABI for the purpose of the patch. > > And because it's not clear whether it's necessary to enable NUMA > balancing for a specific memory area inside an application, so we only > add the flag at the thread level (set_mempolicy()) instead of the > memory area level (mbind()). We can do that when it become necessary. > > To test the patch, we run a test case as follows on a 4-node machine > with 192 GB memory (48 GB per node). > > 1. Change pmbench memory accessing benchmark to call set_mempolicy() > to bind its memory to node 1 and 3 and enable NUMA balancing. Some > related code snippets are as follows, > > #include > #include > > struct bitmask *bmp; > int ret; > > bmp = numa_parse_nodestring("1,3"); > ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, > bmp->maskp, bmp->size + 1); > /* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */ > if (ret < 0 && errno == EINVAL) > ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1); > if (ret < 0) { > perror("Failed to call set_mempolicy"); > exit(-1); > } > > 2. Run a memory eater on node 3 to use 40 GB memory before running pmbench. > > 3. Run pmbench with 64 processes, the working-set size of each process > is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB. The > CPU and the memory (as in step 1.) of all pmbench processes is bound > to node 1 and 3. So, after CPU usage is balanced, some pmbench > processes run on the CPUs of the node 3 will access the memory of > the node 1. > > 4. After the pmbench processes run for 100 seconds, kill the memory > eater. Now it's possible for some pmbench processes to migrate > their pages from node 1 to node 3 to reduce cross-node accessing. > > Test results show that, with the patch, the pages can be migrated from > node 1 to node 3 after killing the memory eater, and the pmbench score > can increase about 17.5%. > > Signed-off-by: "Huang, Ying" > Acked-by: Mel Gorman > Cc: Andrew Morton > Cc: Ingo Molnar > Cc: Rik van Riel > Cc: Johannes Weiner > Cc: "Matthew Wilcox (Oracle)" > Cc: Dave Hansen > Cc: Andi Kleen > Cc: Michal Hocko > Cc: David Rientjes > Cc: linux-api@vger.kernel.org It seems that Andrew has no objection to this patch. Is it possible for you to merge it through your tree? Best Regards, Huang, Ying