From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D125C433E0 for ; Wed, 24 Jun 2020 19:37:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8A4A52082F for ; Wed, 24 Jun 2020 19:37:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391239AbgFXThi (ORCPT ); Wed, 24 Jun 2020 15:37:38 -0400 Received: from mga06.intel.com ([134.134.136.31]:54026 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391221AbgFXThi (ORCPT ); Wed, 24 Jun 2020 15:37:38 -0400 IronPort-SDR: rMcQt7DWxR5iLq/M0auk3LX3rhEYHAmBpqB3TJ54bL+v9oUxY3YaW4xgDDvzPIzHNvfTb00laI NGZTdz8x1/Hg== X-IronPort-AV: E=McAfee;i="6000,8403,9662"; a="206135390" X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="206135390" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 12:37:37 -0700 IronPort-SDR: LLfOPm+zAgYrHz2fWpY7MO4YLnu7r2Rlryp+icRV2MhtH6bWC5tJ+cyYisaRF0Xbkg1gtedIiw NU/cudBOtzLA== X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="479382742" Received: from schapar-mobl1.amr.corp.intel.com (HELO intel.com) ([10.252.143.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 12:37:35 -0700 Date: Wed, 24 Jun 2020 12:37:33 -0700 From: Ben Widawsky To: Michal Hocko Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200624193733.tqeligjd3pdvrsmi@intel.com> Mail-Followup-To: Michal Hocko , linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org References: <20200619162425.1052382-1-ben.widawsky@intel.com> <20200622070957.GB31426@dhcp22.suse.cz> <20200623112048.GR31426@dhcp22.suse.cz> <20200623161211.qjup5km5eiisy5wy@intel.com> <20200624075216.GC1320@dhcp22.suse.cz> <20200624161643.75fkkvsxlmp3bf2e@intel.com> <20200624183917.GW1320@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200624183917.GW1320@dhcp22.suse.cz> Sender: linux-api-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org On 20-06-24 20:39:17, Michal Hocko wrote: > On Wed 24-06-20 09:16:43, Ben Widawsky wrote: > > On 20-06-24 09:52:16, Michal Hocko wrote: > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > > [...] > > > > > It would be also great to provide a high level semantic description > > > > > here. I have very quickly glanced through patches and they are not > > > > > really trivial to follow with many incremental steps so the higher level > > > > > intention is lost easily. > > > > > > > > > > Do I get it right that the default semantic is essentially > > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > > semantic) > > > > > - fallback to numa unrestricted allocation with the default > > > > > numa policy on the failure > > > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > > the fallback? > > > > > > > > tl;dr is: yes, and no usecases. > > > > > > OK, then I am wondering why the change has to be so involved. Except for > > > syscall plumbing the only real change to the allocator path would be > > > something like > > > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > { > > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > > if (unlikely(policy->mode == MPOL_BIND || > > > policy->mode == MPOL_PREFERED_MANY) && > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > > return &policy->v.nodes; > > > > > > return NULL; > > > } > > > > > > alloc_pages_current > > > > > > if (pol->mode == MPOL_INTERLEAVE) > > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > > else { > > > gfp_t gfp_attempt = gfp; > > > > > > /* > > > * Make sure the first allocation attempt will try hard > > > * but eventually fail without OOM killer or other > > > * disruption before falling back to the full nodemask > > > */ > > > if (pol->mode == MPOL_PREFERED_MANY) > > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > > policy_node(gfp, pol, numa_node_id()), > > > policy_nodemask(gfp, pol)); > > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > > page = __alloc_pages_nodemask(gfp, order, > > > numa_node_id(), NULL); > > > } > > > > > > return page; > > > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > > > Or do I miss something that really requires more involved approach like > > > building custom zonelists and other larger changes to the allocator? > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In > > this case when you try to get the page from the freelist, you'll get the > > zonelist of the preferred node, and when you actually scan through on page > > allocation, you have no way to filter out the non-preferred nodes. I think the > > plumbing of multiple nodes has to go all the way through > > __alloc_pages_nodemask(). But it's possible I've missed the point. > > policy_nodemask() will provide the nodemask which will be used as a > filter on the policy_node. Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me to that point. UAPI cannot get independent masks, and callers of these functions don't yet use them. So let me ask before I actually type it up and find it's much much simpler, is there not some perceived benefit to having both masks being independent? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0CC3C433DF for ; Wed, 24 Jun 2020 19:37:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B63B92077D for ; Wed, 24 Jun 2020 19:37:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B63B92077D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3EBCC6B0002; Wed, 24 Jun 2020 15:37:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39A3D6B0008; Wed, 24 Jun 2020 15:37:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 262E16B000A; Wed, 24 Jun 2020 15:37:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com [216.40.44.239]) by kanga.kvack.org (Postfix) with ESMTP id 0B4EE6B0002 for ; Wed, 24 Jun 2020 15:37:42 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 7E59F180AD80F for ; Wed, 24 Jun 2020 19:37:41 +0000 (UTC) X-FDA: 76965115122.02.pail76_4401d4626e46 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id D3402600010471CC for ; Wed, 24 Jun 2020 19:37:39 +0000 (UTC) X-HE-Tag: pail76_4401d4626e46 X-Filterd-Recvd-Size: 7052 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Wed, 24 Jun 2020 19:37:38 +0000 (UTC) IronPort-SDR: aN0RWiNWj5AC7LWlpgDhw0OMB0W8KhLTmMjHlphpgQooUUdksYK+7Qns60xtWGGfzgsKI4PE0b 28vtrrYp/qeA== X-IronPort-AV: E=McAfee;i="6000,8403,9662"; a="144662842" X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="144662842" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 12:37:36 -0700 IronPort-SDR: LLfOPm+zAgYrHz2fWpY7MO4YLnu7r2Rlryp+icRV2MhtH6bWC5tJ+cyYisaRF0Xbkg1gtedIiw NU/cudBOtzLA== X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="479382742" Received: from schapar-mobl1.amr.corp.intel.com (HELO intel.com) ([10.252.143.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 12:37:35 -0700 Date: Wed, 24 Jun 2020 12:37:33 -0700 From: Ben Widawsky To: Michal Hocko Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200624193733.tqeligjd3pdvrsmi@intel.com> Mail-Followup-To: Michal Hocko , linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org References: <20200619162425.1052382-1-ben.widawsky@intel.com> <20200622070957.GB31426@dhcp22.suse.cz> <20200623112048.GR31426@dhcp22.suse.cz> <20200623161211.qjup5km5eiisy5wy@intel.com> <20200624075216.GC1320@dhcp22.suse.cz> <20200624161643.75fkkvsxlmp3bf2e@intel.com> <20200624183917.GW1320@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200624183917.GW1320@dhcp22.suse.cz> X-Rspamd-Queue-Id: D3402600010471CC X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20-06-24 20:39:17, Michal Hocko wrote: > On Wed 24-06-20 09:16:43, Ben Widawsky wrote: > > On 20-06-24 09:52:16, Michal Hocko wrote: > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > > [...] > > > > > It would be also great to provide a high level semantic description > > > > > here. I have very quickly glanced through patches and they are not > > > > > really trivial to follow with many incremental steps so the higher level > > > > > intention is lost easily. > > > > > > > > > > Do I get it right that the default semantic is essentially > > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > > semantic) > > > > > - fallback to numa unrestricted allocation with the default > > > > > numa policy on the failure > > > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > > the fallback? > > > > > > > > tl;dr is: yes, and no usecases. > > > > > > OK, then I am wondering why the change has to be so involved. Except for > > > syscall plumbing the only real change to the allocator path would be > > > something like > > > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > { > > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > > if (unlikely(policy->mode == MPOL_BIND || > > > policy->mode == MPOL_PREFERED_MANY) && > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > > return &policy->v.nodes; > > > > > > return NULL; > > > } > > > > > > alloc_pages_current > > > > > > if (pol->mode == MPOL_INTERLEAVE) > > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > > else { > > > gfp_t gfp_attempt = gfp; > > > > > > /* > > > * Make sure the first allocation attempt will try hard > > > * but eventually fail without OOM killer or other > > > * disruption before falling back to the full nodemask > > > */ > > > if (pol->mode == MPOL_PREFERED_MANY) > > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > > policy_node(gfp, pol, numa_node_id()), > > > policy_nodemask(gfp, pol)); > > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > > page = __alloc_pages_nodemask(gfp, order, > > > numa_node_id(), NULL); > > > } > > > > > > return page; > > > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > > > Or do I miss something that really requires more involved approach like > > > building custom zonelists and other larger changes to the allocator? > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In > > this case when you try to get the page from the freelist, you'll get the > > zonelist of the preferred node, and when you actually scan through on page > > allocation, you have no way to filter out the non-preferred nodes. I think the > > plumbing of multiple nodes has to go all the way through > > __alloc_pages_nodemask(). But it's possible I've missed the point. > > policy_nodemask() will provide the nodemask which will be used as a > filter on the policy_node. Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me to that point. UAPI cannot get independent masks, and callers of these functions don't yet use them. So let me ask before I actually type it up and find it's much much simpler, is there not some perceived benefit to having both masks being independent?