From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AEF05C433E1 for ; Wed, 24 Jun 2020 16:16:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7F07220781 for ; Wed, 24 Jun 2020 16:16:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7F07220781 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E6EB16B0007; Wed, 24 Jun 2020 12:16:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DF7C76B0008; Wed, 24 Jun 2020 12:16:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBFAE6B000A; Wed, 24 Jun 2020 12:16:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0228.hostedemail.com [216.40.44.228]) by kanga.kvack.org (Postfix) with ESMTP id B18C86B0007 for ; Wed, 24 Jun 2020 12:16:52 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 42CD2181ABE88 for ; Wed, 24 Jun 2020 16:16:52 +0000 (UTC) X-FDA: 76964609064.14.feast55_2c1810c26e45 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 046721806E78A for ; Wed, 24 Jun 2020 16:16:50 +0000 (UTC) X-HE-Tag: feast55_2c1810c26e45 X-Filterd-Recvd-Size: 6228 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Wed, 24 Jun 2020 16:16:48 +0000 (UTC) IronPort-SDR: 3dseUO74+Fr0hqVw2NiWkjvKujaTSrVZnW7T/uvehHTLDjGEwpx2JOb8WJz+Dq7b1ySG3fM4ld jQyeogNmuAFQ== X-IronPort-AV: E=McAfee;i="6000,8403,9662"; a="162623689" X-IronPort-AV: E=Sophos;i="5.75,275,1589266800"; d="scan'208";a="162623689" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 09:16:47 -0700 IronPort-SDR: oS+tzd7CbhuaCMQE2Z30/FiwgnWXWshPWNwzXpye1TuyxetOsjaa0OnZr1mHz6OXYkLeL6qJEt eyyojwDnCz0g== X-IronPort-AV: E=Sophos;i="5.75,275,1589266800"; d="scan'208";a="479316294" Received: from mpatacsi-mobl.amr.corp.intel.com (HELO intel.com) ([10.252.132.226]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 09:16:45 -0700 Date: Wed, 24 Jun 2020 09:16:43 -0700 From: Ben Widawsky To: Michal Hocko Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200624161643.75fkkvsxlmp3bf2e@intel.com> Mail-Followup-To: Michal Hocko , linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org References: <20200619162425.1052382-1-ben.widawsky@intel.com> <20200622070957.GB31426@dhcp22.suse.cz> <20200623112048.GR31426@dhcp22.suse.cz> <20200623161211.qjup5km5eiisy5wy@intel.com> <20200624075216.GC1320@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200624075216.GC1320@dhcp22.suse.cz> X-Rspamd-Queue-Id: 046721806E78A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20-06-24 09:52:16, Michal Hocko wrote: > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > On 20-06-23 13:20:48, Michal Hocko wrote: > [...] > > > It would be also great to provide a high level semantic description > > > here. I have very quickly glanced through patches and they are not > > > really trivial to follow with many incremental steps so the higher level > > > intention is lost easily. > > > > > > Do I get it right that the default semantic is essentially > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > semantic) > > > - fallback to numa unrestricted allocation with the default > > > numa policy on the failure > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > the fallback? > > > > tl;dr is: yes, and no usecases. > > OK, then I am wondering why the change has to be so involved. Except for > syscall plumbing the only real change to the allocator path would be > something like > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > { > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > if (unlikely(policy->mode == MPOL_BIND || > policy->mode == MPOL_PREFERED_MANY) && > apply_policy_zone(policy, gfp_zone(gfp)) && > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > return &policy->v.nodes; > > return NULL; > } > > alloc_pages_current > > if (pol->mode == MPOL_INTERLEAVE) > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > else { > gfp_t gfp_attempt = gfp; > > /* > * Make sure the first allocation attempt will try hard > * but eventually fail without OOM killer or other > * disruption before falling back to the full nodemask > */ > if (pol->mode == MPOL_PREFERED_MANY) > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > page = __alloc_pages_nodemask(gfp_attempt, order, > policy_node(gfp, pol, numa_node_id()), > policy_nodemask(gfp, pol)); > if (!page && pol->mode == MPOL_PREFERED_MANY) > page = __alloc_pages_nodemask(gfp, order, > numa_node_id(), NULL); > } > > return page; > > similar (well slightly more hairy) in alloc_pages_vma > > Or do I miss something that really requires more involved approach like > building custom zonelists and other larger changes to the allocator? I think I'm missing how this allows selecting from multiple preferred nodes. In this case when you try to get the page from the freelist, you'll get the zonelist of the preferred node, and when you actually scan through on page allocation, you have no way to filter out the non-preferred nodes. I think the plumbing of multiple nodes has to go all the way through __alloc_pages_nodemask(). But it's possible I've missed the point. I do have a branch where I build a custom zonelist, but that's not the reason here :-)