From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB061C433DF for ; Wed, 24 Jun 2020 20:55:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B51592070A for ; Wed, 24 Jun 2020 20:55:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388979AbgFXUzW (ORCPT ); Wed, 24 Jun 2020 16:55:22 -0400 Received: from mga12.intel.com ([192.55.52.136]:41307 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387735AbgFXUzW (ORCPT ); Wed, 24 Jun 2020 16:55:22 -0400 IronPort-SDR: 7vDtGwddThr4GTqrSkWghGz3mrBG3ScIjskPe0wrayvUM8IKWY92fa9qWcpXd8HCapMfGCW2k7 TCqxbsrQM4ZQ== X-IronPort-AV: E=McAfee;i="6000,8403,9662"; a="124290034" X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="124290034" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 13:55:21 -0700 IronPort-SDR: DIsPgj9GqLhjoL8fNsUj8OhXHZX4vdPeHbGd8s1FK+omE80kdINtUXDQJTM3JQkJENFgH3yjD9 INS+RtyJGxiw== X-IronPort-AV: E=Sophos;i="5.75,276,1589266800"; d="scan'208";a="293668662" Received: from dshapova-mobl.ger.corp.intel.com (HELO intel.com) ([10.252.143.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jun 2020 13:55:20 -0700 Date: Wed, 24 Jun 2020 13:55:18 -0700 From: Ben Widawsky To: Michal Hocko Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200624205518.tzcvjayntez4ueqw@intel.com> Mail-Followup-To: Michal Hocko , linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org References: <20200623161211.qjup5km5eiisy5wy@intel.com> <20200624075216.GC1320@dhcp22.suse.cz> <20200624161643.75fkkvsxlmp3bf2e@intel.com> <20200624183917.GW1320@dhcp22.suse.cz> <20200624193733.tqeligjd3pdvrsmi@intel.com> <20200624195158.GX1320@dhcp22.suse.cz> <20200624200140.dypw6snshshzlbwa@intel.com> <20200624200750.GY1320@dhcp22.suse.cz> <20200624202344.woogq4n3bqkuejty@intel.com> <20200624204232.GZ1320@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200624204232.GZ1320@dhcp22.suse.cz> Sender: linux-api-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-api@vger.kernel.org On 20-06-24 22:42:32, Michal Hocko wrote: > On Wed 24-06-20 13:23:44, Ben Widawsky wrote: > > On 20-06-24 22:07:50, Michal Hocko wrote: > > > On Wed 24-06-20 13:01:40, Ben Widawsky wrote: > > > > On 20-06-24 21:51:58, Michal Hocko wrote: > > > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote: > > > > > > On 20-06-24 20:39:17, Michal Hocko wrote: > > > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote: > > > [...] > > > > > > > > > Or do I miss something that really requires more involved approach like > > > > > > > > > building custom zonelists and other larger changes to the allocator? > > > > > > > > > > > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In > > > > > > > > this case when you try to get the page from the freelist, you'll get the > > > > > > > > zonelist of the preferred node, and when you actually scan through on page > > > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the > > > > > > > > plumbing of multiple nodes has to go all the way through > > > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point. > > > > > > > > > > > > > > policy_nodemask() will provide the nodemask which will be used as a > > > > > > > filter on the policy_node. > > > > > > > > > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me > > > > > > to that point. UAPI cannot get independent masks, and callers of these functions > > > > > > don't yet use them. > > > > > > > > > > > > So let me ask before I actually type it up and find it's much much simpler, is > > > > > > there not some perceived benefit to having both masks being independent? > > > > > > > > > > I am not sure I follow. Which two masks do you have in mind? zonelist > > > > > and user provided nodemask? > > > > > > > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes. > > > > > > Each mask is a local to its policy object. > > > > I mean for __alloc_pages_nodemask as an internal API. That is irrespective of > > policy. Policy decisions are all made beforehand. The question from a few mails > > ago was whether there is any use in keeping that change to > > __alloc_pages_nodemask accepting two nodemasks. > > It is probably too late for me because I am still not following you > mean. Maybe it would be better to provide a pseudo code what you have in > mind. Anyway all that I am saying is that for the functionality that you > propose and _if_ the fallback strategy is fixed then all you should need > is to use the preferred nodemask for the __alloc_pages_nodemask and a > fallback allocation to the full (NULL nodemask). So you first try what > the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but > do not OOM if the memory is depleted semantic and the fallback > allocation goes all the way to OOM on the complete memory depletion. > So I do not see much point in a custom zonelist for the policy. Maybe as > a micro-optimization to save some branches here and there. > > If you envision usecases which might want to control the fallback > allocation strategy then this would get more complex because you > would need a sorted list of zones to try but this would really require > some solid usecase and it should build on top of a trivial > implementation which really is BIND with the fallback. > I will implement what you suggest. I think it's a good suggestion. Here is what I mean though: -struct page * -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, - nodemask_t *nodemask); +struct page * +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask, + nodemask_t *nodemask); Is there any value in keeping two nodemasks as part of the interface?