From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2EE0C433DF for ; Mon, 29 Jun 2020 10:16:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6D09323888 for ; Mon, 29 Jun 2020 10:16:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6D09323888 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DB86E6B0006; Mon, 29 Jun 2020 06:16:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D6A7D6B000A; Mon, 29 Jun 2020 06:16:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7EB76B000C; Mon, 29 Jun 2020 06:16:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0047.hostedemail.com [216.40.44.47]) by kanga.kvack.org (Postfix) with ESMTP id AD6D26B0006 for ; Mon, 29 Jun 2020 06:16:56 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 296771EE6 for ; Mon, 29 Jun 2020 10:16:56 +0000 (UTC) X-FDA: 76981846032.20.use36_50104bb26e6e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id E9F4A180C07AF for ; Mon, 29 Jun 2020 10:16:55 +0000 (UTC) X-HE-Tag: use36_50104bb26e6e X-Filterd-Recvd-Size: 6547 Received: from mail-ed1-f67.google.com (mail-ed1-f67.google.com [209.85.208.67]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Mon, 29 Jun 2020 10:16:55 +0000 (UTC) Received: by mail-ed1-f67.google.com with SMTP id dg28so12338462edb.3 for ; Mon, 29 Jun 2020 03:16:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=O73s0LOtJXu7Fh+11xaWdZVxJaYAQpIuzZzQZGghGk0=; b=C4ut/7ISyFlwQo+7BT44FM5MkqW8C0mIm0vwQ6A1w1hlFNkGSuq9EI2GcqBwTiT4qP +Jj+V9jFGmYbda0kbaxriOb+EzT5JDYGN48cBPQgN0yCChp9yx4I6IbixohSNdX0xGq2 zXghpXIdkeTakxtO5PLD1dd3ZhuWEG6POdwsb1LFi88SWKOgY91NyXyDASAv+RmguLx0 TrLCLuWDFRJ5astsAG2MhNfeezszKRuWvZ6D+IHlf2siZKZLEB+BkzqkHrSnmMzk+23C lkQpe9dKmLlOtzIB3Kpai+g5qL21AyZ9tuCYtuaSOvGVAVYNcvTLrWjPuStrYIx6D1CW Erzw== X-Gm-Message-State: AOAM533wv49LL4wusV3jgig1FwkUzEmX8YbHCU9RAwjaJ8EtetQ3fXIh vnATIXt3myGhlG21JKdhUVU= X-Google-Smtp-Source: ABdhPJwKpj6ektay4WJQs/diddThZei9xqnQ8ZJlw/eJT+/nM4ohuXXxflgHSMBy0/BRNxXDlzNBjg== X-Received: by 2002:aa7:d6cf:: with SMTP id x15mr16384517edr.164.1593425814345; Mon, 29 Jun 2020 03:16:54 -0700 (PDT) Received: from localhost (ip-37-188-168-3.eurotel.cz. [37.188.168.3]) by smtp.gmail.com with ESMTPSA id a1sm18079226ejk.125.2020.06.29.03.16.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Jun 2020 03:16:53 -0700 (PDT) Date: Mon, 29 Jun 2020 12:16:52 +0200 From: Michal Hocko To: Ben Widawsky Cc: linux-mm , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka , linux-api@vger.kernel.org Subject: Re: [PATCH 00/18] multiple preferred nodes Message-ID: <20200629101652.GG32461@dhcp22.suse.cz> References: <20200619162425.1052382-1-ben.widawsky@intel.com> <20200622070957.GB31426@dhcp22.suse.cz> <20200623112048.GR31426@dhcp22.suse.cz> <20200623161211.qjup5km5eiisy5wy@intel.com> <20200624075216.GC1320@dhcp22.suse.cz> <20200626213905.dpu2rgevazmisvhj@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200626213905.dpu2rgevazmisvhj@intel.com> X-Rspamd-Queue-Id: E9F4A180C07AF X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri 26-06-20 14:39:05, Ben Widawsky wrote: > On 20-06-24 09:52:16, Michal Hocko wrote: > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote: > > > On 20-06-23 13:20:48, Michal Hocko wrote: > > [...] > > > > It would be also great to provide a high level semantic description > > > > here. I have very quickly glanced through patches and they are not > > > > really trivial to follow with many incremental steps so the higher level > > > > intention is lost easily. > > > > > > > > Do I get it right that the default semantic is essentially > > > > - allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL > > > > semantic) > > > > - fallback to numa unrestricted allocation with the default > > > > numa policy on the failure > > > > > > > > Or are there any usecases to modify how hard to keep the preference over > > > > the fallback? > > > > > > tl;dr is: yes, and no usecases. > > > > OK, then I am wondering why the change has to be so involved. Except for > > syscall plumbing the only real change to the allocator path would be > > something like > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > { > > /* Lower zones don't get a nodemask applied for MPOL_BIND */ > > if (unlikely(policy->mode == MPOL_BIND || > > policy->mode == MPOL_PREFERED_MANY) && > > apply_policy_zone(policy, gfp_zone(gfp)) && > > cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) > > return &policy->v.nodes; > > > > return NULL; > > } > > > > alloc_pages_current > > > > if (pol->mode == MPOL_INTERLEAVE) > > page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); > > else { > > gfp_t gfp_attempt = gfp; > > > > /* > > * Make sure the first allocation attempt will try hard > > * but eventually fail without OOM killer or other > > * disruption before falling back to the full nodemask > > */ > > if (pol->mode == MPOL_PREFERED_MANY) > > gfp_attempt |= __GFP_RETRY_MAYFAIL; > > > > page = __alloc_pages_nodemask(gfp_attempt, order, > > policy_node(gfp, pol, numa_node_id()), > > policy_nodemask(gfp, pol)); > > if (!page && pol->mode == MPOL_PREFERED_MANY) > > page = __alloc_pages_nodemask(gfp, order, > > numa_node_id(), NULL); > > } > > > > return page; > > > > similar (well slightly more hairy) in alloc_pages_vma > > > > Or do I miss something that really requires more involved approach like > > building custom zonelists and other larger changes to the allocator? > > Hi Michal, > > I'm mostly done implementing this change. It looks good, and so far I think it's > functionally equivalent. One thing though, above you use NULL for the fallback. > That actually should not be NULL because of the logic in policy_node to restrict > zones, and obey cpusets. I've implemented it as such, but I was hoping someone > with a deeper understanding, and more experience can confirm that was the > correct thing to do. Cpusets are just plumbed into the allocator directly. Have a look at __cpuset_zone_allowed call inside get_page_from_freelist. Anyway functionally what you are looking for here is that the fallback allocation should be exactly as if there was no mempolicy in place. And that is expressed by NULL nodemask. The rest is done automagically... -- Michal Hocko SUSE Labs