From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCF3AC4338F for ; Thu, 29 Jul 2021 07:09:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0EF3161077 for ; Thu, 29 Jul 2021 07:09:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0EF3161077 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 4FF636B0036; Thu, 29 Jul 2021 03:09:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4884B6B005D; Thu, 29 Jul 2021 03:09:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 301B06B006C; Thu, 29 Jul 2021 03:09:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154]) by kanga.kvack.org (Postfix) with ESMTP id 148366B0036 for ; Thu, 29 Jul 2021 03:09:29 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id BEFE18249980 for ; Thu, 29 Jul 2021 07:09:28 +0000 (UTC) X-FDA: 78414749616.18.0E02F14 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf24.hostedemail.com (Postfix) with ESMTP id 07FC3B00E854 for ; Thu, 29 Jul 2021 07:09:26 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10059"; a="212852852" X-IronPort-AV: E=Sophos;i="5.84,278,1620716400"; d="scan'208";a="212852852" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jul 2021 00:09:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,278,1620716400"; d="scan'208";a="438102616" Received: from shbuild999.sh.intel.com (HELO localhost) ([10.239.146.151]) by fmsmga007.fm.intel.com with ESMTP; 29 Jul 2021 00:09:19 -0700 Date: Thu, 29 Jul 2021 15:09:18 +0800 From: Feng Tang To: Michal Hocko Cc: linux-mm@kvack.org, Andrew Morton , David Rientjes , Dave Hansen , Ben Widawsky , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, Andrea Arcangeli , Mel Gorman , Mike Kravetz , Randy Dunlap , Vlastimil Babka , Andi Kleen , Dan Williams , ying.huang@intel.com, Dave Hansen Subject: Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Message-ID: <20210729070918.GA96680@shbuild999.sh.intel.com> References: <1626077374-81682-1-git-send-email-feng.tang@intel.com> <1626077374-81682-2-git-send-email-feng.tang@intel.com> <20210728141156.GC43486@shbuild999.sh.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 07FC3B00E854 Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf24.hostedemail.com: domain of feng.tang@intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=feng.tang@intel.com X-Stat-Signature: x5spye9pwnhrpp7caacjjrfh93ftohkh X-HE-Tag: 1627542566-671919 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote: > On Wed 28-07-21 22:11:56, Feng Tang wrote: > > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote: > > > [Sorry for a late review] > > > > Not at all. Thank you for all your reviews and suggestions from v1 > > to v6! > > > > > On Mon 12-07-21 16:09:29, Feng Tang wrote: > > > [...] > > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > /* Return the node id preferred by the given mempolicy, or the given id */ > > > > static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) > > > > { > > > > - if (policy->mode == MPOL_PREFERRED) { > > > > + if (policy->mode == MPOL_PREFERRED || > > > > + policy->mode == MPOL_PREFERRED_MANY) { > > > > nd = first_node(policy->nodes); > > > > } else { > > > > /* > > > > > > Do we really want to have the preferred node to be always the first node > > > in the node mask? Shouldn't that strive for a locality as well? Existing > > > callers already prefer numa_node_id() - aka local node - and I belive we > > > shouldn't just throw that away here. > > > > I think it's about the difference of 'local' and 'prefer/perfer-many' > > policy. There are different kinds of memory HW: HBM(High Bandwidth > > Memory), normal DRAM, PMEM (Persistent Memory), which have different > > price, bandwidth, speed etc. A platform may have two, or all three of > > these types, and there are real use case which want memory comes > > 'preferred' node/nodes than the local node. > > > > And good point for 'local node', if the 'prefer-many' policy's > > nodemask has local node set, we should pick it han this > > 'first_node', and the same semantic also applies to the other > > several places you pointed out. Or do I misunderstand you point? > > Yeah. Essentially what I am trying to tell is that for > MPOL_PREFERRED_MANY you simply want to return the given node without any > alternation. That node will be used for the fallback zonelist and the > nodemask would make sure we won't get out of the policy. I think I got your point now :) With current mainline code, the 'prefer' policy will return the preferred node. For 'prefer-many', we would like to keep the similar semantic, that the preference of node is 'preferred' > 'local' > all other nodes. There is some customer use case, whose platform has both DRAM and cheaper, bigger and slower PMEM, and they anlayzed the hotness of their huge data, and they want to put huge cold data into the PMEM, and only fallback to DRAM as the last step. The HW topology could be simplified like this: Socket 0: Node 0 (CPU + 64GB DRAM), Node 2 (512GB PMEM) Socket 1: Node 1 (CPU + 64GB DRAM), Node 3 (512GB PMEM) E.g they want to allocate memory for colde application data with 'prefer-many' policy + 0xC nodemask (N2+N3 PMEM nodes), so no matter the application is running on Node 0 or Node 1, the 'local' node only has DRAM which is not their preference, and want a preferred-->local-->others order. Thanks, Feng > -- > Michal Hocko > SUSE Labs