From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-14.7 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DFD9C433DF for ; Fri, 19 Jun 2020 16:24:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4145721707 for ; Fri, 19 Jun 2020 16:24:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4145721707 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5987A8D00D5; Fri, 19 Jun 2020 12:24:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54A1D8D00D7; Fri, 19 Jun 2020 12:24:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 376FF8D00D5; Fri, 19 Jun 2020 12:24:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id 11A718D00D3 for ; Fri, 19 Jun 2020 12:24:34 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id BDF4A180AD806 for ; Fri, 19 Jun 2020 16:24:33 +0000 (UTC) X-FDA: 76946484426.07.book40_2b049f826e1a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id C588F1803F9A8 for ; Fri, 19 Jun 2020 16:24:32 +0000 (UTC) X-HE-Tag: book40_2b049f826e1a X-Filterd-Recvd-Size: 11642 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Fri, 19 Jun 2020 16:24:31 +0000 (UTC) IronPort-SDR: caO5O64lxeiVGD2MhNkZSFuape1ljIvQk701itzNJCYAl1Q39EioBRwOm3BInwGIRX0bn210K9 wFse8eZwDGeA== X-IronPort-AV: E=McAfee;i="6000,8403,9657"; a="141280148" X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="141280148" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:29 -0700 IronPort-SDR: NtADxzpG3a0vTTHt6xZm6yYrU0LXwfw3/WCavruWlhjQGB4iiuks/JPbA3rP0nFMLT8oWybsUc TwNeyhrOyp5g== X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="264368109" Received: from sjiang-mobl2.ccr.corp.intel.com (HELO bwidawsk-mobl5.local) ([10.252.131.131]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:29 -0700 From: Ben Widawsky To: linux-mm Cc: Dave Hansen , Andrew Morton , Ben Widawsky Subject: [PATCH 05/18] mm/mempolicy: convert single preferred_node to full nodemask Date: Fri, 19 Jun 2020 09:24:12 -0700 Message-Id: <20200619162425.1052382-6-ben.widawsky@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200619162425.1052382-1-ben.widawsky@intel.com> References: <20200619162425.1052382-1-ben.widawsky@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: C588F1803F9A8 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen The NUMA APIs currently allow passing in a "preferred node" as a single bit set in a nodemask. If more than one bit it set, bits after the first are ignored. Internally, this is implemented as a single integer: mempolicy->preferred_node. This single node is generally OK for location-based NUMA where memory being allocated will eventually be operated on by a single CPU. However, in systems with multiple memory types, folks want to target a *type* of memory instead of a location. For instance, someone might want some high-bandwidth memory but do not care about the CPU next to which it is allocated. Or, they want a cheap, high capacity allocation and want to target all NUMA nodes which have persistent memory in volatile mode. In both of these cases, the application wants to target a *set* of nodes, but does not want strict MPOL_BIND behavior. To get that behavior, a MPOL_PREFERRED mode is desirable, but one that honors multiple nodes to be set in the nodemask. The first step in that direction is to be able to internally store multiple preferred nodes, which is implemented in this patch. This should not have any function changes and just switches the internal representation of mempolicy->preferred_node from an integer to a nodemask called 'mempolicy->preferred_nodes'. This is not a pie-in-the-sky dream for an API. This was a response to a specific ask of more than one group at Intel. Specifically: 1. There are existing libraries that target memory types such as https://github.com/memkind/memkind. These are known to suffer from SIGSEGV's when memory is low on targeted memory "kinds" that span more than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an example of this. 2. Volatile-use persistent memory users want to have a memory policy which is targeted at either "cheap and slow" (PMEM) or "expensive and fast" (DRAM). However, they do not want to experience allocation failures when the targeted type is unavailable. 3. Allocate-then-run. Generally, we let the process scheduler decide on which physical CPU to run a task. That location provides a default allocation policy, and memory availability is not generally considered when placing tasks. For situations where memory is valuable and constrained, some users want to allocate memory first, *then* allocate close compute resources to the allocation. This is the reverse of the normal (CPU) model. Accelerators such as GPUs that operate on core-mm-managed memory are interested in this model. v2: Fix spelling errors in commit message. (Ben) clang-format. (Ben) Integrated bit from another patch. (Ben) Update the docs to reflect the internal data structure change (Ben) Don't advertise MPOL_PREFERRED_MANY in UAPI until we can handle it (Ben) Added more to the commit message (Dave) Cc: Andrew Morton Signed-off-by: Dave Hansen (v2) Co-developed-by: Ben Widawsky Signed-off-by: Ben Widawsky --- .../admin-guide/mm/numa_memory_policy.rst | 6 +-- include/linux/mempolicy.h | 4 +- mm/mempolicy.c | 40 ++++++++++--------- 3 files changed, 27 insertions(+), 23 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Docume= ntation/admin-guide/mm/numa_memory_policy.rst index 067a90a1499c..1ad020c459b8 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -205,9 +205,9 @@ MPOL_PREFERRED of increasing distance from the preferred node based on information provided by the platform firmware. =20 - Internally, the Preferred policy uses a single node--the - preferred_node member of struct mempolicy. When the internal - mode flag MPOL_F_LOCAL is set, the preferred_node is ignored + Internally, the Preferred policy uses a nodemask--the + preferred_nodes member of struct mempolicy. When the internal + mode flag MPOL_F_LOCAL is set, the preferred_nodes are ignored and the policy is interpreted as local allocation. "Local" allocation policy can be viewed as a Preferred policy that starts at the node containing the cpu where the allocation diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ea9c15b60a96..c66ea9f4c61e 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -47,8 +47,8 @@ struct mempolicy { unsigned short mode; /* See MPOL_* above */ unsigned short flags; /* See set_mempolicy() MPOL_F_* above */ union { - short preferred_node; /* preferred */ - nodemask_t nodes; /* interleave/bind */ + nodemask_t preferred_nodes; /* preferred */ + nodemask_t nodes; /* interleave/bind */ /* undefined for default */ } v; union { diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 99e0f3f9c4a6..e0b576838e57 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -205,7 +205,7 @@ static int mpol_new_preferred(struct mempolicy *pol, = const nodemask_t *nodes) else if (nodes_empty(*nodes)) return -EINVAL; /* no allowed nodes */ else - pol->v.preferred_node =3D first_node(*nodes); + pol->v.preferred_nodes =3D nodemask_of_node(first_node(*nodes)); return 0; } =20 @@ -345,22 +345,26 @@ static void mpol_rebind_preferred(struct mempolicy = *pol, const nodemask_t *nodes) { nodemask_t tmp; + nodemask_t preferred_node; + + /* MPOL_PREFERRED uses only the first node in the mask */ + preferred_node =3D nodemask_of_node(first_node(*nodes)); =20 if (pol->flags & MPOL_F_STATIC_NODES) { int node =3D first_node(pol->w.user_nodemask); =20 if (node_isset(node, *nodes)) { - pol->v.preferred_node =3D node; + pol->v.preferred_nodes =3D nodemask_of_node(node); pol->flags &=3D ~MPOL_F_LOCAL; } else pol->flags |=3D MPOL_F_LOCAL; } else if (pol->flags & MPOL_F_RELATIVE_NODES) { mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); - pol->v.preferred_node =3D first_node(tmp); + pol->v.preferred_nodes =3D tmp; } else if (!(pol->flags & MPOL_F_LOCAL)) { - pol->v.preferred_node =3D node_remap(pol->v.preferred_node, - pol->w.cpuset_mems_allowed, - *nodes); + nodes_remap(tmp, pol->v.preferred_nodes, + pol->w.cpuset_mems_allowed, preferred_node); + pol->v.preferred_nodes =3D tmp; pol->w.cpuset_mems_allowed =3D *nodes; } } @@ -913,7 +917,7 @@ static void get_policy_nodemask(struct mempolicy *p, = nodemask_t *nodes) break; case MPOL_PREFERRED: if (!(p->flags & MPOL_F_LOCAL)) - node_set(p->v.preferred_node, *nodes); + *nodes =3D p->v.preferred_nodes; /* else return empty node mask for local allocation */ break; default: @@ -1906,9 +1910,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struc= t mempolicy *policy) static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) { - if (policy->mode =3D=3D MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCA= L)) - nd =3D policy->v.preferred_node; - else { + if (policy->mode =3D=3D MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCA= L)) { + nd =3D first_node(policy->v.preferred_nodes); + } else { /* * __GFP_THISNODE shouldn't even be used with the bind policy * because we might easily break the expectation to stay on the @@ -1953,7 +1957,7 @@ unsigned int mempolicy_slab_node(void) /* * handled MPOL_F_LOCAL above */ - return policy->v.preferred_node; + return first_node(policy->v.preferred_nodes); =20 case MPOL_INTERLEAVE: return interleave_nodes(policy); @@ -2087,7 +2091,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) if (mempolicy->flags & MPOL_F_LOCAL) nid =3D numa_node_id(); else - nid =3D mempolicy->v.preferred_node; + nid =3D first_node(mempolicy->v.preferred_nodes); init_nodemask_of_node(mask, nid); break; =20 @@ -2225,7 +2229,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_are= a_struct *vma, * node in its nodemask, we allocate the standard way. */ if (pol->mode =3D=3D MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL)) - hpage_node =3D pol->v.preferred_node; + hpage_node =3D first_node(pol->v.preferred_nodes); =20 nmask =3D policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { @@ -2364,7 +2368,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempo= licy *b) /* a's ->flags is the same as b's */ if (a->flags & MPOL_F_LOCAL) return true; - return a->v.preferred_node =3D=3D b->v.preferred_node; + return nodes_equal(a->v.preferred_nodes, b->v.preferred_nodes); default: BUG(); return false; @@ -2508,7 +2512,7 @@ int mpol_misplaced(struct page *page, struct vm_are= a_struct *vma, unsigned long if (pol->flags & MPOL_F_LOCAL) polnid =3D numa_node_id(); else - polnid =3D pol->v.preferred_node; + polnid =3D first_node(pol->v.preferred_nodes); break; =20 case MPOL_BIND: @@ -2825,7 +2829,7 @@ void __init numa_policy_init(void) .refcnt =3D ATOMIC_INIT(1), .mode =3D MPOL_PREFERRED, .flags =3D MPOL_F_MOF | MPOL_F_MORON, - .v =3D { .preferred_node =3D nid, }, + .v =3D { .preferred_nodes =3D nodemask_of_node(nid), }, }; } =20 @@ -2991,7 +2995,7 @@ int mpol_parse_str(char *str, struct mempolicy **mp= ol) if (mode !=3D MPOL_PREFERRED) new->v.nodes =3D nodes; else if (nodelist) - new->v.preferred_node =3D first_node(nodes); + new->v.preferred_nodes =3D nodemask_of_node(first_node(nodes)); else new->flags |=3D MPOL_F_LOCAL; =20 @@ -3044,7 +3048,7 @@ void mpol_to_str(char *buffer, int maxlen, struct m= empolicy *pol) if (flags & MPOL_F_LOCAL) mode =3D MPOL_LOCAL; else - node_set(pol->v.preferred_node, nodes); + nodes_or(nodes, nodes, pol->v.preferred_nodes); break; case MPOL_BIND: case MPOL_INTERLEAVE: --=20 2.27.0