From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.7 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF77EC433DF for ; Fri, 19 Jun 2020 16:24:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8248921707 for ; Fri, 19 Jun 2020 16:24:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8248921707 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 239A18D00D1; Fri, 19 Jun 2020 12:24:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C4728D00D0; Fri, 19 Jun 2020 12:24:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 069458D00D1; Fri, 19 Jun 2020 12:24:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id D9E5C8D00D0 for ; Fri, 19 Jun 2020 12:24:30 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 87671181AC9C6 for ; Fri, 19 Jun 2020 16:24:30 +0000 (UTC) X-FDA: 76946484300.23.use92_030652426e1a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 59C3D37604 for ; Fri, 19 Jun 2020 16:24:30 +0000 (UTC) X-HE-Tag: use92_030652426e1a X-Filterd-Recvd-Size: 8382 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Fri, 19 Jun 2020 16:24:29 +0000 (UTC) IronPort-SDR: X4enslEBi3QvbJgCMqYBGmRiZdGw3tbjGH+S0a/t7uL4wsu/i6J7uCVTXhXnDNXz1wOyA3g+WH rjpuusv8aSyg== X-IronPort-AV: E=McAfee;i="6000,8403,9657"; a="141280131" X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="141280131" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:28 -0700 IronPort-SDR: 1oT1kuxMHnOXsM+RBo3LgGrSmxcvvE5G5zfrB7V9gXOCASNKogCTRafmhJE/In4QGbZ9QjAlid fAj4vtNDl4Lg== X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="264367880" Received: from sjiang-mobl2.ccr.corp.intel.com (HELO bwidawsk-mobl5.local) ([10.252.131.131]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:26 -0700 From: Ben Widawsky To: linux-mm Cc: Ben Widawsky , Andi Kleen , Andrew Morton , Christoph Lameter , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Jason Gunthorpe , Johannes Weiner , Jonathan Corbet , Kuppuswamy Sathyanarayanan , Lee Schermerhorn , Li Xinhai , Mel Gorman , Michal Hocko , Mike Kravetz , Mina Almasry , Tejun Heo , Vlastimil Babka Subject: [PATCH 00/18] multiple preferred nodes Date: Fri, 19 Jun 2020 09:24:07 -0700 Message-Id: <20200619162425.1052382-1-ben.widawsky@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 59C3D37604 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch series introduces the concept of the MPOL_PREFERRED_MANY mempo= licy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind= (2) interfaces. Like the MPOL_PREFERRED interface, it allows an application t= o set a preference for nodes which will fulfil memory allocation requests. Like t= he MPOL_BIND interface, it works over a set of nodes. Summary: 1-2: Random fixes I found along the way 3-4: Logic to handle many preferred nodes in page allocation 5-9: Plumbing to allow multiple preferred nodes in mempolicy 10-13: Teach page allocation APIs about nodemasks 14: Provide a helper to generate preferred nodemasks 15: Have page allocation callers generate preferred nodemasks 16-17: Flip the switch to have __alloc_pages_nodemask take preferred mask= . 18: Expose the new uapi Along with these patches are patches for libnuma, numactl, numademo, and = memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered me= mory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and late= ncy requirements allowing preference to be given to all nodes with "fast" mem= ory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (= or perhaps slow memory), but doesn't care which node it runs on. The applica= tion can prefer a set of nodes and then xpu bind to the local node (cpu, accel= erator, etc). This reverses the nodes are chosen today where the kernel attempts = to use local memory to the CPU whenever possible. This will attempt to use the l= ocal accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware = it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes i= f binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from= the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes =3D 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes =3D 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexi= ty is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses th= e notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers b= oth a friendlier API as well as a solution for more customized usage. It's u= nknown if it's worth the complexity to support this. Here is sample code for = how this might work: > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0)= ; > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c= , 1024); --- Cc: Andi Kleen Cc: Andrew Morton Cc: Christoph Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: Jason Gunthorpe Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Kuppuswamy Sathyanarayanan Cc: Lee Schermerhorn Cc: Li Xinhai Cc: Mel Gorman Cc: Michal Hocko Cc: Mike Kravetz Cc: Mina Almasry Cc: Tejun Heo Cc: Vlastimil Babka Ben Widawsky (14): mm/mempolicy: Add comment for missing LOCAL mm/mempolicy: Use node_mem_id() instead of node_id() mm/page_alloc: start plumbing multi preferred node mm/page_alloc: add preferred pass to page allocation mm: Finish handling MPOL_PREFERRED_MANY mm: clean up alloc_pages_vma (thp) mm: Extract THP hugepage allocation mm/mempolicy: Use __alloc_page_node for interleaved mm: kill __alloc_pages mm/mempolicy: Introduce policy_preferred_nodes() mm: convert callers of __alloc_pages_nodemask to pmask alloc_pages_nodemask: turn preferred nid into a nodemask mm: Use less stack for page allocations mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Dave Hansen (4): mm/mempolicy: convert single preferred_node to full nodemask mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes mm/mempolicy: allow preferred code to take a nodemask mm/mempolicy: refactor rebind code for PREFERRED_MANY .../admin-guide/mm/numa_memory_policy.rst | 22 +- include/linux/gfp.h | 19 +- include/linux/mempolicy.h | 4 +- include/linux/migrate.h | 4 +- include/linux/mmzone.h | 3 + include/uapi/linux/mempolicy.h | 6 +- mm/hugetlb.c | 10 +- mm/internal.h | 1 + mm/mempolicy.c | 271 +++++++++++++----- mm/page_alloc.c | 179 +++++++++++- 10 files changed, 403 insertions(+), 116 deletions(-) --=20 2.27.0