From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A56B2C433E1 for ; Tue, 30 Jun 2020 21:25:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6217520663 for ; Tue, 30 Jun 2020 21:25:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6217520663 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 042F68D0003; Tue, 30 Jun 2020 17:25:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F33D78D0001; Tue, 30 Jun 2020 17:25:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E22908D0003; Tue, 30 Jun 2020 17:25:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0167.hostedemail.com [216.40.44.167]) by kanga.kvack.org (Postfix) with ESMTP id C4F038D0001 for ; Tue, 30 Jun 2020 17:25:25 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 73BE81EF1 for ; Tue, 30 Jun 2020 21:25:25 +0000 (UTC) X-FDA: 76987159410.08.cow80_02002a726e7b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin08.hostedemail.com (Postfix) with ESMTP id 4BA611819E76F for ; Tue, 30 Jun 2020 21:25:25 +0000 (UTC) X-HE-Tag: cow80_02002a726e7b X-Filterd-Recvd-Size: 8189 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf13.hostedemail.com (Postfix) with ESMTP for ; Tue, 30 Jun 2020 21:25:23 +0000 (UTC) IronPort-SDR: zpc7tlbItoITwEx7VdN/qxUgaQ0Jxa8lKPnq0tmhoQvy77EGFTH8VHkh3AJ5Z1jBeOYqPYSISU LIBTjf/RfugA== X-IronPort-AV: E=McAfee;i="6000,8403,9668"; a="126011307" X-IronPort-AV: E=Sophos;i="5.75,298,1589266800"; d="scan'208";a="126011307" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jun 2020 14:25:21 -0700 IronPort-SDR: 0Dm7H4ot46cFuHqYGR1zawviMVBVZ6zULCwfW/37YAiNCyvgOB21iihkeWg2qne9LL4zO5dY6y +0Tmf1IM+tRg== X-IronPort-AV: E=Sophos;i="5.75,298,1589266800"; d="scan'208";a="481336247" Received: from schittin-mobl.amr.corp.intel.com (HELO bwidawsk-mobl5.local) ([10.252.132.42]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Jun 2020 14:25:20 -0700 From: Ben Widawsky To: linux-mm , linux-kernel@vger.kernel.org Cc: Michal Hocko , Dave Hansen , Ben Widawsky , Andi Kleen , Andrea Arcangeli , Andrew Morton , Dan Williams , Dave Hansen , David Hildenbrand , David Rientjes , Mel Gorman , Mike Kravetz , Randy Dunlap , Vlastimil Babka Subject: [PATCH v2 00/12] Introduced multi-preference mempolicy Date: Tue, 30 Jun 2020 14:25:05 -0700 Message-Id: <20200630212517.308045-1-ben.widawsky@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 4BA611819E76F X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Significant changes since v1: * Dropped patch to replace numa_node_id in some places (mhocko) * Dropped all the page allocation patches in favor of new mechanism to us= e fallbacks. (mhocko) * Dropped the special snowflake preferred node algorithm (bwidawsk) * If the preferred node fails, ALL nodes are rechecked instead of just th= e non-preferred nodes. In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the A= PI. There wasn't consensus around this, so I've left the existing API as it w= as. I'm open to more feedback here, but my slight preference is to use a new API = as it ensures if people are using it, they are entirely aware of what they're d= oing and not accidentally misusing the old interface. (In a similar way to how MPOL_LOCAL was introduced). In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equa= lly fine with that change, but I hadn't heard much emphatic support for one w= ay or another, so I've left that too. v2 Summary: 1: Random fix I found along the way 2-5: Represent node preference as a mask internally 6-7: Tread many preferred like bind 8-11: Handle page allocation for the new policy 12: Enable the uapi This patch series introduces the concept of the MPOL_PREFERRED_MANY mempo= licy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind= (2) interfaces. Like the MPOL_PREFERRED interface, it allows an application t= o set a preference for nodes which will fulfil memory allocation requests. Unlike= the MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interfac= e, it works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV = or invoke the OOM killer if those preferred nodes are not available. Along with these patches are patches for libnuma, numactl, numademo, and = memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered me= mory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and late= ncy requirements allowing preference to be given to all nodes with "fast" mem= ory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (= or perhaps slow memory), but doesn't care which node it runs on. The applica= tion can prefer a set of nodes and then xpu bind to the local node (cpu, accel= erator, etc). This reverses the nodes are chosen today where the kernel attempts = to use local memory to the CPU whenever possible. This will attempt to use the l= ocal accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware = it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes i= f binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from= the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes =3D 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes =3D 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexi= ty is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses th= e notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers b= oth a friendlier API as well as a solution for more customized usage. It's u= nknown if it's worth the complexity to support this. Here is sample code for = how this might work: > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024); > > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0)= ; > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: Mel Gorman Cc: Michal Hocko Cc: Mike Kravetz Cc: Randy Dunlap Cc: Vlastimil Babka Ben Widawsky (8): mm/mempolicy: Add comment for missing LOCAL mm/mempolicy: kill v.preferred_nodes mm/mempolicy: handle MPOL_PREFERRED_MANY like BIND mm/mempolicy: Create a page allocator for policy mm/mempolicy: Thread allocation for many preferred mm/mempolicy: VMA allocation for many preferred mm/mempolicy: huge-page allocation for many preferred mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Dave Hansen (4): mm/mempolicy: convert single preferred_node to full nodemask mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes mm/mempolicy: allow preferred code to take a nodemask mm/mempolicy: refactor rebind code for PREFERRED_MANY .../admin-guide/mm/numa_memory_policy.rst | 22 +- include/linux/mempolicy.h | 6 +- include/uapi/linux/mempolicy.h | 6 +- mm/hugetlb.c | 20 +- mm/mempolicy.c | 273 ++++++++++++------ 5 files changed, 222 insertions(+), 105 deletions(-) --=20 2.27.0