From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E557C04EB9 for ; Mon, 3 Dec 2018 23:35:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 319042145D for ; Mon, 3 Dec 2018 23:35:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 319042145D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726131AbeLCXfr (ORCPT ); Mon, 3 Dec 2018 18:35:47 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36140 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725915AbeLCXfq (ORCPT ); Mon, 3 Dec 2018 18:35:46 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1CCEFC036770; Mon, 3 Dec 2018 23:35:45 +0000 (UTC) Received: from localhost.localdomain.com (ovpn-120-188.rdu2.redhat.com [10.10.120.188]) by smtp.corp.redhat.com (Postfix) with ESMTP id 90B2D605A8; Mon, 3 Dec 2018 23:35:36 +0000 (UTC) From: jglisse@redhat.com To: linux-mm@kvack.org Cc: Andrew Morton , linux-kernel@vger.kernel.org, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , "Rafael J . Wysocki" , Ross Zwisler , Dan Williams , Dave Hansen , Haggai Eran , Balbir Singh , "Aneesh Kumar K . V" , Benjamin Herrenschmidt , Felix Kuehling , Philip Yang , =?UTF-8?q?Christian=20K=C3=B6nig?= , Paul Blinzer , Logan Gunthorpe , John Hubbard , Ralph Campbell , Michal Hocko , Jonathan Cameron , Mark Hairgrove , Vivek Kini , Mel Gorman , Dave Airlie , Ben Skeggs , Andrea Arcangeli Subject: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation Date: Mon, 3 Dec 2018 18:34:57 -0500 Message-Id: <20181203233509.20671-3-jglisse@redhat.com> In-Reply-To: <20181203233509.20671-1-jglisse@redhat.com> References: <20181203233509.20671-1-jglisse@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Mon, 03 Dec 2018 23:35:45 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Jérôme Glisse Add documentation to what is HMS and what it is for (see patch content). Signed-off-by: Jérôme Glisse Cc: Rafael J. Wysocki Cc: Ross Zwisler Cc: Dan Williams Cc: Dave Hansen Cc: Haggai Eran Cc: Balbir Singh Cc: Aneesh Kumar K.V Cc: Benjamin Herrenschmidt Cc: Felix Kuehling Cc: Philip Yang Cc: Christian König Cc: Paul Blinzer Cc: Logan Gunthorpe Cc: John Hubbard Cc: Ralph Campbell Cc: Michal Hocko Cc: Jonathan Cameron Cc: Mark Hairgrove Cc: Vivek Kini Cc: Mel Gorman Cc: Dave Airlie Cc: Ben Skeggs Cc: Andrea Arcangeli --- Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++----- 1 file changed, 246 insertions(+), 29 deletions(-) diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst index dbf0f71918a9..bd7c9e8e7077 100644 --- a/Documentation/vm/hms.rst +++ b/Documentation/vm/hms.rst @@ -4,32 +4,249 @@ Heterogeneous Memory System (HMS) ================================= -System with complex memory topology needs a more versatile memory topology -description than just node where a node is a collection of memory and CPU. -In heterogeneous memory system we consider four types of object:: - - target: which is any kind of memory - - initiator: any kind of device or CPU - - inter-connect: any kind of links that connects target and initiator - - bridge: a link between two inter-connects - -Properties (like bandwidth, latency, bus width, ...) are define per bridge -and per inter-connect. Property of an inter-connect apply to all initiators -which are link to that inter-connect. Not all initiators are link to all -inter-connect and thus not all initiators can access all memory (this apply -to CPU too ie some CPU might not be able to access all memory). - -Bridges allow initiators (that can use the bridge) to access target for -which they do not have a direct link with (ie they do not share a common -inter-connect with the target). - -Through this four types of object we can describe any kind of system memory -topology. To expose this to userspace we expose a new sysfs hierarchy (that -co-exist with the existing one):: - - /sys/bus/hms/target* all targets in the system - - /sys/bus/hms/initiator* all initiators in the system - - /sys/bus/hms/interconnect* all inter-connects in the system - - /sys/bus/hms/bridge* all bridges in the system - -Inside each bridge or inter-connect directory they are symlinks to targets -and initiators that are linked to that bridge or inter-connect. Properties -are defined inside bridge and inter-connect directory. +Heterogeneous memory system are becoming more and more the norm, in +those system there is not only the main system memory for each node, +but also device memory and|or memory hierarchy to consider. Device +memory can comes from a device like GPU, FPGA, ... or from a memory +only device (persistent memory, or high density memory device). + +Memory hierarchy is when you not only have the main memory but also +other type of memory like HBM (High Bandwidth Memory often stack up +on CPU die or GPU die), peristent memory or high density memory (ie +something slower then regular DDR DIMM but much bigger). + +On top of this diversity of memories you also have to account for the +system bus topology ie how all CPUs and devices are connected to each +others. Userspace do not care about the exact physical topology but +care about topology from behavior point of view ie what are all the +paths between an initiator (anything that can initiate memory access +like CPU, GPU, FGPA, network controller ...) and a target memory and +what are all the properties of each of those path (bandwidth, latency, +granularity, ...). + +This means that it is no longer sufficient to consider a flat view +for each node in a system but for maximum performance we need to +account for all of this new memory but also for system topology. +This is why this proposal is unlike the HMAT proposal [1] which +tries to extend the existing NUMA for new type of memory. Here we +are tackling a much more profound change that depart from NUMA. + + +One of the reasons for radical change is the advance of accelerator +like GPU or FPGA means that CPU is no longer the only piece where +computation happens. It is becoming more and more common for an +application to use a mix and match of different accelerator to +perform its computation. So we can no longer satisfy our self with +a CPU centric and flat view of a system like NUMA and NUMA distance. + + +HMS tackle this problems through three aspects: + 1 - Expose complex system topology and various kind of memory + to user space so that application have a standard way and + single place to get all the information it cares about. + 2 - A new API for user space to bind/provide hint to kernel on + which memory to use for range of virtual address (a new + mbind() syscall). + 3 - Kernel side changes for vm policy to handle this changes + + +The rest of this documents is splits in 3 sections, the first section +talks about complex system topology: what it is, how it is use today +and how to describe it tomorrow. The second sections talks about +new API to bind/provide hint to kernel for range of virtual address. +The third section talks about new mechanism to track bind/hint +provided by user space or device driver inside the kernel. + + +1) Complex system topology and representing them +================================================ + +Inside a node you can have a complex topology of memory, for instance +you can have multiple HBM memory in a node, each HBM memory tie to a +set of CPUs (all of which are in the same node). This means that you +have a hierarchy of memory for CPUs. The local fast HBM but which is +expected to be relatively small compare to main memory and then the +main memory. New memory technology might also deepen this hierarchy +with another level of yet slower memory but gigantic in size (some +persistent memory technology might fall into that category). Another +example is device memory, and device themself can have a hierarchy +like HBM on top of device core and main device memory. + +On top of that you can have multiple path to access each memory and +each path can have different properties (latency, bandwidth, ...). +Also there is not always symmetry ie some memory might only be +accessible by some device or CPU ie not accessible by everyone. + +So a flat hierarchy for each node is not capable of representing this +kind of complexity. To simplify discussion and because we do not want +to single out CPU from device, from here on out we will use initiator +to refer to either CPU or device. An initiator is any kind of CPU or +device that can access memory (ie initiate memory access). + +At this point a example of such system might help: + - 2 nodes and for each node: + - 1 CPU per node with 2 complex of CPUs cores per CPU + - one HBM memory for each complex of CPUs cores (200GB/s) + - CPUs cores complex are linked to each other (100GB/s) + - main memory is (90GB/s) + - 4 GPUs each with: + - HBM memory for each GPU (1000GB/s) (not CPU accessible) + - GDDR memory for each GPU (500GB/s) (CPU accessible) + - connected to CPU root controller (60GB/s) + - connected to other GPUs (even GPUs from the second + node) with GPU link (400GB/s) + +In this example we restrict our self to bandwidth and ignore bus width +or latency, this is just to simplify discussions but obviously they +also factor in. + + +Userspace very much would like to know about this information, for +instance HPC folks have develop complex library to manage this and +there is wide research on the topics [2] [3] [4] [5]. Today most of +the work is done by hardcoding thing for specific platform. Which is +somewhat acceptable for HPC folks where the platform stays the same +for a long period of time. + +Roughly speaking i see two broads use case for topology information. +First is for virtualization and vm where you want to segment your +hardware properly for each vm (binding memory, CPU and GPU that are +all close to each others). Second is for application, many of which +can partition their workload to minimize exchange between partition +allowing each partition to be bind to a subset of device and CPUs +that are close to each others (for maximum locality). Here it is much +more than just NUMA distance, you can leverage the memory hierarchy +and the system topology all-together (see [2] [3] [4] [5] for more +references and details). + +So this is not exposing topology just for the sake of cool graph in +userspace. They are active user today of such information and if we +want to growth and broaden the usage we should provide a unified API +to standardize how that information is accessible to every one. + + +One proposal so far to handle new type of memory is to user CPU less +node for those [6]. While same idea can apply for device memory, it is +still hard to describe multiple path with different property in such +scheme. While it is backward compatible and have minimum changes, it +simplify can not convey complex topology (think any kind of random +graph, not just a tree like graph). + +So HMS use a new way to expose to userspace the system topology. It +relies on 4 types of objects: + - target: any kind of memory (main memory, HBM, device, ...) + - initiator: CPU or device (anything that can access memory) + - link: anything that link initiator and target + - bridges: anything that allow group of initiator to access + remote target (ie target they are not connected with directly + through an link) + +Properties like bandwidth, latency, ... are all sets per bridges and +links. All initiators connected to an link can access any target memory +also connected to the same link and all with the same link properties. + +Link do not need to match physical hardware ie you can have a single +physical link match a single or multiples software expose link. This +allows to model device connected to same physical link (like PCIE +for instance) but not with same characteristics (like number of lane +or lane speed in PCIE). The reverse is also true ie having a single +software expose link match multiples physical link. + +Bridges allows initiator to access remote link. A bridges connect two +links to each others and is also specific to list of initiators (ie +not all initiators connected to each of the link can use the bridge). +Bridges have their own properties (bandwidth, latency, ...) so that +the actual property value for each property is the lowest common +denominator between bridge and each of the links. + + +This model allows to describe any kind of directed graph and thus +allows to describe any kind of topology we might see in the future. +It is also easier to add new properties to each object type. + +Moreover it can be use to expose devices capable to do peer to peer +between them. For that simply have all devices capable to peer to +peer to have a common link or use the bridge object if the peer to +peer capabilities is only one way for instance. + + +HMS use the above scheme to expose system topology through sysfs under +/sys/bus/hms/ with: + - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, + each has a UID and you can usual value in that folder (node id, + size, ...) + + - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator + (CPU or device), each has a HMS UID but also a CPU id for CPU + (which match CPU id in (/sys/bus/cpu/). For device you have a + path that can be PCIE BUS ID for instance) + + - /sys/bus/hms/devices/v%version-%id-link : an link, each has a + UID and a file per property (bandwidth, latency, ...) you also + find a symlink to every target and initiator connected to that + link. + + - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has + a UID and a file per property (bandwidth, latency, ...) you + also find a symlink to all initiators that can use that bridge. + +To help with forward compatibility each object as a version value and +it is mandatory for user space to only use target or initiator with +version supported by the user space. For instance if user space only +knows about what version 1 means and sees a target with version 2 then +the user space must ignore that target as if it does not exist. + +Mandating that allows the additions of new properties that break back- +ward compatibility ie user space must know how this new property affect +the object to be able to use it safely. + +Main memory of each node is expose under a common target. For now +device driver are responsible to register memory they want to expose +through that scheme but in the future that information might come from +the system firmware (this is a different discussion). + + + +2) hbind() bind range of virtual address to heterogeneous memory +================================================================ + +So instead of using a bitmap, hbind() take an array of uid and each uid +is a unique memory target inside the new memory topology description. +User space also provide an array of modifiers. Modifier can be seen as +the flags parameter of mbind() but here we use an array so that user +space can not only supply a modifier but also value with it. This should +allow the API to grow more features in the future. Kernel should return +-EINVAL if it is provided with an unkown modifier and just ignore the +call all together, forcing the user space to restrict itself to modifier +supported by the kernel it is running on (i know i am dreaming about well +behave user space). + + +Note that none of this is exclusive of automatic memory placement like +autonuma. I also believe that we will see something similar to autonuma +for device memory. + + +3) Tracking and applying heterogeneous memory policies +====================================================== + +Current memory policy infrastructure is node oriented, instead of +changing that and risking breakage and regression HMS adds a new +heterogeneous policy tracking infra-structure. The expectation is +that existing application can keep using mbind() and all existing +infrastructure under-disturb and unaffected, while new application +will use the new API and should avoid mix and matching both (as they +can achieve the same thing with the new API). + +Also the policy is not directly tie to the vma structure for a few +reasons: + - avoid having to split vma for policy that do not cover full vma + - avoid changing too much vma code + - avoid growing the vma structure with an extra pointer + +The overall design is simple, on hbind() call a hms policy structure +is created for the supplied range and hms use the callback associated +with the target memory. This callback is provided by device driver +for device memory or by core HMS for regular main memory. The callback +can decide to migrate the range to the target memories or do nothing +(this can be influenced by flags provided to hbind() too). -- 2.17.2