LKML Archive on lore.kernel.org
 help / Atom feed
* [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
@ 2018-12-03 23:34 jglisse
  2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
                   ` (16 more replies)
  0 siblings, 17 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Dave Hansen, Haggai Eran, Balbir Singh,
	Aneesh Kumar K . V, Benjamin Herrenschmidt, Felix Kuehling,
	Philip Yang, Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

From: Jérôme Glisse <jglisse@redhat.com>

Heterogeneous memory system are becoming more and more the norm, in
those system there is not only the main system memory for each node,
but also device memory and|or memory hierarchy to consider. Device
memory can comes from a device like GPU, FPGA, ... or from a memory
only device (persistent memory, or high density memory device).

Memory hierarchy is when you not only have the main memory but also
other type of memory like HBM (High Bandwidth Memory often stack up
on CPU die or GPU die), peristent memory or high density memory (ie
something slower then regular DDR DIMM but much bigger).

On top of this diversity of memories you also have to account for the
system bus topology ie how all CPUs and devices are connected to each
others. Userspace do not care about the exact physical topology but
care about topology from behavior point of view ie what are all the
paths between an initiator (anything that can initiate memory access
like CPU, GPU, FGPA, network controller ...) and a target memory and
what are all the properties of each of those path (bandwidth, latency,
granularity, ...).

This means that it is no longer sufficient to consider a flat view
for each node in a system but for maximum performance we need to
account for all of this new memory but also for system topology.
This is why this proposal is unlike the HMAT proposal [1] which
tries to extend the existing NUMA for new type of memory. Here we
are tackling a much more profound change that depart from NUMA.


One of the reasons for radical change is the advance of accelerator
like GPU or FPGA means that CPU is no longer the only piece where
computation happens. It is becoming more and more common for an
application to use a mix and match of different accelerator to
perform its computation. So we can no longer satisfy our self with
a CPU centric and flat view of a system like NUMA and NUMA distance.


This patchset is a proposal to tackle this problems through three
aspects:
    1 - Expose complex system topology and various kind of memory
        to user space so that application have a standard way and
        single place to get all the information it cares about.
    2 - A new API for user space to bind/provide hint to kernel on
        which memory to use for range of virtual address (a new
        mbind() syscall).
    3 - Kernel side changes for vm policy to handle this changes

This patchset is not and end to end solution but it provides enough
pieces to be useful against nouveau (upstream open source driver for
NVidia GPU). It is intended as a starting point for discussion so
that we can figure out what to do. To avoid having too much topics
to discuss i am not considering memory cgroup for now but it is
definitely something we will want to integrate with.

The rest of this emails is splits in 3 sections, the first section
talks about complex system topology: what it is, how it is use today
and how to describe it tomorrow. The second sections talks about
new API to bind/provide hint to kernel for range of virtual address.
The third section talks about new mechanism to track bind/hint
provided by user space or device driver inside the kernel.


1) Complex system topology and representing them
------------------------------------------------

Inside a node you can have a complex topology of memory, for instance
you can have multiple HBM memory in a node, each HBM memory tie to a
set of CPUs (all of which are in the same node). This means that you
have a hierarchy of memory for CPUs. The local fast HBM but which is
expected to be relatively small compare to main memory and then the
main memory. New memory technology might also deepen this hierarchy
with another level of yet slower memory but gigantic in size (some
persistent memory technology might fall into that category). Another
example is device memory, and device themself can have a hierarchy
like HBM on top of device core and main device memory.

On top of that you can have multiple path to access each memory and
each path can have different properties (latency, bandwidth, ...).
Also there is not always symmetry ie some memory might only be
accessible by some device or CPU ie not accessible by everyone.

So a flat hierarchy for each node is not capable of representing this
kind of complexity. To simplify discussion and because we do not want
to single out CPU from device, from here on out we will use initiator
to refer to either CPU or device. An initiator is any kind of CPU or
device that can access memory (ie initiate memory access).

At this point a example of such system might help:
    - 2 nodes and for each node:
        - 1 CPU per node with 2 complex of CPUs cores per CPU
        - one HBM memory for each complex of CPUs cores (200GB/s)
        - CPUs cores complex are linked to each other (100GB/s)
        - main memory is (90GB/s)
        - 4 GPUs each with:
            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
            - GDDR memory for each GPU (500GB/s) (CPU accessible)
            - connected to CPU root controller (60GB/s)
            - connected to other GPUs (even GPUs from the second
              node) with GPU link (400GB/s)

In this example we restrict our self to bandwidth and ignore bus width
or latency, this is just to simplify discussions but obviously they
also factor in.


Userspace very much would like to know about this information, for
instance HPC folks have develop complex library to manage this and
there is wide research on the topics [2] [3] [4] [5]. Today most of
the work is done by hardcoding thing for specific platform. Which is
somewhat acceptable for HPC folks where the platform stays the same
for a long period of time. But if we want a more ubiquituous support
we should aim to provide the information needed through standard
kernel API such as the one presented in this patchset.

Roughly speaking i see two broads use case for topology information.
First is for virtualization and vm where you want to segment your
hardware properly for each vm (binding memory, CPU and GPU that are
all close to each others). Second is for application, many of which
can partition their workload to minimize exchange between partition
allowing each partition to be bind to a subset of device and CPUs
that are close to each others (for maximum locality). Here it is much
more than just NUMA distance, you can leverage the memory hierarchy
and  the system topology all-together (see [2] [3] [4] [5] for more
references and details).

So this is not exposing topology just for the sake of cool graph in
userspace. They are active user today of such information and if we
want to growth and broaden the usage we should provide a unified API
to standardize how that information is accessible to every one.


One proposal so far to handle new type of memory is to user CPU less
node for those [6]. While same idea can apply for device memory, it is
still hard to describe multiple path with different property in such
scheme. While it is backward compatible and have minimum changes, it
simplify can not convey complex topology (think any kind of random
graph, not just a tree like graph).

Thus far this kind of system have been use through device specific API
and rely on all kind of system specific quirks. To avoid this going out
of hands and grow into a bigger mess than it already is, this patchset
tries to provide a common generic API that should fit various devices
(GPU, FPGA, ...).

So this patchset propose a new way to expose to userspace the system
topology. It relies on 4 types of objects:
    - target: any kind of memory (main memory, HBM, device, ...)
    - initiator: CPU or device (anything that can access memory)
    - link: anything that link initiator and target
    - bridges: anything that allow group of initiator to access
      remote target (ie target they are not connected with directly
      through an link)

Properties like bandwidth, latency, ... are all sets per bridges and
links. All initiators connected to an link can access any target memory
also connected to the same link and all with the same link properties.

Link do not need to match physical hardware ie you can have a single
physical link match a single or multiples software expose link. This
allows to model device connected to same physical link (like PCIE
for instance) but not with same characteristics (like number of lane
or lane speed in PCIE). The reverse is also true ie having a single
software expose link match multiples physical link.

Bridges allows initiator to access remote link. A bridges connect two
links to each others and is also specific to list of initiators (ie
not all initiators connected to each of the link can use the bridge).
Bridges have their own properties (bandwidth, latency, ...) so that
the actual property value for each property is the lowest common
denominator between bridge and each of the links.


This model allows to describe any kind of directed graph and thus
allows to describe any kind of topology we might see in the future.
It is also easier to add new properties to each object type.

Moreover it can be use to expose devices capable to do peer to peer
between them. For that simply have all devices capable to peer to
peer to have a common link or use the bridge object if the peer to
peer capabilities is only one way for instance.


This patchset use the above scheme to expose system topology through
sysfs under /sys/bus/hms/ with:
    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
      each has a UID and you can usual value in that folder (node id,
      size, ...)

    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
      (CPU or device), each has a HMS UID but also a CPU id for CPU
      (which match CPU id in (/sys/bus/cpu/). For device you have a
      path that can be PCIE BUS ID for instance)

    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
      UID and a file per property (bandwidth, latency, ...) you also
      find a symlink to every target and initiator connected to that
      link.

    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
      a UID and a file per property (bandwidth, latency, ...) you
      also find a symlink to all initiators that can use that bridge.

To help with forward compatibility each object as a version value and
it is mandatory for user space to only use target or initiator with
version supported by the user space. For instance if user space only
knows about what version 1 means and sees a target with version 2 then
the user space must ignore that target as if it does not exist.

Mandating that allows the additions of new properties that break back-
ward compatibility ie user space must know how this new property affect
the object to be able to use it safely.

This patchset expose main memory of each node under a common target.
For now device driver are responsible to register memory they want to
expose through that scheme but in the future that information might
come from the system firmware (this is a different discussion).



2) hbind() bind range of virtual address to heterogeneous memory
----------------------------------------------------------------

With this new topology description the mbind() API is too limited to
handle which memory to picks. This is why this patchset introduce a new
API: hbind() for heterogeneous bind. The hbind() API allows to bind any
kind of target memory (using the HMS target uid), this can be any memory
expose through HMS ie main memory, HBM, device memory ... 

So instead of using a bitmap, hbind() take an array of uid and each uid
is a unique memory target inside the new memory topology description.
User space also provide an array of modifiers. This patchset only define
some modifier. Modifier can be seen as the flags parameter of mbind()
but here we use an array so that user space can not only supply a modifier
but also value with it. This should allow the API to grow more features
in the future. Kernel should return -EINVAL if it is provided with an
unkown modifier and just ignore the call all together, forcing the user
space to restrict itself to modifier supported by the kernel it is
running on (i know i am dreaming about well behave user space).


Note that none of this is exclusive of automatic memory placement like
autonuma. I also believe that we will see something similar to autonuma
for device memory. This patchset is just there to provide new API for
process that wish to have a fine control over their memory placement
because process should know better than the kernel on where to place
thing.

This patchset also add necessary bits to the nouveau open source driver
for it to expose its memory and to allow process to bind some range to
the GPU memory. Note that on x86 the GPU memory is not accessible by
CPU because PCIE does not allow cache coherent access to device memory.
Thus when using PCIE device memory on x86 it is mapped as swap out from
CPU POV and any CPU access will triger a migration back to main memory
(this is all part of HMM and nouveau not in this patchset).

This is all done under staging so that we can experiment with the user-
space API for a while before committing to anything. Getting this right
is hard and it might not happen on the first try so instead of having to
support forever an API i would rather have it leave behind staging for
people to experiment with and once we feel confident we have something
we can live with then convert it to a syscall.


3) Tracking and applying heterogeneous memory policies
------------------------------------------------------

Current memory policy infrastructure is node oriented, instead of
changing that and risking breakage and regression this patchset add a
new heterogeneous policy tracking infra-structure. The expectation is
that existing application can keep using mbind() and all existing
infrastructure under-disturb and unaffected, while new application
will use the new API and should avoid mix and matching both (as they
can achieve the same thing with the new API).

Also the policy is not directly tie to the vma structure for a few
reasons:
    - avoid having to split vma for policy that do not cover full vma
    - avoid changing too much vma code
    - avoid growing the vma structure with an extra pointer
So instead this patchset use the mmu_notifier API to track vma liveness
(munmap(),mremap(),...).

This patchset is not tie to process memory allocation either (like said
at the begining this is not and end to end patchset but a starting
point). It does however demonstrate how migration to device memory can
work under this scheme (using nouveau as a demonstration vehicle).

The overall design is simple, on hbind() call a hms policy structure
is created for the supplied range and hms use the callback associated
with the target memory. This callback is provided by device driver
for device memory or by core HMS for regular main memory. The callback
can decide to migrate the range to the target memories or do nothing
(this can be influenced by flags provided to hbind() too).


Latter patches can tie page fault with HMS policy to direct memory
allocation to the right target. For now i would rather postpone that
discussion until a consensus is reach on how to move forward on all
the topics presented in this email. Start smalls, grow big ;)

Cheers,
Jérôme Glisse

https://cgit.freedesktop.org/~glisse/linux/log/?h=hms-hbind-v01
git://people.freedesktop.org/~glisse/linux hms-hbind-v01


[1] https://lkml.org/lkml/2018/11/15/331
[2] https://arxiv.org/pdf/1704.08273.pdf
[3] https://csmd.ornl.gov/highlight/sharp-unified-memory-allocator-intent-based-memory-allocator-extreme-scale-systems
[4] https://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/Trott-white-paper.pdf
    http://cacs.usc.edu/education/cs653/Edwards-Kokkos-JPDC14.pdf
[5] https://github.com/LLNL/Umpire
    https://umpire.readthedocs.io/en/develop/
[6] https://www.spinics.net/lists/hotplug/msg06171.html

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: linux-acpi@vger.kernel.org

Jérôme Glisse (14):
  mm/hms: heterogeneous memory system (sysfs infrastructure)
  mm/hms: heterogenenous memory system (HMS) documentation
  mm/hms: add target memory to heterogeneous memory system
    infrastructure
  mm/hms: add initiator to heterogeneous memory system infrastructure
  mm/hms: add link to heterogeneous memory system infrastructure
  mm/hms: add bridge to heterogeneous memory system infrastructure
  mm/hms: register main memory with heterogenenous memory system
  mm/hms: register main CPUs with heterogenenous memory system
  mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS)
  mm/hbind: add heterogeneous memory policy tracking infrastructure
  mm/hbind: add bind command to heterogeneous memory policy
  mm/hbind: add migrate command to hbind() ioctl
  drm/nouveau: register GPU under heterogeneous memory system
  test/hms: tests for heterogeneous memory system

 Documentation/vm/hms.rst                      | 252 ++++++++
 drivers/base/Kconfig                          |  14 +
 drivers/base/Makefile                         |   1 +
 drivers/base/cpu.c                            |   5 +
 drivers/base/hms-bridge.c                     | 197 +++++++
 drivers/base/hms-initiator.c                  | 141 +++++
 drivers/base/hms-link.c                       | 183 ++++++
 drivers/base/hms-target.c                     | 193 +++++++
 drivers/base/hms.c                            | 199 +++++++
 drivers/base/init.c                           |   2 +
 drivers/base/node.c                           |  83 ++-
 drivers/gpu/drm/nouveau/Kbuild                |   1 +
 drivers/gpu/drm/nouveau/nouveau_hms.c         |  80 +++
 drivers/gpu/drm/nouveau/nouveau_hms.h         |  46 ++
 drivers/gpu/drm/nouveau/nouveau_svm.c         |   6 +
 include/linux/cpu.h                           |   4 +
 include/linux/hms.h                           | 219 +++++++
 include/linux/mm_types.h                      |   6 +
 include/linux/node.h                          |   6 +
 include/uapi/linux/hbind.h                    |  73 +++
 kernel/fork.c                                 |   3 +
 mm/Makefile                                   |   1 +
 mm/hms.c                                      | 545 ++++++++++++++++++
 tools/testing/hms/Makefile                    |  17 +
 tools/testing/hms/hbind-create-device-file.sh |  11 +
 tools/testing/hms/test-hms-migrate.c          |  77 +++
 tools/testing/hms/test-hms.c                  | 237 ++++++++
 tools/testing/hms/test-hms.h                  |  67 +++
 28 files changed, 2667 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/vm/hms.rst
 create mode 100644 drivers/base/hms-bridge.c
 create mode 100644 drivers/base/hms-initiator.c
 create mode 100644 drivers/base/hms-link.c
 create mode 100644 drivers/base/hms-target.c
 create mode 100644 drivers/base/hms.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.h
 create mode 100644 include/linux/hms.h
 create mode 100644 include/uapi/linux/hbind.h
 create mode 100644 mm/hms.c
 create mode 100644 tools/testing/hms/Makefile
 create mode 100755 tools/testing/hms/hbind-create-device-file.sh
 create mode 100644 tools/testing/hms/test-hms-migrate.c
 create mode 100644 tools/testing/hms/test-hms.c
 create mode 100644 tools/testing/hms/test-hms.h

-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure)
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
@ 2018-12-03 23:34 ` jglisse
  2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

System with complex memory topology needs a more versatile memory
topology description than just node where a node is a collection of
memory and CPU. In heterogeneous memory system we consider four
types of object:
      - target: which is any kind of memory
      - initiator: any kind of device or CPU
      - link: any kind of link that connects targets and initiators
      - bridge: a bridge between two links (for some initiators)

Properties (like bandwidth, latency, bus width, ...) are define per
bridge and per link. Property of a link apply to all initiators which
are connected to that link.

Not all initiators are connected to all links thus not all initiators
can access all targets memory (this apply to CPU too ie some CPU might
not be able to access all target memory).

Bridges allow initiators (that can use the bridge) to access targets
for which they do not have a direct link with.

Through this four types of object we can describe any kind of system
memory topology. To expose this to userspace we expose a new sysfs
hierarchy (that co-exist with the existing one):
  - /sys/bus/hms/target/ all targets in the system
  - /sys/bus/hms/initiator all initiators in the system
  - /sys/bus/hms/interconnect all inter-connects in the system
  - /sys/bus/hms/bridge all bridges in the system

Inside each link or bridge directory they are symlinks to targets and
initiators that are connected to that bridge or link. Properties are
defined inside link and bridge directory.

This patch only introduce core HMS infrastructure, each object type
is added with individual patch.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/hms.rst |  35 +++++++
 drivers/base/Kconfig     |  14 +++
 drivers/base/Makefile    |   1 +
 drivers/base/hms.c       | 199 +++++++++++++++++++++++++++++++++++++++
 drivers/base/init.c      |   2 +
 include/linux/hms.h      |  72 ++++++++++++++
 6 files changed, 323 insertions(+)
 create mode 100644 Documentation/vm/hms.rst
 create mode 100644 drivers/base/hms.c
 create mode 100644 include/linux/hms.h

diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
new file mode 100644
index 000000000000..dbf0f71918a9
--- /dev/null
+++ b/Documentation/vm/hms.rst
@@ -0,0 +1,35 @@
+.. hms:
+
+=================================
+Heterogeneous Memory System (HMS)
+=================================
+
+System with complex memory topology needs a more versatile memory topology
+description than just node where a node is a collection of memory and CPU.
+In heterogeneous memory system we consider four types of object::
+   - target: which is any kind of memory
+   - initiator: any kind of device or CPU
+   - inter-connect: any kind of links that connects target and initiator
+   - bridge: a link between two inter-connects
+
+Properties (like bandwidth, latency, bus width, ...) are define per bridge
+and per inter-connect. Property of an inter-connect apply to all initiators
+which are link to that inter-connect. Not all initiators are link to all
+inter-connect and thus not all initiators can access all memory (this apply
+to CPU too ie some CPU might not be able to access all memory).
+
+Bridges allow initiators (that can use the bridge) to access target for
+which they do not have a direct link with (ie they do not share a common
+inter-connect with the target).
+
+Through this four types of object we can describe any kind of system memory
+topology. To expose this to userspace we expose a new sysfs hierarchy (that
+co-exist with the existing one)::
+   - /sys/bus/hms/target* all targets in the system
+   - /sys/bus/hms/initiator* all initiators in the system
+   - /sys/bus/hms/interconnect* all inter-connects in the system
+   - /sys/bus/hms/bridge* all bridges in the system
+
+Inside each bridge or inter-connect directory they are symlinks to targets
+and initiators that are linked to that bridge or inter-connect. Properties
+are defined inside bridge and inter-connect directory.
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..d46a7d47f316 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -276,4 +276,18 @@ config GENERIC_ARCH_TOPOLOGY
 	  appropriate scaling, sysfs interface for changing capacity values at
 	  runtime.
 
+config HMS
+	bool "Heterogeneous memory system"
+	depends on STAGING
+	default n
+	help
+	  THIS IS AN EXPERIMENTAL API DO NOT RELY ON IT ! IT IS UNSTABLE !
+	
+	  Select HMS if you want to expose heterogeneous memory system to user
+	  space. This will expose a new directory under /sys/class/bus/hms that
+	  provide a description of heterogeneous memory system.
+	
+	  See Documentations/vm/hms.rst for further informations.
+
+
 endmenu
diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 704f44295810..92ebfacbf0dc 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -12,6 +12,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
+obj-$(CONFIG_HMS)	+= hms.o
 obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
diff --git a/drivers/base/hms.c b/drivers/base/hms.c
new file mode 100644
index 000000000000..a145f00a3683
--- /dev/null
+++ b/drivers/base/hms.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#include <linux/capability.h>
+#include <linux/topology.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+
+
+#define HMS_CLASS_NAME "hms"
+
+static DEFINE_MUTEX(hms_sysfs_mutex);
+
+static struct bus_type hms_subsys = {
+	.name = HMS_CLASS_NAME,
+	.dev_name = NULL,
+};
+
+void hms_object_release(struct hms_object *object)
+{
+	put_device(object->parent);
+}
+
+int hms_object_init(struct hms_object *object, struct device *parent,
+		    enum hms_type type, unsigned version,
+		    void (*device_release)(struct device *device),
+		    const struct attribute_group **device_group)
+{
+	static unsigned uid = 0;
+	int ret;
+
+	mutex_lock(&hms_sysfs_mutex);
+
+	/*
+	 * For now assume we are not going to have more that (2^31)-1 objects
+	 * in a system.
+	 *
+	 * FIXME use something little less naive ...
+	 */
+	object->uid = uid++;
+
+	switch (type) {
+	case HMS_TARGET:
+		dev_set_name(&object->device, "v%u-%u-target",
+			     version, object->uid);
+		break;
+	case HMS_BRIDGE:
+		dev_set_name(&object->device, "v%u-%u-bridge",
+			     version, object->uid);
+		break;
+	case HMS_INITIATOR:
+		dev_set_name(&object->device, "v%u-%u-initiator",
+			     version, object->uid);
+		break;
+	case HMS_LINK:
+		dev_set_name(&object->device, "v%u-%u-link",
+			     version, object->uid);
+		break;
+	default:
+		mutex_unlock(&hms_sysfs_mutex);
+		return -EINVAL;
+	}
+
+	object->type = type;
+	object->version = version;
+	object->device.id = object->uid;
+	object->device.bus = &hms_subsys;
+	object->device.groups = device_group;
+	object->device.release = device_release;
+
+	ret = device_register(&object->device);
+	if (ret)
+		put_device(&object->device);
+	mutex_unlock(&hms_sysfs_mutex);
+
+	if (!ret && parent) {
+		object->parent = parent;
+		get_device(parent);
+
+		sysfs_create_link(&object->device.kobj, &parent->kobj,
+				  kobject_name(&parent->kobj));
+	}
+
+	return ret;
+}
+
+int hms_object_link(struct hms_object *objecta,
+		    struct hms_object *objectb)
+{
+	int ret;
+
+	ret = sysfs_create_link(&objecta->device.kobj,
+				&objectb->device.kobj,
+				kobject_name(&objectb->device.kobj));
+	if (ret)
+		return ret;
+	ret = sysfs_create_link(&objectb->device.kobj,
+				&objecta->device.kobj,
+				kobject_name(&objecta->device.kobj));
+	if (ret) {
+		sysfs_remove_link(&objecta->device.kobj,
+				  kobject_name(&objectb->device.kobj));
+		return ret;
+	}
+
+	return 0;
+}
+
+void hms_object_unlink(struct hms_object *objecta,
+		       struct hms_object *objectb)
+{
+	sysfs_remove_link(&objecta->device.kobj,
+			  kobject_name(&objectb->device.kobj));
+	sysfs_remove_link(&objectb->device.kobj,
+			  kobject_name(&objecta->device.kobj));
+}
+
+struct hms_object *hms_object_get(struct hms_object *object)
+{
+	if (object == NULL)
+		return NULL;
+
+	get_device(&object->device);
+	return object;
+}
+
+void hms_object_put(struct hms_object *object)
+{
+	put_device(&object->device);
+}
+
+void hms_object_unregister(struct hms_object *object)
+{
+	mutex_lock(&hms_sysfs_mutex);
+	device_unregister(&object->device);
+	mutex_unlock(&hms_sysfs_mutex);
+}
+
+struct hms_object *hms_object_find_locked(unsigned uid)
+{
+	struct device *device;
+
+	device = subsys_find_device_by_id(&hms_subsys, uid, NULL);
+	return device ? to_hms_object(device) : NULL;
+}
+
+struct hms_object *hms_object_find(unsigned uid)
+{
+	struct hms_object *object;
+
+	mutex_lock(&hms_sysfs_mutex);
+	object = hms_object_find_locked(uid);
+	mutex_unlock(&hms_sysfs_mutex);
+	return object;
+}
+
+
+static struct attribute *hms_root_attrs[] = {
+	NULL
+};
+
+static struct attribute_group hms_root_attr_group = {
+	.attrs = hms_root_attrs,
+};
+
+static const struct attribute_group *hms_root_attr_groups[] = {
+	&hms_root_attr_group,
+	NULL,
+};
+
+int __init hms_init(void)
+{
+	int ret;
+
+	ret = subsys_system_register(&hms_subsys, hms_root_attr_groups);
+	if (ret)
+		pr_err("%s() failed: %d\n", __func__, ret);
+
+	return ret;
+}
diff --git a/drivers/base/init.c b/drivers/base/init.c
index 908e6520e804..3b40d5899d66 100644
--- a/drivers/base/init.c
+++ b/drivers/base/init.c
@@ -8,6 +8,7 @@
 #include <linux/init.h>
 #include <linux/memory.h>
 #include <linux/of.h>
+#include <linux/hms.h>
 
 #include "base.h"
 
@@ -34,5 +35,6 @@ void __init driver_init(void)
 	platform_bus_init();
 	cpu_dev_init();
 	memory_dev_init();
+	hms_init();
 	container_dev_init();
 }
diff --git a/include/linux/hms.h b/include/linux/hms.h
new file mode 100644
index 000000000000..1ab288df0158
--- /dev/null
+++ b/include/linux/hms.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#ifndef HMS_H
+#define HMS_H
+#if IS_ENABLED(CONFIG_HMS)
+
+
+#include <linux/device.h>
+
+
+#define to_hms_object(device) container_of(device, struct hms_object, device)
+
+enum hms_type {
+	HMS_BRIDGE,
+	HMS_INITIATOR,
+	HMS_LINK,
+	HMS_TARGET,
+};
+
+struct hms_object {
+	struct device *parent;
+	struct device device;
+	enum hms_type type;
+	unsigned version;
+	unsigned uid;
+};
+
+void hms_object_release(struct hms_object *object);
+int hms_object_init(struct hms_object *object, struct device *parent,
+		    enum hms_type type, unsigned version,
+		    void (*device_release)(struct device *device),
+		    const struct attribute_group **device_group);
+int hms_object_link(struct hms_object *objecta,
+		    struct hms_object *objectb);
+void hms_object_unlink(struct hms_object *objecta,
+		       struct hms_object *objectb);
+struct hms_object *hms_object_get(struct hms_object *object);
+void hms_object_put(struct hms_object *object);
+void hms_object_unregister(struct hms_object *object);
+struct hms_object *hms_object_find_locked(unsigned uid);
+struct hms_object *hms_object_find(unsigned uid);
+
+
+int hms_init(void);
+
+
+#else /* IS_ENABLED(CONFIG_HMS) */
+
+
+static inline int hms_init(void)
+{
+	return 0;
+}
+
+
+#endif /* IS_ENABLED(CONFIG_HMS) */
+#endif /* HMS_H */
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
  2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
@ 2018-12-03 23:34 ` jglisse
  2018-12-04 17:06   ` Andi Kleen
  2018-12-05 10:52   ` Mike Rapoport
  2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

Add documentation to what is HMS and what it is for (see patch content).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 246 insertions(+), 29 deletions(-)

diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
index dbf0f71918a9..bd7c9e8e7077 100644
--- a/Documentation/vm/hms.rst
+++ b/Documentation/vm/hms.rst
@@ -4,32 +4,249 @@
 Heterogeneous Memory System (HMS)
 =================================
 
-System with complex memory topology needs a more versatile memory topology
-description than just node where a node is a collection of memory and CPU.
-In heterogeneous memory system we consider four types of object::
-   - target: which is any kind of memory
-   - initiator: any kind of device or CPU
-   - inter-connect: any kind of links that connects target and initiator
-   - bridge: a link between two inter-connects
-
-Properties (like bandwidth, latency, bus width, ...) are define per bridge
-and per inter-connect. Property of an inter-connect apply to all initiators
-which are link to that inter-connect. Not all initiators are link to all
-inter-connect and thus not all initiators can access all memory (this apply
-to CPU too ie some CPU might not be able to access all memory).
-
-Bridges allow initiators (that can use the bridge) to access target for
-which they do not have a direct link with (ie they do not share a common
-inter-connect with the target).
-
-Through this four types of object we can describe any kind of system memory
-topology. To expose this to userspace we expose a new sysfs hierarchy (that
-co-exist with the existing one)::
-   - /sys/bus/hms/target* all targets in the system
-   - /sys/bus/hms/initiator* all initiators in the system
-   - /sys/bus/hms/interconnect* all inter-connects in the system
-   - /sys/bus/hms/bridge* all bridges in the system
-
-Inside each bridge or inter-connect directory they are symlinks to targets
-and initiators that are linked to that bridge or inter-connect. Properties
-are defined inside bridge and inter-connect directory.
+Heterogeneous memory system are becoming more and more the norm, in
+those system there is not only the main system memory for each node,
+but also device memory and|or memory hierarchy to consider. Device
+memory can comes from a device like GPU, FPGA, ... or from a memory
+only device (persistent memory, or high density memory device).
+
+Memory hierarchy is when you not only have the main memory but also
+other type of memory like HBM (High Bandwidth Memory often stack up
+on CPU die or GPU die), peristent memory or high density memory (ie
+something slower then regular DDR DIMM but much bigger).
+
+On top of this diversity of memories you also have to account for the
+system bus topology ie how all CPUs and devices are connected to each
+others. Userspace do not care about the exact physical topology but
+care about topology from behavior point of view ie what are all the
+paths between an initiator (anything that can initiate memory access
+like CPU, GPU, FGPA, network controller ...) and a target memory and
+what are all the properties of each of those path (bandwidth, latency,
+granularity, ...).
+
+This means that it is no longer sufficient to consider a flat view
+for each node in a system but for maximum performance we need to
+account for all of this new memory but also for system topology.
+This is why this proposal is unlike the HMAT proposal [1] which
+tries to extend the existing NUMA for new type of memory. Here we
+are tackling a much more profound change that depart from NUMA.
+
+
+One of the reasons for radical change is the advance of accelerator
+like GPU or FPGA means that CPU is no longer the only piece where
+computation happens. It is becoming more and more common for an
+application to use a mix and match of different accelerator to
+perform its computation. So we can no longer satisfy our self with
+a CPU centric and flat view of a system like NUMA and NUMA distance.
+
+
+HMS tackle this problems through three aspects:
+    1 - Expose complex system topology and various kind of memory
+        to user space so that application have a standard way and
+        single place to get all the information it cares about.
+    2 - A new API for user space to bind/provide hint to kernel on
+        which memory to use for range of virtual address (a new
+        mbind() syscall).
+    3 - Kernel side changes for vm policy to handle this changes
+
+
+The rest of this documents is splits in 3 sections, the first section
+talks about complex system topology: what it is, how it is use today
+and how to describe it tomorrow. The second sections talks about
+new API to bind/provide hint to kernel for range of virtual address.
+The third section talks about new mechanism to track bind/hint
+provided by user space or device driver inside the kernel.
+
+
+1) Complex system topology and representing them
+================================================
+
+Inside a node you can have a complex topology of memory, for instance
+you can have multiple HBM memory in a node, each HBM memory tie to a
+set of CPUs (all of which are in the same node). This means that you
+have a hierarchy of memory for CPUs. The local fast HBM but which is
+expected to be relatively small compare to main memory and then the
+main memory. New memory technology might also deepen this hierarchy
+with another level of yet slower memory but gigantic in size (some
+persistent memory technology might fall into that category). Another
+example is device memory, and device themself can have a hierarchy
+like HBM on top of device core and main device memory.
+
+On top of that you can have multiple path to access each memory and
+each path can have different properties (latency, bandwidth, ...).
+Also there is not always symmetry ie some memory might only be
+accessible by some device or CPU ie not accessible by everyone.
+
+So a flat hierarchy for each node is not capable of representing this
+kind of complexity. To simplify discussion and because we do not want
+to single out CPU from device, from here on out we will use initiator
+to refer to either CPU or device. An initiator is any kind of CPU or
+device that can access memory (ie initiate memory access).
+
+At this point a example of such system might help:
+    - 2 nodes and for each node:
+        - 1 CPU per node with 2 complex of CPUs cores per CPU
+        - one HBM memory for each complex of CPUs cores (200GB/s)
+        - CPUs cores complex are linked to each other (100GB/s)
+        - main memory is (90GB/s)
+        - 4 GPUs each with:
+            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
+            - GDDR memory for each GPU (500GB/s) (CPU accessible)
+            - connected to CPU root controller (60GB/s)
+            - connected to other GPUs (even GPUs from the second
+              node) with GPU link (400GB/s)
+
+In this example we restrict our self to bandwidth and ignore bus width
+or latency, this is just to simplify discussions but obviously they
+also factor in.
+
+
+Userspace very much would like to know about this information, for
+instance HPC folks have develop complex library to manage this and
+there is wide research on the topics [2] [3] [4] [5]. Today most of
+the work is done by hardcoding thing for specific platform. Which is
+somewhat acceptable for HPC folks where the platform stays the same
+for a long period of time.
+
+Roughly speaking i see two broads use case for topology information.
+First is for virtualization and vm where you want to segment your
+hardware properly for each vm (binding memory, CPU and GPU that are
+all close to each others). Second is for application, many of which
+can partition their workload to minimize exchange between partition
+allowing each partition to be bind to a subset of device and CPUs
+that are close to each others (for maximum locality). Here it is much
+more than just NUMA distance, you can leverage the memory hierarchy
+and  the system topology all-together (see [2] [3] [4] [5] for more
+references and details).
+
+So this is not exposing topology just for the sake of cool graph in
+userspace. They are active user today of such information and if we
+want to growth and broaden the usage we should provide a unified API
+to standardize how that information is accessible to every one.
+
+
+One proposal so far to handle new type of memory is to user CPU less
+node for those [6]. While same idea can apply for device memory, it is
+still hard to describe multiple path with different property in such
+scheme. While it is backward compatible and have minimum changes, it
+simplify can not convey complex topology (think any kind of random
+graph, not just a tree like graph).
+
+So HMS use a new way to expose to userspace the system topology. It
+relies on 4 types of objects:
+    - target: any kind of memory (main memory, HBM, device, ...)
+    - initiator: CPU or device (anything that can access memory)
+    - link: anything that link initiator and target
+    - bridges: anything that allow group of initiator to access
+      remote target (ie target they are not connected with directly
+      through an link)
+
+Properties like bandwidth, latency, ... are all sets per bridges and
+links. All initiators connected to an link can access any target memory
+also connected to the same link and all with the same link properties.
+
+Link do not need to match physical hardware ie you can have a single
+physical link match a single or multiples software expose link. This
+allows to model device connected to same physical link (like PCIE
+for instance) but not with same characteristics (like number of lane
+or lane speed in PCIE). The reverse is also true ie having a single
+software expose link match multiples physical link.
+
+Bridges allows initiator to access remote link. A bridges connect two
+links to each others and is also specific to list of initiators (ie
+not all initiators connected to each of the link can use the bridge).
+Bridges have their own properties (bandwidth, latency, ...) so that
+the actual property value for each property is the lowest common
+denominator between bridge and each of the links.
+
+
+This model allows to describe any kind of directed graph and thus
+allows to describe any kind of topology we might see in the future.
+It is also easier to add new properties to each object type.
+
+Moreover it can be use to expose devices capable to do peer to peer
+between them. For that simply have all devices capable to peer to
+peer to have a common link or use the bridge object if the peer to
+peer capabilities is only one way for instance.
+
+
+HMS use the above scheme to expose system topology through sysfs under
+/sys/bus/hms/ with:
+    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
+      each has a UID and you can usual value in that folder (node id,
+      size, ...)
+
+    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
+      (CPU or device), each has a HMS UID but also a CPU id for CPU
+      (which match CPU id in (/sys/bus/cpu/). For device you have a
+      path that can be PCIE BUS ID for instance)
+
+    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
+      UID and a file per property (bandwidth, latency, ...) you also
+      find a symlink to every target and initiator connected to that
+      link.
+
+    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
+      a UID and a file per property (bandwidth, latency, ...) you
+      also find a symlink to all initiators that can use that bridge.
+
+To help with forward compatibility each object as a version value and
+it is mandatory for user space to only use target or initiator with
+version supported by the user space. For instance if user space only
+knows about what version 1 means and sees a target with version 2 then
+the user space must ignore that target as if it does not exist.
+
+Mandating that allows the additions of new properties that break back-
+ward compatibility ie user space must know how this new property affect
+the object to be able to use it safely.
+
+Main memory of each node is expose under a common target. For now
+device driver are responsible to register memory they want to expose
+through that scheme but in the future that information might come from
+the system firmware (this is a different discussion).
+
+
+
+2) hbind() bind range of virtual address to heterogeneous memory
+================================================================
+
+So instead of using a bitmap, hbind() take an array of uid and each uid
+is a unique memory target inside the new memory topology description.
+User space also provide an array of modifiers. Modifier can be seen as
+the flags parameter of mbind() but here we use an array so that user
+space can not only supply a modifier but also value with it. This should
+allow the API to grow more features in the future. Kernel should return
+-EINVAL if it is provided with an unkown modifier and just ignore the
+call all together, forcing the user space to restrict itself to modifier
+supported by the kernel it is running on (i know i am dreaming about well
+behave user space).
+
+
+Note that none of this is exclusive of automatic memory placement like
+autonuma. I also believe that we will see something similar to autonuma
+for device memory.
+
+
+3) Tracking and applying heterogeneous memory policies
+======================================================
+
+Current memory policy infrastructure is node oriented, instead of
+changing that and risking breakage and regression HMS adds a new
+heterogeneous policy tracking infra-structure. The expectation is
+that existing application can keep using mbind() and all existing
+infrastructure under-disturb and unaffected, while new application
+will use the new API and should avoid mix and matching both (as they
+can achieve the same thing with the new API).
+
+Also the policy is not directly tie to the vma structure for a few
+reasons:
+    - avoid having to split vma for policy that do not cover full vma
+    - avoid changing too much vma code
+    - avoid growing the vma structure with an extra pointer
+
+The overall design is simple, on hbind() call a hms policy structure
+is created for the supplied range and hms use the callback associated
+with the target memory. This callback is provided by device driver
+for device memory or by core HMS for regular main memory. The callback
+can decide to migrate the range to the target memories or do nothing
+(this can be influenced by flags provided to hbind() too).
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
  2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
  2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
@ 2018-12-03 23:34 ` jglisse
  2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

A target is some kind of memory, it can be regular main memory or some
more specialize memory like CPU's HBM (High Bandwidth Memory) or some
device's memory.

Some target memory might not be accessible by all initiators (anything
that can trigger memory access). For instance some device memory might
not be accessible by CPU. This is truely heterogeneous systems at its
heart.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/Makefile     |   2 +-
 drivers/base/hms-target.c | 193 ++++++++++++++++++++++++++++++++++++++
 include/linux/hms.h       |  43 ++++++++-
 3 files changed, 235 insertions(+), 3 deletions(-)
 create mode 100644 drivers/base/hms-target.c

diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 92ebfacbf0dc..8e8092145f18 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -12,7 +12,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_HMS)	+= hms.o
+obj-$(CONFIG_HMS)	+= hms.o hms-target.o
 obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
diff --git a/drivers/base/hms-target.c b/drivers/base/hms-target.c
new file mode 100644
index 000000000000..ce28dfe089a3
--- /dev/null
+++ b/drivers/base/hms-target.c
@@ -0,0 +1,193 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#include <linux/capability.h>
+#include <linux/topology.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+
+
+static DEFINE_MUTEX(hms_target_mutex);
+
+
+static inline struct hms_target *hms_object_to_target(struct hms_object *object)
+{
+	if (object == NULL)
+		return NULL;
+
+	if (object->type != HMS_TARGET)
+		return NULL;
+	return container_of(object, struct hms_target, object);
+}
+
+static inline struct hms_target *device_to_hms_target(struct device *device)
+{
+	if (device == NULL)
+		return NULL;
+
+	return hms_object_to_target(to_hms_object(device));
+}
+
+struct hms_target *hms_target_find_locked(unsigned uid)
+{
+	struct hms_object *object = hms_object_find_locked(uid);
+	struct hms_target *target;
+
+	target = hms_object_to_target(object);
+	if (target)
+		return target;
+	hms_object_put(object);
+	return NULL;
+}
+
+struct hms_target *hms_target_find(unsigned uid)
+{
+	struct hms_object *object = hms_object_find(uid);
+	struct hms_target *target;
+
+	target = hms_object_to_target(object);
+	if (target)
+		return target;
+	hms_object_put(object);
+	return NULL;
+}
+
+static void hms_target_release(struct device *device)
+{
+	struct hms_target *target = device_to_hms_target(device);
+
+	hms_object_release(&target->object);
+	kfree(target);
+}
+
+static ssize_t hms_target_show_size(struct device *device,
+				    struct device_attribute *attr,
+				    char *buf)
+{
+	struct hms_target *target = device_to_hms_target(device);
+
+	if (target == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%ld\n", target->size);
+}
+
+static ssize_t hms_target_show_nid(struct device *device,
+				   struct device_attribute *attr,
+				   char *buf)
+{
+	struct hms_target *target = device_to_hms_target(device);
+
+	if (target == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", target->nid);
+}
+
+static ssize_t hms_target_show_uid(struct device *device,
+				   struct device_attribute *attr,
+				   char *buf)
+{
+	struct hms_target *target = device_to_hms_target(device);
+
+	if (target == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", target->object.uid);
+}
+
+static DEVICE_ATTR(size, 0444, hms_target_show_size, NULL);
+static DEVICE_ATTR(nid, 0444, hms_target_show_nid, NULL);
+static DEVICE_ATTR(uid, 0444, hms_target_show_uid, NULL);
+
+static struct attribute *hms_target_attrs[] = {
+	&dev_attr_size.attr,
+	&dev_attr_nid.attr,
+	&dev_attr_uid.attr,
+	NULL
+};
+
+static struct attribute_group hms_target_attr_group = {
+	.attrs = hms_target_attrs,
+};
+
+static const struct attribute_group *hms_target_attr_groups[] = {
+	&hms_target_attr_group,
+	NULL,
+};
+
+void hms_target_register(struct hms_target **targetp, struct device *parent,
+			 int nid, const struct hms_target_hbind *hbind,
+			 unsigned long size, unsigned version)
+{
+	struct hms_target *target;
+
+	*targetp = NULL;
+	target = kzalloc(sizeof(*target), GFP_KERNEL);
+	if (target == NULL)
+		return;
+
+	target->nid = nid;
+	target->size = size;
+	target->hbind = hbind;
+
+	if (hms_object_init(&target->object, parent, HMS_TARGET, version,
+			    hms_target_release, hms_target_attr_groups)) {
+		kfree(target);
+		target = NULL;
+	}
+
+	*targetp = target;
+}
+EXPORT_SYMBOL(hms_target_register);
+
+void hms_target_add_memory(struct hms_target *target, unsigned long size)
+{
+	if (target) {
+		mutex_lock(&hms_target_mutex);
+		target->size += size;
+		mutex_unlock(&hms_target_mutex);
+	}
+}
+EXPORT_SYMBOL(hms_target_add_memory);
+
+void hms_target_remove_memory(struct hms_target *target, unsigned long size)
+{
+	if (target) {
+		mutex_lock(&hms_target_mutex);
+		target->size = size < target->size ? target->size - size : 0;
+		mutex_unlock(&hms_target_mutex);
+	}
+}
+EXPORT_SYMBOL(hms_target_remove_memory);
+
+void hms_target_unregister(struct hms_target **targetp)
+{
+	struct hms_target *target = *targetp;
+
+	*targetp = NULL;
+	if (target == NULL)
+		return;
+
+	hms_object_unregister(&target->object);
+}
+EXPORT_SYMBOL(hms_target_unregister);
diff --git a/include/linux/hms.h b/include/linux/hms.h
index 1ab288df0158..0568fdf6d479 100644
--- a/include/linux/hms.h
+++ b/include/linux/hms.h
@@ -17,10 +17,21 @@
 /* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
 #ifndef HMS_H
 #define HMS_H
-#if IS_ENABLED(CONFIG_HMS)
-
 
 #include <linux/device.h>
+#include <linux/types.h>
+
+
+struct hms_target;
+
+struct hms_target_hbind {
+	int (*migrate)(struct hms_target *target, struct mm_struct *mm,
+		       unsigned long start, unsigned long end,
+		       unsigned natoms, uint32_t *atoms);
+};
+
+
+#if IS_ENABLED(CONFIG_HMS)
 
 
 #define to_hms_object(device) container_of(device, struct hms_object, device)
@@ -56,12 +67,40 @@ struct hms_object *hms_object_find_locked(unsigned uid);
 struct hms_object *hms_object_find(unsigned uid);
 
 
+struct hms_target {
+	const struct hms_target_hbind *hbind;
+	struct hms_object object;
+	unsigned long size;
+	void *private;
+	int nid;
+};
+
+void hms_target_add_memory(struct hms_target *target, unsigned long size);
+void hms_target_remove_memory(struct hms_target *target, unsigned long size);
+void hms_target_register(struct hms_target **targetp, struct device *parent,
+			 int nid, const struct hms_target_hbind *hbind,
+			 unsigned long size, unsigned version);
+void hms_target_unregister(struct hms_target **targetp);
+struct hms_target *hms_target_find(unsigned uid);
+
+static inline void hms_target_put(struct hms_target *target)
+{
+	hms_object_put(&target->object);
+}
+
+
 int hms_init(void);
 
 
 #else /* IS_ENABLED(CONFIG_HMS) */
 
 
+#define hms_target_add_memory(target, size)
+#define hms_target_remove_memory(target, size)
+#define hms_target_register(targetp, nid, size)
+#define hms_target_unregister(targetp)
+
+
 static inline int hms_init(void)
 {
 	return 0;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 04/14] mm/hms: add initiator to heterogeneous memory system infrastructure
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (2 preceding siblings ...)
  2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
@ 2018-12-03 23:34 ` " jglisse
  2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:34 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

An initiator is anything that can initiate memory access, either a CPU
or a device. Here CPUs and devices are treated as equals.

See HMS Documentation/vm/hms.txt for further detail..

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/Makefile        |   2 +-
 drivers/base/hms-initiator.c | 141 +++++++++++++++++++++++++++++++++++
 include/linux/hms.h          |  15 ++++
 3 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 drivers/base/hms-initiator.c

diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 8e8092145f18..6a1b5ab667bd 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -12,7 +12,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_HMS)	+= hms.o hms-target.o
+obj-$(CONFIG_HMS)	+= hms.o hms-target.o hms-initiator.o
 obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
diff --git a/drivers/base/hms-initiator.c b/drivers/base/hms-initiator.c
new file mode 100644
index 000000000000..08aa519427d6
--- /dev/null
+++ b/drivers/base/hms-initiator.c
@@ -0,0 +1,141 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#include <linux/capability.h>
+#include <linux/topology.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+
+
+static inline struct hms_initiator *hms_object_to_initiator(struct hms_object *object)
+{
+	if (object == NULL)
+		return NULL;
+
+	if (object->type != HMS_INITIATOR)
+		return NULL;
+	return container_of(object, struct hms_initiator, object);
+}
+
+static inline struct hms_initiator *device_to_hms_initiator(struct device *device)
+{
+	if (device == NULL)
+		return NULL;
+
+	return hms_object_to_initiator(to_hms_object(device));
+}
+
+struct hms_initiator *hms_initiator_find_locked(unsigned uid)
+{
+	struct hms_object *object = hms_object_find_locked(uid);
+	struct hms_initiator *initiator;
+
+	initiator = hms_object_to_initiator(object);
+	if (initiator)
+		return initiator;
+	hms_object_put(object);
+	return NULL;
+}
+
+struct hms_initiator *hms_initiator_find(unsigned uid)
+{
+	struct hms_object *object = hms_object_find(uid);
+	struct hms_initiator *initiator;
+
+	initiator = hms_object_to_initiator(object);
+	if (initiator)
+		return initiator;
+	hms_object_put(object);
+	return NULL;
+}
+
+static void hms_initiator_release(struct device *device)
+{
+	struct hms_initiator *initiator = device_to_hms_initiator(device);
+
+	hms_object_release(&initiator->object);
+	kfree(initiator);
+}
+
+static ssize_t hms_initiator_show_uid(struct device *device,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	struct hms_initiator *initiator = device_to_hms_initiator(device);
+
+	if (initiator == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", initiator->object.uid);
+}
+
+static DEVICE_ATTR(uid, 0444, hms_initiator_show_uid, NULL);
+
+static struct attribute *hms_initiator_attrs[] = {
+	&dev_attr_uid.attr,
+	NULL
+};
+
+static struct attribute_group hms_initiator_attr_group = {
+	.attrs = hms_initiator_attrs,
+};
+
+static const struct attribute_group *hms_initiator_attr_groups[] = {
+	&hms_initiator_attr_group,
+	NULL,
+};
+
+void hms_initiator_register(struct hms_initiator **initiatorp,
+			    struct device *parent, int nid,
+			    unsigned version)
+{
+	struct hms_initiator *initiator;
+
+	*initiatorp = NULL;
+	initiator = kzalloc(sizeof(*initiator), GFP_KERNEL);
+	if (initiator == NULL)
+		return;
+
+	initiator->nid = nid;
+
+	if (hms_object_init(&initiator->object, parent, HMS_INITIATOR, version,
+			    hms_initiator_release, hms_initiator_attr_groups))
+	{
+		kfree(initiator);
+		initiator = NULL;
+	}
+
+	*initiatorp = initiator;
+}
+EXPORT_SYMBOL(hms_initiator_register);
+
+void hms_initiator_unregister(struct hms_initiator **initiatorp)
+{
+	struct hms_initiator *initiator = *initiatorp;
+
+	*initiatorp = NULL;
+	if (initiator == NULL)
+		return;
+
+	hms_object_unregister(&initiator->object);
+}
+EXPORT_SYMBOL(hms_initiator_unregister);
diff --git a/include/linux/hms.h b/include/linux/hms.h
index 0568fdf6d479..7a2823493f63 100644
--- a/include/linux/hms.h
+++ b/include/linux/hms.h
@@ -67,6 +67,17 @@ struct hms_object *hms_object_find_locked(unsigned uid);
 struct hms_object *hms_object_find(unsigned uid);
 
 
+struct hms_initiator {
+	struct hms_object object;
+	int nid;
+};
+
+void hms_initiator_register(struct hms_initiator **initiatorp,
+			    struct device *parent, int nid,
+			    unsigned version);
+void hms_initiator_unregister(struct hms_initiator **initiatorp);
+
+
 struct hms_target {
 	const struct hms_target_hbind *hbind;
 	struct hms_object object;
@@ -95,6 +106,10 @@ int hms_init(void);
 #else /* IS_ENABLED(CONFIG_HMS) */
 
 
+#define hms_initiator_register(initiatorp)
+#define hms_initiator_unregister(initiatorp)
+
+
 #define hms_target_add_memory(target, size)
 #define hms_target_remove_memory(target, size)
 #define hms_target_register(targetp, nid, size)
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 05/14] mm/hms: add link to heterogeneous memory system infrastructure
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (3 preceding siblings ...)
  2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
@ 2018-12-03 23:35 ` " jglisse
  2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

A link connect initiators (CPUs or devices) and targets memory with
each others. It does necessarily match one to one with a physical
inter-connect ie a given physical inter-connect by be presented as
multiple links or multiple physical inter-connect can be presented
as just one link.

What matters is that the properties associated with the links applies
to all initiators and targets listed as connected to that link.

For example you can consider the PCIE bus if all initiators can peer
to peer with each others than it can be presented as just one link
with all the PCIE devices in it and the local CPU (ie CPU from which
the PCIE lanes are coming from). If not all PCIE device can peer to
peer than a link per peer to peer group is created and corresponding
CPU is added to each.

See HMS Documentation/vm/hms.txt for detail.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/Makefile   |   2 +-
 drivers/base/hms-link.c | 183 ++++++++++++++++++++++++++++++++++++++++
 include/linux/hms.h     |  23 +++++
 3 files changed, 207 insertions(+), 1 deletion(-)
 create mode 100644 drivers/base/hms-link.c

diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index 6a1b5ab667bd..b8ff678fdae9 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -12,7 +12,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_HMS)	+= hms.o hms-target.o hms-initiator.o
+obj-$(CONFIG_HMS)	+= hms.o hms-target.o hms-initiator.o hms-link.o
 obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
diff --git a/drivers/base/hms-link.c b/drivers/base/hms-link.c
new file mode 100644
index 000000000000..58f4fdd8977c
--- /dev/null
+++ b/drivers/base/hms-link.c
@@ -0,0 +1,183 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#include <linux/capability.h>
+#include <linux/topology.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+
+
+struct hms_link *hms_object_to_link(struct hms_object *object)
+{
+	if (object == NULL)
+		return NULL;
+
+	if (object->type != HMS_LINK)
+		return NULL;
+	return container_of(object, struct hms_link, object);
+}
+
+static inline struct hms_link *device_to_hms_link(struct device *device)
+{
+	if (device == NULL)
+		return NULL;
+
+	return hms_object_to_link(to_hms_object(device));
+}
+
+struct hms_link *hms_link_find_locked(unsigned uid)
+{
+	struct hms_object *object = hms_object_find_locked(uid);
+	struct hms_link *link;
+
+	link = hms_object_to_link(object);
+	if (link)
+		return link;
+	hms_object_put(object);
+	return NULL;
+}
+
+struct hms_link *hms_link_find(unsigned uid)
+{
+	struct hms_object *object = hms_object_find(uid);
+	struct hms_link *link;
+
+	link = hms_object_to_link(object);
+	if (link)
+		return link;
+	hms_object_put(object);
+	return NULL;
+}
+
+static void hms_link_release(struct device *device)
+
+{
+	struct hms_link *link = device_to_hms_link(device);
+
+	hms_object_release(&link->object);
+	kfree(link);
+}
+
+static ssize_t hms_link_show_uid(struct device *device,
+				   struct device_attribute *attr,
+				   char *buf)
+{
+	struct hms_link *link = device_to_hms_link(device);
+
+	if (link == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", link->object.uid);
+}
+
+static DEVICE_ATTR(uid, 0444, hms_link_show_uid, NULL);
+
+static struct attribute *hms_link_attrs[] = {
+	&dev_attr_uid.attr,
+	NULL
+};
+
+static struct attribute_group hms_link_attr_group = {
+	.attrs = hms_link_attrs,
+};
+
+static const struct attribute_group *hms_link_attr_groups[] = {
+	&hms_link_attr_group,
+	NULL,
+};
+
+void hms_link_register(struct hms_link **linkp, struct device *parent,
+		       unsigned version)
+{
+	struct hms_link *link;
+
+	*linkp = NULL;
+	link = kzalloc(sizeof(*link), GFP_KERNEL);
+	if (link == NULL)
+		return;
+
+	if (hms_object_init(&link->object, parent, HMS_LINK, version,
+			    hms_link_release, hms_link_attr_groups)) {
+		kfree(link);
+		link = NULL;
+	}
+
+	*linkp = link;
+}
+EXPORT_SYMBOL(hms_link_register);
+
+void hms_unlink_initiator(struct hms_link *link,
+			  struct hms_initiator *initiator)
+{
+	if (link == NULL || initiator == NULL)
+		return;
+	if (link->object.type != HMS_LINK)
+		return;
+	if (initiator->object.type != HMS_INITIATOR)
+		return;
+	hms_object_unlink(&link->object, &initiator->object);
+}
+EXPORT_SYMBOL(hms_unlink_initiator);
+
+void hms_unlink_target(struct hms_link *link, struct hms_target *target)
+{
+	if (link == NULL || target == NULL)
+		return;
+	if (link->object.type != HMS_LINK || target->object.type != HMS_TARGET)
+		return;
+	hms_object_unlink(&link->object, &target->object);
+}
+EXPORT_SYMBOL(hms_unlink_target);
+
+int hms_link_initiator(struct hms_link *link, struct hms_initiator *initiator)
+{
+	if (link == NULL || initiator == NULL)
+		return -EINVAL;
+	if (link->object.type != HMS_LINK)
+		return -EINVAL;
+	if (initiator->object.type != HMS_INITIATOR)
+		return -EINVAL;
+	return hms_object_link(&link->object, &initiator->object);
+}
+EXPORT_SYMBOL(hms_link_initiator);
+
+int hms_link_target(struct hms_link *link, struct hms_target *target)
+{
+	if (link == NULL || target == NULL)
+		return -EINVAL;
+	if (link->object.type != HMS_LINK || target->object.type != HMS_TARGET)
+		return -EINVAL;
+	return hms_object_link(&link->object, &target->object);
+}
+EXPORT_SYMBOL(hms_link_target);
+
+void hms_link_unregister(struct hms_link **linkp)
+{
+	struct hms_link *link = *linkp;
+
+	*linkp = NULL;
+	if (link == NULL)
+		return;
+
+	hms_object_unregister(&link->object);
+}
+EXPORT_SYMBOL(hms_link_unregister);
diff --git a/include/linux/hms.h b/include/linux/hms.h
index 7a2823493f63..2a9e49a2d771 100644
--- a/include/linux/hms.h
+++ b/include/linux/hms.h
@@ -100,6 +100,21 @@ static inline void hms_target_put(struct hms_target *target)
 }
 
 
+struct hms_link {
+	struct hms_object object;
+};
+
+struct hms_link *hms_object_to_link(struct hms_object *object);
+void hms_unlink_initiator(struct hms_link *link,
+			  struct hms_initiator *initiator);
+void hms_unlink_target(struct hms_link *link, struct hms_target *target);
+int hms_link_initiator(struct hms_link *link, struct hms_initiator *initiator);
+int hms_link_target(struct hms_link *link, struct hms_target *target);
+void hms_link_register(struct hms_link **linkp, struct device *parent,
+		       unsigned version);
+void hms_link_unregister(struct hms_link **linkp);
+
+
 int hms_init(void);
 
 
@@ -116,6 +131,14 @@ int hms_init(void);
 #define hms_target_unregister(targetp)
 
 
+#define hms_unlink_initiator(link, initiator)
+#define hms_unlink_target(link, target)
+#define hms_link_initiator(link, initiator)
+#define hms_link_target(link, target)
+#define hms_link_register(linkp)
+#define hms_link_unregister(linkp)
+
+
 static inline int hms_init(void)
 {
 	return 0;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 06/14] mm/hms: add bridge to heterogeneous memory system infrastructure
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (4 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
@ 2018-12-03 23:35 ` " jglisse
  2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

A bridge connect two links with each others and apply only to listed
initiators. With links, this allows to describe any kind of system
topology ie any kind of directed graph.

Moreover with bridges the userspace can choose to use different bridges
to load balance bandwidth usage accross multiple paths between targets
memory and initiators. Note that explicit path selection is not always
under the control of user space, some system might do load balancing
in hardware.

See HMS Documentation/vm/hms.txt for detail.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/Makefile     |   2 +-
 drivers/base/hms-bridge.c | 197 ++++++++++++++++++++++++++++++++++++++
 include/linux/hms.h       |  24 +++++
 3 files changed, 222 insertions(+), 1 deletion(-)
 create mode 100644 drivers/base/hms-bridge.c

diff --git a/drivers/base/Makefile b/drivers/base/Makefile
index b8ff678fdae9..62695fdcd32f 100644
--- a/drivers/base/Makefile
+++ b/drivers/base/Makefile
@@ -12,7 +12,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_HMS)	+= hms.o hms-target.o hms-initiator.o hms-link.o
+obj-$(CONFIG_HMS)	+= hms.o hms-target.o hms-initiator.o hms-link.o hms-bridge.o
 obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
diff --git a/drivers/base/hms-bridge.c b/drivers/base/hms-bridge.c
new file mode 100644
index 000000000000..64732e923fba
--- /dev/null
+++ b/drivers/base/hms-bridge.c
@@ -0,0 +1,197 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#include <linux/capability.h>
+#include <linux/topology.h>
+#include <linux/uaccess.h>
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+
+
+static inline struct hms_bridge *hms_object_to_bridge(struct hms_object *object)
+{
+	if (object == NULL)
+		return NULL;
+
+	if (object->type != HMS_BRIDGE)
+		return NULL;
+	return container_of(object, struct hms_bridge, object);
+}
+
+static inline struct hms_bridge *device_to_hms_bridge(struct device *device)
+{
+	if (device == NULL)
+		return NULL;
+
+	return hms_object_to_bridge(to_hms_object(device));
+}
+
+struct hms_bridge *hms_bridge_find_locked(unsigned uid)
+{
+	struct hms_object *object = hms_object_find_locked(uid);
+	struct hms_bridge *bridge;
+
+	bridge = hms_object_to_bridge(object);
+	if (bridge)
+		return bridge;
+	hms_object_put(object);
+	return NULL;
+}
+
+struct hms_bridge *hms_bridge_find(unsigned uid)
+{
+	struct hms_object *object = hms_object_find(uid);
+	struct hms_bridge *bridge;
+
+	bridge = hms_object_to_bridge(object);
+	if (bridge)
+		return bridge;
+	hms_object_put(object);
+	return NULL;
+}
+
+static void hms_bridge_release(struct device *device)
+{
+	struct hms_bridge *bridge = device_to_hms_bridge(device);
+
+	hms_object_put(&bridge->linka->object);
+	hms_object_put(&bridge->linkb->object);
+	hms_object_release(&bridge->object);
+	kfree(bridge);
+}
+
+static ssize_t hms_bridge_show_uid(struct device *device,
+				   struct device_attribute *attr,
+				   char *buf)
+{
+	struct hms_bridge *bridge = device_to_hms_bridge(device);
+
+	if (bridge == NULL)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", bridge->object.uid);
+}
+
+static DEVICE_ATTR(uid, 0444, hms_bridge_show_uid, NULL);
+
+static struct attribute *hms_bridge_attrs[] = {
+	&dev_attr_uid.attr,
+	NULL
+};
+
+static struct attribute_group hms_bridge_attr_group = {
+	.attrs = hms_bridge_attrs,
+};
+
+static const struct attribute_group *hms_bridge_attr_groups[] = {
+	&hms_bridge_attr_group,
+	NULL,
+};
+
+void hms_bridge_register(struct hms_bridge **bridgep,
+			 struct device *parent,
+			 struct hms_link *linka,
+			 struct hms_link *linkb,
+			 unsigned version)
+{
+	struct hms_bridge *bridge;
+	int ret;
+
+	*bridgep = NULL;
+
+	if (linka == NULL || linkb == NULL)
+		return;
+	linka = hms_object_to_link(hms_object_get(&linka->object));
+	linkb = hms_object_to_link(hms_object_get(&linkb->object));
+	if (linka == NULL || linkb == NULL)
+		goto error;
+
+	bridge = kzalloc(sizeof(*bridge), GFP_KERNEL);
+	if (bridge == NULL)
+		goto error;
+
+	if (hms_object_init(&bridge->object, parent, HMS_BRIDGE, version,
+			    hms_bridge_release, hms_bridge_attr_groups)) {
+		kfree(bridge);
+		goto error;
+	}
+
+	bridge->linka = linka;
+	bridge->linkb = linkb;
+
+	ret = hms_object_link(&bridge->object, &linka->object);
+	if (ret) {
+		hms_bridge_unregister(&bridge);
+		return;
+	}
+
+	ret = hms_object_link(&bridge->object, &linkb->object);
+	if (ret) {
+		hms_bridge_unregister(&bridge);
+		return;
+	}
+
+	*bridgep = bridge;
+	return;
+
+error:
+	hms_object_put(&linka->object);
+	hms_object_put(&linkb->object);
+}
+EXPORT_SYMBOL(hms_bridge_register);
+
+void hms_unbridge_initiator(struct hms_bridge *bridge,
+			    struct hms_initiator *initiator)
+{
+	if (bridge == NULL || initiator == NULL)
+		return;
+	if (bridge->object.type != HMS_BRIDGE)
+		return;
+	if (initiator->object.type != HMS_INITIATOR)
+		return;
+	hms_object_unlink(&bridge->object, &initiator->object);
+}
+EXPORT_SYMBOL(hms_unbridge_initiator);
+
+int hms_bridge_initiator(struct hms_bridge *bridge,
+			 struct hms_initiator *initiator)
+{
+	if (bridge == NULL || initiator == NULL)
+		return -EINVAL;
+	if (bridge->object.type != HMS_BRIDGE)
+		return -EINVAL;
+	if (initiator->object.type != HMS_INITIATOR)
+		return -EINVAL;
+	return hms_object_link(&bridge->object, &initiator->object);
+}
+EXPORT_SYMBOL(hms_bridge_initiator);
+
+void hms_bridge_unregister(struct hms_bridge **bridgep)
+{
+	struct hms_bridge *bridge = *bridgep;
+
+	*bridgep = NULL;
+	if (bridge == NULL)
+		return;
+
+	hms_object_unregister(&bridge->object);
+}
+EXPORT_SYMBOL(hms_bridge_unregister);
diff --git a/include/linux/hms.h b/include/linux/hms.h
index 2a9e49a2d771..511b5363d8f2 100644
--- a/include/linux/hms.h
+++ b/include/linux/hms.h
@@ -115,6 +115,24 @@ void hms_link_register(struct hms_link **linkp, struct device *parent,
 void hms_link_unregister(struct hms_link **linkp);
 
 
+struct hms_bridge {
+	struct hms_object object;
+	struct hms_link *linka;
+	struct hms_link *linkb;
+};
+
+void hms_unbridge_initiator(struct hms_bridge *bridge,
+			    struct hms_initiator *initiator);
+int hms_bridge_initiator(struct hms_bridge *bridge,
+			 struct hms_initiator *initiator);
+void hms_bridge_register(struct hms_bridge **bridgep,
+			 struct device *parent,
+			 struct hms_link *linka,
+			 struct hms_link *linkb,
+			 unsigned version);
+void hms_bridge_unregister(struct hms_bridge **bridgep);
+
+
 int hms_init(void);
 
 
@@ -139,6 +157,12 @@ int hms_init(void);
 #define hms_link_unregister(linkp)
 
 
+#define hms_unbridge_initiator(bridge, initiator)
+#define hms_bridge_initiator(bridge, initiator)
+#define hms_bridge_register(bridgep)
+#define hms_bridge_unregister(bridgep)
+
+
 static inline int hms_init(void)
 {
 	return 0;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (5 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

Register main memory as target under HMS scheme. Memory is registered
per node (one target device per node). We also create a default link
to connect main memory and CPU that are in the same node. For details
see Documentation/vm/hms.rst.

This is done to allow application to use one API for regular memory or
device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/node.c  | 65 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/node.h |  6 ++++
 2 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..05621ba3cf13 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -323,6 +323,11 @@ static int register_node(struct node *node, int num)
 	if (error)
 		put_device(&node->dev);
 	else {
+		hms_link_register(&node->link, &node->dev, 0);
+		hms_target_register(&node->target, &node->dev,
+				    num, NULL, 0, 0);
+		hms_link_target(node->link, node->target);
+
 		hugetlb_register_node(node);
 
 		compaction_register_node(node);
@@ -339,6 +344,9 @@ static int register_node(struct node *node, int num)
  */
 void unregister_node(struct node *node)
 {
+	hms_target_unregister(&node->target);
+	hms_link_unregister(&node->link);
+
 	hugetlb_unregister_node(node);		/* no-op, if memoryless node */
 
 	device_unregister(&node->dev);
@@ -415,6 +423,9 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
+#if defined(CONFIG_HMS)
+		unsigned long size = PAGE_SIZE;
+#endif
 		int page_nid;
 
 		/*
@@ -445,9 +456,35 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, void *arg)
 		if (ret)
 			return ret;
 
-		return sysfs_create_link_nowarn(&mem_blk->dev.kobj,
+		ret = sysfs_create_link_nowarn(&mem_blk->dev.kobj,
 				&node_devices[nid]->dev.kobj,
 				kobject_name(&node_devices[nid]->dev.kobj));
+		if (ret)
+			return ret;
+
+#if defined(CONFIG_HMS)
+		/*
+		 * Right now here i do not see any easier way to get the size
+		 * in bytes of valid memory that is added to this node.
+		 */
+		for (++pfn; pfn <= sect_end_pfn; pfn++) {
+			if (!pfn_present(pfn)) {
+				pfn = round_down(pfn + PAGES_PER_SECTION,
+						PAGES_PER_SECTION) - 1;
+				continue;
+			}
+			page_nid = get_nid_for_pfn(pfn);
+			if (page_nid < 0)
+				continue;
+			if (page_nid != nid)
+				continue;
+			size += PAGE_SIZE;
+		}
+
+		hms_target_add_memory(node_devices[nid]->target, size);
+#endif
+
+		return 0;
 	}
 	/* mem section does not span the specified node */
 	return 0;
@@ -471,6 +508,10 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 	sect_start_pfn = section_nr_to_pfn(phys_index);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
+#if defined(CONFIG_HMS)
+		unsigned long size = 0;
+		int page_nid;
+#endif
 		int nid;
 
 		nid = get_nid_for_pfn(pfn);
@@ -484,6 +525,28 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 			 kobject_name(&mem_blk->dev.kobj));
 		sysfs_remove_link(&mem_blk->dev.kobj,
 			 kobject_name(&node_devices[nid]->dev.kobj));
+
+#if defined(CONFIG_HMS)
+		/*
+		 * Right now here i do not see any easier way to get the size
+		 * in bytes of valid memory that is added to this node.
+		 */
+		for (; pfn <= sect_end_pfn; pfn++) {
+			if (!pfn_present(pfn)) {
+				pfn = round_down(pfn + PAGES_PER_SECTION,
+						PAGES_PER_SECTION) - 1;
+				continue;
+			}
+			page_nid = get_nid_for_pfn(pfn);
+			if (page_nid < 0)
+				continue;
+			if (page_nid != nid)
+				break;
+			size += PAGE_SIZE;
+		}
+
+		hms_target_remove_memory(node_devices[nid]->target, size);
+#endif
 	}
 	NODEMASK_FREE(unlinked_nodes);
 	return 0;
diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..297b01d3c1ed 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -15,6 +15,7 @@
 #ifndef _LINUX_NODE_H_
 #define _LINUX_NODE_H_
 
+#include <linux/hms.h>
 #include <linux/device.h>
 #include <linux/cpumask.h>
 #include <linux/workqueue.h>
@@ -22,6 +23,11 @@
 struct node {
 	struct device	dev;
 
+#if defined(CONFIG_HMS)
+	struct hms_target *target;
+	struct hms_link *link;
+#endif
+
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 08/14] mm/hms: register main CPUs with heterogenenous memory system
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (6 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
@ 2018-12-03 23:35 ` " jglisse
  2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

Register CPUs as initiator under HMS scheme. CPUs are registered per
node (one initiator device per node per CPU). We also add the CPU to
the node default link so it is connected to main memory for the node.
For details see Documentation/vm/hms.rst.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/base/cpu.c  |  5 +++++
 drivers/base/node.c | 18 +++++++++++++++++-
 include/linux/cpu.h |  4 ++++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index eb9443d5bae1..160454bc5c38 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -76,6 +76,8 @@ void unregister_cpu(struct cpu *cpu)
 {
 	int logical_cpu = cpu->dev.id;
 
+	hms_initiator_unregister(&cpu->initiator);
+
 	unregister_cpu_under_node(logical_cpu, cpu_to_node(logical_cpu));
 
 	device_unregister(&cpu->dev);
@@ -392,6 +394,9 @@ int register_cpu(struct cpu *cpu, int num)
 	dev_pm_qos_expose_latency_limit(&cpu->dev,
 					PM_QOS_RESUME_LATENCY_NO_CONSTRAINT);
 
+	hms_initiator_register(&cpu->initiator, &cpu->dev,
+			       cpu_to_node(num), 0);
+
 	return 0;
 }
 
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 05621ba3cf13..43f1820cdadb 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -375,9 +375,19 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
 	if (ret)
 		return ret;
 
-	return sysfs_create_link(&obj->kobj,
+	ret = sysfs_create_link(&obj->kobj,
 				 &node_devices[nid]->dev.kobj,
 				 kobject_name(&node_devices[nid]->dev.kobj));
+	if (ret)
+		return ret;
+
+	if (IS_ENABLED(CONFIG_HMS)) {
+		struct cpu *cpu = container_of(obj, struct cpu, dev);
+
+		hms_link_initiator(node_devices[nid]->link, cpu->initiator);
+	}
+
+	return 0;
 }
 
 int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
@@ -396,6 +406,12 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
 	sysfs_remove_link(&obj->kobj,
 			  kobject_name(&node_devices[nid]->dev.kobj));
 
+	if (IS_ENABLED(CONFIG_HMS)) {
+		struct cpu *cpu = container_of(obj, struct cpu, dev);
+
+		hms_unlink_initiator(node_devices[nid]->link, cpu->initiator);
+	}
+
 	return 0;
 }
 
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 218df7f4d3e1..1e3a777bfa3d 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -14,6 +14,7 @@
 #ifndef _LINUX_CPU_H_
 #define _LINUX_CPU_H_
 
+#include <linux/hms.h>
 #include <linux/node.h>
 #include <linux/compiler.h>
 #include <linux/cpumask.h>
@@ -27,6 +28,9 @@ struct cpu {
 	int node_id;		/* The node which contains the CPU */
 	int hotpluggable;	/* creates sysfs control file if hotpluggable */
 	struct device dev;
+#if defined(CONFIG_HMS)
+	struct hms_initiator *initiator;
+#endif
 };
 
 extern void boot_cpu_init(void);
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS)
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (7 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Haggai Eran, Balbir Singh,
	Aneesh Kumar K . V, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini

From: Jérôme Glisse <jglisse@redhat.com>

With the advance of heterogeneous computing and the new kind of memory
topology that are now becoming more widespread (CPU HBM, persistent
memory, ...). We no longer just have a flat memory topology inside a
numa node. Instead there is a hierarchy of memory for instance HBM for
CPU versus main memory. Moreover there is also device memory a good
example is GPU which have a large amount of memory (several giga bytes
and it keeps growing).

In face of this the mbind() API is too limited to allow precise selection
of which memory to use inside a node. This is why this patchset introduce
a new API hbind() for heterogeneous bind, that allow to bind any kind of
memory wether it is some specific memory like CPU's HBM in a node, or some
device memory.

Instead of using a bitmap, hbind() take an array of uid and each uid is
a unique memory target inside the new HMS topology description.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: linux-mm@kvack.org
---
 include/uapi/linux/hbind.h |  46 +++++++++++
 mm/Makefile                |   1 +
 mm/hms.c                   | 158 +++++++++++++++++++++++++++++++++++++
 3 files changed, 205 insertions(+)
 create mode 100644 include/uapi/linux/hbind.h
 create mode 100644 mm/hms.c

diff --git a/include/uapi/linux/hbind.h b/include/uapi/linux/hbind.h
new file mode 100644
index 000000000000..a9aba17ab142
--- /dev/null
+++ b/include/uapi/linux/hbind.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#ifndef LINUX_UAPI_HBIND
+#define LINUX_UAPI_HBIND
+
+
+/* For now just freak out if it is bigger than a page. */
+#define HBIND_MAX_TARGETS (4096 / 4)
+#define HBIND_MAX_ATOMS (4096 / 4)
+
+
+struct hbind_params {
+	uint64_t start;
+	uint64_t end;
+	uint32_t ntargets;
+	uint32_t natoms;
+	uint64_t targets;
+	uint64_t atoms;
+};
+
+
+#define HBIND_ATOM_GET_DWORDS(v) (((v) >> 20) & 0xfff)
+#define HBIND_ATOM_SET_DWORDS(v) (((v) & 0xfff) << 20)
+#define HBIND_ATOM_GET_CMD(v) ((v) & 0xfffff)
+#define HBIND_ATOM_SET_CMD(v) ((v) & 0xfffff)
+
+
+#define HBIND_IOCTL		_IOWR('H', 0x00, struct hbind_params)
+
+
+#endif /* LINUX_UAPI_HBIND */
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..0537a95f6cbd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,3 +99,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_HMS) += hms.o
diff --git a/mm/hms.c b/mm/hms.c
new file mode 100644
index 000000000000..bf328bd577dc
--- /dev/null
+++ b/mm/hms.c
@@ -0,0 +1,158 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+/* Heterogeneous memory system (HMS) see Documentation/vm/hms.rst */
+#define pr_fmt(fmt) "hms: " fmt
+
+#include <linux/miscdevice.h>
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/hms.h>
+#include <linux/fs.h>
+
+#include <uapi/linux/hbind.h>
+
+
+#define HBIND_FIX_ARRAY 64
+
+
+static ssize_t hbind_read(struct file *file, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static ssize_t hbind_write(struct file *file, const char __user *buf,
+			 size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	uint32_t *targets, *_dtargets = NULL, _ftargets[HBIND_FIX_ARRAY];
+	uint32_t *atoms, *_datoms = NULL, _fatoms[HBIND_FIX_ARRAY];
+	void __user *uarg = (void __user *)arg;
+	struct hbind_params params;
+	uint32_t i, ndwords;
+	int ret;
+
+	switch(cmd) {
+	case HBIND_IOCTL:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	ret = copy_from_user(&params, uarg, sizeof(params));
+	if (ret)
+		return ret;
+
+	/* Some sanity checks */
+	params.start &= PAGE_MASK;
+	params.end = PAGE_ALIGN(params.end);
+	if (params.end <= params.start)
+		return -EINVAL;
+
+	/* More sanity checks */
+	if (params.ntargets > HBIND_MAX_TARGETS)
+		return -EINVAL;
+
+	/* We need at least one atoms. */
+	if (!params.natoms || params.natoms > HBIND_MAX_ATOMS)
+		return -EINVAL;
+
+	/* Let's allocate memory for parameters. */
+	if (params.ntargets > HBIND_FIX_ARRAY) {
+		_dtargets = kzalloc(4 * params.ntargets, GFP_KERNEL);
+		if (_dtargets == NULL)
+			return -ENOMEM;
+		targets = _dtargets;
+	} else {
+		targets = _ftargets;
+	}
+	if (params.natoms > HBIND_FIX_ARRAY) {
+		_datoms = kzalloc(4 * params.natoms, GFP_KERNEL);
+		if (_datoms == NULL) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		atoms = _datoms;
+	} else {
+		atoms = _fatoms;
+	}
+
+	/* Let's fetch hbind() parameters. */
+	ret = copy_from_user(atoms, (void __user *)params.atoms,
+			     4 * params.natoms);
+	if (ret)
+		goto out;
+	ret = copy_from_user(targets, (void __user *)params.targets,
+			     4 * params.ntargets);
+	if (ret)
+		goto out;
+
+	mmget(current->mm);
+
+	/* Sanity checks atoms and execute them. */
+	for (i = 0, ndwords = 1; i < params.natoms; i += ndwords) {
+		ndwords = 1 + HBIND_ATOM_GET_DWORDS(atoms[i]);
+		switch (HBIND_ATOM_GET_CMD(atoms[i])) {
+		default:
+			ret = -EINVAL;
+			goto out_mm;
+		}
+	}
+
+out_mm:
+	copy_to_user((void __user *)params.atoms, atoms, 4 * params.natoms);
+	mmput(current->mm);
+out:
+	kfree(_dtargets);
+	kfree(_datoms);
+	return ret;
+}
+
+const struct file_operations hbind_fops = {
+	.llseek		= no_llseek,
+	.read		= hbind_read,
+	.write		= hbind_write,
+	.unlocked_ioctl	= hbind_ioctl,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice hbind_device = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.fops = &hbind_fops,
+	.name = "hbind",
+};
+
+int __init hbind_init(void)
+{
+	pr_info("Heterogeneous memory system (HMS) hbind() driver\n");
+	return misc_register(&hbind_device);
+}
+
+void __exit hbind_fini(void)
+{
+	misc_deregister(&hbind_device);
+}
+
+module_init(hbind_init);
+module_exit(hbind_fini);
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (8 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

This patch add infrastructure to track heterogeneous memory policy
within the kernel. Policy are defined over range of virtual address
of a process and attach to the correspond mm_struct.

User can reset to default policy for range of virtual address using
hbind() default commands for the range.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/hms.h        |  46 ++++++
 include/linux/mm_types.h   |   6 +
 include/uapi/linux/hbind.h |   8 +
 kernel/fork.c              |   3 +
 mm/hms.c                   | 306 ++++++++++++++++++++++++++++++++++++-
 5 files changed, 368 insertions(+), 1 deletion(-)

diff --git a/include/linux/hms.h b/include/linux/hms.h
index 511b5363d8f2..f39c390b3afb 100644
--- a/include/linux/hms.h
+++ b/include/linux/hms.h
@@ -20,6 +20,8 @@
 
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/mm_types.h>
+#include <linux/mmu_notifier.h>
 
 
 struct hms_target;
@@ -34,6 +36,10 @@ struct hms_target_hbind {
 #if IS_ENABLED(CONFIG_HMS)
 
 
+#include <linux/interval_tree.h>
+#include <linux/rwsem.h>
+
+
 #define to_hms_object(device) container_of(device, struct hms_object, device)
 
 enum hms_type {
@@ -133,6 +139,42 @@ void hms_bridge_register(struct hms_bridge **bridgep,
 void hms_bridge_unregister(struct hms_bridge **bridgep);
 
 
+struct hms_policy_targets {
+	struct hms_target **targets;
+	unsigned ntargets;
+	struct kref kref;
+};
+
+struct hms_policy_range {
+	struct hms_policy_targets *ptargets;
+	struct interval_tree_node node;
+	struct kref kref;
+};
+
+struct hms_policy {
+	struct rb_root_cached ranges;
+	struct rw_semaphore sem;
+	struct mmu_notifier mn;
+};
+
+static inline unsigned long hms_policy_range_start(struct hms_policy_range *r)
+{
+	return r->node.start;
+}
+
+static inline unsigned long hms_policy_range_end(struct hms_policy_range *r)
+{
+	return r->node.last + 1;
+}
+
+static inline void hms_policy_init(struct mm_struct *mm)
+{
+	mm->hpolicy = NULL;
+}
+
+void hms_policy_fini(struct mm_struct *mm);
+
+
 int hms_init(void);
 
 
@@ -163,6 +205,10 @@ int hms_init(void);
 #define hms_bridge_unregister(bridgep)
 
 
+#define hms_policy_init(mm)
+#define hms_policy_fini(mm)
+
+
 static inline int hms_init(void)
 {
 	return 0;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..3da91767c689 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -26,6 +26,7 @@ typedef int vm_fault_t;
 
 struct address_space;
 struct mem_cgroup;
+struct hms_policy;
 struct hmm;
 
 /*
@@ -491,6 +492,11 @@ struct mm_struct {
 		/* HMM needs to track a few things per mm */
 		struct hmm *hmm;
 #endif
+
+#if IS_ENABLED(CONFIG_HMS)
+		/* Heterogeneous Memory System policy */
+		struct hms_policy *hpolicy;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/uapi/linux/hbind.h b/include/uapi/linux/hbind.h
index a9aba17ab142..cc4687587f5a 100644
--- a/include/uapi/linux/hbind.h
+++ b/include/uapi/linux/hbind.h
@@ -39,6 +39,14 @@ struct hbind_params {
 #define HBIND_ATOM_GET_CMD(v) ((v) & 0xfffff)
 #define HBIND_ATOM_SET_CMD(v) ((v) & 0xfffff)
 
+/*
+ * HBIND_CMD_DEFAULT restore default policy ie undo any of the previous policy.
+ *
+ * Additional dwords:
+ *      NONE (DWORDS MUST BE 0 !)
+ */
+#define HBIND_CMD_DEFAULT 0
+
 
 #define HBIND_IOCTL		_IOWR('H', 0x00, struct hbind_params)
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff89c7b..bc40edcadc69 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -38,6 +38,7 @@
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hmm.h>
+#include <linux/hms.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -671,6 +672,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	hmm_mm_destroy(mm);
+	hms_policy_fini(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
@@ -989,6 +991,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	RCU_INIT_POINTER(mm->exe_file, NULL);
 	mmu_notifier_mm_init(mm);
 	hmm_mm_init(mm);
+	hms_policy_init(mm);
 	init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
diff --git a/mm/hms.c b/mm/hms.c
index bf328bd577dc..be2c4e526f25 100644
--- a/mm/hms.c
+++ b/mm/hms.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/hms.h>
+#include <linux/mm.h>
 #include <linux/fs.h>
 
 #include <uapi/linux/hbind.h>
@@ -31,7 +32,6 @@
 
 #define HBIND_FIX_ARRAY 64
 
-
 static ssize_t hbind_read(struct file *file, char __user *buf,
 			size_t count, loff_t *ppos)
 {
@@ -44,6 +44,300 @@ static ssize_t hbind_write(struct file *file, const char __user *buf,
 	return -EINVAL;
 }
 
+
+static void hms_policy_targets_get(struct hms_policy_targets *ptargets)
+{
+	kref_get(&ptargets->kref);
+}
+
+static void hms_policy_targets_free(struct kref *kref)
+{
+	struct hms_policy_targets *ptargets;
+
+	ptargets = container_of(kref, struct hms_policy_targets, kref);
+	kfree(ptargets->targets);
+	kfree(ptargets);
+}
+
+static void hms_policy_targets_put(struct hms_policy_targets *ptargets)
+{
+	kref_put(&ptargets->kref, &hms_policy_targets_free);
+}
+
+static struct hms_policy_targets* hms_policy_targets_new(const uint32_t *targets,
+							 unsigned ntargets)
+{
+	struct hms_policy_targets *ptargets;
+	void *_targets;
+	unsigned i, c;
+
+	_targets = kzalloc(ntargets * sizeof(void *), GFP_KERNEL);
+	if (_targets == NULL)
+		return NULL;
+
+	ptargets = kmalloc(sizeof(*ptargets), GFP_KERNEL);
+	if (ptargets == NULL) {
+		kfree(_targets);
+		return NULL;
+	}
+
+	kref_init(&ptargets->kref);
+	ptargets->targets = _targets;
+	ptargets->ntargets = ntargets;
+
+	for (i = 0, c = 0; i < ntargets; ++i) {
+		ptargets->targets[c] = hms_target_find(targets[i]);
+		c += !!((long)ptargets->targets[i]);
+	}
+
+	/* Ignore NULL targets[i] */
+	ptargets->ntargets = c;
+
+	if (!c) {
+		/* No valid targets pointless to waste memory ... */
+		hms_policy_targets_put(ptargets);
+		return NULL;
+	}
+
+	return ptargets;
+}
+
+
+static void hms_policy_range_get(struct hms_policy_range *prange)
+{
+	kref_get(&prange->kref);
+}
+
+static void hms_policy_range_free(struct kref *kref)
+{
+	struct hms_policy_range *prange;
+
+	prange = container_of(kref, struct hms_policy_range, kref);
+	hms_policy_targets_put(prange->ptargets);
+	kfree(prange);
+}
+
+static void hms_policy_range_put(struct hms_policy_range *prange)
+{
+	kref_put(&prange->kref, &hms_policy_range_free);
+}
+
+static struct hms_policy_range *hms_policy_range_new(const uint32_t *targets,
+						     unsigned long start,
+						     unsigned long end,
+						     unsigned ntargets)
+{
+	struct hms_policy_targets *ptargets;
+	struct hms_policy_range *prange;
+
+	ptargets = hms_policy_targets_new(targets, ntargets);
+	if (ptargets == NULL)
+		return NULL;
+
+	prange = kmalloc(sizeof(*prange), GFP_KERNEL);
+	if (prange == NULL)
+		return NULL;
+
+	prange->node.start = start & PAGE_MASK;
+	prange->node.last = PAGE_ALIGN(end) - 1;
+	prange->ptargets = ptargets;
+	kref_init(&prange->kref);
+
+	return prange;
+}
+
+static struct hms_policy_range *
+hms_policy_range_dup(struct hms_policy_range *_prange)
+{
+	struct hms_policy_range *prange;
+
+	prange = kmalloc(sizeof(*prange), GFP_KERNEL);
+	if (prange == NULL)
+		return NULL;
+
+	hms_policy_targets_get(_prange->ptargets);
+	prange->node.start = _prange->node.start;
+	prange->node.last = _prange->node.last;
+	prange->ptargets = _prange->ptargets;
+	kref_init(&prange->kref);
+
+	return prange;
+}
+
+
+void hms_policy_fini(struct mm_struct *mm)
+{
+	struct hms_policy *hpolicy = READ_ONCE(mm->hpolicy);
+	struct interval_tree_node *node;
+
+	spin_lock(&mm->page_table_lock);
+	hpolicy = READ_ONCE(mm->hpolicy);
+	mm->hpolicy = NULL;
+	spin_unlock(&mm->page_table_lock);
+
+	/* No active heterogeneous policy structure so nothing to cleanup. */
+	if (hpolicy == NULL)
+		return;
+
+	mmu_notifier_unregister_no_release(&hpolicy->mn, mm);
+
+	down_write(&hpolicy->sem);
+	node = interval_tree_iter_first(&hpolicy->ranges, 0, -1UL);
+	while (node) {
+		struct hms_policy_range *prange;
+		struct interval_tree_node *next;
+
+		prange = container_of(node, struct hms_policy_range, node);
+		next = interval_tree_iter_next(node, 0, -1UL);
+		interval_tree_remove(node, &hpolicy->ranges);
+		hms_policy_range_put(prange);
+		node = next;
+	}
+	up_write(&hpolicy->sem);
+
+	kfree(hpolicy);
+}
+
+
+static int hbind_default_locked(struct hms_policy *hpolicy,
+				struct hbind_params *params)
+{
+	struct interval_tree_node *node;
+	unsigned long start, last;
+	int ret = 0;
+
+	start = params->start;
+	last = params->end - 1UL;
+
+	node = interval_tree_iter_first(&hpolicy->ranges, start, last);
+	while (node) {
+		struct hms_policy_range *prange;
+		struct interval_tree_node *next;
+
+		prange = container_of(node, struct hms_policy_range, node);
+		next = interval_tree_iter_next(node, start, last);
+		if (node->start < start && node->last > last) {
+			/* Node is split in 2 */
+			struct hms_policy_range *_prange;
+			_prange = hms_policy_range_dup(prange);
+			if (_prange == NULL) {
+				ret = -ENOMEM;
+				break;
+			}
+			prange->node.last = start - 1;
+			_prange->node.start = last + 1;
+			interval_tree_insert(&_prange->node, &hpolicy->ranges);
+			break;
+		} else if (node->start < start) {
+			prange->node.last = start - 1;
+		} else if (node->last > last) {
+			prange->node.start = last + 1;
+		} else {
+			/* Fully inside [start, last] */
+			interval_tree_remove(node, &hpolicy->ranges);
+		}
+
+		node = next;
+	}
+
+	return ret;
+}
+
+static int hbind_default(struct mm_struct *mm, struct hbind_params *params,
+			 const uint32_t *targets, uint32_t *atoms)
+{
+	struct hms_policy *hpolicy = READ_ONCE(mm->hpolicy);
+	int ret;
+
+	/* No active heterogeneous policy structure so no range to reset. */
+	if (hpolicy == NULL)
+		return 0;
+
+	down_write(&hpolicy->sem);
+	ret = hbind_default_locked(hpolicy, params);
+	up_write(&hpolicy->sem);
+
+	return ret;
+}
+
+
+static void hms_policy_notifier_release(struct mmu_notifier *mn,
+					struct mm_struct *mm)
+{
+	hms_policy_fini(mm);
+}
+
+static int hms_policy_notifier_invalidate_range_start(struct mmu_notifier *mn,
+				       const struct mmu_notifier_range *range)
+{
+	if (range->event == MMU_NOTIFY_UNMAP) {
+		struct hbind_params params;
+
+		if (!range->blockable)
+			return -EBUSY;
+
+		params.natoms = 0;
+		params.ntargets = 0;
+		params.end = range->end;
+		params.start = range->start;
+		hbind_default(range->mm, &params, NULL, NULL);
+	}
+
+	return 0;
+}
+
+static const struct mmu_notifier_ops hms_policy_notifier_ops = {
+	.release = hms_policy_notifier_release,
+	.invalidate_range_start = hms_policy_notifier_invalidate_range_start,
+};
+
+static struct hms_policy *hms_policy_get(struct mm_struct *mm)
+{
+	struct hms_policy *hpolicy = READ_ONCE(mm->hpolicy);
+	bool mmu_notifier = false;
+
+	/*
+	 * The hpolicy struct can only be freed once the mm_struct goes away,
+	 * hence only pre-allocate if none is attach yet.
+	 */
+	if (hpolicy)
+		return hpolicy;
+
+	hpolicy = kzalloc(sizeof(*hpolicy), GFP_KERNEL);
+	if (hpolicy == NULL)
+		return NULL;
+
+	init_rwsem(&hpolicy->sem);
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hpolicy) {
+		mm->hpolicy = hpolicy;
+		mmu_notifier = true;
+		hpolicy = NULL;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	if (mmu_notifier) {
+		int ret;
+
+		hpolicy->mn.ops = &hms_policy_notifier_ops;
+		ret = mmu_notifier_register(&hpolicy->mn, mm);
+		if (ret) {
+			spin_lock(&mm->page_table_lock);
+			hpolicy = mm->hpolicy;
+			mm->hpolicy = NULL;
+			spin_unlock(&mm->page_table_lock);
+		}
+	}
+
+	if (hpolicy)
+		kfree(hpolicy);
+
+	/* At this point mm->hpolicy is valid */
+	return mm->hpolicy;
+}
+
+
 static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
 	uint32_t *targets, *_dtargets = NULL, _ftargets[HBIND_FIX_ARRAY];
@@ -114,6 +408,16 @@ static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	for (i = 0, ndwords = 1; i < params.natoms; i += ndwords) {
 		ndwords = 1 + HBIND_ATOM_GET_DWORDS(atoms[i]);
 		switch (HBIND_ATOM_GET_CMD(atoms[i])) {
+		case HBIND_CMD_DEFAULT:
+			if (ndwords != 1) {
+				ret = -EINVAL;
+				goto out_mm;
+			}
+			ret = hbind_default(current->mm, &params,
+					    targets, atoms);
+			if (ret)
+				goto out_mm;
+			break;
 		default:
 			ret = -EINVAL;
 			goto out_mm;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (9 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

This patch add bind command to hbind() ioctl, this allow to bind a
range of virtual address to given list of target memory. New memory
allocated in the range will try to use memory from the target memory
list.

Note that this patch does not modify existing page fault path and thus
does not activate new heterogeneous policy. Updating the CPU page fault
code path or device page fault code path (HMM) will be done in separate
patches.

Here we only introduce helpers and infrastructure that will be use by
page fault code path.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/hbind.h | 10 ++++++++++
 mm/hms.c                   | 40 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/uapi/linux/hbind.h b/include/uapi/linux/hbind.h
index cc4687587f5a..7bb876954e3f 100644
--- a/include/uapi/linux/hbind.h
+++ b/include/uapi/linux/hbind.h
@@ -47,6 +47,16 @@ struct hbind_params {
  */
 #define HBIND_CMD_DEFAULT 0
 
+/*
+ * HBIND_CMD_BIND strict policy ie new allocations will comes from one of the
+ * listed targets until they run of memory. Other targets can be use if the
+ * none of the listed targets can be accessed by the initiator that did fault.
+ *
+ * Additional dwords:
+ *      NONE (DWORDS MUST BE 0 !)
+ */
+#define HBIND_CMD_BIND 1
+
 
 #define HBIND_IOCTL		_IOWR('H', 0x00, struct hbind_params)
 
diff --git a/mm/hms.c b/mm/hms.c
index be2c4e526f25..6be6f4acdd49 100644
--- a/mm/hms.c
+++ b/mm/hms.c
@@ -338,6 +338,36 @@ static struct hms_policy *hms_policy_get(struct mm_struct *mm)
 }
 
 
+static int hbind_bind(struct mm_struct *mm, struct hbind_params *params,
+		      const uint32_t *targets, uint32_t *atoms)
+{
+	struct hms_policy_range *prange;
+	struct hms_policy *hpolicy;
+	int ret;
+
+	hpolicy = hms_policy_get(mm);
+	if (hpolicy == NULL)
+		return -ENOMEM;
+
+	prange = hms_policy_range_new(targets, params->start, params->end,
+				      params->ntargets);
+	if (prange == NULL)
+		return -ENOMEM;
+
+	down_write(&hpolicy->sem);
+	ret = hbind_default_locked(hpolicy, params);
+	if (ret)
+		goto out;
+
+	interval_tree_insert(&prange->node, &hpolicy->ranges);
+
+out:
+	up_write(&hpolicy->sem);
+
+	return ret;
+}
+
+
 static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
 	uint32_t *targets, *_dtargets = NULL, _ftargets[HBIND_FIX_ARRAY];
@@ -418,6 +448,16 @@ static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 			if (ret)
 				goto out_mm;
 			break;
+		case HBIND_CMD_BIND:
+			if (ndwords != 1) {
+				ret = -EINVAL;
+				goto out_mm;
+			}
+			ret = hbind_bind(current->mm, &params,
+					 targets, atoms);
+			if (ret)
+				goto out_mm;
+			break;
 		default:
 			ret = -EINVAL;
 			goto out_mm;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (10 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli

From: Jérôme Glisse <jglisse@redhat.com>

This patch add migrate commands to hbind() ioctl, user space can use
this commands to migrate a range of virtual address to list of target
memory.

This does not change the policy for the range, it also ignores any of
the existing policy range, it does not changes the policy for the
range.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 include/uapi/linux/hbind.h |  9 ++++++++
 mm/hms.c                   | 43 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/hbind.h b/include/uapi/linux/hbind.h
index 7bb876954e3f..ededbba22121 100644
--- a/include/uapi/linux/hbind.h
+++ b/include/uapi/linux/hbind.h
@@ -57,6 +57,15 @@ struct hbind_params {
  */
 #define HBIND_CMD_BIND 1
 
+/*
+ * HBIND_CMD_MIGRATE move existing memory to use listed target memory. This is
+ * a best effort.
+ *
+ * Additional dwords:
+ *      [0] result ie number of pages that have been migrated.
+ */
+#define HBIND_CMD_MIGRATE 2
+
 
 #define HBIND_IOCTL		_IOWR('H', 0x00, struct hbind_params)
 
diff --git a/mm/hms.c b/mm/hms.c
index 6be6f4acdd49..6764908f47bf 100644
--- a/mm/hms.c
+++ b/mm/hms.c
@@ -368,6 +368,39 @@ static int hbind_bind(struct mm_struct *mm, struct hbind_params *params,
 }
 
 
+static int hbind_migrate(struct mm_struct *mm, struct hbind_params *params,
+			 const uint32_t *targets, uint32_t *atoms)
+{
+	unsigned long size, npages;
+	int ret = -EINVAL;
+	unsigned i;
+
+	size = PAGE_ALIGN(params->end) - (params->start & PAGE_MASK);
+	npages = size >> PAGE_SHIFT;
+
+	for (i = 0; params->ntargets; ++i) {
+		struct hms_target *target;
+
+		target = hms_target_find(targets[i]);
+		if (target == NULL)
+			continue;
+
+		ret = target->hbind->migrate(target, mm, params->start,
+					     params->end, params->natoms,
+					     atoms);
+		hms_target_put(target);
+
+		if (ret)
+			continue;
+
+		if (atoms[0] >= npages)
+			break;
+	}
+
+	return ret;
+}
+
+
 static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 {
 	uint32_t *targets, *_dtargets = NULL, _ftargets[HBIND_FIX_ARRAY];
@@ -458,6 +491,16 @@ static long hbind_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 			if (ret)
 				goto out_mm;
 			break;
+		case HBIND_CMD_MIGRATE:
+			if (ndwords != 2) {
+				ret = -EINVAL;
+				goto out_mm;
+			}
+			ret = hbind_migrate(current->mm, &params,
+					    targets, atoms);
+			if (ret)
+				goto out_mm;
+			break;
 		default:
 			ret = -EINVAL;
 			goto out_mm;
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (11 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
@ 2018-12-03 23:35 ` jglisse
  2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

This register NVidia GPU under heterogeneous memory system so that one
can use the GPU memory with new syscall like hbind() for compute work
load.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/nouveau/Kbuild        |  1 +
 drivers/gpu/drm/nouveau/nouveau_hms.c | 80 +++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_hms.h | 46 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_svm.c |  6 ++
 4 files changed, 133 insertions(+)
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.h

diff --git a/drivers/gpu/drm/nouveau/Kbuild b/drivers/gpu/drm/nouveau/Kbuild
index a826a4df440d..9c1114b4d8a3 100644
--- a/drivers/gpu/drm/nouveau/Kbuild
+++ b/drivers/gpu/drm/nouveau/Kbuild
@@ -37,6 +37,7 @@ nouveau-y += nouveau_prime.o
 nouveau-y += nouveau_sgdma.o
 nouveau-y += nouveau_ttm.o
 nouveau-y += nouveau_vmm.o
+nouveau-$(CONFIG_HMS) += nouveau_hms.o
 
 # DRM - modesetting
 nouveau-$(CONFIG_DRM_NOUVEAU_BACKLIGHT) += nouveau_backlight.o
diff --git a/drivers/gpu/drm/nouveau/nouveau_hms.c b/drivers/gpu/drm/nouveau/nouveau_hms.c
new file mode 100644
index 000000000000..52af9180e108
--- /dev/null
+++ b/drivers/gpu/drm/nouveau/nouveau_hms.c
@@ -0,0 +1,80 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+#include "nouveau_dmem.h"
+#include "nouveau_drv.h"
+#include "nouveau_hms.h"
+
+#include <linux/hms.h>
+
+static int nouveau_hms_migrate(struct hms_target *target, struct mm_struct *mm,
+			       unsigned long start, unsigned long end,
+			       unsigned natoms, uint32_t *atoms)
+{
+	struct nouveau_hms *hms = target->private;
+	struct nouveau_drm *drm = hms->drm;
+	unsigned long addr;
+	int ret = 0;
+
+	down_read(&mm->mmap_sem);
+
+	for (addr = start; addr < end;) {
+		struct vm_area_struct *vma;
+		unsigned long next;
+
+		vma = find_vma_intersection(mm, addr, end);
+		if (!vma)
+			break;
+
+		next = min(vma->vm_end, end);
+		ret = nouveau_dmem_migrate_vma(drm, vma, addr, next);
+		// FIXME ponder more on what to do
+		addr = next;
+	}
+
+	up_read(&mm->mmap_sem);
+
+	return ret;
+}
+
+const static struct hms_target_hbind nouveau_hms_target_hbind = {
+	.migrate = nouveau_hms_migrate,
+};
+
+
+void nouveau_hms_init(struct nouveau_drm *drm, struct nouveau_hms *hms)
+{
+	unsigned long vram_size = drm->gem.vram_available;
+	struct device *parent;
+
+	hms->drm = drm;
+	parent = drm->dev->pdev ? &drm->dev->pdev->dev : drm->dev->dev;
+	hms_target_register(&hms->target, parent, drm->dev->dev->numa_node,
+			    &nouveau_hms_target_hbind, vram_size, 0);
+	if (hms->target) {
+		hms->target->private = hms;
+	}
+}
+
+void nouveau_hms_fini(struct nouveau_drm *drm, struct nouveau_hms *hms)
+{
+	hms_target_unregister(&hms->target);
+}
diff --git a/drivers/gpu/drm/nouveau/nouveau_hms.h b/drivers/gpu/drm/nouveau/nouveau_hms.h
new file mode 100644
index 000000000000..cda111d7044b
--- /dev/null
+++ b/drivers/gpu/drm/nouveau/nouveau_hms.h
@@ -0,0 +1,46 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+#ifndef __NOUVEAU_HMS_H__
+#define __NOUVEAU_HMS_H__
+
+#if IS_ENABLED(CONFIG_HMS)
+
+#include <linux/hms.h>
+
+struct nouveau_hms {
+	struct hms_target *target;
+	struct nouveau_drm *drm;
+};
+
+void nouveau_hms_init(struct nouveau_drm *drm, struct nouveau_hms *hms);
+void nouveau_hms_fini(struct nouveau_drm *drm, struct nouveau_hms *hms);
+
+#else /* IS_ENABLED(CONFIG_HMS) */
+
+struct nouveau_hms {
+};
+
+#define nouveau_hms_init(drm, hms)
+#define nouveau_hms_fini(drm, hms)
+
+#endif /* IS_ENABLED(CONFIG_HMS) */
+#endif
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 23435ee27892..26daa6d50766 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -23,6 +23,7 @@
 #include "nouveau_drv.h"
 #include "nouveau_chan.h"
 #include "nouveau_dmem.h"
+#include "nouveau_hms.h"
 
 #include <nvif/notify.h>
 #include <nvif/object.h>
@@ -44,6 +45,8 @@ struct nouveau_svm {
 	int refs;
 	struct list_head inst;
 
+	struct nouveau_hms hms;
+
 	struct nouveau_svm_fault_buffer {
 		int id;
 		struct nvif_object object;
@@ -766,6 +769,7 @@ nouveau_svm_suspend(struct nouveau_drm *drm)
 void
 nouveau_svm_fini(struct nouveau_drm *drm)
 {
+	nouveau_hms_fini(drm, &drm->svm->hms);
 	kfree(drm->svm);
 }
 
@@ -776,6 +780,8 @@ nouveau_svm_init(struct nouveau_drm *drm)
 		drm->svm->drm = drm;
 		mutex_init(&drm->svm->mutex);
 		INIT_LIST_HEAD(&drm->svm->inst);
+
+		nouveau_hms_init(drm, &drm->svm->hms);
 	}
 }
 
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [RFC PATCH 14/14] test/hms: tests for heterogeneous memory system
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (12 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
@ 2018-12-03 23:35 ` " jglisse
  2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 94+ messages in thread
From: jglisse @ 2018-12-03 23:35 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Set of tests for heterogeneous memory system (migration, binding, ...)

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 tools/testing/hms/Makefile                    |  17 ++
 tools/testing/hms/hbind-create-device-file.sh |  11 +
 tools/testing/hms/test-hms-migrate.c          |  77 ++++++
 tools/testing/hms/test-hms.c                  | 237 ++++++++++++++++++
 tools/testing/hms/test-hms.h                  |  67 +++++
 5 files changed, 409 insertions(+)
 create mode 100644 tools/testing/hms/Makefile
 create mode 100755 tools/testing/hms/hbind-create-device-file.sh
 create mode 100644 tools/testing/hms/test-hms-migrate.c
 create mode 100644 tools/testing/hms/test-hms.c
 create mode 100644 tools/testing/hms/test-hms.h

diff --git a/tools/testing/hms/Makefile b/tools/testing/hms/Makefile
new file mode 100644
index 000000000000..57223a671cb0
--- /dev/null
+++ b/tools/testing/hms/Makefile
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0
+LDFLAGS += -fsanitize=address -fsanitize=undefined
+CFLAGS += -std=c99 -D_GNU_SOURCE -I. -I../../../include/uapi -g -Og -Wall
+LDLIBS += -lpthread
+TARGETS = test-hms-migrate
+OFILES = test-hms
+
+targets: $(TARGETS)
+
+$(TARGETS): $(OFILES:%=%.o) $(TARGETS:%=%.c)
+	$(CC) $(CFLAGS) -o $@ $(OFILES:%=%.o) $@.c
+
+clean:
+	$(RM) $(TARGETS) *.o
+
+%.o: Makefile *.h %.c
+	$(CC) $(CFLAGS) -o $@ -c $(@:%.o=%.c)
diff --git a/tools/testing/hms/hbind-create-device-file.sh b/tools/testing/hms/hbind-create-device-file.sh
new file mode 100755
index 000000000000..60c2533cc85d
--- /dev/null
+++ b/tools/testing/hms/hbind-create-device-file.sh
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+major=10
+minor=$(awk "\$2==\"hbind\" {print \$1}" /proc/misc)
+
+echo hbind device minor is $minor, creating device file:
+sudo rm /dev/hbind
+sudo mknod /dev/hbind c $major $minor
+sudo chmod 666 /dev/hbind
+echo /dev/hbind created
diff --git a/tools/testing/hms/test-hms-migrate.c b/tools/testing/hms/test-hms-migrate.c
new file mode 100644
index 000000000000..b90f701c0b75
--- /dev/null
+++ b/tools/testing/hms/test-hms-migrate.c
@@ -0,0 +1,77 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+#include <stdio.h>
+
+#include "test-hms.h"
+
+int main(int argc, char *argv[])
+{
+    struct hms_context ctx;
+    struct hms_object *target = NULL;
+    uint64_t targets[1], ntargets = 1;
+    unsigned long size = 64 << 10;
+    unsigned long start, end, i;
+    unsigned *ptr;
+    int ret;
+
+    if (argc != 2) {
+        printf("EE: usage: %s targetname\n", argv[0]);
+        return -1;
+    }
+
+    hms_context_init(&ctx);
+
+    /* Find target */
+    do {
+        target = hms_context_object_find_reference(&ctx, target, argv[1]);
+    } while (target && target->type != HMS_TARGET);
+    if (target == NULL) {
+        printf("EE: could not find %s target\n", argv[1]);
+        return -1;
+    }
+
+    /* Allocate memory */
+    ptr = hms_malloc(size);
+    for (i = 0; i < (size / 4); ++i) {
+        ptr[i] = i;
+    }
+
+    /* Migrate to target */
+    targets[0] = target->id;
+    start = (uintptr_t)ptr;
+    end = start + size;
+    ntargets = 1;
+    ret = hms_migrate(&ctx, start, end, targets, ntargets);
+    if (ret) {
+        printf("EE: migration failure (%d)\n", ret);
+    } else {
+        for (i = 0; i < (size / 4); ++i) {
+            if (ptr[i] != i) {
+                printf("EE: migration failure ptr[%ld] = %d\n", i, ptr[i]);
+                goto out;
+            }
+        }
+        printf("OK: migration successful\n");
+    }
+
+out:
+    /* Free */
+    hms_mfree(ptr, size);
+
+    hms_context_fini(&ctx);
+    return 0;
+}
diff --git a/tools/testing/hms/test-hms.c b/tools/testing/hms/test-hms.c
new file mode 100644
index 000000000000..0502f49198c4
--- /dev/null
+++ b/tools/testing/hms/test-hms.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <strings.h>
+#include <dirent.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+
+#include "test-hms.h"
+#include "linux/hbind.h"
+
+
+static unsigned long page_mask = 0;
+static int page_size = 0;
+static int page_shift = 0;
+
+static inline void page_shift_init(void)
+{
+    if (!page_shift) {
+        page_size = sysconf(_SC_PAGE_SIZE);
+
+        page_shift = ffs(page_size) - 1;
+        page_mask = ~((unsigned long)(page_size - 1));
+    }
+}
+
+static unsigned long page_align(unsigned long size)
+{
+    return (size + page_size - 1) & page_mask;
+}
+
+void hms_object_parse_dir(struct hms_object *object, const char *ctype)
+{
+    struct dirent *dirent;
+    char dirname[256];
+    DIR *dirp;
+
+    snprintf(dirname, 255, "/sys/bus/hms/devices/v%u-%u-%s",
+             object->version, object->id, ctype);
+    dirp = opendir(dirname);
+    if (dirp == NULL) {
+        return;
+    }
+    while ((dirent = readdir(dirp))) {
+        struct hms_reference *reference;
+
+        if (dirent->d_type != DT_LNK || !strcmp(dirent->d_name, "subsystem")) {
+            continue;
+        }
+
+        reference = malloc(sizeof(*reference));
+        strcpy(reference->name, dirent->d_name);
+        reference->object = NULL;
+
+        reference->next = object->references;
+        object->references = reference;
+    }
+    closedir(dirp);
+}
+
+void hms_object_free(struct hms_object *object)
+{
+    struct hms_reference *reference = object->references;
+
+    for (; reference; reference = object->references) {
+        object->references = reference->next;
+        free(reference);
+    }
+
+    free(object);
+}
+
+
+void hms_context_init(struct hms_context *ctx)
+{
+    struct dirent *dirent;
+    DIR *dirp;
+
+    ctx->objects = NULL;
+
+    /* Scan targets, initiators, links, bridges ... */
+    dirp = opendir("/sys/bus/hms/devices/");
+    if (dirp == NULL) {
+        printf("EE: could not open /sys/bus/hms/devices/\n");
+        exit(-1);
+    }
+    while ((dirent = readdir(dirp))) {
+        struct hms_object *object;
+        unsigned version, id;
+        enum hms_type type;
+        char ctype[256];
+
+        if (dirent->d_type != DT_LNK || dirent->d_name[0] != 'v') {
+            continue;
+        }
+        if (sscanf(dirent->d_name, "v%d-%d-%s", &version, &id, ctype) != 3) {
+            continue;
+        }
+
+        if (!strcmp("link", ctype)) {
+            type = HMS_LINK;
+        } else if (!strcmp("bridge", ctype)) {
+            type = HMS_BRIDGE;
+        } else if (!strcmp("target", ctype)) {
+            type = HMS_TARGET;
+        } else if (!strcmp("initiator", ctype)) {
+            type = HMS_INITIATOR;
+        } else {
+            continue;
+        }
+
+        object = malloc(sizeof(*object));
+        object->references = NULL;
+        object->version = version;
+        object->type = type;
+        object->id = id;
+
+        object->next = ctx->objects;
+        ctx->objects = object;
+
+        hms_object_parse_dir(object, ctype);
+    }
+    closedir(dirp);
+
+    ctx->fd = open("/dev/hbind", O_RDWR);
+    if (ctx->fd < 0) {
+        printf("EE: could not open /dev/hbind\n");
+        exit(-1);
+    }
+}
+
+void hms_context_fini(struct hms_context *ctx)
+{
+    struct hms_object *object = ctx->objects;
+
+    for (; object; object = ctx->objects) {
+        ctx->objects = object->next;
+        hms_object_free(object);
+    }
+
+    close(ctx->fd);
+}
+
+struct hms_object *hms_context_object_find_reference(struct hms_context *ctx,
+                                                     struct hms_object *object,
+                                                     const char *name)
+{
+    object = object ? object->next : ctx->objects;
+    for (; object; object = object->next) {
+        struct hms_reference *reference = object->references;
+
+        for (; reference; reference = reference->next) {
+            if (!strcmp(reference->name, name)) {
+                return object;
+            }
+        }
+    }
+
+    return NULL;
+}
+
+
+int hms_migrate(struct hms_context *ctx,
+                unsigned long start,
+                unsigned long end,
+                uint64_t *targets,
+                unsigned ntargets)
+{
+    struct hbind_params params;
+    uint64_t atoms[2], natoms;
+    int ret;
+
+    atoms[0] = HBIND_ATOM_SET_CMD(HBIND_CMD_MIGRATE) |
+               HBIND_ATOM_SET_DWORDS(1);
+    atoms[1] = 0;
+    natoms = 2;
+
+    params.targets = (uintptr_t)targets;
+    params.atoms = (uintptr_t)atoms;
+
+    params.ntargets = ntargets;
+    params.natoms = natoms;
+    params.start = start;
+    params.end = end;
+
+    do {
+        ret = ioctl(ctx->fd, HBIND_IOCTL, &params);
+printf("ret %d artoms %d\n", ret, (int)atoms[1]);
+    } while (ret && (errno == EINTR));
+
+    /* Result of migration is in the atoms after cmd dword */
+printf("ret %d artoms %d\n", ret, (int)atoms[1]);
+    ret = ret ? ret : atoms[1];
+
+    return ret;
+}
+
+
+void *hms_malloc(unsigned long size)
+{
+    void *ptr;
+
+    page_shift_init();
+
+    ptr = mmap(0, page_align(size), PROT_READ | PROT_WRITE,
+               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+    if (ptr == MAP_FAILED) {
+        return NULL;
+    }
+    return ptr;
+}
+
+void hms_mfree(void *ptr, unsigned long size)
+{
+    munmap(ptr, page_align(size));
+}
diff --git a/tools/testing/hms/test-hms.h b/tools/testing/hms/test-hms.h
new file mode 100644
index 000000000000..b5d625e18d59
--- /dev/null
+++ b/tools/testing/hms/test-hms.h
@@ -0,0 +1,67 @@
+/*
+ * Copyright 2018 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors:
+ * Jérôme Glisse <jglisse@redhat.com>
+ */
+#ifndef TEST_HMS_H
+#define TEST_HMS_H
+
+#include <stdint.h>
+
+enum hms_type {
+    HMS_LINK = 0,
+    HMS_BRIDGE,
+    HMS_TARGET,
+    HMS_INITIATOR,
+};
+
+struct hms_reference {
+    char name[256];
+    struct hms_object *object;
+    struct hms_reference *next;
+};
+
+struct hms_object {
+    struct hms_reference *references;
+    struct hms_object *next;
+    unsigned version;
+    unsigned id;
+    enum hms_type type;
+};
+
+struct hms_context {
+    struct hms_object *objects;
+    int fd;
+};
+
+void hms_context_init(struct hms_context *ctx);
+void hms_context_fini(struct hms_context *ctx);
+struct hms_object *hms_context_object_find_reference(struct hms_context *ctx,
+                                                     struct hms_object *object,
+                                                     const char *name);
+
+
+int hms_migrate(struct hms_context *ctx,
+                unsigned long start,
+                unsigned long end,
+                uint64_t *targets,
+                unsigned ntargets);
+
+
+/* Provide page align memory allocations */
+void *hms_malloc(unsigned long size);
+void hms_mfree(void *ptr, unsigned long size);
+
+
+#endif
-- 
2.17.2


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (13 preceding siblings ...)
  2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
@ 2018-12-04  7:44 ` Aneesh Kumar K.V
  2018-12-04 14:44   ` Jerome Glisse
  2018-12-04 18:02 ` Dave Hansen
  2018-12-04 23:54 ` Dave Hansen
  16 siblings, 1 reply; 94+ messages in thread
From: Aneesh Kumar K.V @ 2018-12-04  7:44 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Rafael J . Wysocki, Matthew Wilcox,
	Ross Zwisler, Keith Busch, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On 12/4/18 5:04 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Heterogeneous memory system are becoming more and more the norm, in
> those system there is not only the main system memory for each node,
> but also device memory and|or memory hierarchy to consider. Device
> memory can comes from a device like GPU, FPGA, ... or from a memory
> only device (persistent memory, or high density memory device).
> 
> Memory hierarchy is when you not only have the main memory but also
> other type of memory like HBM (High Bandwidth Memory often stack up
> on CPU die or GPU die), peristent memory or high density memory (ie
> something slower then regular DDR DIMM but much bigger).
> 
> On top of this diversity of memories you also have to account for the
> system bus topology ie how all CPUs and devices are connected to each
> others. Userspace do not care about the exact physical topology but
> care about topology from behavior point of view ie what are all the
> paths between an initiator (anything that can initiate memory access
> like CPU, GPU, FGPA, network controller ...) and a target memory and
> what are all the properties of each of those path (bandwidth, latency,
> granularity, ...).
> 
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.
> 
> 
> One of the reasons for radical change is the advance of accelerator
> like GPU or FPGA means that CPU is no longer the only piece where
> computation happens. It is becoming more and more common for an
> application to use a mix and match of different accelerator to
> perform its computation. So we can no longer satisfy our self with
> a CPU centric and flat view of a system like NUMA and NUMA distance.
> 
> 
> This patchset is a proposal to tackle this problems through three
> aspects:
>      1 - Expose complex system topology and various kind of memory
>          to user space so that application have a standard way and
>          single place to get all the information it cares about.
>      2 - A new API for user space to bind/provide hint to kernel on
>          which memory to use for range of virtual address (a new
>          mbind() syscall).
>      3 - Kernel side changes for vm policy to handle this changes
> 
> This patchset is not and end to end solution but it provides enough
> pieces to be useful against nouveau (upstream open source driver for
> NVidia GPU). It is intended as a starting point for discussion so
> that we can figure out what to do. To avoid having too much topics
> to discuss i am not considering memory cgroup for now but it is
> definitely something we will want to integrate with.
> 
> The rest of this emails is splits in 3 sections, the first section
> talks about complex system topology: what it is, how it is use today
> and how to describe it tomorrow. The second sections talks about
> new API to bind/provide hint to kernel for range of virtual address.
> The third section talks about new mechanism to track bind/hint
> provided by user space or device driver inside the kernel.
> 
> 
> 1) Complex system topology and representing them
> ------------------------------------------------
> 
> Inside a node you can have a complex topology of memory, for instance
> you can have multiple HBM memory in a node, each HBM memory tie to a
> set of CPUs (all of which are in the same node). This means that you
> have a hierarchy of memory for CPUs. The local fast HBM but which is
> expected to be relatively small compare to main memory and then the
> main memory. New memory technology might also deepen this hierarchy
> with another level of yet slower memory but gigantic in size (some
> persistent memory technology might fall into that category). Another
> example is device memory, and device themself can have a hierarchy
> like HBM on top of device core and main device memory.
> 
> On top of that you can have multiple path to access each memory and
> each path can have different properties (latency, bandwidth, ...).
> Also there is not always symmetry ie some memory might only be
> accessible by some device or CPU ie not accessible by everyone.
> 
> So a flat hierarchy for each node is not capable of representing this
> kind of complexity. To simplify discussion and because we do not want
> to single out CPU from device, from here on out we will use initiator
> to refer to either CPU or device. An initiator is any kind of CPU or
> device that can access memory (ie initiate memory access).
> 
> At this point a example of such system might help:
>      - 2 nodes and for each node:
>          - 1 CPU per node with 2 complex of CPUs cores per CPU
>          - one HBM memory for each complex of CPUs cores (200GB/s)
>          - CPUs cores complex are linked to each other (100GB/s)
>          - main memory is (90GB/s)
>          - 4 GPUs each with:
>              - HBM memory for each GPU (1000GB/s) (not CPU accessible)
>              - GDDR memory for each GPU (500GB/s) (CPU accessible)
>              - connected to CPU root controller (60GB/s)
>              - connected to other GPUs (even GPUs from the second
>                node) with GPU link (400GB/s)
> 
> In this example we restrict our self to bandwidth and ignore bus width
> or latency, this is just to simplify discussions but obviously they
> also factor in.
> 
> 
> Userspace very much would like to know about this information, for
> instance HPC folks have develop complex library to manage this and
> there is wide research on the topics [2] [3] [4] [5]. Today most of
> the work is done by hardcoding thing for specific platform. Which is
> somewhat acceptable for HPC folks where the platform stays the same
> for a long period of time. But if we want a more ubiquituous support
> we should aim to provide the information needed through standard
> kernel API such as the one presented in this patchset.
> 
> Roughly speaking i see two broads use case for topology information.
> First is for virtualization and vm where you want to segment your
> hardware properly for each vm (binding memory, CPU and GPU that are
> all close to each others). Second is for application, many of which
> can partition their workload to minimize exchange between partition
> allowing each partition to be bind to a subset of device and CPUs
> that are close to each others (for maximum locality). Here it is much
> more than just NUMA distance, you can leverage the memory hierarchy
> and  the system topology all-together (see [2] [3] [4] [5] for more
> references and details).
> 
> So this is not exposing topology just for the sake of cool graph in
> userspace. They are active user today of such information and if we
> want to growth and broaden the usage we should provide a unified API
> to standardize how that information is accessible to every one.
> 
> 
> One proposal so far to handle new type of memory is to user CPU less
> node for those [6]. While same idea can apply for device memory, it is
> still hard to describe multiple path with different property in such
> scheme. While it is backward compatible and have minimum changes, it
> simplify can not convey complex topology (think any kind of random
> graph, not just a tree like graph).
> 
> Thus far this kind of system have been use through device specific API
> and rely on all kind of system specific quirks. To avoid this going out
> of hands and grow into a bigger mess than it already is, this patchset
> tries to provide a common generic API that should fit various devices
> (GPU, FPGA, ...).
> 
> So this patchset propose a new way to expose to userspace the system
> topology. It relies on 4 types of objects:
>      - target: any kind of memory (main memory, HBM, device, ...)
>      - initiator: CPU or device (anything that can access memory)
>      - link: anything that link initiator and target
>      - bridges: anything that allow group of initiator to access
>        remote target (ie target they are not connected with directly
>        through an link)
> 
> Properties like bandwidth, latency, ... are all sets per bridges and
> links. All initiators connected to an link can access any target memory
> also connected to the same link and all with the same link properties.
> 
> Link do not need to match physical hardware ie you can have a single
> physical link match a single or multiples software expose link. This
> allows to model device connected to same physical link (like PCIE
> for instance) but not with same characteristics (like number of lane
> or lane speed in PCIE). The reverse is also true ie having a single
> software expose link match multiples physical link.
> 
> Bridges allows initiator to access remote link. A bridges connect two
> links to each others and is also specific to list of initiators (ie
> not all initiators connected to each of the link can use the bridge).
> Bridges have their own properties (bandwidth, latency, ...) so that
> the actual property value for each property is the lowest common
> denominator between bridge and each of the links.
> 
> 
> This model allows to describe any kind of directed graph and thus
> allows to describe any kind of topology we might see in the future.
> It is also easier to add new properties to each object type.
> 
> Moreover it can be use to expose devices capable to do peer to peer
> between them. For that simply have all devices capable to peer to
> peer to have a common link or use the bridge object if the peer to
> peer capabilities is only one way for instance.
> 
> 
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
>      - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>        each has a UID and you can usual value in that folder (node id,
>        size, ...)
> 
>      - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>        (CPU or device), each has a HMS UID but also a CPU id for CPU
>        (which match CPU id in (/sys/bus/cpu/). For device you have a
>        path that can be PCIE BUS ID for instance)
> 
>      - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>        UID and a file per property (bandwidth, latency, ...) you also
>        find a symlink to every target and initiator connected to that
>        link.
> 
>      - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>        a UID and a file per property (bandwidth, latency, ...) you
>        also find a symlink to all initiators that can use that bridge.

is that version tagging really needed? What changes do you envision with 
versions?

> 
> To help with forward compatibility each object as a version value and
> it is mandatory for user space to only use target or initiator with
> version supported by the user space. For instance if user space only
> knows about what version 1 means and sees a target with version 2 then
> the user space must ignore that target as if it does not exist.
> 
> Mandating that allows the additions of new properties that break back-
> ward compatibility ie user space must know how this new property affect
> the object to be able to use it safely.
> 
> This patchset expose main memory of each node under a common target.
> For now device driver are responsible to register memory they want to
> expose through that scheme but in the future that information might
> come from the system firmware (this is a different discussion).
> 
> 
> 
> 2) hbind() bind range of virtual address to heterogeneous memory
> ----------------------------------------------------------------
> 
> With this new topology description the mbind() API is too limited to
> handle which memory to picks. This is why this patchset introduce a new
> API: hbind() for heterogeneous bind. The hbind() API allows to bind any
> kind of target memory (using the HMS target uid), this can be any memory
> expose through HMS ie main memory, HBM, device memory ...
> 
> So instead of using a bitmap, hbind() take an array of uid and each uid
> is a unique memory target inside the new memory topology description.
> User space also provide an array of modifiers. This patchset only define
> some modifier. Modifier can be seen as the flags parameter of mbind()
> but here we use an array so that user space can not only supply a modifier
> but also value with it. This should allow the API to grow more features
> in the future. Kernel should return -EINVAL if it is provided with an
> unkown modifier and just ignore the call all together, forcing the user
> space to restrict itself to modifier supported by the kernel it is
> running on (i know i am dreaming about well behave user space).
> 
> 
> Note that none of this is exclusive of automatic memory placement like
> autonuma. I also believe that we will see something similar to autonuma
> for device memory. This patchset is just there to provide new API for
> process that wish to have a fine control over their memory placement
> because process should know better than the kernel on where to place
> thing.
> 
> This patchset also add necessary bits to the nouveau open source driver
> for it to expose its memory and to allow process to bind some range to
> the GPU memory. Note that on x86 the GPU memory is not accessible by
> CPU because PCIE does not allow cache coherent access to device memory.
> Thus when using PCIE device memory on x86 it is mapped as swap out from
> CPU POV and any CPU access will triger a migration back to main memory
> (this is all part of HMM and nouveau not in this patchset).
> 
> This is all done under staging so that we can experiment with the user-
> space API for a while before committing to anything. Getting this right
> is hard and it might not happen on the first try so instead of having to
> support forever an API i would rather have it leave behind staging for
> people to experiment with and once we feel confident we have something
> we can live with then convert it to a syscall.
> 
> 
> 3) Tracking and applying heterogeneous memory policies
> ------------------------------------------------------
> 
> Current memory policy infrastructure is node oriented, instead of
> changing that and risking breakage and regression this patchset add a
> new heterogeneous policy tracking infra-structure. The expectation is
> that existing application can keep using mbind() and all existing
> infrastructure under-disturb and unaffected, while new application
> will use the new API and should avoid mix and matching both (as they
> can achieve the same thing with the new API).
> 
> Also the policy is not directly tie to the vma structure for a few
> reasons:
>      - avoid having to split vma for policy that do not cover full vma
>      - avoid changing too much vma code
>      - avoid growing the vma structure with an extra pointer
> So instead this patchset use the mmu_notifier API to track vma liveness
> (munmap(),mremap(),...).
> 
> This patchset is not tie to process memory allocation either (like said
> at the begining this is not and end to end patchset but a starting
> point). It does however demonstrate how migration to device memory can
> work under this scheme (using nouveau as a demonstration vehicle).
> 
> The overall design is simple, on hbind() call a hms policy structure
> is created for the supplied range and hms use the callback associated
> with the target memory. This callback is provided by device driver
> for device memory or by core HMS for regular main memory. The callback
> can decide to migrate the range to the target memories or do nothing
> (this can be influenced by flags provided to hbind() too).
> 
> 
> Latter patches can tie page fault with HMS policy to direct memory
> allocation to the right target. For now i would rather postpone that
> discussion until a consensus is reach on how to move forward on all
> the topics presented in this email. Start smalls, grow big ;)
> 
>

I liked the simplicity of keeping it outside all the existing memory 
management policy code. But that that is also the drawback isn't it?
We now have multiple entities tracking cpu and memory. (This reminded me 
of how we started with memcg in the early days).

Once we have these different types of targets, ideally the system should
be able to place them in the ideal location based on the affinity of the 
access. ie. we should automatically place the memory such that
initiator can access the target optimally. That is what we try to do 
with current system with autonuma. (You did mention that you are not 
looking at how this patch series will evolve to automatic handling of 
placement right now.) But i guess we want to see if the framework indeed 
help in achieving that goal. Having HMS outside the core memory
handling routines will be a road blocker there?

-aneesh



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
@ 2018-12-04 14:44   ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 14:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Dave Hansen, Haggai Eran, Balbir Singh, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 5:04 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >      - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >        each has a UID and you can usual value in that folder (node id,
> >        size, ...)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >        (CPU or device), each has a HMS UID but also a CPU id for CPU
> >        (which match CPU id in (/sys/bus/cpu/). For device you have a
> >        path that can be PCIE BUS ID for instance)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >        UID and a file per property (bandwidth, latency, ...) you also
> >        find a symlink to every target and initiator connected to that
> >        link.
> > 
> >      - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >        a UID and a file per property (bandwidth, latency, ...) you
> >        also find a symlink to all initiators that can use that bridge.
> 
> is that version tagging really needed? What changes do you envision with
> versions?

I kind of dislike it myself but this is really to keep userspace from
inadvertently using some kind of memory/initiator/link/bridge that it
should not be using if it does not understand what are the implication.

If it was a file inside the directory there is a big chance that user-
space will overlook it. So an old program on a new platform with a new
kind of weird memory like un-coherent memory might start using it and
get all weird result. If version is in the directory name it kind of
force userspace to only look at memory/initiator/link/bridge it does
understand and can use safely.

So i am doing this in hope that it will protect application when new
type of things pops up. We have too many example where we can not
evolve something because existing application have bake in assumptions
about it.


[...]

> > 3) Tracking and applying heterogeneous memory policies
> > ------------------------------------------------------
> > 
> > Current memory policy infrastructure is node oriented, instead of
> > changing that and risking breakage and regression this patchset add a
> > new heterogeneous policy tracking infra-structure. The expectation is
> > that existing application can keep using mbind() and all existing
> > infrastructure under-disturb and unaffected, while new application
> > will use the new API and should avoid mix and matching both (as they
> > can achieve the same thing with the new API).
> > 
> > Also the policy is not directly tie to the vma structure for a few
> > reasons:
> >      - avoid having to split vma for policy that do not cover full vma
> >      - avoid changing too much vma code
> >      - avoid growing the vma structure with an extra pointer
> > So instead this patchset use the mmu_notifier API to track vma liveness
> > (munmap(),mremap(),...).
> > 
> > This patchset is not tie to process memory allocation either (like said
> > at the begining this is not and end to end patchset but a starting
> > point). It does however demonstrate how migration to device memory can
> > work under this scheme (using nouveau as a demonstration vehicle).
> > 
> > The overall design is simple, on hbind() call a hms policy structure
> > is created for the supplied range and hms use the callback associated
> > with the target memory. This callback is provided by device driver
> > for device memory or by core HMS for regular main memory. The callback
> > can decide to migrate the range to the target memories or do nothing
> > (this can be influenced by flags provided to hbind() too).
> > 
> > 
> > Latter patches can tie page fault with HMS policy to direct memory
> > allocation to the right target. For now i would rather postpone that
> > discussion until a consensus is reach on how to move forward on all
> > the topics presented in this email. Start smalls, grow big ;)
> > 
> > 
> 
> I liked the simplicity of keeping it outside all the existing memory
> management policy code. But that that is also the drawback isn't it?
> We now have multiple entities tracking cpu and memory. (This reminded me of
> how we started with memcg in the early days).

This is a hard choice, the rational is that either application use this
new API either it use the old one. So the expectation is that both should
not co-exist in a process. Eventualy both can be consolidated into one
inside the kernel while maintaining the different userspace API. But i
feel that it is better to get to that point slowly while we experiment
with the new API. I feel that we need to gain some experience with the
new API on real workload to convince ourself that it is something we can
leave with. If we reach that point than we can work on consolidating
kernel code into one. In the meantime this experiment does not disrupt
or regress existing API. I took the cautionary road.


> Once we have these different types of targets, ideally the system should
> be able to place them in the ideal location based on the affinity of the
> access. ie. we should automatically place the memory such that
> initiator can access the target optimally. That is what we try to do with
> current system with autonuma. (You did mention that you are not looking at
> how this patch series will evolve to automatic handling of placement right
> now.) But i guess we want to see if the framework indeed help in achieving
> that goal. Having HMS outside the core memory
> handling routines will be a road blocker there?

So evolving autonuma gonna be a thing on its own, the issue is that auto-
numa revolve around CPU id and use a handful of bits to try to catch CPU
access pattern. With device in the mix it is much harder, first using the
page fault trick of autonuma might not be the best idea, second we can get
a lot of informations from IOMMU, bridge chipset or device itself on what
is accessed by who.

So my believe on that front is that its gonna be something different, like
tracking range of virtual address and maintaining a data structure for
range (not per page).

All this is done in core mm code, i am just keeping out of vma struct or
other struct to avoid growing them when and wasting thing when thit is not
in use. So it is very much inside core handling routines, it is just
optional.

In any case i believe that explicit placement (where application hbind()
thing) will be the first main use case. Once we have that figured out (or
at least once we believe we have it figured out :)) then we can look into
auto-heterogeneous.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
@ 2018-12-04 17:06   ` Andi Kleen
  2018-12-04 18:24     ` Jerome Glisse
  2018-12-05 10:52   ` Mike Rapoport
  1 sibling, 1 reply; 94+ messages in thread
From: Andi Kleen @ 2018-12-04 17:06 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell

jglisse@redhat.com writes:

> +
> +To help with forward compatibility each object as a version value and
> +it is mandatory for user space to only use target or initiator with
> +version supported by the user space. For instance if user space only
> +knows about what version 1 means and sees a target with version 2 then
> +the user space must ignore that target as if it does not exist.

So once v2 is introduced all applications that only support v1 break.

That seems very un-Linux and will break Linus' "do not break existing
applications" rule.

The standard approach that if you add something incompatible is to
add new field, but keep the old ones.

> +2) hbind() bind range of virtual address to heterogeneous memory
> +================================================================
> +
> +So instead of using a bitmap, hbind() take an array of uid and each uid
> +is a unique memory target inside the new memory topology description.

You didn't define what an uid is?

user id?

Please use sensible terminology that doesn't conflict with existing
usages.

I assume it's some kind of number that identifies a node in your
graph. 

> +User space also provide an array of modifiers. Modifier can be seen as
> +the flags parameter of mbind() but here we use an array so that user
> +space can not only supply a modifier but also value with it. This should
> +allow the API to grow more features in the future. Kernel should return
> +-EINVAL if it is provided with an unkown modifier and just ignore the
> +call all together, forcing the user space to restrict itself to modifier
> +supported by the kernel it is running on (i know i am dreaming about well
> +behave user space).

It sounds like you're trying to define a system call with built in
ioctl? Is that really a good idea?

If you need ioctl you know where to find it.

Please don't over design APIs like this.

> +3) Tracking and applying heterogeneous memory policies
> +======================================================
> +
> +Current memory policy infrastructure is node oriented, instead of
> +changing that and risking breakage and regression HMS adds a new
> +heterogeneous policy tracking infra-structure. The expectation is
> +that existing application can keep using mbind() and all existing
> +infrastructure under-disturb and unaffected, while new application
> +will use the new API and should avoid mix and matching both (as they
> +can achieve the same thing with the new API).

I think we need a stronger motivation to define a completely
parallel and somewhat redundant infrastructure. What breakage
are you worried about?

The obvious alternative would of course be to add some extra
enumeration to the existing nodes.

It's a strange document. It goes from very high level to low level
with nothing inbetween. I think you need a lot more details
in the middle, in particularly how these new interfaces
should be used. For example how should an application
know how to look for a specific type of device?
How is an automated tool supposed to use the enumeration?
etc.

-Andi

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (14 preceding siblings ...)
  2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
@ 2018-12-04 18:02 ` Dave Hansen
  2018-12-04 18:49   ` Jerome Glisse
  2018-12-04 23:54 ` Dave Hansen
  16 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-04 18:02 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Rafael J . Wysocki, Matthew Wilcox,
	Ross Zwisler, Keith Busch, Dan Williams, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.

The HMAT and its implications exist, in firmware, whether or not we do
*anything* in Linux to support it or not.  Any system with an HMAT
inherently reflects the new topology, via proximity domains, whether or
not we parse the HMAT table in Linux or not.

Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
that or embrace it.  Keith's HMAT patches are embracing it.  These
patches are appearing to fight it.  Agree?  Disagree?

Also, could you add a simple, example program for how someone might use
this?  I got lost in all the new sysfs and ioctl gunk.  Can you
characterize how this would work with the *exiting* NUMA interfaces that
we have?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 17:06   ` Andi Kleen
@ 2018-12-04 18:24     ` Jerome Glisse
  2018-12-04 18:31       ` Dan Williams
                         ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 18:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell

On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> jglisse@redhat.com writes:
> 
> > +
> > +To help with forward compatibility each object as a version value and
> > +it is mandatory for user space to only use target or initiator with
> > +version supported by the user space. For instance if user space only
> > +knows about what version 1 means and sees a target with version 2 then
> > +the user space must ignore that target as if it does not exist.
> 
> So once v2 is introduced all applications that only support v1 break.
> 
> That seems very un-Linux and will break Linus' "do not break existing
> applications" rule.
> 
> The standard approach that if you add something incompatible is to
> add new field, but keep the old ones.

No that's not how it is suppose to work. So let says it is 2018 and you
have v1 memory (like your regular main DDR memory for instance) then it
will always be expose a v1 memory.

Fast forward 2020 and you have this new type of memory that is not cache
coherent and you want to expose this to userspace through HMS. What you
do is a kernel patch that introduce the v2 type for target and define a
set of new sysfs file to describe what v2 is. On this new computer you
report your usual main memory as v1 and your new memory as v2.

So the application that only knew about v1 will keep using any v1 memory
on your new platform but it will not use any of the new memory v2 which
is what you want to happen. You do not have to break existing application
while allowing to add new type of memory.


Sorry if it was unclear. I will try to reformulate and give an example
as above.


> > +2) hbind() bind range of virtual address to heterogeneous memory
> > +================================================================
> > +
> > +So instead of using a bitmap, hbind() take an array of uid and each uid
> > +is a unique memory target inside the new memory topology description.
> 
> You didn't define what an uid is?
> 
> user id ?
> 
> Please use sensible terminology that doesn't conflict with existing
> usages.
> 
> I assume it's some kind of number that identifies a node in your
> graph. 

Correct uid is unique id given to each node in the graph. I will clarify
that.


> > +User space also provide an array of modifiers. Modifier can be seen as
> > +the flags parameter of mbind() but here we use an array so that user
> > +space can not only supply a modifier but also value with it. This should
> > +allow the API to grow more features in the future. Kernel should return
> > +-EINVAL if it is provided with an unkown modifier and just ignore the
> > +call all together, forcing the user space to restrict itself to modifier
> > +supported by the kernel it is running on (i know i am dreaming about well
> > +behave user space).
> 
> It sounds like you're trying to define a system call with built in
> ioctl? Is that really a good idea?
> 
> If you need ioctl you know where to find it.

Well i would like to get thing running in the wild with some guinea pig
user to get feedback from end user. It would be easier if i can do this
with upstream kernel and not some random branch in my private repo. While
doing that i would like to avoid commiting to a syscall upstream. So the
way i see around this is doing a driver under staging with an ioctl which
will be turn into a syscall once some confidence into the API is gain.

If you think i should do a syscall right away i am not against doing that.

> 
> Please don't over design APIs like this.

So there is 2 approach here. I can define 2 syscall, one for migration
and one for policy. Migration and policy are 2 different thing from all
existing user point of view. By defining 2 syscall i can cut them down
to do this one thing and one thing only and make it as simple and lean
as possible.

In the present version i took the other approach of defining just one
API that can grow to do more thing. I know the unix way is one simple
tool for one simple job. I can switch to the simple call for one action.


> > +3) Tracking and applying heterogeneous memory policies
> > +======================================================
> > +
> > +Current memory policy infrastructure is node oriented, instead of
> > +changing that and risking breakage and regression HMS adds a new
> > +heterogeneous policy tracking infra-structure. The expectation is
> > +that existing application can keep using mbind() and all existing
> > +infrastructure under-disturb and unaffected, while new application
> > +will use the new API and should avoid mix and matching both (as they
> > +can achieve the same thing with the new API).
> 
> I think we need a stronger motivation to define a completely
> parallel and somewhat redundant infrastructure. What breakage
> are you worried about?

Some memory expose through HMS is not allocated by regular memory
allocator. For instance GPU memory is manage by GPU driver, so when
you want to use GPU memory (either as a policy or by migrating to it)
you need to use the GPU allocator to allocate that memory. HMS adds
a bunch of callback to target structure so that device driver can
expose a generic API to core kernel to do such allocation.

Now i can change existing code path to use target structure as an
intermediary for allocation but this is changing hot code path and
i doubt it would be welcome today. Eventually i think we will want
that to happen and can work on minimizing cost for user that do not
use thing like GPU.

The transition phase will take times (couple years) and i would like
to avoid disturbing existing workload while we migrate GPU user to
this new API.


> The obvious alternative would of course be to add some extra
> enumeration to the existing nodes.

We can not extend NUMA node to expose GPU memory. GPU memory on
current AMD and Intel platform is not cache coherent and thus
should not be use for random memory allocation. It should really
stay a thing user have to explicitly select to use. Note that the
useage we have here is that when you use GPU memory it is as if
the range of virtual address is swapped out from CPU point of view
but the GPU can access it.

> It's a strange document. It goes from very high level to low level
> with nothing inbetween. I think you need a lot more details
> in the middle, in particularly how these new interfaces
> should be used. For example how should an application
> know how to look for a specific type of device?
> How is an automated tool supposed to use the enumeration?
> etc.

Today user use dedicated API (OpenCL, ROCm, CUDA, ...) those high
level API all have the API i present here in one form or another.
So i want to move this high level API that is actively use by
program today into the kernel. The end game is to create common
infrastructure for various accelerator hardware (GPU, FPGA, ...)
to manage memory.

This is something ask by end user for one simple reasons. Today
users have to mix and match multiple API in their application and
when they want to exchange data between one device that use one API
and another device that use another API they have to do explicit
copy and rebuild their data structure inside the new memory. When
you move over thing like tree or any complex data structure you have
to rebuilt it ie redo the pointers link between the nodes of your
data structure.

This is highly error prone complex and wasteful (you have to burn
CPU cycles to do that). Now if you can use the same address space
as all the other memory allocation in your program and move data
around from one device to another with a common API that works on
all the various devices, you are eliminating that complex step and
making the end user life much easier.

So i am doing this to help existing users by addressing an issues
that is becoming harder and harder to solve for userspace. My end
game is to blur the boundary between CPU and device like GPU, FPGA,
...


Thank you for taking time to read this proposal and for your feed-
back. Much appreciated. I will try to include your comments in my
v2.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:24     ` Jerome Glisse
@ 2018-12-04 18:31       ` Dan Williams
  2018-12-04 18:57         ` Jerome Glisse
  2018-12-04 20:12       ` Andi Kleen
  2018-12-05  4:36       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 94+ messages in thread
From: Dan Williams @ 2018-12-04 18:31 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Ross Zwisler, Dave Hansen, Haggai Eran,
	balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling,
	Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	Logan Gunthorpe, John Hubbard, rcampbell

On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > jglisse@redhat.com writes:
> >
> > > +
> > > +To help with forward compatibility each object as a version value and
> > > +it is mandatory for user space to only use target or initiator with
> > > +version supported by the user space. For instance if user space only
> > > +knows about what version 1 means and sees a target with version 2 then
> > > +the user space must ignore that target as if it does not exist.
> >
> > So once v2 is introduced all applications that only support v1 break.
> >
> > That seems very un-Linux and will break Linus' "do not break existing
> > applications" rule.
> >
> > The standard approach that if you add something incompatible is to
> > add new field, but keep the old ones.
>
> No that's not how it is suppose to work. So let says it is 2018 and you
> have v1 memory (like your regular main DDR memory for instance) then it
> will always be expose a v1 memory.
>
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
>
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.

That sounds needlessly restrictive. Let the kernel arbitrate what
memory an application gets, don't design a system where applications
are hard coded to a memory type. Applications can hint, or optionally
specify an override and the kernel can react accordingly.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 18:02 ` Dave Hansen
@ 2018-12-04 18:49   ` Jerome Glisse
  2018-12-04 18:54     ` Dave Hansen
                       ` (2 more replies)
  0 siblings, 3 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 18:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> > This means that it is no longer sufficient to consider a flat view
> > for each node in a system but for maximum performance we need to
> > account for all of this new memory but also for system topology.
> > This is why this proposal is unlike the HMAT proposal [1] which
> > tries to extend the existing NUMA for new type of memory. Here we
> > are tackling a much more profound change that depart from NUMA.
> 
> The HMAT and its implications exist, in firmware, whether or not we do
> *anything* in Linux to support it or not.  Any system with an HMAT
> inherently reflects the new topology, via proximity domains, whether or
> not we parse the HMAT table in Linux or not.
> 
> Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
> that or embrace it.  Keith's HMAT patches are embracing it.  These
> patches are appearing to fight it.  Agree?  Disagree?

Disagree, sorry if it felt that way that was not my intention. The
ACPI HMAT information can be use to populate the HMS file system
representation. My intention is not to fight Keith's HMAT patches
they are useful on their own. But i do not see how to evolve NUMA
to support device memory, so while Keith is taking a step into the
direction i want, i do not see how to cross to the place i need to
be. More on that below.

> 
> Also, could you add a simple, example program for how someone might use
> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> characterize how this would work with the *exiting* NUMA interfaces that
> we have?

That is the issue i can not expose device memory as NUMA node as
device memory is not cache coherent on AMD and Intel platform today.

More over in some case that memory is not visible at all by the CPU
which is not something you can express in the current NUMA node.
Here is an abreviated list of feature i need to support:
    - device private memory (not accessible by CPU or anybody else)
    - non-coherent memory (PCIE is not cache coherent for CPU access)
    - multiple path to access same memory either:
        - multiple _different_ physical address alias to same memory
        - device block can select which path they take to access some
          memory (it is not inside the page table but in how you program
          the device block)
    - complex topology that is not a tree where device link can have
      better characteristics than the CPU inter-connect between the
      nodes. They are existing today user that use topology information
      to partition their workload (HPC folks who have a fix platform).
    - device memory needs to stay under device driver control as some
      existing API (OpenGL, Vulkan) have different memory model and if
      we want the device to be use for those too then we need to keep
      the device driver in control of the device memory allocation


There is an example userspace program with the last patch in the serie.
But here is a high level overview of how one application looks today:

    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2) Application allocate memory on device A and copy over the dataset
    3) Application run some CPU code to format the copy of the dataset
       inside device A memory (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    4) Application run code on device A that use the dataset
    5) Application allocate memory on device B and copy over result
       from device A
    6) Application run some CPU code to format the copy of the dataset
       inside device B (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    7) Application run code on device B that use the dataset
    8) Application copy result over from device B and keep on doing its
       thing

How it looks with HMS:
    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2-3) Application calls HMS to migrate to device A memory
    4) Application run code on device A that use the dataset
    5-6) Application calls HMS to migrate to device B memory
    7) Application run code on device B that use the dataset
    8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

    Application knows that the platform is running on have 16
    GPU split into 2 group of 8 GPUs each. GPU in each group can
    access each other memory with dedicated mesh links between
    each others. Full speed no traffic bottleneck.

    Application splits its GPU computation in 2 so that each
    partition runs on a group of interconnected GPU allowing
    them to share the dataset.

With HMS:
    Application can query the kernel to discover the topology of
    system it is running on and use it to partition and balance
    its workload accordingly. Same application should now be able
    to run on new platform without having to adapt it to it.

This is kind of naive i expect topology to be hard to use but maybe
it is just me being pesimistics. In any case today we have a chicken
and egg problem. We do not have a standard way to expose topology so
program that can leverage topology are only done for HPC where the
platform is standard for few years. If we had a standard way to expose
the topology then maybe we would see more program using it. At very
least we could convert existing user.


Policy is same kind of story, this email is long enough now :) But
i can write one down if you want.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 18:49   ` Jerome Glisse
@ 2018-12-04 18:54     ` Dave Hansen
  2018-12-04 19:11       ` Jerome Glisse
  2018-12-04 21:37     ` Dave Hansen
  2018-12-05 11:27     ` Aneesh Kumar K.V
  2 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-04 18:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/4/18 10:49 AM, Jerome Glisse wrote:
> Policy is same kind of story, this email is long enough now :) But
> i can write one down if you want.

Yes, please.  I'd love to see the code.

We'll do the same on the "HMAT" side and we can compare notes.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:31       ` Dan Williams
@ 2018-12-04 18:57         ` Jerome Glisse
  2018-12-04 19:11           ` Logan Gunthorpe
  2018-12-04 19:19           ` Dan Williams
  0 siblings, 2 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 18:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Ross Zwisler, Dave Hansen, Haggai Eran,
	balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling,
	Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	Logan Gunthorpe, John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > jglisse@redhat.com writes:
> > >
> > > > +
> > > > +To help with forward compatibility each object as a version value and
> > > > +it is mandatory for user space to only use target or initiator with
> > > > +version supported by the user space. For instance if user space only
> > > > +knows about what version 1 means and sees a target with version 2 then
> > > > +the user space must ignore that target as if it does not exist.
> > >
> > > So once v2 is introduced all applications that only support v1 break.
> > >
> > > That seems very un-Linux and will break Linus' "do not break existing
> > > applications" rule.
> > >
> > > The standard approach that if you add something incompatible is to
> > > add new field, but keep the old ones.
> >
> > No that's not how it is suppose to work. So let says it is 2018 and you
> > have v1 memory (like your regular main DDR memory for instance) then it
> > will always be expose a v1 memory.
> >
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> >
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> 
> That sounds needlessly restrictive. Let the kernel arbitrate what
> memory an application gets, don't design a system where applications
> are hard coded to a memory type. Applications can hint, or optionally
> specify an override and the kernel can react accordingly.

You do not want to randomly use non cache coherent memory inside your
application :) This is not gonna go well with C++ or atomic :) Yes they
are legitimate use case where application can decide to give up cache
coherency temporarily for a range of virtual address. But the application
needs to understand what it is doing and opt in to do that knowing full
well that. The version thing allows for scenario like. You do not have
to define a new version with every new type of memory. If your new memory
has all the properties of v1 than you expose it as v1 and old application
on the new platform will use your new memory type being non the wiser.

The version thing is really to exclude user from using something they
do not want to use without understanding the consequences of doing so.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 18:54     ` Dave Hansen
@ 2018-12-04 19:11       ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 19:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Keith Busch, Dan Williams, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> > Policy is same kind of story, this email is long enough now :) But
> > i can write one down if you want.
> 
> Yes, please.  I'd love to see the code.
> 
> We'll do the same on the "HMAT" side and we can compare notes.

Example use case ? Example use are:
    Application create a range of virtual address with mmap() for the
    input dataset. Application knows it will use GPU on it directly so
    it calls hbind() to set a policy for the range to use GPU memory
    for any new allocation for the range.

    Application directly stream the dataset to GPU memory through the
    virtual address range thanks to the policy.


    Application create a range of virtual address with mmap() to store
    the output result of GPU jobs its about to launch. It binds the
    range of virtual address to GPU memory so that allocation use GPU
    memory for the range.


    Application can also use policy binding as a slow migration path
    ie set a policy to a new target memory so that new allocation are
    directed to this new target.

Or do you want example userspace program like the one in the last
patch of this serie ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:57         ` Jerome Glisse
@ 2018-12-04 19:11           ` Logan Gunthorpe
  2018-12-04 19:22             ` Jerome Glisse
  2018-12-04 20:14             ` Andi Kleen
  2018-12-04 19:19           ` Dan Williams
  1 sibling, 2 replies; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 19:11 UTC (permalink / raw)
  To: Jerome Glisse, Dan Williams
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Ross Zwisler, Dave Hansen, Haggai Eran,
	balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling,
	Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-04 11:57 a.m., Jerome Glisse wrote:
>> That sounds needlessly restrictive. Let the kernel arbitrate what
>> memory an application gets, don't design a system where applications
>> are hard coded to a memory type. Applications can hint, or optionally
>> specify an override and the kernel can react accordingly.
> 
> You do not want to randomly use non cache coherent memory inside your
> application :) This is not gonna go well with C++ or atomic :) Yes they
> are legitimate use case where application can decide to give up cache
> coherency temporarily for a range of virtual address. But the application
> needs to understand what it is doing and opt in to do that knowing full
> well that. The version thing allows for scenario like. You do not have
> to define a new version with every new type of memory. If your new memory
> has all the properties of v1 than you expose it as v1 and old application
> on the new platform will use your new memory type being non the wiser.

I agree with Dan and the general idea that this version thing is really
ugly. Define some standard attributes so the application can say "I want
cache-coherent, high bandwidth memory". If there's some future
new-memory attribute, then the application needs to know about it to
request it.

Also, in the same vein, I think it's wrong to have the API enumerate all
the different memory available in the system. The API should simply
allow userspace to say it wants memory that can be accessed by a set of
initiators with a certain set of attributes and the bind call tries to
fulfill that or fallback on system memory/hmm migration/whatever.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:57         ` Jerome Glisse
  2018-12-04 19:11           ` Logan Gunthorpe
@ 2018-12-04 19:19           ` Dan Williams
  2018-12-04 19:32             ` Jerome Glisse
  1 sibling, 1 reply; 94+ messages in thread
From: Dan Williams @ 2018-12-04 19:19 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Ross Zwisler, Dave Hansen, Haggai Eran,
	balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling,
	Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	Logan Gunthorpe, John Hubbard, rcampbell

On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > > jglisse@redhat.com writes:
> > > >
> > > > > +
> > > > > +To help with forward compatibility each object as a version value and
> > > > > +it is mandatory for user space to only use target or initiator with
> > > > > +version supported by the user space. For instance if user space only
> > > > > +knows about what version 1 means and sees a target with version 2 then
> > > > > +the user space must ignore that target as if it does not exist.
> > > >
> > > > So once v2 is introduced all applications that only support v1 break.
> > > >
> > > > That seems very un-Linux and will break Linus' "do not break existing
> > > > applications" rule.
> > > >
> > > > The standard approach that if you add something incompatible is to
> > > > add new field, but keep the old ones.
> > >
> > > No that's not how it is suppose to work. So let says it is 2018 and you
> > > have v1 memory (like your regular main DDR memory for instance) then it
> > > will always be expose a v1 memory.
> > >
> > > Fast forward 2020 and you have this new type of memory that is not cache
> > > coherent and you want to expose this to userspace through HMS. What you
> > > do is a kernel patch that introduce the v2 type for target and define a
> > > set of new sysfs file to describe what v2 is. On this new computer you
> > > report your usual main memory as v1 and your new memory as v2.
> > >
> > > So the application that only knew about v1 will keep using any v1 memory
> > > on your new platform but it will not use any of the new memory v2 which
> > > is what you want to happen. You do not have to break existing application
> > > while allowing to add new type of memory.
> >
> > That sounds needlessly restrictive. Let the kernel arbitrate what
> > memory an application gets, don't design a system where applications
> > are hard coded to a memory type. Applications can hint, or optionally
> > specify an override and the kernel can react accordingly.
>
> You do not want to randomly use non cache coherent memory inside your
> application :)

The kernel arbitrates memory, it's a bug if it hands out something
that exotic to an unaware application.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 19:11           ` Logan Gunthorpe
@ 2018-12-04 19:22             ` Jerome Glisse
  2018-12-04 19:41               ` Logan Gunthorpe
  2018-12-04 20:14             ` Andi Kleen
  1 sibling, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 19:22 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 12:11:42PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 11:57 a.m., Jerome Glisse wrote:
> >> That sounds needlessly restrictive. Let the kernel arbitrate what
> >> memory an application gets, don't design a system where applications
> >> are hard coded to a memory type. Applications can hint, or optionally
> >> specify an override and the kernel can react accordingly.
> > 
> > You do not want to randomly use non cache coherent memory inside your
> > application :) This is not gonna go well with C++ or atomic :) Yes they
> > are legitimate use case where application can decide to give up cache
> > coherency temporarily for a range of virtual address. But the application
> > needs to understand what it is doing and opt in to do that knowing full
> > well that. The version thing allows for scenario like. You do not have
> > to define a new version with every new type of memory. If your new memory
> > has all the properties of v1 than you expose it as v1 and old application
> > on the new platform will use your new memory type being non the wiser.
> 
> I agree with Dan and the general idea that this version thing is really
> ugly. Define some standard attributes so the application can say "I want
> cache-coherent, high bandwidth memory". If there's some future
> new-memory attribute, then the application needs to know about it to
> request it.

So version is a bad prefix, what about type, prefixing target with a
type id. So that application that are looking for a certain type of
memory (which has a set of define properties) can select them. Having
a type file inside the directory and hopping application will read
that sysfs file is a recipies for failure from my point of view. While
having it in the directory name is making sure that the application
has some idea of what it is doing.

> 
> Also, in the same vein, I think it's wrong to have the API enumerate all
> the different memory available in the system. The API should simply
> allow userspace to say it wants memory that can be accessed by a set of
> initiators with a certain set of attributes and the bind call tries to
> fulfill that or fallback on system memory/hmm migration/whatever.

We have existing application that use topology today to partition their
workload and do load balancing. Those application leverage the fact that
they are only running on a small set of known platform with known topology
here i want to provide a common API so that topology can be queried in a
standard by application.

Yes basic application will not leverage all this information and will
be happy enough with give me memory that will be fast for initiator A
and B. That can easily be implemented inside userspace library which
dumbs down the topology on behalf of application.

I believe that proposing a new infrastructure should allow for maximum
expressiveness. The HMS API in this proposal allow to express any kind
of directed graph hence i do not see any limitation going forward. At
the same time userspace library can easily dumbs this down for average
Joe/Jane application.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 19:19           ` Dan Williams
@ 2018-12-04 19:32             ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 19:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Ross Zwisler, Dave Hansen, Haggai Eran,
	balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling,
	Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	Logan Gunthorpe, John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 11:19:23AM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> > > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > > > jglisse@redhat.com writes:
> > > > >
> > > > > > +
> > > > > > +To help with forward compatibility each object as a version value and
> > > > > > +it is mandatory for user space to only use target or initiator with
> > > > > > +version supported by the user space. For instance if user space only
> > > > > > +knows about what version 1 means and sees a target with version 2 then
> > > > > > +the user space must ignore that target as if it does not exist.
> > > > >
> > > > > So once v2 is introduced all applications that only support v1 break.
> > > > >
> > > > > That seems very un-Linux and will break Linus' "do not break existing
> > > > > applications" rule.
> > > > >
> > > > > The standard approach that if you add something incompatible is to
> > > > > add new field, but keep the old ones.
> > > >
> > > > No that's not how it is suppose to work. So let says it is 2018 and you
> > > > have v1 memory (like your regular main DDR memory for instance) then it
> > > > will always be expose a v1 memory.
> > > >
> > > > Fast forward 2020 and you have this new type of memory that is not cache
> > > > coherent and you want to expose this to userspace through HMS. What you
> > > > do is a kernel patch that introduce the v2 type for target and define a
> > > > set of new sysfs file to describe what v2 is. On this new computer you
> > > > report your usual main memory as v1 and your new memory as v2.
> > > >
> > > > So the application that only knew about v1 will keep using any v1 memory
> > > > on your new platform but it will not use any of the new memory v2 which
> > > > is what you want to happen. You do not have to break existing application
> > > > while allowing to add new type of memory.
> > >
> > > That sounds needlessly restrictive. Let the kernel arbitrate what
> > > memory an application gets, don't design a system where applications
> > > are hard coded to a memory type. Applications can hint, or optionally
> > > specify an override and the kernel can react accordingly.
> >
> > You do not want to randomly use non cache coherent memory inside your
> > application :)
> 
> The kernel arbitrates memory, it's a bug if it hands out something
> that exotic to an unaware application.

In some case and for some period of time some application would like
to use exotic memory for performance reasons. This does exist today.
Graphics API routinely expose uncache memory to application and it has
been doing so for many years.

Some compute folks would like to have some of the benefit of that
sometime. The idea is that you malloc() some memory in your application
do stuff on the CPU, business as usual, then you gonna use that memory
on some exotic device and for that device it would be best if you
migrated that memory to uncache/uncoherent memory. If application
knows its safe to do so then it can decide to pick such memory with
HMS and migrate its malloced stuff there.

This is not only happening in application, it can happen inside a
library that the application use and the application might be totaly
unaware of the library doing so. This is very common today in AI/ML
workload where all the various library in your AI/ML stacks do thing
to the memory you handed them over. It is all part of the library
API contract.

So they are legitimate use case for this hence why i would like to
be able to expose exotic memory to userspace so that it can migrate
regular allocation there when that make sense.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 19:22             ` Jerome Glisse
@ 2018-12-04 19:41               ` Logan Gunthorpe
  2018-12-04 20:13                 ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 19:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell



On 2018-12-04 12:22 p.m., Jerome Glisse wrote:
> So version is a bad prefix, what about type, prefixing target with a
> type id. So that application that are looking for a certain type of
> memory (which has a set of define properties) can select them. Having
> a type file inside the directory and hopping application will read
> that sysfs file is a recipies for failure from my point of view. While
> having it in the directory name is making sure that the application
> has some idea of what it is doing.

Well I don't think it can be a prefix. It has to be a mask. It might be
things like cache coherency, persistence, bandwidth and none of those
things are mutually exclusive.

>> Also, in the same vein, I think it's wrong to have the API enumerate all
>> the different memory available in the system. The API should simply
>> allow userspace to say it wants memory that can be accessed by a set of
>> initiators with a certain set of attributes and the bind call tries to
>> fulfill that or fallback on system memory/hmm migration/whatever.
> 
> We have existing application that use topology today to partition their
> workload and do load balancing. Those application leverage the fact that
> they are only running on a small set of known platform with known topology
> here i want to provide a common API so that topology can be queried in a
> standard by application.

Existing applications are not a valid excuse for poor API design.
Remember, once this API is introduced and has real users, it has to be
maintained *forever*, so we need to get it right. Providing users with
more information than they need makes it exponentially harder to get
right and support.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:24     ` Jerome Glisse
  2018-12-04 18:31       ` Dan Williams
@ 2018-12-04 20:12       ` Andi Kleen
  2018-12-04 20:41         ` Jerome Glisse
  2018-12-05  4:36       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 94+ messages in thread
From: Andi Kleen @ 2018-12-04 20:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell

On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote:
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
> 
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.

That seems entirely like the wrong model. We don't want to rewrite every
application for adding a new memory type.

Rather there needs to be an abstract way to query memory of specific
behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar

Sure there can be a name somewhere, but it should only be used
for identification purposes, not to hard code in applications.

Really you need to define some use cases and describe how your API
handles them.

> > 
> > It sounds like you're trying to define a system call with built in
> > ioctl? Is that really a good idea?
> > 
> > If you need ioctl you know where to find it.
> 
> Well i would like to get thing running in the wild with some guinea pig
> user to get feedback from end user. It would be easier if i can do this
> with upstream kernel and not some random branch in my private repo. While
> doing that i would like to avoid commiting to a syscall upstream. So the
> way i see around this is doing a driver under staging with an ioctl which
> will be turn into a syscall once some confidence into the API is gain.

Ok that's fine I guess.

But should be a clearly defined ioctl, not an ioctl with redefinable parameters
(but perhaps I misunderstood your description)

> In the present version i took the other approach of defining just one
> API that can grow to do more thing. I know the unix way is one simple
> tool for one simple job. I can switch to the simple call for one action.

Simple calls are better.

> > > +Current memory policy infrastructure is node oriented, instead of
> > > +changing that and risking breakage and regression HMS adds a new
> > > +heterogeneous policy tracking infra-structure. The expectation is
> > > +that existing application can keep using mbind() and all existing
> > > +infrastructure under-disturb and unaffected, while new application
> > > +will use the new API and should avoid mix and matching both (as they
> > > +can achieve the same thing with the new API).
> > 
> > I think we need a stronger motivation to define a completely
> > parallel and somewhat redundant infrastructure. What breakage
> > are you worried about?
> 
> Some memory expose through HMS is not allocated by regular memory
> allocator. For instance GPU memory is manage by GPU driver, so when
> you want to use GPU memory (either as a policy or by migrating to it)
> you need to use the GPU allocator to allocate that memory. HMS adds
> a bunch of callback to target structure so that device driver can
> expose a generic API to core kernel to do such allocation.

We already have nodes without memory.
We can also take out nodes out of the normal fall back lists.
We also have nodes with special memory (e.g. DMA32)

Nothing you describe here cannot be handled with the existing nodes.

> > The obvious alternative would of course be to add some extra
> > enumeration to the existing nodes.
> 
> We can not extend NUMA node to expose GPU memory. GPU memory on
> current AMD and Intel platform is not cache coherent and thus
> should not be use for random memory allocation. It should really

Sure you don't expose it as normal memory, but it can be still
tied to a node. In fact you have to for the existing topology
interface to work.

> copy and rebuild their data structure inside the new memory. When
> you move over thing like tree or any complex data structure you have
> to rebuilt it ie redo the pointers link between the nodes of your
> data structure.
> 
> This is highly error prone complex and wasteful (you have to burn
> CPU cycles to do that). Now if you can use the same address space
> as all the other memory allocation in your program and move data
> around from one device to another with a common API that works on
> all the various devices, you are eliminating that complex step and
> making the end user life much easier.
> 
> So i am doing this to help existing users by addressing an issues
> that is becoming harder and harder to solve for userspace. My end
> game is to blur the boundary between CPU and device like GPU, FPGA,

This is just high level rationale. You already had that ...

What I was looking for is how applications actually use the 
API.

e.g. 

1. Compute application is looking for fast cache coherent memory 
for CPU usage.

What does it query and how does it decide and how does it allocate?

2. Allocator in OpenCL application is looking for memory to share
with OpenCL. How does it find memory?

3. Storage application is looking for larger but slower memory
for CPU usage. 

4. ...

Please work out some use cases like this.

-Andi

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 19:41               ` Logan Gunthorpe
@ 2018-12-04 20:13                 ` Jerome Glisse
  2018-12-04 20:30                   ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 20:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 12:41:39PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 12:22 p.m., Jerome Glisse wrote:
> > So version is a bad prefix, what about type, prefixing target with a
> > type id. So that application that are looking for a certain type of
> > memory (which has a set of define properties) can select them. Having
> > a type file inside the directory and hopping application will read
> > that sysfs file is a recipies for failure from my point of view. While
> > having it in the directory name is making sure that the application
> > has some idea of what it is doing.
> 
> Well I don't think it can be a prefix. It has to be a mask. It might be
> things like cache coherency, persistence, bandwidth and none of those
> things are mutually exclusive.

You are right many are non exclusive. It is just my feeling that having
a mask as a file inside the target directory might be overlook by the
application which might start using things it should not. At same time
i guess if i write the userspace library that abstract this kernel API
then i can enforce application to properly select thing.

I will use mask in v2.

> 
> >> Also, in the same vein, I think it's wrong to have the API enumerate all
> >> the different memory available in the system. The API should simply
> >> allow userspace to say it wants memory that can be accessed by a set of
> >> initiators with a certain set of attributes and the bind call tries to
> >> fulfill that or fallback on system memory/hmm migration/whatever.
> > 
> > We have existing application that use topology today to partition their
> > workload and do load balancing. Those application leverage the fact that
> > they are only running on a small set of known platform with known topology
> > here i want to provide a common API so that topology can be queried in a
> > standard by application.
> 
> Existing applications are not a valid excuse for poor API design.
> Remember, once this API is introduced and has real users, it has to be
> maintained *forever*, so we need to get it right. Providing users with
> more information than they need makes it exponentially harder to get
> right and support.

I am not disagreeing on the pain of maintaining API forever but the fact
remain that they are existing user and without a standard way of exposing
this it is impossible to say if we will see more users for that information
or if it will just be the existing user that will leverage this.

I do not think there is a way to answer that question. I am siding on the
side of this API can be dumb down in userspace by a common library. So let
expose the topology and let userspace dumb it down.

If we dumb it down in the kernel i see few pitfalls:
    - kernel dumbing it down badly
    - kernel dumbing down code can grow out of control with gotcha
      for platform
    - it is still harder to fix kernel than userspace in commercial
      user space (the whole RHEL business of slow moving and long
      supported kernel). So on those being able to fix thing in
      userspace sounds pretty enticing

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 19:11           ` Logan Gunthorpe
  2018-12-04 19:22             ` Jerome Glisse
@ 2018-12-04 20:14             ` Andi Kleen
  2018-12-04 20:47               ` Logan Gunthorpe
  1 sibling, 1 reply; 94+ messages in thread
From: Andi Kleen @ 2018-12-04 20:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jerome Glisse, Dan Williams, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell

> Also, in the same vein, I think it's wrong to have the API enumerate all
> the different memory available in the system. The API should simply

We need an enumeration API too, just to display to the user what they
have, and possibly for applications to size their buffers 
(all we do with existing NUMA nodes)

But yes the default usage should be to query for necessary attributes

-Andi

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:13                 ` Jerome Glisse
@ 2018-12-04 20:30                   ` Logan Gunthorpe
  2018-12-04 20:59                     ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 20:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell



On 2018-12-04 1:13 p.m., Jerome Glisse wrote:
> You are right many are non exclusive. It is just my feeling that having
> a mask as a file inside the target directory might be overlook by the
> application which might start using things it should not. At same time
> i guess if i write the userspace library that abstract this kernel API
> then i can enforce application to properly select thing.

I think this is just evidence that this is not a good API. If the user
has the option to just ignore things or do it wrong that's a problem
with the API. Using a prefix for the name doesn't change that fact.

> I do not think there is a way to answer that question. I am siding on the
> side of this API can be dumb down in userspace by a common library. So let
> expose the topology and let userspace dumb it down.

I fundamentally disagree with this approach to designing APIs. Saying
"we'll give you the kitchen sink, add another layer to deal with the
complexity" is actually just eschewing API design and makes it harder
for kernel folks to know what userspace actually requires because they
are multiple layers away.

> If we dumb it down in the kernel i see few pitfalls:
>     - kernel dumbing it down badly
>     - kernel dumbing down code can grow out of control with gotcha
>       for platform

This is just a matter of designing the APIs well. Don't do it badly.

>     - it is still harder to fix kernel than userspace in commercial
>       user space (the whole RHEL business of slow moving and long
>       supported kernel). So on those being able to fix thing in
>       userspace sounds pretty enticing

I hear this argument a lot and it's not compelling to me. I don't think
we should make decisions in upstream code to allow RHEL to bypass the
kernel simply because it would be easier for them to distribute code
changes.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:12       ` Andi Kleen
@ 2018-12-04 20:41         ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 20:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell

On Tue, Dec 04, 2018 at 12:12:26PM -0800, Andi Kleen wrote:
> On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote:
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> > 
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> 
> That seems entirely like the wrong model. We don't want to rewrite every
> application for adding a new memory type.
> 
> Rather there needs to be an abstract way to query memory of specific
> behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar
> 
> Sure there can be a name somewhere, but it should only be used
> for identification purposes, not to hard code in applications.

Discussion with Logan convinced me to use a mask for property like:
    - cache coherent
    - persistent
    ...

Then files for other properties like:
    - bandwidth (bytes/s)
    - latency
    - granularity (size of individual access or bus width)
    ...

> 
> Really you need to define some use cases and describe how your API
> handles them.

I have given examples of how application looks today and how they
transform with HMS in my email exchange with Dave Hansen. I will
add them to the documentation and to the cover letter in my next
posting.

> > > 
> > > It sounds like you're trying to define a system call with built in
> > > ioctl? Is that really a good idea?
> > > 
> > > If you need ioctl you know where to find it.
> > 
> > Well i would like to get thing running in the wild with some guinea pig
> > user to get feedback from end user. It would be easier if i can do this
> > with upstream kernel and not some random branch in my private repo. While
> > doing that i would like to avoid commiting to a syscall upstream. So the
> > way i see around this is doing a driver under staging with an ioctl which
> > will be turn into a syscall once some confidence into the API is gain.
> 
> Ok that's fine I guess.
> 
> But should be a clearly defined ioctl, not an ioctl with redefinable parameters
> (but perhaps I misunderstood your description)
>
> > In the present version i took the other approach of defining just one
> > API that can grow to do more thing. I know the unix way is one simple
> > tool for one simple job. I can switch to the simple call for one action.
> 
> Simple calls are better.

I will switch to one simple call for each individual action (policy and
migration).

> > > > +Current memory policy infrastructure is node oriented, instead of
> > > > +changing that and risking breakage and regression HMS adds a new
> > > > +heterogeneous policy tracking infra-structure. The expectation is
> > > > +that existing application can keep using mbind() and all existing
> > > > +infrastructure under-disturb and unaffected, while new application
> > > > +will use the new API and should avoid mix and matching both (as they
> > > > +can achieve the same thing with the new API).
> > > 
> > > I think we need a stronger motivation to define a completely
> > > parallel and somewhat redundant infrastructure. What breakage
> > > are you worried about?
> > 
> > Some memory expose through HMS is not allocated by regular memory
> > allocator. For instance GPU memory is manage by GPU driver, so when
> > you want to use GPU memory (either as a policy or by migrating to it)
> > you need to use the GPU allocator to allocate that memory. HMS adds
> > a bunch of callback to target structure so that device driver can
> > expose a generic API to core kernel to do such allocation.
> 
> We already have nodes without memory.
> We can also take out nodes out of the normal fall back lists.
> We also have nodes with special memory (e.g. DMA32)
> 
> Nothing you describe here cannot be handled with the existing nodes.

They are have been patchset in the past to exclude node from allocation
last time i check they all were rejected and people felt it was not a
good thing to do.

Also IIRC adding more node might be problematic as i think we do not
have many bits left inside the flags field of struct page. Right now
i do not believe in moving device memory as generic node inside the
linux kernel because for many folks that will just be a waste, people
only doing desktop and not using their GPU for compute will never get
a good usage from that. Graphic memory allocation is wildely different
from compute allocation which is more like CPU one.

So converting graphic driver to register their memory as node does
not seems as a good idea at this time. I doubt the GPU folks upstream
would accept that (with my GPU hat ons i would not).


> > > The obvious alternative would of course be to add some extra
> > > enumeration to the existing nodes.
> > 
> > We can not extend NUMA node to expose GPU memory. GPU memory on
> > current AMD and Intel platform is not cache coherent and thus
> > should not be use for random memory allocation. It should really
> 
> Sure you don't expose it as normal memory, but it can be still
> tied to a node. In fact you have to for the existing topology
> interface to work.

The existing topology interface is not use today for that memory
and people in GPU world do not see it as an interface that can be
use. See above discussion about GPU memory. This is the raison
d'être of this proposal. A new way to expose heterogeneous memory
to userspace.


> > copy and rebuild their data structure inside the new memory. When
> > you move over thing like tree or any complex data structure you have
> > to rebuilt it ie redo the pointers link between the nodes of your
> > data structure.
> > 
> > This is highly error prone complex and wasteful (you have to burn
> > CPU cycles to do that). Now if you can use the same address space
> > as all the other memory allocation in your program and move data
> > around from one device to another with a common API that works on
> > all the various devices, you are eliminating that complex step and
> > making the end user life much easier.
> > 
> > So i am doing this to help existing users by addressing an issues
> > that is becoming harder and harder to solve for userspace. My end
> > game is to blur the boundary between CPU and device like GPU, FPGA,
> 
> This is just high level rationale. You already had that ...
> 
> What I was looking for is how applications actually use the 
> API.
> 
> e.g. 
> 
> 1. Compute application is looking for fast cache coherent memory 
> for CPU usage.
> 
> What does it query and how does it decide and how does it allocate?

Application have an OpenCL context from the context it gets the
device initiator unique id from the device initiator unique id
it looks at all the links and bridge the initiator is connected
to. Which gives it a list of links it can order that list using
bandwidth first and latency second (ie 2 link with same bandwidth
will be order with the one with slowest latency first). It goes
over that list from best to worse and for each links it looks at
what target are also connected to that link. From that it build
an ordered list of targets. It also only pick cache coherent
memory in that list.

It now use this ordered list of targets to set policy or migrate
its buffer to the best memory. Kernel will first try to use the
first target, if it runs out of that memory it will use the next
target ... so on and so forth.


This can all be down inside a userspace common helper library for
ease of use. More advance application will do finer allocation
for instance they will partition their dataset using the access
frequency. Most accessed dataset in the application will use
the fastest memory (which is likely to be somewhat small ie
few GigaBytes), while dataset that are more sparsely accessed
will be push to use slower memory (but they are more of it).


> 2. Allocator in OpenCL application is looking for memory to share
> with OpenCL. How does it find memory?

Same process as above, starts from initiator id, build links
list then build all target that initiator can access. Then
order that list according to the property of interest to the
application (bandwidth, latency, ...). Once it has the target
list it can use either policy or migration. Policy if it is
for a new allocation, migration if it is to migrate an existing
buffer to memory that is more appropriate for the OpenCL device
under use.


> 3. Storage application is looking for larger but slower memory
> for CPU usage.

Application build a list of initiator corresponding to the CPU
it is using (bind too). From that list of initiator it builds
a list of links (considering bridge too). From the list of links
it builds a list of target (connected to those links).

Then it order the list of target by size (not by latency or
bandwidth). Once it has an ordered list of target then it use
either the policy or migrate API for the range of virtual
address it wants to affect.


> 
> 4. ...
> 
> Please work out some use cases like this.

Note that above all the list building in userspace is intended
to be done by an helper library as this is really boiler plate
code. The last patch in my serie have userspace helpers to
parse the sysfs, i will grow that into a mini library with example
to show case it.

More example from other part of this email thread:

High level overview of how one application looks today:

    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2) Application allocate memory on device A and copy over the dataset
    3) Application run some CPU code to format the copy of the dataset
       inside device A memory (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    4) Application run code on device A that use the dataset
    5) Application allocate memory on device B and copy over result
       from device A
    6) Application run some CPU code to format the copy of the dataset
       inside device B (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    7) Application run code on device B that use the dataset
    8) Application copy result over from device B and keep on doing its
       thing

How it looks with HMS:
    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2-3) Application calls HMS to migrate to device A memory
    4) Application run code on device A that use the dataset
    5-6) Application calls HMS to migrate to device B memory
    7) Application run code on device B that use the dataset
    8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

    Application knows that the platform is running on have 16
    GPU split into 2 group of 8 GPUs each. GPU in each group can
    access each other memory with dedicated mesh links between
    each others. Full speed no traffic bottleneck.

    Application splits its GPU computation in 2 so that each
    partition runs on a group of interconnected GPU allowing
    them to share the dataset.

With HMS:
    Application can query the kernel to discover the topology of
    system it is running on and use it to partition and balance
    its workload accordingly. Same application should now be able
    to run on new platform without having to adapt it to it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:14             ` Andi Kleen
@ 2018-12-04 20:47               ` Logan Gunthorpe
  2018-12-04 21:15                 ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 20:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jerome Glisse, Dan Williams, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell



On 2018-12-04 1:14 p.m., Andi Kleen wrote:
>> Also, in the same vein, I think it's wrong to have the API enumerate all
>> the different memory available in the system. The API should simply

> We need an enumeration API too, just to display to the user what they
> have, and possibly for applications to size their buffers 
> (all we do with existing NUMA nodes)

Yes, but I think my main concern is the conflation of the enumeration
API and the binding API. An application doesn't want to walk through all
the possible memory and types in the system just to get some memory that
will work with a couple initiators (which it somehow has to map to
actual resources, like fds). We also don't want userspace to police
itself on which memory works with which initiator.

Enumeration is definitely not the common use case. And if we create a
new enumeration API now, it may make it difficult or impossible to unify
these types of memory with the existing NUMA node hierarchies if/when
this gets more integrated with the mm core.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:30                   ` Logan Gunthorpe
@ 2018-12-04 20:59                     ` Jerome Glisse
  2018-12-04 21:19                       ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 20:59 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 01:30:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:13 p.m., Jerome Glisse wrote:
> > You are right many are non exclusive. It is just my feeling that having
> > a mask as a file inside the target directory might be overlook by the
> > application which might start using things it should not. At same time
> > i guess if i write the userspace library that abstract this kernel API
> > then i can enforce application to properly select thing.
> 
> I think this is just evidence that this is not a good API. If the user
> has the option to just ignore things or do it wrong that's a problem
> with the API. Using a prefix for the name doesn't change that fact.

How to expose harmful memory to userspace then ? How can i expose
non cache coherent memory because yes they are application out there
that use that today and would like to be able to migrate to and from
that memory dynamicly during lifetime of the application as the data
set progress through the application processing pipeline.

They are kind of memory that violate the memory model you expect from
the architecture. This memory is still useful nonetheless and it has
other enticing properties (like bandwidth or latency). The whole point
my proposal is to allow to expose this memory in a generic way so that
application that today rely on a gazillion of device specific API can
move over to common kernel API and consolidate their memory management
on top of a common kernel layers.

The dilema i am facing is exposing this memory while avoiding the non
aware application to accidently use it just because it is there without
understanding the implication that comes with it.

If you have any idea on how to expose this to userspace in a common
API i would happily take any suggestion :) My idea is this patchset
and i agree they are many thing to improve and i have already taken
many of the suggestion given so far.


> 
> > I do not think there is a way to answer that question. I am siding on the
> > side of this API can be dumb down in userspace by a common library. So let
> > expose the topology and let userspace dumb it down.
> 
> I fundamentally disagree with this approach to designing APIs. Saying
> "we'll give you the kitchen sink, add another layer to deal with the
> complexity" is actually just eschewing API design and makes it harder
> for kernel folks to know what userspace actually requires because they
> are multiple layers away.

Note that i do not expose things like physical address or even splits
memory in a node into individual device, in fact in expose less
information that the existing NUMA (no zone, phys index, ...). As i do
not think those have any value to userspace. What matter to userspace
is where is this memory is in my topology so i can look at all the
initiators node that are close by. Or the reverse, i have a set of
initiators what is the set of closest targets to all those initiators.

I feel this is simple enough to understand for anyone. It allows to
describe any topology, a libhms can dumb it down for average application
and more advance application can use the full description. They are
example of such application today. I argue that if we provide a common
API we might see more application but i won't pretend that i know that
for a fact. I am just making assumption here.


> 
> > If we dumb it down in the kernel i see few pitfalls:
> >     - kernel dumbing it down badly
> >     - kernel dumbing down code can grow out of control with gotcha
> >       for platform
> 
> This is just a matter of designing the APIs well. Don't do it badly.

I am talking about the inevitable fact that at some point some system
firmware will miss-represent their platform. System firmware writer
usualy copy and paste thing with little regards to what have change
from one platform to the new. So their will be inevitable workaround
and i would rather see those piling up inside a userspace library than
inside the kernel.

Note that i expec that the error won't be fatal but more along the
line of reporting wrong value for bandwidth, latency, ... So kernel
will most likely unaffected by system firmware error but those will
affect the performance of application that are told innaccurate
informations.


> >     - it is still harder to fix kernel than userspace in commercial
> >       user space (the whole RHEL business of slow moving and long
> >       supported kernel). So on those being able to fix thing in
> >       userspace sounds pretty enticing
> 
> I hear this argument a lot and it's not compelling to me. I don't think
> we should make decisions in upstream code to allow RHEL to bypass the
> kernel simply because it would be easier for them to distribute code
> changes.

Ok i will not bring it up, i have suffer enough on that front so i have
a trauma on this ;)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:47               ` Logan Gunthorpe
@ 2018-12-04 21:15                 ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 21:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Andi Kleen, Dan Williams, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Ross Zwisler,
	Dave Hansen, Haggai Eran, balbirs, Aneesh Kumar K.V,
	Benjamin Herrenschmidt, Kuehling, Felix, Philip.Yang, Koenig,
	Christian, Blinzer, Paul, John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 01:47:17PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:14 p.m., Andi Kleen wrote:
> >> Also, in the same vein, I think it's wrong to have the API enumerate all
> >> the different memory available in the system. The API should simply
> 
> > We need an enumeration API too, just to display to the user what they
> > have, and possibly for applications to size their buffers 
> > (all we do with existing NUMA nodes)
> 
> Yes, but I think my main concern is the conflation of the enumeration
> API and the binding API. An application doesn't want to walk through all
> the possible memory and types in the system just to get some memory that
> will work with a couple initiators (which it somehow has to map to
> actual resources, like fds). We also don't want userspace to police
> itself on which memory works with which initiator.

How application would police itself ? The API i am proposing is best
effort and as such kernel can fully ignore userspace request as it is
doing now sometimes with mbind(). So kernel always have the last call
and can always override application decission.

Device driver can also decide to override, anything that is kernel
side really have more power than userspace would have. So while we
give trust to userspace we do not abdicate control. That is not the
intention here.


> Enumeration is definitely not the common use case. And if we create a
> new enumeration API now, it may make it difficult or impossible to unify
> these types of memory with the existing NUMA node hierarchies if/when
> this gets more integrated with the mm core.

The point i am trying to make is that it can not get integrated as
regular NUMA node inside the mm core. But rather the mm core can
grow to encompass non NUMA node memory. I explained why in other
part of this thread but roughly:

- Device driver need to be in control of device memory allocation
  for backward compatibility reasons and to keep full filling thing
  like graphic API constraint (OpenGL, Vulkan, X, ...).

- Adding new node type is problematic inside mm as we are running
  out of bits in the struct page

- Excluding node from the regular allocation path was reject by
  upstream previously (IBM did post patchset for that IIRC).

I feel it is a safer path to avoid a one model fits all here and
to accept that device memory will be represented and managed in
a different way from other memory. I believe persistent memory
folks feels the same on that front.

Nonetheless i do want to expose this device memory in a standard
way so that we can consolidate and improve user experience on
that front. Eventually i hope that more of the device memory
management can be turn into a common device memory management
inside core mm but i do not want to enforce that at first as it
is likely to fail (building a moonbase before you have a moon
rocket). I rather grow organicaly from high level API that will
get use right away (it is a matter of converting existing user
to it s/computeAPIBind/HMSBind).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 20:59                     ` Jerome Glisse
@ 2018-12-04 21:19                       ` Logan Gunthorpe
  2018-12-04 21:51                         ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 21:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-04 1:59 p.m., Jerome Glisse wrote:
> How to expose harmful memory to userspace then ? How can i expose
> non cache coherent memory because yes they are application out there
> that use that today and would like to be able to migrate to and from
> that memory dynamicly during lifetime of the application as the data
> set progress through the application processing pipeline.

I'm not arguing against the purpose or use cases. I'm being critical of
the API choices.

> Note that i do not expose things like physical address or even splits
> memory in a node into individual device, in fact in expose less
> information that the existing NUMA (no zone, phys index, ...). As i do
> not think those have any value to userspace. What matter to userspace
> is where is this memory is in my topology so i can look at all the
> initiators node that are close by. Or the reverse, i have a set of
> initiators what is the set of closest targets to all those initiators.

No, what matters to applications is getting memory that will work for
the initiators/resources they need it to work for. The specific topology
might be of interest to administrators but it is not what applications
need. And it should be relatively easy to flesh out the existing sysfs
device tree to provide the topology information administrators need.

> I am talking about the inevitable fact that at some point some system
> firmware will miss-represent their platform. System firmware writer
> usualy copy and paste thing with little regards to what have change
> from one platform to the new. So their will be inevitable workaround
> and i would rather see those piling up inside a userspace library than
> inside the kernel.

It's *absolutely* the kernel's responsibility to patch issues caused by
broken firmware. We have quirks all over the place for this. That's
never something userspace should be responsible for. Really, this is the
raison d'etre of the kernel: to provide userspace with a uniform
execution environment -- if every application had to deal with broken
firmware it would be a nightmare.

Logan


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 18:49   ` Jerome Glisse
  2018-12-04 18:54     ` Dave Hansen
@ 2018-12-04 21:37     ` Dave Hansen
  2018-12-04 21:57       ` Jerome Glisse
  2018-12-05 11:27     ` Aneesh Kumar K.V
  2 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-04 21:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/4/18 10:49 AM, Jerome Glisse wrote:
>> Also, could you add a simple, example program for how someone might use
>> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
>> characterize how this would work with the *exiting* NUMA interfaces that
>> we have?
> That is the issue i can not expose device memory as NUMA node as
> device memory is not cache coherent on AMD and Intel platform today.
> 
> More over in some case that memory is not visible at all by the CPU
> which is not something you can express in the current NUMA node.

Yeah, our NUMA mechanisms are for managing memory that the kernel itself
manages in the "normal" allocator and supports a full feature set on.
That has a bunch of implications, like that the memory is cache coherent
and accessible from everywhere.

The HMAT patches only comprehend this "normal" memory, which is why
we're extending the existing /sys/devices/system/node infrastructure.

This series has a much more aggressive goal, which is comprehending the
connections of every memory-target to every memory-initiator, no matter
who is managing the memory, who can access it, or what it can be used for.

Theoretically, HMS could be used for everything that we're doing with
/sys/devices/system/node, as long as it's tied back into the existing
NUMA infrastructure _somehow_.

Right?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 21:19                       ` Logan Gunthorpe
@ 2018-12-04 21:51                         ` Jerome Glisse
  2018-12-04 22:16                           ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 21:51 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 02:19:09PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:59 p.m., Jerome Glisse wrote:
> > How to expose harmful memory to userspace then ? How can i expose
> > non cache coherent memory because yes they are application out there
> > that use that today and would like to be able to migrate to and from
> > that memory dynamicly during lifetime of the application as the data
> > set progress through the application processing pipeline.
> 
> I'm not arguing against the purpose or use cases. I'm being critical of
> the API choices.
> 
> > Note that i do not expose things like physical address or even splits
> > memory in a node into individual device, in fact in expose less
> > information that the existing NUMA (no zone, phys index, ...). As i do
> > not think those have any value to userspace. What matter to userspace
> > is where is this memory is in my topology so i can look at all the
> > initiators node that are close by. Or the reverse, i have a set of
> > initiators what is the set of closest targets to all those initiators.
> 
> No, what matters to applications is getting memory that will work for
> the initiators/resources they need it to work for. The specific topology
> might be of interest to administrators but it is not what applications
> need. And it should be relatively easy to flesh out the existing sysfs
> device tree to provide the topology information administrators need.

Existing user would disagree in my cover letter i have given pointer
to existing library and paper from HPC folks that do leverage system
topology (among the few who are). So they are application _today_ that
do use topology information to adapt their workload to maximize the
performance for the platform they run on.

They are also some new platform that have much more complex topology
that definitly can not be represented as a tree like today sysfs we
have (i believe that even some of the HPC folks have _today_ topology
that are not tree-like).

So existing user + random graph topology becoming more commons lead
me to the choice i made in this API. I believe a graph is someting
that can easily be understood by people. I am not inventing some
weird new data structure, it is just a graph and for the name i have
use the ACPI naming convention but i am more than open to use memory
for target and differentiate cpu and device instead of using initiator
as a name. I do not have strong feeling on that. I do however would
like to be able to represent any topology and be able to use device
memory that is not manage by core mm for reasons i explained previously.

Note that if it turn out to be a bad idea kernel can decide to dumb
down thing in future version for new platform. So it could give a
flat graph to userspace, there is nothing precluding that.


> > I am talking about the inevitable fact that at some point some system
> > firmware will miss-represent their platform. System firmware writer
> > usualy copy and paste thing with little regards to what have change
> > from one platform to the new. So their will be inevitable workaround
> > and i would rather see those piling up inside a userspace library than
> > inside the kernel.
> 
> It's *absolutely* the kernel's responsibility to patch issues caused by
> broken firmware. We have quirks all over the place for this. That's
> never something userspace should be responsible for. Really, this is the
> raison d'etre of the kernel: to provide userspace with a uniform
> execution environment -- if every application had to deal with broken
> firmware it would be a nightmare.

You cuted the other paragraph that explained why they will unlikely
to be broken badly enough to break the kernel.

Anyway we can fix the topology in kernel too ... that is fine with
me.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 21:37     ` Dave Hansen
@ 2018-12-04 21:57       ` Jerome Glisse
  2018-12-04 23:58         ` Dave Hansen
  2018-12-05  1:22         ` Kuehling, Felix
  0 siblings, 2 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 21:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> >> Also, could you add a simple, example program for how someone might use
> >> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> >> characterize how this would work with the *exiting* NUMA interfaces that
> >> we have?
> > That is the issue i can not expose device memory as NUMA node as
> > device memory is not cache coherent on AMD and Intel platform today.
> > 
> > More over in some case that memory is not visible at all by the CPU
> > which is not something you can express in the current NUMA node.
> 
> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
> manages in the "normal" allocator and supports a full feature set on.
> That has a bunch of implications, like that the memory is cache coherent
> and accessible from everywhere.
> 
> The HMAT patches only comprehend this "normal" memory, which is why
> we're extending the existing /sys/devices/system/node infrastructure.
> 
> This series has a much more aggressive goal, which is comprehending the
> connections of every memory-target to every memory-initiator, no matter
> who is managing the memory, who can access it, or what it can be used for.
> 
> Theoretically, HMS could be used for everything that we're doing with
> /sys/devices/system/node, as long as it's tied back into the existing
> NUMA infrastructure _somehow_.
> 
> Right?

Fully correct mind if i steal that perfect summary description next time
i post ? I am so bad at explaining thing :)

Intention is to allow program to do everything they do with mbind() today
and tomorrow with the HMAT patchset and on top of that to also be able to
do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
kernel API to rule them all ;)

Also at first i intend to special case vma alloc page when they are HMS
policy, long term i would like to merge code path inside the kernel. But
i do not want to disrupt existing code path today, i rather grow to that
organicaly. Step by step. The mbind() would still work un-affected in
the end just the plumbing would be slightly different.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 21:51                         ` Jerome Glisse
@ 2018-12-04 22:16                           ` Logan Gunthorpe
  2018-12-04 23:56                             ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-04 22:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-04 2:51 p.m., Jerome Glisse wrote:
> Existing user would disagree in my cover letter i have given pointer
> to existing library and paper from HPC folks that do leverage system
> topology (among the few who are). So they are application _today_ that
> do use topology information to adapt their workload to maximize the
> performance for the platform they run on.

Well we need to give them what they actually need, not what they want to
shoot their foot with. And I imagine, much of what they actually do
right now belongs firmly in the kernel. Like I said, existing
applications are not justifications for bad API design or layering
violations.

You've even mentioned we'd need a simplified "libhms" interface for
applications. We should really just figure out what that needs to be and
make that the kernel interface.

> They are also some new platform that have much more complex topology
> that definitly can not be represented as a tree like today sysfs we
> have (i believe that even some of the HPC folks have _today_ topology
> that are not tree-like).

The sysfs tree already allows for a complex graph that describes
existing hardware very well. If there is hardware it cannot describe
then we should work to improve it and not just carve off a whole new
area for a special API. -- In fact, you are already using sysfs, just
under your own virtual/non-existent bus.

> Note that if it turn out to be a bad idea kernel can decide to dumb
> down thing in future version for new platform. So it could give a
> flat graph to userspace, there is nothing precluding that.

Uh... if it turns out to be a bad idea we are screwed because we have an
API existing applications are using. It's much easier to add features to
a simple (your word: "dumb") interface than it is to take options away
from one that is too broad.

> 
>>> I am talking about the inevitable fact that at some point some system
>>> firmware will miss-represent their platform. System firmware writer
>>> usualy copy and paste thing with little regards to what have change
>>> from one platform to the new. So their will be inevitable workaround
>>> and i would rather see those piling up inside a userspace library than
>>> inside the kernel.
>>
>> It's *absolutely* the kernel's responsibility to patch issues caused by
>> broken firmware. We have quirks all over the place for this. That's
>> never something userspace should be responsible for. Really, this is the
>> raison d'etre of the kernel: to provide userspace with a uniform
>> execution environment -- if every application had to deal with broken
>> firmware it would be a nightmare.
> 
> You cuted the other paragraph that explained why they will unlikely
> to be broken badly enough to break the kernel.

That was entirely beside the point. Just because it doesn't break the
kernel itself doesn't make it any less necessary for it to be fixed
inside the kernel. It must be done in a common place so every
application doesn't have to maintain a table of hardware quirks.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
                   ` (15 preceding siblings ...)
  2018-12-04 18:02 ` Dave Hansen
@ 2018-12-04 23:54 ` Dave Hansen
  2018-12-05  0:15   ` Jerome Glisse
  16 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-04 23:54 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Rafael J . Wysocki, Matthew Wilcox,
	Ross Zwisler, Keith Busch, Dan Williams, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
>     - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>       each has a UID and you can usual value in that folder (node id,
>       size, ...)
> 
>     - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>       (CPU or device), each has a HMS UID but also a CPU id for CPU
>       (which match CPU id in (/sys/bus/cpu/). For device you have a
>       path that can be PCIE BUS ID for instance)
> 
>     - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>       UID and a file per property (bandwidth, latency, ...) you also
>       find a symlink to every target and initiator connected to that
>       link.
> 
>     - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>       a UID and a file per property (bandwidth, latency, ...) you
>       also find a symlink to all initiators that can use that bridge.

We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
connections between each node.  Let's suppose that each node has some
CPUs and some memory.

That means we'll have 1024 target directories in sysfs, 1024 initiator
directories in sysfs, and 1024*1024 link directories.  Or, would the
kernel be responsible for "compiling" the firmware-provided information
down into a more manageable number of links?

Some idiot made the mistake of having one sysfs directory per 128MB of
memory way back when, and now we have hundreds of thousands of
/sys/devices/system/memory/memoryX directories.  That sucks to manage.
Isn't this potentially repeating that mistake?

Basically, is sysfs the right place to even expose this much data?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 22:16                           ` Logan Gunthorpe
@ 2018-12-04 23:56                             ` Jerome Glisse
  2018-12-05  1:15                               ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-04 23:56 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 03:16:54PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 2:51 p.m., Jerome Glisse wrote:
> > Existing user would disagree in my cover letter i have given pointer
> > to existing library and paper from HPC folks that do leverage system
> > topology (among the few who are). So they are application _today_ that
> > do use topology information to adapt their workload to maximize the
> > performance for the platform they run on.
> 
> Well we need to give them what they actually need, not what they want to
> shoot their foot with. And I imagine, much of what they actually do
> right now belongs firmly in the kernel. Like I said, existing
> applications are not justifications for bad API design or layering
> violations.

One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
two 8 GPUs node connected through each other with fast mesh (ie each
GPU can peer to peer to each other at the same bandwidth). Then this
2 blocks are connected to the other block through a share link.

So it looks like:
    SOCKET0----SOCKET1-----SOCKET2----SOCKET3
    |          |           |          |
    S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
    ||     \\//            ||     \\//
    ||     //\\            ||     //\\
    ...    ====...    -----...    ====...
    ||     \\//            ||     \\//
    ||     //\\            ||     //\\
    S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7

Application partition its workload in 2 ie allocate dataset twice
for 16 group of GPU. Each of the 2 partitions is then further split
in two for some of the buffer in the dataset but not all.

So AFAICT they are using all the topology informations. They see
that they are 4 group of GPU that in those 4 group, they are 2
pair of group with better interconnect and then a share slower
inter-connect between the 2 groups.

From HMS point of view this looks like (ignoring CPU):
link0: S0-GPU0 ... S0-GPU7
link1: S1-GPU0 ... S1-GPU7
link2: S2-GPU0 ... S2-GPU7
link3: S3-GPU0 ... S3-GPU7

link4: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7
link5: S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7

link6: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7
       S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7

Dumbing it more down and they loose information they want. On top
of that there is also the NUMA CPU node (which is more symetric).

I do not see how this can express in current sysfs we have but
maybe there is a way to shoe horn it.

I expect more complex topology to show up with a mix of different
devices (like GPU and FPGA).

> 
> You've even mentioned we'd need a simplified "libhms" interface for
> applications. We should really just figure out what that needs to be and
> make that the kernel interface.

No i said that a libhms for average application would totaly make
sense to dumb thing down. I do not expect all application will use
the full extent of the information. One simple reason, desktop,
on desktop i don't expect the topology to grow too complex and
thus all the desktop application will not care about it (your
blender, libreoffice, ... which are using GPU today).

But for people creating application that will run on big server,
yes i expect some of them will use that information if only the
existing people that already do use that information.


> > They are also some new platform that have much more complex topology
> > that definitly can not be represented as a tree like today sysfs we
> > have (i believe that even some of the HPC folks have _today_ topology
> > that are not tree-like).
> 
> The sysfs tree already allows for a complex graph that describes
> existing hardware very well. If there is hardware it cannot describe
> then we should work to improve it and not just carve off a whole new
> area for a special API. -- In fact, you are already using sysfs, just
> under your own virtual/non-existent bus.

How the above example would looks like ? I fail to see how to do it
inside current sysfs. Maybe by creating multiple virtual device for
each of the inter-connect ? So something like

link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child

Then for link4, link5 and link6 we would need symlink to the GPU
device. So it sounds like creating virtual device for the sake of
doing it in the existing framework. Then userspace would have to
learn about this virtual device to identify them as node for the
topology graph and would have to differentiate from non node
device. This sounds much more complex to me.

Also if doing node for those we would need to do CPU less and memory
less NUMA node as the GPU memory is not usable by the CPU ... I am
not sure we want to get there. If that's what people want fine but
i personnaly don't think this is the right solution.


> > Note that if it turn out to be a bad idea kernel can decide to dumb
> > down thing in future version for new platform. So it could give a
> > flat graph to userspace, there is nothing precluding that.
> 
> Uh... if it turns out to be a bad idea we are screwed because we have an
> API existing applications are using. It's much easier to add features to
> a simple (your word: "dumb") interface than it is to take options away
> from one that is too broad.

We all have fears that what we do will not get use, but i do not
want to stop making progress because of that. Like i said i am doing
all this under staging to get the ball rolling, to test it with
guinea pig and to gain some level of confidence it is actually
useful. So i am providing evidence today (see all the research in
HPC on memory management, topology, placement, ... for which i have
given some links to) and i want to gather more evidence before commiting
to this.

I hope this sounds like a reasonable plan. What would you like me to
do differently ? Like i said i feel that this is a chicken and egg
problem today there is no standard way to get topology so there is
no way to know how much applications would use such informations. We
know that very few applications in special case use the topology
informations. How to test wether more applications would use that
same informations without providing some kind of standard API for
them to get it ?

It is also a system availability thing, right now they are very few
system with such complex topology, but we are seeing more and more
GPU, TPU, FPGA in more and more environement. I want to be pro-active
here and provide API that would help leverage those new system for
people experimenting with them.

My proposal is to do HMS behind staging for a while and also avoid
any disruption to existing code path. See with people living on the
bleeding edge if they get interested in that informations. If not then
i can strip down my thing to the bare minimum which is about device
memory.


> >>> I am talking about the inevitable fact that at some point some system
> >>> firmware will miss-represent their platform. System firmware writer
> >>> usualy copy and paste thing with little regards to what have change
> >>> from one platform to the new. So their will be inevitable workaround
> >>> and i would rather see those piling up inside a userspace library than
> >>> inside the kernel.
> >>
> >> It's *absolutely* the kernel's responsibility to patch issues caused by
> >> broken firmware. We have quirks all over the place for this. That's
> >> never something userspace should be responsible for. Really, this is the
> >> raison d'etre of the kernel: to provide userspace with a uniform
> >> execution environment -- if every application had to deal with broken
> >> firmware it would be a nightmare.
> > 
> > You cuted the other paragraph that explained why they will unlikely
> > to be broken badly enough to break the kernel.
> 
> That was entirely beside the point. Just because it doesn't break the
> kernel itself doesn't make it any less necessary for it to be fixed
> inside the kernel. It must be done in a common place so every
> application doesn't have to maintain a table of hardware quirks.

Fine with quirks in kernel. It was just a personnal taste thing ...
pure kernel vs ugly userspace :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 21:57       ` Jerome Glisse
@ 2018-12-04 23:58         ` Dave Hansen
  2018-12-05  0:29           ` Jerome Glisse
  2018-12-05  1:22         ` Kuehling, Felix
  1 sibling, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-04 23:58 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Keith Busch, Dan Williams, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On 12/4/18 1:57 PM, Jerome Glisse wrote:
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)

Go for it!

> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

While I appreciate the exhaustive scope of such a project, I'm really
worried that if we decided to use this for our "HMAT" use cases, we'll
be bottlenecked behind this project while *it* goes through 25 revisions
over 4 or 5 years like HMM did.

So, should we just "park" the enhancements to the existing NUMA
interfaces and infrastructure (think /sys/devices/system/node) and wait
for this to go in?  Do we try to develop them in parallel and make them
consistent?  Or, do we just ignore each other and make Andrew sort it
out in a few years? :)

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 23:54 ` Dave Hansen
@ 2018-12-05  0:15   ` Jerome Glisse
  2018-12-05  1:06     ` Dave Hansen
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  0:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >     - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >       each has a UID and you can usual value in that folder (node id,
> >       size, ...)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >       (CPU or device), each has a HMS UID but also a CPU id for CPU
> >       (which match CPU id in (/sys/bus/cpu/). For device you have a
> >       path that can be PCIE BUS ID for instance)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >       UID and a file per property (bandwidth, latency, ...) you also
> >       find a symlink to every target and initiator connected to that
> >       link.
> > 
> >     - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >       a UID and a file per property (bandwidth, latency, ...) you
> >       also find a symlink to all initiators that can use that bridge.
> 
> We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
> connections between each node.  Let's suppose that each node has some
> CPUs and some memory.
> 
> That means we'll have 1024 target directories in sysfs, 1024 initiator
> directories in sysfs, and 1024*1024 link directories.  Or, would the
> kernel be responsible for "compiling" the firmware-provided information
> down into a more manageable number of links?
> 
> Some idiot made the mistake of having one sysfs directory per 128MB of
> memory way back when, and now we have hundreds of thousands of
> /sys/devices/system/memory/memoryX directories.  That sucks to manage.
> Isn't this potentially repeating that mistake?
> 
> Basically, is sysfs the right place to even expose this much data?

I definitly want to avoid the memoryX mistake. So i do not want to
see one link directory per device. Taking my simple laptop as an
example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
discret one):

link0: cpu0 cpu1 cpu2 cpu3
link1: wifi (2 pcie lane)
link2: gpu0 (unknown number of lane but i believe it has higher
             bandwidth to main memory)
link3: gpu1 (16 pcie lane)
link4: gpu1 and gpu memory

So one link directory per number of pcie lane your device have
so that you can differentiate on bandwidth. The main memory is
symlinked inside all the link directory except link4. The GPU
discret memory is only in link4 directory as it is only
accessible by the GPU (we could add it under link3 too with the
non cache coherent property attach to it).


The issue then becomes how to convert down the HMAT over verbose
information to populate some reasonable layout for HMS. For that
i would say that create a link directory for each different
matrix cell. As an example let say that each entry in the matrix
has bandwidth and latency then we create a link directory for
each combination of bandwidth and latency. On simple system that
should boils down to a handfull of combination roughly speaking
mirroring the example above of one link directory per number of
PCIE lane for instance.

I don't think i have a system with an HMAT table if you have one
HMAT table to provide i could show up the end result.

Note i believe the ACPI HMAT matrix is a bad design for that
reasons ie there is lot of commonality in many of the matrix
entry and many entry also do not make sense (ie initiator not
being able to access all the targets). I feel that link/bridge
is much more compact and allow to represent any directed graph
with multiple arrows from one node to another same node.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 23:58         ` Dave Hansen
@ 2018-12-05  0:29           ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  0:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Keith Busch, Dan Williams, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote:
> On 12/4/18 1:57 PM, Jerome Glisse wrote:
> > Fully correct mind if i steal that perfect summary description next time
> > i post ? I am so bad at explaining thing :)
> 
> Go for it!
> 
> > Intention is to allow program to do everything they do with mbind() today
> > and tomorrow with the HMAT patchset and on top of that to also be able to
> > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> > kernel API to rule them all ;)
> 
> While I appreciate the exhaustive scope of such a project, I'm really
> worried that if we decided to use this for our "HMAT" use cases, we'll
> be bottlenecked behind this project while *it* goes through 25 revisions
> over 4 or 5 years like HMM did.
> 
> So, should we just "park" the enhancements to the existing NUMA
> interfaces and infrastructure (think /sys/devices/system/node) and wait
> for this to go in?  Do we try to develop them in parallel and make them
> consistent?  Or, do we just ignore each other and make Andrew sort it
> out in a few years? :)

Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;)

More seriously i think you should go ahead with Keith HMAT patchset and
make progress there. In HMAT case you can grow and evolve the NUMA node
infrastructure to address your need and i believe you are doing it in
a sensible way. But i do not see a path for what i am trying to achieve
in that framework. If anyone have any good idea i would welcome it.

In the meantime i hope i can make progress with my proposal here under
staging. Once i get enough stuff working in userspace and convince guinea
pig (i need to find a better name for those poor people i will coerce
in testing this ;)) then i can have some hard evidence of what thing in
my proposal is useful on some concret case with open source stack from
top to bottom. It might means stripping down what i am proposing today
to what turns out to be useful. Then start a discussion about merging the
kernel underlying code into one (while preserving all existing API) and
getting out of staging with real syscall we will have to die with.

I know that at the very least the hbind() and hpolicy() syscall would
be successful as the HPC folks have been been dreaming of this. The
topology thing is harder to know, they are some users today but i can
not say how much more interest it can spark outside of this very small
community that is HPC.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05  0:15   ` Jerome Glisse
@ 2018-12-05  1:06     ` Dave Hansen
  2018-12-05  2:13       ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-05  1:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>              bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 23:56                             ` Jerome Glisse
@ 2018-12-05  1:15                               ` Logan Gunthorpe
  2018-12-05  2:31                                 ` Jerome Glisse
  2018-12-05  2:34                                 ` Dan Williams
  0 siblings, 2 replies; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05  1:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> two 8 GPUs node connected through each other with fast mesh (ie each
> GPU can peer to peer to each other at the same bandwidth). Then this
> 2 blocks are connected to the other block through a share link.
> 
> So it looks like:
>     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
>     |          |           |          |
>     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
>     ||     \\//            ||     \\//
>     ||     //\\            ||     //\\
>     ...    ====...    -----...    ====...
>     ||     \\//            ||     \\//
>     ||     //\\            ||     //\\
>     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7

Well the existing NUMA node stuff tells userspace which GPU belongs to
which socket (every device in sysfs already has a numa_node attribute).
And if that's not good enough we should work to improve how that works
for all devices. This problem isn't specific to GPUS or devices with
memory and seems rather orthogonal to an API to bind to device memory.

> How the above example would looks like ? I fail to see how to do it
> inside current sysfs. Maybe by creating multiple virtual device for
> each of the inter-connect ? So something like
> 
> link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child

I think the "links" between GPUs themselves would be a bus. In the same
way a NUMA node is a bus. Each device in sysfs would then need a
directory or something to describe what "link bus(es)" they are a part
of. Though there are other ways to do this: a GPU driver could simply
create symlinks to other GPUs inside a "neighbours" directory under the
device path or something like that.

The point is that this seems like it is specific to GPUs and could
easily be solved in the GPU community without any new universal concepts
or big APIs.

And for applications that need topology information, a lot of it is
already there, we just need to fill in the gaps with small changes that
would be much less controversial. Then if you want to create a libhms
(or whatever) to help applications parse this information out of
existing sysfs that would make sense.

> My proposal is to do HMS behind staging for a while and also avoid
> any disruption to existing code path. See with people living on the
> bleeding edge if they get interested in that informations. If not then
> i can strip down my thing to the bare minimum which is about device
> memory.

This isn't my area or decision to make, but it seemed to me like this is
not what staging is for. Staging is for introducing *drivers* that
aren't up to the Kernel's quality level and they all reside under the
drivers/staging path. It's not meant to introduce experimental APIs
around the kernel that might be revoked at anytime.

DAX introduced itself by marking the config option as EXPERIMENTAL and
printing warnings to dmesg when someone tries to use it. But, to my
knowledge, DAX also wasn't creating APIs with the intention of changing
or revoking them -- it was introducing features using largely existing
APIs that had many broken corner cases.

Do you know of any precedents where big APIs were introduced and then
later revoked or radically changed like you are proposing to do?

Logan




^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 21:57       ` Jerome Glisse
  2018-12-04 23:58         ` Dave Hansen
@ 2018-12-05  1:22         ` Kuehling, Felix
  1 sibling, 0 replies; 94+ messages in thread
From: Kuehling, Felix @ 2018-12-05  1:22 UTC (permalink / raw)
  To: Jerome Glisse, Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Yang, Philip, Koenig, Christian, Blinzer,
	Paul, Logan Gunthorpe, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi


On 2018-12-04 4:57 p.m., Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
>> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
>> manages in the "normal" allocator and supports a full feature set on.
>> That has a bunch of implications, like that the memory is cache coherent
>> and accessible from everywhere.
>>
>> The HMAT patches only comprehend this "normal" memory, which is why
>> we're extending the existing /sys/devices/system/node infrastructure.
>>
>> This series has a much more aggressive goal, which is comprehending the
>> connections of every memory-target to every memory-initiator, no matter
>> who is managing the memory, who can access it, or what it can be used for.
>>
>> Theoretically, HMS could be used for everything that we're doing with
>> /sys/devices/system/node, as long as it's tied back into the existing
>> NUMA infrastructure _somehow_.
>>
>> Right?
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)
>
> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

As for ROCm, I'm looking forward to using hbind in our own APIs. It will
save us some time and trouble not having to implement all the low-level
policy and tracking of virtual address ranges in our device driver.
Going forward, having a common API to manage the topology and memory
affinity would also enable sane ways of having accelerators and memory
devices from different vendors interact under control of a
topology-aware application.

Disclaimer: I haven't had a chance to review the patches in detail yet.
Got caught up in the documentation and discussion ...

Regards,
  Felix


>
> Also at first i intend to special case vma alloc page when they are HMS
> policy, long term i would like to merge code path inside the kernel. But
> i do not want to disrupt existing code path today, i rather grow to that
> organicaly. Step by step. The mbind() would still work un-affected in
> the end just the plumbing would be slightly different.
>
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05  1:06     ` Dave Hansen
@ 2018-12-05  2:13       ` Jerome Glisse
  2018-12-05 17:27         ` Dave Hansen
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  2:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> On 12/4/18 4:15 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> >> Basically, is sysfs the right place to even expose this much data?
> > 
> > I definitly want to avoid the memoryX mistake. So i do not want to
> > see one link directory per device. Taking my simple laptop as an
> > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> > discret one):
> > 
> > link0: cpu0 cpu1 cpu2 cpu3
> > link1: wifi (2 pcie lane)
> > link2: gpu0 (unknown number of lane but i believe it has higher
> >              bandwidth to main memory)
> > link3: gpu1 (16 pcie lane)
> > link4: gpu1 and gpu memory
> > 
> > So one link directory per number of pcie lane your device have
> > so that you can differentiate on bandwidth. The main memory is
> > symlinked inside all the link directory except link4. The GPU
> > discret memory is only in link4 directory as it is only
> > accessible by the GPU (we could add it under link3 too with the
> > non cache coherent property attach to it).
> 
> I'm actually really interested in how this proposal scales.  It's quite
> easy to represent a laptop, but can this scale to the largest systems
> that we expect to encounter over the next 20 years that this ABI will live?
> 
> > The issue then becomes how to convert down the HMAT over verbose
> > information to populate some reasonable layout for HMS. For that
> > i would say that create a link directory for each different
> > matrix cell. As an example let say that each entry in the matrix
> > has bandwidth and latency then we create a link directory for
> > each combination of bandwidth and latency. On simple system that
> > should boils down to a handfull of combination roughly speaking
> > mirroring the example above of one link directory per number of
> > PCIE lane for instance.
> 
> OK, but there are 1024*1024 matrix cells on a systems with 1024
> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> proposing a million-directory approach.

No, pseudo code:
    struct list links;

    for (unsigned r = 0; r < nrows; r++) {
        for (unsigned c = 0; c < ncolumns; c++) {
            if (!link_find(links, hmat[r][c].bandwidth,
                           hmat[r][c].latency)) {
                link = link_new(hmat[r][c].bandwidth,
                                hmat[r][c].latency);
                // add initiator and target correspond to that row
                // and columns to this new link
                list_add(&link, links);
            }
        }
    }

So all cells that have same property are under the same link. Do you
expect all the cell to always have different properties ? On today
platform it should not be the case. I do expect we will keep seeing
many initiator/target pair that share same properties as other pair.

But yes if you have system where no initiator/target pair have the
same properties than you in the worst case you are describing. But
hey that is the hardware you have then :)

Note that userspace can parse all this once during its initialization
and create pools of target to use.


> We also can't simply say that two CPUs with the same connection to two
> other CPUs (think a 4-socket QPI-connected system) share the same "link"
> because they share the same combination of bandwidth and latency.  We
> need to know that *each* has its own, unique link and do not share link
> resources.

That is the purpose of the bridge object to inter-connect link.
To be more exact link is like saying you have 2 arrows with the
same properties between every node listed in the link. While
bridge allow to define arrow in just one direction. Maybe i
should define arrow and node instead of trying to match some of
the ACPI terminology. This might be easier for people to follow
than first having to understand the terminology.

The fear i have with HMAT culling is that HMAT does not have the
information to avoid such culling.

> > I don't think i have a system with an HMAT table if you have one
> > HMAT table to provide i could show up the end result.
> 
> It is new enough (ACPI 6.2) that no publicly-available hardware that
> exists that implements one (that I know of).  Keith Busch can probably
> extract one and send it to you or show you how we're faking them with QEMU.
> 
> > Note i believe the ACPI HMAT matrix is a bad design for that
> > reasons ie there is lot of commonality in many of the matrix
> > entry and many entry also do not make sense (ie initiator not
> > being able to access all the targets). I feel that link/bridge
> > is much more compact and allow to represent any directed graph
> > with multiple arrows from one node to another same node.
> 
> I don't disagree.  But, folks are building systems with them and we need
> to either deal with it, or make its data manageable.  You saw our
> approach: we cull the data and only expose the bare minimum in sysfs.

Yeah and i intend to cull data too inside HMS.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  1:15                               ` Logan Gunthorpe
@ 2018-12-05  2:31                                 ` Jerome Glisse
  2018-12-05 17:41                                   ` Logan Gunthorpe
  2018-12-05  2:34                                 ` Dan Williams
  1 sibling, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  2:31 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 06:15:08PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > two 8 GPUs node connected through each other with fast mesh (ie each
> > GPU can peer to peer to each other at the same bandwidth). Then this
> > 2 blocks are connected to the other block through a share link.
> > 
> > So it looks like:
> >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> >     |          |           |          |
> >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     ...    ====...    -----...    ====...
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
> 
> Well the existing NUMA node stuff tells userspace which GPU belongs to
> which socket (every device in sysfs already has a numa_node attribute).
> And if that's not good enough we should work to improve how that works
> for all devices. This problem isn't specific to GPUS or devices with
> memory and seems rather orthogonal to an API to bind to device memory.

HMS is generic and not for GPU only, i use GPU as example as they are
the first device introducing this complexity. I believe some of the
FPGA folks are working on same thing. I heard that more TPU like hardware
might also grow such complexity.

What you are proposing just seems to me like redoing HMS under the node
directory in sysfs which has the potential of confusing existing application
while providing no benefits (at least i fail to see any).

> > How the above example would looks like ? I fail to see how to do it
> > inside current sysfs. Maybe by creating multiple virtual device for
> > each of the inter-connect ? So something like
> > 
> > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
> 
> I think the "links" between GPUs themselves would be a bus. In the same
> way a NUMA node is a bus. Each device in sysfs would then need a
> directory or something to describe what "link bus(es)" they are a part
> of. Though there are other ways to do this: a GPU driver could simply
> create symlinks to other GPUs inside a "neighbours" directory under the
> device path or something like that.
> 
> The point is that this seems like it is specific to GPUs and could
> easily be solved in the GPU community without any new universal concepts
> or big APIs.

So it would be springly over all this informations in various sub-
directories. To me this is making userspace life harder. HMS only
has one directory hierarchy that userspace need to parse to extract
the information. From my point of view it is much better but this
might be a taste thing.

> 
> And for applications that need topology information, a lot of it is
> already there, we just need to fill in the gaps with small changes that
> would be much less controversial. Then if you want to create a libhms
> (or whatever) to help applications parse this information out of
> existing sysfs that would make sense.

How can i express multiple link, or memory that is only accessible
by a subset of the devices/CPUs. In today model they are back in
assumption like everyone can access all the node which do not hold
in what i am trying to do.

Yes i can do it by adding invalid peer node list inside each node
but this is all more complex from my point of view. Highly confusing
for existing application and with potential to break existing
application on new platform with such weird nodes.


> > My proposal is to do HMS behind staging for a while and also avoid
> > any disruption to existing code path. See with people living on the
> > bleeding edge if they get interested in that informations. If not then
> > i can strip down my thing to the bare minimum which is about device
> > memory.
> 
> This isn't my area or decision to make, but it seemed to me like this is
> not what staging is for. Staging is for introducing *drivers* that
> aren't up to the Kernel's quality level and they all reside under the
> drivers/staging path. It's not meant to introduce experimental APIs
> around the kernel that might be revoked at anytime.
> 
> DAX introduced itself by marking the config option as EXPERIMENTAL and
> printing warnings to dmesg when someone tries to use it. But, to my
> knowledge, DAX also wasn't creating APIs with the intention of changing
> or revoking them -- it was introducing features using largely existing
> APIs that had many broken corner cases.
> 
> Do you know of any precedents where big APIs were introduced and then
> later revoked or radically changed like you are proposing to do?

Yeah it is kind of an issue, i can go the experimental way, idealy
what i would like is a kernel option that enable it with a kernel
boot parameter as an extra gate keeper so i can distribute kernel
with that feature inside some distribution and then provide simple
instruction for people to test (much easier to give a kernel boot
parameter than to have people rebuild a kernel).

I am open to any suggestion on what would be the best guideline to
experiment with API. The issue is that the changes to userspace are
big and takes time (month of works). So if i have to everything line
up and ready (userspace and kernel) in just one go then it is gonna
be painful. My pain i guess so other don't care ... :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  1:15                               ` Logan Gunthorpe
  2018-12-05  2:31                                 ` Jerome Glisse
@ 2018-12-05  2:34                                 ` Dan Williams
  2018-12-05  2:37                                   ` Jerome Glisse
  1 sibling, 1 reply; 94+ messages in thread
From: Dan Williams @ 2018-12-05  2:34 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jérôme Glisse, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > two 8 GPUs node connected through each other with fast mesh (ie each
> > GPU can peer to peer to each other at the same bandwidth). Then this
> > 2 blocks are connected to the other block through a share link.
> >
> > So it looks like:
> >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> >     |          |           |          |
> >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     ...    ====...    -----...    ====...
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
>
> Well the existing NUMA node stuff tells userspace which GPU belongs to
> which socket (every device in sysfs already has a numa_node attribute).
> And if that's not good enough we should work to improve how that works
> for all devices. This problem isn't specific to GPUS or devices with
> memory and seems rather orthogonal to an API to bind to device memory.
>
> > How the above example would looks like ? I fail to see how to do it
> > inside current sysfs. Maybe by creating multiple virtual device for
> > each of the inter-connect ? So something like
> >
> > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
>
> I think the "links" between GPUs themselves would be a bus. In the same
> way a NUMA node is a bus. Each device in sysfs would then need a
> directory or something to describe what "link bus(es)" they are a part
> of. Though there are other ways to do this: a GPU driver could simply
> create symlinks to other GPUs inside a "neighbours" directory under the
> device path or something like that.
>
> The point is that this seems like it is specific to GPUs and could
> easily be solved in the GPU community without any new universal concepts
> or big APIs.
>
> And for applications that need topology information, a lot of it is
> already there, we just need to fill in the gaps with small changes that
> would be much less controversial. Then if you want to create a libhms
> (or whatever) to help applications parse this information out of
> existing sysfs that would make sense.
>
> > My proposal is to do HMS behind staging for a while and also avoid
> > any disruption to existing code path. See with people living on the
> > bleeding edge if they get interested in that informations. If not then
> > i can strip down my thing to the bare minimum which is about device
> > memory.
>
> This isn't my area or decision to make, but it seemed to me like this is
> not what staging is for. Staging is for introducing *drivers* that
> aren't up to the Kernel's quality level and they all reside under the
> drivers/staging path. It's not meant to introduce experimental APIs
> around the kernel that might be revoked at anytime.
>
> DAX introduced itself by marking the config option as EXPERIMENTAL and
> printing warnings to dmesg when someone tries to use it. But, to my
> knowledge, DAX also wasn't creating APIs with the intention of changing
> or revoking them -- it was introducing features using largely existing
> APIs that had many broken corner cases.
>
> Do you know of any precedents where big APIs were introduced and then
> later revoked or radically changed like you are proposing to do?

This came up before for apis even better defined than HMS as well as
more limited scope, i.e. experimental ABI availability only for -rc
kernels. Linus said this:

"There are no loopholes. No "but it's been only one release". No, no,
no. The whole point is that users are supposed to be able to *trust*
the kernel. If we do something, we keep on doing it.

And if it makes it harder to add new user-visible interfaces, then
that's a *good* thing." [1]

The takeaway being don't land work-in-progress ABIs in the kernel.
Once an application depends on it, there are no more incompatible
changes possible regardless of the warnings, experimental notices, or
"staging" designation. DAX is experimental because there are cases
where it currently does not work with respect to another kernel
feature like xfs-reflink, RDMA. The plan is to fix those, not continue
to hide behind an experimental designation, and fix them in a way that
preserves the user visible behavior that has already been exposed,
i.e. no regressions.

[1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  2:34                                 ` Dan Williams
@ 2018-12-05  2:37                                   ` Jerome Glisse
  2018-12-05 17:25                                     ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  2:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Logan Gunthorpe, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Tue, Dec 04, 2018 at 06:34:37PM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >
> >
> >
> > On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > > two 8 GPUs node connected through each other with fast mesh (ie each
> > > GPU can peer to peer to each other at the same bandwidth). Then this
> > > 2 blocks are connected to the other block through a share link.
> > >
> > > So it looks like:
> > >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> > >     |          |           |          |
> > >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     ...    ====...    -----...    ====...
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
> >
> > Well the existing NUMA node stuff tells userspace which GPU belongs to
> > which socket (every device in sysfs already has a numa_node attribute).
> > And if that's not good enough we should work to improve how that works
> > for all devices. This problem isn't specific to GPUS or devices with
> > memory and seems rather orthogonal to an API to bind to device memory.
> >
> > > How the above example would looks like ? I fail to see how to do it
> > > inside current sysfs. Maybe by creating multiple virtual device for
> > > each of the inter-connect ? So something like
> > >
> > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
> >
> > I think the "links" between GPUs themselves would be a bus. In the same
> > way a NUMA node is a bus. Each device in sysfs would then need a
> > directory or something to describe what "link bus(es)" they are a part
> > of. Though there are other ways to do this: a GPU driver could simply
> > create symlinks to other GPUs inside a "neighbours" directory under the
> > device path or something like that.
> >
> > The point is that this seems like it is specific to GPUs and could
> > easily be solved in the GPU community without any new universal concepts
> > or big APIs.
> >
> > And for applications that need topology information, a lot of it is
> > already there, we just need to fill in the gaps with small changes that
> > would be much less controversial. Then if you want to create a libhms
> > (or whatever) to help applications parse this information out of
> > existing sysfs that would make sense.
> >
> > > My proposal is to do HMS behind staging for a while and also avoid
> > > any disruption to existing code path. See with people living on the
> > > bleeding edge if they get interested in that informations. If not then
> > > i can strip down my thing to the bare minimum which is about device
> > > memory.
> >
> > This isn't my area or decision to make, but it seemed to me like this is
> > not what staging is for. Staging is for introducing *drivers* that
> > aren't up to the Kernel's quality level and they all reside under the
> > drivers/staging path. It's not meant to introduce experimental APIs
> > around the kernel that might be revoked at anytime.
> >
> > DAX introduced itself by marking the config option as EXPERIMENTAL and
> > printing warnings to dmesg when someone tries to use it. But, to my
> > knowledge, DAX also wasn't creating APIs with the intention of changing
> > or revoking them -- it was introducing features using largely existing
> > APIs that had many broken corner cases.
> >
> > Do you know of any precedents where big APIs were introduced and then
> > later revoked or radically changed like you are proposing to do?
> 
> This came up before for apis even better defined than HMS as well as
> more limited scope, i.e. experimental ABI availability only for -rc
> kernels. Linus said this:
> 
> "There are no loopholes. No "but it's been only one release". No, no,
> no. The whole point is that users are supposed to be able to *trust*
> the kernel. If we do something, we keep on doing it.
> 
> And if it makes it harder to add new user-visible interfaces, then
> that's a *good* thing." [1]
> 
> The takeaway being don't land work-in-progress ABIs in the kernel.
> Once an application depends on it, there are no more incompatible
> changes possible regardless of the warnings, experimental notices, or
> "staging" designation. DAX is experimental because there are cases
> where it currently does not work with respect to another kernel
> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> to hide behind an experimental designation, and fix them in a way that
> preserves the user visible behavior that has already been exposed,
> i.e. no regressions.
> 
> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html

So i guess i am heading down the vXX road ... such is my life :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-04 18:24     ` Jerome Glisse
  2018-12-04 18:31       ` Dan Williams
  2018-12-04 20:12       ` Andi Kleen
@ 2018-12-05  4:36       ` Aneesh Kumar K.V
  2018-12-05  4:41         ` Jerome Glisse
  2 siblings, 1 reply; 94+ messages in thread
From: Aneesh Kumar K.V @ 2018-12-05  4:36 UTC (permalink / raw)
  To: Jerome Glisse, Andi Kleen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Benjamin Herrenschmidt, Felix Kuehling,
	Philip Yang, Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell

On 12/4/18 11:54 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
>> jglisse@redhat.com writes:
>>
>>> +
>>> +To help with forward compatibility each object as a version value and
>>> +it is mandatory for user space to only use target or initiator with
>>> +version supported by the user space. For instance if user space only
>>> +knows about what version 1 means and sees a target with version 2 then
>>> +the user space must ignore that target as if it does not exist.
>>
>> So once v2 is introduced all applications that only support v1 break.
>>
>> That seems very un-Linux and will break Linus' "do not break existing
>> applications" rule.
>>
>> The standard approach that if you add something incompatible is to
>> add new field, but keep the old ones.
> 
> No that's not how it is suppose to work. So let says it is 2018 and you
> have v1 memory (like your regular main DDR memory for instance) then it
> will always be expose a v1 memory.
> 
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
> 
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.
> 

So the knowledge that v1 is coherent and v2 is non-coherent is within 
the application? That seems really complicated from application point of 
view. Rill that v1 and v2 definition be arch and system dependent?

if we want to encode properties of a target and initiator we should do 
that as files within these directory. Something like 'is_cache_coherent'
in the target director can be used to identify whether the target is 
cache coherent or not?

-aneesh


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  4:36       ` Aneesh Kumar K.V
@ 2018-12-05  4:41         ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05  4:41 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andi Kleen, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Ross Zwisler, Dan Williams, Dave Hansen,
	Haggai Eran, Balbir Singh, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell

On Wed, Dec 05, 2018 at 10:06:02AM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 11:54 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > jglisse@redhat.com writes:
> > > 
> > > > +
> > > > +To help with forward compatibility each object as a version value and
> > > > +it is mandatory for user space to only use target or initiator with
> > > > +version supported by the user space. For instance if user space only
> > > > +knows about what version 1 means and sees a target with version 2 then
> > > > +the user space must ignore that target as if it does not exist.
> > > 
> > > So once v2 is introduced all applications that only support v1 break.
> > > 
> > > That seems very un-Linux and will break Linus' "do not break existing
> > > applications" rule.
> > > 
> > > The standard approach that if you add something incompatible is to
> > > add new field, but keep the old ones.
> > 
> > No that's not how it is suppose to work. So let says it is 2018 and you
> > have v1 memory (like your regular main DDR memory for instance) then it
> > will always be expose a v1 memory.
> > 
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> > 
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> > 
> 
> So the knowledge that v1 is coherent and v2 is non-coherent is within the
> application? That seems really complicated from application point of view.
> Rill that v1 and v2 definition be arch and system dependent?

No the idea was that kernel version X like 4.20 would define what v1
means. Then once v2 is added it would define what that means. Memory
that has v1 property would get v1 and memory that have v2 property
would get v2 as prefix.

Application that was done at 4.20 time and thus only knew about v1
would only look for v1 folder and thus only get memory it does under-
stand.

This is kind of moot discussion as i will switch to mask file inside
the directory per Logan advice.

> 
> if we want to encode properties of a target and initiator we should do that
> as files within these directory. Something like 'is_cache_coherent'
> in the target director can be used to identify whether the target is cache
> coherent or not?

My objection and fear is that application would overlook new properties
that the application need to understand to safely use new type of memory.
Thus old application might start using weird memory on new platform and
break in unexpected way. This was the whole rational and motivation behind
my choice.

I will switch to a set of flag in a file in the target directory and rely
on sane userspace behavior.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
  2018-12-04 17:06   ` Andi Kleen
@ 2018-12-05 10:52   ` Mike Rapoport
  1 sibling, 0 replies; 94+ messages in thread
From: Mike Rapoport @ 2018-12-05 10:52 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Ross Zwisler, Dan Williams, Dave Hansen, Haggai Eran,
	Balbir Singh, Aneesh Kumar K . V, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli

On Mon, Dec 03, 2018 at 06:34:57PM -0500, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Add documentation to what is HMS and what it is for (see patch content).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Haggai Eran <haggaie@mellanox.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: Philip Yang <Philip.Yang@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Paul Blinzer <Paul.Blinzer@amd.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: Vivek Kini <vkini@nvidia.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 246 insertions(+), 29 deletions(-)

This document describes userspace API and it's better to put it into
Documentation/admin-guide/mm.
The Documentation/vm is more for description of design and implementation.

I've spotted a couple of typos, but I think it doesn't make sense to nitpick
about them before  v10 or so ;-)
 
> diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
> index dbf0f71918a9..bd7c9e8e7077 100644
> --- a/Documentation/vm/hms.rst
> +++ b/Documentation/vm/hms.rst

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-04 18:49   ` Jerome Glisse
  2018-12-04 18:54     ` Dave Hansen
  2018-12-04 21:37     ` Dave Hansen
@ 2018-12-05 11:27     ` Aneesh Kumar K.V
  2018-12-05 16:09       ` Jerome Glisse
  2 siblings, 1 reply; 94+ messages in thread
From: Aneesh Kumar K.V @ 2018-12-05 11:27 UTC (permalink / raw)
  To: Jerome Glisse, Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On 12/5/18 12:19 AM, Jerome Glisse wrote:

> Above example is for migrate. Here is an example for how the
> topology is use today:
> 
>      Application knows that the platform is running on have 16
>      GPU split into 2 group of 8 GPUs each. GPU in each group can
>      access each other memory with dedicated mesh links between
>      each others. Full speed no traffic bottleneck.
> 
>      Application splits its GPU computation in 2 so that each
>      partition runs on a group of interconnected GPU allowing
>      them to share the dataset.
> 
> With HMS:
>      Application can query the kernel to discover the topology of
>      system it is running on and use it to partition and balance
>      its workload accordingly. Same application should now be able
>      to run on new platform without having to adapt it to it.
> 

Will the kernel be ever involved in decision making here? Like the 
scheduler will we ever want to control how there computation units get 
scheduled onto GPU groups or GPU?

> This is kind of naive i expect topology to be hard to use but maybe
> it is just me being pesimistics. In any case today we have a chicken
> and egg problem. We do not have a standard way to expose topology so
> program that can leverage topology are only done for HPC where the
> platform is standard for few years. If we had a standard way to expose
> the topology then maybe we would see more program using it. At very
> least we could convert existing user.
> 
> 

I am wondering whether we should consider HMAT as a subset of the ideas
mentioned in this thread and see whether we can first achieve HMAT 
representation with your patch series?

-aneesh


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05 11:27     ` Aneesh Kumar K.V
@ 2018-12-05 16:09       ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 16:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Dave Hansen, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Haggai Eran, Balbir Singh, Benjamin Herrenschmidt,
	Felix Kuehling, Philip Yang, Christian König, Paul Blinzer,
	Logan Gunthorpe, John Hubbard, Ralph Campbell, Michal Hocko,
	Jonathan Cameron, Mark Hairgrove, Vivek Kini, Mel Gorman,
	Dave Airlie, Ben Skeggs, Andrea Arcangeli, Rik van Riel,
	Ben Woodard, linux-acpi

On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote:
> On 12/5/18 12:19 AM, Jerome Glisse wrote:
> 
> > Above example is for migrate. Here is an example for how the
> > topology is use today:
> > 
> >      Application knows that the platform is running on have 16
> >      GPU split into 2 group of 8 GPUs each. GPU in each group can
> >      access each other memory with dedicated mesh links between
> >      each others. Full speed no traffic bottleneck.
> > 
> >      Application splits its GPU computation in 2 so that each
> >      partition runs on a group of interconnected GPU allowing
> >      them to share the dataset.
> > 
> > With HMS:
> >      Application can query the kernel to discover the topology of
> >      system it is running on and use it to partition and balance
> >      its workload accordingly. Same application should now be able
> >      to run on new platform without having to adapt it to it.
> > 
> 
> Will the kernel be ever involved in decision making here? Like the scheduler
> will we ever want to control how there computation units get scheduled onto
> GPU groups or GPU?

I don;t think you will ever see fine control in software because it
would go against what GPU are fundamentaly. GPU have 1000 of cores
and usualy 10 times more thread in flight than core (depends on the
number of register use by the program or size of their thread local
storage). By having many more thread in flight the GPU always have
some threads that are not waiting for memory access and thus always
have something to schedule next on the core. This scheduling is all
done in real time and i do not see that as a good fit for any kernel
CPU code.

That being said higher level and more coarse directive can be given
to the GPU hardware scheduler like giving priorities to group of
thread so that they always get schedule first if ready. There is
a cgroup proposal that goes into the direction of exposing high
level control over GPU resource like that. I think this is a better
venue to discuss such topics.

> 
> > This is kind of naive i expect topology to be hard to use but maybe
> > it is just me being pesimistics. In any case today we have a chicken
> > and egg problem. We do not have a standard way to expose topology so
> > program that can leverage topology are only done for HPC where the
> > platform is standard for few years. If we had a standard way to expose
> > the topology then maybe we would see more program using it. At very
> > least we could convert existing user.
> > 
> > 
> 
> I am wondering whether we should consider HMAT as a subset of the ideas
> mentioned in this thread and see whether we can first achieve HMAT
> representation with your patch series?

I do not want to block HMAT on that. What i am trying to do really
does not fit in the existing NUMA node this is what i have been trying
to show even if not everyone is convince by that. Some bulets points
of why:
    - memory i care about is not accessible by everyone (backed in
      assumption in NUMA node)
    - memory i care about might not be cache coherent (again backed
      in assumption in NUMA node)
    - topology matter so that userspace knows what inter-connect is
      share and what have dedicated links to memory
    - their can be multiple path between one device and one target
      memory and each path have different numa distance (or rather
      properties like bandwidth, latency, ...) again this is does
      not fit with the NUMA distance thing
    - memory is not manage by core kernel for reasons i hav explained
    - ...

The HMAT proposal does not deal with such memory, it is much more
close to what the current model can describe.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  2:37                                   ` Jerome Glisse
@ 2018-12-05 17:25                                     ` Logan Gunthorpe
  2018-12-05 18:01                                       ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 17:25 UTC (permalink / raw)
  To: Jerome Glisse, Dan Williams
  Cc: Andi Kleen, Linux MM, Andrew Morton, Linux Kernel Mailing List,
	Rafael J. Wysocki, Dave Hansen, Haggai Eran, balbirs,
	Aneesh Kumar K.V, Benjamin Herrenschmidt, Kuehling, Felix,
	Philip.Yang, Koenig, Christian, Blinzer, Paul, John Hubbard,
	rcampbell



On 2018-12-04 7:37 p.m., Jerome Glisse wrote:
>>
>> This came up before for apis even better defined than HMS as well as
>> more limited scope, i.e. experimental ABI availability only for -rc
>> kernels. Linus said this:
>>
>> "There are no loopholes. No "but it's been only one release". No, no,
>> no. The whole point is that users are supposed to be able to *trust*
>> the kernel. If we do something, we keep on doing it.
>>
>> And if it makes it harder to add new user-visible interfaces, then
>> that's a *good* thing." [1]
>>
>> The takeaway being don't land work-in-progress ABIs in the kernel.
>> Once an application depends on it, there are no more incompatible
>> changes possible regardless of the warnings, experimental notices, or
>> "staging" designation. DAX is experimental because there are cases
>> where it currently does not work with respect to another kernel
>> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
>> to hide behind an experimental designation, and fix them in a way that
>> preserves the user visible behavior that has already been exposed,
>> i.e. no regressions.
>>
>> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
> 
> So i guess i am heading down the vXX road ... such is my life :)

I recommend against it. I really haven't been convinced by any of your
arguments for having a second topology tree. The existing topology tree
in sysfs already better describes the links between hardware right now,
except for the missing GPU links (and those should be addressable within
the GPU community). Plus, maybe, some other enhancements to sockets/numa
node descriptions if there's something missing there.

Then, 'hbind' is another issue but I suspect it would be better
implemented as an ioctl on existing GPU interfaces. I certainly can't
see any benefit in using it myself.

It's better to take an approach that would be less controversial with
the community than to brow beat them with a patch set 20+ times until
they take it.

Logan


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05  2:13       ` Jerome Glisse
@ 2018-12-05 17:27         ` Dave Hansen
  2018-12-05 17:53           ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-05 17:27 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/4/18 6:13 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
>> OK, but there are 1024*1024 matrix cells on a systems with 1024
>> proximity domains (ACPI term for NUMA node).  So it sounds like you are
>> proposing a million-directory approach.
> 
> No, pseudo code:
>     struct list links;
> 
>     for (unsigned r = 0; r < nrows; r++) {
>         for (unsigned c = 0; c < ncolumns; c++) {
>             if (!link_find(links, hmat[r][c].bandwidth,
>                            hmat[r][c].latency)) {
>                 link = link_new(hmat[r][c].bandwidth,
>                                 hmat[r][c].latency);
>                 // add initiator and target correspond to that row
>                 // and columns to this new link
>                 list_add(&link, links);
>             }
>         }
>     }
> 
> So all cells that have same property are under the same link. 

OK, so the "link" here is like a cable.  It's like saying, "we have a
network and everything is connected with an ethernet cable that can do
1gbit/sec".

But, what actually connects an initiator to a target?  I assume we still
need to know which link is used for each target/initiator pair.  Where
is that enumerated?

I think this just means we need a million symlinks to a "link" instead
of a million link directories.  Still not great.

> Note that userspace can parse all this once during its initialization
> and create pools of target to use.

It sounds like you're agreeing that there is too much data in this
interface for applications to _regularly_ parse it.  We need some
central thing that parses it all and caches the results.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05  2:31                                 ` Jerome Glisse
@ 2018-12-05 17:41                                   ` Logan Gunthorpe
  2018-12-05 18:07                                     ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 17:41 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-04 7:31 p.m., Jerome Glisse wrote:
> How can i express multiple link, or memory that is only accessible
> by a subset of the devices/CPUs. In today model they are back in
> assumption like everyone can access all the node which do not hold
> in what i am trying to do.

Well multiple links are easy when you have a 'link' bus. Just add
another link device under the bus.

Technically, the accessibility issue is already encoded in sysfs. For
example, through the PCI tree you can determine which ACS bits are set
and determine which devices are behind the same root bridge the same way
we do in the kernel p2pdma subsystem. This is all bus specific which is
fine, but if we want to change that, we should have a common way for
existing buses to describe these attributes in the existing tree. The
new 'link' bus devices would have to have some way to describe cases if
memory isn't accessible in some way across it.

But really, I would say the kernel is responsible for telling you when
memory is accessible to a list of initiators, so it should be part of
the checks in a theoretical hbind api. This is already the approach
p2pdma takes in-kernel: we have functions that tell you if two PCI
devices can talk to each other and we have functions to give you memory
accessible by a set of devices. What we don't have is a special tree
that p2pdma users have to walk through to determine accessibility.

In my eye's, you are just conflating a bunch of different issues that
are better solved independently in the existing frameworks we have. And
if they were tackled individually, you'd have a much easier time getting
them merged one by one.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05 17:27         ` Dave Hansen
@ 2018-12-05 17:53           ` Jerome Glisse
  2018-12-06 18:25             ` Dave Hansen
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 17:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote:
> On 12/4/18 6:13 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> >> OK, but there are 1024*1024 matrix cells on a systems with 1024
> >> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> >> proposing a million-directory approach.
> > 
> > No, pseudo code:
> >     struct list links;
> > 
> >     for (unsigned r = 0; r < nrows; r++) {
> >         for (unsigned c = 0; c < ncolumns; c++) {
> >             if (!link_find(links, hmat[r][c].bandwidth,
> >                            hmat[r][c].latency)) {
> >                 link = link_new(hmat[r][c].bandwidth,
> >                                 hmat[r][c].latency);
> >                 // add initiator and target correspond to that row
> >                 // and columns to this new link
> >                 list_add(&link, links);
> >             }
> >         }
> >     }
> > 
> > So all cells that have same property are under the same link. 
> 
> OK, so the "link" here is like a cable.  It's like saying, "we have a
> network and everything is connected with an ethernet cable that can do
> 1gbit/sec".
> 
> But, what actually connects an initiator to a target?  I assume we still
> need to know which link is used for each target/initiator pair.  Where
> is that enumerated?

ls /sys/bus/hms/devices/v0-0-link/
node0           power           subsystem       uevent
uid             bandwidth       latency         v0-1-target
v0-15-initiator v0-21-target    v0-4-initiator  v0-7-initiator
v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator
v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator
v0-5-initiator  v0-8-initiator  v0-6-initiator  v0-9-initiator
v0-12-initiator v0-10-initiator

So above is 16 CPUs (initiators*) and 2 targets all connected
through a common link. This means that all the initiators
connected to this link can access all the target connected to
this link. The bandwidth and latency is best case scenario
for instance when only one initiator is accessing the target.

Initiator can only access target they share a link with or
an extended path through a bridge. So if you have an initiator
connected to link0 and a target connected to link1 and there
is a bridge link0 to link1 then the initiator can access the
target memory in link1 but the bandwidth and latency will be
min(link0.bandwidth, link1.bandwidth, bridge.bandwidth)
min(link0.latency, link1.latency, bridge.latency)

You can really match one to one a link with bus in your
system. For instance with PCIE if you only have 16lanes
PCIE devices you only devince one link directory for all
your PCIE devices (ignore the PCIE peer to peer scenario
here). You add a bride between your PCIE link to your
NUMA node link (the node to which this PCIE root complex
belongs), this means that PCIE device can access the local
node memory with given bandwidth and latency (best case).


> 
> I think this just means we need a million symlinks to a "link" instead
> of a million link directories.  Still not great.
> 
> > Note that userspace can parse all this once during its initialization
> > and create pools of target to use.
> 
> It sounds like you're agreeing that there is too much data in this
> interface for applications to _regularly_ parse it.  We need some
> central thing that parses it all and caches the results.

No so there is 2 kinds of applications:
    1) average one: i am using device {1, 3, 9} give me best memory for
       those devices
    2) advance one: what is the topology of this system ? Parse the
       topology and partition its workload accordingly

For case 1 you can pre-parse stuff but this can be done by helper library
but for case 2 there is no amount of pre-parsing you can do in kernel, only
the application knows its own architecture and thus only the application
knows what matter in the topology. Is the application looking for big
chunk of memory even if it is slow ? Is it also looking for fast memory
close to X and Y ? ...

Each application will care about different thing and there is no telling
what its gonna be.

So what i am saying is that this information is likely to be parse once
by the application during startup ie the sysfs is not something that
is continuously read and parse by the application (unless application
also care about hotplug and then we are talking about the 1% of the 1%).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 17:25                                     ` Logan Gunthorpe
@ 2018-12-05 18:01                                       ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 18:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 10:25:31AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 7:37 p.m., Jerome Glisse wrote:
> >>
> >> This came up before for apis even better defined than HMS as well as
> >> more limited scope, i.e. experimental ABI availability only for -rc
> >> kernels. Linus said this:
> >>
> >> "There are no loopholes. No "but it's been only one release". No, no,
> >> no. The whole point is that users are supposed to be able to *trust*
> >> the kernel. If we do something, we keep on doing it.
> >>
> >> And if it makes it harder to add new user-visible interfaces, then
> >> that's a *good* thing." [1]
> >>
> >> The takeaway being don't land work-in-progress ABIs in the kernel.
> >> Once an application depends on it, there are no more incompatible
> >> changes possible regardless of the warnings, experimental notices, or
> >> "staging" designation. DAX is experimental because there are cases
> >> where it currently does not work with respect to another kernel
> >> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> >> to hide behind an experimental designation, and fix them in a way that
> >> preserves the user visible behavior that has already been exposed,
> >> i.e. no regressions.
> >>
> >> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
> > 
> > So i guess i am heading down the vXX road ... such is my life :)
> 
> I recommend against it. I really haven't been convinced by any of your
> arguments for having a second topology tree. The existing topology tree
> in sysfs already better describes the links between hardware right now,
> except for the missing GPU links (and those should be addressable within
> the GPU community). Plus, maybe, some other enhancements to sockets/numa
> node descriptions if there's something missing there.
> 
> Then, 'hbind' is another issue but I suspect it would be better
> implemented as an ioctl on existing GPU interfaces. I certainly can't
> see any benefit in using it myself.
> 
> It's better to take an approach that would be less controversial with
> the community than to brow beat them with a patch set 20+ times until
> they take it.

So here is what i am gonna do because i need this code now. I am gonna
split the helper code that does policy and hbind out from its sysfs
peerage and i am gonna turn it into helpers that each device driver
can use. I will move the sysfs and syscall to be a patchset on its own
which use the exact same above infrastructure.

This means that i am loosing feature as it means that userspace can
not provide a list of multiple device memory to use (which is much more
common that you might think) but at least i can provide something for
the single device case through ioctl.

I am not giving up on sysfs or syscall as this is needed long term so
i am gonna improve it, port existing userspace (OpenCL, ROCm, ...) to
use it (in branch) and demonstrate how it get use by end application.
I will beat it again and again until either i convince people through
hard evidence or i get bored. I do not get bored easily :)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 17:41                                   ` Logan Gunthorpe
@ 2018-12-05 18:07                                     ` Jerome Glisse
  2018-12-05 18:20                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 18:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 10:41:56AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 7:31 p.m., Jerome Glisse wrote:
> > How can i express multiple link, or memory that is only accessible
> > by a subset of the devices/CPUs. In today model they are back in
> > assumption like everyone can access all the node which do not hold
> > in what i am trying to do.
> 
> Well multiple links are easy when you have a 'link' bus. Just add
> another link device under the bus.

So you are telling do what i am doing in this patchset but not under
HMS directory ?

> 
> Technically, the accessibility issue is already encoded in sysfs. For
> example, through the PCI tree you can determine which ACS bits are set
> and determine which devices are behind the same root bridge the same way
> we do in the kernel p2pdma subsystem. This is all bus specific which is
> fine, but if we want to change that, we should have a common way for
> existing buses to describe these attributes in the existing tree. The
> new 'link' bus devices would have to have some way to describe cases if
> memory isn't accessible in some way across it.

What i am looking at is much more complex than just access bit. It
is a whole set of properties attach to each path (can it be cache
coherent ? can it do atomic ? what is the access granularity ? what
is the bandwidth ? is it dedicated link ? ...)

> 
> But really, I would say the kernel is responsible for telling you when
> memory is accessible to a list of initiators, so it should be part of
> the checks in a theoretical hbind api. This is already the approach
> p2pdma takes in-kernel: we have functions that tell you if two PCI
> devices can talk to each other and we have functions to give you memory
> accessible by a set of devices. What we don't have is a special tree
> that p2pdma users have to walk through to determine accessibility.

You do not need it, but i do need it they are user out there that are
already depending on the information by getting it through non standard
way. I do want to provide a standard way for userspace to get this.
They are real user out there and i believe their would be more user
if we had a standard way to provide it. You do not believe in it fine.
I will do more work in userspace and more example and i will come back
with more hard evidence until i convince enough people.

> 
> In my eye's, you are just conflating a bunch of different issues that
> are better solved independently in the existing frameworks we have. And
> if they were tackled individually, you'd have a much easier time getting
> them merged one by one.

I don't think i can convince you otherwise. They are user that use topology
please looks at the links i provided, those folks have running program
_today_ they rely on non standard API and would like to move toward standard
API it would improve their life.

On top of that i argue that more people would use that information if it
were available to them. I agree that i have no hard evidence to back that
up and that it is just a feeling but you can not disprove me either as
this is a chicken and egg problem, you can not prove people will not use
an API if the API is not there to be use.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 18:07                                     ` Jerome Glisse
@ 2018-12-05 18:20                                       ` Logan Gunthorpe
  2018-12-05 18:33                                         ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 18:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-05 11:07 a.m., Jerome Glisse wrote:
>> Well multiple links are easy when you have a 'link' bus. Just add
>> another link device under the bus.
> 
> So you are telling do what i am doing in this patchset but not under
> HMS directory ?

No, it's completely different. I'm talking about creating a bus to
describe only the real hardware that links GPUs. Not creating a new
virtual tree containing a bunch of duplicate bus and device information
that already exists currently in sysfs.

>>
>> Technically, the accessibility issue is already encoded in sysfs. For
>> example, through the PCI tree you can determine which ACS bits are set
>> and determine which devices are behind the same root bridge the same way
>> we do in the kernel p2pdma subsystem. This is all bus specific which is
>> fine, but if we want to change that, we should have a common way for
>> existing buses to describe these attributes in the existing tree. The
>> new 'link' bus devices would have to have some way to describe cases if
>> memory isn't accessible in some way across it.
> 
> What i am looking at is much more complex than just access bit. It
> is a whole set of properties attach to each path (can it be cache
> coherent ? can it do atomic ? what is the access granularity ? what
> is the bandwidth ? is it dedicated link ? ...)

I'm not talking about just an access bit. I'm talking about what you are
describing: standard ways for *existing* buses in the sysfs hierarchy to
describe things like cache coherency, atomics, granularity, etc without
creating a new hierarchy.

> On top of that i argue that more people would use that information if it
> were available to them. I agree that i have no hard evidence to back that
> up and that it is just a feeling but you can not disprove me either as
> this is a chicken and egg problem, you can not prove people will not use
> an API if the API is not there to be use.

And you miss my point that much of this information is already available
to them. And more can be added in the existing framework without
creating any brand new concepts. I haven't said anything about
chicken-and-egg problems -- I've given you a bunch of different
suggestions to split this up into more managable problems and address
many of them within the APIs and frameworks we have already.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 18:20                                       ` Logan Gunthorpe
@ 2018-12-05 18:33                                         ` Jerome Glisse
  2018-12-05 18:48                                           ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 18:33 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 11:20:30AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:07 a.m., Jerome Glisse wrote:
> >> Well multiple links are easy when you have a 'link' bus. Just add
> >> another link device under the bus.
> > 
> > So you are telling do what i am doing in this patchset but not under
> > HMS directory ?
> 
> No, it's completely different. I'm talking about creating a bus to
> describe only the real hardware that links GPUs. Not creating a new
> virtual tree containing a bunch of duplicate bus and device information
> that already exists currently in sysfs.
> 
> >>
> >> Technically, the accessibility issue is already encoded in sysfs. For
> >> example, through the PCI tree you can determine which ACS bits are set
> >> and determine which devices are behind the same root bridge the same way
> >> we do in the kernel p2pdma subsystem. This is all bus specific which is
> >> fine, but if we want to change that, we should have a common way for
> >> existing buses to describe these attributes in the existing tree. The
> >> new 'link' bus devices would have to have some way to describe cases if
> >> memory isn't accessible in some way across it.
> > 
> > What i am looking at is much more complex than just access bit. It
> > is a whole set of properties attach to each path (can it be cache
> > coherent ? can it do atomic ? what is the access granularity ? what
> > is the bandwidth ? is it dedicated link ? ...)
> 
> I'm not talking about just an access bit. I'm talking about what you are
> describing: standard ways for *existing* buses in the sysfs hierarchy to
> describe things like cache coherency, atomics, granularity, etc without
> creating a new hierarchy.
> 
> > On top of that i argue that more people would use that information if it
> > were available to them. I agree that i have no hard evidence to back that
> > up and that it is just a feeling but you can not disprove me either as
> > this is a chicken and egg problem, you can not prove people will not use
> > an API if the API is not there to be use.
> 
> And you miss my point that much of this information is already available
> to them. And more can be added in the existing framework without
> creating any brand new concepts. I haven't said anything about
> chicken-and-egg problems -- I've given you a bunch of different
> suggestions to split this up into more managable problems and address
> many of them within the APIs and frameworks we have already.

The thing is that what i am considering is not in sysfs, it does not
even have linux kernel driver, it is just chips that connect device
between them and there is nothing to do with those chips it is all
hardware they do not need a driver. So there is nothing existing that
address what i need to represent.

If i add a a fake driver for those what would i do ? under which
sub-system i register them ? How i express the fact that they
connect device X,Y and Z with some properties ?

This is not PCIE ... you can not discover bridges and child, it
not a tree like structure, it is a random graph (which depends
on how the OEM wire port on the chips).


So i have not pre-existing driver, they are not in sysfs today and
they do not need a driver. Hence why i proposed what i proposed
a sysfs hierarchy where i can add those "virtual" object and shows
how they connect existing device for which we have a sysfs directory
to symlink.


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 18:33                                         ` Jerome Glisse
@ 2018-12-05 18:48                                           ` Logan Gunthorpe
  2018-12-05 18:55                                             ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 18:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-05 11:33 a.m., Jerome Glisse wrote:
> If i add a a fake driver for those what would i do ? under which
> sub-system i register them ? How i express the fact that they
> connect device X,Y and Z with some properties ?

Yes this is exactly what I'm suggesting. I wouldn't call it a fake
driver, but a new struct device describing an actual device in the
system. It would be a feature of the GPU subsystem seeing this is a
feature of GPUs. Expressing that the new devices connect to a specific
set of GPUs is not a hard problem to solve.

> This is not PCIE ... you can not discover bridges and child, it
> not a tree like structure, it is a random graph (which depends
> on how the OEM wire port on the chips).

You must be able to discover that these links exist and register a
device with the system. Where else do you get the information currently?
The suggestion doesn't change anything to do with how you interact with
hardware, only how you describe the information within the kernel.

> So i have not pre-existing driver, they are not in sysfs today and
> they do not need a driver. Hence why i proposed what i proposed
> a sysfs hierarchy where i can add those "virtual" object and shows
> how they connect existing device for which we have a sysfs directory
> to symlink.

So add a new driver -- that's what I've been suggesting all along.
Having a driver not exist is no reason to not create one. I'd suggest
that if you want them to show up in the sysfs hierarchy then you do need
some kind of driver code to create a struct device. Just because the
kernel doesn't have to interact with them is no reason not to create a
struct device. It's *much* easier to create a new driver subsystem than
a whole new userspace API.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 18:48                                           ` Logan Gunthorpe
@ 2018-12-05 18:55                                             ` Jerome Glisse
  2018-12-05 19:10                                               ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 18:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 11:48:37AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:33 a.m., Jerome Glisse wrote:
> > If i add a a fake driver for those what would i do ? under which
> > sub-system i register them ? How i express the fact that they
> > connect device X,Y and Z with some properties ?
> 
> Yes this is exactly what I'm suggesting. I wouldn't call it a fake
> driver, but a new struct device describing an actual device in the
> system. It would be a feature of the GPU subsystem seeing this is a
> feature of GPUs. Expressing that the new devices connect to a specific
> set of GPUs is not a hard problem to solve.
> 
> > This is not PCIE ... you can not discover bridges and child, it
> > not a tree like structure, it is a random graph (which depends
> > on how the OEM wire port on the chips).
> 
> You must be able to discover that these links exist and register a
> device with the system. Where else do you get the information currently?
> The suggestion doesn't change anything to do with how you interact with
> hardware, only how you describe the information within the kernel.
> 
> > So i have not pre-existing driver, they are not in sysfs today and
> > they do not need a driver. Hence why i proposed what i proposed
> > a sysfs hierarchy where i can add those "virtual" object and shows
> > how they connect existing device for which we have a sysfs directory
> > to symlink.
> 
> So add a new driver -- that's what I've been suggesting all along.
> Having a driver not exist is no reason to not create one. I'd suggest
> that if you want them to show up in the sysfs hierarchy then you do need
> some kind of driver code to create a struct device. Just because the
> kernel doesn't have to interact with them is no reason not to create a
> struct device. It's *much* easier to create a new driver subsystem than
> a whole new userspace API.

So now once next type of device shows up with the exact same thing
let say FPGA, we have to create a new subsystem for them too. Also
this make the userspace life much much harder. Now userspace must
go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
merge all that different information together and rebuild the
representation i am putting forward in this patchset in userspace.

There is no telling that kernel won't be able to provide quirk and
workaround because some merging is actually illegal on a given
platform (like some link from a subsystem is not accessible through
the PCI connection of one of the device connected to that link).

So it means userspace will have to grow its own database or work-
around and quirk and i am back in the situation i am in today.

Not very convincing to me. What i am proposing here is a new common
description provided by the kernel where we can reconciliate weird
interaction.

But i doubt i can convince you i will make progress on what i need
today and keep working on sysfs.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 18:55                                             ` Jerome Glisse
@ 2018-12-05 19:10                                               ` Logan Gunthorpe
  2018-12-05 22:58                                                 ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 19:10 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-05 11:55 a.m., Jerome Glisse wrote:
> So now once next type of device shows up with the exact same thing
> let say FPGA, we have to create a new subsystem for them too. Also
> this make the userspace life much much harder. Now userspace must
> go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
> merge all that different information together and rebuild the
> representation i am putting forward in this patchset in userspace.

Yes. But seeing such FPGA links aren't common yet and there isn't really
much in terms of common FPGA infrastructure in the kernel (which are
hard seeing the hardware is infinitely customization) you can let the
people developing FPGA code worry about it and come up with their own
solution. Buses between FPGAs may end up never being common enough for
people to care, or they may end up being so weird that they need their
own description independent of GPUS, or maybe when they become common
they find a way to use the GPU link subsystem -- who knows. Don't try to
design for use cases that don't exist yet.

Yes, userspace will have to know about all the buses it cares to find
links over. Sounds like a perfect thing for libhms to do.

> There is no telling that kernel won't be able to provide quirk and
> workaround because some merging is actually illegal on a given
> platform (like some link from a subsystem is not accessible through
> the PCI connection of one of the device connected to that link).

These are all just different individual problems which need different
solutions not grand new design concepts.

> So it means userspace will have to grow its own database or work-
> around and quirk and i am back in the situation i am in today.

No, as I've said, quirks are firmly the responsibility of kernels.
Userspace will need to know how to work with the different buses and
CPU/node information but there really isn't that many of these to deal
with and this is a much easier approach than trying to come up with a
new API that can wrap the nuances of all existing and potential future
bus types we may have to deal with.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 19:10                                               ` Logan Gunthorpe
@ 2018-12-05 22:58                                                 ` Jerome Glisse
  2018-12-05 23:09                                                   ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 22:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 12:10:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:55 a.m., Jerome Glisse wrote:
> > So now once next type of device shows up with the exact same thing
> > let say FPGA, we have to create a new subsystem for them too. Also
> > this make the userspace life much much harder. Now userspace must
> > go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
> > merge all that different information together and rebuild the
> > representation i am putting forward in this patchset in userspace.
> 
> Yes. But seeing such FPGA links aren't common yet and there isn't really
> much in terms of common FPGA infrastructure in the kernel (which are
> hard seeing the hardware is infinitely customization) you can let the
> people developing FPGA code worry about it and come up with their own
> solution. Buses between FPGAs may end up never being common enough for
> people to care, or they may end up being so weird that they need their
> own description independent of GPUS, or maybe when they become common
> they find a way to use the GPU link subsystem -- who knows. Don't try to
> design for use cases that don't exist yet.
> 
> Yes, userspace will have to know about all the buses it cares to find
> links over. Sounds like a perfect thing for libhms to do.

So just to be clear here is how i understand your position:
"Single coherent sysfs hierarchy to describe something is useless
 let's git rm drivers/base/"

While i am arguing that "hey the /sys/bus/node/devices/* is nice
but it just does not cut it for all this new hardware platform
if i add new nodes there for my new memory i will break tons of
existing application. So what about a new hierarchy that allow
to describe those new hardware platform in a single place like
today node thing"


> 
> > There is no telling that kernel won't be able to provide quirk and
> > workaround because some merging is actually illegal on a given
> > platform (like some link from a subsystem is not accessible through
> > the PCI connection of one of the device connected to that link).
> 
> These are all just different individual problems which need different
> solutions not grand new design concepts.
> 
> > So it means userspace will have to grow its own database or work-
> > around and quirk and i am back in the situation i am in today.
> 
> No, as I've said, quirks are firmly the responsibility of kernels.
> Userspace will need to know how to work with the different buses and
> CPU/node information but there really isn't that many of these to deal
> with and this is a much easier approach than trying to come up with a
> new API that can wrap the nuances of all existing and potential future
> bus types we may have to deal with.

No can do that is what i am trying to explain. So if i bus 1 in a
sub-system A and usualy that kind of bus can serve a bridge for
PCIE ie a CPU can access device behind it by going through a PCIE
device first. So now the userspace libary have this knowledge
bake in. Now if a platform has a bug for whatever reasons where
that does not hold, the kernel has no way to tell userspace that
there is an exception there. It is up to userspace to have a data
base of quirks.

Kernel see all those objects in isolation in your scheme. While in
what i am proposing there is only one place and any device that
participate in this common place can report any quirks so that a
coherent view is given to user space.

If we have gazillion of places where all this informations is spread
around than we have no way to fix weird inter-action between any
of those.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 22:58                                                 ` Jerome Glisse
@ 2018-12-05 23:09                                                   ` Logan Gunthorpe
  2018-12-05 23:20                                                     ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 23:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-05 3:58 p.m., Jerome Glisse wrote:
> So just to be clear here is how i understand your position:
> "Single coherent sysfs hierarchy to describe something is useless
>  let's git rm drivers/base/"

I have no idea what you're talking about. I'm saying the existing sysfs
hierarchy *should* be used for this application -- we shouldn't be
creating another hierarchy.

> While i am arguing that "hey the /sys/bus/node/devices/* is nice
> but it just does not cut it for all this new hardware platform
> if i add new nodes there for my new memory i will break tons of
> existing application. So what about a new hierarchy that allow
> to describe those new hardware platform in a single place like
> today node thing"

I'm talking about /sys/bus and all the bus information under there; not
just the node hierarchy. With this information, you can figure out how
any struct device is connected to another struct device. This has little
to do with a hypothetical memory device and what it might expose. You're
conflating memory devices with links between devices (ie. buses).


> No can do that is what i am trying to explain. So if i bus 1 in a
> sub-system A and usualy that kind of bus can serve a bridge for
> PCIE ie a CPU can access device behind it by going through a PCIE
> device first. So now the userspace libary have this knowledge
> bake in. Now if a platform has a bug for whatever reasons where
> that does not hold, the kernel has no way to tell userspace that
> there is an exception there. It is up to userspace to have a data
> base of quirks.

> Kernel see all those objects in isolation in your scheme. While in
> what i am proposing there is only one place and any device that
> participate in this common place can report any quirks so that a
> coherent view is given to user space.

The above makes no sense to me.


> If we have gazillion of places where all this informations is spread
> around than we have no way to fix weird inter-action between any
> of those.

So work to standardize it so that all buses present a consistent view of
what guarantees they provide for bus accesses. Quirks could then adjust
that information for systems that may be broken.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 23:09                                                   ` Logan Gunthorpe
@ 2018-12-05 23:20                                                     ` Jerome Glisse
  2018-12-05 23:23                                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 23:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 04:09:29PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 3:58 p.m., Jerome Glisse wrote:
> > So just to be clear here is how i understand your position:
> > "Single coherent sysfs hierarchy to describe something is useless
> >  let's git rm drivers/base/"
> 
> I have no idea what you're talking about. I'm saying the existing sysfs
> hierarchy *should* be used for this application -- we shouldn't be
> creating another hierarchy.
> 
> > While i am arguing that "hey the /sys/bus/node/devices/* is nice
> > but it just does not cut it for all this new hardware platform
> > if i add new nodes there for my new memory i will break tons of
> > existing application. So what about a new hierarchy that allow
> > to describe those new hardware platform in a single place like
> > today node thing"
> 
> I'm talking about /sys/bus and all the bus information under there; not
> just the node hierarchy. With this information, you can figure out how
> any struct device is connected to another struct device. This has little
> to do with a hypothetical memory device and what it might expose. You're
> conflating memory devices with links between devices (ie. buses).


And my proposal is under /sys/bus and have symlink to all existing
device it agregate in there.

For device memory i explained why it does not make sense to expose
it as node. So now how do i expose it ? Yes i can expose it under
the device directory but then i can not present the properties of
that memory which depends on through which bus and through which
bridges it is accessed.

So i need bus and bridge objects so that i can express the properties
that depends on the path between the initiator and the target memory.

I argue it is better to expose all this under the same directory.
You say it is not. We NUMA as an example that shows everything under
a single hierarchy so to me you are saying it is useless and has no
value. I say the NUMA thing has value and i would like something like
it just with more stuff and with the capability of doing any kind of
graph.


I just do not see how i can achieve my objectives any differently.

I think we are just talking past each other and this is likely a
pointless conversation. I will keep working on this in the meantime.


> > No can do that is what i am trying to explain. So if i bus 1 in a
> > sub-system A and usualy that kind of bus can serve a bridge for
> > PCIE ie a CPU can access device behind it by going through a PCIE
> > device first. So now the userspace libary have this knowledge
> > bake in. Now if a platform has a bug for whatever reasons where
> > that does not hold, the kernel has no way to tell userspace that
> > there is an exception there. It is up to userspace to have a data
> > base of quirks.
> 
> > Kernel see all those objects in isolation in your scheme. While in
> > what i am proposing there is only one place and any device that
> > participate in this common place can report any quirks so that a
> > coherent view is given to user space.
> 
> The above makes no sense to me.
> 
> 
> > If we have gazillion of places where all this informations is spread
> > around than we have no way to fix weird inter-action between any
> > of those.
> 
> So work to standardize it so that all buses present a consistent view of
> what guarantees they provide for bus accesses. Quirks could then adjust
> that information for systems that may be broken.

So you agree with my proposal ? A sysfs directory in which all the
bus and how they are connected to each other and what is connected
to each of them (device, CPU, memory).

THis is really confusing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 23:20                                                     ` Jerome Glisse
@ 2018-12-05 23:23                                                       ` Logan Gunthorpe
  2018-12-05 23:27                                                         ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-05 23:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell



On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> And my proposal is under /sys/bus and have symlink to all existing
> device it agregate in there.

That's so not the point. Use the existing buses don't invent some
virtual tree. I don't know how many times I have to say this or in how
many ways. I'm not responding anymore.

> So you agree with my proposal ? A sysfs directory in which all the
> bus and how they are connected to each other and what is connected
> to each of them (device, CPU, memory).

I'm fine with the motivation. What I'm arguing against is the
implementation and the fact you have to create a whole grand new
userspace API and hierarchy to accomplish it.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 23:23                                                       ` Logan Gunthorpe
@ 2018-12-05 23:27                                                         ` Jerome Glisse
  2018-12-06  0:08                                                           ` Dan Williams
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-05 23:27 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dan Williams, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> > And my proposal is under /sys/bus and have symlink to all existing
> > device it agregate in there.
> 
> That's so not the point. Use the existing buses don't invent some
> virtual tree. I don't know how many times I have to say this or in how
> many ways. I'm not responding anymore.

And how do i express interaction with different buses because i just
do not see how to do that in the existing scheme. It would be like
teaching to each bus about all the other bus versus having each bus
register itself under a common framework and have all the interaction
between bus mediated through that common framework avoiding code
duplication accross buses.

> 
> > So you agree with my proposal ? A sysfs directory in which all the
> > bus and how they are connected to each other and what is connected
> > to each of them (device, CPU, memory).
> 
> I'm fine with the motivation. What I'm arguing against is the
> implementation and the fact you have to create a whole grand new
> userspace API and hierarchy to accomplish it.
> 
> Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation
  2018-12-05 23:27                                                         ` Jerome Glisse
@ 2018-12-06  0:08                                                           ` Dan Williams
  0 siblings, 0 replies; 94+ messages in thread
From: Dan Williams @ 2018-12-06  0:08 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Logan Gunthorpe, Andi Kleen, Linux MM, Andrew Morton,
	Linux Kernel Mailing List, Rafael J. Wysocki, Dave Hansen,
	Haggai Eran, balbirs, Aneesh Kumar K.V, Benjamin Herrenschmidt,
	Kuehling, Felix, Philip.Yang, Koenig, Christian, Blinzer, Paul,
	John Hubbard, rcampbell

On Wed, Dec 5, 2018 at 3:27 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> > > And my proposal is under /sys/bus and have symlink to all existing
> > > device it agregate in there.
> >
> > That's so not the point. Use the existing buses don't invent some
> > virtual tree. I don't know how many times I have to say this or in how
> > many ways. I'm not responding anymore.
>
> And how do i express interaction with different buses because i just
> do not see how to do that in the existing scheme. It would be like
> teaching to each bus about all the other bus versus having each bus
> register itself under a common framework and have all the interaction
> between bus mediated through that common framework avoiding code
> duplication accross buses.
>
> >
> > > So you agree with my proposal ? A sysfs directory in which all the
> > > bus and how they are connected to each other and what is connected
> > > to each of them (device, CPU, memory).
> >
> > I'm fine with the motivation. What I'm arguing against is the
> > implementation and the fact you have to create a whole grand new
> > userspace API and hierarchy to accomplish it.

Right, GPUs show up in /sys today. Don't register a whole new
hierarchy as an alias to what already exists, add a new attribute
scheme to the existing hierarchy. This is what the HMAT enabling is
doing, this is what p2pdma is doing.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-05 17:53           ` Jerome Glisse
@ 2018-12-06 18:25             ` Dave Hansen
  2018-12-06 19:20               ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 18:25 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/5/18 9:53 AM, Jerome Glisse wrote:
> No so there is 2 kinds of applications:
>     1) average one: i am using device {1, 3, 9} give me best memory for
>        those devices
...
> 
> For case 1 you can pre-parse stuff but this can be done by helper library

How would that work?  Would each user/container/whatever do this once?
Where would they keep the pre-parsed stuff?  How do they manage their
cache if the topology changes?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 18:25             ` Dave Hansen
@ 2018-12-06 19:20               ` Jerome Glisse
  2018-12-06 19:31                 ` Dave Hansen
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-06 19:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote:
> On 12/5/18 9:53 AM, Jerome Glisse wrote:
> > No so there is 2 kinds of applications:
> >     1) average one: i am using device {1, 3, 9} give me best memory for
> >        those devices
> ...
> > 
> > For case 1 you can pre-parse stuff but this can be done by helper library
> 
> How would that work?  Would each user/container/whatever do this once?
> Where would they keep the pre-parsed stuff?  How do they manage their
> cache if the topology changes?

Short answer i don't expect a cache, i expect that each program will have
a init function that query the topology and update the application codes
accordingly. This is what people do today, query all available devices,
decide which one to use and how, create context for each selected ones,
define a memory migration job/memory policy for each part of the program
so that memory is migrated/have proper policy in place when the code that
run on some device is executed.


Long answer:

I can not dictate how user folks do their program saddly :) I expect that
many application will do it once during start up. Then you will have all
those containers folks or VM folks that will get presure to react to hot-
plug. For instance if you upgrade your instance with your cloud provider
to have more GPUs or more TPUs ... It is likely to appear as an hotplug
from the VM/container point of view and thus as an hotplug from the
application point of view. So far demonstration i have seen do that by
relaunching the application ... More on that through the live re-patching
issues below.

Oh and i expect application will crash if you hot-unplug anything it is
using (this is what happens i believe now in most API). Again i expect
that some pressure from cloud user and provider will force programmer
to be a bit more reactive to this kind of event.


Live re-patching application code can be difficult i am told. Let say you
have:

void compute_serious0_stuff(accelerator_t *accelerator, void *inputA,
                            size_t sinputA, void *inputB, size_t sinputB,
                            void *outputA, size_t soutputA)
{
    ...

    // Migrate the inputA to the accelerator memory
    api_migrate_memory_to_accelerator(accelerator, inputA, sinputA);

    // The inputB buffer is fine in its default placement

    // The output is assume to be empty vma ie no page allocated yet
    // so set a policy to direct all allocation due to page fault to
    // use the accelerator memory
    api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA);

    ...
    for_parallel<accelerator> (i = 0; i < THEYAREAMILLIONSITEMS; ++i) {
        // Do something serious
    }
    ...
}

void serious0_orchestrator(topology topology, void *inputA,
                           void *inputB, void *outputA)
{
    static accelerator_t **selected = NULL;
    static serious0_job_partition *partition;
    ...
    if (selected == NULL) {
        serious0_select_and_partition(topology, &selected, &partition,
                                      inputA, inputB, outputA)
    }
    ...
    for(i = 0; i < nselected; ++) {
        ...
        compute_serious0_stuff(selected[i],
                               inputA + partition[i].inputA_offset,
                               partition[i].inputA_size,
                               inputB + partition[i].inputB_offset,
                               partition[i].inputB_size,
                               outputA + partition[i].outputB_offset,
                               partition[i].outputA_size);
        ...
    }
    ...
    for(i = 0; i < nselected; ++) {
        accelerator_wait_finish(selected[i]);
    }
    ...
    // outputA is ready to be use by the next function in the program
}

If you start without a GPU/TPU your for_parallel will use the CPU and
with the code the compiler have emitted at built time. For GPU/TPU at
build time you compile your for_parallel loop to some intermediate
representation (a virtual ISA) then at runtime during the application
initialization that intermediate representation get lowered down to
all the available GPU/TPU on your system and each for_parallel loop
is patched to be turn into a call to:

void dispatch_accelerator_function(accelerator_t *accelerator,
                                   void *function, ...)
{
}

So in the above example the for_parallel loop becomes:
dispatch_accelerator_function(accelerator, i_compute_serious_stuff,
                              inputA, inputB, outputA);

This hot patching of code is easy to do when no CPU thread is running
the code. However when CPU threads are running it can be problematic,
i am sure you can do trickery like delay the patching only to the next
time the function get call by doing clever thing at build time like
prepending each for_parallel section with enough nop that would allow
you to replace it to a call to the dispatch function and a jump over
the normal CPU code.


I think compiler people want to solve the static case first ie during
application initializations decide what devices are gonna be use and
then update the application accordingly. But i expect it will grow
to support hotplug as relaunching the application is not that user
friendly even in this day an age where people starts millions of
container with one mouse click.


Anyway above example is how it looks today and accelerator can turn
up to be just regular CPU core if you do not have any devices. The
idea is that we would like a common API that cover both CPU thread
and device thread. Same for the migration/policy functions if it
happens that the accelerator is just plain old CPU then you want to
migrate memory to the CPU node and set memory policy to that node too.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 19:20               ` Jerome Glisse
@ 2018-12-06 19:31                 ` Dave Hansen
  2018-12-06 20:11                   ` Logan Gunthorpe
  2018-12-06 20:27                   ` Jerome Glisse
  0 siblings, 2 replies; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 19:31 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>> For case 1 you can pre-parse stuff but this can be done by helper library
>> How would that work?  Would each user/container/whatever do this once?
>> Where would they keep the pre-parsed stuff?  How do they manage their
>> cache if the topology changes?
> Short answer i don't expect a cache, i expect that each program will have
> a init function that query the topology and update the application codes
> accordingly.

My concern with having folks do per-program parsing, *and* having a huge
amount of data to parse makes it unusable.  The largest systems will
literally have hundreds of thousands of objects in /sysfs, even in a
single directory.  That makes readdir() basically impossible, and makes
even open() (if you already know the path you want somehow) hard to do fast.

I just don't think sysfs (or any filesystem, really) can scale to
express large, complicated topologies in a way that any normal program
can practically parse it.

My suspicion is that we're going to need to have the kernel parse and
cache these things.  We *might* have the data available in sysfs, but we
can't reasonably expect anyone to go parsing it.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 19:31                 ` Dave Hansen
@ 2018-12-06 20:11                   ` Logan Gunthorpe
  2018-12-06 22:04                     ` Dave Hansen
  2018-12-06 20:27                   ` Jerome Glisse
  1 sibling, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-06 20:11 UTC (permalink / raw)
  To: Dave Hansen, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi



On 2018-12-06 12:31 p.m., Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>>> For case 1 you can pre-parse stuff but this can be done by helper library
>>> How would that work?  Would each user/container/whatever do this once?
>>> Where would they keep the pre-parsed stuff?  How do they manage their
>>> cache if the topology changes?
>> Short answer i don't expect a cache, i expect that each program will have
>> a init function that query the topology and update the application codes
>> accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.

Is this actually realistic? I find it hard to imagine an actual hardware
bus that can have even thousands of devices under a single node, let
alone hundreds of thousands. At some point the laws of physics apply.
For example, in present hardware, the most ports a single PCI switch can
have these days is under one hundred. I'd imagine any such large systems
would have a hierarchy of devices (ie. layers of switch-like devices)
which implies the existing sysfs bus/devices  should have a path through
it without navigating a directory with that unreasonable a number of
objects in it. HMS, on the other hand, has all possible initiators
(,etc) under a single directory.

The caveat to this is, that to find an initial starting point in the bus
hierarchy you might have to go through /sys/dev/{block|char} or
/sys/class which may have directories with a large number of objects.
Though, such a system would necessarily have a similarly large number of
objects in /dev which means means you will probably never get around the
readdir/open bottleneck you mention... and, thus, this doesn't seem
overly realistic to me.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 19:31                 ` Dave Hansen
  2018-12-06 20:11                   ` Logan Gunthorpe
@ 2018-12-06 20:27                   ` Jerome Glisse
  2018-12-06 21:46                     ` Jerome Glisse
  1 sibling, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-06 20:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
> >>> For case 1 you can pre-parse stuff but this can be done by helper library
> >> How would that work?  Would each user/container/whatever do this once?
> >> Where would they keep the pre-parsed stuff?  How do they manage their
> >> cache if the topology changes?
> > Short answer i don't expect a cache, i expect that each program will have
> > a init function that query the topology and update the application codes
> > accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.
> 
> I just don't think sysfs (or any filesystem, really) can scale to
> express large, complicated topologies in a way that any normal program
> can practically parse it.
> 
> My suspicion is that we're going to need to have the kernel parse and
> cache these things.  We *might* have the data available in sysfs, but we
> can't reasonably expect anyone to go parsing it.

What i am failing to explain is that kernel can not parse because kernel
does not know what the application cares about and every single applications
will make different choices and thus select differents devices and memory.

It is not even gonna a thing like class A of application will do X and
class B will do Y. Every single application in class A might do something
different because somes care about the little details.

So any kind of pre-parsing in the kernel is defeated by the fact that the
kernel does not know what the application is looking for.

I do not see anyway to express the application logic in something that
can be some kind of automaton or regular expression. The application can
litteraly intro-inspect itself and the topology to partition its workload.
The topology and device selection is expected to be thousands of line of
code in the most advance application.

Even worse inside one same application, they might be different device
partition and memory selection for different function in the application.


I am not scare about the anount of data to parse really, even on big node
it is gonna be few dozens of links and bridges, and few dozens of devices.
So we are talking hundred directories to parse and read.


Maybe an example will help. Let say we have an application with the
following pipeline:

    inA -> functionA -> outA -> functionB -> outB -> functionC -> result

    - inA 8 gigabytes
    - outA 8 gigabytes
    - outB one dword
    - result something small
    - functionA is doing heavy computation on inA (several thousands of
      instructions for each dword in inA).
    - functionB is doing heavy computation for each dword in outA (again
      thousand of instruction for each dword) and it is looking for a
      specific result that it knows will be unique among all the dword
      computation ie it is output only one dword in outB
    - functionC is something well suited for CPU that take outB and turns
      it into the final result

Now let see few different system and their topologies:
    [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
    [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
    [T3] 2 GPU with 8GB of memory and a handfull of CPU core
    [T4] 2 GPU with 8GB of memory and a handfull of CPU core
         the 2 GPU have a very fast link between each others
         (400GBytes/s)

Now let see how the program will partition itself for each topology:
    [T1] Application partition its computation in 3 phases:
            P1: - migrate inA to GPU memory
            P2: - execute functionA on inA producing outA
            P3  - execute functionB on outA producing outB
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU threads and return the result we are done

    [T2] Application partition its computation in 5 phases:
            P1: - migrate first 4GB of inA to GPU memory
            P2: - execute functionA for the 4GB and write the 4GB
                  outA result to the GPU memory
            P3: - execute functionB for the first 4GB of outA
                - while functionB is running DMA in the background
                  the the second 4GB of inA to the GPU memory
                - once one of the millions of thread running functionB
                  find the result it is looking for it writes it to
                  outB which is in main memory
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU thread and DMA and return the result we are
                  done
            P4: - run functionA on the second half of inA ie we did
                  not find the result in the first half so we no
                  process the second half that have been migrated to
                  the GPU memory in the background (see above)
            P5: - run functionB on the second 4GB of outA like
                  above
                - run functionC on CPU and kill everything as soon
                  as one of the thread running functionB has found
                  the result
                - return the result

    [T3] Application partition its computation in 3 phases:
            P1: - migrate first 4GB of inA to GPU1 memory
                - migrate last 4GB of inA to GPU2 memory
            P2: - execute functionA on GPU1 on the first 4GB -> outA
                - execute functionA on GPU2 on the last 4GB -> outA
            P3: - execute functionB on GPU1 on the first 4GB of outA
                - execute functionB on GPU2 on the last 4GB of outA
                - run functionC and see if functionB running on GPU1
                  and GPU2 have found the thing and written it to outB
                  if so then kill all GPU threads and return the result
                  we are done

    [T4] Application partition its computation in 2 phases:
            P1: - migrate 8GB of inA to GPU1 memory
                - allocate 8GB for outA in GPU2 memory
            P2: - execute functionA on GPU1 on the inA 8GB and write
                  out result to GPU2 through the fast link
                - execute functionB on GPU2 and look over each
                  thread on functionB on outA (busy running even
                  if outA is not valid for each thread running
                  functionB)
                - run functionC and see if functionB running on GPU2
                  have found the thing and written it to outB if so
                  then kill all GPU threads and return the result
                  we are done


So this is widely different partition that all depends on the topology
and how accelerator are inter-connected and how much memory they have.
This is a relatively simple example, they are people out there spending
month on designing adaptive partitioning algorithm for their application.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 20:27                   ` Jerome Glisse
@ 2018-12-06 21:46                     ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-06 21:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, Logan Gunthorpe,
	John Hubbard, Ralph Campbell, Michal Hocko, Jonathan Cameron,
	Mark Hairgrove, Vivek Kini, Mel Gorman, Dave Airlie, Ben Skeggs,
	Andrea Arcangeli, Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> > On 12/6/18 11:20 AM, Jerome Glisse wrote:
> > >>> For case 1 you can pre-parse stuff but this can be done by helper library
> > >> How would that work?  Would each user/container/whatever do this once?
> > >> Where would they keep the pre-parsed stuff?  How do they manage their
> > >> cache if the topology changes?
> > > Short answer i don't expect a cache, i expect that each program will have
> > > a init function that query the topology and update the application codes
> > > accordingly.
> > 
> > My concern with having folks do per-program parsing, *and* having a huge
> > amount of data to parse makes it unusable.  The largest systems will
> > literally have hundreds of thousands of objects in /sysfs, even in a
> > single directory.  That makes readdir() basically impossible, and makes
> > even open() (if you already know the path you want somehow) hard to do fast.
> > 
> > I just don't think sysfs (or any filesystem, really) can scale to
> > express large, complicated topologies in a way that any normal program
> > can practically parse it.
> > 
> > My suspicion is that we're going to need to have the kernel parse and
> > cache these things.  We *might* have the data available in sysfs, but we
> > can't reasonably expect anyone to go parsing it.
> 
> What i am failing to explain is that kernel can not parse because kernel
> does not know what the application cares about and every single applications
> will make different choices and thus select differents devices and memory.
> 
> It is not even gonna a thing like class A of application will do X and
> class B will do Y. Every single application in class A might do something
> different because somes care about the little details.
> 
> So any kind of pre-parsing in the kernel is defeated by the fact that the
> kernel does not know what the application is looking for.
> 
> I do not see anyway to express the application logic in something that
> can be some kind of automaton or regular expression. The application can
> litteraly intro-inspect itself and the topology to partition its workload.
> The topology and device selection is expected to be thousands of line of
> code in the most advance application.
> 
> Even worse inside one same application, they might be different device
> partition and memory selection for different function in the application.
> 
> 
> I am not scare about the anount of data to parse really, even on big node
> it is gonna be few dozens of links and bridges, and few dozens of devices.
> So we are talking hundred directories to parse and read.
> 
> 
> Maybe an example will help. Let say we have an application with the
> following pipeline:
> 
>     inA -> functionA -> outA -> functionB -> outB -> functionC -> result
> 
>     - inA 8 gigabytes
>     - outA 8 gigabytes
>     - outB one dword
>     - result something small
>     - functionA is doing heavy computation on inA (several thousands of
>       instructions for each dword in inA).
>     - functionB is doing heavy computation for each dword in outA (again
>       thousand of instruction for each dword) and it is looking for a
>       specific result that it knows will be unique among all the dword
>       computation ie it is output only one dword in outB
>     - functionC is something well suited for CPU that take outB and turns
>       it into the final result
> 
> Now let see few different system and their topologies:
>     [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
>     [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
>     [T3] 2 GPU with 8GB of memory and a handfull of CPU core
>     [T4] 2 GPU with 8GB of memory and a handfull of CPU core
>          the 2 GPU have a very fast link between each others
>          (400GBytes/s)
> 
> Now let see how the program will partition itself for each topology:
>     [T1] Application partition its computation in 3 phases:
>             P1: - migrate inA to GPU memory
>             P2: - execute functionA on inA producing outA
>             P3  - execute functionB on outA producing outB
>                 - run functionC and see if functionB have found the
>                   thing and written it to outB if so then kill all
>                   GPU threads and return the result we are done
> 
>     [T2] Application partition its computation in 5 phases:
>             P1: - migrate first 4GB of inA to GPU memory
>             P2: - execute functionA for the 4GB and write the 4GB
>                   outA result to the GPU memory
>             P3: - execute functionB for the first 4GB of outA
>                 - while functionB is running DMA in the background
>                   the the second 4GB of inA to the GPU memory
>                 - once one of the millions of thread running functionB
>                   find the result it is looking for it writes it to
>                   outB which is in main memory
>                 - run functionC and see if functionB have found the
>                   thing and written it to outB if so then kill all
>                   GPU thread and DMA and return the result we are
>                   done
>             P4: - run functionA on the second half of inA ie we did
>                   not find the result in the first half so we no
>                   process the second half that have been migrated to
>                   the GPU memory in the background (see above)
>             P5: - run functionB on the second 4GB of outA like
>                   above
>                 - run functionC on CPU and kill everything as soon
>                   as one of the thread running functionB has found
>                   the result
>                 - return the result
> 
>     [T3] Application partition its computation in 3 phases:
>             P1: - migrate first 4GB of inA to GPU1 memory
>                 - migrate last 4GB of inA to GPU2 memory
>             P2: - execute functionA on GPU1 on the first 4GB -> outA
>                 - execute functionA on GPU2 on the last 4GB -> outA
>             P3: - execute functionB on GPU1 on the first 4GB of outA
>                 - execute functionB on GPU2 on the last 4GB of outA
>                 - run functionC and see if functionB running on GPU1
>                   and GPU2 have found the thing and written it to outB
>                   if so then kill all GPU threads and return the result
>                   we are done
> 
>     [T4] Application partition its computation in 2 phases:
>             P1: - migrate 8GB of inA to GPU1 memory
>                 - allocate 8GB for outA in GPU2 memory
>             P2: - execute functionA on GPU1 on the inA 8GB and write
>                   out result to GPU2 through the fast link
>                 - execute functionB on GPU2 and look over each
>                   thread on functionB on outA (busy running even
>                   if outA is not valid for each thread running
>                   functionB)
>                 - run functionC and see if functionB running on GPU2
>                   have found the thing and written it to outB if so
>                   then kill all GPU threads and return the result
>                   we are done
> 
> 
> So this is widely different partition that all depends on the topology
> and how accelerator are inter-connected and how much memory they have.
> This is a relatively simple example, they are people out there spending
> month on designing adaptive partitioning algorithm for their application.
> 

And since i am writting example, another funny one let say you have
a system with 2 nodes and on each node 2 GPU and one network. On each
node the local network adapter can only access one of the 2 GPU memory.
All the GPU are conntected to each other through a fully symmetrical
mesh inter-connect.

Now let say your program has 4 functions back to back, each functions
consuming the output of the previous one. Finaly you get your input
from the network and stream out the final function output to the network

So what you can do is:
    Node0 Net0 -> write to Node0 GPU0 memory
    Node0 GPU0 -> run first function and write result to Node0 GPU1
    Node0 GPU1 -> run second function and write result to Node1 GPU3
    Node1 GPU3 -> run third function and write result to Node1 GPU2
    Node1 Net1 -> read result from Node1 GPU2 and stream it out


Yes this kind of thing can be decided at application startup during
initialization. Idea is that you model your program computation graph
each node is a function (or group of functions) and each arrow is
data flow (input and output).

So you have a graph, now what you do is try to find a sub-graph of
your system topology that match this graph and for the system topology
you also have to check that each of your program node can run on
the specific accelerator node of your system (does the accelerator
have the feature X and Y ?)

If you are not lucky and that there is no 1 to 1 match the you can
can re-arrange/simplify your application computation graph. For
instance group multiple of your application function node into just
one node to shrink your computation graph. Rinse and repeat.


Moreover each application will have multiple separate computation
graph and the application will want to spread as evenly as possible
its workload and select the most powerfull accelerator for the most
intensive computation ...


I do not see how to have graph matching API with complex testing
where you need to query back userspace library. Like querying if
the userspace penCL driver for GPU A support feature X ? Which
might not only depend on the device generation or kernel device
driver version but also on the version of the userspace driver.

I feel it would be a lot easier to provide a graph to userspace
and have userspace do this complex matching and adaption of its
computation graph and load balance its computation at the same
time.


Of course not all application will be that complex and like i said
i believe average app (especialy desktop app design to run on
laptop) will just use a dumb down thing ie they will only use
one or two devices at the most.


Yes all this is hard but easy problems are not interesting to
solve.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 20:11                   ` Logan Gunthorpe
@ 2018-12-06 22:04                     ` Dave Hansen
  2018-12-06 22:39                       ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 22:04 UTC (permalink / raw)
  To: Logan Gunthorpe, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
>> My concern with having folks do per-program parsing, *and* having a huge
>> amount of data to parse makes it unusable.  The largest systems will
>> literally have hundreds of thousands of objects in /sysfs, even in a
>> single directory.  That makes readdir() basically impossible, and makes
>> even open() (if you already know the path you want somehow) hard to do fast.
> Is this actually realistic? I find it hard to imagine an actual hardware
> bus that can have even thousands of devices under a single node, let
> alone hundreds of thousands.

Jerome's proposal, as I understand it, would have generic "links".
They're not an instance of bus, but characterize a class of "link".  For
instance, a "link" might characterize the characteristics of the QPI bus
between two CPU sockets. The link directory would enumerate the list of
all *instances* of that link

So, a "link" directory for QPI would say Socket0<->Socket1,
Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
would have to enumerate the connections between every entity that shared
those link properties.

While there might not be millions of buses, there could be millions of
*paths* across all those buses, and that's what the HMAT describes, at
least: the net result of all those paths.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 22:04                     ` Dave Hansen
@ 2018-12-06 22:39                       ` Jerome Glisse
  2018-12-06 23:09                         ` Dave Hansen
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-06 22:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Logan Gunthorpe, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote:
> On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
> >> My concern with having folks do per-program parsing, *and* having a huge
> >> amount of data to parse makes it unusable.  The largest systems will
> >> literally have hundreds of thousands of objects in /sysfs, even in a
> >> single directory.  That makes readdir() basically impossible, and makes
> >> even open() (if you already know the path you want somehow) hard to do fast.
> > Is this actually realistic? I find it hard to imagine an actual hardware
> > bus that can have even thousands of devices under a single node, let
> > alone hundreds of thousands.
> 
> Jerome's proposal, as I understand it, would have generic "links".
> They're not an instance of bus, but characterize a class of "link".  For
> instance, a "link" might characterize the characteristics of the QPI bus
> between two CPU sockets. The link directory would enumerate the list of
> all *instances* of that link
> 
> So, a "link" directory for QPI would say Socket0<->Socket1,
> Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
> would have to enumerate the connections between every entity that shared
> those link properties.
> 
> While there might not be millions of buses, there could be millions of
> *paths* across all those buses, and that's what the HMAT describes, at
> least: the net result of all those paths.

Sorry if again i miss-explained thing. Link are arrows between nodes
(CPU or device or memory). An arrow/link has properties associated
with it: bandwidth, latency, cache-coherent, ...

So if in your system you 4 Sockets and that each socket is connected to
each other (mesh) and all inter-connect in the mesh have same property
then you only have 1 link directory with the 4 socket in it.

No if the 4 sockets are connect in a ring fashion ie:
        Socket0 - Socket1
           |         |
        Socket3 - Socket2

Then you have 4 links:
link0: socket0 socket1
link1: socket1 socket2
link3: socket2 socket3
link4: socket3 socket0

I do not see how their can be an explosion of link directory, worse
case is as many link directories as they are bus for a CPU/device/
target. So worse case if you have N devices and each devices is
connected two 2 bus (PCIE and QPI to go to other socket for instance)
then you have 2*N link directory (again this is a worst case).

They are lot of commonality that will remain so i expect that quite
a few link directory will have many symlink ie you won't get close
to the worst case.


In the end really it is easier to think from the physical topology
and there a link correspond to an inter-connect between two device
or CPU. In all the systems i have seen even in the craziest roadmap
i have only seen thing like 128/256 inter-connect (4 socket 32/64
devices per socket) and many of which can be grouped under a common
link directory. Here worse case is 4 connection per device/CPU/
target so worse case of 128/256 * 4  = 512/1024 link directory
and that's a lot. Given regularity i have seen described on slides
i expect that it would need something like 30 link directory and
20 bridges directory.

On today system 8GPU per socket with GPUlink between each GPU and
PCIE all this with 4 socket it comes down to 20 links directory.

In any cases each devices/CPU/target has a limit on the number of
bus/inter-connect it is connected too. I doubt there is anyone
designing device that will have much more than 4 external bus
connection.

So it is not a link per pair. It is a link for group of device/CPU/
target. Is it any clearer ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 22:39                       ` Jerome Glisse
@ 2018-12-06 23:09                         ` Dave Hansen
  2018-12-06 23:28                           ` Logan Gunthorpe
  2018-12-07  0:15                           ` Jerome Glisse
  0 siblings, 2 replies; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 23:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On 12/6/18 2:39 PM, Jerome Glisse wrote:
> No if the 4 sockets are connect in a ring fashion ie:
>         Socket0 - Socket1
>            |         |
>         Socket3 - Socket2
> 
> Then you have 4 links:
> link0: socket0 socket1
> link1: socket1 socket2
> link3: socket2 socket3
> link4: socket3 socket0
> 
> I do not see how their can be an explosion of link directory, worse
> case is as many link directories as they are bus for a CPU/device/
> target.

This looks great.  But, we don't _have_ this kind of information for any
system that I know about or any system available in the near future.

We basically have two different world views:
1. The system is described point-to-point.  A connects to B @
   100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
   50GB/s.
   * Less information to convey
   * Potentially less precise if the properties are not perfectly
     additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
   * Costs must be calculated instead of being explicitly specified
2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
   B->C @ 50GB/s, A->C @ 50GB/s.
   * A *lot* more information to convey O(N^2)?
   * Potentially more precise.
   * Costs are explicitly specified, not calculated

These patches are really tied to world view #1.  But, the HMAT is really
tied to world view #1.

I know you're not a fan of the HMAT.  But it is the firmware reality
that we are stuck with, until something better shows up.  I just don't
see a way to convert it into what you have described here.

I'm starting to think that, no matter if the HMAT or some other approach
gets adopted, we shouldn't be exposing this level of gunk to userspace
at *all* since it requires adopting one of the world views.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:09                         ` Dave Hansen
@ 2018-12-06 23:28                           ` Logan Gunthorpe
  2018-12-06 23:34                             ` Dave Hansen
  2018-12-06 23:38                             ` Dave Hansen
  2018-12-07  0:15                           ` Jerome Glisse
  1 sibling, 2 replies; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-06 23:28 UTC (permalink / raw)
  To: Dave Hansen, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi



On 2018-12-06 4:09 p.m., Dave Hansen wrote:
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.
> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>    100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>    50GB/s.
>    * Less information to convey
>    * Potentially less precise if the properties are not perfectly
>      additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>    * Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>    B->C @ 50GB/s, A->C @ 50GB/s.
>    * A *lot* more information to convey O(N^2)?
>    * Potentially more precise.
>    * Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

I didn't think this was meant to describe actual real world performance
between all of the links. If that's the case all of this seems like a
pipe dream to me.

Attributes like cache coherency, atomics, etc should fit well in world
view #1... and, at best, some kind of flag saying whether or not to use
a particular link if you care about transfer speed. -- But we don't need
special "link" directories to describe the properties of existing buses.

You're not *really* going to know bandwidth or latency for any of this
unless you actually measure it on the system in question.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:28                           ` Logan Gunthorpe
@ 2018-12-06 23:34                             ` Dave Hansen
  2018-12-06 23:38                             ` Dave Hansen
  1 sibling, 0 replies; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 23:34 UTC (permalink / raw)
  To: Logan Gunthorpe, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

Whoops, should have been "the HMAT is really tied to world view #2"

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:28                           ` Logan Gunthorpe
  2018-12-06 23:34                             ` Dave Hansen
@ 2018-12-06 23:38                             ` Dave Hansen
  2018-12-06 23:48                               ` Logan Gunthorpe
  1 sibling, 1 reply; 94+ messages in thread
From: Dave Hansen @ 2018-12-06 23:38 UTC (permalink / raw)
  To: Logan Gunthorpe, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> I didn't think this was meant to describe actual real world performance
> between all of the links. If that's the case all of this seems like a
> pipe dream to me.

The HMAT discussions (that I was a part of at least) settled on just
trying to describe what we called "sticker speed".  Nobody had an
expectation that you *really* had to measure everything.

The best we can do for any of these approaches is approximate things.

> You're not *really* going to know bandwidth or latency for any of this
> unless you actually measure it on the system in question.

Yeah, agreed.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:38                             ` Dave Hansen
@ 2018-12-06 23:48                               ` Logan Gunthorpe
  2018-12-07  0:20                                 ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Logan Gunthorpe @ 2018-12-06 23:48 UTC (permalink / raw)
  To: Dave Hansen, Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Rafael J . Wysocki,
	Matthew Wilcox, Ross Zwisler, Keith Busch, Dan Williams,
	Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi



On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
>> I didn't think this was meant to describe actual real world performance
>> between all of the links. If that's the case all of this seems like a
>> pipe dream to me.
> 
> The HMAT discussions (that I was a part of at least) settled on just
> trying to describe what we called "sticker speed".  Nobody had an
> expectation that you *really* had to measure everything.
> 
> The best we can do for any of these approaches is approximate things.

Yes, though there's a lot of caveats in this assumption alone.
Specifically with PCI: the bus may run at however many GB/s but P2P
through a CPU's root complexes can slow down significantly (like down to
MB/s).

I've seen similar things across QPI: I can sometimes do P2P from
PCI->QPI->PCI but the performance doesn't even come close to the sticker
speed of any of those buses.

I'm not sure how anyone is going to deal with those issues, but it does
firmly place us in world view #2 instead of #1. But, yes, I agree
exposing information like in #2 full out to userspace, especially
through sysfs, seems like a nightmare and I don't see anything in HMS to
help with that. Providing an API to ask for memory (or another resource)
that's accessible by a set of initiators and with a set of requirements
for capabilities seems more manageable.

Logan

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:09                         ` Dave Hansen
  2018-12-06 23:28                           ` Logan Gunthorpe
@ 2018-12-07  0:15                           ` Jerome Glisse
  1 sibling, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-07  0:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Logan Gunthorpe, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote:
> On 12/6/18 2:39 PM, Jerome Glisse wrote:
> > No if the 4 sockets are connect in a ring fashion ie:
> >         Socket0 - Socket1
> >            |         |
> >         Socket3 - Socket2
> > 
> > Then you have 4 links:
> > link0: socket0 socket1
> > link1: socket1 socket2
> > link3: socket2 socket3
> > link4: socket3 socket0
> > 
> > I do not see how their can be an explosion of link directory, worse
> > case is as many link directories as they are bus for a CPU/device/
> > target.
> 
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.

We do not have it in any standard way, it is out there in either
device driver database, application data base, special platform
OEM blob burried somewhere in the firmware ...

I want to solve the kernel side of the problem ie how to expose
this to userspace. How the kernel get that information is an
orthogonal problem. For now my intention is to have device driver
register and create the links and bridges that are not enumerated
by standard firmware.

> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>    100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>    50GB/s.
>    * Less information to convey
>    * Potentially less precise if the properties are not perfectly
>      additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>    * Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>    B->C @ 50GB/s, A->C @ 50GB/s.
>    * A *lot* more information to convey O(N^2)?
>    * Potentially more precise.
>    * Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.
                      ^#2

Note that they are also the bridge object in my proposal. So in my
proposal you in #1 you have:
link0: A <-> B with 100GB/s and 10ns latency
link1: B <-> C with 50GB/s and 20ns latency

Now if A can reach C through B then you have bridges (bridge are uni-
directional unlike link that are bi-directional thought that finer
point can be discuss this is what allow any kind of directed graph to
be represented):
bridge2: link0 -> link1
bridge3: link1 -> link0

You can also associated properties to bridge (but it is not mandatory).
So you can say that bridge2 and bridge3 have a latency of 50ns and if
the addition of latency is enough then you do not specificy it in bridge.
It is a rule that a path latency is the sum of its individual link
latency. For bandwidth it is the minimum bandwidth ie what ever is the
bottleneck for the path.


> I know you're not a fan of the HMAT.  But it is the firmware reality
> that we are stuck with, until something better shows up.  I just don't
> see a way to convert it into what you have described here.

Like i said i am not targetting HMAT system i am targeting system that
rely today on database spread between driver and application. I want to
move that knowledge in driver first so that they can teach the core
kernel and register thing in the core. Providing a standard firmware
way to provide this information is a different problem (they are some
loose standard on non ACPI platform AFAIK).

> I'm starting to think that, no matter if the HMAT or some other approach
> gets adopted, we shouldn't be exposing this level of gunk to userspace
> at *all* since it requires adopting one of the world views.

I do not see this as exclusive. Yes they are HMAT system "soon" to arrive
but we already have the more extended view which is just buried under a
pile of different pieces. I do not see any exclusion between the 2. If
HMAT is good enough for a whole class of system fine but there is also
a whole class of system and users that do not fit in that paradigm hence
my proposal.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-06 23:48                               ` Logan Gunthorpe
@ 2018-12-07  0:20                                 ` Jerome Glisse
  2018-12-07 15:06                                   ` Jonathan Cameron
  0 siblings, 1 reply; 94+ messages in thread
From: Jerome Glisse @ 2018-12-07  0:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Dave Hansen, linux-mm, Andrew Morton, linux-kernel,
	Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler, Keith Busch,
	Dan Williams, Haggai Eran, Balbir Singh, Aneesh Kumar K . V,
	Benjamin Herrenschmidt, Felix Kuehling, Philip Yang,
	Christian König, Paul Blinzer, John Hubbard, Ralph Campbell,
	Michal Hocko, Jonathan Cameron, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> >> I didn't think this was meant to describe actual real world performance
> >> between all of the links. If that's the case all of this seems like a
> >> pipe dream to me.
> > 
> > The HMAT discussions (that I was a part of at least) settled on just
> > trying to describe what we called "sticker speed".  Nobody had an
> > expectation that you *really* had to measure everything.
> > 
> > The best we can do for any of these approaches is approximate things.
> 
> Yes, though there's a lot of caveats in this assumption alone.
> Specifically with PCI: the bus may run at however many GB/s but P2P
> through a CPU's root complexes can slow down significantly (like down to
> MB/s).
> 
> I've seen similar things across QPI: I can sometimes do P2P from
> PCI->QPI->PCI but the performance doesn't even come close to the sticker
> speed of any of those buses.
> 
> I'm not sure how anyone is going to deal with those issues, but it does
> firmly place us in world view #2 instead of #1. But, yes, I agree
> exposing information like in #2 full out to userspace, especially
> through sysfs, seems like a nightmare and I don't see anything in HMS to
> help with that. Providing an API to ask for memory (or another resource)
> that's accessible by a set of initiators and with a set of requirements
> for capabilities seems more manageable.

Note that in #1 you have bridge that fully allow to express those path
limitation. So what you just describe can be fully reported to userspace.

I explained and given examples on how program adapt their computation to
the system topology it does exist today and people are even developing new
programming langage with some of those idea baked in.

So they are people out there that already rely on such information they
just do not get it from the kernel but from a mix of various device specific
API and they have to stich everything themself and develop a database of
quirk and gotcha. My proposal is to provide a coherent kernel API where
we can sanitize that informations and report it to userspace in a single
and coherent description.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-07  0:20                                 ` Jerome Glisse
@ 2018-12-07 15:06                                   ` Jonathan Cameron
  2018-12-07 19:37                                     ` Jerome Glisse
  0 siblings, 1 reply; 94+ messages in thread
From: Jonathan Cameron @ 2018-12-07 15:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Logan Gunthorpe, Dave Hansen, linux-mm, Andrew Morton,
	linux-kernel, Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler,
	Keith Busch, Dan Williams, Haggai Eran, Balbir Singh,
	Aneesh Kumar K . V, Benjamin Herrenschmidt, Felix Kuehling,
	Philip Yang, Christian König, Paul Blinzer, John Hubbard,
	Ralph Campbell, Michal Hocko, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On Thu, 6 Dec 2018 19:20:45 -0500
Jerome Glisse <jglisse@redhat.com> wrote:

> On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > >> I didn't think this was meant to describe actual real world performance
> > >> between all of the links. If that's the case all of this seems like a
> > >> pipe dream to me.  
> > > 
> > > The HMAT discussions (that I was a part of at least) settled on just
> > > trying to describe what we called "sticker speed".  Nobody had an
> > > expectation that you *really* had to measure everything.
> > > 
> > > The best we can do for any of these approaches is approximate things.  
> > 
> > Yes, though there's a lot of caveats in this assumption alone.
> > Specifically with PCI: the bus may run at however many GB/s but P2P
> > through a CPU's root complexes can slow down significantly (like down to
> > MB/s).
> > 
> > I've seen similar things across QPI: I can sometimes do P2P from
> > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > speed of any of those buses.
> > 
> > I'm not sure how anyone is going to deal with those issues, but it does
> > firmly place us in world view #2 instead of #1. But, yes, I agree
> > exposing information like in #2 full out to userspace, especially
> > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > help with that. Providing an API to ask for memory (or another resource)
> > that's accessible by a set of initiators and with a set of requirements
> > for capabilities seems more manageable.  
> 
> Note that in #1 you have bridge that fully allow to express those path
> limitation. So what you just describe can be fully reported to userspace.
> 
> I explained and given examples on how program adapt their computation to
> the system topology it does exist today and people are even developing new
> programming langage with some of those idea baked in.
> 
> So they are people out there that already rely on such information they
> just do not get it from the kernel but from a mix of various device specific
> API and they have to stich everything themself and develop a database of
> quirk and gotcha. My proposal is to provide a coherent kernel API where
> we can sanitize that informations and report it to userspace in a single
> and coherent description.
> 
> Cheers,
> Jérôme

I know it doesn't work everywhere, but I think it's worth enumerating what
cases we can get some of these numbers for and where the complexity lies.
I.e. What can the really determined user space library do today?

So one open question is how close can we get in a userspace only prototype.
At the end of the day userspace can often read HMAT directly if it wants to
/sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
end view (world 2).  I dislike the limitations of that as much as the next
person. It is slowly improving with the word "Auditable" being
kicked around - btw anyone interested in ACPI who works for a UEFI
member, there are efforts going on and more viewpoints would be great.
Expect some baby steps shortly.

For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
this is discoverable to some degree. 
* Link speed,
* Number of Lanes,
* Full topology.

What isn't there (I think)
* In component latency / bandwidth limitations (some activity going
  on to improve that long term)
* Effect of credit allocations etc on effectively bandwidth - interconnect
  performance is a whole load of black magic.

Presumably there is some information available from NVLink etc?

So whilst I really like the proposal in some ways, I wonder how much exploration
could be done of the usefulness of the data without touching the kernel at all.

The other aspect that is needed to actually make this 'dynamically' useful is
to be able to map whatever Performance Counters are available to the relevant
'links', bridges etc.   Ticket numbers are not all that useful unfortunately
except for small amounts of data on lightly loaded buses.

The kernel ultimately only needs to have a model of this topology if:
1) It's going to use it itself
2) Its going to do something automatic with it.
3) It needs to fix garbage info or supplement with things only the kernel knows.

Jonathan


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
  2018-12-07 15:06                                   ` Jonathan Cameron
@ 2018-12-07 19:37                                     ` Jerome Glisse
  0 siblings, 0 replies; 94+ messages in thread
From: Jerome Glisse @ 2018-12-07 19:37 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Logan Gunthorpe, Dave Hansen, linux-mm, Andrew Morton,
	linux-kernel, Rafael J . Wysocki, Matthew Wilcox, Ross Zwisler,
	Keith Busch, Dan Williams, Haggai Eran, Balbir Singh,
	Aneesh Kumar K . V, Benjamin Herrenschmidt, Felix Kuehling,
	Philip Yang, Christian König, Paul Blinzer, John Hubbard,
	Ralph Campbell, Michal Hocko, Mark Hairgrove, Vivek Kini,
	Mel Gorman, Dave Airlie, Ben Skeggs, Andrea Arcangeli,
	Rik van Riel, Ben Woodard, linux-acpi

On Fri, Dec 07, 2018 at 03:06:36PM +0000, Jonathan Cameron wrote:
> On Thu, 6 Dec 2018 19:20:45 -0500
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > > >> I didn't think this was meant to describe actual real world performance
> > > >> between all of the links. If that's the case all of this seems like a
> > > >> pipe dream to me.  
> > > > 
> > > > The HMAT discussions (that I was a part of at least) settled on just
> > > > trying to describe what we called "sticker speed".  Nobody had an
> > > > expectation that you *really* had to measure everything.
> > > > 
> > > > The best we can do for any of these approaches is approximate things.  
> > > 
> > > Yes, though there's a lot of caveats in this assumption alone.
> > > Specifically with PCI: the bus may run at however many GB/s but P2P
> > > through a CPU's root complexes can slow down significantly (like down to
> > > MB/s).
> > > 
> > > I've seen similar things across QPI: I can sometimes do P2P from
> > > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > > speed of any of those buses.
> > > 
> > > I'm not sure how anyone is going to deal with those issues, but it does
> > > firmly place us in world view #2 instead of #1. But, yes, I agree
> > > exposing information like in #2 full out to userspace, especially
> > > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > > help with that. Providing an API to ask for memory (or another resource)
> > > that's accessible by a set of initiators and with a set of requirements
> > > for capabilities seems more manageable.  
> > 
> > Note that in #1 you have bridge that fully allow to express those path
> > limitation. So what you just describe can be fully reported to userspace.
> > 
> > I explained and given examples on how program adapt their computation to
> > the system topology it does exist today and people are even developing new
> > programming langage with some of those idea baked in.
> > 
> > So they are people out there that already rely on such information they
> > just do not get it from the kernel but from a mix of various device specific
> > API and they have to stich everything themself and develop a database of
> > quirk and gotcha. My proposal is to provide a coherent kernel API where
> > we can sanitize that informations and report it to userspace in a single
> > and coherent description.
> > 
> > Cheers,
> > Jérôme
> 
> I know it doesn't work everywhere, but I think it's worth enumerating what
> cases we can get some of these numbers for and where the complexity lies.
> I.e. What can the really determined user space library do today?

I gave an example in an email in this thread:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html

Is the kind of example you are looking for ? :)

> 
> So one open question is how close can we get in a userspace only prototype.
> At the end of the day userspace can often read HMAT directly if it wants to
> /sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
> end view (world 2).  I dislike the limitations of that as much as the next
> person. It is slowly improving with the word "Auditable" being
> kicked around - btw anyone interested in ACPI who works for a UEFI
> member, there are efforts going on and more viewpoints would be great.
> Expect some baby steps shortly.
> 
> For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
> this is discoverable to some degree. 
> * Link speed,
> * Number of Lanes,
> * Full topology.

Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI,
...) userspace will have way to find the topology. The issue lies with
orthogonal topology of extra bus that are not necessarily enumerated
or with a device driver presently and especially how they inter-act
with each other (can you cross them ? ...)

> 
> What isn't there (I think)
> * In component latency / bandwidth limitations (some activity going
>   on to improve that long term)
> * Effect of credit allocations etc on effectively bandwidth - interconnect
>   performance is a whole load of black magic.
> 
> Presumably there is some information available from NVLink etc?

From my point of view we want to give the best case sticker value to
userspace ie the bandwidth the engineer that designed the bus sworn
their hardware deliver :)

I believe it the is the best approximation we can deliver.

> 
> So whilst I really like the proposal in some ways, I wonder how much exploration
> could be done of the usefulness of the data without touching the kernel at all.
> 
> The other aspect that is needed to actually make this 'dynamically' useful is
> to be able to map whatever Performance Counters are available to the relevant
> 'links', bridges etc.   Ticket numbers are not all that useful unfortunately
> except for small amounts of data on lightly loaded buses.
> 
> The kernel ultimately only needs to have a model of this topology if:
> 1) It's going to use it itself

I don't think this should be a criteria, kernel is not using GPU or
network adatper to browse the web for itself (at least i hope the
linux kernel is not selfaware ;)). So this kind of topology is not
of big use to the kernel. Kernel will only care about CPU and memory
that abide to the memory model of the platform. It will also care
about more irregular CPU inter-connected ie CPUs on the same mega
substrate likely have a faster inter-connect between them then to
the ones in a different physical socket. NUMA distance can model
that. Dunno if more than that would be useful to the kernel.

> 2) Its going to do something automatic with it.

The information is intended for userspace for application that use
that information. Today application get that information from non
standard source and i would like to provide this in a standard
common place in the kernel for few reasons:
    - Common model with explicit definition of what is what and
      what are the rules. No need to userspace to understand the
      specificities of various kernel sub-system.
    - Define unique identifiant for _every_ type of memory in the
      system even device memory so that i can define syscall to
      operate on those memory (can not do that in device driver)
    - Integrate with core mm so that long term we can move more
      of individual device memory management into core component.

> 3) It needs to fix garbage info or supplement with things only the kernel knows.

Yes kernel is expect to fix the informations it get and sanitize
it so that userspace do not have to grow database of quirk and
workaround. Moreover kernel can also benchmark inter-connect and
adapt reported bandwidth and latency if this is ever something
people would like to see.


I will post two v2 where i split the common helpers from the sysfs
and syscall part. I need the common helpers today in the case of
single device and have user for that code (nouveau and amdgpu for
starter). I want to continue the sysfs and syscall discussion and
i need to reformulate thing and give better explaination of why
i think the way i am doing thing have more values than any other.

Dunno if i will have time to finish rework-ing all this before the
end of this year.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, back to index

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
2018-12-04 17:06   ` Andi Kleen
2018-12-04 18:24     ` Jerome Glisse
2018-12-04 18:31       ` Dan Williams
2018-12-04 18:57         ` Jerome Glisse
2018-12-04 19:11           ` Logan Gunthorpe
2018-12-04 19:22             ` Jerome Glisse
2018-12-04 19:41               ` Logan Gunthorpe
2018-12-04 20:13                 ` Jerome Glisse
2018-12-04 20:30                   ` Logan Gunthorpe
2018-12-04 20:59                     ` Jerome Glisse
2018-12-04 21:19                       ` Logan Gunthorpe
2018-12-04 21:51                         ` Jerome Glisse
2018-12-04 22:16                           ` Logan Gunthorpe
2018-12-04 23:56                             ` Jerome Glisse
2018-12-05  1:15                               ` Logan Gunthorpe
2018-12-05  2:31                                 ` Jerome Glisse
2018-12-05 17:41                                   ` Logan Gunthorpe
2018-12-05 18:07                                     ` Jerome Glisse
2018-12-05 18:20                                       ` Logan Gunthorpe
2018-12-05 18:33                                         ` Jerome Glisse
2018-12-05 18:48                                           ` Logan Gunthorpe
2018-12-05 18:55                                             ` Jerome Glisse
2018-12-05 19:10                                               ` Logan Gunthorpe
2018-12-05 22:58                                                 ` Jerome Glisse
2018-12-05 23:09                                                   ` Logan Gunthorpe
2018-12-05 23:20                                                     ` Jerome Glisse
2018-12-05 23:23                                                       ` Logan Gunthorpe
2018-12-05 23:27                                                         ` Jerome Glisse
2018-12-06  0:08                                                           ` Dan Williams
2018-12-05  2:34                                 ` Dan Williams
2018-12-05  2:37                                   ` Jerome Glisse
2018-12-05 17:25                                     ` Logan Gunthorpe
2018-12-05 18:01                                       ` Jerome Glisse
2018-12-04 20:14             ` Andi Kleen
2018-12-04 20:47               ` Logan Gunthorpe
2018-12-04 21:15                 ` Jerome Glisse
2018-12-04 19:19           ` Dan Williams
2018-12-04 19:32             ` Jerome Glisse
2018-12-04 20:12       ` Andi Kleen
2018-12-04 20:41         ` Jerome Glisse
2018-12-05  4:36       ` Aneesh Kumar K.V
2018-12-05  4:41         ` Jerome Glisse
2018-12-05 10:52   ` Mike Rapoport
2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
2018-12-04 14:44   ` Jerome Glisse
2018-12-04 18:02 ` Dave Hansen
2018-12-04 18:49   ` Jerome Glisse
2018-12-04 18:54     ` Dave Hansen
2018-12-04 19:11       ` Jerome Glisse
2018-12-04 21:37     ` Dave Hansen
2018-12-04 21:57       ` Jerome Glisse
2018-12-04 23:58         ` Dave Hansen
2018-12-05  0:29           ` Jerome Glisse
2018-12-05  1:22         ` Kuehling, Felix
2018-12-05 11:27     ` Aneesh Kumar K.V
2018-12-05 16:09       ` Jerome Glisse
2018-12-04 23:54 ` Dave Hansen
2018-12-05  0:15   ` Jerome Glisse
2018-12-05  1:06     ` Dave Hansen
2018-12-05  2:13       ` Jerome Glisse
2018-12-05 17:27         ` Dave Hansen
2018-12-05 17:53           ` Jerome Glisse
2018-12-06 18:25             ` Dave Hansen
2018-12-06 19:20               ` Jerome Glisse
2018-12-06 19:31                 ` Dave Hansen
2018-12-06 20:11                   ` Logan Gunthorpe
2018-12-06 22:04                     ` Dave Hansen
2018-12-06 22:39                       ` Jerome Glisse
2018-12-06 23:09                         ` Dave Hansen
2018-12-06 23:28                           ` Logan Gunthorpe
2018-12-06 23:34                             ` Dave Hansen
2018-12-06 23:38                             ` Dave Hansen
2018-12-06 23:48                               ` Logan Gunthorpe
2018-12-07  0:20                                 ` Jerome Glisse
2018-12-07 15:06                                   ` Jonathan Cameron
2018-12-07 19:37                                     ` Jerome Glisse
2018-12-07  0:15                           ` Jerome Glisse
2018-12-06 20:27                   ` Jerome Glisse
2018-12-06 21:46                     ` Jerome Glisse

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox