linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
@ 2017-08-17  0:05 Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
                   ` (20 more replies)
  0 siblings, 21 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse

Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
test same kernel as kbuild system, git branch:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v25

Change since v24 are:
  - more comments and more documentations
  - fix dumb mistake when registering device
  - fix race when multiple concurrent device fault
  - merge HMM-CDM patchset with HMM patchset to avoid unecessary code
    churn

The overall logic is the same. Below is the long description of what HMM
is about and why. At the end of this email i describe briefly each patch
and suggest reviewers for each of them.

Please consider this patchset for 4.14


Heterogeneous Memory Management (HMM) (description and justification)

Today device driver expose dedicated memory allocation API through their
device file, often relying on a combination of IOCTL and mmap calls. The
device can only access and use memory allocated through this API. This
effectively split the program address space into object allocated for the
device and useable by the device and other regular memory (malloc, mmap
of a file, share memory, Ac) only accessible by CPU (or in a very limited
way by a device by pinning memory).

Allowing different isolated component of a program to use a device thus
require duplication of the input data structure using device memory
allocator. This is reasonable for simple data structure (array, grid,
image, Ac) but this get extremely complex with advance data structure
(list, tree, graph, Ac) that rely on a web of memory pointers. This is
becoming a serious limitation on the kind of work load that can be
offloaded to device like GPU.

New industry standard like C++, OpenCL or CUDA are pushing to remove this
barrier. This require a shared address space between GPU device and CPU so
that GPU can access any memory of a process (while still obeying memory
protection like read only). This kind of feature is also appearing in
various other operating systems.

HMM is a set of helpers to facilitate several aspects of address space
sharing and device memory management. Unlike existing sharing mechanism
that rely on pining pages use by a device, HMM relies on mmu_notifier to
propagate CPU page table update to device page table.

Duplicating CPU page table is only one aspect necessary for efficiently
using device like GPU. GPU local memory have bandwidth in the TeraBytes/
second range but they are connected to main memory through a system bus
like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
is necessary to allow migration of process memory from main system memory
to device memory. Issue is that on platform that only have PCIE the device
memory is not accessible by the CPU with the same properties as main
memory (cache coherency, atomic operations, ...).

To allow migration from main memory to device memory HMM provides a set
of helper to hotplug device memory as a new type of ZONE_DEVICE memory
which is un-addressable by CPU but still has struct page representing it.
This allow most of the core kernel logic that deals with a process memory
to stay oblivious of the peculiarity of device memory.

When page backing an address of a process is migrated to device memory
the CPU page table entry is set to a new specific swap entry. CPU access
to such address triggers a migration back to system memory, just like if
the page was swap on disk. HMM also blocks any one from pinning a
ZONE_DEVICE page so that it can always be migrated back to system memory
if CPU access it. Conversely HMM does not migrate to device memory any
page that is pin in system memory.

To allow efficient migration between device memory and main memory a new
migrate_vma() helpers is added with this patchset. It allows to leverage
device DMA engine to perform the copy operation.

This feature will be use by upstream driver like nouveau mlx5 and probably
other in the future (amdgpu is next suspect in line). We are actively
working on nouveau and mlx5 support. To test this patchset we also worked
with NVidia close source driver team, they have more resources than us to
test this kind of infrastructure and also a bigger and better userspace
eco-system with various real industry workload they can be use to test and
profile HMM.

The expected workload is a program builds a data set on the CPU (from disk,
from network, from sensors, Ac). Program uses GPU API (OpenCL, CUDA, ...)
to give hint on memory placement for the input data and also for the output
buffer. Program call GPU API to schedule a GPU job, this happens using
device driver specific ioctl. All this is hidden from programmer point of
view in case of C++ compiler that transparently offload some part of a
program to GPU. Program can keep doing other stuff on the CPU while the
GPU is crunching numbers.

It is expected that CPU will not access the same data set as the GPU while
GPU is working on it, but this is not mandatory. In fact we expect some
small memory object to be actively access by both GPU and CPU concurrently
as synchronization channel and/or for monitoring purposes. Such object will
stay in system memory and should not be bottlenecked by system bus
bandwidth (rare write and read access from both CPU and GPU).

As we are relying on device driver API, HMM does not introduce any new
syscall nor does it modify any existing ones. It does not change any POSIX
semantics or behaviors. For instance the child after a fork of a process
that is using HMM will not be impacted in anyway, nor is there any data
hazard between child COW or parent COW of memory that was migrated to
device prior to fork.

HMM assume a numbers of hardware features. Device must allow device page
table to be updated at any time (ie device job must be preemptable). Device
page table must provides memory protection such as read only. Device must
track write access (dirty bit). Device must have a minimum granularity that
match PAGE_SIZE (ie 4k).


Reviewer (just hint):
Patch 1  HMM documentation
Patch 2  introduce core infrastructure and definition of HMM, pretty
         small patch and easy to review
Patch 3  introduce the mirror functionality of HMM, it relies on
         mmu_notifier and thus someone familiar with that part would be
         in better position to review
Patch 4  is an helper to snapshot CPU page table while synchronizing with
         concurrent page table update. Understanding mmu_notifier makes
         review easier.
Patch 5  is mostly a wrapper around handle_mm_fault()
Patch 6  add new add_pages() helper to avoid modifying each arch memory
         hot plug function
Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
         in various core mm to support this new type. Dan Williams and
         any core mm contributor are best people to review each half of
         this patchset
Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
         Dan Williams are best person to review this
Patch 9  allow to uncharge a page from memory group without using the lru
         list field of struct page (best reviewer: Johannes Weiner or
         Vladimir Davydov or Michal Hocko)
Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
         reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
Patch 11 add helper to hotplug un-addressable device memory as new type
         of ZONE_DEVICE memory (new type introducted in patch 3 of this
         serie). This is boiler plate code around memory hotplug and it
         also pick a free range of physical address for the device memory.
         Note that the physical address do not point to anything (at least
         as far as the kernel knows).
Patch 12 introduce a new hmm_device class as an helper for device driver
         that want to expose multiple device memory under a common fake
         device driver. This is usefull for multi-gpu configuration.
         Anyone familiar with device driver infrastructure can review
         this. Boiler plate code really.
Patch 13 add a new migrate mode. Any one familiar with page migration is
         welcome to review.
Patch 14 introduce a new migration helper (migrate_vma()) that allow to
         migrate a range of virtual address of a process using device DMA
         engine to perform the copy. It is not limited to do copy from and
         to device but can also do copy between any kind of source and
         destination memory. Again anyone familiar with migration code
         should be able to verify the logic.
Patch 15 optimize the new migrate_vma() by unmapping pages while we are
         collecting them. This can be review by any mm folks.
Patch 16 add unaddressable memory migration to helper introduced in patch
         7, this can be review by anyone familiar with migration code
Patch 17 add a feature that allow device to allocate non-present page on
         the GPU when migrating a range of address to device memory. This
         is an helper for device driver to avoid having to first allocate
         system memory before migration to device memory
Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
         memory (CDM)
Patch 19 add an helper to hotplug CDM memory


Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
v10 https://lwn.net/Articles/654430/
v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
v12 http://www.kernelhub.org/?msg=972982&p=2
v13 https://lwn.net/Articles/706856/
v14 https://lkml.org/lkml/2016/12/8/344
v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
v16 http://www.spinics.net/lists/linux-mm/msg119814.html
v17 https://lkml.org/lkml/2017/1/27/847
v18 https://lkml.org/lkml/2017/3/16/596
v19 https://lkml.org/lkml/2017/4/5/831
v20 https://lwn.net/Articles/720715/
v21 https://lkml.org/lkml/2017/4/24/747
v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
v24 https://lwn.net/Articles/726691/

JA(C)rA'me Glisse (18):
  hmm: heterogeneous memory management documentation v3
  mm/hmm: heterogeneous memory management (HMM for short) v5
  mm/hmm/mirror: mirror process address space on device with HMM helpers
    v3
  mm/hmm/mirror: helper to snapshot CPU page table v4
  mm/hmm/mirror: device page fault handler
  mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  mm/ZONE_DEVICE: special case put_page() for device private pages v4
  mm/memcontrol: allow to uncharge page without using page->lru field
  mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4
  mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7
  mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
  mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  mm/migrate: new memory migration helper for use with device memory v5
  mm/migrate: migrate_vma() unmap page from vma while collecting pages
  mm/migrate: support un-addressable ZONE_DEVICE page in migration v3
  mm/migrate: allow migrate_vma() to alloc new page on empty entry v4
  mm/device-public-memory: device memory cache coherent with CPU v5
  mm/hmm: add new helper to hotplug CDM memory region v3

Michal Hocko (1):
  mm/memory_hotplug: introduce add_pages

 Documentation/vm/hmm.txt       |  384 ++++++++++++
 MAINTAINERS                    |    7 +
 arch/x86/Kconfig               |    4 +
 arch/x86/mm/init_64.c          |   22 +-
 fs/aio.c                       |    8 +
 fs/f2fs/data.c                 |    5 +-
 fs/hugetlbfs/inode.c           |    5 +-
 fs/proc/task_mmu.c             |    9 +-
 fs/ubifs/file.c                |    5 +-
 include/linux/hmm.h            |  518 ++++++++++++++++
 include/linux/ioport.h         |    2 +
 include/linux/memory_hotplug.h |   11 +
 include/linux/memremap.h       |  107 ++++
 include/linux/migrate.h        |  124 ++++
 include/linux/migrate_mode.h   |    5 +
 include/linux/mm.h             |   33 +-
 include/linux/mm_types.h       |    6 +
 include/linux/swap.h           |   24 +-
 include/linux/swapops.h        |   68 +++
 kernel/fork.c                  |    3 +
 kernel/memremap.c              |   60 +-
 mm/Kconfig                     |   47 +-
 mm/Makefile                    |    2 +-
 mm/balloon_compaction.c        |    8 +
 mm/gup.c                       |    7 +
 mm/hmm.c                       | 1273 ++++++++++++++++++++++++++++++++++++++++
 mm/madvise.c                   |    2 +-
 mm/memcontrol.c                |  222 ++++---
 mm/memory.c                    |  107 +++-
 mm/memory_hotplug.c            |   10 +-
 mm/migrate.c                   |  928 ++++++++++++++++++++++++++++-
 mm/mprotect.c                  |   14 +
 mm/page_vma_mapped.c           |   10 +
 mm/rmap.c                      |   25 +
 mm/swap.c                      |   11 +
 mm/zsmalloc.c                  |    8 +
 36 files changed, 3964 insertions(+), 120 deletions(-)
 create mode 100644 Documentation/vm/hmm.txt
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse

This add documentation for HMM (Heterogeneous Memory Management). It
presents the motivation behind it, the features necessary for it to
be useful and and gives an overview of how this is implemented.

Changed since v2:
  - add section about memory accounting cgroup (memcg)
Changed since v1:
  - removed outdated section

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 Documentation/vm/hmm.txt | 384 +++++++++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS              |   7 +
 2 files changed, 391 insertions(+)
 create mode 100644 Documentation/vm/hmm.txt

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
new file mode 100644
index 000000000000..4d3aac9f4a5d
--- /dev/null
+++ b/Documentation/vm/hmm.txt
@@ -0,0 +1,384 @@
+Heterogeneous Memory Management (HMM)
+
+Transparently allow any component of a program to use any memory region of said
+program with a device without using device specific memory allocator. This is
+becoming a requirement to simplify the use of advance heterogeneous computing
+where GPU, DSP or FPGA are use to perform various computations.
+
+This document is divided as follow, in the first section i expose the problems
+related to the use of a device specific allocator. The second section i expose
+the hardware limitations that are inherent to many platforms. The third section
+gives an overview of HMM designs. The fourth section explains how CPU page-
+table mirroring works and what is HMM purpose in this context. Fifth section
+deals with how device memory is represented inside the kernel. Finaly the last
+section present the new migration helper that allow to leverage the device DMA
+engine.
+
+
+1) Problems of using device specific memory allocator:
+2) System bus, device memory characteristics
+3) Share address space and migration
+4) Address space mirroring implementation and API
+5) Represent and manage device memory from core kernel point of view
+6) Migrate to and from device memory
+7) Memory cgroup (memcg) and rss accounting
+
+
+-------------------------------------------------------------------------------
+
+1) Problems of using device specific memory allocator:
+
+Device with large amount of on board memory (several giga bytes) like GPU have
+historically manage their memory through dedicated driver specific API. This
+creates a disconnect between memory allocated and managed by device driver and
+regular application memory (private anonymous, share memory or regular file
+back memory). From here on i will refer to this aspect as split address space.
+I use share address space to refer to the opposite situation ie one in which
+any memory region can be use by device transparently.
+
+Split address space because device can only access memory allocated through the
+device specific API. This imply that all memory object in a program are not
+equal from device point of view which complicate large program that rely on a
+wide set of libraries.
+
+Concretly this means that code that wants to leverage device like GPU need to
+copy object between genericly allocated memory (malloc, mmap private/share/)
+and memory allocated through the device driver API (this still end up with an
+mmap but of the device file).
+
+For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
+complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
+data-set need to re-map all the pointer relations between each of its elements.
+This is error prone and program gets harder to debug because of the duplicate
+data-set.
+
+Split address space also means that library can not transparently use data they
+are getting from core program or other library and thus each library might have
+to duplicate its input data-set using specific memory allocator. Large project
+suffer from this and waste resources because of the various memory copy.
+
+Duplicating each library API to accept as input or output memory allocted by
+each device specific allocator is not a viable option. It would lead to a
+combinatorial explosions in the library entry points.
+
+Finaly with the advance of high level language constructs (in C++ but in other
+language too) it is now possible for compiler to leverage GPU or other devices
+without even the programmer knowledge. Some of compiler identified patterns are
+only do-able with a share address. It is as well more reasonable to use a share
+address space for all the other patterns.
+
+
+-------------------------------------------------------------------------------
+
+2) System bus, device memory characteristics
+
+System bus cripple share address due to few limitations. Most system bus only
+allow basic memory access from device to main memory, even cache coherency is
+often optional. Access to device memory from CPU is even more limited, most
+often than not it is not cache coherent.
+
+If we only consider the PCIE bus than device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However it only allows
+a limited set of atomic operation from device on main memory. This is worse
+in the other direction the CPUs can only access a limited range of the device
+memory and can not perform atomic operations on it. Thus device memory can not
+be consider like regular memory from kernel point of view.
+
+Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
+and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
+The final limitation is latency, access to main memory from the device has an
+order of magnitude higher latency than when the device access its own memory.
+
+Some platform are developing new system bus or additions/modifications to PCIE
+to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+way cache coherency between CPU and device and allow all atomic operations the
+architecture supports. Saddly not all platform are following this trends and
+some major architecture are left without hardware solutions to those problems.
+
+So for share address space to make sense not only we must allow device to
+access any memory memory but we must also permit any memory to be migrated to
+device memory while device is using it (blocking CPU access while it happens).
+
+
+-------------------------------------------------------------------------------
+
+3) Share address space and migration
+
+HMM intends to provide two main features. First one is to share the address
+space by duplication the CPU page table into the device page table so same
+address point to same memory and this for any valid main memory address in
+the process address space.
+
+To achieve this, HMM offer a set of helpers to populate the device page table
+while keeping track of CPU page table updates. Device page table updates are
+not as easy as CPU page table updates. To update the device page table you must
+allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
+commands in it to perform the update (unmap, cache invalidations and flush,
+...). This can not be done through common code for all device. Hence why HMM
+provides helpers to factor out everything that can be while leaving the gory
+details to the device driver.
+
+The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
+allow to allocate a struct page for each page of the device memory. Those page
+are special because the CPU can not map them. They however allow to migrate
+main memory to device memory using exhisting migration mechanism and everything
+looks like if page was swap out to disk from CPU point of view. Using a struct
+page gives the easiest and cleanest integration with existing mm mechanisms.
+Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
+for the device memory and second to perform migration. Policy decision of what
+and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page trigger a page fault and a migration
+back to main memory ie when a page backing an given address A is migrated from
+a main memory page to a device page then any CPU access to address A trigger a
+page fault and initiate a migration back to main memory.
+
+
+With this two features, HMM not only allow a device to mirror a process address
+space and keeps both CPU and device page table synchronize, but also allow to
+leverage device memory by migrating part of data-set that is actively use by a
+device.
+
+
+-------------------------------------------------------------------------------
+
+4) Address space mirroring implementation and API
+
+Address space mirroring main objective is to allow to duplicate range of CPU
+page table into a device page table and HMM helps keeping both synchronize. A
+device driver that want to mirror a process address space must start with the
+registration of an hmm_mirror struct:
+
+ int hmm_mirror_register(struct hmm_mirror *mirror,
+                         struct mm_struct *mm);
+ int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+                                struct mm_struct *mm);
+
+The locked variant is to be use when the driver is already holding the mmap_sem
+of the mm in write mode. The mirror struct has a set of callback that are use
+to propagate CPU page table:
+
+ struct hmm_mirror_ops {
+     /* sync_cpu_device_pagetables() - synchronize page tables
+      *
+      * @mirror: pointer to struct hmm_mirror
+      * @update_type: type of update that occurred to the CPU page table
+      * @start: virtual start address of the range to update
+      * @end: virtual end address of the range to update
+      *
+      * This callback ultimately originates from mmu_notifiers when the CPU
+      * page table is updated. The device driver must update its page table
+      * in response to this callback. The update argument tells what action
+      * to perform.
+      *
+      * The device driver must not return from this callback until the device
+      * page tables are completely updated (TLBs flushed, etc); this is a
+      * synchronous call.
+      */
+      void (*update)(struct hmm_mirror *mirror,
+                     enum hmm_update action,
+                     unsigned long start,
+                     unsigned long end);
+ };
+
+Device driver must perform update to the range following action (turn range
+read only, or fully unmap, ...). Once driver callback returns the device must
+be done with the update.
+
+
+When device driver wants to populate a range of virtual address it can use
+either:
+ int hmm_vma_get_pfns(struct vm_area_struct *vma,
+                      struct hmm_range *range,
+                      unsigned long start,
+                      unsigned long end,
+                      hmm_pfn_t *pfns);
+ int hmm_vma_fault(struct vm_area_struct *vma,
+                   struct hmm_range *range,
+                   unsigned long start,
+                   unsigned long end,
+                   hmm_pfn_t *pfns,
+                   bool write,
+                   bool block);
+
+First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
+will not trigger a page fault on missing or non present entry. The second one
+do trigger page fault on missing or read only entry if write parameter is true.
+Page fault use the generic mm page fault code path just like a CPU page fault.
+
+Both function copy CPU page table into their pfns array argument. Each entry in
+that array correspond to an address in the virtual range. HMM provide a set of
+flags to help driver identify special CPU page table entries.
+
+Locking with the update() callback is the most important aspect the driver must
+respect in order to keep things properly synchronize. The usage pattern is :
+
+ int driver_populate_range(...)
+ {
+      struct hmm_range range;
+      ...
+ again:
+      ret = hmm_vma_get_pfns(vma, &range, start, end, pfns);
+      if (ret)
+          return ret;
+      take_lock(driver->update);
+      if (!hmm_vma_range_done(vma, &range)) {
+          release_lock(driver->update);
+          goto again;
+      }
+
+      // Use pfns array content to update device page table
+
+      release_lock(driver->update);
+      return 0;
+ }
+
+The driver->update lock is the same lock that driver takes inside its update()
+callback. That lock must be call before hmm_vma_range_done() to avoid any race
+with a concurrent CPU page table update.
+
+HMM implements all this on top of the mmu_notifier API because we wanted to a
+simpler API and also to be able to perform optimization latter own like doing
+concurrent device update in multi-devices scenario.
+
+HMM also serve as an impedence missmatch between how CPU page table update are
+done (by CPU write to the page table and TLB flushes) from how device update
+their own page table. Device update is a multi-step process, first appropriate
+commands are write to a buffer, then this buffer is schedule for execution on
+the device. It is only once the device has executed commands in the buffer that
+the update is done. Creating and scheduling update command buffer can happen
+concurrently for multiple devices. Waiting for each device to report commands
+as executed is serialize (there is no point in doing this concurrently).
+
+
+-------------------------------------------------------------------------------
+
+5) Represent and manage device memory from core kernel point of view
+
+Several differents design were try to support device memory. First one use
+device specific data structure to keep information about migrated memory and
+HMM hooked itself in various place of mm code to handle any access to address
+that were back by device memory. It turns out that this ended up replicating
+most of the fields of struct page and also needed many kernel code path to be
+updated to understand this new kind of memory.
+
+Thing is most kernel code path never try to access the memory behind a page
+but only care about struct page contents. Because of this HMM switchted to
+directly using struct page for device memory which left most kernel code path
+un-aware of the difference. We only need to make sure that no one ever try to
+map those page from the CPU side.
+
+HMM provide a set of helpers to register and hotplug device memory as a new
+region needing struct page. This is offer through a very simple API:
+
+ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+                                   struct device *device,
+                                   unsigned long size);
+ void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+The hmm_devmem_ops is where most of the important things are:
+
+ struct hmm_devmem_ops {
+     void (*free)(struct hmm_devmem *devmem, struct page *page);
+     int (*fault)(struct hmm_devmem *devmem,
+                  struct vm_area_struct *vma,
+                  unsigned long addr,
+                  struct page *page,
+                  unsigned flags,
+                  pmd_t *pmdp);
+ };
+
+The first callback (free()) happens when the last reference on a device page is
+drop. This means the device page is now free and no longer use by anyone. The
+second callback happens whenever CPU try to access a device page which it can
+not do. This second callback must trigger a migration back to system memory.
+
+
+-------------------------------------------------------------------------------
+
+6) Migrate to and from device memory
+
+Because CPU can not access device memory, migration must use device DMA engine
+to perform copy from and to device memory. For this we need a new migration
+helper:
+
+ int migrate_vma(const struct migrate_vma_ops *ops,
+                 struct vm_area_struct *vma,
+                 unsigned long mentries,
+                 unsigned long start,
+                 unsigned long end,
+                 unsigned long *src,
+                 unsigned long *dst,
+                 void *private);
+
+Unlike other migration function it works on a range of virtual address, there
+is two reasons for that. First device DMA copy has a high setup overhead cost
+and thus batching multiple pages is needed as otherwise the migration overhead
+make the whole excersie pointless. The second reason is because driver trigger
+such migration base on range of address the device is actively accessing.
+
+The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
+control destination memory allocation and copy operation. Second one is there
+to allow device driver to perform cleanup operation after migration.
+
+ struct migrate_vma_ops {
+     void (*alloc_and_copy)(struct vm_area_struct *vma,
+                            const unsigned long *src,
+                            unsigned long *dst,
+                            unsigned long start,
+                            unsigned long end,
+                            void *private);
+     void (*finalize_and_map)(struct vm_area_struct *vma,
+                              const unsigned long *src,
+                              const unsigned long *dst,
+                              unsigned long start,
+                              unsigned long end,
+                              void *private);
+ };
+
+It is important to stress that this migration helpers allow for hole in the
+virtual address range. Some pages in the range might not be migrated for all
+the usual reasons (page is pin, page is lock, ...). This helper does not fail
+but just skip over those pages.
+
+The alloc_and_copy() might as well decide to not migrate all pages in the
+range (for reasons under the callback control). For those the callback just
+have to leave the corresponding dst entry empty.
+
+Finaly the migration of the struct page might fails (for file back page) for
+various reasons (failure to freeze reference, or update page cache, ...). If
+that happens then the finalize_and_map() can catch any pages that was not
+migrated. Note those page were still copied to new page and thus we wasted
+bandwidth but this is considered as a rare event and a price that we are
+willing to pay to keep all the code simpler.
+
+
+-------------------------------------------------------------------------------
+
+7) Memory cgroup (memcg) and rss accounting
+
+For now device memory is accounted as any regular page in rss counters (either
+anonymous if device page is use for anonymous, file if device page is use for
+file back page or shmem if device page is use for share memory). This is a
+deliberate choice to keep existing application that might start using device
+memory without knowing about it to keep runing unimpacted.
+
+Drawbacks is that OOM killer might kill an application using a lot of device
+memory and not a lot of regular system memory and thus not freeing much system
+memory. We want to gather more real world experience on how application and
+system react under memory pressure in the presence of device memory before
+deciding to account device memory differently.
+
+
+Same decision was made for memory cgroup. Device memory page are accounted
+against same memory cgroup a regular page would be accounted to. This does
+simplify migration to and from device memory. This also means that migration
+back from device memory to regular memory can not fail because it would
+go above memory cgroup limit. We might revisit this choice latter on once we
+get more experience in how device memory is use and its impact on memory
+resource control.
+
+
+Note that device memory can never be pin nor by device driver nor through GUP
+and thus such memory is always free upon process exit. Or when last reference
+is drop in case of share memory or file back memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index a982e0afd894..d8e43f32f246 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7744,6 +7744,13 @@ M:	Sasha Levin <alexander.levin@verizon.com>
 S:	Maintained
 F:	tools/lib/lockdep/
 
+HMM - Heterogeneous Memory Management
+M:	JA(C)rA'me Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 LIBNVDIMM BLK: MMIO-APERTURE DRIVER
 M:	Ross Zwisler <ross.zwisler@linux.intel.com>
 L:	linux-nvdimm@lists.01.org
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Changed since v4:
  - added hmm_mm_init() call to init mm's HMM fields
Changed since v3:
  - Unconditionaly build hmm.c for static keys
Changed since v2:
  - s/device unaddressable/device private
Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h      | 152 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   6 ++
 kernel/fork.c            |   3 +
 mm/Kconfig               |  13 ++++
 mm/Makefile              |   2 +-
 mm/hmm.c                 |  74 +++++++++++++++++++++++
 6 files changed, 249 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 000000000000..ca60595ce784
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Heterogeneous Memory Management (HMM)
+ *
+ * See Documentation/vm/hmm.txt for reasons and overview of what HMM is and it
+ * is for. Here we focus on the HMM API description, with some explanation of
+ * the underlying implementation.
+ *
+ * Short description: HMM provides a set of helpers to share a virtual address
+ * space between CPU and a device, so that the device can access any valid
+ * address of the process (while still obeying memory protection). HMM also
+ * provides helpers to migrate process memory to device memory, and back. Each
+ * set of functionality (address space mirroring, and migration to and from
+ * device memory) can be used independently of the other.
+ *
+ *
+ * HMM address space mirroring API:
+ *
+ * Use HMM address space mirroring if you want to mirror range of the CPU page
+ * table of a process into a device page table. Here, "mirror" means "keep
+ * synchronized". Prerequisites: the device must provide the ability to write-
+ * protect its page tables (at PAGE_SIZE granularity), and must be able to
+ * recover from the resulting potential page faults.
+ *
+ * HMM guarantees that at any point in time, a given virtual address points to
+ * either the same memory in both CPU and device page tables (that is: CPU and
+ * device page tables each point to the same pages), or that one page table (CPU
+ * or device) points to no entry, while the other still points to the old page
+ * for the address. The latter case happens when the CPU page table update
+ * happens first, and then the update is mirrored over to the device page table.
+ * This does not cause any issue, because the CPU page table cannot start
+ * pointing to a new page until the device page table is invalidated.
+ *
+ * HMM uses mmu_notifiers to monitor the CPU page tables, and forwards any
+ * updates to each device driver that has registered a mirror. It also provides
+ * some API calls to help with taking a snapshot of the CPU page table, and to
+ * synchronize with any updates that might happen concurrently.
+ *
+ *
+ * HMM migration to and from device memory:
+ *
+ * HMM provides a set of helpers to hotplug device memory as ZONE_DEVICE, with
+ * a new MEMORY_DEVICE_PRIVATE type. This provides a struct page for each page
+ * of the device memory, and allows the device driver to manage its memory
+ * using those struct pages. Having struct pages for device memory makes
+ * migration easier. Because that memory is not addressable by the CPU it must
+ * never be pinned to the device; in other words, any CPU page fault can always
+ * cause the device memory to be migrated (copied/moved) back to regular memory.
+ *
+ * A new migrate helper (migrate_vma()) has been added (see mm/migrate.c) that
+ * allows use of a device DMA engine to perform the copy operation between
+ * regular system memory and device memory.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table has write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+static inline void hmm_mm_init(struct mm_struct *mm)
+{
+	mm->hmm = NULL;
+}
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only! Not to be used by device driver! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f384bb62d8e..7f9a04fbbe01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -504,6 +505,11 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM needs to track a few things per mm */
+	struct hmm *hmm;
+#endif
 } __randomize_layout;
 
 extern struct mm_struct init_mm;
diff --git a/kernel/fork.c b/kernel/fork.c
index d747e14ce27b..2760f1b15538 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -37,6 +37,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -807,6 +808,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
+	hmm_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
@@ -885,6 +887,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/Kconfig b/mm/Kconfig
index ab937c8d247f..f0b0f2d6c123 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -691,6 +691,19 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config ARCH_HAS_HMM
+	bool
+	default y
+	depends on (X86_64 || PPC64)
+	depends on ZONE_DEVICE
+	depends on MMU && 64BIT
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on SPARSEMEM_VMEMMAP
+
+config HMM
+	bool
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 411bd24d4a7c..1cde2a8bed97 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o swap_slots.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   debug.o hmm.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 000000000000..88a7e10747d5
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,74 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: JA(C)rA'me Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+
+#ifdef CONFIG_HMM
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be used directly by device drivers. It allocates an
+ * HMM struct if mm does not have one, and initializes it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be freed once the mm_struct goes away,
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	kfree(mm->hmm);
+}
+#endif /* CONFIG_HMM */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Changed since v2:
  - s/device unaddressable/device private/
Changed since v1:
  - Kconfig logic (depend on x86-64 and use ARCH_HAS pattern)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 110 ++++++++++++++++++++++++++++++++++
 mm/Kconfig          |  12 ++++
 mm/hmm.c            | 168 +++++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 275 insertions(+), 15 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ca60595ce784..61c707970aa6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
@@ -134,6 +135,115 @@ static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to synchronize device page table with CPU page table.
+ *
+ * A device driver that is participating in HMM mirroring must always
+ * synchronize with CPU page table updates. For this, device drivers can either
+ * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
+ * drivers can decide to register one mirror per device per process, or just
+ * one mirror per process for a group of devices. The pattern is:
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contains an hmm_mirror struct as one of its fields.
+ *          ...
+ *
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *
+ *          // Other device driver specific initialization
+ *          ...
+ *      }
+ *
+ * Once an hmm_mirror is registered for an address space, the device driver
+ * will get callbacks through sync_cpu_device_pagetables() operation (see
+ * hmm_mirror_ops struct).
+ *
+ * Device driver must not free the struct containing the hmm_mirror struct
+ * before calling hmm_mirror_unregister(). The expected usage is to do that when
+ * the device driver is unbinding from an address space.
+ *
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          ...
+ *
+ *          hmm_mirror_unregister(&das->mirror);
+ *
+ *          // Other device driver specific cleanup, and now das can be freed
+ *          ...
+ *      }
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update_type - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update_type {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* sync_cpu_device_pagetables() - synchronize page tables
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update_type: type of update that occurred to the CPU page table
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback ultimately originates from mmu_notifiers when the CPU
+	 * page table is updated. The device driver must update its page table
+	 * in response to this callback. The update argument tells what action
+	 * to perform.
+	 *
+	 * The device driver must not return from this callback until the device
+	 * page tables are completely updated (TLBs flushed, etc); this is a
+	 * synchronous call.
+	 */
+	void (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
+					   enum hmm_update_type update_type,
+					   unsigned long start,
+					   unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * instance of an hmm_mirror struct with HMM. HMM will track the list of all
+ * mirrors for each mm_struct.
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index f0b0f2d6c123..87d530cbbbc7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -704,6 +704,18 @@ config ARCH_HAS_HMM
 config HMM
 	bool
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	depends on ARCH_HAS_HMM
+	select MMU_NOTIFIER
+	select HMM
+	help
+	  Select HMM_MIRROR if you want to mirror range of the CPU page table of a
+	  process into a device page table. Here, mirror means "keep synchronized".
+	  Prerequisites: the device must provide the ability to write-protect its
+	  page tables (at PAGE_SIZE granularity), and must be able to recover from
+	  the resulting potential page faults.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 88a7e10747d5..ff78c92cd34c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,16 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 
 #ifdef CONFIG_HMM
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
+
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @sequence: we track updates to the CPU page table with a sequence number
+ * @mirrors: list of mirrors for this mm
+ * @mmu_notifier: mmu notifier to track updates to CPU page table
+ * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	atomic_t		sequence;
+	struct list_head	mirrors;
+	struct mmu_notifier	mmu_notifier;
+	struct rw_semaphore	mirrors_sem;
 };
 
 /*
@@ -43,27 +54,48 @@ struct hmm {
  */
 static struct hmm *hmm_register(struct mm_struct *mm)
 {
-	if (!mm->hmm) {
-		struct hmm *hmm = NULL;
-
-		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
-		if (!hmm)
-			return NULL;
-		hmm->mm = mm;
-
-		spin_lock(&mm->page_table_lock);
-		if (!mm->hmm)
-			mm->hmm = hmm;
-		else
-			kfree(hmm);
-		spin_unlock(&mm->page_table_lock);
-	}
+	struct hmm *hmm = READ_ONCE(mm->hmm);
+	bool cleanup = false;
 
 	/*
 	 * The hmm struct can only be freed once the mm_struct goes away,
 	 * hence we should always have pre-allocated an new hmm struct
 	 * above.
 	 */
+	if (hmm)
+		return hmm;
+
+	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+	if (!hmm)
+		return NULL;
+	INIT_LIST_HEAD(&hmm->mirrors);
+	init_rwsem(&hmm->mirrors_sem);
+	atomic_set(&hmm->sequence, 0);
+	hmm->mmu_notifier.ops = NULL;
+	hmm->mm = mm;
+
+	/*
+	 * We should only get here if hold the mmap_sem in write mode ie on
+	 * registration of first mirror through hmm_mirror_register()
+	 */
+	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hmm)
+		mm->hmm = hmm;
+	else
+		cleanup = true;
+	spin_unlock(&mm->page_table_lock);
+
+	if (cleanup) {
+		mmu_notifier_unregister(&hmm->mmu_notifier, mm);
+		kfree(hmm);
+	}
+
 	return mm->hmm;
 }
 
@@ -72,3 +104,109 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	kfree(mm->hmm);
 }
 #endif /* CONFIG_HMM */
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update_type action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list)
+		mirror->ops->sync_cpu_device_pagetables(mirror, action,
+							start, end);
+	up_read(&hmm->mirrors_sem);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->sequence);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space, the device driver must register
+ * an HMM mirror struct.
+ *
+ * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	down_write(&mirror->hmm->mirrors_sem);
+	list_add(&mirror->list, &mirror->hmm->mirrors);
+	up_write(&mirror->hmm->mirrors_sem);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space, and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	list_del(&mirror->list);
+	up_write(&hmm->mirrors_sem);
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Changed since v3:
  - fix THP handling
Changed since v2:
  - s/device unaddressable/device private/
Changes since v1:
  - Use spinlock instead of rcu synchronized list traversal

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  55 +++++++++-
 mm/hmm.c            | 285 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 338 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61c707970aa6..4cb6f9706c3c 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -79,13 +79,26 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
+ *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
+ *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
+ *      set and the pfn value is undefined.
+ * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_SPECIAL (1 << 5)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
+#define HMM_PFN_SHIFT 7
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -241,6 +254,44 @@ struct hmm_mirror {
 
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
+ * driver lock that serializes device page table updates, then call
+ * hmm_vma_range_done(), to check if the snapshot is still valid. The same
+ * device driver page table update lock must also be used in the
+ * hmm_mirror_ops.sync_cpu_device_pagetables() callback, so that CPU page
+ * table invalidation serializes on it.
+ *
+ * YOU MUST CALL hmm_vma_range_done() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ff78c92cd34c..b8a24b80463e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,8 +19,12 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
 
@@ -31,14 +35,18 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting ranges list
  * @sequence: we track updates to the CPU page table with a sequence number
+ * @ranges: list of range being snapshotted
  * @mirrors: list of mirrors for this mm
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
 	atomic_t		sequence;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
@@ -72,6 +80,8 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 	init_rwsem(&hmm->mirrors_sem);
 	atomic_set(&hmm->sequence, 0);
 	hmm->mmu_notifier.ops = NULL;
+	INIT_LIST_HEAD(&hmm->ranges);
+	spin_lock_init(&hmm->lock);
 	hmm->mm = mm;
 
 	/*
@@ -112,6 +122,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	spin_unlock(&hmm->lock);
 
 	down_read(&hmm->mirrors_sem);
 	list_for_each_entry(mirror, &hmm->mirrors, list)
@@ -209,4 +235,263 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static int hmm_pfns_bad(unsigned long addr,
+			unsigned long end,
+			struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_ERROR;
+
+	return 0;
+}
+
+static int hmm_vma_walk_hole(unsigned long addr,
+			     unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = HMM_PFN_EMPTY;
+
+	return 0;
+}
+
+static int hmm_vma_walk_clear(unsigned long addr,
+			      unsigned long end,
+			      struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long i;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	for (; addr < end; addr += PAGE_SIZE, i++)
+		pfns[i] = 0;
+
+	return 0;
+}
+
+static int hmm_vma_walk_pmd(pmd_t *pmdp,
+			    unsigned long start,
+			    unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct hmm_range *range = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	hmm_pfn_t *pfns = range->pfns;
+	unsigned long addr = start, i;
+	hmm_pfn_t flag;
+	pte_t *ptep;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+again:
+	if (pmd_none(*pmdp))
+		return hmm_vma_walk_hole(start, end, walk);
+
+	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
+		return hmm_pfns_bad(start, end, walk);
+
+	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
+		unsigned long pfn;
+		pmd_t pmd;
+
+		/*
+		 * No need to take pmd_lock here, even if some other threads
+		 * is splitting the huge pmd we will get that event through
+		 * mmu_notifier callback.
+		 *
+		 * So just read pmd value and check again its a transparent
+		 * huge or device mapping one and compute corresponding pfn
+		 * values.
+		 */
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+			goto again;
+		if (pmd_protnone(pmd))
+			return hmm_vma_walk_clear(start, end, walk);
+
+		pfn = pmd_pfn(pmd) + pte_index(addr);
+		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
+			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+		return 0;
+	}
+
+	if (pmd_bad(*pmdp))
+		return hmm_pfns_bad(start, end, walk);
+
+	ptep = pte_offset_map(pmdp, addr);
+	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
+		pte_t pte = *ptep;
+
+		pfns[i] = 0;
+
+		if (pte_none(pte) || !pte_present(pte)) {
+			pfns[i] = HMM_PFN_EMPTY;
+			continue;
+		}
+
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	}
+	pte_unmap(ptep - 1);
+
+	return 0;
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
+ * @vma: virtual memory area containing the virtual address range
+ * @range: used to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshots the CPU page table for a range of virtual addresses. Snapshot
+ * validity is tracked by range struct. See hmm_vma_range_done() for further
+ * information.
+ *
+ * The range struct is initialized here. It tracks the CPU page table, but only
+ * if the function returns success (0), in which case the caller must then call
+ * hmm_vma_range_done() to stop CPU page table update tracking on this range.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have registered a mirror, via hmm_mirror_register() ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.private = range;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	walk_page_range(start, end, &mm_walk);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being tracked
+ * Returns: false if range data has been invalidated, true otherwise
+ *
+ * Range struct is used to track updates to the CPU page table after a call to
+ * either hmm_vma_get_pfns() or hmm_vma_fault(). Once the device driver is done
+ * using the data,  or wants to lock updates to the data it got from those
+ * functions, it must call the hmm_vma_range_done() function, which will then
+ * stop tracking CPU page table updates.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * the mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be used while trying to duplicate CPU page table contents for a range of
+ * virtual addresses.
+ *
+ * There are two ways to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 05/19] mm/hmm/mirror: device page fault handler
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  27 ++++++
 mm/hmm.c            | 256 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 271 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4cb6f9706c3c..79f5d4fa2d4f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -292,6 +292,33 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
+ * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * be updated with the fault result and current snapshot of the CPU page table
+ * for the range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be
+ * dropped by the function if the block argument is false. In that case, the
+ * function returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successful for every single
+ * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * determine fault status for each address.
+ *
+ * Trying to fault inside an invalid vma will result in -EINVAL.
+ *
+ * See the function description in mm/hmm.c for further documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index b8a24b80463e..91592dae364e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -236,6 +236,36 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+struct hmm_vma_walk {
+	struct hmm_range	*range;
+	unsigned long		last;
+	bool			fault;
+	bool			block;
+	bool			write;
+};
+
+static int hmm_vma_do_fault(struct mm_walk *walk,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn)
+{
+	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int r;
+
+	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EBUSY;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return -EAGAIN;
+}
+
 static void hmm_pfns_special(hmm_pfn_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
@@ -259,34 +289,62 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = HMM_PFN_EMPTY;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_clear(unsigned long addr,
 			      unsigned long end,
 			      struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
+	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++)
+	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
+		if (hmm_vma_walk->fault) {
+			int ret;
 
-	return 0;
+			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			if (ret != -EAGAIN)
+				return ret;
+		}
+	}
+
+	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -294,15 +352,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long end,
 			    struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
+	bool write_fault;
 	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
@@ -331,6 +392,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (pmd_protnone(pmd))
 			return hmm_vma_walk_clear(start, end, walk);
 
+		if (write_fault && !pmd_write(pmd))
+			return hmm_vma_walk_clear(start, end, walk);
+
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
@@ -347,13 +411,55 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 		pfns[i] = 0;
 
-		if (pte_none(pte) || !pte_present(pte)) {
+		if (pte_none(pte)) {
 			pfns[i] = HMM_PFN_EMPTY;
+			if (hmm_vma_walk->fault)
+				goto fault;
 			continue;
 		}
 
+		if (!pte_present(pte)) {
+			swp_entry_t entry;
+
+			if (!non_swap_entry(entry)) {
+				if (hmm_vma_walk->fault)
+					goto fault;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			 */
+			if (is_migration_entry(entry)) {
+				if (hmm_vma_walk->fault) {
+					pte_unmap(ptep);
+					hmm_vma_walk->last = addr;
+					migration_entry_wait(vma->vm_mm,
+							     pmdp, addr);
+					return -EAGAIN;
+				}
+				continue;
+			} else {
+				/* Report error for everything else */
+				pfns[i] = HMM_PFN_ERROR;
+			}
+			continue;
+		}
+
+		if (write_fault && !pte_write(pte))
+			goto fault;
+
 		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+		continue;
+
+fault:
+		pte_unmap(ptep);
+		/* Fault all pages in range */
+		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
@@ -386,6 +492,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns)
 {
+	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
@@ -417,9 +524,12 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
+	hmm_vma_walk.fault = false;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+
 	mm_walk.vma = vma;
 	mm_walk.mm = vma->vm_mm;
-	mm_walk.private = range;
 	mm_walk.pte_entry = NULL;
 	mm_walk.test_walk = NULL;
 	mm_walk.hugetlb_entry = NULL;
@@ -427,7 +537,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	walk_page_range(start, end, &mm_walk);
-
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -454,7 +563,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -465,7 +574,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -494,4 +603,127 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARNED !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	struct hmm_vma_walk hmm_vma_walk;
+	struct mm_walk mm_walk;
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have registered a mirror using hmm_mirror_register() */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	hmm_vma_walk.fault = true;
+	hmm_vma_walk.write = write;
+	hmm_vma_walk.block = block;
+	hmm_vma_walk.range = range;
+	mm_walk.private = &hmm_vma_walk;
+	hmm_vma_walk.last = range->start;
+
+	mm_walk.vma = vma;
+	mm_walk.mm = vma->vm_mm;
+	mm_walk.pte_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.pmd_entry = hmm_vma_walk_pmd;
+	mm_walk.pte_hole = hmm_vma_walk_hole;
+
+	do {
+		ret = walk_page_range(start, end, &mm_walk);
+		start = hmm_vma_walk.last;
+	} while (ret == -EAGAIN);
+
+	if (ret) {
+		unsigned long i;
+
+		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
+		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
+		hmm_vma_range_done(vma, range);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Michal Hocko, Jérôme Glisse

From: Michal Hocko <mhocko@suse.com>

There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be
done by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use
the currently existing __add_pages.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/x86/Kconfig               |  4 ++++
 arch/x86/mm/init_64.c          | 22 +++++++++++++++-------
 include/linux/memory_hotplug.h | 11 +++++++++++
 3 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cf3012220d10..78492ac02779 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2319,6 +2319,10 @@ source "kernel/livepatch/Kconfig"
 
 endmenu
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 136422d7d539..048fbe8fc274 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -761,7 +761,7 @@ void __init paging_init(void)
  * After memory hotplug the variables max_pfn, max_low_pfn and high_memory need
  * updating.
  */
-static void  update_end_of_memory_vars(u64 start, u64 size)
+static void update_end_of_memory_vars(u64 start, u64 size)
 {
 	unsigned long end_pfn = PFN_UP(start + size);
 
@@ -772,22 +772,30 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+int add_pages(int nid, unsigned long start_pfn,
+	      unsigned long nr_pages, bool want_memblock)
 {
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	init_memory_mapping(start, start + size);
-
 	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
 	WARN_ON_ONCE(ret);
 
 	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start, size);
+	update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
+				  nr_pages << PAGE_SHIFT);
 
 	return ret;
 }
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	init_memory_mapping(start, start + size);
+
+	return add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 5e6e4cc36ff4..0995e1a2b458 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -133,6 +133,17 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 extern int __add_pages(int nid, unsigned long start_pfn,
 	unsigned long nr_pages, bool want_memblock);
 
+#ifndef CONFIG_ARCH_HAS_ADD_PAGES
+static inline int add_pages(int nid, unsigned long start_pfn,
+			    unsigned long nr_pages, bool want_memblock)
+{
+	return __add_pages(nid, start_pfn, nr_pages, want_memblock);
+}
+#else /* ARCH_HAS_ADD_PAGES */
+int add_pages(int nid, unsigned long start_pfn,
+	      unsigned long nr_pages, bool want_memblock);
+#endif /* ARCH_HAS_ADD_PAGES */
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2018-12-20  8:33   ` Dan Williams
  2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Ross Zwisler

HMM (heterogeneous memory management) need struct page to support migration
from system main memory to device memory.  Reasons for HMM and migration to
device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support different
types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory type.
There is a clear separation between what is expected from each memory type
and existing user of ZONE_DEVICE are un-affected by new requirement and new
use of the un-addressable type. All specific code path are protect with
test against the memory type.

Because memory is un-addressable we use a new special swap type for when
a page is migrated to device memory (this reduces the number of maximum
swap file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which means
the page is free as ZONE_DEVICE page never reach a refcount of 0). This
allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to
an address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

Changed since v4:
  - s/DEVICE_PUBLIC/DEVICE_HOST (to free DEVICE_PUBLIC for HMM-CDM)
Changed since v3:
  - fix comments that was still using UNADDRESSABLE as keyword
  - kernel configuration simplification
Changed since v2:
  - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
Changed since v1:
  - rename to device private memory (from device unaddressable)

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  7 +++++
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h       | 12 ++++++++
 include/linux/swap.h     | 24 ++++++++++++++--
 include/linux/swapops.h  | 68 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c        | 34 ++++++++++++++++++++++
 mm/Kconfig               | 11 +++++++-
 mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c      | 10 +++++--
 mm/mprotect.c            | 14 ++++++++++
 11 files changed, 309 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 186e907187ab..e3ba93b35925 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -535,6 +535,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			}
 		} else if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_private_entry(swpent))
+			page = device_private_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -698,6 +700,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		else if (is_device_private_entry(swpent))
+			page = device_private_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1198,6 +1202,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_private_entry(entry))
+			page = device_private_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064d7f95..3a4f69137bc2 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 93416196ba64..8e164ec9eed0 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -4,6 +4,8 @@
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
+#include <asm/pgtable.h>
+
 struct resource;
 struct device;
 
@@ -35,18 +37,89 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+/*
+ * Specialize ZONE_DEVICE memory into multiple types each having differents
+ * usage.
+ *
+ * MEMORY_DEVICE_HOST:
+ * Persistent device memory (pmem): struct page might be allocated in different
+ * memory and architecture might want to perform special actions. It is similar
+ * to regular memory, in that the CPU can access it transparently. However,
+ * it is likely to have different bandwidth and latency than regular memory.
+ * See Documentation/nvdimm/nvdimm.txt for more information.
+ *
+ * MEMORY_DEVICE_PRIVATE:
+ * Device memory that is not directly addressable by the CPU: CPU can neither
+ * read nor write private memory. In this case, we do still have struct pages
+ * backing the device memory. Doing so simplifies the implementation, but it is
+ * important to remember that there are certain points at which the struct page
+ * must be treated as an opaque object, rather than a "normal" struct page.
+ *
+ * A more complete discussion of unaddressable memory may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ */
+enum memory_type {
+	MEMORY_DEVICE_HOST = 0,
+	MEMORY_DEVICE_PRIVATE,
+};
+
+/*
+ * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two
+ * callbacks:
+ *   page_fault()
+ *   page_free()
+ *
+ * Additional notes about MEMORY_DEVICE_PRIVATE may be found in
+ * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
+ * explanation in include/linux/memory_hotplug.h.
+ *
+ * The page_fault() callback must migrate page back, from device memory to
+ * system memory, so that the CPU can access it. This might fail for various
+ * reasons (device issues,  device have been unplugged, ...). When such error
+ * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
+ * set the CPU page table entry to "poisoned".
+ *
+ * Note that because memory cgroup charges are transferred to the device memory,
+ * this should never fail due to memory restrictions. However, allocation
+ * of a regular system page might still fail because we are out of memory. If
+ * that happens, the page_fault() callback must return VM_FAULT_OOM.
+ *
+ * The page_fault() callback can also try to migrate back multiple pages in one
+ * chunk, as an optimization. It must, however, prioritize the faulting address
+ * over all the others.
+ *
+ *
+ * The page_free() callback is called once the page refcount reaches 1
+ * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
+ * This allows the device driver to implement its own memory management.)
+ */
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				const struct page *page,
+				unsigned int flags,
+				pmd_t *pmdp);
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an unaddressable device page
+ * @page_free: free page callback when page refcount reaches 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: private data pointer for page_free()
+ * @type: memory type: see MEMORY_* in memory_hotplug.h
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
+	enum memory_type type;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fee3213a75e..a59e149b958a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -788,11 +788,23 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
+}
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 static inline void get_page(struct page *page)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 76f1632eea5a..12beb55ccdd4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -51,6 +51,23 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Unaddressable device memory support. See include/linux/hmm.h and
+ * Documentation/vm/hmm.txt. Short description is we need struct pages for
+ * device memory that is unaddressable (inaccessible) by CPU, so that we can
+ * migrate part of a process memory to device memory.
+ *
+ * When a page is migrated from CPU to device, we set the CPU page table entry
+ * to a special SWP_DEVICE_* entry.
+ */
+#ifdef CONFIG_DEVICE_PRIVATE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
+#define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -72,7 +89,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -435,8 +453,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_private_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_private_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 45b092aa6419..291c4b534658 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,74 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+{
+	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
+			 page_to_pfn(page));
+}
+
+static inline bool is_device_private_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_private_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+}
+
+static inline bool is_write_device_private_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_private_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_private_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_PRIVATE */
+static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_private_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_private_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_private_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_private_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned int flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_PRIVATE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 9afdc434fb49..d3241f51c1f0 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -194,6 +196,34 @@ struct page_map {
 	struct vmem_altmap altmap;
 };
 
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+int device_private_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned int flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_private_entry_to_page(entry);
+
+	/*
+	 * The page_fault() callback must migrate page back to system memory
+	 * so that CPU can access it. This might fail for various reasons
+	 * (device issue, device was unsafely unplugged, ...). When such
+	 * error conditions happen, the callback must return VM_FAULT_SIGBUS.
+	 *
+	 * Note that because memory cgroup charges are accounted to the device
+	 * memory, this should never fail because of memory restrictions (but
+	 * allocation of regular system page might still fail because we are
+	 * out of memory).
+	 *
+	 * There is a more in-depth description of what that callback can and
+	 * cannot do, in include/linux/memremap.h
+	 */
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_private_entry_fault);
+#endif /* CONFIG_DEVICE_PRIVATE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -333,6 +363,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->type = MEMORY_DEVICE_HOST;
+	pgmap->page_fault = NULL;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 87d530cbbbc7..c07cb590b2d6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -676,7 +676,7 @@ config ARCH_HAS_ZONE_DEVICE
 	bool
 
 config ZONE_DEVICE
-	bool "Device memory (pmem, etc...) hotplug support"
+	bool "Device memory (pmem, HMM, etc...) hotplug support"
 	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on SPARSEMEM_VMEMMAP
@@ -716,6 +716,15 @@ config HMM_MIRROR
 	  page tables (at PAGE_SIZE granularity), and must be able to recover from
 	  the resulting potential page faults.
 
+config DEVICE_PRIVATE
+	bool "Unaddressable device memory (GPU memory, ...)"
+	depends on ARCH_HAS_HMM
+
+	help
+	  Allows creation of struct pages to represent unaddressable device
+	  memory; i.e., memory that is only accessible from the device (or
+	  group of devices). You likely also want to select HMM_MIRROR.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index edabf6f03447..d1b2da5fe70e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -49,6 +49,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -927,6 +928,35 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_private_entry(entry)) {
+			page = device_private_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for unaddressable pages, as
+			 * they should treated just like normal pages in this
+			 * respect.
+			 *
+			 * We will likely want to have some new rss counters
+			 * for unaddressable pages, at some point. But for now
+			 * keep things as they are.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			/*
+			 * We do not preserve soft-dirty information, because so
+			 * far, checkpoint/restore is the only feature that
+			 * requires that. And checkpoint/restore does not work
+			 * when a device driver is involved (you cannot easily
+			 * save and restore device driver state).
+			 */
+			if (is_write_device_private_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_private_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1245,6 +1275,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_private_entry(entry)) {
+			struct page *page = device_private_entry_to_page(entry);
+
+			if (unlikely(details && details->check_mapping)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping !=
+				    page_rmapping(page))
+					continue;
+			}
+
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+			continue;
+		}
+
 		/* If details->check_mapping, we leave swap entries. */
 		if (unlikely(details))
 			continue;
@@ -2740,6 +2793,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_private_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_private_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1de2f132bca3..459bbc182d10 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -99,7 +99,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -108,7 +108,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
+			pr_debug("Device unaddressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 5be04e28709f..b70de621ee83 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -125,6 +125,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_private_entry(entry)) {
+				pte_t newpte;
+
+				/*
+				 * We do not preserve soft-dirtiness. See
+				 * copy_one_pte() for explanation.
+				 */
+				make_device_private_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Kirill A . Shutemov, Ross Zwisler

A ZONE_DEVICE page that reach a refcount of 1 is free ie no longer
have any user. For device private pages this is important to catch
and thus we need to special case put_page() for this.

Changed since v3:
  - clear page mapping field
Changed since v2:
  - clear page active and waiters
Changed since v1:
  - use static key to disable special code path in put_page() by
    default
  - uninline put_zone_device_private_page()
  - fix build issues with some kernel config related to header
    inter-dependency

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 13 +++++++++++++
 include/linux/mm.h       | 31 ++++++++++++++++++++++---------
 kernel/memremap.c        | 25 ++++++++++++++++++++++++-
 mm/hmm.c                 |  8 ++++++++
 4 files changed, 67 insertions(+), 10 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8e164ec9eed0..8aa6b82679e2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -126,6 +126,14 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool is_zone_device_page(const struct page *page);
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -144,6 +152,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool is_device_private_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a59e149b958a..515d4ae611b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include <linux/page_ext.h>
 #include <linux/err.h>
 #include <linux/page_ref.h>
+#include <linux/memremap.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -788,25 +789,25 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
-
-static inline bool is_device_private_page(const struct page *page)
-{
-	/* See MEMORY_DEVICE_PRIVATE in include/linux/memory_hotplug.h */
-	return ((page_zonenum(page) == ZONE_DEVICE) &&
-		(page->pgmap->type == MEMORY_DEVICE_PRIVATE));
-}
 #else
 static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
 }
+#endif
 
-static inline bool is_device_private_page(const struct page *page)
+#ifdef CONFIG_DEVICE_PRIVATE
+void put_zone_device_private_page(struct page *page);
+#else
+static inline void put_zone_device_private_page(struct page *page)
 {
-	return false;
 }
 #endif
 
+static inline bool is_device_private_page(const struct page *page);
+
+DECLARE_STATIC_KEY_FALSE(device_private_key);
+
 static inline void get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -822,6 +823,18 @@ static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
 
+	/*
+	 * For private device pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the private device page is
+	 * free and we need to inform the device driver through callback. See
+	 * include/linux/memremap.h and HMM for details.
+	 */
+	if (static_branch_unlikely(&device_private_key) &&
+	    unlikely(is_device_private_page(page))) {
+		put_zone_device_private_page(page);
+		return;
+	}
+
 	if (put_page_testzero(page))
 		__put_page(page);
 }
diff --git a/kernel/memremap.c b/kernel/memremap.c
index d3241f51c1f0..398630c1fba3 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -11,7 +11,6 @@
  * General Public License for more details.
  */
 #include <linux/radix-tree.h>
-#include <linux/memremap.h>
 #include <linux/device.h>
 #include <linux/types.h>
 #include <linux/pfn_t.h>
@@ -476,3 +475,27 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 	return pgmap ? pgmap->altmap : NULL;
 }
 #endif /* CONFIG_ZONE_DEVICE */
+
+
+#ifdef CONFIG_DEVICE_PRIVATE
+void put_zone_device_private_page(struct page *page)
+{
+	int count = page_ref_dec_return(page);
+
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (count == 1) {
+		/* Clear Active bit in case of parallel mark_page_accessed */
+		__ClearPageActive(page);
+		__ClearPageWaiters(page);
+
+		page->mapping = NULL;
+
+		page->pgmap->page_free(page, page->pgmap->data);
+	} else if (!count)
+		__put_page(page);
+}
+EXPORT_SYMBOL(put_zone_device_private_page);
+#endif /* CONFIG_DEVICE_PRIVATE */
diff --git a/mm/hmm.c b/mm/hmm.c
index 91592dae364e..e615f337110a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -25,9 +25,17 @@
 #include <linux/sched.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
 
 
+/*
+ * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
+ */
+DEFINE_STATIC_KEY_FALSE(device_private_key);
+EXPORT_SYMBOL(device_private_key);
+
+
 #ifdef CONFIG_HMM
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 mm/memcontrol.c | 168 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 92 insertions(+), 76 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df6f63ee95d6..604fb3ca8028 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5533,48 +5533,102 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
 	cancel_charge(memcg, nr_pages);
 }
 
-static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
-			   unsigned long nr_anon, unsigned long nr_file,
-			   unsigned long nr_kmem, unsigned long nr_huge,
-			   unsigned long nr_shmem, struct page *dummy_page)
+struct uncharge_gather {
+	struct mem_cgroup *memcg;
+	unsigned long pgpgout;
+	unsigned long nr_anon;
+	unsigned long nr_file;
+	unsigned long nr_kmem;
+	unsigned long nr_huge;
+	unsigned long nr_shmem;
+	struct page *dummy_page;
+};
+
+static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 {
-	unsigned long nr_pages = nr_anon + nr_file + nr_kmem;
+	memset(ug, 0, sizeof(*ug));
+}
+
+static void uncharge_batch(const struct uncharge_gather *ug)
+{
+	unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
 	unsigned long flags;
 
-	if (!mem_cgroup_is_root(memcg)) {
-		page_counter_uncharge(&memcg->memory, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg)) {
+		page_counter_uncharge(&ug->memcg->memory, nr_pages);
 		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, nr_pages);
-		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && nr_kmem)
-			page_counter_uncharge(&memcg->kmem, nr_kmem);
-		memcg_oom_recover(memcg);
+			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
+			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
+		memcg_oom_recover(ug->memcg);
 	}
 
 	local_irq_save(flags);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS], nr_anon);
-	__this_cpu_sub(memcg->stat->count[MEMCG_CACHE], nr_file);
-	__this_cpu_sub(memcg->stat->count[MEMCG_RSS_HUGE], nr_huge);
-	__this_cpu_sub(memcg->stat->count[NR_SHMEM], nr_shmem);
-	__this_cpu_add(memcg->stat->events[PGPGOUT], pgpgout);
-	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
-	memcg_check_events(memcg, dummy_page);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS], ug->nr_anon);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_CACHE], ug->nr_file);
+	__this_cpu_sub(ug->memcg->stat->count[MEMCG_RSS_HUGE], ug->nr_huge);
+	__this_cpu_sub(ug->memcg->stat->count[NR_SHMEM], ug->nr_shmem);
+	__this_cpu_add(ug->memcg->stat->events[PGPGOUT], ug->pgpgout);
+	__this_cpu_add(ug->memcg->stat->nr_page_events, nr_pages);
+	memcg_check_events(ug->memcg, ug->dummy_page);
 	local_irq_restore(flags);
 
-	if (!mem_cgroup_is_root(memcg))
-		css_put_many(&memcg->css, nr_pages);
+	if (!mem_cgroup_is_root(ug->memcg))
+		css_put_many(&ug->memcg->css, nr_pages);
+}
+
+static void uncharge_page(struct page *page, struct uncharge_gather *ug)
+{
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
+
+	if (!page->mem_cgroup)
+		return;
+
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * page->mem_cgroup at this point, we have fully
+	 * exclusive access to the page.
+	 */
+
+	if (ug->memcg != page->mem_cgroup) {
+		if (ug->memcg) {
+			uncharge_batch(ug);
+			uncharge_gather_clear(ug);
+		}
+		ug->memcg = page->mem_cgroup;
+	}
+
+	if (!PageKmemcg(page)) {
+		unsigned int nr_pages = 1;
+
+		if (PageTransHuge(page)) {
+			nr_pages <<= compound_order(page);
+			ug->nr_huge += nr_pages;
+		}
+		if (PageAnon(page))
+			ug->nr_anon += nr_pages;
+		else {
+			ug->nr_file += nr_pages;
+			if (PageSwapBacked(page))
+				ug->nr_shmem += nr_pages;
+		}
+		ug->pgpgout++;
+	} else {
+		ug->nr_kmem += 1 << compound_order(page);
+		__ClearPageKmemcg(page);
+	}
+
+	ug->dummy_page = page;
+	page->mem_cgroup = NULL;
 }
 
 static void uncharge_list(struct list_head *page_list)
 {
-	struct mem_cgroup *memcg = NULL;
-	unsigned long nr_shmem = 0;
-	unsigned long nr_anon = 0;
-	unsigned long nr_file = 0;
-	unsigned long nr_huge = 0;
-	unsigned long nr_kmem = 0;
-	unsigned long pgpgout = 0;
+	struct uncharge_gather ug;
 	struct list_head *next;
-	struct page *page;
+
+	uncharge_gather_clear(&ug);
 
 	/*
 	 * Note that the list can be a single page->lru; hence the
@@ -5582,57 +5636,16 @@ static void uncharge_list(struct list_head *page_list)
 	 */
 	next = page_list->next;
 	do {
+		struct page *page;
+
 		page = list_entry(next, struct page, lru);
 		next = page->lru.next;
 
-		VM_BUG_ON_PAGE(PageLRU(page), page);
-		VM_BUG_ON_PAGE(!PageHWPoison(page) && page_count(page), page);
-
-		if (!page->mem_cgroup)
-			continue;
-
-		/*
-		 * Nobody should be changing or seriously looking at
-		 * page->mem_cgroup at this point, we have fully
-		 * exclusive access to the page.
-		 */
-
-		if (memcg != page->mem_cgroup) {
-			if (memcg) {
-				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-					       nr_kmem, nr_huge, nr_shmem, page);
-				pgpgout = nr_anon = nr_file = nr_kmem = 0;
-				nr_huge = nr_shmem = 0;
-			}
-			memcg = page->mem_cgroup;
-		}
-
-		if (!PageKmemcg(page)) {
-			unsigned int nr_pages = 1;
-
-			if (PageTransHuge(page)) {
-				nr_pages <<= compound_order(page);
-				nr_huge += nr_pages;
-			}
-			if (PageAnon(page))
-				nr_anon += nr_pages;
-			else {
-				nr_file += nr_pages;
-				if (PageSwapBacked(page))
-					nr_shmem += nr_pages;
-			}
-			pgpgout++;
-		} else {
-			nr_kmem += 1 << compound_order(page);
-			__ClearPageKmemcg(page);
-		}
-
-		page->mem_cgroup = NULL;
+		uncharge_page(page, &ug);
 	} while (next != page_list);
 
-	if (memcg)
-		uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
-			       nr_kmem, nr_huge, nr_shmem, page);
+	if (ug.memcg)
+		uncharge_batch(&ug);
 }
 
 /**
@@ -5644,6 +5657,8 @@ static void uncharge_list(struct list_head *page_list)
  */
 void mem_cgroup_uncharge(struct page *page)
 {
+	struct uncharge_gather ug;
+
 	if (mem_cgroup_disabled())
 		return;
 
@@ -5651,8 +5666,9 @@ void mem_cgroup_uncharge(struct page *page)
 	if (!page->mem_cgroup)
 		return;
 
-	INIT_LIST_HEAD(&page->lru);
-	uncharge_list(&page->lru);
+	uncharge_gather_clear(&ug);
+	uncharge_page(page, &ug);
+	uncharge_batch(&ug);
 }
 
 /**
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-09-05 17:13   ` Laurent Dufour
  2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, cgroups

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus need special handling when it comes to lru or refcount. This
patch make sure that memcontrol properly handle those when it face
them. Those pages are use like regular pages in a process address
space either as anonymous page or as file back page. So from memcg
point of view we want to handle them like regular page for now at
least.

Changed since v3:
  - remove public support and move those chunk to separate patch
Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host
  - add comments explaining how device memory behave and why

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: cgroups@vger.kernel.org
---
 kernel/memremap.c |  1 +
 mm/memcontrol.c   | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 398630c1fba3..f42d7483e886 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -492,6 +492,7 @@ void put_zone_device_private_page(struct page *page)
 		__ClearPageWaiters(page);
 
 		page->mapping = NULL;
+		mem_cgroup_uncharge(page);
 
 		page->pgmap->page_free(page, page->pgmap->data);
 	} else if (!count)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 604fb3ca8028..977d1cf3493a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4407,12 +4407,13 @@ enum mc_target_type {
 	MC_TARGET_NONE = 0,
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
+	MC_TARGET_DEVICE,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
 {
-	struct page *page = vm_normal_page(vma, addr, ptent);
+	struct page *page = _vm_normal_page(vma, addr, ptent, true);
 
 	if (!page || !page_mapped(page))
 		return NULL;
@@ -4429,7 +4430,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 	return page;
 }
 
-#ifdef CONFIG_SWAP
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
 static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 			pte_t ptent, swp_entry_t *entry)
 {
@@ -4438,6 +4439,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 
 	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
 		return NULL;
+
+	/*
+	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
+	 * a device and because they are not accessible by CPU they are store
+	 * as special swap entry in the CPU page table.
+	 */
+	if (is_device_private_entry(ent)) {
+		page = device_private_entry_to_page(ent);
+		/*
+		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
+		 * a refcount of 1 when free (unlike normal page)
+		 */
+		if (!page_ref_add_unless(page, 1, 1))
+			return NULL;
+		return page;
+	}
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.
@@ -4598,6 +4616,12 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
+ *     (so ZONE_DEVICE page and thus not on the lru). For now we such page is
+ *     charge like a regular page would be as for all intent and purposes it is
+ *     just special memory taking the place of a regular page.
+ *
+ *     See Documentations/vm/hmm.txt and include/linux/hmm.h
  *
  * Called with pte lock held.
  */
@@ -4626,6 +4650,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
+			if (is_device_private_page(page))
+				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
 		}
@@ -4693,6 +4719,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
+		/*
+		 * Note their can not be MC_TARGET_DEVICE for now as we do not
+		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
+		 * MEMORY_DEVICE_PRIVATE but this might change.
+		 */
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -4908,6 +4939,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				putback_lru_page(page);
 			}
 			put_page(page);
+		} else if (target_type == MC_TARGET_DEVICE) {
+			page = target.page;
+			if (!mem_cgroup_move_account(page, true,
+						     mc.from, mc.to)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_charge += HPAGE_PMD_NR;
+			}
+			put_page(page);
 		}
 		spin_unlock(ptl);
 		return 0;
@@ -4919,12 +4958,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
+		bool device = false;
 		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
 
 		switch (get_mctgt_type(vma, addr, ptent, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			/* fall through */
 		case MC_TARGET_PAGE:
 			page = target.page;
 			/*
@@ -4935,7 +4978,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (isolate_lru_page(page))
+			if (!device && isolate_lru_page(page))
 				goto put;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
@@ -4943,7 +4986,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
 			}
-			putback_lru_page(page);
+			if (!device)
+				putback_lru_page(page);
 put:			/* get_mctgt_type() gets the page */
 			put_page(page);
 			break;
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Device driver should use this helper during device initialization to
hotplug the device memory. It should only need to remove the memory
once the device is going offline (shutdown or hotremove). There should
not be any userspace API to hotplug memory expect maybe for host
device driver to allow to add more memory to a guest device driver.

Device's memory is manage by the device driver and HMM only provides
helpers to that effect.

Changed since v6:
  - fix start address calculation (Balbir Singh)
  - more comments and updated commit message.
Changed since v5:
  - kernel configuration simplification
  - remove now unuse device driver helper
Changed since v4:
  - enable device_private_key static key when adding device memory
Changed since v3:
  - s/device unaddressable/device private/
Changed since v2:
  - s/SECTION_SIZE/PA_SECTION_SIZE
Changed since v1:
  - change to adapt to new add_pages() helper
  - make this x86-64 only for now

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 include/linux/hmm.h | 155 +++++++++++++++++++++
 mm/hmm.c            | 379 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 533 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79f5d4fa2d4f..0a8e5e3688e8 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,6 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -322,6 +327,156 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ *
+ * Both callback happens from page_free() and page_fault() callback of struct
+ * dev_pagemap respectively. See include/linux/memremap.h for more details on
+ * those.
+ *
+ * The hmm_devmem_ops callback are just here to provide a coherent and
+ * uniq API to device driver and device driver should not register their
+ * own page_free() or page_fault() but rely on the hmm_devmem_ops call-
+ * back.
+ */
+struct hmm_devmem_ops {
+	/*
+	 * free() - free a device page
+	 * @devmem: device memory structure (see struct hmm_devmem)
+	 * @page: pointer to struct page being freed
+	 *
+	 * Call back occurs whenever a device page refcount reach 1 which
+	 * means that no one is holding any reference on the page anymore
+	 * (ZONE_DEVICE page have an elevated refcount of 1 as default so
+	 * that they are not release to the general page allocator).
+	 *
+	 * Note that callback has exclusive ownership of the page (as no
+	 * one is holding any reference).
+	 */
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	/*
+	 * fault() - CPU page fault or get user page (GUP)
+	 * @devmem: device memory structure (see struct hmm_devmem)
+	 * @vma: virtual memory area containing the virtual address
+	 * @addr: virtual address that faulted or for which there is a GUP
+	 * @page: pointer to struct page backing virtual address (unreliable)
+	 * @flags: FAULT_FLAG_* (see include/linux/mm.h)
+	 * @pmdp: page middle directory
+	 * Returns: VM_FAULT_MINOR/MAJOR on success or one of VM_FAULT_ERROR
+	 *   on error
+	 *
+	 * The callback occurs whenever there is a CPU page fault or GUP on a
+	 * virtual address. This means that the device driver must migrate the
+	 * page back to regular memory (CPU accessible).
+	 *
+	 * The device driver is free to migrate more than one page from the
+	 * fault() callback as an optimization. However if device decide to
+	 * migrate more than one page it must always priotirize the faulting
+	 * address over the others.
+	 *
+	 * The struct page pointer is only given as an hint to allow quick
+	 * lookup of internal device driver data. A concurrent migration
+	 * might have already free that page and the virtual address might
+	 * not longer be back by it. So it should not be modified by the
+	 * callback.
+	 *
+	 * Note that mmap semaphore is held in read mode at least when this
+	 * callback occurs, hence the vma is valid upon callback entry.
+	 */
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     const struct page *page,
+		     unsigned int flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device drivers that do not wish to implement
+ * the gory details related to hotplugging new memoy and allocating struct
+ * pages.
+ *
+ * Device drivers can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct device			*device;
+	struct dev_pagemap		pagemap;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+};
+
+/*
+ * To add (hotplug) device memory, HMM assumes that there is no real resource
+ * that reserves a range in the physical address space (this is intended to be
+ * use by unaddressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * The device driver can wrap the hmm_devmem struct inside a private device
+ * driver struct. The device driver must call hmm_devmem_remove() before the
+ * device goes away and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per-page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+
+
 /* Below are for HMM internal use only! Not to be used by device driver! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index e615f337110a..afc9ab22228c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,16 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/jump_label.h>
 #include <linux/mmu_notifier.h>
+#include <linux/memory_hotplug.h>
+
+#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
 
 /*
@@ -441,7 +447,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * This is a special swap entry, ignore migration, use
 			 * device and report anything else as error.
 			 */
-			if (is_migration_entry(entry)) {
+			if (is_device_private_entry(entry)) {
+				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				if (is_write_device_private_entry(entry)) {
+					pfns[i] |= HMM_PFN_WRITE;
+				} else if (write_fault)
+					goto fault;
+				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
 					hmm_vma_walk->last = addr;
@@ -735,3 +749,366 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_kill(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    const struct page *page,
+			    unsigned int flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+	resource_size_t key, align_start, align_size, align_end;
+
+	align_start = resource->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), PA_SECTION_SIZE);
+	align_end = align_start + align_size - 1;
+
+	mutex_lock(&hmm_devmem_lock);
+	for (key = resource->start;
+	     key <= resource->end;
+	     key += PA_SECTION_SIZE)
+		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+	mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+	struct hmm_devmem *devmem = data;
+	struct resource *resource = devmem->resource;
+	unsigned long start_pfn, npages;
+	struct zone *zone;
+	struct page *page;
+
+	if (percpu_ref_tryget_live(&devmem->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(&devmem->ref);
+	}
+
+	/* pages are dead and unused, undo the arch mapping */
+	start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT;
+	npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT;
+
+	page = pfn_to_page(start_pfn);
+	zone = page_zone(page);
+
+	mem_hotplug_begin();
+	__remove_pages(zone, start_pfn, npages);
+	mem_hotplug_done();
+
+	hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+	resource_size_t key, align_start, align_size, align_end;
+	struct device *device = devmem->device;
+	int ret, nid, is_ram;
+	unsigned long pfn;
+
+	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(devmem->resource->start +
+			   resource_size(devmem->resource),
+			   PA_SECTION_SIZE) - align_start;
+
+	is_ram = region_intersects(align_start, align_size,
+				   IORESOURCE_SYSTEM_RAM,
+				   IORES_DESC_NONE);
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+				__func__, devmem->resource);
+		return -ENXIO;
+	}
+	if (is_ram == REGION_INTERSECTS)
+		return -ENXIO;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.dev = devmem->device;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+
+	mutex_lock(&hmm_devmem_lock);
+	align_end = align_start + align_size - 1;
+	for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) {
+		struct hmm_devmem *dup;
+
+		rcu_read_lock();
+		dup = hmm_devmem_find(key);
+		rcu_read_unlock();
+		if (dup) {
+			dev_err(device, "%s: collides with mapping for %s\n",
+				__func__, dev_name(dup->device));
+			mutex_unlock(&hmm_devmem_lock);
+			ret = -EBUSY;
+			goto error;
+		}
+		ret = radix_tree_insert(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT,
+					devmem);
+		if (ret) {
+			dev_err(device, "%s: failed: %d\n", __func__, ret);
+			mutex_unlock(&hmm_devmem_lock);
+			goto error_radix;
+		}
+	}
+	mutex_unlock(&hmm_devmem_lock);
+
+	nid = dev_to_node(device);
+	if (nid < 0)
+		nid = numa_mem_id();
+
+	mem_hotplug_begin();
+	/*
+	 * For device private memory we call add_pages() as we only need to
+	 * allocate and initialize struct page for the device memory. More-
+	 * over the device memory is un-accessible thus we do not want to
+	 * create a linear mapping for the memory like arch_add_memory()
+	 * would do.
+	 */
+	ret = add_pages(nid, align_start >> PAGE_SHIFT,
+			align_size >> PAGE_SHIFT, false);
+	if (ret) {
+		mem_hotplug_done();
+		goto error_add_memory;
+	}
+	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
+				align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT);
+	mem_hotplug_done();
+
+	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		page->pgmap = &devmem->pagemap;
+	}
+	return 0;
+
+error_add_memory:
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+	hmm_devmem_radix_release(devmem->resource);
+error:
+	return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+	struct hmm_devmem *devmem = data;
+
+	return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+	devres_release(devmem->device, &hmm_devmem_release,
+		       &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This function first finds an empty range of physical address big enough to
+ * contain the new resource, and then hotplugs it as ZONE_DEVICE memory, which
+ * in turn allocates struct pages. It does not do anything beyond that; all
+ * events affecting the memory will go through the various callbacks provided
+ * by hmm_devmem_ops struct.
+ *
+ * Device driver should call this function during device initialization and
+ * is then responsible of memory management. HMM only provides helpers.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size)
+{
+	struct hmm_devmem *devmem;
+	resource_size_t addr;
+	int ret;
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+	size = ALIGN(size, PA_SECTION_SIZE);
+	addr = min((unsigned long)iomem_resource.end,
+		   (1UL << MAX_PHYSMEM_BITS) - 1);
+	addr = addr - size + 1UL;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error_no_resource;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error_no_resource;
+	}
+
+	devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_pages;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_pages:
+	devm_release_mem_region(device, devmem->resource->start,
+				resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf
+ * of the device driver. It will free struct page and remove the resource that
+ * reserved the physical address range for this device memory.
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	resource_size_t start, size;
+	struct device *device;
+
+	if (!devmem)
+		return;
+
+	device = devmem->device;
+	start = devmem->resource->start;
+	size = resource_size(devmem->resource);
+
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+	hmm_devmem_pages_remove(devmem);
+
+	devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is useful to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v2:
  - use device_initcall() and drop everything that is module specific
Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 ++++++++++++++-
 mm/hmm.c            | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0a8e5e3688e8..aeba696a4385 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -72,11 +72,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -474,6 +474,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned int		minor;
+};
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper and
+ * it is not strictly needed, in order to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index afc9ab22228c..8c5ce5f76b2e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,6 +19,7 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
@@ -1111,4 +1112,84 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 	devm_release_mem_region(device, start, size);
 }
 EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * A device driver that wants to handle multiple devices memory through a
+ * single fake device can use hmm_device to do so. This is purely a helper
+ * and it is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor = find_first_zero_bit(hmm_device_mask, HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return ERR_PTR(-EBUSY);
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	dev_set_drvdata(&hmm_device->device, drvdata);
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+device_initcall(hmm_init);
 #endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17 21:12   ` Andrew Morton
  2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse

Introduce a new migration mode that allow to offload the copy to
a device DMA engine. This changes the workflow of migration and
not all address_space migratepage callback can support this. So
it needs to be tested in those cases.

This is intended to be use by migrate_vma() which itself is use
for thing like HMM (see include/linux/hmm.h).

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 fs/aio.c                     |  8 +++++++
 fs/f2fs/data.c               |  5 ++++-
 fs/hugetlbfs/inode.c         |  5 ++++-
 fs/ubifs/file.c              |  5 ++++-
 include/linux/migrate.h      |  5 +++++
 include/linux/migrate_mode.h |  5 +++++
 mm/balloon_compaction.c      |  8 +++++++
 mm/migrate.c                 | 52 ++++++++++++++++++++++++++++++++++----------
 mm/zsmalloc.c                |  8 +++++++
 9 files changed, 86 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8f0127526299..b5d69f28d8b1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -373,6 +373,14 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	pgoff_t idx;
 	int rc;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the ctx->completion_lock. That does not work with the
+	 * migration workflow of MIGRATE_SYNC_NO_COPY.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index aefc2a5745d3..cf7930daa10e 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2257,7 +2257,10 @@ int f2fs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	set_page_private(newpage, page_private(page));
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 33961b35007b..59073e9f01a4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -830,7 +830,10 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	return MIGRATEPAGE_SUCCESS;
 }
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index f90a466ea5db..a02aa59d1e24 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1490,7 +1490,10 @@ static int ubifs_migrate_page(struct address_space *mapping,
 		SetPagePrivate(newpage);
 	}
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce15989521a1..7db4c812a2a6 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -72,6 +72,7 @@ extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
+extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
@@ -92,6 +93,10 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
+static inline void migrate_page_states(struct page *newpage, struct page *page)
+{
+}
+
 static inline void migrate_page_copy(struct page *newpage,
 				     struct page *page) {}
 
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89a3919..bdf66af9b937 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,16 @@
  *	on most operations but not ->writepage as the potential stall time
  *	is too significant
  * MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_NO_COPY will block when migrating pages but will not copy pages
+ *	with the CPU. Instead, page copy happens outside the migratepage()
+ *	callback and is likely using a DMA engine. See migrate_vma() and HMM
+ *	(mm/hmm.c) for users of this mode.
  */
 enum migrate_mode {
 	MIGRATE_ASYNC,
 	MIGRATE_SYNC_LIGHT,
 	MIGRATE_SYNC,
+	MIGRATE_SYNC_NO_COPY,
 };
 
 #endif		/* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 9075aa54e955..e78ae1255cb3 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -139,6 +139,14 @@ int balloon_page_migrate(struct address_space *mapping,
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
+	/*
+	 * We can not easily support the no copy case here so ignore it as it
+	 * is unlikely to be use with ballon pages. See include/linux/hmm.h for
+	 * user of the MIGRATE_SYNC_NO_COPY mode.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index fd33aa444f47..f8e1baea6c82 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -633,15 +633,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 /*
  * Copy the page to its new location
  */
-void migrate_page_copy(struct page *newpage, struct page *page)
+void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -695,6 +690,17 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+EXPORT_SYMBOL(migrate_page_states);
+
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -720,7 +726,10 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -770,12 +779,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (mode != MIGRATE_SYNC_NO_COPY)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -834,8 +846,13 @@ static int fallback_migrate_page(struct address_space *mapping,
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
-		if (mode != MIGRATE_SYNC)
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			return -EBUSY;
+		}
 		return writeout(mapping, page);
 	}
 
@@ -972,7 +989,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the retry loop is too short and in the sync-light case,
 		 * the overhead of stalling is too much
 		 */
-		if (mode != MIGRATE_SYNC) {
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
 			rc = -EBUSY;
 			goto out_unlock;
 		}
@@ -1242,8 +1263,15 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 		return -ENOMEM;
 
 	if (!trylock_page(hpage)) {
-		if (!force || mode != MIGRATE_SYNC)
+		if (!force)
 			goto out;
+		switch (mode) {
+		case MIGRATE_SYNC:
+		case MIGRATE_SYNC_NO_COPY:
+			break;
+		default:
+			goto out;
+		}
 		lock_page(hpage);
 	}
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 62457eb82330..5ad75ec4151c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1969,6 +1969,14 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	unsigned int obj_idx;
 	int ret = -EAGAIN;
 
+	/*
+	 * We cannot support the _NO_COPY case here, because copy needs to
+	 * happen under the zs lock, which does not work with
+	 * MIGRATE_SYNC_NO_COPY workflow.
+	 */
+	if (mode == MIGRATE_SYNC_NO_COPY)
+		return -EINVAL;
+
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be useful for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changed since v4:
  - split THP instead of skipping them
Changes since v3:
  - Rebase
Changes since v2:
  - droped HMM prefix and HMM specific code
Changes since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h | 104 ++++++++++
 mm/migrate.c            | 492 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 596 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 7db4c812a2a6..8f73cebfc3f5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -156,4 +156,108 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#ifdef CONFIG_MIGRATION
+
+#define MIGRATE_PFN_VALID	(1UL << 0)
+#define MIGRATE_PFN_MIGRATE	(1UL << 1)
+#define MIGRATE_PFN_LOCKED	(1UL << 2)
+#define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_ERROR	(1UL << 4)
+#define MIGRATE_PFN_SHIFT	5
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn >> MIGRATE_PFN_SHIFT);
+}
+
+static inline unsigned long migrate_pfn(unsigned long pfn)
+{
+	return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memory and copy source memory to it
+ * @finalize_and_map: allow caller to map the successfully migrated pages
+ *
+ *
+ * The alloc_and_copy() callback happens once all source pages have been locked,
+ * unmapped and checked (checked whether pinned or not). All pages that can be
+ * migrated will have an entry in the src array set with the pfn value of the
+ * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set (other
+ * flags might be set but should be ignored by the callback).
+ *
+ * The alloc_and_copy() callback can then allocate destination memory and copy
+ * source memory to it for all those entries (ie with MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_MIGRATE flag set). Once these are allocated and copied, the
+ * callback must update each corresponding entry in the dst array with the pfn
+ * value of the destination page and with the MIGRATE_PFN_VALID and
+ * MIGRATE_PFN_LOCKED flags set (destination pages must have their struct pages
+ * locked, via lock_page()).
+ *
+ * At this point the alloc_and_copy() callback is done and returns.
+ *
+ * Note that the callback does not have to migrate all the pages that are
+ * marked with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration
+ * from device memory to system memory (ie the MIGRATE_PFN_DEVICE flag is also
+ * set in the src array entry). If the device driver cannot migrate a device
+ * page back to system memory, then it must set the corresponding dst array
+ * entry to MIGRATE_PFN_ERROR. This will trigger a SIGBUS if CPU tries to
+ * access any of the virtual addresses originally backed by this page. Because
+ * a SIGBUS is such a severe result for the userspace process, the device
+ * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
+ * unrecoverable state.
+ *
+ * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
+ * OR BAD THINGS WILL HAPPEN !
+ *
+ *
+ * The finalize_and_map() callback happens after struct page migration from
+ * source to destination (destination struct pages are the struct pages for the
+ * memory allocated by the alloc_and_copy() callback).  Migration can fail, and
+ * thus the finalize_and_map() allows the driver to inspect which pages were
+ * successfully migrated, and which were not. Successfully migrated pages will
+ * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
+ *
+ * It is safe to update device page table from within the finalize_and_map()
+ * callback because both destination and source page are still locked, and the
+ * mmap_sem is held in read mode (hence no one can unmap the range being
+ * migrated).
+ *
+ * Once callback is done cleaning up things and updating its page table (if it
+ * chose to do so, this is not an obligation) then it returns. At this point,
+ * the HMM core will finish up the final steps, and the migration is complete.
+ *
+ * THE finalize_and_map() CALLBACK MUST NOT CHANGE ANY OF THE SRC OR DST ARRAY
+ * ENTRIES OR BAD THINGS WILL HAPPEN !
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 const unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
+#endif /* CONFIG_MIGRATION */
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index f8e1baea6c82..60e2f8369cd7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -427,6 +427,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2117,3 +2125,487 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+again:
+	if (pmd_none(*pmdp))
+		return migrate_vma_collect_hole(start, end, walk);
+
+	if (pmd_trans_huge(*pmdp)) {
+		struct page *page;
+
+		ptl = pmd_lock(mm, pmdp);
+		if (unlikely(!pmd_trans_huge(*pmdp))) {
+			spin_unlock(ptl);
+			goto again;
+		}
+
+		page = pmd_page(*pmdp);
+		if (is_huge_zero_page(page)) {
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmdp, addr);
+			if (pmd_trans_unstable(pmdp))
+				return migrate_vma_collect_hole(start, end,
+								walk);
+		} else {
+			int ret;
+
+			get_page(page);
+			spin_unlock(ptl);
+			if (unlikely(!trylock_page(page)))
+				return migrate_vma_collect_hole(start, end,
+								walk);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			if (ret || pmd_none(*pmdp))
+				return migrate_vma_collect_hole(start, end,
+								walk);
+		}
+	}
+
+	if (unlikely(pmd_bad(*pmdp)))
+		return migrate_vma_collect_hole(start, end, walk);
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long mpfn, pfn;
+		struct page *page;
+		pte_t pte;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			mpfn = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and that blocks
+		 * any kind of migration. Side effect is that it "freezes" the
+		 * pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be dropped from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = mpfn;
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect pages over a range of virtual addresses
+ * @migrate: migrate struct containing all migration information
+ *
+ * This will walk the CPU page table. For each virtual address backed by a
+ * valid page, it updates the src array and takes a reference on the page, in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+
+	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pinned or not
+ * @page: struct page to check
+ *
+ * Pinned pages cannot be migrated. This is the same test as in
+ * migrate_page_move_mapping(), except that here we allow migration of a
+ * ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller holds an extra reference, either from
+	 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
+	 * a device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them than regular pages, because they can be mapped with a pmd
+	 * or with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration information
+ *
+ * This locks pages that have been collected by migrate_vma_collect(). Once each
+ * page is locked it is isolated from the lru (for non-device pages). Finally,
+ * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
+ * migrated by concurrent kernel threads.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	bool allow_drain = true;
+	unsigned long i;
+
+	lru_add_drain();
+
+	for (i = 0; (i < npages) && migrate->cpages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		/*
+		 * Because we are migrating several pages there can be
+		 * a deadlock between 2 concurrent migration where each
+		 * are waiting on each other page lock.
+		 *
+		 * Make migrate_vma() a best effort thing and backoff
+		 * for any page we can not lock right away.
+		 */
+		if (!trylock_page(page)) {
+			migrate->src[i] = 0;
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * Replace page mapping (CPU page table pte) with a special migration pte entry
+ * and check again if it has been pinned. Pinned pages are restored because we
+ * cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
+
+	for (i = 0; i < npages; i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration information
+ *
+ * This migrates struct page meta-data from source struct page to destination
+ * struct page. This effectively finishes the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i;
+
+	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration information
+ *
+ * This replaces the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page, or to the original page
+ * otherwise.
+ *
+ * This also unlocks the pages and puts them back on the lru, or drops the extra
+ * refcount, for device pages.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	const unsigned long npages = migrate->npages;
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This function tries to migrate a range of memory virtual address range, using
+ * callbacks to allocate and copy memory from source to destination. First it
+ * collects all the pages backing each virtual address in the range, saving this
+ * inside the src array. Then it locks those pages and unmaps them. Once the pages
+ * are locked and unmapped, it checks whether each page is pinned or not. Pages
+ * that aren't pinned have the MIGRATE_PFN_MIGRATE flag set (by this function)
+ * in the corresponding src array entry. It then restores any pages that are
+ * pinned, by remapping and unlocking those pages.
+ *
+ * At this point it calls the alloc_and_copy() callback. For documentation on
+ * what is expected from that callback, see struct migrate_vma_ops comments in
+ * include/linux/migrate.h
+ *
+ * After the alloc_and_copy() callback, this function goes over each entry in
+ * the src array that has the MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag
+ * set. If the corresponding entry in dst array has MIGRATE_PFN_VALID flag set,
+ * then the function tries to migrate struct page information from the source
+ * struct page to the destination struct page. If it fails to migrate the struct
+ * page information, then it clears the MIGRATE_PFN_MIGRATE flag in the src
+ * array.
+ *
+ * At this point all successfully migrated pages have an entry in the src
+ * array with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set and the dst
+ * array entry with MIGRATE_PFN_VALID flag set.
+ *
+ * It then calls the finalize_and_map() callback. See comments for "struct
+ * migrate_vma_ops", in include/linux/migrate.h for details about
+ * finalize_and_map() behavior.
+ *
+ * After the finalize_and_map() callback, for successfully migrated pages, this
+ * function updates the CPU page table to point to new pages, otherwise it
+ * restores the CPU page table to point to the original source pages.
+ *
+ * Function returns 0 after the above steps, even if no pages were migrated
+ * (The function only returns an error if any of the arguments are invalid.)
+ *
+ * Both src and dst array must be big enough for (end - start) >> PAGE_SHIFT
+ * unsigned long entries.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (!vma || is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+	if (!ops || !src || !dst || start >= end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are locked and unmapped, and thus they have
+	 * stable content and can safely be copied to destination memory that
+	 * is allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 112 insertions(+), 29 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 60e2f8369cd7..57d1fa7a8e62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2160,7 +2160,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	struct migrate_vma *migrate = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start;
+	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
 
@@ -2205,9 +2205,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		return migrate_vma_collect_hole(start, end, walk);
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		unsigned long mpfn, pfn;
 		struct page *page;
+		swp_entry_t entry;
 		pte_t pte;
 
 		pte = *ptep;
@@ -2239,11 +2242,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+		/*
+		 * Optimize for the common case where page is only mapped once
+		 * in one process. If we can lock the page, then we can safely
+		 * set up a special migration page table entry now.
+		 */
+		if (trylock_page(page)) {
+			pte_t swp_pte;
+
+			mpfn |= MIGRATE_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, pte_write(pte));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmapped++;
+		}
+
 next:
 		migrate->src[migrate->npages++] = mpfn;
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmapped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2268,7 +2304,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mm_walk.mm = migrate->vma->vm_mm;
 	mm_walk.private = migrate;
 
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
 	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
 
 	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
 }
@@ -2316,32 +2358,37 @@ static bool migrate_vma_check_page(struct page *page)
 static void migrate_vma_prepare(struct migrate_vma *migrate)
 {
 	const unsigned long npages = migrate->npages;
+	const unsigned long start = migrate->start;
+	unsigned long addr, i, restore = 0;
 	bool allow_drain = true;
-	unsigned long i;
 
 	lru_add_drain();
 
 	for (i = 0; (i < npages) && migrate->cpages; i++) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		bool remap = true;
 
 		if (!page)
 			continue;
 
-		/*
-		 * Because we are migrating several pages there can be
-		 * a deadlock between 2 concurrent migration where each
-		 * are waiting on each other page lock.
-		 *
-		 * Make migrate_vma() a best effort thing and backoff
-		 * for any page we can not lock right away.
-		 */
-		if (!trylock_page(page)) {
-			migrate->src[i] = 0;
-			migrate->cpages--;
-			put_page(page);
-			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+			/*
+			 * Because we are migrating several pages there can be
+			 * a deadlock between 2 concurrent migration where each
+			 * are waiting on each other page lock.
+			 *
+			 * Make migrate_vma() a best effort thing and backoff
+			 * for any page we can not lock right away.
+			 */
+			if (!trylock_page(page)) {
+				migrate->src[i] = 0;
+				migrate->cpages--;
+				put_page(page);
+				continue;
+			}
+			remap = false;
+			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
-		migrate->src[i] |= MIGRATE_PFN_LOCKED;
 
 		if (!PageLRU(page) && allow_drain) {
 			/* Drain CPU's pagevec */
@@ -2350,21 +2397,50 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		}
 
 		if (isolate_lru_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
-			put_page(page);
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+				put_page(page);
+			}
 			continue;
 		}
 
 		if (!migrate_vma_check_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 
-			putback_lru_page(page);
+				get_page(page);
+				putback_lru_page(page);
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+
+				putback_lru_page(page);
+			}
 		}
 	}
+
+	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_pte(page, migrate->vma, addr, page);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		put_page(page);
+		restore--;
+	}
 }
 
 /*
@@ -2391,12 +2467,19 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 			continue;
 
-		try_to_unmap(page, flags);
-		if (page_mapped(page) || !migrate_vma_check_page(page)) {
-			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-			migrate->cpages--;
-			restore++;
+		if (page_mapped(page)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page))
+				goto restore;
 		}
+
+		if (migrate_vma_check_page(page))
+			continue;
+
+restore:
+		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+		migrate->cpages--;
+		restore++;
 	}
 
 	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Kirill A . Shutemov

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Changed since v2:
  - un-conditionaly allow device private memory to be migrated (it can
    not be pin to pointless to check reference count).
Changed since v1:
  - s/device unaddressable/device private/

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |  10 +++-
 mm/migrate.c            | 149 +++++++++++++++++++++++++++++++++++++++---------
 mm/page_vma_mapped.c    |  10 ++++
 mm/rmap.c               |  25 ++++++++
 4 files changed, 164 insertions(+), 30 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8f73cebfc3f5..8dc8f0a3f1af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -159,12 +159,18 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #ifdef CONFIG_MIGRATION
 
+/*
+ * Watch out for PAE architecture, which has an unsigned long, and might not
+ * have enough bits to store all physical address and flags. So far we have
+ * enough room for all our flags.
+ */
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
-#define MIGRATE_PFN_ERROR	(1UL << 4)
-#define MIGRATE_PFN_SHIFT	5
+#define MIGRATE_PFN_DEVICE	(1UL << 4)
+#define MIGRATE_PFN_ERROR	(1UL << 5)
+#define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 57d1fa7a8e62..6c8a9826da32 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
+#include <linux/memremap.h>
 #include <linux/balloon_compaction.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
@@ -236,7 +237,13 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		flush_dcache_page(new);
+		if (unlikely(is_zone_device_page(new)) &&
+		    is_device_private_page(new)) {
+			entry = make_device_private_entry(new, pte_write(pte));
+			pte = swp_entry_to_pte(entry);
+		} else
+			flush_dcache_page(new);
+
 #ifdef CONFIG_HUGETLB_PAGE
 		if (PageHuge(new)) {
 			pte = pte_mkhuge(pte);
@@ -2216,17 +2223,40 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 		pfn = pte_pfn(pte);
 
-		if (!pte_present(pte)) {
+		if (pte_none(pte)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
 
+		if (!pte_present(pte)) {
+			mpfn = pfn = 0;
+
+			/*
+			 * Only care about unaddressable device page special
+			 * page table entry. Other special swap entries are not
+			 * migratable, and we ignore regular swapped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_private_entry(entry))
+				goto next;
+
+			page = device_private_entry_to_page(entry);
+			mpfn = migrate_pfn(page_to_pfn(page))|
+				MIGRATE_PFN_DEVICE | MIGRATE_PFN_MIGRATE;
+			if (is_write_device_private_entry(entry))
+				mpfn |= MIGRATE_PFN_WRITE;
+		} else {
+			page = vm_normal_page(migrate->vma, addr, pte);
+			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
+			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+		}
+
 		/* FIXME support THP */
-		page = vm_normal_page(migrate->vma, addr, pte);
 		if (!page || !page->mapping || PageTransCompound(page)) {
 			mpfn = pfn = 0;
 			goto next;
 		}
+		pfn = page_to_pfn(page);
 
 		/*
 		 * By getting a reference on the page we pin it and that blocks
@@ -2239,8 +2269,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 */
 		get_page(page);
 		migrate->cpages++;
-		mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
-		mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
 		/*
 		 * Optimize for the common case where page is only mapped once
@@ -2267,10 +2295,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 */
 			page_remove_rmap(page, false);
 			put_page(page);
-			unmapped++;
+
+			if (pte_present(pte))
+				unmapped++;
 		}
 
 next:
+		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = mpfn;
 	}
 	arch_leave_lazy_mmu_mode();
@@ -2340,6 +2371,28 @@ static bool migrate_vma_check_page(struct page *page)
 	if (PageCompound(page))
 		return false;
 
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		/*
+		 * Private page can never be pin as they have no valid pte and
+		 * GUP will fail for those. Yet if there is a pending migration
+		 * a thread might try to wait on the pte migration entry and
+		 * will bump the page reference count. Sadly there is no way to
+		 * differentiate a regular pin from migration wait. Hence to
+		 * avoid 2 racing thread trying to migrate back to CPU to enter
+		 * infinite loop (one stoping migration because the other is
+		 * waiting on pte migration entry). We always return true here.
+		 *
+		 * FIXME proper solution is to rework migration_entry_wait() so
+		 * it does not need to take a reference on page.
+		 */
+		if (is_device_private_page(page))
+			return true;
+
+		/* Other ZONE_DEVICE memory type are not supported */
+		return false;
+	}
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2390,24 +2443,30 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
 
-		if (!PageLRU(page) && allow_drain) {
-			/* Drain CPU's pagevec */
-			lru_add_drain_all();
-			allow_drain = false;
-		}
+		/* ZONE_DEVICE pages are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
 
-		if (isolate_lru_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
-				put_page(page);
+			if (isolate_lru_page(page)) {
+				if (remap) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					migrate->cpages--;
+					restore++;
+				} else {
+					migrate->src[i] = 0;
+					unlock_page(page);
+					migrate->cpages--;
+					put_page(page);
+				}
+				continue;
 			}
-			continue;
+
+			/* Drop the reference we took in collect */
+			put_page(page);
 		}
 
 		if (!migrate_vma_check_page(page)) {
@@ -2416,14 +2475,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				migrate->cpages--;
 				restore++;
 
-				get_page(page);
-				putback_lru_page(page);
+				if (!is_zone_device_page(page)) {
+					get_page(page);
+					putback_lru_page(page);
+				}
 			} else {
 				migrate->src[i] = 0;
 				unlock_page(page);
 				migrate->cpages--;
 
-				putback_lru_page(page);
+				if (!is_zone_device_page(page))
+					putback_lru_page(page);
+				else
+					put_page(page);
 			}
 		}
 	}
@@ -2494,7 +2558,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		unlock_page(page);
 		restore--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -2525,6 +2592,26 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
+		if (is_zone_device_page(newpage)) {
+			if (is_device_private_page(newpage)) {
+				/*
+				 * For now only support private anonymous when
+				 * migrating to un-addressable device memory.
+				 */
+				if (mapping) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					continue;
+				}
+			} else {
+				/*
+				 * Other types of ZONE_DEVICE page are not
+				 * supported.
+				 */
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+		}
+
 		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2565,11 +2652,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		unlock_page(page);
 		migrate->cpages--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 
 		if (newpage != page) {
 			unlock_page(newpage);
-			putback_lru_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
 		}
 	}
 }
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 3bd3008db4cb..6a03946469a9 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -48,6 +48,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		if (!is_swap_pte(*pvmw->pte))
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
+
 		if (!is_migration_entry(entry))
 			return false;
 		if (migration_entry_to_page(entry) - pvmw->page >=
@@ -60,6 +61,15 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 #endif
 	} else {
+		if (is_swap_pte(*pvmw->pte)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(*pvmw->pte);
+			if (is_device_private_entry(entry) &&
+			    device_private_entry_to_page(entry) == pvmw->page)
+				return true;
+		}
+
 		if (!pte_present(*pvmw->pte))
 			return false;
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2c55e7b8f8f4..60e47a96cdbd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -63,6 +63,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1330,6 +1331,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		return true;
 
+	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
+	    is_zone_device_page(page) && !is_device_private_page(page))
+		return true;
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_SPLIT_FREEZE, page);
@@ -1378,6 +1383,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (IS_ENABLED(CONFIG_MIGRATION) &&
+		    (flags & TTU_MIGRATION) &&
+		    is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_migration_entry(page, 0);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			goto discard;
+		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
 			if (ptep_clear_flush_young_notify(vma, address,
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (15 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse

This allow caller of migrate_vma() to allocate new page for empty CPU
page table entry (pte_none or back by zero page). This is only for
anonymous memory and it won't allow new page to be instanced if the
userfaultfd is armed.

This is useful to device driver that want to migrate a range of virtual
address and would rather allocate new memory than having to fault later
on.

Changed since v3:
  - support zero pfn entry
  - improve commit message
Changed sinve v2:
  - differentiate between empty CPU page table entry and non empty
  - improve code comments explaining how this works
Changed since v1:
  - 5 level page table fix

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
---
 include/linux/migrate.h |   9 +++
 mm/migrate.c            | 205 +++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 205 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 8dc8f0a3f1af..d4e6d12a0b40 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -218,6 +218,15 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
  * driver should avoid setting MIGRATE_PFN_ERROR unless it is really in an
  * unrecoverable state.
  *
+ * For empty entry inside CPU page table (pte_none() or pmd_none() is true) we
+ * do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus
+ * allowing device driver to allocate device memory for those unback virtual
+ * address. For this the device driver simply have to allocate device memory
+ * and properly set the destination entry like for regular migration. Note that
+ * this can still fails and thus inside the device driver must check if the
+ * migration was successful for those entry inside the finalize_and_map()
+ * callback just like for regular migration.
+ *
  * THE alloc_and_copy() CALLBACK MUST NOT CHANGE ANY OF THE SRC ARRAY ENTRIES
  * OR BAD THINGS WILL HAPPEN !
  *
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c8a9826da32..5500e3738354 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -37,6 +37,7 @@
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
 #include <linux/memremap.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/balloon_compaction.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
@@ -2152,6 +2153,22 @@ static int migrate_vma_collect_hole(unsigned long start,
 	unsigned long addr;
 
 	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+		migrate->src[migrate->npages++] = MIGRATE_PFN_MIGRATE;
+		migrate->dst[migrate->npages] = 0;
+		migrate->cpages++;
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_skip(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr;
+
+	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = 0;
 	}
@@ -2189,7 +2206,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmdp, addr);
 			if (pmd_trans_unstable(pmdp))
-				return migrate_vma_collect_hole(start, end,
+				return migrate_vma_collect_skip(start, end,
 								walk);
 		} else {
 			int ret;
@@ -2197,19 +2214,22 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			get_page(page);
 			spin_unlock(ptl);
 			if (unlikely(!trylock_page(page)))
-				return migrate_vma_collect_hole(start, end,
+				return migrate_vma_collect_skip(start, end,
 								walk);
 			ret = split_huge_page(page);
 			unlock_page(page);
 			put_page(page);
-			if (ret || pmd_none(*pmdp))
+			if (ret)
+				return migrate_vma_collect_skip(start, end,
+								walk);
+			if (pmd_none(*pmdp))
 				return migrate_vma_collect_hole(start, end,
 								walk);
 		}
 	}
 
 	if (unlikely(pmd_bad(*pmdp)))
-		return migrate_vma_collect_hole(start, end, walk);
+		return migrate_vma_collect_skip(start, end, walk);
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
@@ -2224,7 +2244,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pfn = pte_pfn(pte);
 
 		if (pte_none(pte)) {
-			mpfn = pfn = 0;
+			mpfn = MIGRATE_PFN_MIGRATE;
+			migrate->cpages++;
+			pfn = 0;
 			goto next;
 		}
 
@@ -2246,6 +2268,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
+			if (is_zero_pfn(pfn)) {
+				mpfn = MIGRATE_PFN_MIGRATE;
+				migrate->cpages++;
+				pfn = 0;
+				goto next;
+			}
 			page = vm_normal_page(migrate->vma, addr, pte);
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
@@ -2565,6 +2593,135 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 	}
 }
 
+static void migrate_vma_insert_page(struct migrate_vma *migrate,
+				    unsigned long addr,
+				    struct page *page,
+				    unsigned long *src,
+				    unsigned long *dst)
+{
+	struct vm_area_struct *vma = migrate->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mem_cgroup *memcg;
+	bool flush = false;
+	spinlock_t *ptl;
+	pte_t entry;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+
+	/* Only allow populating anonymous memory */
+	if (!vma_is_anonymous(vma))
+		goto abort;
+
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr);
+	if (!p4dp)
+		goto abort;
+	pudp = pud_alloc(mm, p4dp, addr);
+	if (!pudp)
+		goto abort;
+	pmdp = pmd_alloc(mm, pudp, addr);
+	if (!pmdp)
+		goto abort;
+
+	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+		goto abort;
+
+	/*
+	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
+	 * pte_offset_map() on pmds where a huge pmd might be created
+	 * from a different thread.
+	 *
+	 * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
+	 * parallel threads are excluded by other means.
+	 *
+	 * Here we only have down_read(mmap_sem).
+	 */
+	if (pte_alloc(mm, pmdp, addr))
+		goto abort;
+
+	/* See the comment in pte_alloc_one_map() */
+	if (unlikely(pmd_trans_unstable(pmdp)))
+		goto abort;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto abort;
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+		goto abort;
+
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * preceding stores to the page contents become visible before
+	 * the set_pte_at() write.
+	 */
+	__SetPageUptodate(page);
+
+	if (is_zone_device_page(page) && is_device_private_page(page)) {
+		swp_entry_t swp_entry;
+
+		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+		entry = swp_entry_to_pte(swp_entry);
+	} else {
+		entry = mk_pte(page, vma->vm_page_prot);
+		if (vma->vm_flags & VM_WRITE)
+			entry = pte_mkwrite(pte_mkdirty(entry));
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+
+	if (pte_present(*ptep)) {
+		unsigned long pfn = pte_pfn(*ptep);
+
+		if (!is_zero_pfn(pfn)) {
+			pte_unmap_unlock(ptep, ptl);
+			mem_cgroup_cancel_charge(page, memcg, false);
+			goto abort;
+		}
+		flush = true;
+	} else if (!pte_none(*ptep)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	/*
+	 * Check for usefaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(ptep, ptl);
+		mem_cgroup_cancel_charge(page, memcg, false);
+		goto abort;
+	}
+
+	inc_mm_counter(mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
+	if (!is_zone_device_page(page))
+		lru_cache_add_active_or_unevictable(page, vma);
+	get_page(page);
+
+	if (flush) {
+		flush_cache_page(vma, addr, pte_pfn(*ptep));
+		ptep_clear_flush_notify(vma, addr, ptep);
+		set_pte_at_notify(mm, addr, ptep, entry);
+		update_mmu_cache(vma, addr, ptep);
+	} else {
+		/* No need to invalidate - it was non-present before */
+		set_pte_at(mm, addr, ptep, entry);
+		update_mmu_cache(vma, addr, ptep);
+	}
+
+	pte_unmap_unlock(ptep, ptl);
+	*src = MIGRATE_PFN_MIGRATE;
+	return;
+
+abort:
+	*src &= ~MIGRATE_PFN_MIGRATE;
+}
+
 /*
  * migrate_vma_pages() - migrate meta-data from src page to dst page
  * @migrate: migrate struct containing all migration information
@@ -2577,7 +2734,10 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 {
 	const unsigned long npages = migrate->npages;
 	const unsigned long start = migrate->start;
-	unsigned long addr, i;
+	struct vm_area_struct *vma = migrate->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr, i, mmu_start;
+	bool notified = false;
 
 	for (i = 0, addr = start; i < npages; addr += PAGE_SIZE, i++) {
 		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
@@ -2585,10 +2745,27 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 		struct address_space *mapping;
 		int r;
 
-		if (!page || !newpage)
+		if (!newpage) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
-		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+		}
+
+		if (!page) {
+			if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+				continue;
+			}
+			if (!notified) {
+				mmu_start = addr;
+				notified = true;
+				mmu_notifier_invalidate_range_start(mm,
+								mmu_start,
+								migrate->end);
+			}
+			migrate_vma_insert_page(migrate, addr, newpage,
+						&migrate->src[i],
+						&migrate->dst[i]);
 			continue;
+		}
 
 		mapping = page_mapping(page);
 
@@ -2616,6 +2793,10 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 	}
+
+	if (notified)
+		mmu_notifier_invalidate_range_end(mm, mmu_start,
+						  migrate->end);
 }
 
 /*
@@ -2638,8 +2819,14 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
 
-		if (!page)
+		if (!page) {
+			if (newpage) {
+				unlock_page(newpage);
+				put_page(newpage);
+			}
 			continue;
+		}
+
 		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
 			if (newpage) {
 				unlock_page(newpage);
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (16 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse, Aneesh Kumar, Paul E . McKenney,
	Benjamin Herrenschmidt, Ross Zwisler

Platform with advance system bus (like CAPI or CCIX) allow device
memory to be accessible from CPU in a cache coherent fashion. Add
a new type of ZONE_DEVICE to represent such memory. The use case
are the same as for the un-addressable device memory but without
all the corners cases.

Changed since v4:
  - added memory cgroup change to this patch
Changed since v3:
  - s/public/public (going back)
Changed since v2:
  - s/public/public
  - add proper include in migrate.c and drop useless #if/#endif
Changed since v1:
  - Kconfig and #if/#else cleanup

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       |  2 +-
 include/linux/hmm.h      |  4 ++--
 include/linux/ioport.h   |  1 +
 include/linux/memremap.h | 21 ++++++++++++++++++
 include/linux/mm.h       | 20 ++++++++++-------
 kernel/memremap.c        |  8 +++----
 mm/Kconfig               | 11 ++++++++++
 mm/gup.c                 |  7 ++++++
 mm/hmm.c                 |  4 ++--
 mm/madvise.c             |  2 +-
 mm/memcontrol.c          | 12 +++++-----
 mm/memory.c              | 46 +++++++++++++++++++++++++++++++++-----
 mm/migrate.c             | 57 ++++++++++++++++++++++++++++++++----------------
 mm/swap.c                | 11 ++++++++++
 14 files changed, 159 insertions(+), 47 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e3ba93b35925..d3a57b43f9f1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1189,7 +1189,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		if (pm->show_pfn)
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
-		page = vm_normal_page(vma, addr, pte);
+		page = _vm_normal_page(vma, addr, pte, true);
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index aeba696a4385..79e63178fd87 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -327,7 +327,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct hmm_devmem;
 
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
@@ -494,7 +494,7 @@ struct hmm_device {
  */
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 3a4f69137bc2..f5cf32e80041 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -131,6 +131,7 @@ enum {
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
 	IORES_DESC_DEVICE_PRIVATE_MEMORY	= 6,
+	IORES_DESC_DEVICE_PUBLIC_MEMORY		= 7,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8aa6b82679e2..f8ee1c73ad2d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -57,10 +57,18 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  *
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.txt.
+ *
+ * MEMORY_DEVICE_PUBLIC:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is use on platform that have an advance system bus (like CAPI or CCIX). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allow to pin such memory so that it can always be evicted.
  */
 enum memory_type {
 	MEMORY_DEVICE_HOST = 0,
 	MEMORY_DEVICE_PRIVATE,
+	MEMORY_DEVICE_PUBLIC,
 };
 
 /*
@@ -92,6 +100,8 @@ enum memory_type {
  * The page_free() callback is called once the page refcount reaches 1
  * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
  * This allows the device driver to implement its own memory management.)
+ *
+ * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter.
  */
 typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
 				unsigned long addr,
@@ -134,6 +144,12 @@ static inline bool is_device_private_page(const struct page *page)
 	return is_zone_device_page(page) &&
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -157,6 +173,11 @@ static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 515d4ae611b2..d1d161efefa0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -796,15 +796,16 @@ static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page);
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page);
 #else
-static inline void put_zone_device_private_page(struct page *page)
+static inline void put_zone_device_private_or_public_page(struct page *page)
 {
 }
-#endif
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
 
 static inline bool is_device_private_page(const struct page *page);
+static inline bool is_device_public_page(const struct page *page);
 
 DECLARE_STATIC_KEY_FALSE(device_private_key);
 
@@ -830,8 +831,9 @@ static inline void put_page(struct page *page)
 	 * include/linux/memremap.h and HMM for details.
 	 */
 	if (static_branch_unlikely(&device_private_key) &&
-	    unlikely(is_device_private_page(page))) {
-		put_zone_device_private_page(page);
+	    unlikely(is_device_private_page(page) ||
+		     is_device_public_page(page))) {
+		put_zone_device_private_or_public_page(page);
 		return;
 	}
 
@@ -1220,8 +1222,10 @@ struct zap_details {
 	pgoff_t last_index;			/* Highest page->index to unmap */
 };
 
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-		pte_t pte);
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device);
+#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index f42d7483e886..d5ce737af5d7 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -477,8 +477,8 @@ struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 #endif /* CONFIG_ZONE_DEVICE */
 
 
-#ifdef CONFIG_DEVICE_PRIVATE
-void put_zone_device_private_page(struct page *page)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
+void put_zone_device_private_or_public_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -498,5 +498,5 @@ void put_zone_device_private_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_page);
-#endif /* CONFIG_DEVICE_PRIVATE */
+EXPORT_SYMBOL(put_zone_device_private_or_public_page);
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/Kconfig b/mm/Kconfig
index c07cb590b2d6..5152959e1912 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -719,12 +719,23 @@ config HMM_MIRROR
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
+	select HMM
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
 	  memory; i.e., memory that is only accessible from the device (or
 	  group of devices). You likely also want to select HMM_MIRROR.
 
+config DEVICE_PUBLIC
+	bool "Addressable device memory (like GPU memory)"
+	depends on ARCH_HAS_HMM
+	select HMM
+
+	help
+	  Allows creation of struct pages to represent addressable device
+	  memory; i.e., memory that is accessible from both the device and
+	  the CPU
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/gup.c b/mm/gup.c
index 76fd199aaae2..b2b4d4263768 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -456,6 +456,13 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte)))
 			goto unmap;
 		*page = pte_page(*pte);
+
+		/*
+		 * This should never happen (a device public page in the gate
+		 * area).
+		 */
+		if (is_device_public_page(*page))
+			goto unmap;
 	}
 	get_page(*page);
 out:
diff --git a/mm/hmm.c b/mm/hmm.c
index 8c5ce5f76b2e..1a1e79d390c1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -752,7 +752,7 @@ EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
+#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
 struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
 				       unsigned long addr)
 {
@@ -1192,4 +1192,4 @@ static int __init hmm_init(void)
 }
 
 device_initcall(hmm_init);
-#endif /* IS_ENABLED(CONFIG_DEVICE_PRIVATE) */
+#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
diff --git a/mm/madvise.c b/mm/madvise.c
index 47d8d8a25eae..6cf68d5c49f1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -344,7 +344,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		}
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = _vm_normal_page(vma, addr, ptent, true);
 		if (!page)
 			continue;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 977d1cf3493a..504eb015f3c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4616,10 +4616,11 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
- *     (so ZONE_DEVICE page and thus not on the lru). For now we such page is
- *     charge like a regular page would be as for all intent and purposes it is
- *     just special memory taking the place of a regular page.
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PUBLIC
+ *     or MEMORY_DEVICE_PRIVATE (so ZONE_DEVICE page and thus not on the lru).
+ *     For now we such page is charge like a regular page would be as for all
+ *     intent and purposes it is just special memory taking the place of a
+ *     regular page.
  *
  *     See Documentations/vm/hmm.txt and include/linux/hmm.h
  *
@@ -4650,7 +4651,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 */
 		if (page->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (is_device_private_page(page))
+			if (is_device_private_page(page) ||
+			    is_device_public_page(page))
 				ret = MC_TARGET_DEVICE;
 			if (target)
 				target->page = page;
diff --git a/mm/memory.c b/mm/memory.c
index d1b2da5fe70e..f3af4e21c8df 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -789,8 +789,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 #else
 # define HAVE_PTE_SPECIAL 0
 #endif
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-				pte_t pte)
+struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte, bool with_public_device)
 {
 	unsigned long pfn = pte_pfn(pte);
 
@@ -801,8 +801,31 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (!is_zero_pfn(pfn))
-			print_bad_pte(vma, addr, pte, NULL);
+		if (is_zero_pfn(pfn))
+			return NULL;
+
+		/*
+		 * Device public pages are special pages (they are ZONE_DEVICE
+		 * pages but different from persistent memory). They behave
+		 * allmost like normal pages. The difference is that they are
+		 * not on the lru and thus should never be involve with any-
+		 * thing that involve lru manipulation (mlock, numa balancing,
+		 * ...).
+		 *
+		 * This is why we still want to return NULL for such page from
+		 * vm_normal_page() so that we do not have to special case all
+		 * call site of vm_normal_page().
+		 */
+		if (likely(pfn < highest_memmap_pfn)) {
+			struct page *page = pfn_to_page(pfn);
+
+			if (is_device_public_page(page)) {
+				if (with_public_device)
+					return page;
+				return NULL;
+			}
+		}
+		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
 
@@ -983,6 +1006,19 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
+	} else if (pte_devmap(pte)) {
+		page = pte_page(pte);
+
+		/*
+		 * Cache coherent device memory behave like regular page and
+		 * not like persistent memory page. For more informations see
+		 * MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
+		 */
+		if (is_device_public_page(page)) {
+			get_page(page);
+			page_dup_rmap(page, false);
+			rss[mm_counter(page)]++;
+		}
 	}
 
 out_set_pte:
@@ -1238,7 +1274,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (pte_present(ptent)) {
 			struct page *page;
 
-			page = vm_normal_page(vma, addr, ptent);
+			page = _vm_normal_page(vma, addr, ptent, true);
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
diff --git a/mm/migrate.c b/mm/migrate.c
index 5500e3738354..850001bc3c12 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -36,6 +36,7 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 #include <linux/gfp.h>
+#include <linux/pfn_t.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/balloon_compaction.h>
@@ -238,10 +239,14 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 		if (is_write_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 
-		if (unlikely(is_zone_device_page(new)) &&
-		    is_device_private_page(new)) {
-			entry = make_device_private_entry(new, pte_write(pte));
-			pte = swp_entry_to_pte(entry);
+		if (unlikely(is_zone_device_page(new))) {
+			if (is_device_private_page(new)) {
+				entry = make_device_private_entry(new, pte_write(pte));
+				pte = swp_entry_to_pte(entry);
+			} else if (is_device_public_page(new)) {
+				pte = pte_mkdevmap(pte);
+				flush_dcache_page(new);
+			}
 		} else
 			flush_dcache_page(new);
 
@@ -436,12 +441,11 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	void **pslot;
 
 	/*
-	 * ZONE_DEVICE pages have 1 refcount always held by their device
-	 *
-	 * Note that DAX memory will never reach that point as it does not have
-	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 * Device public or private pages have an extra refcount as they are
+	 * ZONE_DEVICE pages.
 	 */
-	expected_count += is_zone_device_page(page);
+	expected_count += is_device_private_page(page);
+	expected_count += is_device_public_page(page);
 
 	if (!mapping) {
 		/* Anonymous page without mapping */
@@ -2134,7 +2138,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 #endif /* CONFIG_NUMA */
 
-
 struct migrate_vma {
 	struct vm_area_struct	*vma;
 	unsigned long		*dst;
@@ -2274,7 +2277,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				pfn = 0;
 				goto next;
 			}
-			page = vm_normal_page(migrate->vma, addr, pte);
+			page = _vm_normal_page(migrate->vma, addr, pte, true);
 			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
@@ -2417,10 +2420,19 @@ static bool migrate_vma_check_page(struct page *page)
 		if (is_device_private_page(page))
 			return true;
 
-		/* Other ZONE_DEVICE memory type are not supported */
-		return false;
+		/*
+		 * Only allow device public page to be migrated and account for
+		 * the extra reference count imply by ZONE_DEVICE pages.
+		 */
+		if (!is_device_public_page(page))
+			return false;
+		extra++;
 	}
 
+	/* For file back page */
+	if (page_mapping(page))
+		extra += 1 + page_has_private(page);
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2658,11 +2670,18 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	 */
 	__SetPageUptodate(page);
 
-	if (is_zone_device_page(page) && is_device_private_page(page)) {
-		swp_entry_t swp_entry;
-
-		swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
-		entry = swp_entry_to_pte(swp_entry);
+	if (is_zone_device_page(page)) {
+		if (is_device_private_page(page)) {
+			swp_entry_t swp_entry;
+
+			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			entry = swp_entry_to_pte(swp_entry);
+		} else if (is_device_public_page(page)) {
+			entry = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+			if (vma->vm_flags & VM_WRITE)
+				entry = pte_mkwrite(pte_mkdirty(entry));
+			entry = pte_mkdevmap(entry);
+		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2779,7 +2798,7 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 					continue;
 				}
-			} else {
+			} else if (!is_device_public_page(newpage)) {
 				/*
 				 * Other types of ZONE_DEVICE page are not
 				 * supported.
diff --git a/mm/swap.c b/mm/swap.c
index 62d96b8e5eb3..9295ae960d66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -765,6 +765,17 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (is_huge_zero_page(page))
 			continue;
 
+		/* Device public page can not be huge page */
+		if (is_device_public_page(page)) {
+			if (locked_pgdat) {
+				spin_unlock_irqrestore(&locked_pgdat->lru_lock,
+						       flags);
+				locked_pgdat = NULL;
+			}
+			put_zone_device_private_or_public_page(page);
+			continue;
+		}
+
 		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (17 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
@ 2017-08-17  0:05 ` Jérôme Glisse
  2017-09-04  3:09   ` Bob Liu
  2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
  2017-12-13 12:10 ` Figo.zhang
  20 siblings, 1 reply; 66+ messages in thread
From: Jérôme Glisse @ 2017-08-17  0:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Jérôme Glisse

Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c            | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79e63178fd87..5866f3194c26 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -443,6 +443,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 				  struct device *device,
 				  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 1a1e79d390c1..3faa4d40295e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
 	zone = page_zone(page);
 
 	mem_hotplug_begin();
-	__remove_pages(zone, start_pfn, npages);
+	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+		__remove_pages(zone, start_pfn, npages);
+	else
+		arch_remove_memory(start_pfn << PAGE_SHIFT,
+				   npages << PAGE_SHIFT);
 	mem_hotplug_done();
 
 	hmm_devmem_radix_release(resource);
@@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	if (is_ram == REGION_INTERSECTS)
 		return -ENXIO;
 
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+	else
+		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
 	devmem->pagemap.res = devmem->resource;
 	devmem->pagemap.page_fault = hmm_devmem_fault;
 	devmem->pagemap.page_free = hmm_devmem_free;
@@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 * over the device memory is un-accessible thus we do not want to
 	 * create a linear mapping for the memory like arch_add_memory()
 	 * would do.
+	 *
+	 * For device public memory, which is accesible by the CPU, we do
+	 * want the linear mapping and thus use arch_add_memory().
 	 */
-	ret = add_pages(nid, align_start >> PAGE_SHIFT,
-			align_size >> PAGE_SHIFT, false);
+	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+		ret = arch_add_memory(nid, align_start, align_size, false);
+	else
+		ret = add_pages(nid, align_start >> PAGE_SHIFT,
+				align_size >> PAGE_SHIFT, false);
 	if (ret) {
 		mem_hotplug_done();
 		goto error_add_memory;
@@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+					   struct device *device,
+					   struct resource *res)
+{
+	struct hmm_devmem *devmem;
+	int ret;
+
+	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+		return ERR_PTR(-EINVAL);
+
+	static_branch_enable(&device_private_key);
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = res;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_devm_add_action;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add_resource);
+
 /*
  * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
  *
@@ -1097,6 +1172,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 {
 	resource_size_t start, size;
 	struct device *device;
+	bool cdm = false;
 
 	if (!devmem)
 		return;
@@ -1105,11 +1181,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
 	start = devmem->resource->start;
 	size = resource_size(devmem->resource);
 
+	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
 	hmm_devmem_ref_kill(&devmem->ref);
 	hmm_devmem_ref_exit(&devmem->ref);
 	hmm_devmem_pages_remove(devmem);
 
-	devm_release_mem_region(device, start, size);
+	if (!cdm)
+		devm_release_mem_region(device, start, size);
 }
 EXPORT_SYMBOL(hmm_devmem_remove);
 
-- 
2.13.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
@ 2017-08-17 21:12   ` Andrew Morton
  2017-08-17 21:44     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2017-08-17 21:12 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Wed, 16 Aug 2017 20:05:42 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:

> Introduce a new migration mode that allow to offload the copy to
> a device DMA engine. This changes the workflow of migration and
> not all address_space migratepage callback can support this. So
> it needs to be tested in those cases.

Can you please expand on this?  What additional testing must be
performed before we are able to merge this into mainline?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (18 preceding siblings ...)
  2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
@ 2017-08-17 21:39 ` Andrew Morton
  2017-08-17 21:55   ` Jerome Glisse
  2017-12-13 12:10 ` Figo.zhang
  20 siblings, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2017-08-17 21:39 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:

> Heterogeneous Memory Management (HMM) (description and justification)

The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
to be any way of avoiding this overhead, or of avoiding whatever
runtime overheads are added.

It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.

So that's all quite a lot of bloat for systems which get no benefit from
the patchset.  What can we do to improve this situation (a lot)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
  2017-08-17 21:12   ` Andrew Morton
@ 2017-08-17 21:44     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-08-17 21:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Thu, Aug 17, 2017 at 02:12:45PM -0700, Andrew Morton wrote:
> On Wed, 16 Aug 2017 20:05:42 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
> 
> > Introduce a new migration mode that allow to offload the copy to
> > a device DMA engine. This changes the workflow of migration and
> > not all address_space migratepage callback can support this. So
> > it needs to be tested in those cases.
> 
> Can you please expand on this?  What additional testing must be
> performed before we are able to merge this into mainline?
> 

No additional testing needed. I disable MIGRATE_SYNC_NO_COPY in all
problematic migratepage() callback and i added comment in those to
explain why (part of this patch). The commit message is unclear it
should say that any callback that wish to support this new mode need
to be aware of the difference in the migration flow from other mode.

Some of this callback do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.

Usual flow is:

For each page {
 1 - lock page
 2 - call migratepage() callback
 3 - (extra locking in some migratepage() callback)
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
 5 - copy page
 6 - (unlock any extra lock of migratepage() callback)
 7 - return from migratepage() callback
 8 - unlock page
}

The new mode MIGRATE_SYNC_NO_COPY:
 1 - lock multiple pages
For each page {
 2 - call migratepage() callback
 3 - abort in all problematic migratepage() callback
 4 - migrate page state (freeze refcount, update page cache, buffer
     head, ...)
} // finished all calls to migratepage() callback
 5 - DMA copy multiple pages
 6 - unlock all the pages

To support MIGRATE_SYNC_NO_COPY in the problematic case we would
need a new callback migratepages() (for instance) that deals with
multiple pages in one transaction.

Because the problematic cases are not important for current usage
i did not wanted to complexify this patchset even more for no good
reasons.

I hope this clarify, the commit message and comment in migrate.h
can probably use this extra description i just gave.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
@ 2017-08-17 21:55   ` Jerome Glisse
  2017-08-17 21:59     ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-08-17 21:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Thu, Aug 17, 2017 at 02:39:16PM -0700, Andrew Morton wrote:
> On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
> 
> > Heterogeneous Memory Management (HMM) (description and justification)
> 
> The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
> to be any way of avoiding this overhead, or of avoiding whatever
> runtime overheads are added.

HMM have already been integrated in couple of Red Hat kernel and AFAIK there
is no runtime performance issue reported. Thought the RHEL version does not
use static key as Dan asked.

> 
> It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.
> 
> So that's all quite a lot of bloat for systems which get no benefit from
> the patchset.  What can we do to improve this situation (a lot)?

I will look into why object file grow so much on arm. My guess is that the
new migrate code is the bulk of that. I can hide the new page migration code
behind a kernel configuration flag.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17 21:55   ` Jerome Glisse
@ 2017-08-17 21:59     ` Dan Williams
  2017-08-17 22:02       ` Jerome Glisse
  2017-08-17 22:16       ` Andrew Morton
  0 siblings, 2 replies; 66+ messages in thread
From: Dan Williams @ 2017-08-17 21:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh

On Thu, Aug 17, 2017 at 2:55 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Aug 17, 2017 at 02:39:16PM -0700, Andrew Morton wrote:
>> On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
>>
>> > Heterogeneous Memory Management (HMM) (description and justification)
>>
>> The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
>> to be any way of avoiding this overhead, or of avoiding whatever
>> runtime overheads are added.
>
> HMM have already been integrated in couple of Red Hat kernel and AFAIK there
> is no runtime performance issue reported. Thought the RHEL version does not
> use static key as Dan asked.
>
>>
>> It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.
>>
>> So that's all quite a lot of bloat for systems which get no benefit from
>> the patchset.  What can we do to improve this situation (a lot)?
>
> I will look into why object file grow so much on arm. My guess is that the
> new migrate code is the bulk of that. I can hide the new page migration code
> behind a kernel configuration flag.

Shouldn't we completely disable all of it unless there is a driver in
the kernel that selects it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17 21:59     ` Dan Williams
@ 2017-08-17 22:02       ` Jerome Glisse
  2017-08-17 22:06         ` Dan Williams
  2017-08-17 22:16       ` Andrew Morton
  1 sibling, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-08-17 22:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh

On Thu, Aug 17, 2017 at 02:59:20PM -0700, Dan Williams wrote:
> On Thu, Aug 17, 2017 at 2:55 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Aug 17, 2017 at 02:39:16PM -0700, Andrew Morton wrote:
> >> On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
> >>
> >> > Heterogeneous Memory Management (HMM) (description and justification)
> >>
> >> The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
> >> to be any way of avoiding this overhead, or of avoiding whatever
> >> runtime overheads are added.
> >
> > HMM have already been integrated in couple of Red Hat kernel and AFAIK there
> > is no runtime performance issue reported. Thought the RHEL version does not
> > use static key as Dan asked.
> >
> >>
> >> It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.
> >>
> >> So that's all quite a lot of bloat for systems which get no benefit from
> >> the patchset.  What can we do to improve this situation (a lot)?
> >
> > I will look into why object file grow so much on arm. My guess is that the
> > new migrate code is the bulk of that. I can hide the new page migration code
> > behind a kernel configuration flag.
> 
> Shouldn't we completely disable all of it unless there is a driver in
> the kernel that selects it?

At one point people asked to be able to use the new migrate helper without
HMM and hence why it is not behind any HMM kconfig.

IIRC even ARM folks were interested pretty much all SOC have several DMA
engine that site idle and i think people where toying with the idea of using
this new helper to make use of them. But i can add a different kconfig to
hide this code and if people want to use it they will have to select it.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17 22:02       ` Jerome Glisse
@ 2017-08-17 22:06         ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2017-08-17 22:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh

On Thu, Aug 17, 2017 at 3:02 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Thu, Aug 17, 2017 at 02:59:20PM -0700, Dan Williams wrote:
>> On Thu, Aug 17, 2017 at 2:55 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Thu, Aug 17, 2017 at 02:39:16PM -0700, Andrew Morton wrote:
>> >> On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
>> >>
>> >> > Heterogeneous Memory Management (HMM) (description and justification)
>> >>
>> >> The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
>> >> to be any way of avoiding this overhead, or of avoiding whatever
>> >> runtime overheads are added.
>> >
>> > HMM have already been integrated in couple of Red Hat kernel and AFAIK there
>> > is no runtime performance issue reported. Thought the RHEL version does not
>> > use static key as Dan asked.
>> >
>> >>
>> >> It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.
>> >>
>> >> So that's all quite a lot of bloat for systems which get no benefit from
>> >> the patchset.  What can we do to improve this situation (a lot)?
>> >
>> > I will look into why object file grow so much on arm. My guess is that the
>> > new migrate code is the bulk of that. I can hide the new page migration code
>> > behind a kernel configuration flag.
>>
>> Shouldn't we completely disable all of it unless there is a driver in
>> the kernel that selects it?
>
> At one point people asked to be able to use the new migrate helper without
> HMM and hence why it is not behind any HMM kconfig.
>
> IIRC even ARM folks were interested pretty much all SOC have several DMA
> engine that site idle and i think people where toying with the idea of using
> this new helper to make use of them. But i can add a different kconfig to
> hide this code and if people want to use it they will have to select it.

If they were interested then I would expect their use case would be
included in this patchset so the infrastructure and at least one
consumer can land together.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17 21:59     ` Dan Williams
  2017-08-17 22:02       ` Jerome Glisse
@ 2017-08-17 22:16       ` Andrew Morton
  1 sibling, 0 replies; 66+ messages in thread
From: Andrew Morton @ 2017-08-17 22:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh

On Thu, 17 Aug 2017 14:59:20 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> On Thu, Aug 17, 2017 at 2:55 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> > On Thu, Aug 17, 2017 at 02:39:16PM -0700, Andrew Morton wrote:
> >> On Wed, 16 Aug 2017 20:05:29 -0400 J__r__me Glisse <jglisse@redhat.com> wrote:
> >>
> >> > Heterogeneous Memory Management (HMM) (description and justification)
> >>
> >> The patchset adds 55 kbytes to x86_64's mm/*.o and there doesn't appear
> >> to be any way of avoiding this overhead, or of avoiding whatever
> >> runtime overheads are added.
> >
> > HMM have already been integrated in couple of Red Hat kernel and AFAIK there
> > is no runtime performance issue reported. Thought the RHEL version does not
> > use static key as Dan asked.
> >
> >>
> >> It also adds 18k to arm's mm/*.o and arm doesn't support HMM at all.
> >>
> >> So that's all quite a lot of bloat for systems which get no benefit from
> >> the patchset.  What can we do to improve this situation (a lot)?
> >
> > I will look into why object file grow so much on arm. My guess is that the
> > new migrate code is the bulk of that. I can hide the new page migration code
> > behind a kernel configuration flag.
> 
> Shouldn't we completely disable all of it unless there is a driver in
> the kernel that selects it?

That would be typical (and nice).

I'm not sure that Red Hat's decision is a broad enough guide here. 
Someone who is using Linux to make a cash register or a thermostat
faces different tradeoffs...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
@ 2017-09-04  3:09   ` Bob Liu
  2017-09-04 15:51     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-09-04  3:09 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	majiuyue, xieyisheng (A)

On 2017/8/17 8:05, JA(C)rA'me Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 

Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
through ACPI and recognized as NUMA memory node.
Then how can their memory be captured and managed by HMM framework?

--
Regards,
Bob Liu

> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>
> ---
>  include/linux/hmm.h |  3 ++
>  mm/hmm.c            | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 86 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79e63178fd87..5866f3194c26 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -443,6 +443,9 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
>  				  struct device *device,
>  				  unsigned long size);
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +					   struct device *device,
> +					   struct resource *res);
>  void hmm_devmem_remove(struct hmm_devmem *devmem);
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1a1e79d390c1..3faa4d40295e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	zone = page_zone(page);
>  
>  	mem_hotplug_begin();
> -	__remove_pages(zone, start_pfn, npages);
> +	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> +		__remove_pages(zone, start_pfn, npages);
> +	else
> +		arch_remove_memory(start_pfn << PAGE_SHIFT,
> +				   npages << PAGE_SHIFT);
>  	mem_hotplug_done();
>  
>  	hmm_devmem_radix_release(resource);
> @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	if (is_ram == REGION_INTERSECTS)
>  		return -ENXIO;
>  
> -	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
> +		devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> +	else
> +		devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +
>  	devmem->pagemap.res = devmem->resource;
>  	devmem->pagemap.page_fault = hmm_devmem_fault;
>  	devmem->pagemap.page_free = hmm_devmem_free;
> @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	 * over the device memory is un-accessible thus we do not want to
>  	 * create a linear mapping for the memory like arch_add_memory()
>  	 * would do.
> +	 *
> +	 * For device public memory, which is accesible by the CPU, we do
> +	 * want the linear mapping and thus use arch_add_memory().
>  	 */
> -	ret = add_pages(nid, align_start >> PAGE_SHIFT,
> -			align_size >> PAGE_SHIFT, false);
> +	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> +		ret = arch_add_memory(nid, align_start, align_size, false);
> +	else
> +		ret = add_pages(nid, align_start >> PAGE_SHIFT,
> +				align_size >> PAGE_SHIFT, false);
>  	if (ret) {
>  		mem_hotplug_done();
>  		goto error_add_memory;
> @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
>  }
>  EXPORT_SYMBOL(hmm_devmem_add);
>  
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +					   struct device *device,
> +					   struct resource *res)
> +{
> +	struct hmm_devmem *devmem;
> +	int ret;
> +
> +	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> +		return ERR_PTR(-EINVAL);
> +
> +	static_branch_enable(&device_private_key);
> +
> +	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
> +				   GFP_KERNEL, dev_to_node(device));
> +	if (!devmem)
> +		return ERR_PTR(-ENOMEM);
> +
> +	init_completion(&devmem->completion);
> +	devmem->pfn_first = -1UL;
> +	devmem->pfn_last = -1UL;
> +	devmem->resource = res;
> +	devmem->device = device;
> +	devmem->ops = ops;
> +
> +	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
> +			      0, GFP_KERNEL);
> +	if (ret)
> +		goto error_percpu_ref;
> +
> +	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
> +	if (ret)
> +		goto error_devm_add_action;
> +
> +
> +	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
> +	devmem->pfn_last = devmem->pfn_first +
> +			   (resource_size(devmem->resource) >> PAGE_SHIFT);
> +
> +	ret = hmm_devmem_pages_create(devmem);
> +	if (ret)
> +		goto error_devm_add_action;
> +
> +	devres_add(device, devmem);
> +
> +	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
> +	if (ret) {
> +		hmm_devmem_remove(devmem);
> +		return ERR_PTR(ret);
> +	}
> +
> +	return devmem;
> +
> +error_devm_add_action:
> +	hmm_devmem_ref_kill(&devmem->ref);
> +	hmm_devmem_ref_exit(&devmem->ref);
> +error_percpu_ref:
> +	devres_free(devmem);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL(hmm_devmem_add_resource);
> +
>  /*
>   * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
>   *
> @@ -1097,6 +1172,7 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
>  {
>  	resource_size_t start, size;
>  	struct device *device;
> +	bool cdm = false;
>  
>  	if (!devmem)
>  		return;
> @@ -1105,11 +1181,13 @@ void hmm_devmem_remove(struct hmm_devmem *devmem)
>  	start = devmem->resource->start;
>  	size = resource_size(devmem->resource);
>  
> +	cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY;
>  	hmm_devmem_ref_kill(&devmem->ref);
>  	hmm_devmem_ref_exit(&devmem->ref);
>  	hmm_devmem_pages_remove(devmem);
>  
> -	devm_release_mem_region(device, start, size);
> +	if (!cdm)
> +		devm_release_mem_region(device, start, size);
>  }
>  EXPORT_SYMBOL(hmm_devmem_remove);
>  
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-04  3:09   ` Bob Liu
@ 2017-09-04 15:51     ` Jerome Glisse
  2017-09-05  1:13       ` Bob Liu
  2017-09-05  3:36       ` Balbir Singh
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-04 15:51 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A)

On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jerome Glisse wrote:
> > Unlike unaddressable memory, coherent device memory has a real
> > resource associated with it on the system (as CPU can address
> > it). Add a new helper to hotplug such memory within the HMM
> > framework.
> > 
> 
> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> through ACPI and recognized as NUMA memory node.
> Then how can their memory be captured and managed by HMM framework?
> 

Only platform that has such memory today is powerpc and it is not reported
as regular memory by the firmware hence why they need this helper.

I don't think anyone has defined anything yet for x86 and acpi. As this is
memory on PCIE like interface then i don't expect it to be reported as NUMA
memory node but as io range like any regular PCIE resources. Device driver
through capabilities flags would then figure out if the link between the
device and CPU is CCIX capable if so it can use this helper to hotplug it
as device memory.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-04 15:51     ` Jerome Glisse
@ 2017-09-05  1:13       ` Bob Liu
  2017-09-05  2:38         ` Jerome Glisse
  2017-09-05  3:36       ` Balbir Singh
  1 sibling, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-09-05  1:13 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A),
	ross.zwisler

On 2017/9/4 23:51, Jerome Glisse wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jerome Glisse wrote:
>>> Unlike unaddressable memory, coherent device memory has a real
>>> resource associated with it on the system (as CPU can address
>>> it). Add a new helper to hotplug such memory within the HMM
>>> framework.
>>>
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
> 
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
> 
> I don't think anyone has defined anything yet for x86 and acpi. As this is

Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
Table (HMAT) table defined in ACPI 6.2.
The HMAT can cover CPU-addressable memory types(though not non-cache coherent on-device memory).

Ross from Intel already done some work on this, see:
https://lwn.net/Articles/724562/

arm64 supports APCI also, there is likely more this kind of device when CCIX is out
(should be very soon if on schedule).

> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.
> 

>From my point of view,  Cache coherent device memory will popular soon and reported through ACPI/UEFI.
Extending NUMA policy still sounds more reasonable to me.

--
Thanks,
Bob Liu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05  1:13       ` Bob Liu
@ 2017-09-05  2:38         ` Jerome Glisse
  2017-09-05  3:50           ` Bob Liu
  2017-09-05 18:54           ` Ross Zwisler
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-05  2:38 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A),
	ross.zwisler

On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> On 2017/9/4 23:51, Jerome Glisse wrote:
> > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >> On 2017/8/17 8:05, Jerome Glisse wrote:
> >>> Unlike unaddressable memory, coherent device memory has a real
> >>> resource associated with it on the system (as CPU can address
> >>> it). Add a new helper to hotplug such memory within the HMM
> >>> framework.
> >>>
> >>
> >> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> >> through ACPI and recognized as NUMA memory node.
> >> Then how can their memory be captured and managed by HMM framework?
> >>
> > 
> > Only platform that has such memory today is powerpc and it is not reported
> > as regular memory by the firmware hence why they need this helper.
> > 
> > I don't think anyone has defined anything yet for x86 and acpi. As this is
> 
> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> Table (HMAT) table defined in ACPI 6.2.
> The HMAT can cover CPU-addressable memory types(though not non-cache
> coherent on-device memory).
> 
> Ross from Intel already done some work on this, see:
> https://lwn.net/Articles/724562/
> 
> arm64 supports APCI also, there is likely more this kind of device when CCIX
> is out (should be very soon if on schedule).

HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
when you have several kind of memory each with different characteristics:
  - HBM very fast (latency) and high bandwidth, non persistent, somewhat
    small (ie few giga bytes)
  - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
  - DDR (good old memory) well characteristics are between HBM and persistent

So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
that device memory can have a hierarchy of memory themself (HBM, GDDR and in
maybe even persistent memory).

> > memory on PCIE like interface then i don't expect it to be reported as NUMA
> > memory node but as io range like any regular PCIE resources. Device driver
> > through capabilities flags would then figure out if the link between the
> > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > as device memory.
> > 
> 
> From my point of view,  Cache coherent device memory will popular soon and
> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> to me.

Cache coherent device will be reported through standard mecanisms defined by
the bus standard they are using. To my knowledge all the standard are either
on top of PCIE or are similar to PCIE.

It is true that on many platform PCIE resource is manage/initialize by the
bios (UEFI) but it is platform specific. In some case we reprogram what the
bios pick.

So like i was saying i don't expect the BIOS/UEFI to report device memory as
regular memory. It will be reported as a regular PCIE resources and then the
device driver will be able to determine through some flags if the link between
the CPU(s) and the device is cache coherent or not. At that point the device
driver can use register it with HMM helper.


The whole NUMA discussion happen several time in the past i suggest looking
on mm list archive for them. But it was rule out for several reasons. Top of
my head:
  - people hate CPU less node and device memory is inherently CPU less
  - device driver want total control over memory and thus to be isolated from
    mm mecanism and doing all those special cases was not welcome
  - existing NUMA migration mecanism are ill suited for this memory as
    access by the device to the memory is unknown to core mm and there
    is no easy way to report it or track it (this kind of depends on the
    platform and hardware)

I am likely missing other big points.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-04 15:51     ` Jerome Glisse
  2017-09-05  1:13       ` Bob Liu
@ 2017-09-05  3:36       ` Balbir Singh
  1 sibling, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2017-09-05  3:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, majiuyue, xieyisheng (A)

On Tue, Sep 5, 2017 at 1:51 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> > Unlike unaddressable memory, coherent device memory has a real
>> > resource associated with it on the system (as CPU can address
>> > it). Add a new helper to hotplug such memory within the HMM
>> > framework.
>> >
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
>
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
>
> I don't think anyone has defined anything yet for x86 and acpi. As this is
> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.

Yep, the arch needs to do the right thing at hotplug time, which is

1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

Like Jerome said and we tried as well, the NUMA approach needs more
agreement and discussion and probable extensions

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05  2:38         ` Jerome Glisse
@ 2017-09-05  3:50           ` Bob Liu
  2017-09-05 13:50             ` Jerome Glisse
  2017-09-05 18:54           ` Ross Zwisler
  1 sibling, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-09-05  3:50 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A),
	ross.zwisler, Mel Gorman, Rik van Riel, Michal Hocko

On 2017/9/5 10:38, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>> On 2017/8/17 8:05, Jerome Glisse wrote:
>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>> resource associated with it on the system (as CPU can address
>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>> framework.
>>>>>
>>>>
>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
>>>> through ACPI and recognized as NUMA memory node.
>>>> Then how can their memory be captured and managed by HMM framework?
>>>>
>>>
>>> Only platform that has such memory today is powerpc and it is not reported
>>> as regular memory by the firmware hence why they need this helper.
>>>
>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>
>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> Table (HMAT) table defined in ACPI 6.2.
>> The HMAT can cover CPU-addressable memory types(though not non-cache
>> coherent on-device memory).
>>
>> Ross from Intel already done some work on this, see:
>> https://lwn.net/Articles/724562/
>>
>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>     small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 

Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
Does someone have any suggestion?  I thought HMM can do this.
Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
swap, can't accept DDR fallback allocation.

> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 

This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.


>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>> memory node but as io range like any regular PCIE resources. Device driver
>>> through capabilities flags would then figure out if the link between the
>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>> as device memory.
>>>
>>
>> From my point of view,  Cache coherent device memory will popular soon and
>> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
>> to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as

But it's happening.
In my understanding, that's why HMAT was introduced.
For reporting device memory as regular memory(with different characteristics).

--
Regards,
Bob Liu

> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less
>   - device driver want total control over memory and thus to be isolated from
>     mm mecanism and doing all those special cases was not welcome
>   - existing NUMA migration mecanism are ill suited for this memory as
>     access by the device to the memory is unknown to core mm and there
>     is no easy way to report it or track it (this kind of depends on the
>     platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jerome
> 
> .
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05  3:50           ` Bob Liu
@ 2017-09-05 13:50             ` Jerome Glisse
  2017-09-05 16:18               ` Dan Williams
  2017-09-05 19:00               ` Ross Zwisler
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-05 13:50 UTC (permalink / raw)
  To: Bob Liu
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A),
	ross.zwisler, Mel Gorman, Rik van Riel, Michal Hocko

On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> On 2017/9/5 10:38, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>>> On 2017/8/17 8:05, Jerome Glisse wrote:
> >>>>> Unlike unaddressable memory, coherent device memory has a real
> >>>>> resource associated with it on the system (as CPU can address
> >>>>> it). Add a new helper to hotplug such memory within the HMM
> >>>>> framework.
> >>>>>
> >>>>
> >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> >>>> through ACPI and recognized as NUMA memory node.
> >>>> Then how can their memory be captured and managed by HMM framework?
> >>>>
> >>>
> >>> Only platform that has such memory today is powerpc and it is not reported
> >>> as regular memory by the firmware hence why they need this helper.
> >>>
> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> >>
> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >> Table (HMAT) table defined in ACPI 6.2.
> >> The HMAT can cover CPU-addressable memory types(though not non-cache
> >> coherent on-device memory).
> >>
> >> Ross from Intel already done some work on this, see:
> >> https://lwn.net/Articles/724562/
> >>
> >> arm64 supports APCI also, there is likely more this kind of device when CCIX
> >> is out (should be very soon if on schedule).
> > 
> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > when you have several kind of memory each with different characteristics:
> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >     small (ie few giga bytes)
> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >   - DDR (good old memory) well characteristics are between HBM and persistent
> > 
> 
> Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
> Does someone have any suggestion?  I thought HMM can do this.
> Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
> swap, can't accept DDR fallback allocation.

I don't think there is any consensus for this. I put forward the idea that NUMA
needed to be extended as with deep hierarchy it is not only the distance between
two nodes but also others factors like persistency, bandwidth, latency ...


> > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > maybe even persistent memory).
> > 
> 
> This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.

It is not, it is much more complex than that. Linux kernel has no idea on what is
going on a device and thus do not have any usefull informations to make proper
decission regarding device memory. Here device is real device ie something with
processing capability, not something like HBM or persistent memory even if the
latter is associated with a struct device inside linux kernel.

> 
> 
> >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> >>> memory node but as io range like any regular PCIE resources. Device driver
> >>> through capabilities flags would then figure out if the link between the
> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> >>> as device memory.
> >>>
> >>
> >> From my point of view,  Cache coherent device memory will popular soon and
> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> >> to me.
> > 
> > Cache coherent device will be reported through standard mecanisms defined by
> > the bus standard they are using. To my knowledge all the standard are either
> > on top of PCIE or are similar to PCIE.
> > 
> > It is true that on many platform PCIE resource is manage/initialize by the
> > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > bios pick.
> > 
> > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> 
> But it's happening.
> In my understanding, that's why HMAT was introduced.
> For reporting device memory as regular memory(with different characteristics).

That is not my understanding but only Intel can confirm. HMAT was introduced
for things like HBM or persistent memory. Which i do not consider as device
memory. Sure persistent memory is assign a device struct because it is easier
for integration with the block system i assume. But it does not make it a
device in my view. For me a device is a piece of hardware that has some
processing capabilities (network adapter, sound card, GPU, ...)

But we can argue about semantic and what a device is. For all intent and purposes
device in HMM context is some piece of hardware with processing capabilities and
local device memory.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05 13:50             ` Jerome Glisse
@ 2017-09-05 16:18               ` Dan Williams
  2017-09-05 19:00               ` Ross Zwisler
  1 sibling, 0 replies; 66+ messages in thread
From: Dan Williams @ 2017-09-05 16:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, Andrew Morton, linux-kernel, Linux MM, John Hubbard,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A),
	Ross Zwisler, Mel Gorman, Rik van Riel, Michal Hocko

On Tue, Sep 5, 2017 at 6:50 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
>> On 2017/9/5 10:38, Jerome Glisse wrote:
>> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> >> On 2017/9/4 23:51, Jerome Glisse wrote:
>> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> >>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> >>>>> Unlike unaddressable memory, coherent device memory has a real
>> >>>>> resource associated with it on the system (as CPU can address
>> >>>>> it). Add a new helper to hotplug such memory within the HMM
>> >>>>> framework.
>> >>>>>
>> >>>>
>> >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS
>> >>>> through ACPI and recognized as NUMA memory node.
>> >>>> Then how can their memory be captured and managed by HMM framework?
>> >>>>
>> >>>
>> >>> Only platform that has such memory today is powerpc and it is not reported
>> >>> as regular memory by the firmware hence why they need this helper.
>> >>>
>> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>> >>
>> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> >> Table (HMAT) table defined in ACPI 6.2.
>> >> The HMAT can cover CPU-addressable memory types(though not non-cache
>> >> coherent on-device memory).
>> >>
>> >> Ross from Intel already done some work on this, see:
>> >> https://lwn.net/Articles/724562/
>> >>
>> >> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> >> is out (should be very soon if on schedule).
>> >
>> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> > when you have several kind of memory each with different characteristics:
>> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> >     small (ie few giga bytes)
>> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>> >   - DDR (good old memory) well characteristics are between HBM and persistent
>> >
>>
>> Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
>> Does someone have any suggestion?  I thought HMM can do this.
>> Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
>> swap, can't accept DDR fallback allocation.
>
> I don't think there is any consensus for this. I put forward the idea that NUMA
> needed to be extended as with deep hierarchy it is not only the distance between
> two nodes but also others factors like persistency, bandwidth, latency ...
>
>
>> > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> > maybe even persistent memory).
>> >
>>
>> This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.
>
> It is not, it is much more complex than that. Linux kernel has no idea on what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
>
>>
>>
>> >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>> >>> memory node but as io range like any regular PCIE resources. Device driver
>> >>> through capabilities flags would then figure out if the link between the
>> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>> >>> as device memory.
>> >>>
>> >>
>> >> From my point of view,  Cache coherent device memory will popular soon and
>> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
>> >> to me.
>> >
>> > Cache coherent device will be reported through standard mecanisms defined by
>> > the bus standard they are using. To my knowledge all the standard are either
>> > on top of PCIE or are similar to PCIE.
>> >
>> > It is true that on many platform PCIE resource is manage/initialize by the
>> > bios (UEFI) but it is platform specific. In some case we reprogram what the
>> > bios pick.
>> >
>> > So like i was saying i don't expect the BIOS/UEFI to report device memory as
>>
>> But it's happening.
>> In my understanding, that's why HMAT was introduced.
>> For reporting device memory as regular memory(with different characteristics).
>
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a device is a piece of hardware that has some
> processing capabilities (network adapter, sound card, GPU, ...)
>
> But we can argue about semantic and what a device is. For all intent and purposes
> device in HMM context is some piece of hardware with processing capabilities and
> local device memory.

I would say that device memory at its base-level is a memory range
whose availability is dependent on a device-driver. HMM layers some
additional functionality on top, but ZONE_DEVICE should only be seen
as the device-driver controlled lifetime and not conflated with the
incremental HMM functionality.

HMAT simply allows you to associate a memory range with a numa-node /
proximity-domain number that represents a set of performance / feature
characteristics.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4
  2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
@ 2017-09-05 17:13   ` Laurent Dufour
  2017-09-05 17:21     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Laurent Dufour @ 2017-09-05 17:13 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, cgroups

On 17/08/2017 02:05, JA(C)rA'me Glisse wrote:
> HMM pages (private or public device pages) are ZONE_DEVICE page and
> thus need special handling when it comes to lru or refcount. This
> patch make sure that memcontrol properly handle those when it face
> them. Those pages are use like regular pages in a process address
> space either as anonymous page or as file back page. So from memcg
> point of view we want to handle them like regular page for now at
> least.
> 
> Changed since v3:
>   - remove public support and move those chunk to separate patch
> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
>   - add comments explaining how device memory behave and why
> 
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Acked-by: Balbir Singh <bsingharora@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> ---
>  kernel/memremap.c |  1 +
>  mm/memcontrol.c   | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 398630c1fba3..f42d7483e886 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -492,6 +492,7 @@ void put_zone_device_private_page(struct page *page)
>  		__ClearPageWaiters(page);
> 
>  		page->mapping = NULL;
> +		mem_cgroup_uncharge(page);
> 
>  		page->pgmap->page_free(page, page->pgmap->data);
>  	} else if (!count)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 604fb3ca8028..977d1cf3493a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4407,12 +4407,13 @@ enum mc_target_type {
>  	MC_TARGET_NONE = 0,
>  	MC_TARGET_PAGE,
>  	MC_TARGET_SWAP,
> +	MC_TARGET_DEVICE,
>  };
> 
>  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  						unsigned long addr, pte_t ptent)
>  {
> -	struct page *page = vm_normal_page(vma, addr, ptent);
> +	struct page *page = _vm_normal_page(vma, addr, ptent, true);

Hi JA(C)rA'me,

As _vm_normal_page() is defined later in the patch 18, so this patch should
 break the bisectability.

Cheers,
Laurent.

> 
>  	if (!page || !page_mapped(page))
>  		return NULL;
> @@ -4429,7 +4430,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  	return page;
>  }
> 
> -#ifdef CONFIG_SWAP
> +#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
>  static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>  			pte_t ptent, swp_entry_t *entry)
>  {
> @@ -4438,6 +4439,23 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> 
>  	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
>  		return NULL;
> +
> +	/*
> +	 * Handle MEMORY_DEVICE_PRIVATE which are ZONE_DEVICE page belonging to
> +	 * a device and because they are not accessible by CPU they are store
> +	 * as special swap entry in the CPU page table.
> +	 */
> +	if (is_device_private_entry(ent)) {
> +		page = device_private_entry_to_page(ent);
> +		/*
> +		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
> +		 * a refcount of 1 when free (unlike normal page)
> +		 */
> +		if (!page_ref_add_unless(page, 1, 1))
> +			return NULL;
> +		return page;
> +	}
> +
>  	/*
>  	 * Because lookup_swap_cache() updates some statistics counter,
>  	 * we call find_get_page() with swapper_space directly.
> @@ -4598,6 +4616,12 @@ static int mem_cgroup_move_account(struct page *page,
>   *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
>   *     target for charge migration. if @target is not NULL, the entry is stored
>   *     in target->ent.
> + *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is MEMORY_DEVICE_PRIVATE
> + *     (so ZONE_DEVICE page and thus not on the lru). For now we such page is
> + *     charge like a regular page would be as for all intent and purposes it is
> + *     just special memory taking the place of a regular page.
> + *
> + *     See Documentations/vm/hmm.txt and include/linux/hmm.h
>   *
>   * Called with pte lock held.
>   */
> @@ -4626,6 +4650,8 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>  		 */
>  		if (page->mem_cgroup == mc.from) {
>  			ret = MC_TARGET_PAGE;
> +			if (is_device_private_page(page))
> +				ret = MC_TARGET_DEVICE;
>  			if (target)
>  				target->page = page;
>  		}
> @@ -4693,6 +4719,11 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> 
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
> +		/*
> +		 * Note their can not be MC_TARGET_DEVICE for now as we do not
> +		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
> +		 * MEMORY_DEVICE_PRIVATE but this might change.
> +		 */
>  		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
>  			mc.precharge += HPAGE_PMD_NR;
>  		spin_unlock(ptl);
> @@ -4908,6 +4939,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  				putback_lru_page(page);
>  			}
>  			put_page(page);
> +		} else if (target_type == MC_TARGET_DEVICE) {
> +			page = target.page;
> +			if (!mem_cgroup_move_account(page, true,
> +						     mc.from, mc.to)) {
> +				mc.precharge -= HPAGE_PMD_NR;
> +				mc.moved_charge += HPAGE_PMD_NR;
> +			}
> +			put_page(page);
>  		}
>  		spin_unlock(ptl);
>  		return 0;
> @@ -4919,12 +4958,16 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> +		bool device = false;
>  		swp_entry_t ent;
> 
>  		if (!mc.precharge)
>  			break;
> 
>  		switch (get_mctgt_type(vma, addr, ptent, &target)) {
> +		case MC_TARGET_DEVICE:
> +			device = true;
> +			/* fall through */
>  		case MC_TARGET_PAGE:
>  			page = target.page;
>  			/*
> @@ -4935,7 +4978,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  			 */
>  			if (PageTransCompound(page))
>  				goto put;
> -			if (isolate_lru_page(page))
> +			if (!device && isolate_lru_page(page))
>  				goto put;
>  			if (!mem_cgroup_move_account(page, false,
>  						mc.from, mc.to)) {
> @@ -4943,7 +4986,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  				/* we uncharge from mc.from later. */
>  				mc.moved_charge++;
>  			}
> -			putback_lru_page(page);
> +			if (!device)
> +				putback_lru_page(page);
>  put:			/* get_mctgt_type() gets the page */
>  			put_page(page);
>  			break;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4
  2017-09-05 17:13   ` Laurent Dufour
@ 2017-09-05 17:21     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-05 17:21 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, cgroups

On Tue, Sep 05, 2017 at 07:13:15PM +0200, Laurent Dufour wrote:
> On 17/08/2017 02:05, Jerome Glisse wrote:
> > HMM pages (private or public device pages) are ZONE_DEVICE page and
> > thus need special handling when it comes to lru or refcount. This
> > patch make sure that memcontrol properly handle those when it face
> > them. Those pages are use like regular pages in a process address
> > space either as anonymous page or as file back page. So from memcg
> > point of view we want to handle them like regular page for now at
> > least.
> > 
> > Changed since v3:
> >   - remove public support and move those chunk to separate patch
> > Changed since v2:
> >   - s/host/public
> > Changed since v1:
> >   - s/public/host
> >   - add comments explaining how device memory behave and why
> > 
> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> > Acked-by: Balbir Singh <bsingharora@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> > Cc: cgroups@vger.kernel.org
> > ---
> >  kernel/memremap.c |  1 +
> >  mm/memcontrol.c   | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 49 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 398630c1fba3..f42d7483e886 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -492,6 +492,7 @@ void put_zone_device_private_page(struct page *page)
> >  		__ClearPageWaiters(page);
> > 
> >  		page->mapping = NULL;
> > +		mem_cgroup_uncharge(page);
> > 
> >  		page->pgmap->page_free(page, page->pgmap->data);
> >  	} else if (!count)
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 604fb3ca8028..977d1cf3493a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4407,12 +4407,13 @@ enum mc_target_type {
> >  	MC_TARGET_NONE = 0,
> >  	MC_TARGET_PAGE,
> >  	MC_TARGET_SWAP,
> > +	MC_TARGET_DEVICE,
> >  };
> > 
> >  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> >  						unsigned long addr, pte_t ptent)
> >  {
> > -	struct page *page = vm_normal_page(vma, addr, ptent);
> > +	struct page *page = _vm_normal_page(vma, addr, ptent, true);
> 
> Hi Jerome,
> 
> As _vm_normal_page() is defined later in the patch 18, so this patch should
>  break the bisectability.

Correct, it seems i miss that when re-org patch order. The vm_normal_page()
changes can be move to patch 18 as it would be more logical to change call
site with the patch that adds the new special function.

Dunno if patch can be edited now ?

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05  2:38         ` Jerome Glisse
  2017-09-05  3:50           ` Bob Liu
@ 2017-09-05 18:54           ` Ross Zwisler
  2017-09-06  1:25             ` Bob Liu
  1 sibling, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-09-05 18:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A),
	ross.zwisler

On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >> On 2017/8/17 8:05, Jerome Glisse wrote:
> > >>> Unlike unaddressable memory, coherent device memory has a real
> > >>> resource associated with it on the system (as CPU can address
> > >>> it). Add a new helper to hotplug such memory within the HMM
> > >>> framework.
> > >>>
> > >>
> > >> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> > >> through ACPI and recognized as NUMA memory node.
> > >> Then how can their memory be captured and managed by HMM framework?
> > >>
> > > 
> > > Only platform that has such memory today is powerpc and it is not reported
> > > as regular memory by the firmware hence why they need this helper.
> > > 
> > > I don't think anyone has defined anything yet for x86 and acpi. As this is
> > 
> > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > Table (HMAT) table defined in ACPI 6.2.
> > The HMAT can cover CPU-addressable memory types(though not non-cache
> > coherent on-device memory).
> > 
> > Ross from Intel already done some work on this, see:
> > https://lwn.net/Articles/724562/
> > 
> > arm64 supports APCI also, there is likely more this kind of device when CCIX
> > is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>     small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 
> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 
> > > memory on PCIE like interface then i don't expect it to be reported as NUMA
> > > memory node but as io range like any regular PCIE resources. Device driver
> > > through capabilities flags would then figure out if the link between the
> > > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > > as device memory.
> > > 
> > 
> > From my point of view,  Cache coherent device memory will popular soon and
> > reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> > to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as
> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less

With the introduction of the HMAT in ACPI 6.2 one of the things that was added
was the ability to have an ACPI proximity domain that isn't associated with a
CPU.  This can be seen in the changes in the text of the "Proximity Domain"
field in table 5-73 which describes the "Memory Affinity Structure".  One of
the major features of the HMAT was the separation of "Initiator" proximity
domains (CPUs, devices that initiate memory transfers), and "target" proximity
domains (memory regions, be they attached to a CPU or some other device).

ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
already in a place where we have to support CPU-less NUMA nodes.

>   - device driver want total control over memory and thus to be isolated from
>     mm mecanism and doing all those special cases was not welcome

I agree that the kernel doesn't have enough information to be able to
accurately handle all the use cases for the various types of heterogeneous
memory.   The goal of my HMAT enabling is to allow that memory to be reserved
from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
Address Range Structure, then provide userspace with enough information to be
able to distinguish between the various types of memory in the system so it
can allocate & utilize it appropriately.

>   - existing NUMA migration mecanism are ill suited for this memory as
>     access by the device to the memory is unknown to core mm and there
>     is no easy way to report it or track it (this kind of depends on the
>     platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05 13:50             ` Jerome Glisse
  2017-09-05 16:18               ` Dan Williams
@ 2017-09-05 19:00               ` Ross Zwisler
  2017-09-05 19:20                 ` Jerome Glisse
  1 sibling, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-09-05 19:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bob Liu, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A),
	ross.zwisler, Mel Gorman, Rik van Riel, Michal Hocko

On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >>>> On 2017/8/17 8:05, Jerome Glisse wrote:
> > >>>>> Unlike unaddressable memory, coherent device memory has a real
> > >>>>> resource associated with it on the system (as CPU can address
> > >>>>> it). Add a new helper to hotplug such memory within the HMM
> > >>>>> framework.
> > >>>>>
> > >>>>
> > >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> > >>>> through ACPI and recognized as NUMA memory node.
> > >>>> Then how can their memory be captured and managed by HMM framework?
> > >>>>
> > >>>
> > >>> Only platform that has such memory today is powerpc and it is not reported
> > >>> as regular memory by the firmware hence why they need this helper.
> > >>>
> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> > >>
> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > >> Table (HMAT) table defined in ACPI 6.2.
> > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > >> coherent on-device memory).
> > >>
> > >> Ross from Intel already done some work on this, see:
> > >> https://lwn.net/Articles/724562/
> > >>
> > >> arm64 supports APCI also, there is likely more this kind of device when CCIX
> > >> is out (should be very soon if on schedule).
> > > 
> > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > > when you have several kind of memory each with different characteristics:
> > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > >     small (ie few giga bytes)
> > >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> > >   - DDR (good old memory) well characteristics are between HBM and persistent
> > > 
> > 
> > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
> > Does someone have any suggestion?  I thought HMM can do this.
> > Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
> > swap, can't accept DDR fallback allocation.
> 
> I don't think there is any consensus for this. I put forward the idea that NUMA
> needed to be extended as with deep hierarchy it is not only the distance between
> two nodes but also others factors like persistency, bandwidth, latency ...
> 
> 
> > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > > maybe even persistent memory).
> > > 
> > 
> > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.
> 
> It is not, it is much more complex than that. Linux kernel has no idea on what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
> 
> > 
> > 
> > >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> > >>> memory node but as io range like any regular PCIE resources. Device driver
> > >>> through capabilities flags would then figure out if the link between the
> > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> > >>> as device memory.
> > >>>
> > >>
> > >> From my point of view,  Cache coherent device memory will popular soon and
> > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> > >> to me.
> > > 
> > > Cache coherent device will be reported through standard mecanisms defined by
> > > the bus standard they are using. To my knowledge all the standard are either
> > > on top of PCIE or are similar to PCIE.
> > > 
> > > It is true that on many platform PCIE resource is manage/initialize by the
> > > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > > bios pick.
> > > 
> > > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> > 
> > But it's happening.
> > In my understanding, that's why HMAT was introduced.
> > For reporting device memory as regular memory(with different characteristics).
> 
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a device is a piece of hardware that has some
> processing capabilities (network adapter, sound card, GPU, ...)
> 
> But we can argue about semantic and what a device is. For all intent and purposes
> device in HMM context is some piece of hardware with processing capabilities and
> local device memory.

I personally don't see a reason why we couldn't use the HMAT to describe
device memory.  The idea of having memory-only NUMA nodes is already a realty
post-HMAT, and the HMAT is just there to give you information on the memory
ranges in the system.  I realize that you may need a different device driver
to set the memory up, but once you do set it up and it's cache coherent,
doesn't it just look like any other memory range where you can say things
like:

My memory starts at X
My memory has size Y
My memory's performance from CPU Z is XXX (latency, bandwidth, read & write)
etc?

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05 19:00               ` Ross Zwisler
@ 2017-09-05 19:20                 ` Jerome Glisse
  2017-09-08 19:43                   ` Ross Zwisler
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-09-05 19:20 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Bob Liu, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A),
	Mel Gorman, Rik van Riel, Michal Hocko

On Tue, Sep 05, 2017 at 01:00:13PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > > >>>> On 2017/8/17 8:05, Jerome Glisse wrote:
> > > >>>>> Unlike unaddressable memory, coherent device memory has a real
> > > >>>>> resource associated with it on the system (as CPU can address
> > > >>>>> it). Add a new helper to hotplug such memory within the HMM
> > > >>>>> framework.
> > > >>>>>
> > > >>>>
> > > >>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> > > >>>> through ACPI and recognized as NUMA memory node.
> > > >>>> Then how can their memory be captured and managed by HMM framework?
> > > >>>>
> > > >>>
> > > >>> Only platform that has such memory today is powerpc and it is not reported
> > > >>> as regular memory by the firmware hence why they need this helper.
> > > >>>
> > > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> > > >>
> > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > > >> Table (HMAT) table defined in ACPI 6.2.
> > > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > > >> coherent on-device memory).
> > > >>
> > > >> Ross from Intel already done some work on this, see:
> > > >> https://lwn.net/Articles/724562/
> > > >>
> > > >> arm64 supports APCI also, there is likely more this kind of device when CCIX
> > > >> is out (should be very soon if on schedule).
> > > > 
> > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > > > when you have several kind of memory each with different characteristics:
> > > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > > >     small (ie few giga bytes)
> > > >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> > > >   - DDR (good old memory) well characteristics are between HBM and persistent
> > > > 
> > > 
> > > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"?
> > > Does someone have any suggestion?  I thought HMM can do this.
> > > Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be
> > > swap, can't accept DDR fallback allocation.
> > 
> > I don't think there is any consensus for this. I put forward the idea that NUMA
> > needed to be extended as with deep hierarchy it is not only the distance between
> > two nodes but also others factors like persistency, bandwidth, latency ...
> > 
> > 
> > > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> > > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > > > maybe even persistent memory).
> > > > 
> > > 
> > > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way.
> > 
> > It is not, it is much more complex than that. Linux kernel has no idea on what is
> > going on a device and thus do not have any usefull informations to make proper
> > decission regarding device memory. Here device is real device ie something with
> > processing capability, not something like HBM or persistent memory even if the
> > latter is associated with a struct device inside linux kernel.
> > 
> > > 
> > > 
> > > >>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> > > >>> memory node but as io range like any regular PCIE resources. Device driver
> > > >>> through capabilities flags would then figure out if the link between the
> > > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> > > >>> as device memory.
> > > >>>
> > > >>
> > > >> From my point of view,  Cache coherent device memory will popular soon and
> > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> > > >> to me.
> > > > 
> > > > Cache coherent device will be reported through standard mecanisms defined by
> > > > the bus standard they are using. To my knowledge all the standard are either
> > > > on top of PCIE or are similar to PCIE.
> > > > 
> > > > It is true that on many platform PCIE resource is manage/initialize by the
> > > > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > > > bios pick.
> > > > 
> > > > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> > > 
> > > But it's happening.
> > > In my understanding, that's why HMAT was introduced.
> > > For reporting device memory as regular memory(with different characteristics).
> > 
> > That is not my understanding but only Intel can confirm. HMAT was introduced
> > for things like HBM or persistent memory. Which i do not consider as device
> > memory. Sure persistent memory is assign a device struct because it is easier
> > for integration with the block system i assume. But it does not make it a
> > device in my view. For me a device is a piece of hardware that has some
> > processing capabilities (network adapter, sound card, GPU, ...)
> > 
> > But we can argue about semantic and what a device is. For all intent and purposes
> > device in HMM context is some piece of hardware with processing capabilities and
> > local device memory.
> 
> I personally don't see a reason why we couldn't use the HMAT to describe
> device memory.  The idea of having memory-only NUMA nodes is already a realty
> post-HMAT, and the HMAT is just there to give you information on the memory
> ranges in the system.  I realize that you may need a different device driver
> to set the memory up, but once you do set it up and it's cache coherent,
> doesn't it just look like any other memory range where you can say things
> like:
> 
> My memory starts at X
> My memory has size Y
> My memory's performance from CPU Z is XXX (latency, bandwidth, read & write)
> etc?

Does HMAT support device hotplug ? I am unfamiliar with the whole inner working
of ACPI versus PCIE. Anyway i don't see any issue with device memory also showing
through HMAT but like i said device driver for the device will want to be in total
control of that memory.

Like i said issue here is that core kernel is unaware of the device activity ie
on what part of memory the device is actively working. So core mm can not make
inform decision on what should be migrated to device memory. Also we do not want
regular memory allocation to end in device memory unless explicitly ask for.
Few reasons for that. First this memory might not only be use for compute task
but also for graphic and in that case they are hard constraint on physically
contiguous memory allocation that require the GPU to move thing around to make
room for graphic object (can't allow GUP).

Second reasons, the device memory is inherently unreliable. If there is a bug
in the device driver or the user manage to trigger a faulty condition on GPU
the device might need a hard reset (ie cut PCIE power to device) which leads
to loss of memory content. While GPU are becoming more and more resilient they
are still prone to lockup.

Finaly for GPU there is a common pattern of memory over-commit. You pretend to
each application as if they were the only one and allow each of them to allocate
all of the device memory or more than could with strict sharing. As GPU have
long timeslice between switching to different context/application they can
easily move out and in large chunk of the process memory at context/application
switching. This is have proven to be a key aspect to allow maximum performances
accross several concurrent application/context.

To implement this easiest solution is for the device to lie about how much memory
it has and use the system memory as an overflow.


I am not saying that NUMA is not the way forward, i am saying that as it is today
it is not suited for this. It is lacking metric, it is lacking logic, it is lacking
features. We could add all this but it is a lot of work and i don't feel that we
have enough real world experience to do so now. I would rather have each devices
grow proper infrastructure in their driver through device specific API.

Then identify common pattern and from there try to build a sane API (if any such
thing exist :)) rather than trying today to build the whole house from the ground
up with just a foggy idea of how it should looks in the end.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05 18:54           ` Ross Zwisler
@ 2017-09-06  1:25             ` Bob Liu
  2017-09-06  2:12               ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-09-06  1:25 UTC (permalink / raw)
  To: Ross Zwisler, Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh, majiuyue, xieyisheng (A)

On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>> On 2017/8/17 8:05, Jerome Glisse wrote:
>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>> resource associated with it on the system (as CPU can address
>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>> framework.
>>>>>>
>>>>>
>>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
>>>>> through ACPI and recognized as NUMA memory node.
>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>
>>>>
>>>> Only platform that has such memory today is powerpc and it is not reported
>>>> as regular memory by the firmware hence why they need this helper.
>>>>
>>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>
>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>> Table (HMAT) table defined in ACPI 6.2.
>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>> coherent on-device memory).
>>>
>>> Ross from Intel already done some work on this, see:
>>> https://lwn.net/Articles/724562/
>>>
>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>> is out (should be very soon if on schedule).
>>
>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> when you have several kind of memory each with different characteristics:
>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>>     small (ie few giga bytes)
>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>
>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> maybe even persistent memory).
>>
>>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>>> memory node but as io range like any regular PCIE resources. Device driver
>>>> through capabilities flags would then figure out if the link between the
>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>> as device memory.
>>>>
>>>
>>> From my point of view,  Cache coherent device memory will popular soon and
>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
>>> to me.
>>
>> Cache coherent device will be reported through standard mecanisms defined by
>> the bus standard they are using. To my knowledge all the standard are either
>> on top of PCIE or are similar to PCIE.
>>
>> It is true that on many platform PCIE resource is manage/initialize by the
>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>> bios pick.
>>
>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>> regular memory. It will be reported as a regular PCIE resources and then the
>> device driver will be able to determine through some flags if the link between
>> the CPU(s) and the device is cache coherent or not. At that point the device
>> driver can use register it with HMM helper.
>>
>>
>> The whole NUMA discussion happen several time in the past i suggest looking
>> on mm list archive for them. But it was rule out for several reasons. Top of
>> my head:
>>   - people hate CPU less node and device memory is inherently CPU less
> 
> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> was the ability to have an ACPI proximity domain that isn't associated with a
> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> field in table 5-73 which describes the "Memory Affinity Structure".  One of
> the major features of the HMAT was the separation of "Initiator" proximity
> domains (CPUs, devices that initiate memory transfers), and "target" proximity
> domains (memory regions, be they attached to a CPU or some other device).
> 
> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> already in a place where we have to support CPU-less NUMA nodes.
> 
>>   - device driver want total control over memory and thus to be isolated from
>>     mm mecanism and doing all those special cases was not welcome
> 
> I agree that the kernel doesn't have enough information to be able to
> accurately handle all the use cases for the various types of heterogeneous
> memory.   The goal of my HMAT enabling is to allow that memory to be reserved
> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> Address Range Structure, then provide userspace with enough information to be
> able to distinguish between the various types of memory in the system so it
> can allocate & utilize it appropriately.
> 

Does this mean require an user space memory management library to deal with all alloc/free/defragment..
But how to do with virtual <-> physical address mapping from userspace?

--
Regards,
Bob Liu

>>   - existing NUMA migration mecanism are ill suited for this memory as
>>     access by the device to the memory is unknown to core mm and there
>>     is no easy way to report it or track it (this kind of depends on the
>>     platform and hardware)
>>
>> I am likely missing other big points.
>>
>> Cheers,
>> Jerome


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-06  1:25             ` Bob Liu
@ 2017-09-06  2:12               ` Jerome Glisse
  2017-09-07  2:06                 ` Bob Liu
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-09-06  2:12 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> On 2017/9/6 2:54, Ross Zwisler wrote:
> > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>>>> On 2017/8/17 8:05, Jerome Glisse wrote:
> >>>>>> Unlike unaddressable memory, coherent device memory has a real
> >>>>>> resource associated with it on the system (as CPU can address
> >>>>>> it). Add a new helper to hotplug such memory within the HMM
> >>>>>> framework.
> >>>>>>
> >>>>>
> >>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
> >>>>> through ACPI and recognized as NUMA memory node.
> >>>>> Then how can their memory be captured and managed by HMM framework?
> >>>>>
> >>>>
> >>>> Only platform that has such memory today is powerpc and it is not reported
> >>>> as regular memory by the firmware hence why they need this helper.
> >>>>
> >>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> >>>
> >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >>> Table (HMAT) table defined in ACPI 6.2.
> >>> The HMAT can cover CPU-addressable memory types(though not non-cache
> >>> coherent on-device memory).
> >>>
> >>> Ross from Intel already done some work on this, see:
> >>> https://lwn.net/Articles/724562/
> >>>
> >>> arm64 supports APCI also, there is likely more this kind of device when CCIX
> >>> is out (should be very soon if on schedule).
> >>
> >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> >> when you have several kind of memory each with different characteristics:
> >>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >>     small (ie few giga bytes)
> >>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >>   - DDR (good old memory) well characteristics are between HBM and persistent
> >>
> >> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> >> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> >> maybe even persistent memory).
> >>
> >>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
> >>>> memory node but as io range like any regular PCIE resources. Device driver
> >>>> through capabilities flags would then figure out if the link between the
> >>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> >>>> as device memory.
> >>>>
> >>>
> >>> From my point of view,  Cache coherent device memory will popular soon and
> >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> >>> to me.
> >>
> >> Cache coherent device will be reported through standard mecanisms defined by
> >> the bus standard they are using. To my knowledge all the standard are either
> >> on top of PCIE or are similar to PCIE.
> >>
> >> It is true that on many platform PCIE resource is manage/initialize by the
> >> bios (UEFI) but it is platform specific. In some case we reprogram what the
> >> bios pick.
> >>
> >> So like i was saying i don't expect the BIOS/UEFI to report device memory as
> >> regular memory. It will be reported as a regular PCIE resources and then the
> >> device driver will be able to determine through some flags if the link between
> >> the CPU(s) and the device is cache coherent or not. At that point the device
> >> driver can use register it with HMM helper.
> >>
> >>
> >> The whole NUMA discussion happen several time in the past i suggest looking
> >> on mm list archive for them. But it was rule out for several reasons. Top of
> >> my head:
> >>   - people hate CPU less node and device memory is inherently CPU less
> > 
> > With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> > was the ability to have an ACPI proximity domain that isn't associated with a
> > CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> > field in table 5-73 which describes the "Memory Affinity Structure".  One of
> > the major features of the HMAT was the separation of "Initiator" proximity
> > domains (CPUs, devices that initiate memory transfers), and "target" proximity
> > domains (memory regions, be they attached to a CPU or some other device).
> > 
> > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> > already in a place where we have to support CPU-less NUMA nodes.
> > 
> >>   - device driver want total control over memory and thus to be isolated from
> >>     mm mecanism and doing all those special cases was not welcome
> > 
> > I agree that the kernel doesn't have enough information to be able to
> > accurately handle all the use cases for the various types of heterogeneous
> > memory.   The goal of my HMAT enabling is to allow that memory to be reserved
> > from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> > Address Range Structure, then provide userspace with enough information to be
> > able to distinguish between the various types of memory in the system so it
> > can allocate & utilize it appropriately.
> > 
> 
> Does this mean require an user space memory management library to deal with all
> alloc/free/defragment.. But how to do with virtual <-> physical address mapping
> from userspace?

For HMM each process give hint (somewhat similar to mbind) for range of virtual
address to the device kernel driver (through some API like OpenCL or CUDA for GPU
for instance). All this being device driver specific ioctl.

The kernel device driver have an overall view of all the process that use the device
and each of the memory advise they gave. From that informations the kernel device
driver decide what part of each process address space to migrate to device memory.
This obviously dynamic and likely to change over the process lifetime.


My understanding is that HMAT want similar API to allow process to give direction on
where each range of virtual address should be allocated. It is expected that most
software can easily infer what part of its address will need more bandwidth, smaller
latency versus what part is sparsely accessed ...

For HMAT i think first target is HBM and persistent memory and device memory might
be added latter if that make sense.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-06  2:12               ` Jerome Glisse
@ 2017-09-07  2:06                 ` Bob Liu
  2017-09-07 17:00                   ` Jerome Glisse
  2017-09-07 17:27                   ` Jerome Glisse
  0 siblings, 2 replies; 66+ messages in thread
From: Bob Liu @ 2017-09-07  2:06 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ross Zwisler, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>> On 2017/8/17 8:05, Jerome Glisse wrote:
>>>>>>>> Unlike unaddressable memory, coherent device memory has a real
>>>>>>>> resource associated with it on the system (as CPU can address
>>>>>>>> it). Add a new helper to hotplug such memory within the HMM
>>>>>>>> framework.
>>>>>>>>
>>>>>>>
>>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely reported to OS 
>>>>>>> through ACPI and recognized as NUMA memory node.
>>>>>>> Then how can their memory be captured and managed by HMM framework?
>>>>>>>
>>>>>>
>>>>>> Only platform that has such memory today is powerpc and it is not reported
>>>>>> as regular memory by the firmware hence why they need this helper.
>>>>>>
>>>>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>>>
>>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>>>> Table (HMAT) table defined in ACPI 6.2.
>>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>>>> coherent on-device memory).
>>>>>
>>>>> Ross from Intel already done some work on this, see:
>>>>> https://lwn.net/Articles/724562/
>>>>>
>>>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>>>> is out (should be very soon if on schedule).
>>>>
>>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>>>> when you have several kind of memory each with different characteristics:
>>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>>>>     small (ie few giga bytes)
>>>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>>>
>>>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>>>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>>>> maybe even persistent memory).
>>>>
>>>>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>>>>> memory node but as io range like any regular PCIE resources. Device driver
>>>>>> through capabilities flags would then figure out if the link between the
>>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>>>>> as device memory.
>>>>>>
>>>>>
>>>>> From my point of view,  Cache coherent device memory will popular soon and
>>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
>>>>> to me.
>>>>
>>>> Cache coherent device will be reported through standard mecanisms defined by
>>>> the bus standard they are using. To my knowledge all the standard are either
>>>> on top of PCIE or are similar to PCIE.
>>>>
>>>> It is true that on many platform PCIE resource is manage/initialize by the
>>>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>>>> bios pick.
>>>>
>>>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>>>> regular memory. It will be reported as a regular PCIE resources and then the
>>>> device driver will be able to determine through some flags if the link between
>>>> the CPU(s) and the device is cache coherent or not. At that point the device
>>>> driver can use register it with HMM helper.
>>>>
>>>>
>>>> The whole NUMA discussion happen several time in the past i suggest looking
>>>> on mm list archive for them. But it was rule out for several reasons. Top of
>>>> my head:
>>>>   - people hate CPU less node and device memory is inherently CPU less
>>>
>>> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
>>> was the ability to have an ACPI proximity domain that isn't associated with a
>>> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
>>> field in table 5-73 which describes the "Memory Affinity Structure".  One of
>>> the major features of the HMAT was the separation of "Initiator" proximity
>>> domains (CPUs, devices that initiate memory transfers), and "target" proximity
>>> domains (memory regions, be they attached to a CPU or some other device).
>>>
>>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
>>> already in a place where we have to support CPU-less NUMA nodes.
>>>
>>>>   - device driver want total control over memory and thus to be isolated from
>>>>     mm mecanism and doing all those special cases was not welcome
>>>
>>> I agree that the kernel doesn't have enough information to be able to
>>> accurately handle all the use cases for the various types of heterogeneous
>>> memory.   The goal of my HMAT enabling is to allow that memory to be reserved
>>> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
>>> Address Range Structure, then provide userspace with enough information to be
>>> able to distinguish between the various types of memory in the system so it
>>> can allocate & utilize it appropriately.
>>>
>>
>> Does this mean require an user space memory management library to deal with all
>> alloc/free/defragment.. But how to do with virtual <-> physical address mapping
>> from userspace?
> 
> For HMM each process give hint (somewhat similar to mbind) for range of virtual
> address to the device kernel driver (through some API like OpenCL or CUDA for GPU
> for instance). All this being device driver specific ioctl.
> 
> The kernel device driver have an overall view of all the process that use the device
> and each of the memory advise they gave. From that informations the kernel device
> driver decide what part of each process address space to migrate to device memory.

Oh, I mean CDM-HMM.  I'm fine with HMM.

> This obviously dynamic and likely to change over the process lifetime.
> 
> 
> My understanding is that HMAT want similar API to allow process to give direction on
> where each range of virtual address should be allocated. It is expected that most

Right, but not clear who should manage the physical memory allocation and setup the 
pagetable mapping. An new driver or the kernel?

> software can easily infer what part of its address will need more bandwidth, smaller
> latency versus what part is sparsely accessed ...
> 
> For HMAT i think first target is HBM and persistent memory and device memory might
> be added latter if that make sense.
> 

Okay, so there are two potential ways for CPU-addressable cache-coherent device memory
(or cpu-less numa memory or "target domain" memory in ACPI spec )?
1. CDM-HMM
2. HMAT

--
Regards,
Bob Liu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-07  2:06                 ` Bob Liu
@ 2017-09-07 17:00                   ` Jerome Glisse
  2017-09-07 17:27                   ` Jerome Glisse
  1 sibling, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-07 17:00 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
> >>>>>>>> Unlike unaddressable memory, coherent device memory has a real
> >>>>>>>> resource associated with it on the system (as CPU can address
> >>>>>>>> it). Add a new helper to hotplug such memory within the HMM
> >>>>>>>> framework.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Got an new question, coherent device( e.g CCIX) memory are likely
> >>>>>>> reported to OS
> >>>>>>> through ACPI and recognized as NUMA memory node.
> >>>>>>> Then how can their memory be captured and managed by HMM framework?
> >>>>>>>
> >>>>>>
> >>>>>> Only platform that has such memory today is powerpc and it is not
> >>>>>> reported
> >>>>>> as regular memory by the firmware hence why they need this helper.
> >>>>>>
> >>>>>> I don't think anyone has defined anything yet for x86 and acpi. As
> >>>>>> this is
> >>>>>
> >>>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >>>>> Table (HMAT) table defined in ACPI 6.2.
> >>>>> The HMAT can cover CPU-addressable memory types(though not non-cache
> >>>>> coherent on-device memory).
> >>>>>
> >>>>> Ross from Intel already done some work on this, see:
> >>>>> https://lwn.net/Articles/724562/
> >>>>>
> >>>>> arm64 supports APCI also, there is likely more this kind of device when
> >>>>> CCIX
> >>>>> is out (should be very soon if on schedule).
> >>>>
> >>>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy"
> >>>> memory ie
> >>>> when you have several kind of memory each with different
> >>>> characteristics:
> >>>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >>>>     small (ie few giga bytes)
> >>>>   - Persistent memory, slower (both latency and bandwidth) big (tera
> >>>>   bytes)
> >>>>   - DDR (good old memory) well characteristics are between HBM and
> >>>>   persistent
> >>>>
> >>>> So AFAICT this has nothing to do with what HMM is for, ie device memory.
> >>>> Note
> >>>> that device memory can have a hierarchy of memory themself (HBM, GDDR
> >>>> and in
> >>>> maybe even persistent memory).
> >>>>
> >>>>>> memory on PCIE like interface then i don't expect it to be reported as
> >>>>>> NUMA
> >>>>>> memory node but as io range like any regular PCIE resources. Device
> >>>>>> driver
> >>>>>> through capabilities flags would then figure out if the link between
> >>>>>> the
> >>>>>> device and CPU is CCIX capable if so it can use this helper to hotplug
> >>>>>> it
> >>>>>> as device memory.
> >>>>>>
> >>>>>
> >>>>> From my point of view,  Cache coherent device memory will popular soon
> >>>>> and
> >>>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more
> >>>>> reasonable
> >>>>> to me.
> >>>>
> >>>> Cache coherent device will be reported through standard mecanisms
> >>>> defined by
> >>>> the bus standard they are using. To my knowledge all the standard are
> >>>> either
> >>>> on top of PCIE or are similar to PCIE.
> >>>>
> >>>> It is true that on many platform PCIE resource is manage/initialize by
> >>>> the
> >>>> bios (UEFI) but it is platform specific. In some case we reprogram what
> >>>> the
> >>>> bios pick.
> >>>>
> >>>> So like i was saying i don't expect the BIOS/UEFI to report device
> >>>> memory as
> >>>> regular memory. It will be reported as a regular PCIE resources and then
> >>>> the
> >>>> device driver will be able to determine through some flags if the link
> >>>> between
> >>>> the CPU(s) and the device is cache coherent or not. At that point the
> >>>> device
> >>>> driver can use register it with HMM helper.
> >>>>
> >>>>
> >>>> The whole NUMA discussion happen several time in the past i suggest
> >>>> looking
> >>>> on mm list archive for them. But it was rule out for several reasons.
> >>>> Top of
> >>>> my head:
> >>>>   - people hate CPU less node and device memory is inherently CPU less
> >>>
> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was
> >>> added
> >>> was the ability to have an ACPI proximity domain that isn't associated
> >>> with a
> >>> CPU.  This can be seen in the changes in the text of the "Proximity
> >>> Domain"
> >>> field in table 5-73 which describes the "Memory Affinity Structure".  One
> >>> of
> >>> the major features of the HMAT was the separation of "Initiator"
> >>> proximity
> >>> domains (CPUs, devices that initiate memory transfers), and "target"
> >>> proximity
> >>> domains (memory regions, be they attached to a CPU or some other device).
> >>>
> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> >>> already in a place where we have to support CPU-less NUMA nodes.
> >>>
> >>>>   - device driver want total control over memory and thus to be isolated
> >>>>   from
> >>>>     mm mecanism and doing all those special cases was not welcome
> >>>
> >>> I agree that the kernel doesn't have enough information to be able to
> >>> accurately handle all the use cases for the various types of
> >>> heterogeneous
> >>> memory.   The goal of my HMAT enabling is to allow that memory to be
> >>> reserved
> >>> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> >>> Address Range Structure, then provide userspace with enough information
> >>> to be
> >>> able to distinguish between the various types of memory in the system so
> >>> it
> >>> can allocate & utilize it appropriately.
> >>>
> >>
> >> Does this mean require an user space memory management library to deal
> >> with all
> >> alloc/free/defragment.. But how to do with virtual <-> physical address
> >> mapping
> >> from userspace?
> > 
> > For HMM each process give hint (somewhat similar to mbind) for range of
> > virtual
> > address to the device kernel driver (through some API like OpenCL or CUDA
> > for GPU
> > for instance). All this being device driver specific ioctl.
> > 
> > The kernel device driver have an overall view of all the process that use
> > the device
> > and each of the memory advise they gave. From that informations the kernel
> > device
> > driver decide what part of each process address space to migrate to device
> > memory.
> 
> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> > This obviously dynamic and likely to change over the process lifetime.
> > 
> > 
> > My understanding is that HMAT want similar API to allow process to give
> > direction on
> > where each range of virtual address should be allocated. It is expected
> > that most
> 
> Right, but not clear who should manage the physical memory allocation and
> setup the
> pagetable mapping. An new driver or the kernel?
> 
> > software can easily infer what part of its address will need more
> > bandwidth, smaller
> > latency versus what part is sparsely accessed ...
> > 
> > For HMAT i think first target is HBM and persistent memory and device
> > memory might
> > be added latter if that make sense.
> > 
> 
> Okay, so there are two potential ways for CPU-addressable cache-coherent
> device memory
> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
> 1. CDM-HMM
> 2. HMAT
> 
> --
> Regards,
> Bob Liu
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-07  2:06                 ` Bob Liu
  2017-09-07 17:00                   ` Jerome Glisse
@ 2017-09-07 17:27                   ` Jerome Glisse
  2017-09-08  1:59                     ` Bob Liu
  1 sibling, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-09-07 17:27 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>>>>>> On 2017/8/17 8:05, Jérôme Glisse wrote:

[...]

> > For HMM each process give hint (somewhat similar to mbind) for range of
> > virtual address to the device kernel driver (through some API like OpenCL
> > or CUDA for GPU for instance). All this being device driver specific ioctl.
> > 
> > The kernel device driver have an overall view of all the process that use
> > the device and each of the memory advise they gave. From that informations
> > the kernel device driver decide what part of each process address space to
> > migrate to device memory.
> 
> Oh, I mean CDM-HMM.  I'm fine with HMM.

They are one and the same really. In both cases HMM is just a set of helpers
for device driver.

> > This obviously dynamic and likely to change over the process lifetime.
> > 
> > My understanding is that HMAT want similar API to allow process to give
> > direction on
> > where each range of virtual address should be allocated. It is expected
> > that most
> 
> Right, but not clear who should manage the physical memory allocation and
> setup the pagetable mapping. An new driver or the kernel?

Physical device memory is manage by the kernel device driver as it is today
and has it will be tomorrow. HMM does not change that, nor does it requires
any change to that.

Migrating process memory to or from device is done by the kernel through
the regular page migration. HMM provides new helper for device driver to
initiate such migration. There is no mechanisms like auto numa migration
for the reasons i explain previously.

Kernel device driver use all knowledge it has to decide what to migrate to
device memory. Nothing new here either, it is what happens today for special
allocated device object and it will just happen all the same for regular
mmap memory (private anonymous or mmap of a regular file of a filesystem).


So every low level thing happen in the kernel. Userspace only provides
directive to the kernel device driver through device specific API. But the
kernel device driver can ignore or override those directive.


> > software can easily infer what part of its address will need more
> > bandwidth, smaller
> > latency versus what part is sparsely accessed ...
> > 
> > For HMAT i think first target is HBM and persistent memory and device
> > memory might
> > be added latter if that make sense.
> > 
> 
> Okay, so there are two potential ways for CPU-addressable cache-coherent
> device memory
> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
> 1. CDM-HMM
> 2. HMAT

No this are 2 orthogonal thing, they do not conflict with each others quite
the contrary. HMM (the CDM part is no different) is a set of helpers, see
it as a toolbox, for device driver.

HMAT is a way for firmware to report memory resources with more informations
that just range of physical address. HMAT is specific to platform that rely
on ACPI. HMAT does not provide any helpers to manage these memory.

So a device driver can get informations about device memory from HMAT and then
use HMM to help in managing and using this memory.

Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-07 17:27                   ` Jerome Glisse
@ 2017-09-08  1:59                     ` Bob Liu
  2017-09-08 20:43                       ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Bob Liu @ 2017-09-08  1:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ross Zwisler, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

On 2017/9/8 1:27, Jerome Glisse wrote:
>> On 2017/9/6 10:12, Jerome Glisse wrote:
>>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>>>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>>>>>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>>>>>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>>>>>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>>>>>>>> On 2017/8/17 8:05, JA(C)rA'me Glisse wrote:
> 
> [...]
> 
>>> For HMM each process give hint (somewhat similar to mbind) for range of
>>> virtual address to the device kernel driver (through some API like OpenCL
>>> or CUDA for GPU for instance). All this being device driver specific ioctl.
>>>
>>> The kernel device driver have an overall view of all the process that use
>>> the device and each of the memory advise they gave. From that informations
>>> the kernel device driver decide what part of each process address space to
>>> migrate to device memory.
>>
>> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> They are one and the same really. In both cases HMM is just a set of helpers
> for device driver.
> 
>>> This obviously dynamic and likely to change over the process lifetime.
>>>
>>> My understanding is that HMAT want similar API to allow process to give
>>> direction on
>>> where each range of virtual address should be allocated. It is expected
>>> that most
>>
>> Right, but not clear who should manage the physical memory allocation and
>> setup the pagetable mapping. An new driver or the kernel?
> 
> Physical device memory is manage by the kernel device driver as it is today
> and has it will be tomorrow. HMM does not change that, nor does it requires
> any change to that.
> 

Can someone from Intel give more information about the plan of managing HMAT reported memory?

> Migrating process memory to or from device is done by the kernel through
> the regular page migration. HMM provides new helper for device driver to
> initiate such migration. There is no mechanisms like auto numa migration
> for the reasons i explain previously.
> 
> Kernel device driver use all knowledge it has to decide what to migrate to
> device memory. Nothing new here either, it is what happens today for special
> allocated device object and it will just happen all the same for regular
> mmap memory (private anonymous or mmap of a regular file of a filesystem).
> 
> 
> So every low level thing happen in the kernel. Userspace only provides
> directive to the kernel device driver through device specific API. But the
> kernel device driver can ignore or override those directive.
> 
> 
>>> software can easily infer what part of its address will need more
>>> bandwidth, smaller
>>> latency versus what part is sparsely accessed ...
>>>
>>> For HMAT i think first target is HBM and persistent memory and device
>>> memory might
>>> be added latter if that make sense.
>>>
>>
>> Okay, so there are two potential ways for CPU-addressable cache-coherent
>> device memory
>> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
>> 1. CDM-HMM
>> 2. HMAT
> 
> No this are 2 orthogonal thing, they do not conflict with each others quite
> the contrary. HMM (the CDM part is no different) is a set of helpers, see
> it as a toolbox, for device driver.
> 
> HMAT is a way for firmware to report memory resources with more informations
> that just range of physical address. HMAT is specific to platform that rely
> on ACPI. HMAT does not provide any helpers to manage these memory.
> 
> So a device driver can get informations about device memory from HMAT and then
> use HMM to help in managing and using this memory.
> 

Yes, but as Balbir mentioned requires :
1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

And I'm not sure whether Intel going to use this HMM-CDM based method for their "target domain" memory ? 
Or they prefer to NUMA approach?   Rossi 1/4 ? Dan?

--
Thanks,
Bob Liu


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-05 19:20                 ` Jerome Glisse
@ 2017-09-08 19:43                   ` Ross Zwisler
  2017-09-08 20:29                     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Ross Zwisler @ 2017-09-08 19:43 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Ross Zwisler, Bob Liu, akpm, linux-kernel, linux-mm,
	John Hubbard, Dan Williams, David Nellans, Balbir Singh,
	majiuyue, xieyisheng (A),
	Mel Gorman, Rik van Riel, Michal Hocko

On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
<>
> Does HMAT support device hotplug ? I am unfamiliar with the whole inner working
> of ACPI versus PCIE. Anyway i don't see any issue with device memory also showing
> through HMAT but like i said device driver for the device will want to be in total
> control of that memory.

Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
will use to replace the current one.

I don't yet have support for _HMA in my enabling, but I do intend to add
support for it once we settle on a sysfs API for the regular boot-time case.

> Like i said issue here is that core kernel is unaware of the device activity ie
> on what part of memory the device is actively working. So core mm can not make
> inform decision on what should be migrated to device memory. Also we do not want
> regular memory allocation to end in device memory unless explicitly ask for.
> Few reasons for that. First this memory might not only be use for compute task
> but also for graphic and in that case they are hard constraint on physically
> contiguous memory allocation that require the GPU to move thing around to make
> room for graphic object (can't allow GUP).
> 
> Second reasons, the device memory is inherently unreliable. If there is a bug
> in the device driver or the user manage to trigger a faulty condition on GPU
> the device might need a hard reset (ie cut PCIE power to device) which leads
> to loss of memory content. While GPU are becoming more and more resilient they
> are still prone to lockup.
> 
> Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> each application as if they were the only one and allow each of them to allocate
> all of the device memory or more than could with strict sharing. As GPU have
> long timeslice between switching to different context/application they can
> easily move out and in large chunk of the process memory at context/application
> switching. This is have proven to be a key aspect to allow maximum performances
> accross several concurrent application/context.
> 
> To implement this easiest solution is for the device to lie about how much memory
> it has and use the system memory as an overflow.

I don't think any of this precludes the HMAT being involved.  This is all very
similar to what I think we need to do for high bandwidth memory, for example.
We don't want the OS to use it for anything, and we want all of it to be
available for applications to allocate and use for their specific workload.
We don't want to make any assumptions about how it can or should be used.

The HMAT is just there to give us a few things:

1) It provides us with an explicit way of telling the OS not to use the
memory, in the form of the "Reservation hint" flag in the Memory Subsystem
Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
be set for persistent memory and HBM, and it sounds like you'd expect it to be
set for your device memory as well.

2) It provides us with a way of telling userspace "hey, I know about some
memory, and I can tell you its performance characteristics".  All control of
how this memory is allocated and used is still left to userspace.

> I am not saying that NUMA is not the way forward, i am saying that as it is today
> it is not suited for this. It is lacking metric, it is lacking logic, it is lacking
> features. We could add all this but it is a lot of work and i don't feel that we
> have enough real world experience to do so now. I would rather have each devices
> grow proper infrastructure in their driver through device specific API.

To be clear, I'm not proposing that we teach the NUMA code how to
automatically allocate for a given numa node, balance, etc. memory described
by the HMAT.  All I want is an API that says "here is some memory, I'll tell
you all I can about it and let you do with it what you will", and perhaps a
way to manually allocate what you want.

And yes, this is very hand-wavy at this point. :)  After I get the sysfs
portion sussed out the next step is to work on enabling something like
libnuma to allow the memory to be manually allocated.

I think this works for both my use case and yours, correct?

> Then identify common pattern and from there try to build a sane API (if any such
> thing exist :)) rather than trying today to build the whole house from the ground
> up with just a foggy idea of how it should looks in the end.

Yea, I do see your point.  My worry is that if I define an API, and you define
an API, we'll end up in two different places with people using our different
APIs, then:

https://xkcd.com/927/

:)

The HMAT enabling I'm trying to do is very passive - it doesn't actively do
*anything* with the memory, it's entire purpose is to give userspace more
information about the memory so userspace can make informed decisions.

Would you be willing to look at the sysfs API I have defined, and see if it
would work for you?

https://lkml.org/lkml/2017/7/6/749

I'll look harder at your enabling and see if we can figure out some common
ground.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-08 19:43                   ` Ross Zwisler
@ 2017-09-08 20:29                     ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-09-08 20:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Bob Liu, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A),
	Mel Gorman, Rik van Riel, Michal Hocko

On Fri, Sep 08, 2017 at 01:43:44PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
> <>
> > Does HMAT support device hotplug ? I am unfamiliar with the whole inner working
> > of ACPI versus PCIE. Anyway i don't see any issue with device memory also showing
> > through HMAT but like i said device driver for the device will want to be in total
> > control of that memory.
> 
> Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
> of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
> will use to replace the current one.
> 
> I don't yet have support for _HMA in my enabling, but I do intend to add
> support for it once we settle on a sysfs API for the regular boot-time case.
> 
> > Like i said issue here is that core kernel is unaware of the device activity ie
> > on what part of memory the device is actively working. So core mm can not make
> > inform decision on what should be migrated to device memory. Also we do not want
> > regular memory allocation to end in device memory unless explicitly ask for.
> > Few reasons for that. First this memory might not only be use for compute task
> > but also for graphic and in that case they are hard constraint on physically
> > contiguous memory allocation that require the GPU to move thing around to make
> > room for graphic object (can't allow GUP).
> > 
> > Second reasons, the device memory is inherently unreliable. If there is a bug
> > in the device driver or the user manage to trigger a faulty condition on GPU
> > the device might need a hard reset (ie cut PCIE power to device) which leads
> > to loss of memory content. While GPU are becoming more and more resilient they
> > are still prone to lockup.
> > 
> > Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> > each application as if they were the only one and allow each of them to allocate
> > all of the device memory or more than could with strict sharing. As GPU have
> > long timeslice between switching to different context/application they can
> > easily move out and in large chunk of the process memory at context/application
> > switching. This is have proven to be a key aspect to allow maximum performances
> > accross several concurrent application/context.
> > 
> > To implement this easiest solution is for the device to lie about how much memory
> > it has and use the system memory as an overflow.
> 
> I don't think any of this precludes the HMAT being involved.  This is all very
> similar to what I think we need to do for high bandwidth memory, for example.
> We don't want the OS to use it for anything, and we want all of it to be
> available for applications to allocate and use for their specific workload.
> We don't want to make any assumptions about how it can or should be used.
> 
> The HMAT is just there to give us a few things:
> 
> 1) It provides us with an explicit way of telling the OS not to use the
> memory, in the form of the "Reservation hint" flag in the Memory Subsystem
> Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
> be set for persistent memory and HBM, and it sounds like you'd expect it to be
> set for your device memory as well.
> 
> 2) It provides us with a way of telling userspace "hey, I know about some
> memory, and I can tell you its performance characteristics".  All control of
> how this memory is allocated and used is still left to userspace.
> 
> > I am not saying that NUMA is not the way forward, i am saying that as it is today
> > it is not suited for this. It is lacking metric, it is lacking logic, it is lacking
> > features. We could add all this but it is a lot of work and i don't feel that we
> > have enough real world experience to do so now. I would rather have each devices
> > grow proper infrastructure in their driver through device specific API.
> 
> To be clear, I'm not proposing that we teach the NUMA code how to
> automatically allocate for a given numa node, balance, etc. memory described
> by the HMAT.  All I want is an API that says "here is some memory, I'll tell
> you all I can about it and let you do with it what you will", and perhaps a
> way to manually allocate what you want.
> 
> And yes, this is very hand-wavy at this point. :)  After I get the sysfs
> portion sussed out the next step is to work on enabling something like
> libnuma to allow the memory to be manually allocated.
> 
> I think this works for both my use case and yours, correct?

Depend what you mean. Using NUMA as it is today no. Growing new API on the
side of libnuma maybe. It is hard to say. Right now the GPU do have a very
reach API see OpenCL or CUDA API. Anything less expressive than what they
offer would not work.

Existing libnuma API is illsuited. It is too static. GPU workload are more
dynamic as a result the virtual address space and the type of memory that
is backing a range of virtual address is manage in isolation. You move range
of virtual address to different memory following application programmer
advice and overall resource usage.

libnuma is much more alloc some memory on some node and that's where that
memory should stay. At least this is my feeling from the API (here i am not
talking about mbind syscall but about userspace libnuma).


But it can grow to something more in the line of OpenCL/CUDA.


> > Then identify common pattern and from there try to build a sane API (if any such
> > thing exist :)) rather than trying today to build the whole house from the ground
> > up with just a foggy idea of how it should looks in the end.
> 
> Yea, I do see your point.  My worry is that if I define an API, and you define
> an API, we'll end up in two different places with people using our different
> APIs, then:
> 
> https://xkcd.com/927/
> 
> :)

I am well aware of that :) The thing is, that API already exist and there is large
chunk of what we could call legacy application that rely on those (again OpenCL and
CUDA API). The easiest way is to evolve from this existing API and factor out what
is constant accross devices.

Note that while the userspace API of the OpenCL/CUDA API is standardize and well
define. The actual kernel API that the OpenCL/CUDA runtime use to talk with the
kernel vary from one device to the others. That is why i say there is no consensus
on what that API would look like. From high level (OpenCL/CUDA) sure but how this
could translate into device agnostic kernel API is more fuzzy at this point.

Easiest thing would be to create kernel API with all the information that those
high API offer but this would mean moving large chunk of device specific user
space code into the kernel. Given that GPU device driver are already bigger on
their own that all the rest of the kernel taken together, i am not sure this
would be welcome (also some of this code rely on floating point operation
though that can be work around).

> 
> The HMAT enabling I'm trying to do is very passive - it doesn't actively do
> *anything* with the memory, it's entire purpose is to give userspace more
> information about the memory so userspace can make informed decisions.
> 
> Would you be willing to look at the sysfs API I have defined, and see if it
> would work for you?
> 
> https://lkml.org/lkml/2017/7/6/749

For now the same kind of information about GPU memory is reported through API
like OpenCL/CUDA but having one place to report all might appeal to developer.
Issue becomes how to match that sysfs information with an OpenCL/CUDA device
as those API are operating system agnostic.

Nonetheless i will go over the sysfs see how it could fit.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-08  1:59                     ` Bob Liu
@ 2017-09-08 20:43                       ` Dan Williams
  2017-11-17  3:47                         ` chetan L
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2017-09-08 20:43 UTC (permalink / raw)
  To: Bob Liu
  Cc: Jerome Glisse, Ross Zwisler, Andrew Morton, linux-kernel,
	Linux MM, John Hubbard, David Nellans, Balbir Singh, majiuyue,
	xieyisheng (A)

On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu <liubo95@huawei.com> wrote:
> On 2017/9/8 1:27, Jerome Glisse wrote:
[..]
>> No this are 2 orthogonal thing, they do not conflict with each others quite
>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>> it as a toolbox, for device driver.
>>
>> HMAT is a way for firmware to report memory resources with more informations
>> that just range of physical address. HMAT is specific to platform that rely
>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>
>> So a device driver can get informations about device memory from HMAT and then
>> use HMM to help in managing and using this memory.
>>
>
> Yes, but as Balbir mentioned requires :
> 1. Don't online the memory as a NUMA node
> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>
> And I'm not sure whether Intel going to use this HMM-CDM based method for their "target domain" memory ?
> Or they prefer to NUMA approach?   Ross? Dan?

The starting / strawman proposal for performance differentiated memory
ranges is to get platform firmware to mark them reserved by default.
Then, after we parse the HMAT, make them available via the device-dax
mechanism so that applications that need 100% guaranteed access to
these potentially high-value / limited-capacity ranges can be sure to
get them by default, i.e. before any random kernel objects are placed
in them. Otherwise, if there are no dedicated users for the memory
ranges via device-dax, or they don't need the total capacity, we want
to hotplug that memory into the general purpose memory allocator with
a numa node number so typical numactl and memory-management flows are
enabled.

Ideally this would not be specific to HMAT and any agent that knows
differentiated performance characteristics of a memory range could use
this scheme.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
  2017-09-08 20:43                       ` Dan Williams
@ 2017-11-17  3:47                         ` chetan L
  0 siblings, 0 replies; 66+ messages in thread
From: chetan L @ 2017-11-17  3:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Bob Liu, Jerome Glisse, Ross Zwisler, Andrew Morton,
	linux-kernel, Linux MM, John Hubbard, David Nellans,
	Balbir Singh, majiuyue, xieyisheng (A)

On Fri, Sep 8, 2017 at 1:43 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu <liubo95@huawei.com> wrote:
>> On 2017/9/8 1:27, Jerome Glisse wrote:
> [..]
>>> No this are 2 orthogonal thing, they do not conflict with each others quite
>>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>>> it as a toolbox, for device driver.
>>>
>>> HMAT is a way for firmware to report memory resources with more informations
>>> that just range of physical address. HMAT is specific to platform that rely
>>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>>
>>> So a device driver can get informations about device memory from HMAT and then
>>> use HMM to help in managing and using this memory.
>>>
>>
>> Yes, but as Balbir mentioned requires :
>> 1. Don't online the memory as a NUMA node
>> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>>
>> And I'm not sure whether Intel going to use this HMM-CDM based method for their "target domain" memory ?
>> Or they prefer to NUMA approach?   Ross? Dan?
>
> The starting / strawman proposal for performance differentiated memory
> ranges is to get platform firmware to mark them reserved by default.
> Then, after we parse the HMAT, make them available via the device-dax
> mechanism so that applications that need 100% guaranteed access to
> these potentially high-value / limited-capacity ranges can be sure to
> get them by default, i.e. before any random kernel objects are placed
> in them. Otherwise, if there are no dedicated users for the memory
> ranges via device-dax, or they don't need the total capacity, we want
> to hotplug that memory into the general purpose memory allocator with
> a numa node number so typical numactl and memory-management flows are
> enabled.
>
> Ideally this would not be specific to HMAT and any agent that knows
> differentiated performance characteristics of a memory range could use
> this scheme.

@Dan/Ross

With this approach, in a SVM environment, if you would want a PRI(page
grant) request to get satisfied from this HMAT-indexed memory node,
then do you think we could make that happen. If yes, is that something
you guys are currently working on.


Chetan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
                   ` (19 preceding siblings ...)
  2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
@ 2017-12-13 12:10 ` Figo.zhang
  2017-12-13 16:12   ` Jerome Glisse
  20 siblings, 1 reply; 66+ messages in thread
From: Figo.zhang @ 2017-12-13 12:10 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

[-- Attachment #1: Type: text/plain, Size: 15562 bytes --]

it mention that HMM provided two functions:

1. one is mirror the page table between CPU and a device(like GPU, FPGA),
so i am confused that:  duplicating the CPU page table into a device page
table means

that copy the CPU page table entry into device page table? so the device
can access the CPU's virtual address? and that device can access the the CPU

allocated physical memory which map into this VMA, right?


for example:  VMA -> PA (CPU physical address)

mirror: fill the the PTE entry of this VMA into GPU's  page table

so:

For CPU's view: it can access the PA

For GPU's view: it can access the CPU's VMA and PA

right?


2. other function is migrate CPU memory to device memory, what is the
application scenario ?

some input data created by GPU and migrate back to CPU memory? use for CPU
to access GPU's data?

3. function one is help the GPU to access CPU's VMA and  CPU's physical
memory, if CPU want to access GPU's memory, still need to

specification device driver API like IOCTL+mmap+DMA?

4. is it any example? i remember it has a dummy driver in older patchset
version. i canot find in this version.

2017-08-17 8:05 GMT+08:00 Jérôme Glisse <jglisse@redhat.com>:

> Patchset is on top of git://git.cmpxchg.org/linux-mmotm.git so i
> test same kernel as kbuild system, git branch:
>
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v25
>
> Change since v24 are:
>   - more comments and more documentations
>   - fix dumb mistake when registering device
>   - fix race when multiple concurrent device fault
>   - merge HMM-CDM patchset with HMM patchset to avoid unecessary code
>     churn
>
> The overall logic is the same. Below is the long description of what HMM
> is about and why. At the end of this email i describe briefly each patch
> and suggest reviewers for each of them.
>
> Please consider this patchset for 4.14
>
>
> Heterogeneous Memory Management (HMM) (description and justification)
>
> Today device driver expose dedicated memory allocation API through their
> device file, often relying on a combination of IOCTL and mmap calls. The
> device can only access and use memory allocated through this API. This
> effectively split the program address space into object allocated for the
> device and useable by the device and other regular memory (malloc, mmap
> of a file, share memory, â) only accessible by CPU (or in a very limited
> way by a device by pinning memory).
>
> Allowing different isolated component of a program to use a device thus
> require duplication of the input data structure using device memory
> allocator. This is reasonable for simple data structure (array, grid,
> image, â) but this get extremely complex with advance data structure
> (list, tree, graph, â) that rely on a web of memory pointers. This is
> becoming a serious limitation on the kind of work load that can be
> offloaded to device like GPU.
>
> New industry standard like C++, OpenCL or CUDA are pushing to remove this
> barrier. This require a shared address space between GPU device and CPU so
> that GPU can access any memory of a process (while still obeying memory
> protection like read only). This kind of feature is also appearing in
> various other operating systems.
>
> HMM is a set of helpers to facilitate several aspects of address space
> sharing and device memory management. Unlike existing sharing mechanism
> that rely on pining pages use by a device, HMM relies on mmu_notifier to
> propagate CPU page table update to device page table.
>
> Duplicating CPU page table is only one aspect necessary for efficiently
> using device like GPU. GPU local memory have bandwidth in the TeraBytes/
> second range but they are connected to main memory through a system bus
> like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x). Thus it
> is necessary to allow migration of process memory from main system memory
> to device memory. Issue is that on platform that only have PCIE the device
> memory is not accessible by the CPU with the same properties as main
> memory (cache coherency, atomic operations, ...).
>
> To allow migration from main memory to device memory HMM provides a set
> of helper to hotplug device memory as a new type of ZONE_DEVICE memory
> which is un-addressable by CPU but still has struct page representing it.
> This allow most of the core kernel logic that deals with a process memory
> to stay oblivious of the peculiarity of device memory.
>
> When page backing an address of a process is migrated to device memory
> the CPU page table entry is set to a new specific swap entry. CPU access
> to such address triggers a migration back to system memory, just like if
> the page was swap on disk. HMM also blocks any one from pinning a
> ZONE_DEVICE page so that it can always be migrated back to system memory
> if CPU access it. Conversely HMM does not migrate to device memory any
> page that is pin in system memory.
>
> To allow efficient migration between device memory and main memory a new
> migrate_vma() helpers is added with this patchset. It allows to leverage
> device DMA engine to perform the copy operation.
>
> This feature will be use by upstream driver like nouveau mlx5 and probably
> other in the future (amdgpu is next suspect in line). We are actively
> working on nouveau and mlx5 support. To test this patchset we also worked
> with NVidia close source driver team, they have more resources than us to
> test this kind of infrastructure and also a bigger and better userspace
> eco-system with various real industry workload they can be use to test and
> profile HMM.
>
> The expected workload is a program builds a data set on the CPU (from disk,
> from network, from sensors, â). Program uses GPU API (OpenCL, CUDA, ...)
> to give hint on memory placement for the input data and also for the output
> buffer. Program call GPU API to schedule a GPU job, this happens using
> device driver specific ioctl. All this is hidden from programmer point of
> view in case of C++ compiler that transparently offload some part of a
> program to GPU. Program can keep doing other stuff on the CPU while the
> GPU is crunching numbers.
>
> It is expected that CPU will not access the same data set as the GPU while
> GPU is working on it, but this is not mandatory. In fact we expect some
> small memory object to be actively access by both GPU and CPU concurrently
> as synchronization channel and/or for monitoring purposes. Such object will
> stay in system memory and should not be bottlenecked by system bus
> bandwidth (rare write and read access from both CPU and GPU).
>
> As we are relying on device driver API, HMM does not introduce any new
> syscall nor does it modify any existing ones. It does not change any POSIX
> semantics or behaviors. For instance the child after a fork of a process
> that is using HMM will not be impacted in anyway, nor is there any data
> hazard between child COW or parent COW of memory that was migrated to
> device prior to fork.
>
> HMM assume a numbers of hardware features. Device must allow device page
> table to be updated at any time (ie device job must be preemptable). Device
> page table must provides memory protection such as read only. Device must
> track write access (dirty bit). Device must have a minimum granularity that
> match PAGE_SIZE (ie 4k).
>
>
> Reviewer (just hint):
> Patch 1  HMM documentation
> Patch 2  introduce core infrastructure and definition of HMM, pretty
>          small patch and easy to review
> Patch 3  introduce the mirror functionality of HMM, it relies on
>          mmu_notifier and thus someone familiar with that part would be
>          in better position to review
> Patch 4  is an helper to snapshot CPU page table while synchronizing with
>          concurrent page table update. Understanding mmu_notifier makes
>          review easier.
> Patch 5  is mostly a wrapper around handle_mm_fault()
> Patch 6  add new add_pages() helper to avoid modifying each arch memory
>          hot plug function
> Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
>          in various core mm to support this new type. Dan Williams and
>          any core mm contributor are best people to review each half of
>          this patchset
> Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
>          Dan Williams are best person to review this
> Patch 9  allow to uncharge a page from memory group without using the lru
>          list field of struct page (best reviewer: Johannes Weiner or
>          Vladimir Davydov or Michal Hocko)
> Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup
> (best
>          reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
> Patch 11 add helper to hotplug un-addressable device memory as new type
>          of ZONE_DEVICE memory (new type introducted in patch 3 of this
>          serie). This is boiler plate code around memory hotplug and it
>          also pick a free range of physical address for the device memory.
>          Note that the physical address do not point to anything (at least
>          as far as the kernel knows).
> Patch 12 introduce a new hmm_device class as an helper for device driver
>          that want to expose multiple device memory under a common fake
>          device driver. This is usefull for multi-gpu configuration.
>          Anyone familiar with device driver infrastructure can review
>          this. Boiler plate code really.
> Patch 13 add a new migrate mode. Any one familiar with page migration is
>          welcome to review.
> Patch 14 introduce a new migration helper (migrate_vma()) that allow to
>          migrate a range of virtual address of a process using device DMA
>          engine to perform the copy. It is not limited to do copy from and
>          to device but can also do copy between any kind of source and
>          destination memory. Again anyone familiar with migration code
>          should be able to verify the logic.
> Patch 15 optimize the new migrate_vma() by unmapping pages while we are
>          collecting them. This can be review by any mm folks.
> Patch 16 add unaddressable memory migration to helper introduced in patch
>          7, this can be review by anyone familiar with migration code
> Patch 17 add a feature that allow device to allocate non-present page on
>          the GPU when migrating a range of address to device memory. This
>          is an helper for device driver to avoid having to first allocate
>          system memory before migration to device memory
> Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
>          memory (CDM)
> Patch 19 add an helper to hotplug CDM memory
>
>
> Previous patchset posting :
> v1 http://lwn.net/Articles/597289/
> v2 https://lkml.org/lkml/2014/6/12/559
> v3 https://lkml.org/lkml/2014/6/13/633
> v4 https://lkml.org/lkml/2014/8/29/423
> v5 https://lkml.org/lkml/2014/11/3/759
> v6 http://lwn.net/Articles/619737/
> v7 http://lwn.net/Articles/627316/
> v8 https://lwn.net/Articles/645515/
> v9 https://lwn.net/Articles/651553/
> v10 https://lwn.net/Articles/654430/
> v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
> v12 http://www.kernelhub.org/?msg=972982&p=2
> v13 https://lwn.net/Articles/706856/
> v14 https://lkml.org/lkml/2016/12/8/344
> v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/
> msg1304107.html
> v16 http://www.spinics.net/lists/linux-mm/msg119814.html
> v17 https://lkml.org/lkml/2017/1/27/847
> v18 https://lkml.org/lkml/2017/3/16/596
> v19 https://lkml.org/lkml/2017/4/5/831
> v20 https://lwn.net/Articles/720715/
> v21 https://lkml.org/lkml/2017/4/24/747
> v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
> v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/
> msg1404788.html
> v24 https://lwn.net/Articles/726691/
>
> Jérôme Glisse (18):
>   hmm: heterogeneous memory management documentation v3
>   mm/hmm: heterogeneous memory management (HMM for short) v5
>   mm/hmm/mirror: mirror process address space on device with HMM helpers
>     v3
>   mm/hmm/mirror: helper to snapshot CPU page table v4
>   mm/hmm/mirror: device page fault handler
>   mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
>   mm/ZONE_DEVICE: special case put_page() for device private pages v4
>   mm/memcontrol: allow to uncharge page without using page->lru field
>   mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4
>   mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7
>   mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3
>   mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY
>   mm/migrate: new memory migration helper for use with device memory v5
>   mm/migrate: migrate_vma() unmap page from vma while collecting pages
>   mm/migrate: support un-addressable ZONE_DEVICE page in migration v3
>   mm/migrate: allow migrate_vma() to alloc new page on empty entry v4
>   mm/device-public-memory: device memory cache coherent with CPU v5
>   mm/hmm: add new helper to hotplug CDM memory region v3
>
> Michal Hocko (1):
>   mm/memory_hotplug: introduce add_pages
>
>  Documentation/vm/hmm.txt       |  384 ++++++++++++
>  MAINTAINERS                    |    7 +
>  arch/x86/Kconfig               |    4 +
>  arch/x86/mm/init_64.c          |   22 +-
>  fs/aio.c                       |    8 +
>  fs/f2fs/data.c                 |    5 +-
>  fs/hugetlbfs/inode.c           |    5 +-
>  fs/proc/task_mmu.c             |    9 +-
>  fs/ubifs/file.c                |    5 +-
>  include/linux/hmm.h            |  518 ++++++++++++++++
>  include/linux/ioport.h         |    2 +
>  include/linux/memory_hotplug.h |   11 +
>  include/linux/memremap.h       |  107 ++++
>  include/linux/migrate.h        |  124 ++++
>  include/linux/migrate_mode.h   |    5 +
>  include/linux/mm.h             |   33 +-
>  include/linux/mm_types.h       |    6 +
>  include/linux/swap.h           |   24 +-
>  include/linux/swapops.h        |   68 +++
>  kernel/fork.c                  |    3 +
>  kernel/memremap.c              |   60 +-
>  mm/Kconfig                     |   47 +-
>  mm/Makefile                    |    2 +-
>  mm/balloon_compaction.c        |    8 +
>  mm/gup.c                       |    7 +
>  mm/hmm.c                       | 1273 ++++++++++++++++++++++++++++++
> ++++++++++
>  mm/madvise.c                   |    2 +-
>  mm/memcontrol.c                |  222 ++++---
>  mm/memory.c                    |  107 +++-
>  mm/memory_hotplug.c            |   10 +-
>  mm/migrate.c                   |  928 ++++++++++++++++++++++++++++-
>  mm/mprotect.c                  |   14 +
>  mm/page_vma_mapped.c           |   10 +
>  mm/rmap.c                      |   25 +
>  mm/swap.c                      |   11 +
>  mm/zsmalloc.c                  |    8 +
>  36 files changed, 3964 insertions(+), 120 deletions(-)
>  create mode 100644 Documentation/vm/hmm.txt
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
>
> --
> 2.13.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

[-- Attachment #2: Type: text/html, Size: 20090 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-13 12:10 ` Figo.zhang
@ 2017-12-13 16:12   ` Jerome Glisse
  2017-12-14  2:48     ` Figo.zhang
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-12-13 16:12 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:
> it mention that HMM provided two functions:
> 
> 1. one is mirror the page table between CPU and a device(like GPU, FPGA),
> so i am confused that:  duplicating the CPU page table into a device page
> table means
> 
> that copy the CPU page table entry into device page table? so the device
> can access the CPU's virtual address? and that device can access the the CPU
> allocated physical memory which map into this VMA, right?
>
> for example:  VMA -> PA (CPU physical address)
> 
> mirror: fill the the PTE entry of this VMA into GPU's  page table
> 
> so:
> 
> For CPU's view: it can access the PA
> 
> For GPU's view: it can access the CPU's VMA and PA
> 
> right?

Correct. This is for platform/device without ATS/PASID. Note that
HMM only provide helpers to snapshot the CPU page table and properly
synchronize with concurrent CPU page table update. Most of the code
is really inside the device driver as each device driver has its own
architecture and its own page table format.


> 2. other function is migrate CPU memory to device memory, what is the
> application scenario ?

Second part of HMM is to allow to register "special" struct page
(they are not on the LRU and are associated with a device). Having
struct page allow most of the kernel memory management to be
oblivous to the underlying memory type (regular DDR memort or device
memory).

The migrate helper introduced with HMM is to allow to migrate to
and from device memory using DMA engine and not CPU memcopy. It
is needed because on some platform CPU can not access the device
memory and moreover DMA engine reach higher bandwidth more easily
than CPU memcopy.

Again this is an helper. The policy on what to migrate, when, ...
is outside HMM for now we assume that the device driver is the
best place to have this logic. Maybe in few year once we have more
device driver using that kind of feature we may grow common code
to expose common API to userspace for migration policy.


> some input data created by GPU and migrate back to CPU memory? use for CPU
> to access GPU's data?

It can be use in any number of way. So yes all the scenario you
list do apply. On platform where CPU can not access device memory
you need to migrate back to regular memory for CPU access.

Note that the physical memory use behind a virtual address pointer
is transparent to the application thus you do not need to modify
it in anyway. That is the whole point of HMM.


> 3. function one is help the GPU to access CPU's VMA and  CPU's physical
> memory, if CPU want to access GPU's memory, still need to
> specification device driver API like IOCTL+mmap+DMA?

No you do not need special memory allocator with an HMM capable
device (and device driver). HMM mirror functionality is to allow
any pointer to point to same memory on both a device and CPU for
a given application. This is the Fine-Grained system SVM as
specified in the OpenCL 2.0 specification.

Basic example is without HMM:
    mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
    {
        gpu_buffer_t gpu_r, gpu_a, gpu_b;

        gpu_r = gpu_alloc(m*m*sizeof(float));
        gpu_a = gpu_alloc(m*m*sizeof(float));
        gpu_b = gpu_alloc(m*m*sizeof(float));
        gpu_copy_to(gpu_a, a, m*m*sizeof(float));
        gpu_copy_to(gpu_b, b, m*m*sizeof(float));

        gpu_mul_mat(gpu_r, gpu_a, gpu_b, m);

        gpu_copy_from(gpu_r, r, m*m*sizeof(float));
    }

With HMM:
    mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
    {
        gpu_mul_mat(r, a, b, m);
    }

So it is going from a world with device specific allocation
to a model where any regular process memory (outcome of an
mmap to a regular file or for anonymous private memory). can
be access by both CPU and GPU using same pointer.


Now on platform like PCIE where CPU can not access the device
memory in cache coherent way (also with no garanty regarding
CPU atomic operations) you can not have the CPU access the device
memory without breaking memory model expected by the programmer.

Hence if some range of the virtual address space of a process
has been migrated to device memory it must be migrated back to
regular memory.

On platform like CAPI or CCIX you do not need to migrate back
to regular memory on CPU access. HMM provide helpers to handle
both cases.


> 4. is it any example? i remember it has a dummy driver in older patchset
> version. i canot find in this version.

Dummy driver and userspace:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-next
https://github.com/glisse/hmm-dummy-test-suite

nouveau prototype:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau

I was waiting on rework of nouveau memory management before
working on final implementation of HMM inside nouveau. That
rework is now mostly ready:

https://github.com/skeggsb/nouveau/tree/devel-fault

I intend to start working on final HMM inside nouveau after
the end of year celebration and i hope to have it in some
working state in couple month. At the same time we are working
on an open source userspace to make use of that (probably
an OpenCL runtime first but we are looking into other thing
such as OpenMP, CUDA, ...).

Plans is to upstream all this next year, all the bits are
slowly cooking.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-13 16:12   ` Jerome Glisse
@ 2017-12-14  2:48     ` Figo.zhang
  2017-12-14  3:16       ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Figo.zhang @ 2017-12-14  2:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

[-- Attachment #1: Type: text/plain, Size: 7383 bytes --]

2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:

> On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:
> > it mention that HMM provided two functions:
> >
> > 1. one is mirror the page table between CPU and a device(like GPU, FPGA),
> > so i am confused that:  duplicating the CPU page table into a device page
> > table means
> >
> > that copy the CPU page table entry into device page table? so the device
> > can access the CPU's virtual address? and that device can access the the
> CPU
> > allocated physical memory which map into this VMA, right?
> >
> > for example:  VMA -> PA (CPU physical address)
> >
> > mirror: fill the the PTE entry of this VMA into GPU's  page table
> >
> > so:
> >
> > For CPU's view: it can access the PA
> >
> > For GPU's view: it can access the CPU's VMA and PA
> >
> > right?
>
> Correct. This is for platform/device without ATS/PASID. Note that
> HMM only provide helpers to snapshot the CPU page table and properly
> synchronize with concurrent CPU page table update. Most of the code
> is really inside the device driver as each device driver has its own
> architecture and its own page table format.
>
>
> > 2. other function is migrate CPU memory to device memory, what is the
> > application scenario ?
>
> Second part of HMM is to allow to register "special" struct page
> (they are not on the LRU and are associated with a device). Having
> struct page allow most of the kernel memory management to be
> oblivous to the underlying memory type (regular DDR memort or device
> memory).
>
> The migrate helper introduced with HMM is to allow to migrate to
> and from device memory using DMA engine and not CPU memcopy. It
> is needed because on some platform CPU can not access the device
> memory and moreover DMA engine reach higher bandwidth more easily
> than CPU memcopy.
>
> Again this is an helper. The policy on what to migrate, when, ...
> is outside HMM for now we assume that the device driver is the
> best place to have this logic. Maybe in few year once we have more
> device driver using that kind of feature we may grow common code
> to expose common API to userspace for migration policy.


>
> > some input data created by GPU and migrate back to CPU memory? use for
> CPU
> > to access GPU's data?
>
> It can be use in any number of way. So yes all the scenario you
> list do apply. On platform where CPU can not access device memory
> you need to migrate back to regular memory for CPU access.
>
> Note that the physical memory use behind a virtual address pointer
> is transparent to the application thus you do not need to modify
> it in anyway. That is the whole point of HMM.
>
>
> > 3. function one is help the GPU to access CPU's VMA and  CPU's physical
> > memory, if CPU want to access GPU's memory, still need to
> > specification device driver API like IOCTL+mmap+DMA?
>
> No you do not need special memory allocator with an HMM capable
> device (and device driver). HMM mirror functionality is to allow
> any pointer to point to same memory on both a device and CPU for
> a given application. This is the Fine-Grained system SVM as
> specified in the OpenCL 2.0 specification.
>
> Basic example is without HMM:
>     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
>     {
>         gpu_buffer_t gpu_r, gpu_a, gpu_b;
>
>         gpu_r = gpu_alloc(m*m*sizeof(float));
>         gpu_a = gpu_alloc(m*m*sizeof(float));
>         gpu_b = gpu_alloc(m*m*sizeof(float));
>         gpu_copy_to(gpu_a, a, m*m*sizeof(float));
>         gpu_copy_to(gpu_b, b, m*m*sizeof(float));
>
>         gpu_mul_mat(gpu_r, gpu_a, gpu_b, m);
>
>         gpu_copy_from(gpu_r, r, m*m*sizeof(float));
>     }
>

The traditional workflow is:
1. the pointer a, b and r are total point to the CPU memory
2. create/alloc three GPU buffers: gpu_a, gpu_b, gpu_r
3. copy CPU memory a and b to GPU memory gpu_b and gpu_b
4. let the GPU to do the calculation
5.  copy the result from GPU buffer (gpu_r) to CPU buffer (r)

is it right?


>
> With HMM:
>     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
>     {
>         gpu_mul_mat(r, a, b, m);
>     }
>

with HMM workflow:
1. CPU has three buffer: a, b, r, and it is physical addr is : pa, pb, pr
     and GPU has tree physical buffer: gpu_a, gpu_b, gpu_r
2. GPU want to access buffer a and b, cause a GPU page fault
3. GPU report a page fault to CPU
4. CPU got a GPU page fault:
                * unmap the buffer a,b,r (who do it? GPU driver?)
                * copy the buffer a ,b's content to GPU physical buffers:
gpu_a, gpu_b
                * fill the GPU page table entry with these pages (gpu_a,
gpu_b, gpu_r) of the CPU virtual address: a,b,r;

5. GPU do the calculation
6. CPU want to get result from buffer r and will cause a CPU page fault:
7. in CPU page fault:
             * unmap the GPU page table entry for virtual address a,b,r.
(who do the unmap? GPU driver?)
             * copy the GPU's buffer content (gpu_a, gpu_b, gpu_r) to
CPU buffer (abr)
             * fill the CPU page table entry: virtual_addr -> buffer
(pa,pb,pr)
8. so the CPU can get the result form buffer r.

my guess workflow is right?
it seems need two copy, from CPU to GPU, and then GPU to CPU for result.
* is it CPU and GPU have the  page table concurrently, so
no page fault occur?
* how about the performance? it sounds will create lots of page fault.


> So it is going from a world with device specific allocation
> to a model where any regular process memory (outcome of an
> mmap to a regular file or for anonymous private memory). can
> be access by both CPU and GPU using same pointer.
>
>
> Now on platform like PCIE where CPU can not access the device
> memory in cache coherent way (also with no garanty regarding
> CPU atomic operations) you can not have the CPU access the device
> memory without breaking memory model expected by the programmer.
>
> Hence if some range of the virtual address space of a process
> has been migrated to device memory it must be migrated back to
> regular memory.
>
> On platform like CAPI or CCIX you do not need to migrate back
> to regular memory on CPU access. HMM provide helpers to handle
> both cases.
>
>
> > 4. is it any example? i remember it has a dummy driver in older patchset
> > version. i canot find in this version.
>
> Dummy driver and userspace:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-next
> https://github.com/glisse/hmm-dummy-test-suite
>
> nouveau prototype:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau
>
> I was waiting on rework of nouveau memory management before
> working on final implementation of HMM inside nouveau. That
> rework is now mostly ready:
>
> https://github.com/skeggsb/nouveau/tree/devel-fault
>
> I intend to start working on final HMM inside nouveau after
> the end of year celebration and i hope to have it in some
> working state in couple month. At the same time we are working
> on an open source userspace to make use of that (probably
> an OpenCL runtime first but we are looking into other thing
> such as OpenMP, CUDA, ...).
>
> Plans is to upstream all this next year, all the bits are
> slowly cooking.
>
> Cheers,
> Jérôme
>

[-- Attachment #2: Type: text/html, Size: 9642 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-14  2:48     ` Figo.zhang
@ 2017-12-14  3:16       ` Jerome Glisse
  2017-12-14  3:53         ` Figo.zhang
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-12-14  3:16 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> 
> > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:

[...]

> > Basic example is without HMM:
> >     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
> >     {
> >         gpu_buffer_t gpu_r, gpu_a, gpu_b;
> >
> >         gpu_r = gpu_alloc(m*m*sizeof(float));
> >         gpu_a = gpu_alloc(m*m*sizeof(float));
> >         gpu_b = gpu_alloc(m*m*sizeof(float));
> >         gpu_copy_to(gpu_a, a, m*m*sizeof(float));
> >         gpu_copy_to(gpu_b, b, m*m*sizeof(float));
> >
> >         gpu_mul_mat(gpu_r, gpu_a, gpu_b, m);
> >
> >         gpu_copy_from(gpu_r, r, m*m*sizeof(float));
> >     }
> >
> 
> The traditional workflow is:
> 1. the pointer a, b and r are total point to the CPU memory
> 2. create/alloc three GPU buffers: gpu_a, gpu_b, gpu_r
> 3. copy CPU memory a and b to GPU memory gpu_b and gpu_b
> 4. let the GPU to do the calculation
> 5.  copy the result from GPU buffer (gpu_r) to CPU buffer (r)
> 
> is it right?

Right.


> > With HMM:
> >     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
> >     {
> >         gpu_mul_mat(r, a, b, m);
> >     }
> >
> 
> with HMM workflow:
> 1. CPU has three buffer: a, b, r, and it is physical addr is : pa, pb, pr
>      and GPU has tree physical buffer: gpu_a, gpu_b, gpu_r
> 2. GPU want to access buffer a and b, cause a GPU page fault
> 3. GPU report a page fault to CPU
> 4. CPU got a GPU page fault:
>                 * unmap the buffer a,b,r (who do it? GPU driver?)
>                 * copy the buffer a ,b's content to GPU physical buffers:
> gpu_a, gpu_b
>                 * fill the GPU page table entry with these pages (gpu_a,
> gpu_b, gpu_r) of the CPU virtual address: a,b,r;
> 
> 5. GPU do the calculation
> 6. CPU want to get result from buffer r and will cause a CPU page fault:
> 7. in CPU page fault:
>              * unmap the GPU page table entry for virtual address a,b,r.
> (who do the unmap? GPU driver?)
>              * copy the GPU's buffer content (gpu_a, gpu_b, gpu_r) to
> CPU buffer (abr)
>              * fill the CPU page table entry: virtual_addr -> buffer
> (pa,pb,pr)
> 8. so the CPU can get the result form buffer r.
> 
> my guess workflow is right?
> it seems need two copy, from CPU to GPU, and then GPU to CPU for result.
> * is it CPU and GPU have the  page table concurrently, so
> no page fault occur?
> * how about the performance? it sounds will create lots of page fault.

This is not what happen. Here is the workflow with HMM mirror (note that
physical address do not matter here so i do not even reference them it is
all about virtual address):
 1 They are 3 buffers a, b and r at given virtual address both CPU and
   GPU can access them (concurently or not this does not matter).
 2 GPU can fault so if any virtual address do not have a page table
   entry inside the GPU page table this trigger a page fault that will
   call HMM mirror helper to snapshot CPU page table into the GPU page
   table. If there is no physical memory backing the virtual address
   (ie CPU page table is also empty for the given virtual address) then 
   the regular page fault handler of the kernel is invoked.

Without HMM mirror but ATS/PASI (CCIX or CAPI):
 1 They are 3 buffers a, b and r at given virtual address both CPU and
   GPU can access them (concurently or not this does not matter).
 2 GPU use the exact same page table as the CPU and fault exactly like
   CPU on empty page table entry

So in the end with HMM mirror or ATS/PASID you get the same behavior.
There is no complexity like you seem to assume. This all about virtual
address. At any point in time any given valid virtual address of a process
point to a given physical memory address and that physical memory address
is the same on both the CPU and the GPU at any point in time they are
never out of sync (both in HMM mirror and in ATS/PASID case).

The exception is for platform that do not have CAPI or CCIX property ie
cache coherency for CPU access to device memory. On such platform when
you migrate a virtual address to use device physical memory you update
the CPU page table with a special entry. If the CPU try to access the
virtual address with special entry it trigger fault and HMM will migrate
the virtual address back to regular memory. But this does not apply for
CAPI or CCIX platform.


Too minimize page fault the device driver is encourage to pre-fault and
prepopulate its page table (the HMM mirror case). Often device driver has
enough context information to guess what range of virtual address is
about to be access by the device and thus pre-fault thing.


Hope this clarify thing for you.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-14  3:16       ` Jerome Glisse
@ 2017-12-14  3:53         ` Figo.zhang
  2017-12-14  4:16           ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Figo.zhang @ 2017-12-14  3:53 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

[-- Attachment #1: Type: text/plain, Size: 6335 bytes --]

2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:

> On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> >
> > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:
>
> [...]
>
> > > Basic example is without HMM:
> > >     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
> > >     {
> > >         gpu_buffer_t gpu_r, gpu_a, gpu_b;
> > >
> > >         gpu_r = gpu_alloc(m*m*sizeof(float));
> > >         gpu_a = gpu_alloc(m*m*sizeof(float));
> > >         gpu_b = gpu_alloc(m*m*sizeof(float));
> > >         gpu_copy_to(gpu_a, a, m*m*sizeof(float));
> > >         gpu_copy_to(gpu_b, b, m*m*sizeof(float));
> > >
> > >         gpu_mul_mat(gpu_r, gpu_a, gpu_b, m);
> > >
> > >         gpu_copy_from(gpu_r, r, m*m*sizeof(float));
> > >     }
> > >
> >
> > The traditional workflow is:
> > 1. the pointer a, b and r are total point to the CPU memory
> > 2. create/alloc three GPU buffers: gpu_a, gpu_b, gpu_r
> > 3. copy CPU memory a and b to GPU memory gpu_b and gpu_b
> > 4. let the GPU to do the calculation
> > 5.  copy the result from GPU buffer (gpu_r) to CPU buffer (r)
> >
> > is it right?
>
> Right.
>
>
> > > With HMM:
> > >     mul_mat_on_gpu(float *r, float *a, float *b, unsigned m)
> > >     {
> > >         gpu_mul_mat(r, a, b, m);
> > >     }
> > >
> >
> > with HMM workflow:
> > 1. CPU has three buffer: a, b, r, and it is physical addr is : pa, pb, pr
> >      and GPU has tree physical buffer: gpu_a, gpu_b, gpu_r
> > 2. GPU want to access buffer a and b, cause a GPU page fault
> > 3. GPU report a page fault to CPU
> > 4. CPU got a GPU page fault:
> >                 * unmap the buffer a,b,r (who do it? GPU driver?)
> >                 * copy the buffer a ,b's content to GPU physical buffers:
> > gpu_a, gpu_b
> >                 * fill the GPU page table entry with these pages (gpu_a,
> > gpu_b, gpu_r) of the CPU virtual address: a,b,r;
> >
> > 5. GPU do the calculation
> > 6. CPU want to get result from buffer r and will cause a CPU page fault:
> > 7. in CPU page fault:
> >              * unmap the GPU page table entry for virtual address a,b,r.
> > (who do the unmap? GPU driver?)
> >              * copy the GPU's buffer content (gpu_a, gpu_b, gpu_r) to
> > CPU buffer (abr)
> >              * fill the CPU page table entry: virtual_addr -> buffer
> > (pa,pb,pr)
> > 8. so the CPU can get the result form buffer r.
> >
> > my guess workflow is right?
> > it seems need two copy, from CPU to GPU, and then GPU to CPU for result.
> > * is it CPU and GPU have the  page table concurrently, so
> > no page fault occur?
> > * how about the performance? it sounds will create lots of page fault.
>
> This is not what happen. Here is the workflow with HMM mirror (note that
> physical address do not matter here so i do not even reference them it is
> all about virtual address):
>  1 They are 3 buffers a, b and r at given virtual address both CPU and
>    GPU can access them (concurently or not this does not matter).
>  2 GPU can fault so if any virtual address do not have a page table
>    entry inside the GPU page table this trigger a page fault that will
>    call HMM mirror helper to snapshot CPU page table into the GPU page
>    table. If there is no physical memory backing the virtual address
>    (ie CPU page table is also empty for the given virtual address) then
>    the regular page fault handler of the kernel is invoked.
>

so when HMM mirror done, the content of GPU page table entry and
CPU page table entry
are same, right? so the GPU and CPU can access the same physical address,
this physical
address is allocated by CPU malloc systemcall. is it conflict and race
condition? CPU and GPU
write to this physical address concurrently.

i see this slides said:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf

in page 22~23:
When CPU page fault occurs:
* UM (unified memory driver) copies page data to CPU, umaps from GPU
*HMM maps page to CPU

when GPU page fault occurs:
*HMM has a malloc record buffer, so UM copy page data to GPU
*HMM unmaps page from CPU

so in this slides, it said it will has two copies, from CPU to GPU, and
from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need two
copies in kernel space?


>
> Without HMM mirror but ATS/PASI (CCIX or CAPI):
>  1 They are 3 buffers a, b and r at given virtual address both CPU and
>    GPU can access them (concurently or not this does not matter).
>  2 GPU use the exact same page table as the CPU and fault exactly like
>    CPU on empty page table entry
>
> So in the end with HMM mirror or ATS/PASID you get the same behavior.
> There is no complexity like you seem to assume. This all about virtual
> address. At any point in time any given valid virtual address of a process
> point to a given physical memory address and that physical memory address
> is the same on both the CPU and the GPU at any point in time they are
> never out of sync (both in HMM mirror and in ATS/PASID case).
>
> The exception is for platform that do not have CAPI or CCIX property ie
> cache coherency for CPU access to device memory. On such platform when
> you migrate a virtual address to use device physical memory you update
> the CPU page table with a special entry. If the CPU try to access the
> virtual address with special entry it trigger fault and HMM will migrate
> the virtual address back to regular memory. But this does not apply for
> CAPI or CCIX platform.
>

the example of the virtual address using device physical memory is : gpu_r
= gpu_alloc(m*m*sizeof(float)),
so CPU want to access gpu_r will trigger migrate back to CPU memory,
it will allocate CPU page and copy
to gpu_r's content to CPU pages, right?



>
>
> Too minimize page fault the device driver is encourage to pre-fault and
> prepopulate its page table (the HMM mirror case). Often device driver has
> enough context information to guess what range of virtual address is
> about to be access by the device and thus pre-fault thing.
>
>
> Hope this clarify thing for you.
>
> Cheers,
> Jérôme
>

[-- Attachment #2: Type: text/html, Size: 8203 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-14  3:53         ` Figo.zhang
@ 2017-12-14  4:16           ` Jerome Glisse
  2017-12-14  7:05             ` Figo.zhang
  0 siblings, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2017-12-14  4:16 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote:
> 2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> 
> > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > >
> > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:

[...]

> > This is not what happen. Here is the workflow with HMM mirror (note that
> > physical address do not matter here so i do not even reference them it is
> > all about virtual address):
> >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> >    GPU can access them (concurently or not this does not matter).
> >  2 GPU can fault so if any virtual address do not have a page table
> >    entry inside the GPU page table this trigger a page fault that will
> >    call HMM mirror helper to snapshot CPU page table into the GPU page
> >    table. If there is no physical memory backing the virtual address
> >    (ie CPU page table is also empty for the given virtual address) then
> >    the regular page fault handler of the kernel is invoked.
> >
> 
> so when HMM mirror done, the content of GPU page table entry and
> CPU page table entry
> are same, right? so the GPU and CPU can access the same physical address,
> this physical
> address is allocated by CPU malloc systemcall. is it conflict and race
> condition? CPU and GPU
> write to this physical address concurrently.

Correct and yes it is conflict free. PCIE platform already support
cache coherent access by device to main memory (snoop transaction
in PCIE specification). Access can happen concurently to same byte
and it behave exactly the same as if two CPU core try to access the
same byte.

> 
> i see this slides said:
> http://on-demand.gputechconf.com/gtc/2017/presentation/s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf
> 
> in page 22~23i 1/4 ?
> When CPU page fault occurs:
> * UM (unified memory driver) copies page data to CPU, umaps from GPU
> *HMM maps page to CPU
> 
> when GPU page fault occurs:
> *HMM has a malloc record buffer, so UM copy page data to GPU
> *HMM unmaps page from CPU
> 
> so in this slides, it said it will has two copies, from CPU to GPU, and
> from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need two
> copies in kernel space?

This slide is for the case where you use device memory on PCIE platform.
When that happen only the device can access the virtual address back by
device memory. If CPU try to access such address a page fault is trigger
and it migrate the data back to regular memory where both GPU and CPU can
access it concurently.

And again this behavior only happen if you use HMM non cache coherent
device memory model. If you use the device cache coherent model with HMM
then CPU can access the device memory directly too and above scenario
never happen.

Note that memory copy when data move from device to system or from system
to device memory are inevitable. This is exactly as with autoNUMA. Also
note that in some case thing can get allocated directly on GPU and never
copied back to regular memory (only use by GPU and freed once GPU is done
with them) the zero copy case. But i want to stress that the zero copy
case is unlikely to happen for input buffer. Usualy you do not get your
input data set directly on the GPU but from network or disk and you might
do pre-processing on CPU (uncompress input, or do something that is better
done on the CPU). Then you feed your data to the GPU and you do computation
there.


> > Without HMM mirror but ATS/PASI (CCIX or CAPI):
> >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> >    GPU can access them (concurently or not this does not matter).
> >  2 GPU use the exact same page table as the CPU and fault exactly like
> >    CPU on empty page table entry
> >
> > So in the end with HMM mirror or ATS/PASID you get the same behavior.
> > There is no complexity like you seem to assume. This all about virtual
> > address. At any point in time any given valid virtual address of a process
> > point to a given physical memory address and that physical memory address
> > is the same on both the CPU and the GPU at any point in time they are
> > never out of sync (both in HMM mirror and in ATS/PASID case).
> >
> > The exception is for platform that do not have CAPI or CCIX property ie
> > cache coherency for CPU access to device memory. On such platform when
> > you migrate a virtual address to use device physical memory you update
> > the CPU page table with a special entry. If the CPU try to access the
> > virtual address with special entry it trigger fault and HMM will migrate
> > the virtual address back to regular memory. But this does not apply for
> > CAPI or CCIX platform.
> >
> 
> the example of the virtual address using device physical memory is : gpu_r
> = gpu_alloc(m*m*sizeof(float)),
> so CPU want to access gpu_r will trigger migrate back to CPU memory,
> it will allocate CPU page and copy
> to gpu_r's content to CPU pages, right?

No. Here we are always talking about virtual address that are the outcome
of an mmap syscall either as private anonymous memory or as mmap of regular
file (ie not a device file but a regular file on a filesystem).

Device driver can migrate any virtual address to use device memory for
performance reasons (how, why and when such migration happens is totaly
opaque to HMM it is under the control of the device driver).

So if you do:
   BUFA = malloc(size);
Then do something with BUFA on the CPU (like reading input or network, ...)
the memory is likely to be allocated with regular main memory (like DDR).

Now if you start some job on your GPU that access BUFA the device driver
might call migrate_vma() helper to migrate the memory to device memory. At
that point the virtual address of BUFA point to physical device memory here
CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to device
memory while the CPU page table point to invalid special entry. The GPU can
work on BUFA that now reside inside the device memory. Finaly, in the non
CAPI/CCIX case, if CPU try to access that memory then a migration back to
regular memory happen.


What you really need is to decouple the virtual address part from what is
the physical memory that is backing a virtual address. HMM provide helpers
for both aspect. First to mirror page table so that every virtual address
point to same physical address. Second side of HMM is to allow to use device
memory transparently inside a process by allowing to migrate any virtual
address to use device memory. Both aspect are orthogonal to each others.

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-14  4:16           ` Jerome Glisse
@ 2017-12-14  7:05             ` Figo.zhang
  2017-12-14 15:28               ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Figo.zhang @ 2017-12-14  7:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

[-- Attachment #1: Type: text/plain, Size: 8139 bytes --]

2017-12-14 12:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:

> On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote:
> > 2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> >
> > > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > > >
> > > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:
>
> [...]
>
> > > This is not what happen. Here is the workflow with HMM mirror (note
> that
> > > physical address do not matter here so i do not even reference them it
> is
> > > all about virtual address):
> > >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> > >    GPU can access them (concurently or not this does not matter).
> > >  2 GPU can fault so if any virtual address do not have a page table
> > >    entry inside the GPU page table this trigger a page fault that will
> > >    call HMM mirror helper to snapshot CPU page table into the GPU page
> > >    table. If there is no physical memory backing the virtual address
> > >    (ie CPU page table is also empty for the given virtual address) then
> > >    the regular page fault handler of the kernel is invoked.
> > >
> >
> > so when HMM mirror done, the content of GPU page table entry and
> > CPU page table entry
> > are same, right? so the GPU and CPU can access the same physical address,
> > this physical
> > address is allocated by CPU malloc systemcall. is it conflict and race
> > condition? CPU and GPU
> > write to this physical address concurrently.
>
> Correct and yes it is conflict free. PCIE platform already support
> cache coherent access by device to main memory (snoop transaction
> in PCIE specification). Access can happen concurently to same byte
> and it behave exactly the same as if two CPU core try to access the
> same byte.
>
> >
> > i see this slides said:
> > http://on-demand.gputechconf.com/gtc/2017/presentation/
> s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf
> >
> > in page 22~23:
> > When CPU page fault occurs:
> > * UM (unified memory driver) copies page data to CPU, umaps from GPU
> > *HMM maps page to CPU
> >
> > when GPU page fault occurs:
> > *HMM has a malloc record buffer, so UM copy page data to GPU
> > *HMM unmaps page from CPU
> >
> > so in this slides, it said it will has two copies, from CPU to GPU, and
> > from GPU to CPU. so in this case (mul_mat_on_gpu()), is it really need
> two
> > copies in kernel space?
>
> This slide is for the case where you use device memory on PCIE platform.
> When that happen only the device can access the virtual address back by
> device memory. If CPU try to access such address a page fault is trigger
> and it migrate the data back to regular memory where both GPU and CPU can
> access it concurently.
>
> And again this behavior only happen if you use HMM non cache coherent
> device memory model. If you use the device cache coherent model with HMM
> then CPU can access the device memory directly too and above scenario
> never happen.
>
> Note that memory copy when data move from device to system or from system
> to device memory are inevitable. This is exactly as with autoNUMA. Also
> note that in some case thing can get allocated directly on GPU and never
> copied back to regular memory (only use by GPU and freed once GPU is done
> with them) the zero copy case. But i want to stress that the zero copy
> case is unlikely to happen for input buffer. Usualy you do not get your
> input data set directly on the GPU but from network or disk and you might
> do pre-processing on CPU (uncompress input, or do something that is better
> done on the CPU). Then you feed your data to the GPU and you do computation
> there.
>

Great!very detail about the HMM explanation, Thanks a lot.
so would you like see my conclusion is correct?
* if support CCIX/CAPI, CPU can access GPU memory directly, and GPU also
can access CPU memory directly,
so it no need copy on kernel space in HMM solutions.

* if no support CCIX/CAPI, CPU cannot access GPU memory in cache
coherency method, also GPU cannot access CPU memory at
cache coherency. it need some copies like John Hobburt's slides.
   *when GPU page fault, need copy data from CPU page to GPU page, and
HMM unmap the CPU page...
   * when CPU page fault, need copy data from GPU page to CPU page
and ummap GPU page and map the CPU page...


>
>
> > > Without HMM mirror but ATS/PASI (CCIX or CAPI):
> > >  1 They are 3 buffers a, b and r at given virtual address both CPU and
> > >    GPU can access them (concurently or not this does not matter).
> > >  2 GPU use the exact same page table as the CPU and fault exactly like
> > >    CPU on empty page table entry
> > >
> > > So in the end with HMM mirror or ATS/PASID you get the same behavior.
> > > There is no complexity like you seem to assume. This all about virtual
> > > address. At any point in time any given valid virtual address of a
> process
> > > point to a given physical memory address and that physical memory
> address
> > > is the same on both the CPU and the GPU at any point in time they are
> > > never out of sync (both in HMM mirror and in ATS/PASID case).
> > >
> > > The exception is for platform that do not have CAPI or CCIX property ie
> > > cache coherency for CPU access to device memory. On such platform when
> > > you migrate a virtual address to use device physical memory you update
> > > the CPU page table with a special entry. If the CPU try to access the
> > > virtual address with special entry it trigger fault and HMM will
> migrate
> > > the virtual address back to regular memory. But this does not apply for
> > > CAPI or CCIX platform.
> > >
> >
> > the example of the virtual address using device physical memory is :
> gpu_r
> > = gpu_alloc(m*m*sizeof(float)),
> > so CPU want to access gpu_r will trigger migrate back to CPU memory,
> > it will allocate CPU page and copy
> > to gpu_r's content to CPU pages, right?
>
> No. Here we are always talking about virtual address that are the outcome
> of an mmap syscall either as private anonymous memory or as mmap of regular
> file (ie not a device file but a regular file on a filesystem).
>
> Device driver can migrate any virtual address to use device memory for
> performance reasons (how, why and when such migration happens is totaly
> opaque to HMM it is under the control of the device driver).
>
> So if you do:
>    BUFA = malloc(size);
> Then do something with BUFA on the CPU (like reading input or network, ...)
> the memory is likely to be allocated with regular main memory (like DDR).
>
> Now if you start some job on your GPU that access BUFA the device driver
> might call migrate_vma() helper to migrate the memory to device memory. At
> that point the virtual address of BUFA point to physical device memory here
> CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to
> device
> memory while the CPU page table point to invalid special entry. The GPU can
> work on BUFA that now reside inside the device memory. Finaly, in the non
> CAPI/CCIX case, if CPU try to access that memory then a migration back to
> regular memory happen.
>

in this scenario:
*if CAPI/CCIX support, the CPU's page table and GPU's also point to the
device physical page?
in this case, it still need the ZONE_DEVICE infrastructure for
CPU page table?

*if no CAPI/CCIX support, the CPU's page table filled a invalid special pte.


>
> What you really need is to decouple the virtual address part from what is
> the physical memory that is backing a virtual address. HMM provide helpers
> for both aspect. First to mirror page table so that every virtual address
> point to same physical address. Second side of HMM is to allow to use
> device
> memory transparently inside a process by allowing to migrate any virtual
> address to use device memory. Both aspect are orthogonal to each others.
>
> Cheers,
> Jérôme
>

[-- Attachment #2: Type: text/html, Size: 9903 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25
  2017-12-14  7:05             ` Figo.zhang
@ 2017-12-14 15:28               ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2017-12-14 15:28 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Andrew Morton, Linux MM, John Hubbard, Dan Williams,
	David Nellans, Balbir Singh

On Thu, Dec 14, 2017 at 03:05:39PM +0800, Figo.zhang wrote:
> 2017-12-14 12:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > On Thu, Dec 14, 2017 at 11:53:40AM +0800, Figo.zhang wrote:
> > > 2017-12-14 11:16 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > > > On Thu, Dec 14, 2017 at 10:48:36AM +0800, Figo.zhang wrote:
> > > > > 2017-12-14 0:12 GMT+08:00 Jerome Glisse <jglisse@redhat.com>:
> > > > > > On Wed, Dec 13, 2017 at 08:10:42PM +0800, Figo.zhang wrote:

[...]

> > This slide is for the case where you use device memory on PCIE platform.
> > When that happen only the device can access the virtual address back by
> > device memory. If CPU try to access such address a page fault is trigger
> > and it migrate the data back to regular memory where both GPU and CPU can
> > access it concurently.
> >
> > And again this behavior only happen if you use HMM non cache coherent
> > device memory model. If you use the device cache coherent model with HMM
> > then CPU can access the device memory directly too and above scenario
> > never happen.
> >
> > Note that memory copy when data move from device to system or from system
> > to device memory are inevitable. This is exactly as with autoNUMA. Also
> > note that in some case thing can get allocated directly on GPU and never
> > copied back to regular memory (only use by GPU and freed once GPU is done
> > with them) the zero copy case. But i want to stress that the zero copy
> > case is unlikely to happen for input buffer. Usualy you do not get your
> > input data set directly on the GPU but from network or disk and you might
> > do pre-processing on CPU (uncompress input, or do something that is better
> > done on the CPU). Then you feed your data to the GPU and you do computation
> > there.
> >
> 
> Greati 1/4 ?very detail about the HMM explanation, Thanks a lot.
> so would you like see my conclusion is correct?
> * if support CCIX/CAPI, CPU can access GPU memory directly, and GPU also
> can access CPU memory directly,
> so it no need copy on kernel space in HMM solutions.

Yes but migration do imply copy. The physical address backing a virtual address
can change over the lifetime of a virtual address (between mmap and munmap). As
a result of various activity (auto NUMA, compaction, swap out then swap back in,
...) and in the case that interest us as the result of a device driver migrating
thing to its device memory.


> * if no support CCIX/CAPI, CPU cannot access GPU memory in cache
> coherency method, also GPU cannot access CPU memory at
> cache coherency. it need some copies like John Hobburt's slides.
>    *when GPU page fault, need copy data from CPU page to GPU page, and
> HMM unmap the CPU page...
>    * when CPU page fault, need copy data from GPU page to CPU page
> and ummap GPU page and map the CPU page...

No, GPU can access main memory just fine (snoop PCIE transaction and in a
full cache coherent way with CPU). Only the CPU can not access the device
memory. So there is a special case only when migrating some virtual address
to use device memory.

What is described inside John's slides is what happen when you migrate some
virtual addresses to device memory where the CPU can not access it. This
migration is not necessary for the GPU to access memory. It only happens as
an optimization when the device driver suspect it will make frequent access
to that memory and that CPU will not try to access it.

[...]

> > No. Here we are always talking about virtual address that are the outcome
> > of an mmap syscall either as private anonymous memory or as mmap of regular
> > file (ie not a device file but a regular file on a filesystem).
> >
> > Device driver can migrate any virtual address to use device memory for
> > performance reasons (how, why and when such migration happens is totaly
> > opaque to HMM it is under the control of the device driver).
> >
> > So if you do:
> >    BUFA = malloc(size);
> > Then do something with BUFA on the CPU (like reading input or network, ...)
> > the memory is likely to be allocated with regular main memory (like DDR).
> >
> > Now if you start some job on your GPU that access BUFA the device driver
> > might call migrate_vma() helper to migrate the memory to device memory. At
> > that point the virtual address of BUFA point to physical device memory here
> > CAPI or CCIX. If it is not CAPI/CCIX than the GPU page table point to
> > device
> > memory while the CPU page table point to invalid special entry. The GPU can
> > work on BUFA that now reside inside the device memory. Finaly, in the non
> > CAPI/CCIX case, if CPU try to access that memory then a migration back to
> > regular memory happen.
> >
> 
> in this scenario:
> *if CAPI/CCIX supporti 1/4 ? the CPU's page table and GPU's also point to the
> device physical page?
> in this case, it still need the ZONE_DEVICE infrastructure for
> CPU page tablei 1/4 ?

Correct, in CAPI/CCIX case there is only one page table and thus after migration
they both point to same physical address for the virtual addresses of BUFA.

> *if no CAPI/CCIX support, the CPU's page table filled a invalid special pte.

Correct. This is the case described by John's slides.


The physical memory backing a virtual address can change at anytime for many
different reasons (autonuma, compaction, swap out follow by swap in, ...) and
migration (from one physical memory type to another) for accelerator purposes
is just a new reasons in that list.

Cheers,
JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
@ 2018-12-20  8:33   ` Dan Williams
  2018-12-20 16:15     ` Jerome Glisse
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2018-12-20  8:33 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

On Wed, Aug 16, 2017 at 5:06 PM Jérôme Glisse <jglisse@redhat.com> wrote:
>
> HMM (heterogeneous memory management) need struct page to support migration
> from system main memory to device memory.  Reasons for HMM and migration to
> device memory is explained with HMM core patch.
>
> This patch deals with device memory that is un-addressable memory (ie CPU
> can not access it). Hence we do not want those struct page to be manage
> like regular memory. That is why we extend ZONE_DEVICE to support different
> types of memory.
>
> A persistent memory type is define for existing user of ZONE_DEVICE and a
> new device un-addressable type is added for the un-addressable memory type.
> There is a clear separation between what is expected from each memory type
> and existing user of ZONE_DEVICE are un-affected by new requirement and new
> use of the un-addressable type. All specific code path are protect with
> test against the memory type.
>
> Because memory is un-addressable we use a new special swap type for when
> a page is migrated to device memory (this reduces the number of maximum
> swap file).
>
> The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> First one, page_free() is call whenever page refcount reach 1 (which means
> the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> allow device driver to manage its memory and associated struct page.
>
> The second callback page_fault() happens when there is a CPU access to
> an address that is back by a device page (which are un-addressable by the
> CPU). This callback is responsible to migrate the page back to system
> main memory. Device driver can not block migration back to system memory,
> HMM make sure that such page can not be pin into device memory.
>
> If device is in some error condition and can not migrate memory back then
> a CPU page fault to device memory should end with SIGBUS.
>
> Changed since v4:
>   - s/DEVICE_PUBLIC/DEVICE_HOST (to free DEVICE_PUBLIC for HMM-CDM)
> Changed since v3:
>   - fix comments that was still using UNADDRESSABLE as keyword
>   - kernel configuration simplification
> Changed since v2:
>   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> Changed since v1:
>   - rename to device private memory (from device unaddressable)
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[..]
>  fs/proc/task_mmu.c       |  7 +++++
>  include/linux/ioport.h   |  1 +
>  include/linux/memremap.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h       | 12 ++++++++
>  include/linux/swap.h     | 24 ++++++++++++++--
>  include/linux/swapops.h  | 68 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c        | 34 ++++++++++++++++++++++
>  mm/Kconfig               | 11 +++++++-
>  mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c      | 10 +++++--
>  mm/mprotect.c            | 14 ++++++++++
>  11 files changed, 309 insertions(+), 6 deletions(-)
>
[..]
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 93416196ba64..8e164ec9eed0 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -4,6 +4,8 @@
>  #include <linux/ioport.h>
>  #include <linux/percpu-refcount.h>
>
> +#include <asm/pgtable.h>
> +

So it turns out, over a year later, that this include was a mistake
and makes the build fragile.

>  struct resource;
>  struct device;
>
[..]
> +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
> +                               unsigned long addr,
> +                               const struct page *page,
> +                               unsigned int flags,
> +                               pmd_t *pmdp);

I recently included this file somewhere that did not have a pile of
other mm headers included and 0day reports:

  In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
                    from arch/m68k/include/asm/pgtable.h:5,
                    from include/linux/memremap.h:7,
                    from drivers//dax/bus.c:3:
   arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
>> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
     return mm->pgd + pgd_index(address);
              ^~
I assume this pulls in the entirety of pgtable.h just to get the pmd_t
definition?

> +typedef void (*dev_page_free_t)(struct page *page, void *data);
> +
>  /**
>   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> + * @page_fault: callback when CPU fault on an unaddressable device page
> + * @page_free: free page callback when page refcount reaches 1
>   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @data: private data pointer for page_free()
> + * @type: memory type: see MEMORY_* in memory_hotplug.h
>   */
>  struct dev_pagemap {
> +       dev_page_fault_t page_fault;

Rather than try to figure out how to forward declare pmd_t, how about
just move dev_page_fault_t out of the generic dev_pagemap and into the
HMM specific container structure? This should be straightfoward on top
of the recent refactor.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20  8:33   ` Dan Williams
@ 2018-12-20 16:15     ` Jerome Glisse
  2018-12-20 16:15       ` Jerome Glisse
  2018-12-20 16:47       ` Dan Williams
  0 siblings, 2 replies; 66+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

On Thu, Dec 20, 2018 at 12:33:47AM -0800, Dan Williams wrote:
> On Wed, Aug 16, 2017 at 5:06 PM J�r�me Glisse <jglisse@redhat.com> wrote:
> >
> > HMM (heterogeneous memory management) need struct page to support migration
> > from system main memory to device memory.  Reasons for HMM and migration to
> > device memory is explained with HMM core patch.
> >
> > This patch deals with device memory that is un-addressable memory (ie CPU
> > can not access it). Hence we do not want those struct page to be manage
> > like regular memory. That is why we extend ZONE_DEVICE to support different
> > types of memory.
> >
> > A persistent memory type is define for existing user of ZONE_DEVICE and a
> > new device un-addressable type is added for the un-addressable memory type.
> > There is a clear separation between what is expected from each memory type
> > and existing user of ZONE_DEVICE are un-affected by new requirement and new
> > use of the un-addressable type. All specific code path are protect with
> > test against the memory type.
> >
> > Because memory is un-addressable we use a new special swap type for when
> > a page is migrated to device memory (this reduces the number of maximum
> > swap file).
> >
> > The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> > First one, page_free() is call whenever page refcount reach 1 (which means
> > the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> > allow device driver to manage its memory and associated struct page.
> >
> > The second callback page_fault() happens when there is a CPU access to
> > an address that is back by a device page (which are un-addressable by the
> > CPU). This callback is responsible to migrate the page back to system
> > main memory. Device driver can not block migration back to system memory,
> > HMM make sure that such page can not be pin into device memory.
> >
> > If device is in some error condition and can not migrate memory back then
> > a CPU page fault to device memory should end with SIGBUS.
> >
> > Changed since v4:
> >   - s/DEVICE_PUBLIC/DEVICE_HOST (to free DEVICE_PUBLIC for HMM-CDM)
> > Changed since v3:
> >   - fix comments that was still using UNADDRESSABLE as keyword
> >   - kernel configuration simplification
> > Changed since v2:
> >   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> > Changed since v1:
> >   - rename to device private memory (from device unaddressable)
> >
> > Signed-off-by: J�r�me Glisse <jglisse@redhat.com>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> [..]
> >  fs/proc/task_mmu.c       |  7 +++++
> >  include/linux/ioport.h   |  1 +
> >  include/linux/memremap.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/mm.h       | 12 ++++++++
> >  include/linux/swap.h     | 24 ++++++++++++++--
> >  include/linux/swapops.h  | 68 ++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/memremap.c        | 34 ++++++++++++++++++++++
> >  mm/Kconfig               | 11 +++++++-
> >  mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
> >  mm/memory_hotplug.c      | 10 +++++--
> >  mm/mprotect.c            | 14 ++++++++++
> >  11 files changed, 309 insertions(+), 6 deletions(-)
> >
> [..]
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 93416196ba64..8e164ec9eed0 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -4,6 +4,8 @@
> >  #include <linux/ioport.h>
> >  #include <linux/percpu-refcount.h>
> >
> > +#include <asm/pgtable.h>
> > +
> 
> So it turns out, over a year later, that this include was a mistake
> and makes the build fragile.
> 
> >  struct resource;
> >  struct device;
> >
> [..]
> > +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
> > +                               unsigned long addr,
> > +                               const struct page *page,
> > +                               unsigned int flags,
> > +                               pmd_t *pmdp);
> 
> I recently included this file somewhere that did not have a pile of
> other mm headers included and 0day reports:
> 
>   In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
>                     from arch/m68k/include/asm/pgtable.h:5,
>                     from include/linux/memremap.h:7,
>                     from drivers//dax/bus.c:3:
>    arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
> >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
>      return mm->pgd + pgd_index(address);
>               ^~
> I assume this pulls in the entirety of pgtable.h just to get the pmd_t
> definition?
> 
> > +typedef void (*dev_page_free_t)(struct page *page, void *data);
> > +
> >  /**
> >   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> > + * @page_fault: callback when CPU fault on an unaddressable device page
> > + * @page_free: free page callback when page refcount reaches 1
> >   * @altmap: pre-allocated/reserved memory for vmemmap allocations
> >   * @res: physical address range covered by @ref
> >   * @ref: reference count that pins the devm_memremap_pages() mapping
> >   * @dev: host device of the mapping for debug
> > + * @data: private data pointer for page_free()
> > + * @type: memory type: see MEMORY_* in memory_hotplug.h
> >   */
> >  struct dev_pagemap {
> > +       dev_page_fault_t page_fault;
> 
> Rather than try to figure out how to forward declare pmd_t, how about
> just move dev_page_fault_t out of the generic dev_pagemap and into the
> HMM specific container structure? This should be straightfoward on top
> of the recent refactor.

Fine with me.

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20 16:15     ` Jerome Glisse
@ 2018-12-20 16:15       ` Jerome Glisse
  2018-12-20 16:47       ` Dan Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 5762 bytes --]

On Thu, Dec 20, 2018 at 12:33:47AM -0800, Dan Williams wrote:
> On Wed, Aug 16, 2017 at 5:06 PM Jérôme Glisse <jglisse@redhat.com> wrote:
> >
> > HMM (heterogeneous memory management) need struct page to support migration
> > from system main memory to device memory.  Reasons for HMM and migration to
> > device memory is explained with HMM core patch.
> >
> > This patch deals with device memory that is un-addressable memory (ie CPU
> > can not access it). Hence we do not want those struct page to be manage
> > like regular memory. That is why we extend ZONE_DEVICE to support different
> > types of memory.
> >
> > A persistent memory type is define for existing user of ZONE_DEVICE and a
> > new device un-addressable type is added for the un-addressable memory type.
> > There is a clear separation between what is expected from each memory type
> > and existing user of ZONE_DEVICE are un-affected by new requirement and new
> > use of the un-addressable type. All specific code path are protect with
> > test against the memory type.
> >
> > Because memory is un-addressable we use a new special swap type for when
> > a page is migrated to device memory (this reduces the number of maximum
> > swap file).
> >
> > The main two additions beside memory type to ZONE_DEVICE is two callbacks.
> > First one, page_free() is call whenever page refcount reach 1 (which means
> > the page is free as ZONE_DEVICE page never reach a refcount of 0). This
> > allow device driver to manage its memory and associated struct page.
> >
> > The second callback page_fault() happens when there is a CPU access to
> > an address that is back by a device page (which are un-addressable by the
> > CPU). This callback is responsible to migrate the page back to system
> > main memory. Device driver can not block migration back to system memory,
> > HMM make sure that such page can not be pin into device memory.
> >
> > If device is in some error condition and can not migrate memory back then
> > a CPU page fault to device memory should end with SIGBUS.
> >
> > Changed since v4:
> >   - s/DEVICE_PUBLIC/DEVICE_HOST (to free DEVICE_PUBLIC for HMM-CDM)
> > Changed since v3:
> >   - fix comments that was still using UNADDRESSABLE as keyword
> >   - kernel configuration simplification
> > Changed since v2:
> >   - s/DEVICE_UNADDRESSABLE/DEVICE_PRIVATE
> > Changed since v1:
> >   - rename to device private memory (from device unaddressable)
> >
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> [..]
> >  fs/proc/task_mmu.c       |  7 +++++
> >  include/linux/ioport.h   |  1 +
> >  include/linux/memremap.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/mm.h       | 12 ++++++++
> >  include/linux/swap.h     | 24 ++++++++++++++--
> >  include/linux/swapops.h  | 68 ++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/memremap.c        | 34 ++++++++++++++++++++++
> >  mm/Kconfig               | 11 +++++++-
> >  mm/memory.c              | 61 ++++++++++++++++++++++++++++++++++++++++
> >  mm/memory_hotplug.c      | 10 +++++--
> >  mm/mprotect.c            | 14 ++++++++++
> >  11 files changed, 309 insertions(+), 6 deletions(-)
> >
> [..]
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 93416196ba64..8e164ec9eed0 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -4,6 +4,8 @@
> >  #include <linux/ioport.h>
> >  #include <linux/percpu-refcount.h>
> >
> > +#include <asm/pgtable.h>
> > +
> 
> So it turns out, over a year later, that this include was a mistake
> and makes the build fragile.
> 
> >  struct resource;
> >  struct device;
> >
> [..]
> > +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
> > +                               unsigned long addr,
> > +                               const struct page *page,
> > +                               unsigned int flags,
> > +                               pmd_t *pmdp);
> 
> I recently included this file somewhere that did not have a pile of
> other mm headers included and 0day reports:
> 
>   In file included from arch/m68k/include/asm/pgtable_mm.h:148:0,
>                     from arch/m68k/include/asm/pgtable.h:5,
>                     from include/linux/memremap.h:7,
>                     from drivers//dax/bus.c:3:
>    arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset':
> >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct'
>      return mm->pgd + pgd_index(address);
>               ^~
> I assume this pulls in the entirety of pgtable.h just to get the pmd_t
> definition?
> 
> > +typedef void (*dev_page_free_t)(struct page *page, void *data);
> > +
> >  /**
> >   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> > + * @page_fault: callback when CPU fault on an unaddressable device page
> > + * @page_free: free page callback when page refcount reaches 1
> >   * @altmap: pre-allocated/reserved memory for vmemmap allocations
> >   * @res: physical address range covered by @ref
> >   * @ref: reference count that pins the devm_memremap_pages() mapping
> >   * @dev: host device of the mapping for debug
> > + * @data: private data pointer for page_free()
> > + * @type: memory type: see MEMORY_* in memory_hotplug.h
> >   */
> >  struct dev_pagemap {
> > +       dev_page_fault_t page_fault;
> 
> Rather than try to figure out how to forward declare pmd_t, how about
> just move dev_page_fault_t out of the generic dev_pagemap and into the
> HMM specific container structure? This should be straightfoward on top
> of the recent refactor.

Fine with me.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20 16:15     ` Jerome Glisse
  2018-12-20 16:15       ` Jerome Glisse
@ 2018-12-20 16:47       ` Dan Williams
  2018-12-20 16:47         ` Dan Williams
  2018-12-20 16:57         ` Jerome Glisse
  1 sibling, 2 replies; 66+ messages in thread
From: Dan Williams @ 2018-12-20 16:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

On Thu, Dec 20, 2018 at 8:15 AM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > Rather than try to figure out how to forward declare pmd_t, how about
> > just move dev_page_fault_t out of the generic dev_pagemap and into the
> > HMM specific container structure? This should be straightfoward on top
> > of the recent refactor.
>
> Fine with me.

I was hoping you would reply with a patch. I'll take a look...

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20 16:47       ` Dan Williams
@ 2018-12-20 16:47         ` Dan Williams
  2018-12-20 16:57         ` Jerome Glisse
  1 sibling, 0 replies; 66+ messages in thread
From: Dan Williams @ 2018-12-20 16:47 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

On Thu, Dec 20, 2018 at 8:15 AM Jerome Glisse <jglisse@redhat.com> wrote:
[..]
> > Rather than try to figure out how to forward declare pmd_t, how about
> > just move dev_page_fault_t out of the generic dev_pagemap and into the
> > HMM specific container structure? This should be straightfoward on top
> > of the recent refactor.
>
> Fine with me.

I was hoping you would reply with a patch. I'll take a look...


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20 16:47       ` Dan Williams
  2018-12-20 16:47         ` Dan Williams
@ 2018-12-20 16:57         ` Jerome Glisse
  2018-12-20 16:57           ` Jerome Glisse
  1 sibling, 1 reply; 66+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

On Thu, Dec 20, 2018 at 08:47:39AM -0800, Dan Williams wrote:
> On Thu, Dec 20, 2018 at 8:15 AM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > Rather than try to figure out how to forward declare pmd_t, how about
> > > just move dev_page_fault_t out of the generic dev_pagemap and into the
> > > HMM specific container structure? This should be straightfoward on top
> > > of the recent refactor.
> >
> > Fine with me.
> 
> I was hoping you would reply with a patch. I'll take a look...

Bit busy right now but i can do that after new years :)

Cheers,
J�r�me

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5
  2018-12-20 16:57         ` Jerome Glisse
@ 2018-12-20 16:57           ` Jerome Glisse
  0 siblings, 0 replies; 66+ messages in thread
From: Jerome Glisse @ 2018-12-20 16:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Linux Kernel Mailing List, linux-mm, John Hubbard,
	David Nellans, Balbir Singh, Ross Zwisler

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="UTF-8", Size: 569 bytes --]

On Thu, Dec 20, 2018 at 08:47:39AM -0800, Dan Williams wrote:
> On Thu, Dec 20, 2018 at 8:15 AM Jerome Glisse <jglisse@redhat.com> wrote:
> [..]
> > > Rather than try to figure out how to forward declare pmd_t, how about
> > > just move dev_page_fault_t out of the generic dev_pagemap and into the
> > > HMM specific container structure? This should be straightfoward on top
> > > of the recent refactor.
> >
> > Fine with me.
> 
> I was hoping you would reply with a patch. I'll take a look...

Bit busy right now but i can do that after new years :)

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-12-20 16:57 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-17  0:05 [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 01/19] hmm: heterogeneous memory management documentation v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 02/19] mm/hmm: heterogeneous memory management (HMM for short) v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 03/19] mm/hmm/mirror: mirror process address space on device with HMM helpers v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 04/19] mm/hmm/mirror: helper to snapshot CPU page table v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 05/19] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 06/19] mm/memory_hotplug: introduce add_pages Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 07/19] mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory v5 Jérôme Glisse
2018-12-20  8:33   ` Dan Williams
2018-12-20 16:15     ` Jerome Glisse
2018-12-20 16:15       ` Jerome Glisse
2018-12-20 16:47       ` Dan Williams
2018-12-20 16:47         ` Dan Williams
2018-12-20 16:57         ` Jerome Glisse
2018-12-20 16:57           ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 08/19] mm/ZONE_DEVICE: special case put_page() for device private pages v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 09/19] mm/memcontrol: allow to uncharge page without using page->lru field Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 10/19] mm/memcontrol: support MEMORY_DEVICE_PRIVATE v4 Jérôme Glisse
2017-09-05 17:13   ` Laurent Dufour
2017-09-05 17:21     ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 11/19] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE v7 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 12/19] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 13/19] mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY Jérôme Glisse
2017-08-17 21:12   ` Andrew Morton
2017-08-17 21:44     ` Jerome Glisse
2017-08-17  0:05 ` [HMM-v25 14/19] mm/migrate: new memory migration helper for use with device memory v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 15/19] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 16/19] mm/migrate: support un-addressable ZONE_DEVICE page in migration v3 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 17/19] mm/migrate: allow migrate_vma() to alloc new page on empty entry v4 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 18/19] mm/device-public-memory: device memory cache coherent with CPU v5 Jérôme Glisse
2017-08-17  0:05 ` [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3 Jérôme Glisse
2017-09-04  3:09   ` Bob Liu
2017-09-04 15:51     ` Jerome Glisse
2017-09-05  1:13       ` Bob Liu
2017-09-05  2:38         ` Jerome Glisse
2017-09-05  3:50           ` Bob Liu
2017-09-05 13:50             ` Jerome Glisse
2017-09-05 16:18               ` Dan Williams
2017-09-05 19:00               ` Ross Zwisler
2017-09-05 19:20                 ` Jerome Glisse
2017-09-08 19:43                   ` Ross Zwisler
2017-09-08 20:29                     ` Jerome Glisse
2017-09-05 18:54           ` Ross Zwisler
2017-09-06  1:25             ` Bob Liu
2017-09-06  2:12               ` Jerome Glisse
2017-09-07  2:06                 ` Bob Liu
2017-09-07 17:00                   ` Jerome Glisse
2017-09-07 17:27                   ` Jerome Glisse
2017-09-08  1:59                     ` Bob Liu
2017-09-08 20:43                       ` Dan Williams
2017-11-17  3:47                         ` chetan L
2017-09-05  3:36       ` Balbir Singh
2017-08-17 21:39 ` [HMM-v25 00/19] HMM (Heterogeneous Memory Management) v25 Andrew Morton
2017-08-17 21:55   ` Jerome Glisse
2017-08-17 21:59     ` Dan Williams
2017-08-17 22:02       ` Jerome Glisse
2017-08-17 22:06         ` Dan Williams
2017-08-17 22:16       ` Andrew Morton
2017-12-13 12:10 ` Figo.zhang
2017-12-13 16:12   ` Jerome Glisse
2017-12-14  2:48     ` Figo.zhang
2017-12-14  3:16       ` Jerome Glisse
2017-12-14  3:53         ` Figo.zhang
2017-12-14  4:16           ` Jerome Glisse
2017-12-14  7:05             ` Figo.zhang
2017-12-14 15:28               ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).