All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] hmm: fixes and documentations
@ 2018-03-15 18:36 ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>

Jérôme Glisse (2):
  mm/hmm: fix header file if/else/endif maze
  mm/hmm: change CPU page table snapshot functions to simplify drivers

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed

 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 147 ++++++++++---------
 mm/hmm.c                 | 351 +++++++++++++++++++++++++--------------------
 4 files changed, 468 insertions(+), 391 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/4] hmm: fixes and documentations
@ 2018-03-15 18:36 ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>

JA(C)rA'me Glisse (2):
  mm/hmm: fix header file if/else/endif maze
  mm/hmm: change CPU page table snapshot functions to simplify drivers

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed

 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 147 ++++++++++---------
 mm/hmm.c                 | 351 +++++++++++++++++++++++++--------------------
 4 files changed, 468 insertions(+), 391 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] mm/hmm: documentation editorial update to HMM documentation
  2018-03-15 18:36 ` jglisse
@ 2018-03-15 18:36   ` jglisse
  -1 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..c7418c56a0ac 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence missmatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 1/4] mm/hmm: documentation editorial update to HMM documentation
@ 2018-03-15 18:36   ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..c7418c56a0ac 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence missmatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] mm/hmm: fix header file if/else/endif maze
  2018-03-15 18:36 ` jglisse
@ 2018-03-15 18:36   ` jglisse
  -1 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: Jérôme Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..ef6044d08cc5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,6 +498,9 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* IS_ENABLED(CONFIG_HMM) */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
@@ -513,8 +516,4 @@ static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
-
-#else /* IS_ENABLED(CONFIG_HMM) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] mm/hmm: fix header file if/else/endif maze
@ 2018-03-15 18:36   ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: JA(C)rA'me Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..ef6044d08cc5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,6 +498,9 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* IS_ENABLED(CONFIG_HMM) */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
@@ -513,8 +516,4 @@ static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
-
-#else /* IS_ENABLED(CONFIG_HMM) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-15 18:36 ` jglisse
@ 2018-03-15 18:36   ` jglisse
  -1 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 20 +++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ef6044d08cc5..61b0e1c05ee1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..db24d9f9f046 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,23 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	VM_BUG_ON(!hmm);
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +202,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +248,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
@ 2018-03-15 18:36   ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 20 +++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ef6044d08cc5..61b0e1c05ee1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..db24d9f9f046 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,23 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm *hmm = mm->hmm;
+	struct hmm_mirror *mirror;
+	struct hmm_mirror *mirror_next;
+
+	VM_BUG_ON(!hmm);
+
+	down_write(&hmm->mirrors_sem);
+	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release)
+			mirror->ops->release(mirror);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +202,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +248,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
  2018-03-15 18:36 ` jglisse
@ 2018-03-15 18:37   ` jglisse
  -1 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
to directly write entry that can match any device page table entry
format. Device driver now provide an array of flags value and we use
enum to index this array for each flag.

This also allow the device driver to ask for write fault on a per page
basis making API more flexible to service multiple device page faults
in one go.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++----------
 mm/hmm.c            | 331 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 249 insertions(+), 212 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61b0e1c05ee1..34e8a8c65bbd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,11 +80,10 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ * uint64_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -92,64 +91,94 @@ struct hmm;
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
+enum hmm_pfn_flag_e {
+	HMM_PFN_FLAG_VALID = 0,
+	HMM_PFN_FLAG_WRITE,
+	HMM_PFN_FLAG_ERROR,
+	HMM_PFN_FLAG_NONE,
+	HMM_PFN_FLAG_SPECIAL,
+	HMM_PFN_FLAG_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
+#define HMM_RANGE_PFN_FLAG(f) (range->flags[HMM_PFN_FLAG_##f])
 
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: uint64_t to convert to struct page
  * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * If the uint64_t is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: uint64_t to extract pfn from
+ * Returns: pfn value if uint64_t is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	return (pfn >> range->pfn_shift);
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * hmm_pfn_from_page() - create a valid uint64_t value from struct page
+ * @range: struct hmm_range pointer where pfn encoding constant are
  * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * Returns: valid uint64_t for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		HMM_RANGE_PFN_FLAG(VALID);
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * hmm_pfn_from_pfn() - create a valid uint64_t value from pfn
+ * @range: struct hmm_range pointer where pfn encoding constant are
  * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * Returns: valid uint64_t for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) | HMM_RANGE_PFN_FLAG(VALID);
 }
 
 
@@ -271,23 +300,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	hmm_pfn_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
@@ -301,17 +313,13 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The uint64_t array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -320,20 +328,14 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the uint64_t array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index db24d9f9f046..fe92a580e6af 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -258,60 +258,63 @@ struct hmm_vma_walk {
 	unsigned long		last;
 	bool			fault;
 	bool			block;
-	bool			write;
 };
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
 	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
-	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	flags |= (*pfn) & HMM_RANGE_PFN_FLAG(WRITE) ? FAULT_FLAG_WRITE : 0;
 	r = handle_mm_fault(vma, addr, flags);
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = HMM_RANGE_PFN_FLAG(ERROR);
 		return -EFAULT;
 	}
 
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(const struct hmm_range *range,
+			     uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
+		*pfns = HMM_RANGE_PFN_FLAG(SPECIAL);
 }
 
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
-	hmm_pfn_t *pfns = range->pfns;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = HMM_RANGE_PFN_FLAG(ERROR);
 
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = HMM_RANGE_PFN_FLAG(NONE);
 }
 
 static int hmm_vma_walk_hole(unsigned long addr,
@@ -320,13 +323,13 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
+		pfns[i] = HMM_RANGE_PFN_FLAG(NONE);
 		if (hmm_vma_walk->fault) {
 			int ret;
 
@@ -339,29 +342,146 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
-static int hmm_vma_walk_clear(unsigned long addr,
+static bool hmm_pfn_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+			       const uint64_t *pfns, unsigned long npages,
+			       uint64_t cpu_flags)
+{
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t mask_valid, mask_write, mask_device;
+	unsigned long i;
+
+	if (!hmm_vma_walk->fault)
+		return false;
+
+	/* Mask flags we care about for fault */
+	mask_valid = HMM_RANGE_PFN_FLAG(VALID);
+	mask_write = HMM_RANGE_PFN_FLAG(WRITE);
+	mask_device = HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE);
+
+	for (i = 0; i < npages; ++i) {
+		/* We aren't ask to do anything ... */
+		if (!(pfns[i] & mask_valid))
+			continue;
+		/* Need to write fault ? */
+		if ((pfns[i] & mask_write) && !(cpu_flags & mask_write))
+			return true;
+		/* Do we fault on device memory ? */
+		if ((pfns[i] & mask_device) && (cpu_flags & mask_device))
+			return true;
+	}
+	return false;
+}
+
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
+{
+	if (pmd_protnone(pmd))
+		return 0;
+	return pmd_write(pmd) ? HMM_RANGE_PFN_FLAG(VALID) |
+				HMM_RANGE_PFN_FLAG(WRITE) :
+				HMM_RANGE_PFN_FLAG(VALID);
+}
+
+static int hmm_vma_handle_pmd(struct mm_walk *walk,
+			      unsigned long addr,
 			      unsigned long end,
-			      struct mm_walk *walk)
+			      uint64_t *pfns,
+			      pmd_t pmd)
 {
+	unsigned long npages = (end - addr) >> PAGE_SHIFT, pfn = 0;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
-	unsigned long i;
+	uint64_t cpu_flags, i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
-		if (hmm_vma_walk->fault) {
-			int ret;
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
+	pfn = cpu_flags ? pmd_pfn(pmd) + pte_index(addr) : 0;
+	pfn = cpu_flags ? hmm_pfn_from_pfn(range, pfn) | cpu_flags : 0;
 
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
+	if (hmm_pfn_need_fault(hmm_vma_walk, pfns, npages, cpu_flags)) {
+		int ret;
+
+		hmm_vma_walk->last = addr;
+		ret = hmm_vma_do_fault(walk, addr, pfns);
+		return ret ? ret : -EAGAIN;
+	}
+
+	for (i = 0; i < npages; i++) {
+		pfns[i] = pfn;
+		pfn = pfn ? (pfn + (1ULL << range->pfn_shift)) : 0;
+	}
+
+	hmm_vma_walk->last = end;
+	return 0;
+}
+
+static inline uint64_t hmm_vma_handle_pte(struct mm_walk *walk,
+					  unsigned long addr, pmd_t *pmdp,
+					  pte_t *ptep, uint64_t *pfns)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t pfn = HMM_RANGE_PFN_FLAG(VALID);
+	struct vm_area_struct *vma = walk->vma;
+	pte_t pte = *ptep;
+	int ret;
+
+	if (pte_none(pte)) {
+		*pfns = HMM_RANGE_PFN_FLAG(NONE);
+		return 0;
+	}
+
+	if (!pte_present(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (!non_swap_entry(entry)) {
+			if (hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn))
+				goto fault;
+			*pfns = HMM_RANGE_PFN_FLAG(NONE);
+			return 0;
 		}
+
+		if (is_device_private_entry(entry)) {
+			pfn |= range->vma->vm_flags & VM_WRITE ?
+			       HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE) |
+			       HMM_RANGE_PFN_FLAG(WRITE) :
+			       HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE);
+			pfn |= hmm_pfn_from_pfn(range, swp_offset(entry));
+
+			if (hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn))
+				goto fault;
+
+			*pfns = pfn;
+			return 0;
+		}
+
+		if (is_migration_entry(entry)) {
+			if (hmm_vma_walk->fault) {
+				pte_unmap(ptep);
+				hmm_vma_walk->last = addr;
+				migration_entry_wait(vma->vm_mm,
+						     pmdp, addr);
+				return -EAGAIN;
+			}
+
+			*pfns = HMM_RANGE_PFN_FLAG(NONE);
+			return 0;
+		}
+
+		/* Report error for everything else */
+		*pfns = HMM_RANGE_PFN_FLAG(ERROR);
+		return 0;
 	}
 
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
+	pfn |= pte_write(pte) ? HMM_RANGE_PFN_FLAG(WRITE) : 0;
+	pfn |= hmm_pfn_from_pfn(range, pte_pfn(pte));
+	if (!hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn)) {
+		*pfns = pfn;
+		return 0;
+	}
+
+fault:
+	pte_unmap(ptep);
+	ret = hmm_vma_do_fault(walk, addr, pfns);
+	return ret ? ret : -EAGAIN;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -372,15 +492,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
-	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
-	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
@@ -390,7 +506,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_bad(start, end, walk);
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
-		unsigned long pfn;
 		pmd_t pmd;
 
 		/*
@@ -406,17 +521,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
-		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
 
-		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
-
-		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
-		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
-		return 0;
+		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
 	}
 
 	if (pmd_bad(*pmdp))
@@ -424,78 +530,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
-		pte_t pte = *ptep;
-
-		pfns[i] = 0;
-
-		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
-			if (hmm_vma_walk->fault)
-				goto fault;
-			continue;
-		}
-
-		if (!pte_present(pte)) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
-			if (!non_swap_entry(entry)) {
-				if (hmm_vma_walk->fault)
-					goto fault;
-				continue;
-			}
+		int ret;
 
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			 */
-			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
-				if (is_write_device_private_entry(entry)) {
-					pfns[i] |= HMM_PFN_WRITE;
-				} else if (write_fault)
-					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
-			} else if (is_migration_entry(entry)) {
-				if (hmm_vma_walk->fault) {
-					pte_unmap(ptep);
-					hmm_vma_walk->last = addr;
-					migration_entry_wait(vma->vm_mm,
-							     pmdp, addr);
-					return -EAGAIN;
-				}
-				continue;
-			} else {
-				/* Report error for everything else */
-				pfns[i] = HMM_PFN_ERROR;
-			}
-			continue;
+		ret = hmm_vma_handle_pte(walk, addr, pmdp, ptep, &pfns[i]);
+		if (ret) {
+			hmm_vma_walk->last = addr;
+			return ret;
 		}
-
-		if (write_fault && !pte_write(pte))
-			goto fault;
-
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
-		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-		continue;
-
-fault:
-		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
+	hmm_vma_walk->last = addr;
 	return 0;
 }
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
  * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -509,26 +560,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range, range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -539,9 +587,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -559,14 +604,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -586,10 +630,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -597,15 +641,16 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct vm_area_struct *vma = range->vma;
 	struct hmm *hmm;
 
 	if (range->end <= range->start) {
@@ -629,12 +674,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
  * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
- * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
  *
@@ -642,14 +682,14 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
  * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * uint64_t error flag for the corresponding pfn entry.
  *
  * Expected use pattern:
  * retry:
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, block);
  *   switch (ret) {
  *   case -EAGAIN:
  *     hmm_vma_range_done(vma, range);
@@ -666,8 +706,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
- *   // Commit pfns we got from hmm_vma_fault()
+ *   if (hmm_vma_range_done(range)) {
+ *     // Commit pfns we got from hmm_vma_fault()
+ *   }
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
  *
@@ -676,28 +717,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
+	unsigned long start;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -705,9 +742,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -715,12 +749,11 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range, range->pfns, range->start, range->end);
 		return 0;
 	}
 
 	hmm_vma_walk.fault = true;
-	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
@@ -734,8 +767,9 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
+	start = range->start;
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -743,8 +777,9 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(range, &range->pfns[i],
+			       hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
@ 2018-03-15 18:37   ` jglisse
  0 siblings, 0 replies; 24+ messages in thread
From: jglisse @ 2018-03-15 18:37 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: JA(C)rA'me Glisse <jglisse@redhat.com>

This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
to directly write entry that can match any device page table entry
format. Device driver now provide an array of flags value and we use
enum to index this array for each flag.

This also allow the device driver to ask for write fault on a per page
basis making API more flexible to service multiple device page faults
in one go.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++----------
 mm/hmm.c            | 331 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 249 insertions(+), 212 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61b0e1c05ee1..34e8a8c65bbd 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,11 +80,10 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
+ * uint64_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -92,64 +91,94 @@ struct hmm;
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
+enum hmm_pfn_flag_e {
+	HMM_PFN_FLAG_VALID = 0,
+	HMM_PFN_FLAG_WRITE,
+	HMM_PFN_FLAG_ERROR,
+	HMM_PFN_FLAG_NONE,
+	HMM_PFN_FLAG_SPECIAL,
+	HMM_PFN_FLAG_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
+#define HMM_RANGE_PFN_FLAG(f) (range->flags[HMM_PFN_FLAG_##f])
 
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: uint64_t to convert to struct page
  * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * If the uint64_t is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: uint64_t to extract pfn from
+ * Returns: pfn value if uint64_t is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	return (pfn >> range->pfn_shift);
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
+ * hmm_pfn_from_page() - create a valid uint64_t value from struct page
+ * @range: struct hmm_range pointer where pfn encoding constant are
  * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * Returns: valid uint64_t for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		HMM_RANGE_PFN_FLAG(VALID);
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * hmm_pfn_from_pfn() - create a valid uint64_t value from pfn
+ * @range: struct hmm_range pointer where pfn encoding constant are
  * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * Returns: valid uint64_t for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) | HMM_RANGE_PFN_FLAG(VALID);
 }
 
 
@@ -271,23 +300,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	hmm_pfn_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
@@ -301,17 +313,13 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The uint64_t array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -320,20 +328,14 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the uint64_t array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index db24d9f9f046..fe92a580e6af 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -258,60 +258,63 @@ struct hmm_vma_walk {
 	unsigned long		last;
 	bool			fault;
 	bool			block;
-	bool			write;
 };
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
 	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
-	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	flags |= (*pfn) & HMM_RANGE_PFN_FLAG(WRITE) ? FAULT_FLAG_WRITE : 0;
 	r = handle_mm_fault(vma, addr, flags);
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = HMM_RANGE_PFN_FLAG(ERROR);
 		return -EFAULT;
 	}
 
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(const struct hmm_range *range,
+			     uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
+		*pfns = HMM_RANGE_PFN_FLAG(SPECIAL);
 }
 
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
-	hmm_pfn_t *pfns = range->pfns;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = HMM_RANGE_PFN_FLAG(ERROR);
 
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = HMM_RANGE_PFN_FLAG(NONE);
 }
 
 static int hmm_vma_walk_hole(unsigned long addr,
@@ -320,13 +323,13 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
+		pfns[i] = HMM_RANGE_PFN_FLAG(NONE);
 		if (hmm_vma_walk->fault) {
 			int ret;
 
@@ -339,29 +342,146 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
-static int hmm_vma_walk_clear(unsigned long addr,
+static bool hmm_pfn_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+			       const uint64_t *pfns, unsigned long npages,
+			       uint64_t cpu_flags)
+{
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t mask_valid, mask_write, mask_device;
+	unsigned long i;
+
+	if (!hmm_vma_walk->fault)
+		return false;
+
+	/* Mask flags we care about for fault */
+	mask_valid = HMM_RANGE_PFN_FLAG(VALID);
+	mask_write = HMM_RANGE_PFN_FLAG(WRITE);
+	mask_device = HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE);
+
+	for (i = 0; i < npages; ++i) {
+		/* We aren't ask to do anything ... */
+		if (!(pfns[i] & mask_valid))
+			continue;
+		/* Need to write fault ? */
+		if ((pfns[i] & mask_write) && !(cpu_flags & mask_write))
+			return true;
+		/* Do we fault on device memory ? */
+		if ((pfns[i] & mask_device) && (cpu_flags & mask_device))
+			return true;
+	}
+	return false;
+}
+
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
+{
+	if (pmd_protnone(pmd))
+		return 0;
+	return pmd_write(pmd) ? HMM_RANGE_PFN_FLAG(VALID) |
+				HMM_RANGE_PFN_FLAG(WRITE) :
+				HMM_RANGE_PFN_FLAG(VALID);
+}
+
+static int hmm_vma_handle_pmd(struct mm_walk *walk,
+			      unsigned long addr,
 			      unsigned long end,
-			      struct mm_walk *walk)
+			      uint64_t *pfns,
+			      pmd_t pmd)
 {
+	unsigned long npages = (end - addr) >> PAGE_SHIFT, pfn = 0;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
-	unsigned long i;
+	uint64_t cpu_flags, i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
-		if (hmm_vma_walk->fault) {
-			int ret;
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
+	pfn = cpu_flags ? pmd_pfn(pmd) + pte_index(addr) : 0;
+	pfn = cpu_flags ? hmm_pfn_from_pfn(range, pfn) | cpu_flags : 0;
 
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
+	if (hmm_pfn_need_fault(hmm_vma_walk, pfns, npages, cpu_flags)) {
+		int ret;
+
+		hmm_vma_walk->last = addr;
+		ret = hmm_vma_do_fault(walk, addr, pfns);
+		return ret ? ret : -EAGAIN;
+	}
+
+	for (i = 0; i < npages; i++) {
+		pfns[i] = pfn;
+		pfn = pfn ? (pfn + (1ULL << range->pfn_shift)) : 0;
+	}
+
+	hmm_vma_walk->last = end;
+	return 0;
+}
+
+static inline uint64_t hmm_vma_handle_pte(struct mm_walk *walk,
+					  unsigned long addr, pmd_t *pmdp,
+					  pte_t *ptep, uint64_t *pfns)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	uint64_t pfn = HMM_RANGE_PFN_FLAG(VALID);
+	struct vm_area_struct *vma = walk->vma;
+	pte_t pte = *ptep;
+	int ret;
+
+	if (pte_none(pte)) {
+		*pfns = HMM_RANGE_PFN_FLAG(NONE);
+		return 0;
+	}
+
+	if (!pte_present(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (!non_swap_entry(entry)) {
+			if (hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn))
+				goto fault;
+			*pfns = HMM_RANGE_PFN_FLAG(NONE);
+			return 0;
 		}
+
+		if (is_device_private_entry(entry)) {
+			pfn |= range->vma->vm_flags & VM_WRITE ?
+			       HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE) |
+			       HMM_RANGE_PFN_FLAG(WRITE) :
+			       HMM_RANGE_PFN_FLAG(DEVICE_PRIVATE);
+			pfn |= hmm_pfn_from_pfn(range, swp_offset(entry));
+
+			if (hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn))
+				goto fault;
+
+			*pfns = pfn;
+			return 0;
+		}
+
+		if (is_migration_entry(entry)) {
+			if (hmm_vma_walk->fault) {
+				pte_unmap(ptep);
+				hmm_vma_walk->last = addr;
+				migration_entry_wait(vma->vm_mm,
+						     pmdp, addr);
+				return -EAGAIN;
+			}
+
+			*pfns = HMM_RANGE_PFN_FLAG(NONE);
+			return 0;
+		}
+
+		/* Report error for everything else */
+		*pfns = HMM_RANGE_PFN_FLAG(ERROR);
+		return 0;
 	}
 
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
+	pfn |= pte_write(pte) ? HMM_RANGE_PFN_FLAG(WRITE) : 0;
+	pfn |= hmm_pfn_from_pfn(range, pte_pfn(pte));
+	if (!hmm_pfn_need_fault(hmm_vma_walk, pfns, 1, pfn)) {
+		*pfns = pfn;
+		return 0;
+	}
+
+fault:
+	pte_unmap(ptep);
+	ret = hmm_vma_do_fault(walk, addr, pfns);
+	return ret ? ret : -EAGAIN;
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -372,15 +492,11 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
-	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
-	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
@@ -390,7 +506,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_bad(start, end, walk);
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
-		unsigned long pfn;
 		pmd_t pmd;
 
 		/*
@@ -406,17 +521,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
-		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
 
-		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
-
-		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
-		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
-		return 0;
+		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
 	}
 
 	if (pmd_bad(*pmdp))
@@ -424,78 +530,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
-		pte_t pte = *ptep;
-
-		pfns[i] = 0;
-
-		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
-			if (hmm_vma_walk->fault)
-				goto fault;
-			continue;
-		}
-
-		if (!pte_present(pte)) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
-			if (!non_swap_entry(entry)) {
-				if (hmm_vma_walk->fault)
-					goto fault;
-				continue;
-			}
+		int ret;
 
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			 */
-			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
-				if (is_write_device_private_entry(entry)) {
-					pfns[i] |= HMM_PFN_WRITE;
-				} else if (write_fault)
-					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
-			} else if (is_migration_entry(entry)) {
-				if (hmm_vma_walk->fault) {
-					pte_unmap(ptep);
-					hmm_vma_walk->last = addr;
-					migration_entry_wait(vma->vm_mm,
-							     pmdp, addr);
-					return -EAGAIN;
-				}
-				continue;
-			} else {
-				/* Report error for everything else */
-				pfns[i] = HMM_PFN_ERROR;
-			}
-			continue;
+		ret = hmm_vma_handle_pte(walk, addr, pmdp, ptep, &pfns[i]);
+		if (ret) {
+			hmm_vma_walk->last = addr;
+			return ret;
 		}
-
-		if (write_fault && !pte_write(pte))
-			goto fault;
-
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
-		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-		continue;
-
-fault:
-		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
+	hmm_vma_walk->last = addr;
 	return 0;
 }
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
  * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -509,26 +560,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range, range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -539,9 +587,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -559,14 +604,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -586,10 +630,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -597,15 +641,16 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct vm_area_struct *vma = range->vma;
 	struct hmm *hmm;
 
 	if (range->end <= range->start) {
@@ -629,12 +674,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
  * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
- * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
  *
@@ -642,14 +682,14 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
  * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * uint64_t error flag for the corresponding pfn entry.
  *
  * Expected use pattern:
  * retry:
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, block);
  *   switch (ret) {
  *   case -EAGAIN:
  *     hmm_vma_range_done(vma, range);
@@ -666,8 +706,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
- *   // Commit pfns we got from hmm_vma_fault()
+ *   if (hmm_vma_range_done(range)) {
+ *     // Commit pfns we got from hmm_vma_fault()
+ *   }
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
  *
@@ -676,28 +717,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
+	unsigned long start;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -705,9 +742,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -715,12 +749,11 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range, range->pfns, range->start, range->end);
 		return 0;
 	}
 
 	hmm_vma_walk.fault = true;
-	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
@@ -734,8 +767,9 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
+	start = range->start;
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -743,8 +777,9 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(range, &range->pfns[i],
+			       hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-15 18:36   ` jglisse
  (?)
@ 2018-03-15 22:48   ` Andrew Morton
  2018-03-16  0:54     ` Jerome Glisse
  2018-03-20 11:33     ` Michal Hocko
  -1 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2018-03-15 22:48 UTC (permalink / raw)
  To: jglisse
  Cc: linux-mm, linux-kernel, Ralph Campbell, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:

> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.

The changelog doesn't tell us what the runtime effects of the bug are. 
This makes it hard for me to answer the "did Jerome consider doing
cc:stable" question.

> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,23 @@ static void hmm_invalidate_range(struct hmm *hmm,
>  	up_read(&hmm->mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> +	struct hmm *hmm = mm->hmm;
> +	struct hmm_mirror *mirror;
> +	struct hmm_mirror *mirror_next;
> +
> +	VM_BUG_ON(!hmm);

This doesn't add much value.  We'll reliably oops on the next statement
anyway, which will provide the same info.  And Linus gets all upset at
new BUG_ON() instances.

> +	down_write(&hmm->mirrors_sem);
> +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> +		list_del_init(&mirror->list);
> +		if (mirror->ops->release)
> +			mirror->ops->release(mirror);
> +	}
> +	up_write(&hmm->mirrors_sem);
> +}
> +

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] mm/hmm: fix header file if/else/endif maze
  2018-03-15 18:36   ` jglisse
  (?)
@ 2018-03-16  0:11   ` Balbir Singh
  -1 siblings, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2018-03-16  0:11 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Ralph Campbell,
	John Hubbard, Evgeny Baskakov

On Fri, Mar 16, 2018 at 5:36 AM,  <jglisse@redhat.com> wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
>
> The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> ---

Acked-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-15 22:48   ` Andrew Morton
@ 2018-03-16  0:54     ` Jerome Glisse
  2018-03-16  1:17         ` John Hubbard
  2018-03-20 11:33     ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: Jerome Glisse @ 2018-03-16  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Thu, Mar 15, 2018 at 03:48:29PM -0700, Andrew Morton wrote:
> On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
> 
> > From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> 
> The changelog doesn't tell us what the runtime effects of the bug are. 
> This makes it hard for me to answer the "did Jerome consider doing
> cc:stable" question.

The impact is low, they might be issue only if application is kill,
and we don't have any upstream user yet hence why i did not cc
stable.

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -160,6 +160,23 @@ static void hmm_invalidate_range(struct hmm *hmm,
> >  	up_read(&hmm->mirrors_sem);
> >  }
> >  
> > +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = mm->hmm;
> > +	struct hmm_mirror *mirror;
> > +	struct hmm_mirror *mirror_next;
> > +
> > +	VM_BUG_ON(!hmm);
> 
> This doesn't add much value.  We'll reliably oops on the next statement
> anyway, which will provide the same info.  And Linus gets all upset at
> new BUG_ON() instances.

It is true, this BUG_ON can be drop, you want me to respin ?

> 
> > +	down_write(&hmm->mirrors_sem);
> > +	list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
> > +		list_del_init(&mirror->list);
> > +		if (mirror->ops->release)
> > +			mirror->ops->release(mirror);
> > +	}
> > +	up_write(&hmm->mirrors_sem);
> > +}
> > +
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-16  0:54     ` Jerome Glisse
@ 2018-03-16  1:17         ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2018-03-16  1:17 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, Evgeny Baskakov, Mark Hairgrove

On 03/15/2018 05:54 PM, Jerome Glisse wrote:
> On Thu, Mar 15, 2018 at 03:48:29PM -0700, Andrew Morton wrote:
>> On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
>>
>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> The changelog doesn't tell us what the runtime effects of the bug are. 
>> This makes it hard for me to answer the "did Jerome consider doing
>> cc:stable" question.
> 
> The impact is low, they might be issue only if application is kill,
> and we don't have any upstream user yet hence why i did not cc
> stable.
> 

Hi Jerome and Andrew,

I'd claim that it is not possible to make a safe and correct device
driver, without this patch. That's because, without the .release callback
that you're adding here, the driver could end up doing operations on a 
stale struct_mm, leading to crashes and other disasters.

Even if people think that maybe that window is "small", it's not really
any smaller than lots of race condition problems that we've seen. And
it is definitely not that hard to hit it: just a good directed stress
test involving multiple threads that are doing early process termination
while also doing lots of migrations and page faults, should suffice.

It is probably best to add this patch to stable, for that reason. 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
@ 2018-03-16  1:17         ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2018-03-16  1:17 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Morton
  Cc: linux-mm, linux-kernel, Ralph Campbell, Evgeny Baskakov, Mark Hairgrove

On 03/15/2018 05:54 PM, Jerome Glisse wrote:
> On Thu, Mar 15, 2018 at 03:48:29PM -0700, Andrew Morton wrote:
>> On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
>>
>>> From: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> The changelog doesn't tell us what the runtime effects of the bug are. 
>> This makes it hard for me to answer the "did Jerome consider doing
>> cc:stable" question.
> 
> The impact is low, they might be issue only if application is kill,
> and we don't have any upstream user yet hence why i did not cc
> stable.
> 

Hi Jerome and Andrew,

I'd claim that it is not possible to make a safe and correct device
driver, without this patch. That's because, without the .release callback
that you're adding here, the driver could end up doing operations on a 
stale struct_mm, leading to crashes and other disasters.

Even if people think that maybe that window is "small", it's not really
any smaller than lots of race condition problems that we've seen. And
it is definitely not that hard to hit it: just a good directed stress
test involving multiple threads that are doing early process termination
while also doing lots of migrations and page faults, should suffice.

It is probably best to add this patch to stable, for that reason. 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
  2018-03-15 18:37   ` jglisse
@ 2018-03-16  5:08     ` John Hubbard
  -1 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2018-03-16  5:08 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/15/2018 11:37 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> to directly write entry that can match any device page table entry
> format. Device driver now provide an array of flags value and we use
> enum to index this array for each flag.
> 
> This also allow the device driver to ask for write fault on a per page
> basis making API more flexible to service multiple device page faults
> in one go.
> 

Hi Jerome,

This is a large patch, so I'm going to review it in two passes. The first 
pass is just an overview plus the hmm.h changes (now), and tomorrow I will
review the hmm.c, which is where the real changes are.

Overview: the hmm.c changes are doing several things, and it is difficult to
review, because refactoring, plus new behavior, makes diffs less useful here.
It would probably be good to split the hmm.c changes into a few patches, such
as:

	-- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
           being passed to functions), and
        -- New behavior in the page handling loops, and 
	-- Refactoring into new routines (hmm_vma_handle_pte, and others)

That way, reviewers can see more easily that things are correct. 

> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 130 +++++++++++----------
>  mm/hmm.c            | 331 +++++++++++++++++++++++++++++-----------------------
>  2 files changed, 249 insertions(+), 212 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 61b0e1c05ee1..34e8a8c65bbd 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,11 +80,10 @@
>  struct hmm;
>  
>  /*
> - * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
> + * uint64_t - HMM uses its own pfn type to keep several flags per page

This line now is a little odd, because it looks like it's trying to document
uint64_t as an HMM pfn type. :) Maybe:

* HMM pfns are of type uint64_t

...or else just delete it, either way.

>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

All of these are missing a _FLAG_ piece. The above should be HMM_PFN_FLAG_VALID,
to match the enum below.

> - * HMM_PFN_READ:  CPU page table has read permission set

So why is it that we don't need the _READ flag anymore? I looked at the corresponding
hmm.c but still don't quite get it. Is it that we just expect that _READ is
always set if there is an entry at all? Or something else?

>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -92,64 +91,94 @@ struct hmm;
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
>   *      set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
>   */
> -typedef unsigned long hmm_pfn_t;
> +enum hmm_pfn_flag_e {
> +	HMM_PFN_FLAG_VALID = 0,
> +	HMM_PFN_FLAG_WRITE,
> +	HMM_PFN_FLAG_ERROR,
> +	HMM_PFN_FLAG_NONE,
> +	HMM_PFN_FLAG_SPECIAL,
> +	HMM_PFN_FLAG_DEVICE_PRIVATE,
> +	HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * struct hmm_range - track invalidation lock on virtual address range
> + *
> + * @vma: the vm area struct for the range
> + * @list: all range lock are on a list
> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @pfns: array of pfns (big enough for the range)
> + * @flags: pfn flags to match device driver page table
> + * @valid: pfns array did not change since it has been fill by an HMM function
> + */
> +struct hmm_range {
> +	struct vm_area_struct	*vma;
> +	struct list_head	list;
> +	unsigned long		start;
> +	unsigned long		end;
> +	uint64_t		*pfns;
> +	const uint64_t		*flags;
> +	uint8_t			pfn_shift;
> +	bool			valid;
> +};
> +#define HMM_RANGE_PFN_FLAG(f) (range->flags[HMM_PFN_FLAG_##f])

Please please please no. :)  This breaks grep without actually adding any value.
It's not as if you need to build up a whole set of symmetric macros like
the Page* flags do, after all. So we can keep this very simple, instead.

I've looked through the hmm.c and it's always just something like
HMM_RANGE_PFN_FLAG(WRITE), so there really is no need for this macro at all.

Just use HMM_PFN_FLAG_WRITE and friends directly, and enjoy the resulting clarity.


>  
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_READ (1 << 1)
> -#define HMM_PFN_WRITE (1 << 2)
> -#define HMM_PFN_ERROR (1 << 3)
> -#define HMM_PFN_EMPTY (1 << 4)
> -#define HMM_PFN_SPECIAL (1 << 5)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
> -#define HMM_PFN_SHIFT 7
>  
>  /*
> - * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
> - * @pfn: hmm_pfn_t to convert to struct page
> + * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
> + * @pfn: uint64_t to convert to struct page
>   * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
>   *
> - * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
> + * If the uint64_t is valid (ie valid flag set) then return the struct page

I realize that the "uint64_t" above is one of many search-and-replace effects,
but it really should be "if the HMM pfn is valid". Otherwise it's weird--who
ever considered whether a uint64_t is "valid"? heh

>   * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
>   */
> -static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
> +static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
> +					   uint64_t pfn)
>  {
> -	if (!(pfn & HMM_PFN_VALID))
> +	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
>  		return NULL;
> -	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> +	return pfn_to_page(pfn >> range->pfn_shift);
>  }
>  
>  /*
> - * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
> - * @pfn: hmm_pfn_t to extract pfn from
> - * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
> + * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
> + * @pfn: uint64_t to extract pfn from

Same as above for the uint64_t that used to be a hmm_pfn_t (I haven't tagged
all of these, but they are all in need of a tweak).

> + * Returns: pfn value if uint64_t is valid, -1UL otherwise
>   */
> -static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
> +static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
> +					   uint64_t pfn)
>  {
> -	if (!(pfn & HMM_PFN_VALID))
> +	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
>  		return -1UL;
> -	return (pfn >> HMM_PFN_SHIFT);
> +	return (pfn >> range->pfn_shift);
>  }
>  
>  /*
> - * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
> + * hmm_pfn_from_page() - create a valid uint64_t value from struct page
> + * @range: struct hmm_range pointer where pfn encoding constant are
>   * @page: struct page pointer for which to create the hmm_pfn_t
> - * Returns: valid hmm_pfn_t for the page
> + * Returns: valid uint64_t for the page
>   */
> -static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
> +static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
> +					 struct page *page)
>  {
> -	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +	return (page_to_pfn(page) << range->pfn_shift) |
> +		HMM_RANGE_PFN_FLAG(VALID);
>  }
>  
>  /*
> - * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
> + * hmm_pfn_from_pfn() - create a valid uint64_t value from pfn
> + * @range: struct hmm_range pointer where pfn encoding constant are
>   * @pfn: pfn value for which to create the hmm_pfn_t
> - * Returns: valid hmm_pfn_t for the pfn
> + * Returns: valid uint64_t for the pfn
>   */
> -static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
> +static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> +					unsigned long pfn)
>  {
> -	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +	return (pfn << range->pfn_shift) | HMM_RANGE_PFN_FLAG(VALID);
>  }
>  
>  
> @@ -271,23 +300,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
>  
> -/*
> - * struct hmm_range - track invalidation lock on virtual address range
> - *
> - * @list: all range lock are on a list
> - * @start: range virtual start address (inclusive)
> - * @end: range virtual end address (exclusive)
> - * @pfns: array of pfns (big enough for the range)
> - * @valid: pfns array did not change since it has been fill by an HMM function
> - */
> -struct hmm_range {
> -	struct list_head	list;
> -	unsigned long		start;
> -	unsigned long		end;
> -	hmm_pfn_t		*pfns;
> -	bool			valid;
> -};
> -
>  /*
>   * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
>   * driver lock that serializes device page table updates, then call
> @@ -301,17 +313,13 @@ struct hmm_range {
>   *
>   * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
>   */
> -int hmm_vma_get_pfns(struct vm_area_struct *vma,
> -		     struct hmm_range *range,
> -		     unsigned long start,
> -		     unsigned long end,
> -		     hmm_pfn_t *pfns);
> -bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
> +int hmm_vma_get_pfns(struct hmm_range *range);
> +bool hmm_vma_range_done(struct hmm_range *range);
>  
>  
>  /*
>   * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> - * not migrate any device memory back to system memory. The hmm_pfn_t array will
> + * not migrate any device memory back to system memory. The uint64_t array will
>   * be updated with the fault result and current snapshot of the CPU page table
>   * for the range.
>   *
> @@ -320,20 +328,14 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
>   * function returns -EAGAIN.
>   *
>   * Return value does not reflect if the fault was successful for every single
> - * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
> + * address or not. Therefore, the caller must to inspect the uint64_t array to
>   * determine fault status for each address.
>   *
>   * Trying to fault inside an invalid vma will result in -EINVAL.
>   *
>   * See the function description in mm/hmm.c for further documentation.
>   */
> -int hmm_vma_fault(struct vm_area_struct *vma,
> -		  struct hmm_range *range,
> -		  unsigned long start,
> -		  unsigned long end,
> -		  hmm_pfn_t *pfns,
> -		  bool write,
> -		  bool block);
> +int hmm_vma_fault(struct hmm_range *range, bool block);

OK, even though we're breaking the device driver API, I agree that it is a little 
easier to just pass around the hmm_range* everywhere, so I guess it's worth it.

Like I mentioned above, this is as far as I'm going, tonight. I'll look at 
the hmm.c part tomorrow.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
@ 2018-03-16  5:08     ` John Hubbard
  0 siblings, 0 replies; 24+ messages in thread
From: John Hubbard @ 2018-03-16  5:08 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/15/2018 11:37 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> to directly write entry that can match any device page table entry
> format. Device driver now provide an array of flags value and we use
> enum to index this array for each flag.
> 
> This also allow the device driver to ask for write fault on a per page
> basis making API more flexible to service multiple device page faults
> in one go.
> 

Hi Jerome,

This is a large patch, so I'm going to review it in two passes. The first 
pass is just an overview plus the hmm.h changes (now), and tomorrow I will
review the hmm.c, which is where the real changes are.

Overview: the hmm.c changes are doing several things, and it is difficult to
review, because refactoring, plus new behavior, makes diffs less useful here.
It would probably be good to split the hmm.c changes into a few patches, such
as:

	-- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
           being passed to functions), and
        -- New behavior in the page handling loops, and 
	-- Refactoring into new routines (hmm_vma_handle_pte, and others)

That way, reviewers can see more easily that things are correct. 

> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 130 +++++++++++----------
>  mm/hmm.c            | 331 +++++++++++++++++++++++++++++-----------------------
>  2 files changed, 249 insertions(+), 212 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 61b0e1c05ee1..34e8a8c65bbd 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,11 +80,10 @@
>  struct hmm;
>  
>  /*
> - * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
> + * uint64_t - HMM uses its own pfn type to keep several flags per page

This line now is a little odd, because it looks like it's trying to document
uint64_t as an HMM pfn type. :) Maybe:

* HMM pfns are of type uint64_t

...or else just delete it, either way.

>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

All of these are missing a _FLAG_ piece. The above should be HMM_PFN_FLAG_VALID,
to match the enum below.

> - * HMM_PFN_READ:  CPU page table has read permission set

So why is it that we don't need the _READ flag anymore? I looked at the corresponding
hmm.c but still don't quite get it. Is it that we just expect that _READ is
always set if there is an entry at all? Or something else?

>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -92,64 +91,94 @@ struct hmm;
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
>   *      set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
>   */
> -typedef unsigned long hmm_pfn_t;
> +enum hmm_pfn_flag_e {
> +	HMM_PFN_FLAG_VALID = 0,
> +	HMM_PFN_FLAG_WRITE,
> +	HMM_PFN_FLAG_ERROR,
> +	HMM_PFN_FLAG_NONE,
> +	HMM_PFN_FLAG_SPECIAL,
> +	HMM_PFN_FLAG_DEVICE_PRIVATE,
> +	HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * struct hmm_range - track invalidation lock on virtual address range
> + *
> + * @vma: the vm area struct for the range
> + * @list: all range lock are on a list
> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @pfns: array of pfns (big enough for the range)
> + * @flags: pfn flags to match device driver page table
> + * @valid: pfns array did not change since it has been fill by an HMM function
> + */
> +struct hmm_range {
> +	struct vm_area_struct	*vma;
> +	struct list_head	list;
> +	unsigned long		start;
> +	unsigned long		end;
> +	uint64_t		*pfns;
> +	const uint64_t		*flags;
> +	uint8_t			pfn_shift;
> +	bool			valid;
> +};
> +#define HMM_RANGE_PFN_FLAG(f) (range->flags[HMM_PFN_FLAG_##f])

Please please please no. :)  This breaks grep without actually adding any value.
It's not as if you need to build up a whole set of symmetric macros like
the Page* flags do, after all. So we can keep this very simple, instead.

I've looked through the hmm.c and it's always just something like
HMM_RANGE_PFN_FLAG(WRITE), so there really is no need for this macro at all.

Just use HMM_PFN_FLAG_WRITE and friends directly, and enjoy the resulting clarity.


>  
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_READ (1 << 1)
> -#define HMM_PFN_WRITE (1 << 2)
> -#define HMM_PFN_ERROR (1 << 3)
> -#define HMM_PFN_EMPTY (1 << 4)
> -#define HMM_PFN_SPECIAL (1 << 5)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
> -#define HMM_PFN_SHIFT 7
>  
>  /*
> - * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
> - * @pfn: hmm_pfn_t to convert to struct page
> + * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
> + * @pfn: uint64_t to convert to struct page
>   * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
>   *
> - * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
> + * If the uint64_t is valid (ie valid flag set) then return the struct page

I realize that the "uint64_t" above is one of many search-and-replace effects,
but it really should be "if the HMM pfn is valid". Otherwise it's weird--who
ever considered whether a uint64_t is "valid"? heh

>   * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
>   */
> -static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
> +static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
> +					   uint64_t pfn)
>  {
> -	if (!(pfn & HMM_PFN_VALID))
> +	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
>  		return NULL;
> -	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> +	return pfn_to_page(pfn >> range->pfn_shift);
>  }
>  
>  /*
> - * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
> - * @pfn: hmm_pfn_t to extract pfn from
> - * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
> + * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
> + * @pfn: uint64_t to extract pfn from

Same as above for the uint64_t that used to be a hmm_pfn_t (I haven't tagged
all of these, but they are all in need of a tweak).

> + * Returns: pfn value if uint64_t is valid, -1UL otherwise
>   */
> -static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
> +static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
> +					   uint64_t pfn)
>  {
> -	if (!(pfn & HMM_PFN_VALID))
> +	if (!(pfn & HMM_RANGE_PFN_FLAG(VALID)))
>  		return -1UL;
> -	return (pfn >> HMM_PFN_SHIFT);
> +	return (pfn >> range->pfn_shift);
>  }
>  
>  /*
> - * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
> + * hmm_pfn_from_page() - create a valid uint64_t value from struct page
> + * @range: struct hmm_range pointer where pfn encoding constant are
>   * @page: struct page pointer for which to create the hmm_pfn_t
> - * Returns: valid hmm_pfn_t for the page
> + * Returns: valid uint64_t for the page
>   */
> -static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
> +static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
> +					 struct page *page)
>  {
> -	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +	return (page_to_pfn(page) << range->pfn_shift) |
> +		HMM_RANGE_PFN_FLAG(VALID);
>  }
>  
>  /*
> - * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
> + * hmm_pfn_from_pfn() - create a valid uint64_t value from pfn
> + * @range: struct hmm_range pointer where pfn encoding constant are
>   * @pfn: pfn value for which to create the hmm_pfn_t
> - * Returns: valid hmm_pfn_t for the pfn
> + * Returns: valid uint64_t for the pfn
>   */
> -static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
> +static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> +					unsigned long pfn)
>  {
> -	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +	return (pfn << range->pfn_shift) | HMM_RANGE_PFN_FLAG(VALID);
>  }
>  
>  
> @@ -271,23 +300,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
>  
> -/*
> - * struct hmm_range - track invalidation lock on virtual address range
> - *
> - * @list: all range lock are on a list
> - * @start: range virtual start address (inclusive)
> - * @end: range virtual end address (exclusive)
> - * @pfns: array of pfns (big enough for the range)
> - * @valid: pfns array did not change since it has been fill by an HMM function
> - */
> -struct hmm_range {
> -	struct list_head	list;
> -	unsigned long		start;
> -	unsigned long		end;
> -	hmm_pfn_t		*pfns;
> -	bool			valid;
> -};
> -
>  /*
>   * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
>   * driver lock that serializes device page table updates, then call
> @@ -301,17 +313,13 @@ struct hmm_range {
>   *
>   * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
>   */
> -int hmm_vma_get_pfns(struct vm_area_struct *vma,
> -		     struct hmm_range *range,
> -		     unsigned long start,
> -		     unsigned long end,
> -		     hmm_pfn_t *pfns);
> -bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
> +int hmm_vma_get_pfns(struct hmm_range *range);
> +bool hmm_vma_range_done(struct hmm_range *range);
>  
>  
>  /*
>   * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
> - * not migrate any device memory back to system memory. The hmm_pfn_t array will
> + * not migrate any device memory back to system memory. The uint64_t array will
>   * be updated with the fault result and current snapshot of the CPU page table
>   * for the range.
>   *
> @@ -320,20 +328,14 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
>   * function returns -EAGAIN.
>   *
>   * Return value does not reflect if the fault was successful for every single
> - * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
> + * address or not. Therefore, the caller must to inspect the uint64_t array to
>   * determine fault status for each address.
>   *
>   * Trying to fault inside an invalid vma will result in -EINVAL.
>   *
>   * See the function description in mm/hmm.c for further documentation.
>   */
> -int hmm_vma_fault(struct vm_area_struct *vma,
> -		  struct hmm_range *range,
> -		  unsigned long start,
> -		  unsigned long end,
> -		  hmm_pfn_t *pfns,
> -		  bool write,
> -		  bool block);
> +int hmm_vma_fault(struct hmm_range *range, bool block);

OK, even though we're breaking the device driver API, I agree that it is a little 
easier to just pass around the hmm_range* everywhere, so I guess it's worth it.

Like I mentioned above, this is as far as I'm going, tonight. I'll look at 
the hmm.c part tomorrow.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
  2018-03-16  5:08     ` John Hubbard
@ 2018-03-16 19:19       ` Jerome Glisse
  -1 siblings, 0 replies; 24+ messages in thread
From: Jerome Glisse @ 2018-03-16 19:19 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 15, 2018 at 10:08:21PM -0700, John Hubbard wrote:
> On 03/15/2018 11:37 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> > to directly write entry that can match any device page table entry
> > format. Device driver now provide an array of flags value and we use
> > enum to index this array for each flag.
> > 
> > This also allow the device driver to ask for write fault on a per page
> > basis making API more flexible to service multiple device page faults
> > in one go.
> > 
> 
> Hi Jerome,
> 
> This is a large patch, so I'm going to review it in two passes. The first 
> pass is just an overview plus the hmm.h changes (now), and tomorrow I will
> review the hmm.c, which is where the real changes are.
> 
> Overview: the hmm.c changes are doing several things, and it is difficult to
> review, because refactoring, plus new behavior, makes diffs less useful here.
> It would probably be good to split the hmm.c changes into a few patches, such
> as:
> 
> 	-- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
>            being passed to functions), and
>         -- New behavior in the page handling loops, and 
> 	-- Refactoring into new routines (hmm_vma_handle_pte, and others)
> 
> That way, reviewers can see more easily that things are correct. 

So i resent patchset and i splitted this patch in many (11 patches now).
I included your comments so far in the new version so probably better if
you look at new one.

[...[

> > - * HMM_PFN_READ:  CPU page table has read permission set
> 
> So why is it that we don't need the _READ flag anymore? I looked at the corresponding
> hmm.c but still don't quite get it. Is it that we just expect that _READ is
> always set if there is an entry at all? Or something else?

Explained why in the commit message, !READ when WRITE make no sense so
now VALID imply READ as does WRITE (write by itself is meaningless
without valid).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers
@ 2018-03-16 19:19       ` Jerome Glisse
  0 siblings, 0 replies; 24+ messages in thread
From: Jerome Glisse @ 2018-03-16 19:19 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Thu, Mar 15, 2018 at 10:08:21PM -0700, John Hubbard wrote:
> On 03/15/2018 11:37 AM, jglisse@redhat.com wrote:
> > From: Jerome Glisse <jglisse@redhat.com>
> > 
> > This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> > to directly write entry that can match any device page table entry
> > format. Device driver now provide an array of flags value and we use
> > enum to index this array for each flag.
> > 
> > This also allow the device driver to ask for write fault on a per page
> > basis making API more flexible to service multiple device page faults
> > in one go.
> > 
> 
> Hi Jerome,
> 
> This is a large patch, so I'm going to review it in two passes. The first 
> pass is just an overview plus the hmm.h changes (now), and tomorrow I will
> review the hmm.c, which is where the real changes are.
> 
> Overview: the hmm.c changes are doing several things, and it is difficult to
> review, because refactoring, plus new behavior, makes diffs less useful here.
> It would probably be good to split the hmm.c changes into a few patches, such
> as:
> 
> 	-- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
>            being passed to functions), and
>         -- New behavior in the page handling loops, and 
> 	-- Refactoring into new routines (hmm_vma_handle_pte, and others)
> 
> That way, reviewers can see more easily that things are correct. 

So i resent patchset and i splitted this patch in many (11 patches now).
I included your comments so far in the new version so probably better if
you look at new one.

[...[

> > - * HMM_PFN_READ:  CPU page table has read permission set
> 
> So why is it that we don't need the _READ flag anymore? I looked at the corresponding
> hmm.c but still don't quite get it. Is it that we just expect that _READ is
> always set if there is an entry at all? Or something else?

Explained why in the commit message, !READ when WRITE make no sense so
now VALID imply READ as does WRITE (write by itself is meaningless
without valid).

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] mm/hmm: fix header file if/else/endif maze
  2018-03-15 18:36   ` jglisse
@ 2018-03-17  0:53     ` kbuild test robot
  -1 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-03-17  0:53 UTC (permalink / raw)
  To: jglisse
  Cc: kbuild-all, linux-mm, Andrew Morton, linux-kernel,
	Jérôme Glisse, Ralph Campbell, John Hubbard,
	Evgeny Baskakov

[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]

Hi Jérôme,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5 next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/jglisse-redhat-com/mm-hmm-documentation-editorial-update-to-HMM-documentation/20180317-074102
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from kernel/fork.c:40:0:
>> include/linux/hmm.h:515:20: error: redefinition of 'hmm_mm_destroy'
    static inline void hmm_mm_destroy(struct mm_struct *mm) {}
                       ^~~~~~~~~~~~~~
   include/linux/hmm.h:502:20: note: previous definition of 'hmm_mm_destroy' was here
    static inline void hmm_mm_destroy(struct mm_struct *mm) {}
                       ^~~~~~~~~~~~~~
>> include/linux/hmm.h:516:20: error: redefinition of 'hmm_mm_init'
    static inline void hmm_mm_init(struct mm_struct *mm) {}
                       ^~~~~~~~~~~
   include/linux/hmm.h:503:20: note: previous definition of 'hmm_mm_init' was here
    static inline void hmm_mm_init(struct mm_struct *mm) {}
                       ^~~~~~~~~~~

vim +/hmm_mm_destroy +515 include/linux/hmm.h

133ff0eac Jérôme Glisse 2017-09-08  509  
133ff0eac Jérôme Glisse 2017-09-08  510  static inline void hmm_mm_init(struct mm_struct *mm)
133ff0eac Jérôme Glisse 2017-09-08  511  {
133ff0eac Jérôme Glisse 2017-09-08  512  	mm->hmm = NULL;
133ff0eac Jérôme Glisse 2017-09-08  513  }
6b368cd4a Jérôme Glisse 2017-09-08  514  #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
6b368cd4a Jérôme Glisse 2017-09-08 @515  static inline void hmm_mm_destroy(struct mm_struct *mm) {}
6b368cd4a Jérôme Glisse 2017-09-08 @516  static inline void hmm_mm_init(struct mm_struct *mm) {}
6b368cd4a Jérôme Glisse 2017-09-08  517  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
133ff0eac Jérôme Glisse 2017-09-08  518  

:::::: The code at line 515 was first introduced by commit
:::::: 6b368cd4a44ce95b33f1d31f2f932e6ae707f319 mm/hmm: avoid bloating arch that do not make use of HMM

:::::: TO: Jérôme Glisse <jglisse@redhat.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6733 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] mm/hmm: fix header file if/else/endif maze
@ 2018-03-17  0:53     ` kbuild test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kbuild test robot @ 2018-03-17  0:53 UTC (permalink / raw)
  To: jglisse
  Cc: kbuild-all, linux-mm, Andrew Morton, linux-kernel,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

[-- Attachment #1: Type: text/plain, Size: 2517 bytes --]

Hi Jerome,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16-rc5 next-20180316]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/jglisse-redhat-com/mm-hmm-documentation-editorial-update-to-HMM-documentation/20180317-074102
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from kernel/fork.c:40:0:
>> include/linux/hmm.h:515:20: error: redefinition of 'hmm_mm_destroy'
    static inline void hmm_mm_destroy(struct mm_struct *mm) {}
                       ^~~~~~~~~~~~~~
   include/linux/hmm.h:502:20: note: previous definition of 'hmm_mm_destroy' was here
    static inline void hmm_mm_destroy(struct mm_struct *mm) {}
                       ^~~~~~~~~~~~~~
>> include/linux/hmm.h:516:20: error: redefinition of 'hmm_mm_init'
    static inline void hmm_mm_init(struct mm_struct *mm) {}
                       ^~~~~~~~~~~
   include/linux/hmm.h:503:20: note: previous definition of 'hmm_mm_init' was here
    static inline void hmm_mm_init(struct mm_struct *mm) {}
                       ^~~~~~~~~~~

vim +/hmm_mm_destroy +515 include/linux/hmm.h

133ff0eac Jerome Glisse 2017-09-08  509  
133ff0eac Jerome Glisse 2017-09-08  510  static inline void hmm_mm_init(struct mm_struct *mm)
133ff0eac Jerome Glisse 2017-09-08  511  {
133ff0eac Jerome Glisse 2017-09-08  512  	mm->hmm = NULL;
133ff0eac Jerome Glisse 2017-09-08  513  }
6b368cd4a Jerome Glisse 2017-09-08  514  #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
6b368cd4a Jerome Glisse 2017-09-08 @515  static inline void hmm_mm_destroy(struct mm_struct *mm) {}
6b368cd4a Jerome Glisse 2017-09-08 @516  static inline void hmm_mm_init(struct mm_struct *mm) {}
6b368cd4a Jerome Glisse 2017-09-08  517  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
133ff0eac Jerome Glisse 2017-09-08  518  

:::::: The code at line 515 was first introduced by commit
:::::: 6b368cd4a44ce95b33f1d31f2f932e6ae707f319 mm/hmm: avoid bloating arch that do not make use of HMM

:::::: TO: Jerome Glisse <jglisse@redhat.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6733 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-15 22:48   ` Andrew Morton
  2018-03-16  0:54     ` Jerome Glisse
@ 2018-03-20 11:33     ` Michal Hocko
  2018-03-20 14:45         ` Jerome Glisse
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2018-03-20 11:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jglisse, linux-mm, linux-kernel, Ralph Campbell, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On Thu 15-03-18 15:48:29, Andrew Morton wrote:
> On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
> 
> > From: Ralph Campbell <rcampbell@nvidia.com>
> > 
> > The hmm_mirror_register() function registers a callback for when
> > the CPU pagetable is modified. Normally, the device driver will
> > call hmm_mirror_unregister() when the process using the device is
> > finished. However, if the process exits uncleanly, the struct_mm
> > can be destroyed with no warning to the device driver.
> 
> The changelog doesn't tell us what the runtime effects of the bug are. 
> This makes it hard for me to answer the "did Jerome consider doing
> cc:stable" question.

There is no upstream user of this code IIRC, so does it make sense to
mark anything for stable trees?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
  2018-03-20 11:33     ` Michal Hocko
@ 2018-03-20 14:45         ` Jerome Glisse
  0 siblings, 0 replies; 24+ messages in thread
From: Jerome Glisse @ 2018-03-20 14:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Ralph Campbell,
	Evgeny Baskakov, Mark Hairgrove, John Hubbard

On Tue, Mar 20, 2018 at 12:33:26PM +0100, Michal Hocko wrote:
> On Thu 15-03-18 15:48:29, Andrew Morton wrote:
> > On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
> > 
> > > From: Ralph Campbell <rcampbell@nvidia.com>
> > > 
> > > The hmm_mirror_register() function registers a callback for when
> > > the CPU pagetable is modified. Normally, the device driver will
> > > call hmm_mirror_unregister() when the process using the device is
> > > finished. However, if the process exits uncleanly, the struct_mm
> > > can be destroyed with no warning to the device driver.
> > 
> > The changelog doesn't tell us what the runtime effects of the bug are. 
> > This makes it hard for me to answer the "did Jerome consider doing
> > cc:stable" question.
> 
> There is no upstream user of this code IIRC, so does it make sense to
> mark anything for stable trees?

I am fine with dropping stable, distribution that care about out of tree
drivers can easily backport themself. I am hoping to have the nouveau
part upstream in 4.18/4.19 ...

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed
@ 2018-03-20 14:45         ` Jerome Glisse
  0 siblings, 0 replies; 24+ messages in thread
From: Jerome Glisse @ 2018-03-20 14:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Ralph Campbell,
	Evgeny Baskakov, Mark Hairgrove, John Hubbard

On Tue, Mar 20, 2018 at 12:33:26PM +0100, Michal Hocko wrote:
> On Thu 15-03-18 15:48:29, Andrew Morton wrote:
> > On Thu, 15 Mar 2018 14:36:59 -0400 jglisse@redhat.com wrote:
> > 
> > > From: Ralph Campbell <rcampbell@nvidia.com>
> > > 
> > > The hmm_mirror_register() function registers a callback for when
> > > the CPU pagetable is modified. Normally, the device driver will
> > > call hmm_mirror_unregister() when the process using the device is
> > > finished. However, if the process exits uncleanly, the struct_mm
> > > can be destroyed with no warning to the device driver.
> > 
> > The changelog doesn't tell us what the runtime effects of the bug are. 
> > This makes it hard for me to answer the "did Jerome consider doing
> > cc:stable" question.
> 
> There is no upstream user of this code IIRC, so does it make sense to
> mark anything for stable trees?

I am fine with dropping stable, distribution that care about out of tree
drivers can easily backport themself. I am hoping to have the nouveau
part upstream in 4.18/4.19 ...

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-03-20 14:45 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-15 18:36 [PATCH 0/4] hmm: fixes and documentations jglisse
2018-03-15 18:36 ` jglisse
2018-03-15 18:36 ` [PATCH 1/4] mm/hmm: documentation editorial update to HMM documentation jglisse
2018-03-15 18:36   ` jglisse
2018-03-15 18:36 ` [PATCH 2/4] mm/hmm: fix header file if/else/endif maze jglisse
2018-03-15 18:36   ` jglisse
2018-03-16  0:11   ` Balbir Singh
2018-03-17  0:53   ` kbuild test robot
2018-03-17  0:53     ` kbuild test robot
2018-03-15 18:36 ` [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed jglisse
2018-03-15 18:36   ` jglisse
2018-03-15 22:48   ` Andrew Morton
2018-03-16  0:54     ` Jerome Glisse
2018-03-16  1:17       ` John Hubbard
2018-03-16  1:17         ` John Hubbard
2018-03-20 11:33     ` Michal Hocko
2018-03-20 14:45       ` Jerome Glisse
2018-03-20 14:45         ` Jerome Glisse
2018-03-15 18:37 ` [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers jglisse
2018-03-15 18:37   ` jglisse
2018-03-16  5:08   ` John Hubbard
2018-03-16  5:08     ` John Hubbard
2018-03-16 19:19     ` Jerome Glisse
2018-03-16 19:19       ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.