linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] hmm: fixes and documentations v4
@ 2018-03-23  0:55 jglisse
  2018-03-23  0:55 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
                   ` (14 more replies)
  0 siblings, 15 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, linux-kernel, Jérôme Glisse

From: Jérôme Glisse <jglisse@redhat.com>

Fixes and improvement to HMM only impact HMM user. Changes since the
last posting:

Allow release callback to wait on any device driver workqueue which
is processing fault without having to worry for deadlock. Some driver
do call migrate_vma() from their page fault workqueue which itself
will trigger calls to mmu_notifier callback. If we take and keep the
HMM semaphore in write mode while calling device driver release call
back then those can not wait on their page fault workqueue as the
workqueue might end up deadlocking (waiting to be able to take the
HMM semaphore in read mode). Avoid this by dropping HMM semaphore
while calling the device driver release callback.

Other changes from previous post are solely typos fixes (in comment
not in code), s/pfns/pfn for one function argument, and usual patch
formatting fixes (also added review tag for review receive so far).

Below are previous cover letter (everything in them are still true):

----------------------------------------------------------------------
cover letter for v3:

Added a patch to fix zombie mm_struct (missing call to mmu notifier
unregister) this was lost in translation at some point. Included all
typos and comments received so far (and even more typos fixes). Added
more comments. Updated individual patch version to reflect changes.

----------------------------------------------------------------------
cover letter for v2:

Removed pointless VM_BUG_ON() cced stable when appropriate and splited
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):
----------------------------------------------------------------------
cover letter for v1:

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Jérôme Glisse (13):
  mm/hmm: fix header file if/else/endif maze v2
  mm/hmm: unregister mmu_notifier when last HMM client quit v3
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
    v2
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
    ulong v2
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
    v3
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
    v2
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn v2

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v3

 Documentation/vm/hmm.txt | 360 +++++++++++++++---------------
 MAINTAINERS              |   1 +
 include/linux/hmm.h      | 201 ++++++++++-------
 mm/hmm.c                 | 556 +++++++++++++++++++++++++++++++----------------
 4 files changed, 675 insertions(+), 443 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-04-08  3:19   ` Randy Dunlap
  2018-03-23  0:55 ` [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2 jglisse
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Stephen Bates, Jason Gunthorpe,
	Logan Gunthorpe, Evgeny Baskakov, Mark Hairgrove, John Hubbard

From: Ralph Campbell <rcampbell@nvidia.com>

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Stephen  Bates <sbates@raithlin.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
 MAINTAINERS              |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..e99b97003982 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 -------------------------------------------------------------------------------
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not
+equal from the device point of view which complicates large programs that rely
+on a wide set of libraries.
 
-Concretly this means that code that wants to leverage device like GPU need to
+Concretly this means that code that wants to leverage devices like GPUs need to
 copy object between genericly allocated memory (malloc, mmap private/share/)
 and memory allocated through the device driver API (this still end up with an
 mmap but of the device file).
 
-For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
-complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
-data-set need to re-map all the pointer relations between each of its elements.
-This is error prone and program gets harder to debug because of the duplicate
-data-set.
+For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but
+complex data-sets (list, tree, ...) are hard to get right. Duplicating a
+complex data-set needs to re-map all the pointer relations between each of its
+elements. This is error prone and program gets harder to debug because of the
+duplicate data-set and addresses.
 
-Split address space also means that library can not transparently use data they
-are getting from core program or other library and thus each library might have
-to duplicate its input data-set using specific memory allocator. Large project
-suffer from this and waste resources because of the various memory copy.
+Split address space also means that libraries can not transparently use data
+they are getting from the core program or another library and thus each library
+might have to duplicate its input data-set using the device specific memory
+allocator. Large projects suffer from this and waste resources because of the
+various memory copies.
 
 Duplicating each library API to accept as input or output memory allocted by
 each device specific allocator is not a viable option. It would lead to a
-combinatorial explosions in the library entry points.
+combinatorial explosion in the library entry points.
 
-Finaly with the advance of high level language constructs (in C++ but in other
-language too) it is now possible for compiler to leverage GPU or other devices
-without even the programmer knowledge. Some of compiler identified patterns are
-only do-able with a share address. It is as well more reasonable to use a share
-address space for all the other patterns.
+Finally, with the advance of high level language constructs (in C++ but in
+other languages too) it is now possible for the compiler to leverage GPUs and
+other devices without programmer knowledge. Some compiler identified patterns
+are only do-able with a shared address space. It is also more reasonable to use
+a shared address space for all other patterns.
 
 
 -------------------------------------------------------------------------------
 
-2) System bus, device memory characteristics
+2) I/O bus, device memory characteristics
 
-System bus cripple share address due to few limitations. Most system bus only
+I/O buses cripple shared address due to few limitations. Most I/O buses only
 allow basic memory access from device to main memory, even cache coherency is
-often optional. Access to device memory from CPU is even more limited, most
-often than not it is not cache coherent.
+often optional. Access to device memory from CPU is even more limited. More
+often than not, it is not cache coherent.
 
-If we only consider the PCIE bus than device can access main memory (often
-through an IOMMU) and be cache coherent with the CPUs. However it only allows
-a limited set of atomic operation from device on main memory. This is worse
-in the other direction the CPUs can only access a limited range of the device
+If we only consider the PCIE bus, then a device can access main memory (often
+through an IOMMU) and be cache coherent with the CPUs. However, it only allows
+a limited set of atomic operations from device on main memory. This is worse
+in the other direction, the CPU can only access a limited range of the device
 memory and can not perform atomic operations on it. Thus device memory can not
-be consider like regular memory from kernel point of view.
+be considered the same as regular memory from the kernel point of view.
 
 Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
-and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
-The final limitation is latency, access to main memory from the device has an
-order of magnitude higher latency than when the device access its own memory.
+and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
+The final limitation is latency. Access to main memory from the device has an
+order of magnitude higher latency than when the device accesses its own memory.
 
-Some platform are developing new system bus or additions/modifications to PCIE
-to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
+Some platforms are developing new I/O buses or additions/modifications to PCIE
+to address some of these limitations (OpenCAPI, CCIX). They mainly allow two
 way cache coherency between CPU and device and allow all atomic operations the
-architecture supports. Saddly not all platform are following this trends and
-some major architecture are left without hardware solutions to those problems.
+architecture supports. Saddly, not all platforms are following this trend and
+some major architectures are left without hardware solutions to these problems.
 
-So for share address space to make sense not only we must allow device to
+So for shared address space to make sense, not only must we allow device to
 access any memory memory but we must also permit any memory to be migrated to
 device memory while device is using it (blocking CPU access while it happens).
 
 
 -------------------------------------------------------------------------------
 
-3) Share address space and migration
+3) Shared address space and migration
 
 HMM intends to provide two main features. First one is to share the address
-space by duplication the CPU page table into the device page table so same
-address point to same memory and this for any valid main memory address in
+space by duplicating the CPU page table in the device page table so the same
+address points to the same physical memory for any valid main memory address in
 the process address space.
 
-To achieve this, HMM offer a set of helpers to populate the device page table
+To achieve this, HMM offers a set of helpers to populate the device page table
 while keeping track of CPU page table updates. Device page table updates are
-not as easy as CPU page table updates. To update the device page table you must
-allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
-commands in it to perform the update (unmap, cache invalidations and flush,
-...). This can not be done through common code for all device. Hence why HMM
-provides helpers to factor out everything that can be while leaving the gory
-details to the device driver.
-
-The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
-allow to allocate a struct page for each page of the device memory. Those page
-are special because the CPU can not map them. They however allow to migrate
-main memory to device memory using exhisting migration mechanism and everything
-looks like if page was swap out to disk from CPU point of view. Using a struct
-page gives the easiest and cleanest integration with existing mm mechanisms.
-Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
-for the device memory and second to perform migration. Policy decision of what
-and when to migrate things is left to the device driver.
-
-Note that any CPU access to a device page trigger a page fault and a migration
-back to main memory ie when a page backing an given address A is migrated from
-a main memory page to a device page then any CPU access to address A trigger a
-page fault and initiate a migration back to main memory.
-
-
-With this two features, HMM not only allow a device to mirror a process address
-space and keeps both CPU and device page table synchronize, but also allow to
-leverage device memory by migrating part of data-set that is actively use by a
-device.
+not as easy as CPU page table updates. To update the device page table, you must
+allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
+specific commands in it to perform the update (unmap, cache invalidations, and
+flush, ...). This can not be done through common code for all devices. Hence
+why HMM provides helpers to factor out everything that can be while leaving the
+hardware specific details to the device driver.
+
+The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that
+allows allocating a struct page for each page of the device memory. Those pages
+are special because the CPU can not map them. However, they allow migrating
+main memory to device memory using existing migration mechanisms and everything
+looks like a page is swapped out to disk from the CPU point of view. Using a
+struct page gives the easiest and cleanest integration with existing mm mech-
+anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
+memory for the device memory and second to perform migration. Policy decisions
+of what and when to migrate things is left to the device driver.
+
+Note that any CPU access to a device page triggers a page fault and a migration
+back to main memory. For example, when a page backing a given CPU address A is
+migrated from a main memory page to a device page, then any CPU access to
+address A triggers a page fault and initiates a migration back to main memory.
+
+With these two features, HMM not only allows a device to mirror process address
+space and keeping both CPU and device page table synchronized, but also lever-
+ages device memory by migrating the part of the data-set that is actively being
+used by the device.
 
 
 -------------------------------------------------------------------------------
 
 4) Address space mirroring implementation and API
 
-Address space mirroring main objective is to allow to duplicate range of CPU
-page table into a device page table and HMM helps keeping both synchronize. A
+Address space mirroring's main objective is to allow duplication of a range of
+CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that want to mirror a process address space must start with the
 registration of an hmm_mirror struct:
 
@@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
                                 struct mm_struct *mm);
 
 The locked variant is to be use when the driver is already holding the mmap_sem
-of the mm in write mode. The mirror struct has a set of callback that are use
-to propagate CPU page table:
+of the mm in write mode. The mirror struct has a set of callbacks that are used
+to propagate CPU page tables:
 
  struct hmm_mirror_ops {
      /* sync_cpu_device_pagetables() - synchronize page tables
@@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
                      unsigned long end);
  };
 
-Device driver must perform update to the range following action (turn range
-read only, or fully unmap, ...). Once driver callback returns the device must
-be done with the update.
+The device driver must perform the update action to the range (mark range
+read only, or fully unmap, ...). The device must be done with the update before
+the driver callback returns.
 
 
-When device driver wants to populate a range of virtual address it can use
-either:
+When the device driver wants to populate a range of virtual addresses, it can
+use either:
  int hmm_vma_get_pfns(struct vm_area_struct *vma,
                       struct hmm_range *range,
                       unsigned long start,
@@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
                    bool write,
                    bool block);
 
-First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
-will not trigger a page fault on missing or non present entry. The second one
-do trigger page fault on missing or read only entry if write parameter is true.
-Page fault use the generic mm page fault code path just like a CPU page fault.
+The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
+entries and will not trigger a page fault on missing or non present entries.
+The second one does trigger a page fault on missing or read only entry if the
+write parameter is true. Page faults use the generic mm page fault code path
+just like a CPU page fault.
 
-Both function copy CPU page table into their pfns array argument. Each entry in
-that array correspond to an address in the virtual range. HMM provide a set of
-flags to help driver identify special CPU page table entries.
+Both functions copy CPU page table entries into their pfns array argument. Each
+entry in that array corresponds to an address in the virtual range. HMM
+provides a set of flags to help the driver identify special CPU page table
+entries.
 
 Locking with the update() callback is the most important aspect the driver must
-respect in order to keep things properly synchronize. The usage pattern is :
+respect in order to keep things properly synchronized. The usage pattern is:
 
  int driver_populate_range(...)
  {
@@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
       return 0;
  }
 
-The driver->update lock is the same lock that driver takes inside its update()
-callback. That lock must be call before hmm_vma_range_done() to avoid any race
-with a concurrent CPU page table update.
+The driver->update lock is the same lock that the driver takes inside its
+update() callback. That lock must be held before hmm_vma_range_done() to avoid
+any race with a concurrent CPU page table update.
 
-HMM implements all this on top of the mmu_notifier API because we wanted to a
-simpler API and also to be able to perform optimization latter own like doing
-concurrent device update in multi-devices scenario.
+HMM implements all this on top of the mmu_notifier API because we wanted a
+simpler API and also to be able to perform optimizations latter on like doing
+concurrent device updates in multi-devices scenario.
 
-HMM also serve as an impedence missmatch between how CPU page table update are
-done (by CPU write to the page table and TLB flushes) from how device update
-their own page table. Device update is a multi-step process, first appropriate
-commands are write to a buffer, then this buffer is schedule for execution on
-the device. It is only once the device has executed commands in the buffer that
-the update is done. Creating and scheduling update command buffer can happen
-concurrently for multiple devices. Waiting for each device to report commands
-as executed is serialize (there is no point in doing this concurrently).
+HMM also serves as an impedence mismatch between how CPU page table updates
+are done (by CPU write to the page table and TLB flushes) and how devices
+update their own page table. Device updates are a multi-step process. First,
+appropriate commands are writen to a buffer, then this buffer is scheduled for
+execution on the device. It is only once the device has executed commands in
+the buffer that the update is done. Creating and scheduling the update command
+buffer can happen concurrently for multiple devices. Waiting for each device to
+report commands as executed is serialized (there is no point in doing this
+concurrently).
 
 
 -------------------------------------------------------------------------------
 
 5) Represent and manage device memory from core kernel point of view
 
-Several differents design were try to support device memory. First one use
-device specific data structure to keep information about migrated memory and
-HMM hooked itself in various place of mm code to handle any access to address
-that were back by device memory. It turns out that this ended up replicating
-most of the fields of struct page and also needed many kernel code path to be
-updated to understand this new kind of memory.
+Several different designs were tried to support device memory. First one used
+a device specific data structure to keep information about migrated memory and
+HMM hooked itself in various places of mm code to handle any access to
+addresses that were backed by device memory. It turns out that this ended up
+replicating most of the fields of struct page and also needed many kernel code
+paths to be updated to understand this new kind of memory.
 
-Thing is most kernel code path never try to access the memory behind a page
-but only care about struct page contents. Because of this HMM switchted to
-directly using struct page for device memory which left most kernel code path
-un-aware of the difference. We only need to make sure that no one ever try to
-map those page from the CPU side.
+Most kernel code paths never try to access the memory behind a page
+but only care about struct page contents. Because of this, HMM switched to
+directly using struct page for device memory which left most kernel code paths
+unaware of the difference. We only need to make sure that no one ever tries to
+map those pages from the CPU side.
 
-HMM provide a set of helpers to register and hotplug device memory as a new
-region needing struct page. This is offer through a very simple API:
+HMM provides a set of helpers to register and hotplug device memory as a new
+region needing a struct page. This is offered through a very simple API:
 
  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
                                    struct device *device,
@@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
  };
 
 The first callback (free()) happens when the last reference on a device page is
-drop. This means the device page is now free and no longer use by anyone. The
-second callback happens whenever CPU try to access a device page which it can
-not do. This second callback must trigger a migration back to system memory.
+dropped. This means the device page is now free and no longer used by anyone.
+The second callback happens whenever the CPU tries to access a device page
+which it can not do. This second callback must trigger a migration back to
+system memory.
 
 
 -------------------------------------------------------------------------------
 
-6) Migrate to and from device memory
+6) Migration to and from device memory
 
-Because CPU can not access device memory, migration must use device DMA engine
-to perform copy from and to device memory. For this we need a new migration
-helper:
+Because the CPU can not access device memory, migration must use the device DMA
+engine to perform copy from and to device memory. For this we need a new
+migration helper:
 
  int migrate_vma(const struct migrate_vma_ops *ops,
                  struct vm_area_struct *vma,
@@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
                  unsigned long *dst,
                  void *private);
 
-Unlike other migration function it works on a range of virtual address, there
-is two reasons for that. First device DMA copy has a high setup overhead cost
+Unlike other migration functions it works on a range of virtual address, there
+are two reasons for that. First, device DMA copy has a high setup overhead cost
 and thus batching multiple pages is needed as otherwise the migration overhead
-make the whole excersie pointless. The second reason is because driver trigger
-such migration base on range of address the device is actively accessing.
+makes the whole exersize pointless. The second reason is because the
+migration might be for a range of addresses the device is actively accessing.
 
-The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
-control destination memory allocation and copy operation. Second one is there
-to allow device driver to perform cleanup operation after migration.
+The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
+controls destination memory allocation and copy operation. Second one is there
+to allow the device driver to perform cleanup operations after migration.
 
  struct migrate_vma_ops {
      void (*alloc_and_copy)(struct vm_area_struct *vma,
@@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
                               void *private);
  };
 
-It is important to stress that this migration helpers allow for hole in the
+It is important to stress that these migration helpers allow for holes in the
 virtual address range. Some pages in the range might not be migrated for all
-the usual reasons (page is pin, page is lock, ...). This helper does not fail
-but just skip over those pages.
+the usual reasons (page is pinned, page is locked, ...). This helper does not
+fail but just skips over those pages.
 
-The alloc_and_copy() might as well decide to not migrate all pages in the
-range (for reasons under the callback control). For those the callback just
-have to leave the corresponding dst entry empty.
+The alloc_and_copy() might decide to not migrate all pages in the
+range (for reasons under the callback control). For those, the callback just
+has to leave the corresponding dst entry empty.
 
-Finaly the migration of the struct page might fails (for file back page) for
+Finally, the migration of the struct page might fail (for file backed page) for
 various reasons (failure to freeze reference, or update page cache, ...). If
-that happens then the finalize_and_map() can catch any pages that was not
-migrated. Note those page were still copied to new page and thus we wasted
+that happens, then the finalize_and_map() can catch any pages that were not
+migrated. Note those pages were still copied to a new page and thus we wasted
 bandwidth but this is considered as a rare event and a price that we are
 willing to pay to keep all the code simpler.
 
@@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
 7) Memory cgroup (memcg) and rss accounting
 
 For now device memory is accounted as any regular page in rss counters (either
-anonymous if device page is use for anonymous, file if device page is use for
-file back page or shmem if device page is use for share memory). This is a
-deliberate choice to keep existing application that might start using device
-memory without knowing about it to keep runing unimpacted.
-
-Drawbacks is that OOM killer might kill an application using a lot of device
-memory and not a lot of regular system memory and thus not freeing much system
-memory. We want to gather more real world experience on how application and
-system react under memory pressure in the presence of device memory before
+anonymous if device page is used for anonymous, file if device page is used for
+file backed page or shmem if device page is used for shared memory). This is a
+deliberate choice to keep existing applications, that might start using device
+memory without knowing about it, running unimpacted.
+
+A Drawback is that the OOM killer might kill an application using a lot of
+device memory and not a lot of regular system memory and thus not freeing much
+system memory. We want to gather more real world experience on how applications
+and system react under memory pressure in the presence of device memory before
 deciding to account device memory differently.
 
 
-Same decision was made for memory cgroup. Device memory page are accounted
+Same decision was made for memory cgroup. Device memory pages are accounted
 against same memory cgroup a regular page would be accounted to. This does
 simplify migration to and from device memory. This also means that migration
 back from device memory to regular memory can not fail because it would
 go above memory cgroup limit. We might revisit this choice latter on once we
-get more experience in how device memory is use and its impact on memory
+get more experience in how device memory is used and its impact on memory
 resource control.
 
 
-Note that device memory can never be pin nor by device driver nor through GUP
+Note that device memory can never be pinned by device driver nor through GUP
 and thus such memory is always free upon process exit. Or when last reference
-is drop in case of share memory or file back memory.
+is dropped in case of shared memory or file backed memory.
diff --git a/MAINTAINERS b/MAINTAINERS
index 8be7d1382ce9..4db4d4a820b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6466,6 +6466,7 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	Documentation/vm/hmm.txt
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
  2018-03-23  0:55 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3 jglisse
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Ralph Campbell, John Hubbard, Evgeny Baskakov

From: Jérôme Glisse <jglisse@redhat.com>

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong. Because
of this after multiple include there was multiple definition of both
hmm_mm_init() and hmm_mm_destroy() leading to build failure if HMM
was enabled (CONFIG_HMM set).

Changed since v1:
  - Fix the maze when CONFIG_HMM is disabled not just when it is
    disabled. This fix bot build failure.
  - Improved commit message.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
---
 include/linux/hmm.h | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..36dd21fe5caf 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,23 +498,16 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-#endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-#if IS_ENABLED(CONFIG_HMM_MIRROR)
 void hmm_mm_destroy(struct mm_struct *mm);
 
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
-
 #else /* IS_ENABLED(CONFIG_HMM) */
 static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
+#endif /* IS_ENABLED(CONFIG_HMM) */
 #endif /* LINUX_HMM_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
  2018-03-23  0:55 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
  2018-03-23  0:55 ` [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3 jglisse
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove

From: Ralph Campbell <rcampbell@nvidia.com>

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable
Changed since v2:
  - drop stable
  - Split list removale and call to driver release callback. This
    allow the release callback to wait on any pending fault handler
    without deadlock.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 10 ++++++++++
 mm/hmm.c            | 29 ++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 36dd21fe5caf..fa7b51f65905 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+	/* release() - release hmm_mirror
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 *
+	 * This is called when the mm_struct is being released.
+	 * The callback should make sure no references to the mirror occur
+	 * after the callback returns.
+	 */
+	void (*release)(struct hmm_mirror *mirror);
+
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..8116727766f7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+	struct hmm_mirror *mirror;
+	struct hmm *hmm = mm->hmm;
+
+	down_write(&hmm->mirrors_sem);
+	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
+					  list);
+	while (mirror) {
+		list_del_init(&mirror->list);
+		if (mirror->ops->release) {
+			/*
+			 * Drop mirrors_sem so callback can wait on any pending
+			 * work that might itself trigger mmu_notifier callback
+			 * and thus would deadlock with us.
+			 */
+			up_write(&hmm->mirrors_sem);
+			mirror->ops->release(mirror);
+			down_write(&hmm->mirrors_sem);
+		}
+		mirror = list_first_entry_or_null(&hmm->mirrors,
+						  struct hmm_mirror, list);
+	}
+	up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
@@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.release		= hmm_release,
 	.invalidate_range_start	= hmm_invalidate_range_start,
 	.invalidate_range_end	= hmm_invalidate_range_end,
 };
@@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
+	list_del_init(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (2 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct jglisse
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This code was lost in translation at one point. This properly call
mmu_notifier_unregister_no_release() once last user is gone. This
fix the zombie mm_struct as without this patch we do not drop the
refcount we have on it.

Changed since v1:
  - close race window between a last mirror unregistering and a new
    mirror registering, which could have lead to use after free()
    kind of bug
Changed since v2:
  - Avoid issue if there is multiple call to hmm_mirror_register()
    (which lead to use after free).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 8116727766f7..2d00769e8985 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -233,13 +233,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
 
+again:
 	mirror->hmm = hmm_register(mm);
 	if (!mirror->hmm)
 		return -ENOMEM;
 
 	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
+	if (mirror->hmm->mm == NULL) {
+		/*
+		 * A racing hmm_mirror_unregister() is about to destroy the hmm
+		 * struct. Try again to allocate a new one.
+		 */
+		up_write(&mirror->hmm->mirrors_sem);
+		mirror->hmm = NULL;
+		goto again;
+	} else {
+		list_add(&mirror->list, &mirror->hmm->mirrors);
+		up_write(&mirror->hmm->mirrors_sem);
+	}
 
 	return 0;
 }
@@ -254,11 +265,32 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	struct hmm *hmm = mirror->hmm;
+	bool should_unregister = false;
+	struct mm_struct *mm;
+	struct hmm *hmm;
+
+	if (mirror->hmm == NULL)
+		return;
 
+	hmm = mirror->hmm;
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
+	should_unregister = list_empty(&hmm->mirrors);
+	mirror->hmm = NULL;
+	mm = hmm->mm;
+	hmm->mm = NULL;
 	up_write(&hmm->mirrors_sem);
+
+	if (!should_unregister || mm == NULL)
+		return;
+
+	spin_lock(&mm->page_table_lock);
+	if (mm->hmm == hmm)
+		mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
+	kfree(hmm);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (3 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2 jglisse
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse, stable,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: stable@vger.kernel.org
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 2d00769e8985..812a66997627 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -336,7 +336,8 @@ static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
 {
-	struct hmm_range *range = walk->private;
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long i;
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (4 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2 jglisse
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easier in the future to add
new parameters by simply adding them to the structure.

Changed since v1:
  - Improved comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 18 ++++---------
 mm/hmm.c            | 78 +++++++++++++++++++----------------------------------
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fa7b51f65905..d0d6760cdada 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct vm_area_struct	*vma;
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 812a66997627..fc5057d7aa05 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -533,11 +533,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshotted
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -551,26 +547,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-		     struct hmm_range *range,
-		     unsigned long start,
-		     unsigned long end,
-		     hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+	struct vm_area_struct *vma = range->vma;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return -EINVAL;
 	}
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
@@ -581,9 +574,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -601,14 +591,13 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	mm_walk.pmd_entry = hmm_vma_walk_pmd;
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
-	walk_page_range(start, end, &mm_walk);
+	walk_page_range(range->start, range->end, &mm_walk);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
 
 /*
  * hmm_vma_range_done() - stop tracking change to CPU page table over a range
- * @vma: virtual memory area containing the virtual address range
  * @range: range being tracked
  * Returns: false if range data has been invalidated, true otherwise
  *
@@ -628,10 +617,10 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *
  * There are two ways to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
- *   if (!hmm_vma_range_done(vma, range)) {
+ *   if (!hmm_vma_range_done(range)) {
  *     device_page_table_unlock();
  *     goto again;
  *   }
@@ -639,13 +628,13 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
+ *   hmm_vma_get_pfns(range); or hmm_vma_fault(...);
  *   device_page_table_lock();
- *   hmm_vma_range_done(vma, range);
- *   device_update_page_table(pfns);
+ *   hmm_vma_range_done(range);
+ *   device_update_page_table(range->pfns);
  *   device_page_table_unlock();
  */
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+bool hmm_vma_range_done(struct hmm_range *range)
 {
 	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
 	struct hmm *hmm;
@@ -655,7 +644,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 		return false;
 	}
 
-	hmm = hmm_register(vma->vm_mm);
+	hmm = hmm_register(range->vma->vm_mm);
 	if (!hmm) {
 		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
 		return false;
@@ -671,11 +660,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
- * @vma: virtual memory area containing the virtual address range
- * @range: use to track pfns array content validity
- * @start: fault range virtual start address (inclusive)
- * @end: fault range virtual end address (exclusive)
- * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @range: range being faulted
  * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
@@ -691,10 +676,10 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   down_read(&mm->mmap_sem);
  *   // Find vma and address device wants to fault, initialize hmm_pfn_t
  *   // array accordingly
- *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   ret = hmm_vma_fault(range, write, block);
  *   switch (ret) {
  *   case -EAGAIN:
- *     hmm_vma_range_done(vma, range);
+ *     hmm_vma_range_done(range);
  *     // You might want to rate limit or yield to play nicely, you may
  *     // also commit any valid pfn in the array assuming that you are
  *     // getting true from hmm_vma_range_monitor_end()
@@ -708,7 +693,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *   }
  *   // Take device driver lock that serialize device page table update
  *   driver_lock_device_page_table_update();
- *   hmm_vma_range_done(vma, range);
+ *   hmm_vma_range_done(range);
  *   // Commit pfns we got from hmm_vma_fault()
  *   driver_unlock_device_page_table_update();
  *   up_read(&mm->mmap_sem)
@@ -718,28 +703,24 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
-		  struct hmm_range *range,
-		  unsigned long start,
-		  unsigned long end,
-		  hmm_pfn_t *pfns,
-		  bool write,
-		  bool block)
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 {
+	struct vm_area_struct *vma = range->vma;
+	unsigned long start = range->start;
 	struct hmm_vma_walk hmm_vma_walk;
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 	int ret;
 
 	/* Sanity check, this really should not happen ! */
-	if (start < vma->vm_start || start >= vma->vm_end)
+	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
-	if (end < vma->vm_start || end > vma->vm_end)
+	if (range->end < vma->vm_start || range->end > vma->vm_end)
 		return -EINVAL;
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(pfns, start, end);
+		hmm_pfns_clear(range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -747,9 +728,6 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	/* Initialize range to track CPU page table update */
-	range->start = start;
-	range->pfns = pfns;
-	range->end = end;
 	spin_lock(&hmm->lock);
 	range->valid = true;
 	list_add_rcu(&range->list, &hmm->ranges);
@@ -757,7 +735,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(pfns, start, end);
+		hmm_pfns_special(range->pfns, range->start, range->end);
 		return 0;
 	}
 
@@ -777,7 +755,7 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 	mm_walk.pte_hole = hmm_vma_walk_hole;
 
 	do {
-		ret = walk_page_range(start, end, &mm_walk);
+		ret = walk_page_range(start, range->end, &mm_walk);
 		start = hmm_vma_walk.last;
 	} while (ret == -EAGAIN);
 
@@ -785,8 +763,8 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&pfns[i], hmm_vma_walk.last, end);
-		hmm_vma_range_done(vma, range);
+		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_vma_range_done(range);
 	}
 	return ret;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (5 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2 jglisse
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Changed since v1:
  - Fail early on vma without read permission and return an error
  - Improve comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 16 +++++++---------
 mm/hmm.c            | 44 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index d0d6760cdada..dd907f614dfe 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,8 +83,7 @@ struct hmm;
  * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
  *
  * Flags:
- * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
+ * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index fc5057d7aa05..5da0f852a7aa 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -417,11 +417,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	hmm_pfn_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
-	hmm_pfn_t flag;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
 	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -433,6 +431,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
+		hmm_pfn_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -497,7 +496,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 				} else if (write_fault)
 					goto fault;
 				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-				pfns[i] |= flag;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
@@ -517,7 +515,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -534,7 +532,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
- * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, -EPERM invalid
+ *          vma permission, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
  * validity is tracked by range struct. See hmm_vma_range_done() for further
@@ -573,6 +572,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
+
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
@@ -686,6 +696,9 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *     goto retry;
  *   case 0:
  *     break;
+ *   case -ENOMEM:
+ *   case -EINVAL:
+ *   case -EPERM:
  *   default:
  *     // Handle error !
  *     up_read(&mm->mmap_sem)
@@ -727,11 +740,16 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
-	/* Initialize range to track CPU page table update */
-	spin_lock(&hmm->lock);
-	range->valid = true;
-	list_add_rcu(&range->list, &hmm->ranges);
-	spin_unlock(&hmm->lock);
+	if (!(vma->vm_flags & VM_READ)) {
+		/*
+		 * If vma do not allow read access, then assume that it does
+		 * not allow write access, either. Architecture that allow
+		 * write without read access are not supported by HMM, because
+		 * operations such has atomic access would not work.
+		 */
+		hmm_pfns_clear(range->pfns, range->start, range->end);
+		return -EPERM;
+	}
 
 	/* FIXME support hugetlb fs */
 	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
@@ -739,6 +757,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return 0;
 	}
 
+	/* Initialize range to track CPU page table update */
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
 	hmm_vma_walk.fault = true;
 	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (6 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL) jglisse
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for conversion between the two).

Changed since v1:
  - Fix comments typos

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 46 +++++++++++++++++++++-------------------------
 mm/hmm.c            | 26 +++++++++++++-------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index dd907f614dfe..54d684fe3b90 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *      set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the HMM pfn is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
 	if (!(pfn & HMM_PFN_VALID))
 		return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
 	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
 	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
 	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
-	hmm_pfn_t		*pfns;
+	uint64_t		*pfns;
 	bool			valid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the hmm_pfn_t array to
+ * address or not. Therefore, the caller must to inspect the HMM pfn array to
  * determine fault status for each address.
  *
  * Trying to fault inside an invalid vma will result in -EINVAL.
diff --git a/mm/hmm.c b/mm/hmm.c
index 5da0f852a7aa..b69f30fc064b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -304,7 +304,7 @@ struct hmm_vma_walk {
 
 static int hmm_vma_do_fault(struct mm_walk *walk,
 			    unsigned long addr,
-			    hmm_pfn_t *pfn)
+			    uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -324,7 +324,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(hmm_pfn_t *pfns,
+static void hmm_pfns_special(uint64_t *pfns,
 			     unsigned long addr,
 			     unsigned long end)
 {
@@ -338,7 +338,7 @@ static int hmm_pfns_bad(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -348,7 +348,7 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(hmm_pfn_t *pfns,
+static void hmm_pfns_clear(uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
@@ -362,7 +362,7 @@ static int hmm_vma_walk_hole(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -387,7 +387,7 @@ static int hmm_vma_walk_clear(unsigned long addr,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	hmm_vma_walk->last = addr;
@@ -414,7 +414,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
-	hmm_pfn_t *pfns = range->pfns;
+	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
 	bool write_fault;
 	pte_t *ptep;
@@ -431,7 +431,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
 		unsigned long pfn;
-		hmm_pfn_t flag = 0;
+		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -456,7 +456,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_t_from_pfn(pfn) | flag;
+			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
 		return 0;
 	}
 
@@ -490,7 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			 * device and report anything else as error.
 			 */
 			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_t_from_pfn(swp_offset(entry));
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_private_entry(entry)) {
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
@@ -515,7 +515,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (write_fault && !pte_write(pte))
 			goto fault;
 
-		pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
+		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
 		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
 		continue;
 
@@ -678,8 +678,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  * This is similar to a regular CPU page fault except that it will not trigger
  * any memory migration if the memory being faulted is not accessible by CPUs.
  *
- * On error, for one virtual address in the range, the function will set the
- * hmm_pfn_t error flag for the corresponding pfn entry.
+ * On error, for one virtual address in the range, the function will mark the
+ * corresponding HMM pfn entry with an error flag.
  *
  * Expected use pattern:
  * retry:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL)
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (7 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v3 jglisse
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model across device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index b69f30fc064b..a93c1e35df91 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -324,14 +324,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-			     unsigned long addr,
-			     unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
 			unsigned long end,
 			struct mm_walk *walk)
@@ -529,6 +521,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+	unsigned long addr = range->start, i = 0;
+
+	for (; addr < range->end; addr += PAGE_SIZE, i++)
+		range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual addresses
  * @range: range being snapshotted
@@ -553,12 +553,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	struct mm_walk mm_walk;
 	struct hmm *hmm;
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return -EINVAL;
-	}
-
 	/* Sanity check, this really should not happen ! */
 	if (range->start < vma->vm_start || range->start >= vma->vm_end)
 		return -EINVAL;
@@ -572,6 +566,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -740,6 +740,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	if (!hmm->mmu_notifier.ops)
 		return -EINVAL;
 
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(range);
+		return -EINVAL;
+	}
+
 	if (!(vma->vm_flags & VM_READ)) {
 		/*
 		 * If vma do not allow read access, then assume that it does
@@ -751,12 +757,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 		return -EPERM;
 	}
 
-	/* FIXME support hugetlb fs */
-	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-		hmm_pfns_special(range->pfns, range->start, range->end);
-		return 0;
-	}
-
 	/* Initialize range to track CPU page table update */
 	spin_lock(&hmm->lock);
 	range->valid = true;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v3
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (8 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL) jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE jglisse
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Changed since v1:
  - Improved comments
Changed since v2:
  - Typo in comments

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h |  8 +++-----
 mm/hmm.c            | 45 +++++++++++++++------------------------------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 54d684fe3b90..cf283db22106 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index a93c1e35df91..b8affe0bf4eb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -348,6 +348,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
 		*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This function will be called whenever pmd_none() or pte_none() returns true,
+ * or whenever there is no page directory covering the virtual address range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 			     unsigned long end,
 			     struct mm_walk *walk)
@@ -357,31 +367,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
-	hmm_vma_walk->last = addr;
-	i = (addr - range->start) >> PAGE_SHIFT;
-	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = HMM_PFN_EMPTY;
-		if (hmm_vma_walk->fault) {
-			int ret;
-
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-			if (ret != -EAGAIN)
-				return ret;
-		}
-	}
-
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
-			      unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	uint64_t *pfns = range->pfns;
-	unsigned long i;
-
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -440,10 +425,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
 		if (pmd_protnone(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_clear(start, end, walk);
+			return hmm_vma_walk_hole(start, end, walk);
 
 		pfn = pmd_pfn(pmd) + pte_index(addr);
 		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -462,7 +447,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		pfns[i] = 0;
 
 		if (pte_none(pte)) {
-			pfns[i] = HMM_PFN_EMPTY;
+			pfns[i] = 0;
 			if (hmm_vma_walk->fault)
 				goto fault;
 			continue;
@@ -513,8 +498,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
 		pte_unmap(ptep);
-		/* Fault all pages in range */
-		return hmm_vma_walk_clear(start, end, walk);
+		/* Fault any virtual address we were asked to fault */
+		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (9 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v3 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use jglisse
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Make naming consistent across code, DEVICE_PRIVATE is the name use
outside HMM code so use that one.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 include/linux/hmm.h | 4 ++--
 mm/hmm.c            | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index cf283db22106..e8515cad5a00 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,13 +88,13 @@ struct hmm;
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
+ * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
  */
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
 #define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
 #define HMM_PFN_SHIFT 5
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index b8affe0bf4eb..c287fbbbf088 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -472,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 					pfns[i] |= HMM_PFN_WRITE;
 				} else if (write_fault)
 					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
+				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
 			} else if (is_migration_entry(entry)) {
 				if (hmm_vma_walk->fault) {
 					pte_unmap(ptep);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (10 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() v2 jglisse
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

Move hmm_pfns_clear() closer to where it is use to make it clear it
is not use by page table walkers.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index c287fbbbf088..05b49a5d6674 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -340,14 +340,6 @@ static int hmm_pfns_bad(unsigned long addr,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
-			   unsigned long addr,
-			   unsigned long end)
-{
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
-}
-
 /*
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
@@ -506,6 +498,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static void hmm_pfns_clear(uint64_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = 0;
+}
+
 static void hmm_pfns_special(struct hmm_range *range)
 {
 	unsigned long addr = range->start, i = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (11 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis jglisse
  2018-03-23  0:55 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove

From: Jérôme Glisse <jglisse@redhat.com>

No functional change, just create one function to handle pmd and one
to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

Changed since v1:
  - s/pfns/pfn for pte as in that case we are dealing with a single pfn

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
---
 mm/hmm.c | 174 +++++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 102 insertions(+), 72 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 05b49a5d6674..2cc4dda1fd2e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -375,6 +375,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	return hmm_vma_walk->fault ? -EAGAIN : 0;
 }
 
+static int hmm_vma_handle_pmd(struct mm_walk *walk,
+			      unsigned long addr,
+			      unsigned long end,
+			      uint64_t *pfns,
+			      pmd_t pmd)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	unsigned long pfn, i;
+	uint64_t flag = 0;
+
+	if (pmd_protnone(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
+		return hmm_vma_walk_hole(addr, end, walk);
+
+	pfn = pmd_pfn(pmd) + pte_index(addr);
+	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
+		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+	hmm_vma_walk->last = end;
+	return 0;
+}
+
+static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
+			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
+			      uint64_t *pfn)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t pte = *ptep;
+
+	*pfn = 0;
+
+	if (pte_none(pte)) {
+		*pfn = 0;
+		if (hmm_vma_walk->fault)
+			goto fault;
+		return 0;
+	}
+
+	if (!pte_present(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if (!non_swap_entry(entry)) {
+			if (hmm_vma_walk->fault)
+				goto fault;
+			return 0;
+		}
+
+		/*
+		 * This is a special swap entry, ignore migration, use
+		 * device and report anything else as error.
+		 */
+		if (is_device_private_entry(entry)) {
+			*pfn = hmm_pfn_from_pfn(swp_offset(entry));
+			if (is_write_device_private_entry(entry)) {
+				*pfn |= HMM_PFN_WRITE;
+			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
+				goto fault;
+			*pfn |= HMM_PFN_DEVICE_PRIVATE;
+			return 0;
+		}
+
+		if (is_migration_entry(entry)) {
+			if (hmm_vma_walk->fault) {
+				pte_unmap(ptep);
+				hmm_vma_walk->last = addr;
+				migration_entry_wait(vma->vm_mm,
+						pmdp, addr);
+				return -EAGAIN;
+			}
+			return 0;
+		}
+
+		/* Report error for everything else */
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+		goto fault;
+
+	*pfn = hmm_pfn_from_pfn(pte_pfn(pte));
+	*pfn |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	return 0;
+
+fault:
+	pte_unmap(ptep);
+	/* Fault any virtual address we were asked to fault */
+	return hmm_vma_walk_hole(addr, end, walk);
+}
+
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long start,
 			    unsigned long end,
@@ -382,25 +475,20 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	struct vm_area_struct *vma = walk->vma;
 	uint64_t *pfns = range->pfns;
 	unsigned long addr = start, i;
-	bool write_fault;
 	pte_t *ptep;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
-	write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
 	if (pmd_none(*pmdp))
 		return hmm_vma_walk_hole(start, end, walk);
 
-	if (pmd_huge(*pmdp) && vma->vm_flags & VM_HUGETLB)
+	if (pmd_huge(*pmdp) && (range->vma->vm_flags & VM_HUGETLB))
 		return hmm_pfns_bad(start, end, walk);
 
 	if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
-		unsigned long pfn;
-		uint64_t flag = 0;
 		pmd_t pmd;
 
 		/*
@@ -416,17 +504,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		barrier();
 		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
 			goto again;
-		if (pmd_protnone(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
-
-		if (write_fault && !pmd_write(pmd))
-			return hmm_vma_walk_hole(start, end, walk);
 
-		pfn = pmd_pfn(pmd) + pte_index(addr);
-		flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
-		for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-			pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
-		return 0;
+		return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
 	}
 
 	if (pmd_bad(*pmdp))
@@ -434,67 +513,18 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 	ptep = pte_offset_map(pmdp, addr);
 	for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
-		pte_t pte = *ptep;
-
-		pfns[i] = 0;
-
-		if (pte_none(pte)) {
-			pfns[i] = 0;
-			if (hmm_vma_walk->fault)
-				goto fault;
-			continue;
-		}
-
-		if (!pte_present(pte)) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
-			if (!non_swap_entry(entry)) {
-				if (hmm_vma_walk->fault)
-					goto fault;
-				continue;
-			}
+		int r;
 
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			 */
-			if (is_device_private_entry(entry)) {
-				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
-				if (is_write_device_private_entry(entry)) {
-					pfns[i] |= HMM_PFN_WRITE;
-				} else if (write_fault)
-					goto fault;
-				pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
-			} else if (is_migration_entry(entry)) {
-				if (hmm_vma_walk->fault) {
-					pte_unmap(ptep);
-					hmm_vma_walk->last = addr;
-					migration_entry_wait(vma->vm_mm,
-							     pmdp, addr);
-					return -EAGAIN;
-				}
-				continue;
-			} else {
-				/* Report error for everything else */
-				pfns[i] = HMM_PFN_ERROR;
-			}
-			continue;
+		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
+		if (r) {
+			/* hmm_vma_handle_pte() did unmap pte directory */
+			hmm_vma_walk->last = addr;
+			return r;
 		}
-
-		if (write_fault && !pte_write(pte))
-			goto fault;
-
-		pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte));
-		pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-		continue;
-
-fault:
-		pte_unmap(ptep);
-		/* Fault any virtual address we were asked to fault */
-		return hmm_vma_walk_hole(start, end, walk);
 	}
 	pte_unmap(ptep - 1);
 
+	hmm_vma_walk->last = addr;
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (12 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() v2 jglisse
@ 2018-03-23  0:55 ` jglisse
  2018-03-23  0:55 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

This change hmm_vma_fault() to not take a global write fault flag
for a range but instead rely on caller to populate HMM pfns array
with proper fault flag ie HMM_PFN_VALID if driver want read fault
for that address or HMM_PFN_VALID and HMM_PFN_WRITE for write.

Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can
ask for device private memory to be migrated back to system memory
through page fault.

This is more flexible API and it better reflects how device handles
and reports fault.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h |   2 +-
 mm/hmm.c            | 151 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 119 insertions(+), 34 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index e8515cad5a00..0f7ea3074175 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -317,7 +317,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
+int hmm_vma_fault(struct hmm_range *range, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 2cc4dda1fd2e..290c872062a1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -299,12 +299,10 @@ struct hmm_vma_walk {
 	unsigned long		last;
 	bool			fault;
 	bool			block;
-	bool			write;
 };
 
-static int hmm_vma_do_fault(struct mm_walk *walk,
-			    unsigned long addr,
-			    uint64_t *pfn)
+static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
+			    bool write_fault, uint64_t *pfn)
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
@@ -312,7 +310,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
 	int r;
 
 	flags |= hmm_vma_walk->block ? 0 : FAULT_FLAG_ALLOW_RETRY;
-	flags |= hmm_vma_walk->write ? FAULT_FLAG_WRITE : 0;
+	flags |= write_fault ? FAULT_FLAG_WRITE : 0;
 	r = handle_mm_fault(vma, addr, flags);
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
@@ -344,15 +342,17 @@ static int hmm_pfns_bad(unsigned long addr,
  * hmm_vma_walk_hole() - handle a range lacking valid pmd or pte(s)
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
+ * @fault: should we fault or not ?
+ * @write_fault: write fault ?
  * @walk: mm_walk structure
  * Returns: 0 on success, -EAGAIN after page fault, or page fault error
  *
  * This function will be called whenever pmd_none() or pte_none() returns true,
  * or whenever there is no page directory covering the virtual address range.
  */
-static int hmm_vma_walk_hole(unsigned long addr,
-			     unsigned long end,
-			     struct mm_walk *walk)
+static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
+			      bool fault, bool write_fault,
+			      struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
@@ -363,16 +363,89 @@ static int hmm_vma_walk_hole(unsigned long addr,
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
 		pfns[i] = 0;
-		if (hmm_vma_walk->fault) {
+		if (fault || write_fault) {
 			int ret;
 
-			ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
+			ret = hmm_vma_do_fault(walk, addr, write_fault,
+					       &pfns[i]);
 			if (ret != -EAGAIN)
 				return ret;
 		}
 	}
 
-	return hmm_vma_walk->fault ? -EAGAIN : 0;
+	return (fault || write_fault) ? -EAGAIN : 0;
+}
+
+static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				      uint64_t pfns, uint64_t cpu_flags,
+				      bool *fault, bool *write_fault)
+{
+	*fault = *write_fault = false;
+	if (!hmm_vma_walk->fault)
+		return;
+
+	/* We aren't ask to do anything ... */
+	if (!(pfns & HMM_PFN_VALID))
+		return;
+	/* If CPU page table is not valid then we need to fault */
+	*fault = cpu_flags & HMM_PFN_VALID;
+	/* Need to write fault ? */
+	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+		*fault = *write_fault = false;
+		return;
+	}
+	/* Do we fault on device memory ? */
+	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
+	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
+		*write_fault = pfns & HMM_PFN_WRITE;
+		*fault = true;
+	}
+}
+
+static void hmm_range_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
+				 const uint64_t *pfns, unsigned long npages,
+				 uint64_t cpu_flags, bool *fault,
+				 bool *write_fault)
+{
+	unsigned long i;
+
+	if (!hmm_vma_walk->fault) {
+		*fault = *write_fault = false;
+		return;
+	}
+
+	for (i = 0; i < npages; ++i) {
+		hmm_pte_need_fault(hmm_vma_walk, pfns[i], cpu_flags,
+				   fault, write_fault);
+		if ((*fault) || (*write_fault))
+			return;
+	}
+}
+
+static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	bool fault, write_fault;
+	unsigned long i, npages;
+	uint64_t *pfns;
+
+	i = (addr - range->start) >> PAGE_SHIFT;
+	npages = (end - addr) >> PAGE_SHIFT;
+	pfns = &range->pfns[i];
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages,
+			     0, &fault, &write_fault);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
+}
+
+static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+{
+	if (pmd_protnone(pmd))
+		return 0;
+	return pmd_write(pmd) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -382,14 +455,17 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	unsigned long pfn, i;
-	uint64_t flag = 0;
+	unsigned long pfn, npages, i;
+	uint64_t flag = 0, cpu_flags;
+	bool fault, write_fault;
 
-	if (pmd_protnone(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	npages = (end - addr) >> PAGE_SHIFT;
+	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
+			     &fault, &write_fault);
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
-		return hmm_vma_walk_hole(addr, end, walk);
+	if (pmd_protnone(pmd) || fault || write_fault)
+		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
 	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -399,19 +475,32 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 	return 0;
 }
 
+static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+{
+	if (pte_none(pte) || !pte_present(pte))
+		return 0;
+	return pte_write(pte) ? HMM_PFN_VALID |
+				HMM_PFN_WRITE :
+				HMM_PFN_VALID;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
 			      uint64_t *pfn)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct vm_area_struct *vma = walk->vma;
+	bool fault, write_fault;
+	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
 	*pfn = 0;
+	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	hmm_pte_need_fault(hmm_vma_walk, *pfn, cpu_flags,
+			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
-		*pfn = 0;
-		if (hmm_vma_walk->fault)
+		if (fault || write_fault)
 			goto fault;
 		return 0;
 	}
@@ -420,7 +509,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if (!non_swap_entry(entry)) {
-			if (hmm_vma_walk->fault)
+			if (fault || write_fault)
 				goto fault;
 			return 0;
 		}
@@ -430,21 +519,20 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
+			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags |= is_write_device_private_entry(entry) ?
+					HMM_PFN_WRITE : 0;
 			*pfn = hmm_pfn_from_pfn(swp_offset(entry));
-			if (is_write_device_private_entry(entry)) {
-				*pfn |= HMM_PFN_WRITE;
-			} else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
-				goto fault;
 			*pfn |= HMM_PFN_DEVICE_PRIVATE;
 			return 0;
 		}
 
 		if (is_migration_entry(entry)) {
-			if (hmm_vma_walk->fault) {
+			if (fault || write_fault) {
 				pte_unmap(ptep);
 				hmm_vma_walk->last = addr;
 				migration_entry_wait(vma->vm_mm,
-						pmdp, addr);
+						     pmdp, addr);
 				return -EAGAIN;
 			}
 			return 0;
@@ -455,17 +543,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		return -EFAULT;
 	}
 
-	if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
+	if (fault || write_fault)
 		goto fault;
 
-	*pfn = hmm_pfn_from_pfn(pte_pfn(pte));
-	*pfn |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+	*pfn = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
 	pte_unmap(ptep);
 	/* Fault any virtual address we were asked to fault */
-	return hmm_vma_walk_hole(addr, end, walk);
+	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
@@ -686,7 +773,6 @@ EXPORT_SYMBOL(hmm_vma_range_done);
 /*
  * hmm_vma_fault() - try to fault some address in a virtual address range
  * @range: range being faulted
- * @write: is it a write fault
  * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
  * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
  *
@@ -731,7 +817,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
  *
  * YOU HAVE BEEN WARNED !
  */
-int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
+int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	struct vm_area_struct *vma = range->vma;
 	unsigned long start = range->start;
@@ -779,7 +865,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, bool block)
 	spin_unlock(&hmm->lock);
 
 	hmm_vma_walk.fault = true;
-	hmm_vma_walk.write = write;
 	hmm_vma_walk.block = block;
 	hmm_vma_walk.range = range;
 	mm_walk.private = &hmm_vma_walk;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
                   ` (13 preceding siblings ...)
  2018-03-23  0:55 ` [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis jglisse
@ 2018-03-23  0:55 ` jglisse
  14 siblings, 0 replies; 21+ messages in thread
From: jglisse @ 2018-03-23  0:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
and pfn shift value allowing them to define their own encoding for HMM
pfn that are fill inside the pfns array of the hmm_range struct. With
this device driver can get pfn that match their own private encoding
out of HMM without having to do any conversion.

Changed since v1:
  - Split flags and special values for clarification
  - Improved comments and provide examples

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
 mm/hmm.c            |  84 ++++++++++++++++++---------------
 2 files changed, 142 insertions(+), 72 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0f7ea3074175..5d26e0a223d9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,68 +80,145 @@
 struct hmm;
 
 /*
+ * hmm_pfn_flag_e - HMM flag enums
+ *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
+ *
+ * The driver provide a flags array, if driver valid bit for an entry is bit
+ * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
+ * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
+ * Same logic apply to all flags. This is same idea as vm_page_prot in vma
+ * except that this is per device driver rather than per architecture.
+ */
+enum hmm_pfn_flag_e {
+	HMM_PFN_VALID = 0,
+	HMM_PFN_WRITE,
+	HMM_PFN_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * hmm_pfn_value_e - HMM pfn special value
+ *
+ * Flags:
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
+ *
+ * Driver provide entry value for none entry, error entry and special entry,
+ * driver can alias (ie use same value for error and special for instance). It
+ * should not alias none and error or special.
+ *
+ * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
+ * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
+ * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
+ * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
  */
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
-#define HMM_PFN_SHIFT 5
+enum hmm_pfn_value_e {
+	HMM_PFN_ERROR,
+	HMM_PFN_NONE,
+	HMM_PFN_SPECIAL,
+	HMM_PFN_VALUE_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @values: pfn value for some special case (none, special, error, ...)
+ * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	const uint64_t		*values;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to get corresponding struct page from
  * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
  * If the HMM pfn is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_to_page(uint64_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return NULL;
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to extract pfn from
  * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return -1UL;
+	return (pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @range: range use to encode HMM pfn value
  * @page: struct page pointer for which to create the HMM pfn
  * Returns: valid HMM pfn for the page
  */
-static inline uint64_t hmm_pfn_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 /*
  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @range: range use to encode HMM pfn value
  * @pfn: pfn value for which to create the HMM pfn
  * Returns: valid HMM pfn for the pfn
  */
-static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 
@@ -263,25 +340,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @vma: the vm area struct for the range
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct vm_area_struct	*vma;
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	uint64_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
diff --git a/mm/hmm.c b/mm/hmm.c
index 290c872062a1..e4742f6f1e05 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -306,6 +306,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
@@ -315,7 +316,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
@@ -333,7 +334,7 @@ static int hmm_pfns_bad(unsigned long addr,
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = range->values[HMM_PFN_ERROR];
 
 	return 0;
 }
@@ -362,7 +363,7 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
+		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
 
@@ -380,24 +381,27 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 				      uint64_t pfns, uint64_t cpu_flags,
 				      bool *fault, bool *write_fault)
 {
+	struct hmm_range *range = hmm_vma_walk->range;
+
 	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
 	/* We aren't ask to do anything ... */
-	if (!(pfns & HMM_PFN_VALID))
+	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
 	/* If CPU page table is not valid then we need to fault */
-	*fault = cpu_flags & HMM_PFN_VALID;
+	*fault = cpu_flags & range->flags[HMM_PFN_VALID];
 	/* Need to write fault ? */
-	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+	if ((pfns & range->flags[HMM_PFN_WRITE]) &&
+	    !(cpu_flags & range->flags[HMM_PFN_WRITE])) {
 		*fault = *write_fault = false;
 		return;
 	}
 	/* Do we fault on device memory ? */
-	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
-	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
-		*write_fault = pfns & HMM_PFN_WRITE;
+	if ((pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) &&
+	    (cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
+		*write_fault = pfns & range->flags[HMM_PFN_WRITE];
 		*fault = true;
 	}
 }
@@ -439,13 +443,13 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
-static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 {
 	if (pmd_protnone(pmd))
 		return 0;
-	return pmd_write(pmd) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -455,12 +459,13 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long pfn, npages, i;
-	uint64_t flag = 0, cpu_flags;
 	bool fault, write_fault;
+	uint64_t cpu_flags;
 
 	npages = (end - addr) >> PAGE_SHIFT;
-	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
 	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
 			     &fault, &write_fault);
 
@@ -468,20 +473,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
 	hmm_vma_walk->last = end;
 	return 0;
 }
 
-static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
 {
 	if (pte_none(pte) || !pte_present(pte))
 		return 0;
-	return pte_write(pte) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
@@ -489,13 +493,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      uint64_t *pfn)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	bool fault, write_fault;
 	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
-	*pfn = 0;
-	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	*pfn = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
 	hmm_pte_need_fault(hmm_vma_walk, *pfn, cpu_flags,
 			   &fault, &write_fault);
 
@@ -519,11 +524,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
-			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags = range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_DEVICE_PRIVATE];
 			cpu_flags |= is_write_device_private_entry(entry) ?
-					HMM_PFN_WRITE : 0;
-			*pfn = hmm_pfn_from_pfn(swp_offset(entry));
-			*pfn |= HMM_PFN_DEVICE_PRIVATE;
+				range->flags[HMM_PFN_WRITE] : 0;
+			hmm_pte_need_fault(hmm_vma_walk, *pfn, cpu_flags,
+					   &fault, &write_fault);
+			if (fault || write_fault)
+				goto fault;
+			*pfn = hmm_pfn_from_pfn(range, swp_offset(entry));
+			*pfn |= cpu_flags;
 			return 0;
 		}
 
@@ -539,14 +549,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		}
 
 		/* Report error for everything else */
-		*pfn = HMM_PFN_ERROR;
+		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
 	if (fault || write_fault)
 		goto fault;
 
-	*pfn = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
+	*pfn = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
@@ -615,12 +625,13 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = range->values[HMM_PFN_NONE];
 }
 
 static void hmm_pfns_special(struct hmm_range *range)
@@ -628,7 +639,7 @@ static void hmm_pfns_special(struct hmm_range *range)
 	unsigned long addr = range->start, i = 0;
 
 	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = HMM_PFN_SPECIAL;
+		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
 }
 
 /*
@@ -681,7 +692,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -834,7 +845,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -854,7 +865,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -887,7 +898,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
+			       range->end);
 		hmm_vma_range_done(range);
 	}
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation
  2018-03-23  0:55 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
@ 2018-04-08  3:19   ` Randy Dunlap
  0 siblings, 0 replies; 21+ messages in thread
From: Randy Dunlap @ 2018-04-08  3:19 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Ralph Campbell, Stephen Bates,
	Jason Gunthorpe, Logan Gunthorpe, Evgeny Baskakov,
	Mark Hairgrove, John Hubbard

On 03/22/2018 05:55 PM, jglisse@redhat.com wrote:
> From: Ralph Campbell <rcampbell@nvidia.com>
> 
> This patch updates the documentation for HMM to fix minor typos and
> phrasing to be a bit more readable.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Stephen  Bates <sbates@raithlin.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/vm/hmm.txt | 360 ++++++++++++++++++++++++-----------------------
>  MAINTAINERS              |   1 +
>  2 files changed, 187 insertions(+), 174 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> index 4d3aac9f4a5d..e99b97003982 100644
> --- a/Documentation/vm/hmm.txt
> +++ b/Documentation/vm/hmm.txt
> @@ -1,151 +1,159 @@
>  Heterogeneous Memory Management (HMM)
>  
> -Transparently allow any component of a program to use any memory region of said
> -program with a device without using device specific memory allocator. This is
> -becoming a requirement to simplify the use of advance heterogeneous computing
> -where GPU, DSP or FPGA are use to perform various computations.
> -
> -This document is divided as follow, in the first section i expose the problems
> -related to the use of a device specific allocator. The second section i expose
> -the hardware limitations that are inherent to many platforms. The third section
> -gives an overview of HMM designs. The fourth section explains how CPU page-
> -table mirroring works and what is HMM purpose in this context. Fifth section
> -deals with how device memory is represented inside the kernel. Finaly the last
> -section present the new migration helper that allow to leverage the device DMA
> -engine.
> -
> -
> -1) Problems of using device specific memory allocator:
> -2) System bus, device memory characteristics
> -3) Share address space and migration
> +Provide infrastructure and helpers to integrate non conventional memory (device

                                                   non-conventional

> +memory like GPU on board memory) into regular kernel code path. Corner stone of

                                                             path, with the cornerstone of

> +this being specialize struct page for such memory (see sections 5 to 7 of this

              specialized

> +document).
> +
> +HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a

            provides                                        Memory), i.e.,

> +device to transparently access program address coherently with the CPU meaning
> +that any valid pointer on the CPU is also a valid pointer for the device. This
> +is becoming a mandatory to simplify the use of advance heterogeneous computing

      becoming mandatory                          advanced

> +where GPU, DSP, or FPGA are used to perform various computations on behalf of
> +a process.
> +
> +This document is divided as follows: in the first section I expose the problems
> +related to using device specific memory allocators. In the second section, I
> +expose the hardware limitations that are inherent to many platforms. The third
> +section gives an overview of the HMM design. The fourth section explains how
> +CPU page-table mirroring works and what is HMM's purpose in this context. The

                                  and the purpose of HMM in this context.

> +fifth section deals with how device memory is represented inside the kernel.
> +Finally, the last section presents a new migration helper that allows lever-
> +aging the device DMA engine.
> +
> +
> +1) Problems of using a device specific memory allocator:
> +2) I/O bus, device memory characteristics
> +3) Shared address space and migration
>  4) Address space mirroring implementation and API
>  5) Represent and manage device memory from core kernel point of view
> -6) Migrate to and from device memory
> +6) Migration to and from device memory
>  7) Memory cgroup (memcg) and rss accounting
>  
>  
>  -------------------------------------------------------------------------------
>  
> -1) Problems of using device specific memory allocator:
> +1) Problems of using a device specific memory allocator:
>  
> -Device with large amount of on board memory (several giga bytes) like GPU have
> -historically manage their memory through dedicated driver specific API. This
> -creates a disconnect between memory allocated and managed by device driver and
> -regular application memory (private anonymous, share memory or regular file
> -back memory). From here on i will refer to this aspect as split address space.
> -I use share address space to refer to the opposite situation ie one in which
> -any memory region can be use by device transparently.
> +Devices with a large amount of on board memory (several giga bytes) like GPUs

                                                           gigabytes)

> +have historically managed their memory through dedicated driver specific APIs.
> +This creates a disconnect between memory allocated and managed by a device
> +driver and regular application memory (private anonymous, shared memory, or
> +regular file backed memory). From here on I will refer to this aspect as split
> +address space. I use shared address space to refer to the opposite situation:
> +i.e., one in which any application memory region can be used by a device
> +transparently.
>  
>  Split address space because device can only access memory allocated through the

Awkward sentence: maybe:
   Split address space happens because

> -device specific API. This imply that all memory object in a program are not
> -equal from device point of view which complicate large program that rely on a
> -wide set of libraries.
> +device specific API. This implies that all memory objects in a program are not
> +equal from the device point of view which complicates large programs that rely
> +on a wide set of libraries.
>  
> -Concretly this means that code that wants to leverage device like GPU need to
> +Concretly this means that code that wants to leverage devices like GPUs need to

   Concretely                                                              needs

>  copy object between genericly allocated memory (malloc, mmap private/share/)

        object [or an object] between generically

>  and memory allocated through the device driver API (this still end up with an

                                                                  ends up

>  mmap but of the device file).
>  
> -For flat dataset (array, grid, image, ...) this isn't too hard to achieve but
> -complex data-set (list, tree, ...) are hard to get right. Duplicating a complex
> -data-set need to re-map all the pointer relations between each of its elements.
> -This is error prone and program gets harder to debug because of the duplicate
> -data-set.
> +For flat data-sets (array, grid, image, ...) this isn't too hard to achieve but

            data sets

> +complex data-sets (list, tree, ...) are hard to get right. Duplicating a

           data sets

> +complex data-set needs to re-map all the pointer relations between each of its

           data set
> +elements. This is error prone and program gets harder to debug because of the
> +duplicate data-set and addresses.

             data set
>  
> -Split address space also means that library can not transparently use data they
> -are getting from core program or other library and thus each library might have
> -to duplicate its input data-set using specific memory allocator. Large project
> -suffer from this and waste resources because of the various memory copy.
> +Split address space also means that libraries can not transparently use data

                                                 cannot

> +they are getting from the core program or another library and thus each library
> +might have to duplicate its input data-set using the device specific memory

                                     data set

> +allocator. Large projects suffer from this and waste resources because of the
> +various memory copies.
>  
>  Duplicating each library API to accept as input or output memory allocted by

                                                                    allocated

>  each device specific allocator is not a viable option. It would lead to a
> -combinatorial explosions in the library entry points.
> +combinatorial explosion in the library entry points.
>  
> -Finaly with the advance of high level language constructs (in C++ but in other
> -language too) it is now possible for compiler to leverage GPU or other devices
> -without even the programmer knowledge. Some of compiler identified patterns are
> -only do-able with a share address. It is as well more reasonable to use a share
> -address space for all the other patterns.
> +Finally, with the advance of high level language constructs (in C++ but in
> +other languages too) it is now possible for the compiler to leverage GPUs and
> +other devices without programmer knowledge. Some compiler identified patterns
> +are only do-able with a shared address space. It is also more reasonable to use
> +a shared address space for all other patterns.
>  
>  
>  -------------------------------------------------------------------------------
>  
> -2) System bus, device memory characteristics
> +2) I/O bus, device memory characteristics
>  
> -System bus cripple share address due to few limitations. Most system bus only
> +I/O buses cripple shared address due to few limitations. Most I/O buses only

                     shared address spaces due to a few limitations.

>  allow basic memory access from device to main memory, even cache coherency is

                                                 memory; even

> -often optional. Access to device memory from CPU is even more limited, most
> -often than not it is not cache coherent.
> +often optional. Access to device memory from CPU is even more limited. More
> +often than not, it is not cache coherent.
>  
> -If we only consider the PCIE bus than device can access main memory (often
> -through an IOMMU) and be cache coherent with the CPUs. However it only allows
> -a limited set of atomic operation from device on main memory. This is worse
> -in the other direction the CPUs can only access a limited range of the device
> +If we only consider the PCIE bus, then a device can access main memory (often
> +through an IOMMU) and be cache coherent with the CPUs. However, it only allows
> +a limited set of atomic operations from device on main memory. This is worse
> +in the other direction, the CPU can only access a limited range of the device

          other direction:

>  memory and can not perform atomic operations on it. Thus device memory can not

              cannot                                                      cannot

> -be consider like regular memory from kernel point of view.
> +be considered the same as regular memory from the kernel point of view.
>  
>  Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0
> -and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s).
> -The final limitation is latency, access to main memory from the device has an
> -order of magnitude higher latency than when the device access its own memory.
> +and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s).
> +The final limitation is latency. Access to main memory from the device has an
> +order of magnitude higher latency than when the device accesses its own memory.
>  
> -Some platform are developing new system bus or additions/modifications to PCIE
> -to address some of those limitations (OpenCAPI, CCIX). They mainly allow two
> +Some platforms are developing new I/O buses or additions/modifications to PCIE
> +to address some of these limitations (OpenCAPI, CCIX). They mainly allow two

                                                                            two-

>  way cache coherency between CPU and device and allow all atomic operations the
> -architecture supports. Saddly not all platform are following this trends and
> -some major architecture are left without hardware solutions to those problems.
> +architecture supports. Saddly, not all platforms are following this trend and

                          Sadly,

> +some major architectures are left without hardware solutions to these problems.
>  
> -So for share address space to make sense not only we must allow device to
> +So for shared address space to make sense, not only must we allow device to

                                                                     devices to

>  access any memory memory but we must also permit any memory to be migrated to

          any memory but

>  device memory while device is using it (blocking CPU access while it happens).
>  
>  
>  -------------------------------------------------------------------------------
>  
> -3) Share address space and migration
> +3) Shared address space and migration
>  
>  HMM intends to provide two main features. First one is to share the address
> -space by duplication the CPU page table into the device page table so same
> -address point to same memory and this for any valid main memory address in
> +space by duplicating the CPU page table in the device page table so the same
> +address points to the same physical memory for any valid main memory address in
>  the process address space.
>  
> -To achieve this, HMM offer a set of helpers to populate the device page table
> +To achieve this, HMM offers a set of helpers to populate the device page table
>  while keeping track of CPU page table updates. Device page table updates are
> -not as easy as CPU page table updates. To update the device page table you must
> -allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics
> -commands in it to perform the update (unmap, cache invalidations and flush,
> -...). This can not be done through common code for all device. Hence why HMM
> -provides helpers to factor out everything that can be while leaving the gory
> -details to the device driver.
> -
> -The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does
> -allow to allocate a struct page for each page of the device memory. Those page
> -are special because the CPU can not map them. They however allow to migrate
> -main memory to device memory using exhisting migration mechanism and everything
> -looks like if page was swap out to disk from CPU point of view. Using a struct
> -page gives the easiest and cleanest integration with existing mm mechanisms.
> -Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory
> -for the device memory and second to perform migration. Policy decision of what
> -and when to migrate things is left to the device driver.
> -
> -Note that any CPU access to a device page trigger a page fault and a migration
> -back to main memory ie when a page backing an given address A is migrated from
> -a main memory page to a device page then any CPU access to address A trigger a
> -page fault and initiate a migration back to main memory.
> -
> -
> -With this two features, HMM not only allow a device to mirror a process address
> -space and keeps both CPU and device page table synchronize, but also allow to
> -leverage device memory by migrating part of data-set that is actively use by a
> -device.
> +not as easy as CPU page table updates. To update the device page table, you must
> +allocate a buffer (or use a pool of pre-allocated buffers) and write GPU
> +specific commands in it to perform the update (unmap, cache invalidations, and
> +flush, ...). This can not be done through common code for all devices. Hence

                     cannot

> +why HMM provides helpers to factor out everything that can be while leaving the
> +hardware specific details to the device driver.
> +
> +The second mechanism HMM provides, is a new kind of ZONE_DEVICE memory that

                            provides is

> +allows allocating a struct page for each page of the device memory. Those pages
> +are special because the CPU can not map them. However, they allow migrating

                               cannot

> +main memory to device memory using existing migration mechanisms and everything
> +looks like a page is swapped out to disk from the CPU point of view. Using a
> +struct page gives the easiest and cleanest integration with existing mm mech-
> +anisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE
> +memory for the device memory and second to perform migration. Policy decisions
> +of what and when to migrate things is left to the device driver.
> +
> +Note that any CPU access to a device page triggers a page fault and a migration
> +back to main memory. For example, when a page backing a given CPU address A is
> +migrated from a main memory page to a device page, then any CPU access to
> +address A triggers a page fault and initiates a migration back to main memory.
> +
> +With these two features, HMM not only allows a device to mirror process address
> +space and keeping both CPU and device page table synchronized, but also lever-
> +ages device memory by migrating the part of the data-set that is actively being

                                                   data set

> +used by the device.
>  
>  
>  -------------------------------------------------------------------------------
>  
>  4) Address space mirroring implementation and API
>  
> -Address space mirroring main objective is to allow to duplicate range of CPU
> -page table into a device page table and HMM helps keeping both synchronize. A
> +Address space mirroring's main objective is to allow duplication of a range of
> +CPU page table into a device page table; HMM helps keep both synchronized. A
>  device driver that want to mirror a process address space must start with the

                      wants

>  registration of an hmm_mirror struct:
>  
> @@ -155,8 +163,8 @@ device driver that want to mirror a process address space must start with the
>                                  struct mm_struct *mm);
>  
>  The locked variant is to be use when the driver is already holding the mmap_sem

                         to be used

> -of the mm in write mode. The mirror struct has a set of callback that are use
> -to propagate CPU page table:
> +of the mm in write mode. The mirror struct has a set of callbacks that are used
> +to propagate CPU page tables:
>  
>   struct hmm_mirror_ops {
>       /* sync_cpu_device_pagetables() - synchronize page tables
> @@ -181,13 +189,13 @@ of the mm in write mode. The mirror struct has a set of callback that are use
>                       unsigned long end);
>   };
>  
> -Device driver must perform update to the range following action (turn range
> -read only, or fully unmap, ...). Once driver callback returns the device must
> -be done with the update.
> +The device driver must perform the update action to the range (mark range
> +read only, or fully unmap, ...). The device must be done with the update before
> +the driver callback returns.
>  
>  
> -When device driver wants to populate a range of virtual address it can use
> -either:
> +When the device driver wants to populate a range of virtual addresses, it can
> +use either:
>   int hmm_vma_get_pfns(struct vm_area_struct *vma,
>                        struct hmm_range *range,
>                        unsigned long start,
> @@ -201,17 +209,19 @@ When device driver wants to populate a range of virtual address it can use
>                     bool write,
>                     bool block);
>  
> -First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and
> -will not trigger a page fault on missing or non present entry. The second one
> -do trigger page fault on missing or read only entry if write parameter is true.
> -Page fault use the generic mm page fault code path just like a CPU page fault.
> +The first one (hmm_vma_get_pfns()) will only fetch present CPU page table
> +entries and will not trigger a page fault on missing or non present entries.

                                                           non-present

> +The second one does trigger a page fault on missing or read only entry if the

                                                          read-only

> +write parameter is true. Page faults use the generic mm page fault code path
> +just like a CPU page fault.
>  
> -Both function copy CPU page table into their pfns array argument. Each entry in
> -that array correspond to an address in the virtual range. HMM provide a set of
> -flags to help driver identify special CPU page table entries.
> +Both functions copy CPU page table entries into their pfns array argument. Each
> +entry in that array corresponds to an address in the virtual range. HMM
> +provides a set of flags to help the driver identify special CPU page table
> +entries.
>  
>  Locking with the update() callback is the most important aspect the driver must
> -respect in order to keep things properly synchronize. The usage pattern is :
> +respect in order to keep things properly synchronized. The usage pattern is:
>  
>   int driver_populate_range(...)
>   {
> @@ -233,43 +243,44 @@ Locking with the update() callback is the most important aspect the driver must
>        return 0;
>   }
>  
> -The driver->update lock is the same lock that driver takes inside its update()
> -callback. That lock must be call before hmm_vma_range_done() to avoid any race
> -with a concurrent CPU page table update.
> +The driver->update lock is the same lock that the driver takes inside its
> +update() callback. That lock must be held before hmm_vma_range_done() to avoid
> +any race with a concurrent CPU page table update.
>  
> -HMM implements all this on top of the mmu_notifier API because we wanted to a
> -simpler API and also to be able to perform optimization latter own like doing
> -concurrent device update in multi-devices scenario.
> +HMM implements all this on top of the mmu_notifier API because we wanted a
> +simpler API and also to be able to perform optimizations latter on like doing
> +concurrent device updates in multi-devices scenario.
>  
> -HMM also serve as an impedence missmatch between how CPU page table update are
> -done (by CPU write to the page table and TLB flushes) from how device update
> -their own page table. Device update is a multi-step process, first appropriate
> -commands are write to a buffer, then this buffer is schedule for execution on
> -the device. It is only once the device has executed commands in the buffer that
> -the update is done. Creating and scheduling update command buffer can happen
> -concurrently for multiple devices. Waiting for each device to report commands
> -as executed is serialize (there is no point in doing this concurrently).
> +HMM also serves as an impedence mismatch between how CPU page table updates

                         impedance

> +are done (by CPU write to the page table and TLB flushes) and how devices
> +update their own page table. Device updates are a multi-step process. First,
> +appropriate commands are writen to a buffer, then this buffer is scheduled for

                            written

> +execution on the device. It is only once the device has executed commands in
> +the buffer that the update is done. Creating and scheduling the update command
> +buffer can happen concurrently for multiple devices. Waiting for each device to
> +report commands as executed is serialized (there is no point in doing this
> +concurrently).
>  
>  
>  -------------------------------------------------------------------------------
>  
>  5) Represent and manage device memory from core kernel point of view
>  
> -Several differents design were try to support device memory. First one use
> -device specific data structure to keep information about migrated memory and
> -HMM hooked itself in various place of mm code to handle any access to address
> -that were back by device memory. It turns out that this ended up replicating
> -most of the fields of struct page and also needed many kernel code path to be
> -updated to understand this new kind of memory.
> +Several different designs were tried to support device memory. First one used
> +a device specific data structure to keep information about migrated memory and
> +HMM hooked itself in various places of mm code to handle any access to
> +addresses that were backed by device memory. It turns out that this ended up
> +replicating most of the fields of struct page and also needed many kernel code
> +paths to be updated to understand this new kind of memory.
>  
> -Thing is most kernel code path never try to access the memory behind a page
> -but only care about struct page contents. Because of this HMM switchted to
> -directly using struct page for device memory which left most kernel code path
> -un-aware of the difference. We only need to make sure that no one ever try to
> -map those page from the CPU side.
> +Most kernel code paths never try to access the memory behind a page
> +but only care about struct page contents. Because of this, HMM switched to
> +directly using struct page for device memory which left most kernel code paths
> +unaware of the difference. We only need to make sure that no one ever tries to
> +map those pages from the CPU side.
>  
> -HMM provide a set of helpers to register and hotplug device memory as a new
> -region needing struct page. This is offer through a very simple API:
> +HMM provides a set of helpers to register and hotplug device memory as a new
> +region needing a struct page. This is offered through a very simple API:
>  
>   struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
>                                     struct device *device,
> @@ -289,18 +300,19 @@ HMM provide a set of helpers to register and hotplug device memory as a new
>   };
>  
>  The first callback (free()) happens when the last reference on a device page is
> -drop. This means the device page is now free and no longer use by anyone. The
> -second callback happens whenever CPU try to access a device page which it can
> -not do. This second callback must trigger a migration back to system memory.
> +dropped. This means the device page is now free and no longer used by anyone.
> +The second callback happens whenever the CPU tries to access a device page
> +which it can not do. This second callback must trigger a migration back to

            cannot

> +system memory.
>  
>  
>  -------------------------------------------------------------------------------
>  
> -6) Migrate to and from device memory
> +6) Migration to and from device memory
>  
> -Because CPU can not access device memory, migration must use device DMA engine
> -to perform copy from and to device memory. For this we need a new migration
> -helper:
> +Because the CPU can not access device memory, migration must use the device DMA

                   cannot

> +engine to perform copy from and to device memory. For this we need a new
> +migration helper:
>  
>   int migrate_vma(const struct migrate_vma_ops *ops,
>                   struct vm_area_struct *vma,
> @@ -311,15 +323,15 @@ to perform copy from and to device memory. For this we need a new migration
>                   unsigned long *dst,
>                   void *private);
>  
> -Unlike other migration function it works on a range of virtual address, there
> -is two reasons for that. First device DMA copy has a high setup overhead cost
> +Unlike other migration functions it works on a range of virtual address, there
> +are two reasons for that. First, device DMA copy has a high setup overhead cost
>  and thus batching multiple pages is needed as otherwise the migration overhead
> -make the whole excersie pointless. The second reason is because driver trigger
> -such migration base on range of address the device is actively accessing.
> +makes the whole exersize pointless. The second reason is because the

                   exercise

> +migration might be for a range of addresses the device is actively accessing.
>  
> -The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy())
> -control destination memory allocation and copy operation. Second one is there
> -to allow device driver to perform cleanup operation after migration.
> +The migrate_vma_ops struct defines two callbacks. First one (alloc_and_copy())
> +controls destination memory allocation and copy operation. Second one is there
> +to allow the device driver to perform cleanup operations after migration.
>  
>   struct migrate_vma_ops {
>       void (*alloc_and_copy)(struct vm_area_struct *vma,
> @@ -336,19 +348,19 @@ to allow device driver to perform cleanup operation after migration.
>                                void *private);
>   };
>  
> -It is important to stress that this migration helpers allow for hole in the
> +It is important to stress that these migration helpers allow for holes in the
>  virtual address range. Some pages in the range might not be migrated for all
> -the usual reasons (page is pin, page is lock, ...). This helper does not fail
> -but just skip over those pages.
> +the usual reasons (page is pinned, page is locked, ...). This helper does not
> +fail but just skips over those pages.
>  
> -The alloc_and_copy() might as well decide to not migrate all pages in the
> -range (for reasons under the callback control). For those the callback just
> -have to leave the corresponding dst entry empty.
> +The alloc_and_copy() might decide to not migrate all pages in the
> +range (for reasons under the callback control). For those, the callback just
> +has to leave the corresponding dst entry empty.
>  
> -Finaly the migration of the struct page might fails (for file back page) for
> +Finally, the migration of the struct page might fail (for file backed page) for
>  various reasons (failure to freeze reference, or update page cache, ...). If
> -that happens then the finalize_and_map() can catch any pages that was not
> -migrated. Note those page were still copied to new page and thus we wasted
> +that happens, then the finalize_and_map() can catch any pages that were not
> +migrated. Note those pages were still copied to a new page and thus we wasted
>  bandwidth but this is considered as a rare event and a price that we are
>  willing to pay to keep all the code simpler.
>  
> @@ -358,27 +370,27 @@ willing to pay to keep all the code simpler.
>  7) Memory cgroup (memcg) and rss accounting
>  
>  For now device memory is accounted as any regular page in rss counters (either
> -anonymous if device page is use for anonymous, file if device page is use for
> -file back page or shmem if device page is use for share memory). This is a
> -deliberate choice to keep existing application that might start using device
> -memory without knowing about it to keep runing unimpacted.
> -
> -Drawbacks is that OOM killer might kill an application using a lot of device
> -memory and not a lot of regular system memory and thus not freeing much system
> -memory. We want to gather more real world experience on how application and
> -system react under memory pressure in the presence of device memory before
> +anonymous if device page is used for anonymous, file if device page is used for
> +file backed page or shmem if device page is used for shared memory). This is a
> +deliberate choice to keep existing applications, that might start using device
> +memory without knowing about it, running unimpacted.
> +
> +A Drawback is that the OOM killer might kill an application using a lot of

     drawback

> +device memory and not a lot of regular system memory and thus not freeing much
> +system memory. We want to gather more real world experience on how applications
> +and system react under memory pressure in the presence of device memory before
>  deciding to account device memory differently.
>  
>  
> -Same decision was made for memory cgroup. Device memory page are accounted
> +Same decision was made for memory cgroup. Device memory pages are accounted
>  against same memory cgroup a regular page would be accounted to. This does
>  simplify migration to and from device memory. This also means that migration
>  back from device memory to regular memory can not fail because it would

                                             cannot

>  go above memory cgroup limit. We might revisit this choice latter on once we
> -get more experience in how device memory is use and its impact on memory
> +get more experience in how device memory is used and its impact on memory
>  resource control.
>  
>  
> -Note that device memory can never be pin nor by device driver nor through GUP
> +Note that device memory can never be pinned by device driver nor through GUP
>  and thus such memory is always free upon process exit. Or when last reference
> -is drop in case of share memory or file back memory.
> +is dropped in case of shared memory or file backed memory.


-- 
~Randy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-21 15:52     ` Jerome Glisse
@ 2018-03-21 23:19       ` John Hubbard
  0 siblings, 0 replies; 21+ messages in thread
From: John Hubbard @ 2018-03-21 23:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On 03/21/2018 08:52 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
> 
> [...]
> 

<snip>

>>
>> Let's just keep it simple, and go back to the bitmap flags!
> 
> This simplify nouveau code and it is the reason why i did that patch.
> I am sure it can simplify NVidia uvm code, i can look into it if you
> want to give pointers. Idea here is that HMM can fill array with some-
> thing that match device driver internal format and avoid the conversion
> step from HMM format to driver format (saving CPU cycles and memory
> doing so). I am open to alternative that give the same end result.
> 
> [Just because code is worth 2^32 words :)
> 
> Without this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *hmm_pfns, *nouveau_pfns;
> 
>         hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., hmm_pfns, ...);
> 
>         for (i = 0; i < npages; ++i) {
>             nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
>         }
>         ...
>     }
> 
> With this patch:
>     int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
>     {
>         uint64_t *nouveau_pfns;
> 
>         nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
>         hmm_vma_fault(..., nouveau_pfns, ...);
> 
>         ...
>     }
> 
> Benefit from this patch is quite obvious to me. Down the road with bit
> more integration between HMM and IOMMU/DMA this can turn into something
> directly ready for hardware consumptions.
> 
> Note that you could argue that i can convert nouveau to use HMM format
> but this would not work, first because it requires a lot of changes in
> nouuveau, second because HMM do not have all the flags needed by the
> drivers (nor does HMM need them). HMM being the helper here, i feel it
> is up to HMM to adapt to drivers than the other way around.]
> 

OK, if this simplifies Nouveau and potentially other drivers, then I'll 
drop my earlier objections! Thanks for explaining what's going on, in detail.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-21  4:39   ` John Hubbard
@ 2018-03-21 15:52     ` Jerome Glisse
  2018-03-21 23:19       ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: Jerome Glisse @ 2018-03-21 15:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Andrew Morton, linux-kernel, Evgeny Baskakov,
	Ralph Campbell, Mark Hairgrove

On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
> On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 0f7ea3074175..5d26e0a223d9 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -80,68 +80,145 @@
> >  struct hmm;
> >  
> >  /*
> > + * hmm_pfn_flag_e - HMM flag enums
> > + *
> >   * Flags:
> >   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
> >   * HMM_PFN_WRITE: CPU page table has write permission set
> > + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> > + *
> > + * The driver provide a flags array, if driver valid bit for an entry is bit
> > + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> > + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> > + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> > + * except that this is per device driver rather than per architecture.
> 
> Hi Jerome,
> 
> If we go with this approach--and I hope not, I'll try to talk you down from the
> ledge, in a moment--then maybe we should add the following to the comments: 
> 
> "There is only one bit ever set in each hmm_range.flags[entry]." 

This is not necesarily true, driver can have multiple bits set matching
one HMM flag. Actually i do expect to see that (in nouveau).


> Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
> strange way to do things (especially when there is a much easier way), and deserves 
> that extra bit of helpful documentation.
> 
> More below...
> 
> > + */
> > +enum hmm_pfn_flag_e {
> > +	HMM_PFN_VALID = 0,
> > +	HMM_PFN_WRITE,
> > +	HMM_PFN_DEVICE_PRIVATE,
> > +	HMM_PFN_FLAG_MAX
> > +};
> > +
> > +/*
> > + * hmm_pfn_value_e - HMM pfn special value
> > + *
> > + * Flags:
> >   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> > + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
> >   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
> >   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
> >   *      set and the pfn value is undefined.
> > - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> > + *
> > + * Driver provide entry value for none entry, error entry and special entry,
> > + * driver can alias (ie use same value for error and special for instance). It
> > + * should not alias none and error or special.
> > + *
> > + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> > + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> > + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> > + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
> >   */
> > -#define HMM_PFN_VALID (1 << 0)
> > -#define HMM_PFN_WRITE (1 << 1)
> > -#define HMM_PFN_ERROR (1 << 2)
> > -#define HMM_PFN_SPECIAL (1 << 3)
> > -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> > -#define HMM_PFN_SHIFT 5
> > +enum hmm_pfn_value_e {
> > +	HMM_PFN_ERROR,
> > +	HMM_PFN_NONE,
> > +	HMM_PFN_SPECIAL,
> > +	HMM_PFN_VALUE_MAX
> > +};
> 
> I can think of perhaps two good solid ways to get what you want, without
> moving to what I consider an unnecessary excursion into arrays of flags. 
> If I understand correctly, you want to let each architecture
> specify which bit to use for each of the above HMM_PFN_* flags. 
> 
> The way you have it now, the code does things like this:
> 
>         cpu_flags & range->flags[HMM_PFN_WRITE]
> 
> but that array entry is mostly empty space, and it's confusing. It would
> be nicer to see:
> 
>         cpu_flags & HMM_PFN_WRITE
> 
> ...which you can easily do, by defining HMM_PFN_WRITE and friends in an
> arch-specific header file.

arch-specific header do not work, HMM can be use in same kernel by
different device driver (AMD and NVidia for instance) and each will
have their own page table entry format and they won't match.


> The other way to make this more readable would be to use helper routines
> similar to what the vm_pgprot* routines do:
> 
> static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
> {
> 	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
> }
> 
> ...but that's also unnecessary.
> 
> Let's just keep it simple, and go back to the bitmap flags!

This simplify nouveau code and it is the reason why i did that patch.
I am sure it can simplify NVidia uvm code, i can look into it if you
want to give pointers. Idea here is that HMM can fill array with some-
thing that match device driver internal format and avoid the conversion
step from HMM format to driver format (saving CPU cycles and memory
doing so). I am open to alternative that give the same end result.

[Just because code is worth 2^32 words :)

Without this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *hmm_pfns, *nouveau_pfns;

        hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., hmm_pfns, ...);

        for (i = 0; i < npages; ++i) {
            nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
        }
        ...
    }

With this patch:
    int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
    {
        uint64_t *nouveau_pfns;

        nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
        hmm_vma_fault(..., nouveau_pfns, ...);

        ...
    }

Benefit from this patch is quite obvious to me. Down the road with bit
more integration between HMM and IOMMU/DMA this can turn into something
directly ready for hardware consumptions.

Note that you could argue that i can convert nouveau to use HMM format
but this would not work, first because it requires a lot of changes in
nouuveau, second because HMM do not have all the flags needed by the
drivers (nor does HMM need them). HMM being the helper here, i feel it
is up to HMM to adapt to drivers than the other way around.]

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-20  2:00 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
@ 2018-03-21  4:39   ` John Hubbard
  2018-03-21 15:52     ` Jerome Glisse
  0 siblings, 1 reply; 21+ messages in thread
From: John Hubbard @ 2018-03-21  4:39 UTC (permalink / raw)
  To: jglisse, linux-mm
  Cc: Andrew Morton, linux-kernel, Evgeny Baskakov, Ralph Campbell,
	Mark Hairgrove

On 03/19/2018 07:00 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
> and pfn shift value allowing them to define their own encoding for HMM
> pfn that are fill inside the pfns array of the hmm_range struct. With
> this device driver can get pfn that match their own private encoding
> out of HMM without having to do any conversion.
> 
> Changed since v1:
>   - Split flags and special values for clarification
>   - Improved comments and provide examples
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
>  mm/hmm.c            |  85 +++++++++++++++++++---------------
>  2 files changed, 142 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 0f7ea3074175..5d26e0a223d9 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,68 +80,145 @@
>  struct hmm;
>  
>  /*
> + * hmm_pfn_flag_e - HMM flag enums
> + *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> + *
> + * The driver provide a flags array, if driver valid bit for an entry is bit
> + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> + * except that this is per device driver rather than per architecture.

Hi Jerome,

If we go with this approach--and I hope not, I'll try to talk you down from the
ledge, in a moment--then maybe we should add the following to the comments: 

"There is only one bit ever set in each hmm_range.flags[entry]." 

Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
strange way to do things (especially when there is a much easier way), and deserves 
that extra bit of helpful documentation.

More below...

> + */
> +enum hmm_pfn_flag_e {
> +	HMM_PFN_VALID = 0,
> +	HMM_PFN_WRITE,
> +	HMM_PFN_DEVICE_PRIVATE,
> +	HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * hmm_pfn_value_e - HMM pfn special value
> + *
> + * Flags:
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
> + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
>   *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
>   *      set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> + *
> + * Driver provide entry value for none entry, error entry and special entry,
> + * driver can alias (ie use same value for error and special for instance). It
> + * should not alias none and error or special.
> + *
> + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
>   */
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_WRITE (1 << 1)
> -#define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> -#define HMM_PFN_SHIFT 5
> +enum hmm_pfn_value_e {
> +	HMM_PFN_ERROR,
> +	HMM_PFN_NONE,
> +	HMM_PFN_SPECIAL,
> +	HMM_PFN_VALUE_MAX
> +};

I can think of perhaps two good solid ways to get what you want, without
moving to what I consider an unnecessary excursion into arrays of flags. 
If I understand correctly, you want to let each architecture
specify which bit to use for each of the above HMM_PFN_* flags. 

The way you have it now, the code does things like this:

        cpu_flags & range->flags[HMM_PFN_WRITE]

but that array entry is mostly empty space, and it's confusing. It would
be nicer to see:

        cpu_flags & HMM_PFN_WRITE

...which you can easily do, by defining HMM_PFN_WRITE and friends in an
arch-specific header file.

The other way to make this more readable would be to use helper routines
similar to what the vm_pgprot* routines do:

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
}

...but that's also unnecessary.

Let's just keep it simple, and go back to the bitmap flags!

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2
  2018-03-20  2:00 [PATCH 00/15] hmm: fixes and documentations v3 jglisse
@ 2018-03-20  2:00 ` jglisse
  2018-03-21  4:39   ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: jglisse @ 2018-03-20  2:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, linux-kernel, Jérôme Glisse,
	Evgeny Baskakov, Ralph Campbell, Mark Hairgrove, John Hubbard

From: Jérôme Glisse <jglisse@redhat.com>

User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
and pfn shift value allowing them to define their own encoding for HMM
pfn that are fill inside the pfns array of the hmm_range struct. With
this device driver can get pfn that match their own private encoding
out of HMM without having to do any conversion.

Changed since v1:
  - Split flags and special values for clarification
  - Improved comments and provide examples

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/hmm.h | 130 +++++++++++++++++++++++++++++++++++++---------------
 mm/hmm.c            |  85 +++++++++++++++++++---------------
 2 files changed, 142 insertions(+), 73 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 0f7ea3074175..5d26e0a223d9 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,68 +80,145 @@
 struct hmm;
 
 /*
+ * hmm_pfn_flag_e - HMM flag enums
+ *
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
+ * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
+ *
+ * The driver provide a flags array, if driver valid bit for an entry is bit
+ * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
+ * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
+ * Same logic apply to all flags. This is same idea as vm_page_prot in vma
+ * except that this is per device driver rather than per architecture.
+ */
+enum hmm_pfn_flag_e {
+	HMM_PFN_VALID = 0,
+	HMM_PFN_WRITE,
+	HMM_PFN_DEVICE_PRIVATE,
+	HMM_PFN_FLAG_MAX
+};
+
+/*
+ * hmm_pfn_value_e - HMM pfn special value
+ *
+ * Flags:
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
+ * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *      result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *      be mirrored by a device, because the entry will never have HMM_PFN_VALID
  *      set and the pfn value is undefined.
- * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
+ *
+ * Driver provide entry value for none entry, error entry and special entry,
+ * driver can alias (ie use same value for error and special for instance). It
+ * should not alias none and error or special.
+ *
+ * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
+ * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
+ * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
+ * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
  */
-#define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_SPECIAL (1 << 3)
-#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
-#define HMM_PFN_SHIFT 5
+enum hmm_pfn_value_e {
+	HMM_PFN_ERROR,
+	HMM_PFN_NONE,
+	HMM_PFN_SPECIAL,
+	HMM_PFN_VALUE_MAX
+};
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @vma: the vm area struct for the range
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @flags: pfn flags to match device driver page table
+ * @values: pfn value for some special case (none, special, error, ...)
+ * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct vm_area_struct	*vma;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	uint64_t		*pfns;
+	const uint64_t		*flags;
+	const uint64_t		*values;
+	uint8_t			pfn_shift;
+	bool			valid;
+};
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to get corresponding struct page from
  * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
  * If the HMM pfn is valid (ie valid flag set) then return the struct page
  * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_to_page(uint64_t pfn)
+static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return NULL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return NULL;
-	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return NULL;
+	return pfn_to_page(pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @range: range use to decode HMM pfn value
  * @pfn: HMM pfn value to extract pfn from
  * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range,
+					   uint64_t pfn)
 {
-	if (!(pfn & HMM_PFN_VALID))
+	if (pfn == range->values[HMM_PFN_NONE])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_ERROR])
+		return -1UL;
+	if (pfn == range->values[HMM_PFN_SPECIAL])
 		return -1UL;
-	return (pfn >> HMM_PFN_SHIFT);
+	if (!(pfn & range->flags[HMM_PFN_VALID]))
+		return -1UL;
+	return (pfn >> range->pfn_shift);
 }
 
 /*
  * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @range: range use to encode HMM pfn value
  * @page: struct page pointer for which to create the HMM pfn
  * Returns: valid HMM pfn for the page
  */
-static inline uint64_t hmm_pfn_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(const struct hmm_range *range,
+					 struct page *page)
 {
-	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (page_to_pfn(page) << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 /*
  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @range: range use to encode HMM pfn value
  * @pfn: pfn value for which to create the HMM pfn
  * Returns: valid HMM pfn for the pfn
  */
-static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
+					unsigned long pfn)
 {
-	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+	return (pfn << range->pfn_shift) |
+		range->flags[HMM_PFN_VALID];
 }
 
 
@@ -263,25 +340,6 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
-/*
- * struct hmm_range - track invalidation lock on virtual address range
- *
- * @vma: the vm area struct for the range
- * @list: all range lock are on a list
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @pfns: array of pfns (big enough for the range)
- * @valid: pfns array did not change since it has been fill by an HMM function
- */
-struct hmm_range {
-	struct vm_area_struct	*vma;
-	struct list_head	list;
-	unsigned long		start;
-	unsigned long		end;
-	uint64_t		*pfns;
-	bool			valid;
-};
-
 /*
  * To snapshot the CPU page table, call hmm_vma_get_pfns(), then take a device
  * driver lock that serializes device page table updates, then call
diff --git a/mm/hmm.c b/mm/hmm.c
index 956689804600..99473ca37b4c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -282,6 +282,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 {
 	unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	int r;
 
@@ -291,7 +292,7 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	if (r & VM_FAULT_RETRY)
 		return -EBUSY;
 	if (r & VM_FAULT_ERROR) {
-		*pfn = HMM_PFN_ERROR;
+		*pfn = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
@@ -309,7 +310,7 @@ static int hmm_pfns_bad(unsigned long addr,
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = HMM_PFN_ERROR;
+		pfns[i] = range->values[HMM_PFN_ERROR];
 
 	return 0;
 }
@@ -338,7 +339,7 @@ static int hmm_vma_walk_hole_(unsigned long addr, unsigned long end,
 	hmm_vma_walk->last = addr;
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++) {
-		pfns[i] = 0;
+		pfns[i] = range->values[HMM_PFN_NONE];
 		if (fault || write_fault) {
 			int ret;
 
@@ -356,24 +357,27 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 				      uint64_t pfns, uint64_t cpu_flags,
 				      bool *fault, bool *write_fault)
 {
+	struct hmm_range *range = hmm_vma_walk->range;
+
 	*fault = *write_fault = false;
 	if (!hmm_vma_walk->fault)
 		return;
 
 	/* We aren't ask to do anything ... */
-	if (!(pfns & HMM_PFN_VALID))
+	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
 	/* If CPU page table is not valid then we need to fault */
-	*fault = cpu_flags & HMM_PFN_VALID;
+	*fault = cpu_flags & range->flags[HMM_PFN_VALID];
 	/* Need to write fault ? */
-	if ((pfns & HMM_PFN_WRITE) && !(cpu_flags & HMM_PFN_WRITE)) {
+	if ((pfns & range->flags[HMM_PFN_WRITE]) &&
+	    !(cpu_flags & range->flags[HMM_PFN_WRITE])) {
 		*fault = *write_fault = false;
 		return;
 	}
 	/* Do we fault on device memory ? */
-	if ((pfns & HMM_PFN_DEVICE_PRIVATE) &&
-	    (cpu_flags & HMM_PFN_DEVICE_PRIVATE)) {
-		*write_fault = pfns & HMM_PFN_WRITE;
+	if ((pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) &&
+	    (cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
+		*write_fault = pfns & range->flags[HMM_PFN_WRITE];
 		*fault = true;
 	}
 }
@@ -415,13 +419,13 @@ static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 }
 
-static inline uint64_t pmd_to_hmm_pfn_flags(pmd_t pmd)
+static inline uint64_t pmd_to_hmm_pfn_flags(struct hmm_range *range, pmd_t pmd)
 {
 	if (pmd_protnone(pmd))
 		return 0;
-	return pmd_write(pmd) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pmd_write(pmd) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pmd(struct mm_walk *walk,
@@ -431,12 +435,13 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 			      pmd_t pmd)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long pfn, npages, i;
-	uint64_t flag = 0, cpu_flags;
 	bool fault, write_fault;
+	uint64_t cpu_flags;
 
 	npages = (end - addr) >> PAGE_SHIFT;
-	cpu_flags = pmd_to_hmm_pfn_flags(pmd);
+	cpu_flags = pmd_to_hmm_pfn_flags(range, pmd);
 	hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags,
 			     &fault, &write_fault);
 
@@ -444,20 +449,19 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
 
 	pfn = pmd_pfn(pmd) + pte_index(addr);
-	flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
 	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
+		pfns[i] = hmm_pfn_from_pfn(range, pfn) | cpu_flags;
 	hmm_vma_walk->last = end;
 	return 0;
 }
 
-static inline uint64_t pte_to_hmm_pfn_flags(pte_t pte)
+static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
 {
 	if (pte_none(pte) || !pte_present(pte))
 		return 0;
-	return pte_write(pte) ? HMM_PFN_VALID |
-				HMM_PFN_WRITE :
-				HMM_PFN_VALID;
+	return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_WRITE] :
+				range->flags[HMM_PFN_VALID];
 }
 
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
@@ -465,18 +469,18 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      uint64_t *pfns)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
 	struct vm_area_struct *vma = walk->vma;
 	bool fault, write_fault;
 	uint64_t cpu_flags;
 	pte_t pte = *ptep;
 
-	*pfns = 0;
-	cpu_flags = pte_to_hmm_pfn_flags(pte);
+	*pfns = range->values[HMM_PFN_NONE];
+	cpu_flags = pte_to_hmm_pfn_flags(range, pte);
 	hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
 			   &fault, &write_fault);
 
 	if (pte_none(pte)) {
-		*pfns = 0;
 		if (fault || write_fault)
 			goto fault;
 		return 0;
@@ -496,11 +500,16 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * device and report anything else as error.
 		 */
 		if (is_device_private_entry(entry)) {
-			cpu_flags = HMM_PFN_VALID | HMM_PFN_DEVICE_PRIVATE;
+			cpu_flags = range->flags[HMM_PFN_VALID] |
+				range->flags[HMM_PFN_DEVICE_PRIVATE];
 			cpu_flags |= is_write_device_private_entry(entry) ?
-					HMM_PFN_WRITE : 0;
-			*pfns = hmm_pfn_from_pfn(swp_offset(entry));
-			*pfns |= HMM_PFN_DEVICE_PRIVATE;
+				range->flags[HMM_PFN_WRITE] : 0;
+			hmm_pte_need_fault(hmm_vma_walk, *pfns, cpu_flags,
+					   &fault, &write_fault);
+			if (fault || write_fault)
+				goto fault;
+			*pfns = hmm_pfn_from_pfn(range, swp_offset(entry));
+			*pfns |= cpu_flags;
 			return 0;
 		}
 
@@ -516,14 +525,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		}
 
 		/* Report error for everything else */
-		*pfns = HMM_PFN_ERROR;
+		*pfns = range->values[HMM_PFN_ERROR];
 		return -EFAULT;
 	}
 
 	if (fault || write_fault)
 		goto fault;
 
-	*pfns = hmm_pfn_from_pfn(pte_pfn(pte)) | cpu_flags;
+	*pfns = hmm_pfn_from_pfn(range, pte_pfn(pte)) | cpu_flags;
 	return 0;
 
 fault:
@@ -592,12 +601,13 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-static void hmm_pfns_clear(uint64_t *pfns,
+static void hmm_pfns_clear(struct hmm_range *range,
+			   uint64_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
 {
 	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = 0;
+		*pfns = range->values[HMM_PFN_NONE];
 }
 
 static void hmm_pfns_special(struct hmm_range *range)
@@ -605,7 +615,7 @@ static void hmm_pfns_special(struct hmm_range *range)
 	unsigned long addr = range->start, i = 0;
 
 	for (; addr < range->end; addr += PAGE_SIZE, i++)
-		range->pfns[i] = HMM_PFN_SPECIAL;
+		range->pfns[i] = range->values[HMM_PFN_SPECIAL];
 }
 
 /*
@@ -658,7 +668,7 @@ int hmm_vma_get_pfns(struct hmm_range *range)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -811,7 +821,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 
 	hmm = hmm_register(vma->vm_mm);
 	if (!hmm) {
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -ENOMEM;
 	}
 	/* Caller must have registered a mirror using hmm_mirror_register() */
@@ -831,7 +841,7 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		 * write without read access are not supported by HMM, because
 		 * operations such has atomic access would not work.
 		 */
-		hmm_pfns_clear(range->pfns, range->start, range->end);
+		hmm_pfns_clear(range, range->pfns, range->start, range->end);
 		return -EPERM;
 	}
 
@@ -864,7 +874,8 @@ int hmm_vma_fault(struct hmm_range *range, bool block)
 		unsigned long i;
 
 		i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-		hmm_pfns_clear(&range->pfns[i], hmm_vma_walk.last, range->end);
+		hmm_pfns_clear(range, &range->pfns[i], hmm_vma_walk.last,
+			       range->end);
 		hmm_vma_range_done(range);
 	}
 	return ret;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-04-08  3:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-23  0:55 [PATCH 00/15] hmm: fixes and documentations v4 jglisse
2018-03-23  0:55 ` [PATCH 01/15] mm/hmm: documentation editorial update to HMM documentation jglisse
2018-04-08  3:19   ` Randy Dunlap
2018-03-23  0:55 ` [PATCH 02/15] mm/hmm: fix header file if/else/endif maze v2 jglisse
2018-03-23  0:55 ` [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3 jglisse
2018-03-23  0:55 ` [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v3 jglisse
2018-03-23  0:55 ` [PATCH 05/15] mm/hmm: hmm_pfns_bad() was accessing wrong struct jglisse
2018-03-23  0:55 ` [PATCH 06/15] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters v2 jglisse
2018-03-23  0:55 ` [PATCH 07/15] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture v2 jglisse
2018-03-23  0:55 ` [PATCH 08/15] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong v2 jglisse
2018-03-23  0:55 ` [PATCH 09/15] mm/hmm: cleanup special vma handling (VM_SPECIAL) jglisse
2018-03-23  0:55 ` [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v3 jglisse
2018-03-23  0:55 ` [PATCH 11/15] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE jglisse
2018-03-23  0:55 ` [PATCH 12/15] mm/hmm: move hmm_pfns_clear() closer to where it is use jglisse
2018-03-23  0:55 ` [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd() v2 jglisse
2018-03-23  0:55 ` [PATCH 14/15] mm/hmm: change hmm_vma_fault() to allow write fault on page basis jglisse
2018-03-23  0:55 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
  -- strict thread matches above, loose matches on Subject: below --
2018-03-20  2:00 [PATCH 00/15] hmm: fixes and documentations v3 jglisse
2018-03-20  2:00 ` [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2 jglisse
2018-03-21  4:39   ` John Hubbard
2018-03-21 15:52     ` Jerome Glisse
2018-03-21 23:19       ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).